{"paper_id":"1453b458-ec89-46d6-91c3-e0c5fa60c541","body_text":"1 \n \nSpontaneous emergence of context-dependent statistical learning in 1 \nhumans and neural networks 2 \n 3 \nFleming C. Peck,1 Hongjing Lu,1,2 Jesse Rissman1,3* 4 \n 5 \n1Department of Psychology, University of California, Los Angeles, Los Angeles, CA, USA. 6 \n2Department of Statistics, University of California, Los Angeles, Los Angeles, CA, USA. 7 \n3Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, Los 8 \nAngeles, CA, USA. 9 \n 10 \nAbstract 11 \n 12 \nHumans readily extract statistical regularities from experience, yet natural environments require 13 \nflexible adaptation when associative structures shift across changing contexts, often without 14 \nwarning. Across two experiments, we show that humans can incidentally learn overlapping and 15 \nconflicting visual associations even when contexts dynamically alternate and remain unsignaled 16 \nor only minimally cued. To probe the computational mechanisms supporting this adaptive 17 \ncapacity, we trained recurrent neural networks with gated recurrent units on the same statistical 18 \nlearning task without providing any explicit context information. These models spontaneously 19 \ndeveloped distributed internal representations that robustly separated conflicting associations and 20 \nsupported rapid adaptation to latent context shifts. Critically, we show that these distributed 21 \nrepresentations, strongly shaped by the model’s initial weight configuration, played a key role in 22 \npreventing catastrophic interference between contexts. Together, these behavioral and 23 \ncomputational results significantly advance our understanding of how humans and artificial 24 \nsystems can successfully learn and flexibly retrieve context-dependent associations under 25 \nchallenging conditions. 26 \n 27 \n 28 \nIntroduction 29 \n 30 \nMany everyday experiences unfold in structured, predictable ways, with events that recur over 31 \ntime in stable patterns. Internalizing these regularities allows anticipation of future occurrences, 32 \nfacilitating efficient information gathering, decision-making, and behavioral adaptation. It follows 33 \nthat the human brain is fundamentally oriented toward predicting the upcoming future based on 34 \nrecent events.1–3 This predictive ability helps conserve cognitive resources by reducing the need 35 \nfor continuous, effortful learning once patterns have been identified.4 However, the world is rarely 36 \nstatic: associations often vary across contexts.5 To support adaptive behavior, the brain is 37 \nthought to engage in context-dependent learning of these regularities and associations for flexible 38 \npredictions as environmental conditions shift.6 For example, navigating a daily commute relies on 39 \nlearning the timing and location of traffic congestion, and expectations for social interaction may 40 \ndiffer when a friend is encountered at work versus at a party. In both cases, prior experience 41 \nsupports the formation of context-bound predictions that guide perception and behavior. 42 \n 43 \nHumans have an innate ability for statistical learning, allowing them to spontaneously discover 44 \nregularities and associations. This process extracts spatial and temporal regularities from sensory 45 \ninput through passive exposure, without explicit instruction or external rewards.7 Statistical 46 \nlearning is proposed to support a wide range of cognitive functions, including language 47 \nacquisition, visual perception, object recognition, and social cognition.9,10 Empirical studies 48 \ndemonstrate that individuals can detect regular patterns in continuous streams of stimuli across 49 \nvisual,11 auditory,12 and tactile13 modalities, in the absence of explicit transition cues and 50 \ninstructions.  51 \n 52 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n2 \n \nDespite the rich literature on statistical learning, most research has focused on simple, highly 53 \nreliable associations, such as detecting short sequences of objects or sounds. However, in 54 \nnatural environments, context plays a critical and often unobserved role in shaping how 55 \nassociations are formed. In animal learning research, for example, association-based behavior is 56 \nknown to be highly context-specific: extinguished fear responses return when animals are tested 57 \noutside the extinction setting.14 Moreover, following extinction or reversal learning, animals 58 \nreacquire original contingencies more rapidly than during initial learning,15,16 suggesting that prior 59 \ncontingencies are retained as latent knowledge in memory rather than being overwritten by new 60 \nlearning. Cognitive control processes are thought to underpin the behavioral flexibility afforded by 61 \nsuppressing previously useful but no longer relevant responses, allowing learners to pivot 62 \nbetween contexts and contingencies as the environment demands.17 Notably, most studies in 63 \nanimal learning literature involve explicit reinforcement (e.g., reward or punishment), whereas 64 \nstatistical learning occurs incidentally without feedback, instruction, or overt motivation.  65 \n 66 \nAlthough a few studies have explored the statistical learning of regularities that depend on a 67 \nlatent context or environment (e.g., 18–22), it remains unclear whether individuals can incidentally 68 \nlearn and retrieve context-dependent temporal associations without explicit perceptual context 69 \ncues, reinforcement, or instruction. Analogous mechanisms have been proposed in sensorimotor 70 \nlearning, an instance of implicit learning where the brain is thought to infer context shifts and 71 \npartition experience into distinct memories.23 Here, we test whether people can acquire two 72 \ndistinct sets of temporal associations instantiated with an overlapping pool of visual objects, 73 \nwhere most associations are in direct conflict between contexts. For example, in Context A, 74 \nObject X is followed by Object Y, whereas in Context B, the same Object X is followed by Object 75 \nZ. Successful learning requires participants to flexibly update their expectations according to the 76 \nactive context inferred from recent sequence history. We examine how well human learners can 77 \ndiscover these context-dependent associations without any external context cue – where context 78 \nis embedded only in the pattern of transitions – using both offline testing and online learning 79 \nmeasures.  80 \n 81 \nTo explore how these context-dependent representations might emerge from experience, we 82 \ntrained neural network models on the same behavioral task. We then identified the model that 83 \nbest matched human performance across the experimental conditions and analyzed its hidden-84 \nlayer activations to generate testable hypotheses about analogous representations in the human 85 \nbrain. Deep neural networks have proven effective at capturing lower-level sensory processing,24 86 \nand recent perspectives advocate for extending these approaches to the study of higher-order 87 \ncognition, including the representation of abstract knowledge.25 However, a common limitation of 88 \nthese modeling efforts is that these networks are typically trained on far more data than human 89 \nlearners (see 26 for a review), limiting the validity of direct comparisons. Additionally, prior 90 \nmodeling work frequently incorporates strong inductive biases that render context artificially 91 \nexplicit, either by feeding an unambiguous context signal into the input22,27 or by augmenting 92 \nnetwork architecture with designated units or computation modules.28,29 These modifications, 93 \nwhile effective, constrain opportunities to observe how context discovery might emerge 94 \nspontaneously. Inspired by Elman’s finding that simple recurrent neural networks can capture 95 \nboth short- and long-range dependencies30 and echoing recent calls to avoid hard-wiring 96 \nsolutions in cognitive modeling,31 we used minimally structured architectures that omitted context 97 \nsignaling and specialized inference modules. This design allowed us to examine how networks 98 \ndiscover and represent latent task structure through sequence exposure alone. Finally, given 99 \nevidence that weight initialization scale can influence learning trajectories in neural 100 \nnetworks27,32,33, we systematically varied the initial weight magnitudes of the networks to assess 101 \nhow this factor affects their ability to learn and distinguish context-dependent associations. 102 \n 103 \nThe goal of the neural network modeling is to generate hypotheses about how the brain might 104 \nrepresent context in statistical learning. The hippocampus represents two dominant neural 105 \nrepresentation strategies to support memory of individual experiences and to extract regularities 106 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n3 \n \nacross experiences34,35: sparse and distributed coding. Sparse codes, observed in the dentate 107 \ngyrus and CA2/3 subregions, involve highly selective activation of a small subset of units in 108 \nresponse to a given unit.36 Distributed representations, observed in the CA1 subregion, encode 109 \ninputs across overlapping patterns of activity spanning the neural population.37 We specifically 110 \nseek to find evidence of each of these strategies in the hidden layer activations of neural 111 \nnetworks that successfully represent context-dependent associations. 112 \n 113 \nOverall, this study aims to advance our understanding of context-dependent statistical learning by 114 \nexamining whether humans can learn and retrieve multiple conflicting statistical structures within 115 \nhighly overlapping stimulus sets. By manipulating the presence of visual contextual cues, we 116 \nassess whether explicit signals of context shifts facilitate learning and whether individuals can still 117 \nlearn context-dependent associations in their absence. In parallel, we use neural network models 118 \ntrained on the same task to test whether artificial systems can account for human-like learning 119 \ndynamics, offering insight into the computational mechanisms that may support flexible, context-120 \nsensitive learning in the brain. 121 \n 122 \nResults 123 \n 124 \nParticipants performed a context-dependent statistical learning task in which they viewed a 125 \ncontinuous stream of 1,600 object images (Fig. 1A). Their only task was to indicate whether an 126 \n“×” or “+” was embedded on each object (Fig. 1C), a perceptual judgment designed to maintain 127 \nattention and allow tracking of online learning via reaction times (RTs). Unbeknownst to 128 \nparticipants, the image stream was structured into object pairs specific to one of two distinct 129 \ncontexts. Although they were told that parts of the sequence might become familiar over time, 130 \nthey received no information about the underlying structure or the existence of multiple contexts. 131 \nEach context defined a unique set of temporal associations between a largely overlapping object 132 \nset, such that the probability of one object following another depended on the active context (Fig. 133 \n1B). 134 \n 135 \nFollowing the learning phase, participants completed a two-alternative forced choice (2AFC) test. 136 \nBecause objects appeared in both contexts, the correct association on a given trial depends on 137 \nthe active context. Accordingly, each test trial began with a six-object sequence composed of 138 \nthree object pairs from a single, consistent context, followed by the first item of a test pair (Fig. 139 \n1E). Participants were then tasked with choosing which of two objects should come next (Fig 1F). 140 \nContext-independent trials assessed knowledge of the context-independent pair. Context-141 \ndependent trials consisted of two types: in direct-conflict trials, the lure was the object paired with 142 \nthe test cue in the other context; in indirect-conflict trials, the lure was an object not paired with 143 \nthe test cue in either context. After each choice, participants rated their confidence on a 1-4 scale 144 \n(Fig. 1G). 145 \n 146 \nIn Experiment 1 (Unsignaled), n = 50 participants completed the task without any explicit perceptual 147 \ncontext cue. In Experiment 2 (Signaled), a separate group of n = 50 participants completed the 148 \nsame task but with a visual context cue: a colored border (white or black) surrounding each object, 149 \ncorresponding with active context (Fig. 1D); this border was present during both the learning phase 150 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n4 \n \nand the 2AFC test. Accuracy across the learning phase for the perceptual task was 92.2% for Expt. 151 \n1 and 92.7% for Expt. 2, indicating that participants attended to the stimuli during learning. 152 \n 153 \n 154 \nFigure 1. Experimental overview. (A) Visualization of the learning phase. Participants viewed a 155 \nuniformly paced sequence of objects separated by brief fixation periods. Each object appeared for 156 \n1200ms with a 450ms interstimulus interval. The sequence was organized according to the 157 \ntemporal pair structure dictated by one of two contexts (Context A and Context B), which switched 158 \nevery 50 pairs. The orange and green backdrops are shown for illustrative purposes only. 159 \nParticipants performed four blocks of 200 pairs each, separated by short breaks. (B) Sample object 160 \nassignments to context pair structures comprising 11 unique objects. The context-independent pair 161 \nis the same for both contexts as shown in the first row, three of the context-dependent pairs consist 162 \nof the same object set with pair assignment of the second pair position different for each context 163 \nas shown in rows 2-4, and one context-dependent pair consists of a context-specific object in the 164 \nsecond pair position as shown in the last row. (C) Example of object embedded with “+” or “×”. 165 \nParticipants were tasked with making a button-press response to indicate which symbol each object 166 \ncontained; object-symbol mapping was held constant throughout the experiment. (D) Differentiation 167 \nof the two experiments: In Expt. 1 (Unsignaled), no context cues were shown and thus context 168 \nswitches were entirely latent (left); in Expt. 2 (Signaled), context was indicated with a white or black 169 \nborder around the object (right). (E) 2AFC test procedure: Example of 6 -item (3-pair) sequence 170 \nleading up to the test cue of a 2AFC trial. (F) Immediately following the test cue, participants chose 171 \nwhich of two candidate objects comes next in the sequence. This example is a direct -conflict 172 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n5 \n \ncontext-dependent trial, in which the lure corresponds to the object paired with the test cue in the 173 \nother context. (G) After each choice, participants made a confidence rating. 174 \n 175 \nBehavioral evidence of context-dependent statistical learning 176 \nWe observed evidence of context-dependent statistical learning with significant 2AFC performance 177 \nfor both contexts (one-sample t-tests, Holm-Bonferroni corrected for three tests, all p < 0.001) (Fig. 178 \n2). When considering direct- and indirect-conflict 2AFC trials separately, we found above-chance 179 \naccuracy for both trial types (all one-sample t-tests p < 0.05; SI Appendix, Fig. S1). A mixed-design 180 \nANOVA was conducted to examine the effects of experiment (Unsignaled vs. Signaled context, 181 \nbetween-subjects) and context -dependence (context -independent vs. context -dependent trials, 182 \nwithin-subjects) on 2AFC accuracy. There was a significant main effect of context -dependence 183 \n(F(1, 98) = 17.52, p < 0.001), reflecting higher performance on context-independent than context-184 \ndependent trials. The main effect of experiment was not significant (F(1, 98) = 1.93, p = 0.17), nor 185 \nwas the interaction between experiment and context-dependence (F(1, 98) = 3.07, p = 0.08). We 186 \nused Bayesian estimation to assess equivalence of context -dependent 2AFC accuracy between 187 \nexperiments. The posterior distribution of the mean difference was centered near zero (mean = 188 \n0.38%, 95% HDI [-4.1, 4.6]). Approximately 96.5% of the posterior mass fell within the predefined 189 \nregion of practical equivalence (ROPE) of [-5%, 5%], providing evidence that the two experiments 190 \nyielded equivalent performance. This equivalence suggests that the border cue may have been too 191 \nsubtle to boost context -dependent learning or that explicit contextual cues are unnecessary to 192 \nfoster context-dependent learning beyond the contextual information that can be ascertained from 193 \nrecent sequence history  in this paradigm. Additional analyses of confidence ratings and 194 \nperformance on remaining test tasks are reported in SI Appendix, section S1. These analyses 195 \nreveal that most participants showed no explicit awareness of the temporal pair structure. 196 \n 197 \n 198 \n 199 \nFigure 2. 2AFC test performance. Bar height reflects group average 2AFC accuracy (% correct) 200 \nfor Context A questions (left bar, orange), Context B questions (middle bar, green), and context -201 \nindependent questions (right bar, gray). Note that Context A and Context B correspond to the first 202 \nand second contexts, respectively, used during the learning phase. Each dot reflects the accuracy 203 \nfor one participant with lines connecting a participant’s performance across the two contexts. 204 \nResults plotted separately for Unsignaled and Signaled conditions on the left and right, respectively. 205 \nAsterisks indicate significant deviation from chance performance (50%; horizontal line). ***p<0.001. 206 \n 207 \nWe also found evidence of an online learning effect using participants’ reaction times during the 208 \nlearning phase when they judged whether each object contained an “×” or “+” (Fig. 1C). Because 209 \nthese markers were consistently associated within corresponding objects across learning, faster 210 \nresponses could reflect memory -based predictions about the identity of upcoming objects  211 \nconsistent with rapid adaptation to temporal statistics in the sequence . We expect that over the 212 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n6 \n \ncourse of learning, knowledge of the temporal pair structure would facilitate faster, anticipatory 213 \nresponses to the second item of each pair than the first item of each pair, which follows a random 214 \ntransition between pairs. Based on evidence that RTs improve throughout an experiment 38, we 215 \nmeasure online learning as the second item RT subtracted from the first item RT, where a positive 216 \nvalue indicates an anticipation effect, and a negative value reflects possible interference from 217 \ncontext switches. Mean reaction times for each pair position are reported in SI Appendix, Table S1. 218 \n 219 \n 220 \nFigure 3. Reaction time differences reveal trajectory of online learning. Reaction time (RT) 221 \ndifference between responses to objects in the first (item 1) and second (item 2) pair position. A 222 \npositive value on the y-axis shows anticipation effect plotted for each block during learning phase 223 \n(x-axis). Average RT difference with standard error of the mean (shaded) for context-independent 224 \npairs in gray and for context-dependent pairs in blue. Linear trend significance indicated with same 225 \ncolor scheme. Linear contrast significance indicated, ***p<0.001; **p<0.01; *p<0.05. The shaded 226 \nareas indicate sampling error. 227 \n 228 \nFor context-independent pairs (Fig. 3; gray), we found a significant linear trend of RT differences 229 \nacross blocks in the Unsignaled experiment (t(49) = 2.70, p = 0.009), suggesting increasing 230 \nanticipatory learning over time. However, no such trend was observed in the Signaled experiment 231 \n(t(49) = 0.70, p = 0.49), where RT differences appeared to stabilize after the first block. For 232 \ncontext-dependent pairs, both experiments showed a significant linear increase in RT difference 233 \nacross blocks (Unsignaled: t(49) = 4.36, p < 0.001; Signaled: t(49) = 2.59, p = 0.013). However, 234 \nunlike the context-independent pairs, RTs for the predictable, item 2 objects in the Unsignaled 235 \nexperiment were initially slower than the first, unpredictable items (negative RT effect) and 236 \napproached equivalence by the final block. This slowing earlier in the experiment may reflect 237 \ninterference from frequent context switches: participants had to suppress the prediction under the 238 \npreviously active context, which would be especially demanding during the early blocks of 239 \ntraining. This effect is slightly ameliorated in the Signaled experiment, suggesting that participants 240 \nmay have been able to integrate the border contextual cue to facilitate online context-dependent 241 \nlearning. Despite this initial disadvantage for second item responses, the online learning measure 242 \nincreased over time, reaching its highest average in the final block. A mixed-design ANOVA on 243 \nthe RT difference score with experiment as a between-subjects factor and block as a within-244 \nsubjects factor showed no main effect of experiment (F(1, 98) = 1.48, p = 0.23). 245 \n 246 \nNeural network weight initialization influences context-dependent learning 247 \nHaving established that humans can spontaneously learn context-dependent associations from 248 \nexposure alone, we next turned to a computational account of this behavior using artificial neural 249 \nnetwork models. Our goals were to test whether these models could similarly discover the task’s 250 \nlatent structure without context cues and, critically, to characterize the nature of the emergent 251 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n7 \n \nrepresentations that give rise to the context-dependent gating of associative predictions, 252 \nexamining how specific model parameters shape this capacity. 253 \n 254 \nFirst, we determined that recurrent neural networks with gated recurrent units (GRUs) learned the 255 \ntask more effectively than other network architectures, including feedforward networks and 256 \nrecurrent networks without gated units (see model comparison details in SI Appendix, section 257 \nS2). Next, we trained GRU models on the same amount of sequence exposure as human 258 \nparticipants. Models featured a 150-node hidden layer and were trained to predict the next item in 259 \nthe sequence using one-hot encoded object representation for both inputs and outputs (Fig. 4A). 260 \nCritically, models received no explicit context information, requiring them to discover the latent 261 \nstructure from sequence statistics to make accurate predictions. As with humans, learning was 262 \nevaluated with a 2AFC test. Model weights were frozen after training, and each 2AFC trial 263 \npresented the model with a series of seven objects (i.e. a three-pair sequence and a test cue). 264 \nThe model then “selected” the next object in the sequence between two options, with its choice 265 \ndetermined by the object with the higher predicted probability. 266 \n 267 \n 268 \nFigure 4. GRU model’s 2AFC performance by weight variance. (A) Visualization of neural 269 \nnetwork architecture comprised of 11 input units, a single GRU layer with 150 units, and 11 270 \noutput units. (B-D) 2AFC accuracy (y-axis) on context-dependent test trials for GRU models with 271 \nweights initialized with increasing variance along the x-axis color-coded by question category. (B) 272 \n2AFC performance on Context A (orange), Context B (green), overall context-dependent (blue) 273 \nand context-independent (red). Chance performance (50%) indicated with gray horizontal line. (C-274 \nD) 2AFC performance on individual contexts, visualized for overall as well as direct-conflict (dark 275 \ncoloring) and indirect-conflict (light coloring) trial subsets. Significant one-sample t-tests from 276 \nchance (Bonferroni-corrected for eight comparisons) indicated with horizontal lines at top of plot 277 \ncolor-coded in the same way. Mean human performance on direct-conflict trials of each context 278 \nindicated by the dashed horizontal black line. (C) Context A. (D) Context B. (E) Absolute 279 \ndifference direct-conflict 2AFC performance between human group average and model group 280 \naverage for each weight initialization configuration. Bar height reflects summed direct-conflict 281 \n2AFC performance absolute difference of Context A (dark orange) and Context B (dark green). 282 \n 283 \nWe systematically varied the bounds of the uniform distribution used to initialize model weights to 284 \nevaluate whether greater initial weight variance would accelerate convergence, motivated by prior 285 \nfindings that initialization in neural networks can strongly influence learning dynamics.39,40 Low-286 \nvariance initialization is commonly used as the default in neural networks. However, it remains 287 \nunclear whether this default choice affects a model’s capacity to learn latent structures in the 288 \ndata. To address this, we systematically varied the weight initialization variance across a wide 289 \nrange of values. For each weight variance initialization condition, we trained and tested 50 290 \nindependent models and report the average performance. 291 \n 292 \nAcross weight initialization conditions, models with low to moderate initialized weight variance 293 \nachieve perfect accuracy on context-independent trials, demonstrating their ability to learn stable, 294 \nnon-contextual associations (Fig. 4B). However, as variance of initial weights increases, 295 \nperformance steadily declines, highlighting how excessive initial weight variance introduces 296 \nnoise, disrupting the model’s ability to extract consistent patterns from the sequence.  297 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n8 \n \n 298 \nThe models’ learning of context-dependent associations – where context-specific conflicts must 299 \nbe resolved – reveals more complex dynamics. The relationship between initialized weight 300 \nvariance and context-dependent accuracy is non-monotonic, with the highest performance 301 \ndemonstrated by models with moderate weight initialization variance within the range of (0.4-0.6) 302 \n(Fig. 4B). Low-variance models (0.08-0.2) demonstrate around 90% accuracy on Context B, the 303 \ncontext which the model was most recently processing at the end of training before model 304 \nweights were frozen, compared to below 60% accuracy for Context A, the context which was 305 \npreviously learned and conflicted with the more recently learned associations. Increasing 306 \ninitialized weight variance in the high-variance range (1.0-1.4) exhibits a steady decline in 307 \naccuracy for both contexts, indicating that their representations may be too diffuse or unstable, 308 \npreventing them from effectively differentiating between contexts. Notably, this non-monotonic 309 \nrelationship between 2AFC accuracy and initialized weight variance was preserved when models 310 \nwere trained using input representations that reflected the perceptual features of the objects (as 311 \nopposed to one-hot vectors), derived from computer vision models such as AlexNet (SI Appendix, 312 \nsection S3). The same pattern held when the context-dependent pair that included a context-313 \nunique second item was excluded, such that all pairs comprised items that appeared in both 314 \ncontexts (SI Appendix, section S4). This result helps rule out the possibility that recent exposure 315 \nto particular items served as a context cue, strengthening the interpretation that context is 316 \ninferred from recent sequence history. 317 \n 318 \nBreaking down performance into direct-conflict and indirect-conflict trials reveals notable 319 \ndifferences in model learning. Direct-conflict trials (where the lure object is the correct answer for 320 \nthe other context) are the most diagnostic test of context-dependent learning as they place 321 \nassociations from different contexts in direct competition, making accurate performance 322 \ndependent on the use of contextual information to disambiguate the correct response. Only 323 \nmodels initialized with a weight variance of 0.6 achieved above-chance performance on Context 324 \nA direct-conflict trials (t(49) = 2.81, p = 0.004), though this came at the expense of reduced 325 \nthough still significant accuracy on Context B direct-conflict trials (Fig. 4D). Low-weight models 326 \nperform near floor on Context A direct-conflict trials, rendering their high overall context-327 \ndependent accuracy misleading as it reflects only strong performance on indirect-conflict Context 328 \nA questions and mastery of Context B. In contrast, high-weight models show no advantage for 329 \nContext B direct-conflict trials, with performance on direct-conflict questions around chance for 330 \nboth contexts. 331 \n 332 \nTo identify which weight initialization variance best approximated human-like behavior, model 333 \nperformance on Context A and Context B direct-conflict trials was compared to human data. 334 \nHuman accuracy was averaged across the Unsignaled and Signaled experiments as no 335 \nsignificant difference was observed between them. Fig. 4E visualizes the absolute difference in 336 \n2AFC performance between models and humans on direct-conflict trials for both contexts (human 337 \nperformance indicated by the dashed black line in Fig. 4C, 4D). Initialized weight variance of 0.6 338 \nproduced the clearest match to human performance, forming an elbow in the plot and achieving 339 \nabove-chance accuracy across all question sets. 340 \n 341 \nDistributed hidden layer representation strategy facilitates context-dependent learning 342 \nHaving observed that models with initialized weight variance in the moderate range (such as 0.6) 343 \nsuccessfully learned context-dependent associations without any explicit context signal as input, 344 \nwe next examined how context information is encoded in the hidden layer activations of the 345 \nmodels. Context encoding could manifest as either a sparse representation (carried by a few 346 \nunits) or a distributed representation (spread across many units). Therefore, we quantified two 347 \ncomplementary properties of the activations: the extent to which context sensitivity was localized 348 \nto a small subset of units (akin to individual “context cell” neurons that code for which context is 349 \ncurrently active), and the degree to which the currently active context was expressed a as 350 \ndistinctive pattern of activity across many units. These analyses were conceptually motivated by 351 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n9 \n \nprior work distinguishing sparse and distributed coding strategies in hippocampal and 352 \nconnectionist models.35,41 The analysis was conducted using the hidden-layer activations during 353 \nthe final block of training. 354 \n 355 \nThe sparse representation index measures the proportion of hidden layer units that do not show 356 \nsignificant activation differences between contexts. A higher sparse representation index 357 \ntherefore indicates that fewer units selectively encode a specific context, while a larger proportion 358 \nof units are context-insensitive (Fig. 5A). This measure of context sensitivity for each unit was 359 \ncalculated with a one-way ANOVA comparing activations during exposure to Context A versus 360 \nContext B in the final quarter of training (block 4). Fewer nodes with significant context sensitivity 361 \nindicate that limited number of hidden-layer units support the context-specific representations. 362 \n 363 \nThe distributed representation index was derived from a representational similarity analysis of 364 \nhidden layer activations for the first item of each pair, the item that carries the context-dependent 365 \nassociation. The index compares the geometric distance (dissimilarity) of these representations 366 \nwithin a context versus across contexts, with normalization based on within-context consistency 367 \nso that more stable representations are given greater influence (Fig. 5A, right plot). Higher values 368 \nindicate that distinct context representations are distributed across the hidden layer. 369 \n 370 \nWe found that the low-variance models exhibit the sparsest representations, and the moderate-371 \nvariance models exhibit the most distributed representations (Fig. 5A, middle plot). High-variance 372 \nmodels do not show strong evidence of either representation strategy. The 0.6-initialized model, 373 \nwhich was the only model to demonstrate significant direct-conflict 2AFC accuracy for both 374 \ncontexts (Fig. 5B-C), exhibited the strongest evidence for the distributed over the sparse 375 \nrepresentation index. 376 \n 377 \n 378 \nFigure 5. Neural network hidden layer task representation strategies. (A) Visualization of 379 \ncomputation of sparse representation index (left) and distributed representation index (right) 380 \nplotted for each weight variance configuration (middle; x-axis) in light green and blue, 381 \nrespectively. (B) Significance of beta coefficients (y-axis) for multivariate regression analyses 382 \nusing sparse (green) and distributed (blue) representation indices to predict each 2AFC question 383 \ncategory (x-axis). ***p<0.001; **p<0.01; *p<0.05. (C-D): Lesion analysis results. (C) 2AFC 384 \naccuracy of Context A (left) and Context B (right) questions (y-axis) as an increasing number of 385 \nhidden layer nodes are lesioned in descending rank order of context sensitivity (x-axis) for each 386 \nweight initialization configuration (rainbow coloring). (D) 2AFC accuracy performance difference 387 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n10 \n \nfrom no intervention for lesion analysis. (E) Context switch latency (y-axis) visualized across 388 \nlearning phase blocks (x-axis) for each weight initialization configuration (rainbow coloring). 389 \nFailure to reflect a context switch is noted with a value of 51 (e.g., greater than duration of context 390 \nexposure). 391 \n 392 \nTo understand how these representation strategies supported learning, we regressed 2AFC 393 \naccuracy on the z-scored sparse and distributed representation indices (averaged across the 50 394 \nmodels initialized for each weight configuration; Fig. 5B). Including both predictors in the same 395 \nmodel allows us to evaluate their unique contributions to 2AFC task performance. Both indices 396 \nsignificantly predicted context-independent accuracy (Sparse: β = 0.080, p = 0.016; Distributed: β 397 \n= 0.081, p = 0.015). For context-dependent learning, the sparse representation index predicted 398 \nperformance only for Context B (β = 0.17, p < 0.001) but not Context A (β  = 0.007, p = 0.71), 399 \nconsistent with its prominence in the low-variance models that disproportionately learned Context 400 \nB. In contrast, the distributed representation index predicted accuracy for both Context A (β = 401 \n0.069, p = 0.012) and Context B (β = 0.11, p < 0.001), reinforcing that this strategy more 402 \neffectively supports context-dependent learning, where successful learning requires retaining 403 \nknowledge of both contexts. 404 \n 405 \nEfficient context switching facilitates expression of context-dependent knowledge 406 \nWe next carried out a lesion simulation analysis to understand how the moderate-variance 407 \nmodels provide a better account of human behavior than the low-variance models, which is often 408 \nused as the default initialization in neural networks. For each model, all 150 hidden layer units 409 \nwere ranked according to their context sensitivity index (the F-statistic of activity difference 410 \nbetween Context A and Context B). We then progressively lesioned the most context-sensitive 411 \nunits by setting their activations to zero and re-evaluated 2AFC accuracy after each lesion step. 412 \n 413 \nThe moderate-variance models show a steady decline in both Context A and Context B accuracy 414 \nas more nodes were lesioned (Fig. 5C). This result further indicates a distributed representation 415 \nstrategy where many units contribute uniquely to the representation of current context. In 416 \ncontrast, the low-variance models show evidence of a redundant coding strategy: accuracy, 417 \nparticularly for Context B, remains largely unchanged until around half of the hidden layer was 418 \nlesioned (Fig. 5C). This delayed performance decline complements the earlier finding of a sparse 419 \nrepresentation strategy, in which very few nodes showed significant context sensitivity, 420 \nsuggesting that most units carried only shallow, overlapping context signals. Then, when Context 421 \nB performance began to decline, Context A performance actually increased (Fig. 5D), with 422 \nperformance eventually reaching level comparable to the moderate-weight models (Fig. 5C). This 423 \nresult indicates that the Context A representations are present in the knowledge base of the 424 \nnetwork. However, the context knowledge is not accessible for the 2AFC testing task given the 425 \nlow-variance models show excellent Context B accuracy (the context on which it was more 426 \nrecently trained) but extremely poor Context A accuracy (the previous context) (Fig. 4A). 427 \n 428 \nTo explain the discrepancy in 2AFC accuracy in light of this evidence of Context A knowledge 429 \npreserved in both low- and moderate-variance models, we examined how efficiently models 430 \nadapted to context switches. We derived a context switch latency metric operationalized as the 431 \nnumber of pairs the model processed after a context switch before perfectly predicting the paired 432 \nassociate of all remaining pairs in each 50-pair context exposure. We found that, by the final 433 \nblock of training, the moderate-variance models exhibited faster switch latencies whereas the 434 \nlow-variance models adapted more slowly (Fig. 5F). This inefficiency likely prevented the low-435 \nvariance models from shifting away from their end-of-training state during the brief six-item 436 \nexposure provided in each 2AFC trial, leaving them biased toward Context B despite evidence of 437 \nretaining earlier Context A associations. Overall, this evidence indicates that the moderate-438 \nvariance models achieve the best context-dependent learning because they retain associative 439 \nknowledge across contexts and quickly adapt to context switches with support from a distributed 440 \nrepresentation of context across the hidden layer. 441 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n11 \n \n 442 \nDiscussion  443 \n 444 \nGiven that statistical learning enables associative strengths to be incrementally updated over 445 \nmany exposures, it has been unclear whether it affords sufficient flexibility to adapt to changing 446 \nassociative contingencies in different contexts. The present results extend our understanding of 447 \nwhen incidental learning of temporal regularities is possible via demonstration of context-448 \ndependent statistical learning under circumstances where contexts dynamically alternate, 449 \nassociations directly conflict, and no explicit instructions are provided. We found evidence for this 450 \nin above-chance performance on the final 2AFC test as well as progressive RT speeding for 451 \npredictable objects compared to unpredictable objects over the course of learning, which 452 \nsuggests that implicit learning mechanisms facilitated anticipatory behavior.42 453 \n 454 \nNotably, explicitly signaling context with a colored border (Expt. 2) did not enhance context-455 \ndependent learning compared to when context was fully latent (Expt. 1). This may reflect greater 456 \ninfluence of local temporal context (e.g., recent sequence history, which was available during 457 \nboth learning and retrieval) over environmental cues or disruption of implicit learning mechanisms 458 \nby promoting a more deliberate strategy. Indeed, recent work suggests that states of reduced 459 \nexecutive control, such as mind wandering, can enhance statistical learning relative to focused 460 \non-task states,43 implying that exogenously focusing attention via explicit cues may be 461 \ncounterproductive for this type of incidental learning. However, given that the border cue changed 462 \ncolor only every few minutes and its relevance was not explicitly conveyed, it is also possible that 463 \nthis visual cueing of context may have been too subtle to provide a performance advantage; a 464 \nmore salient context signal might have produced different effects.  465 \n 466 \nPrior efforts to demonstrate context-dependent statistical learning with auditory stimuli have been 467 \nunable to find learning of both contexts unless participants were provided with explicit instructions 468 \nor salient context cues.18,19 Our success in the visual perceptual domain supports accounts 469 \nsuggesting that statistical learning mechanisms may be modality-specific rather than fully domain-470 \ngeneral,44 with visual statistical learning potentially more robust to context-based interference 471 \nunder implicit learning conditions. Siegelman et al.20 reported some evidence of context-472 \ndependent learning in the visual domain using associative structures built from an overlapping set 473 \nof stimuli. However, their paradigm involved a single consecutive exposure to each of the two 474 \ncontexts, rather than repeated interleaved context switching, self-paced stimulus sequence 475 \nexposure, and explicit instructions to look for patterns in which shapes tended to follow each 476 \nother. Such design choices are different from the present study that focuses on shorter, fixed-477 \nduration stimulus presentations to minimize possibilities for strategic encoding and support the 478 \npassive, implicit learning that is thought to characterize statistical learning.45 479 \n 480 \nThe present findings build on prior work on second-order conditional (SOC) sequence learning, 481 \nwhich has demonstrated that learners can extract higher-order temporal dependencies in which 482 \npredictability depends on combinations of preceding elements rather than simple pairwise 483 \ntransitions.42,46 Recent work further suggests that exposure to SOC structure can shape 484 \nsubsequent performance and subjective sensitivity to sequence regularities even when explicit 485 \nknowledge is limited.47 Although the surface structure of these tasks differs from the present 486 \nparadigm, both lines of work underscore how context-sensitive behavior can emerge from the 487 \nintegration of temporal regularities over experience, without requiring explicit contextual signals. 488 \n   489 \nWe used a neural network modeling approach to inform hypotheses of how the human brain 490 \nmight support such learning. These models were optimized to predict the next object in the 491 \nsequence. While our human participants engaged in a cover task requiring simple ×/+ perceptual 492 \njudgments, we assume that they were implicitly forming predictions about upcoming stimuli. 493 \nTherefore, the models’ predictive framework captures a core computational goal that the human 494 \nlearners pursue implicitly: anticipating future input based on recent experiences.1 495 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n12 \n \n 496 \nThe GRU’s gating architecture may support its successful context-dependent learning by 497 \nenabling the model to manage conflicting associations based on retaining relevant information 498 \nwhile filtering out noise. This computational function parallels how the human brain manages 499 \ninterference between new and old memories.48 Although GRU models are not intended as 500 \nmodels of biological mechanisms, the update and reset gates bear resemblance to the dynamic 501 \ninterplay between the hippocampus and neocortex that supports stability for long-term memory 502 \nstorage34 as well as to neuromodulatory systems where prediction error signals (i.e., dopamine 503 \nrelease) prompt a reassessment of context and switch in behavioral strategy.49 Such mechanisms 504 \nhave been hypothesized to facilitate segmenting continuous experiences and recalibration of 505 \npredictions,50 which may be relevant for context-dependent temporal associative learning and 506 \nmotivate hypotheses for future work examining parallels between biological systems and 507 \ncomputational models. 508 \n 509 \nPrior neural network modeling of context-dependent learning has imbued neural networks with 510 \nspecialized architecture to facilitate latent cause inference.28,29 or have explicitly provided 511 \nunambiguous context information in model input.22,27 For example, Smith and colleagues51 512 \neffectively demonstrated that recurrent networks can track temporal structure across multiple 513 \ntimescales within explicitly signaled contexts in a statistical learning paradigm instantiated as 514 \ngames that share response choices. However, these studies bypass the question of how a sense 515 \nof context might emerge organically from exposure alone to disambiguate overlapping task 516 \nstructure. Additionally, they introduce assumptions that are arguably biologically implausible, such 517 \nas constant context monitoring and perfectly reliable context cues.5 Here, we more directly focus 518 \non latent context discovery by exploring how weight initialization affects learning dynamics. Since 519 \nnetwork weights are adjusted throughout training to minimize loss, their initial configuration acts 520 \nas a key driver of convergence.52 Prior work suggests that higher initial weight magnitudes bias 521 \nmodels toward “lazy” solutions, involving rapid solution convergence with unstructured 522 \nrepresentations, while smaller magnitudes support “rich” solutions that exhibit more structured 523 \nlearning albeit at a slower pace.27,33,40 524 \n 525 \nIndeed, increasing the variance of the uniform distribution used to initialize model weights to a 526 \nmoderate range facilitated successful context-dependent 2AFC performance. This improvement 527 \nwas associated with a high-dimensional, distributed code in the hidden layer that was significantly 528 \nassociated with 2AFC trials of both contexts. This is consistent with studies suggesting that high 529 \ndimensional codes afforded by mixed selectivity in prefrontal cortex neurons allow for more 530 \nflexibility and rapid adaptation to new tasks.41,53 The successful distributed context coding 531 \nstrategy where identical model input is represented differently when processed in different 532 \ncontexts is consistent with reports of the hippocampus integrating contextual information into 533 \nstimulus representations.54,55 Furthermore, the hippocampus supports the rapid learning of 534 \ntemporal associations.37,56 Taken together, these parallels suggest that the moderate-variance 535 \nGRU models are capturing both higher-level contextual encoding and lower-level temporal 536 \nassociations, consistent with core functions of the hippocampus. 537 \n 538 \nThe variance of weight initialization may be interpreted as shaping the GRU’s inductive bias: the 539 \nassumptions the model makes about the structure of the environment, particularly regarding the 540 \npresence and separability of underlying contexts. Low initial weight variance appeared to bias the 541 \nmodel towards rigid representations that emphasize on recently experienced associations and 542 \nfailed to recover earlier learned patterns following context shifts. On the other end, high initial 543 \nweight variance produced overly flexible representations that failed to consolidate stable 544 \nstructure. Our analyses suggest an optimal intermediate range of initial weight magnitudes, where 545 \nmodels were sufficiently flexible to distinguish between contexts yet structured enough to 546 \npreserve associations within each context and avoid catastrophic interference. Accordingly, these 547 \neffects are best understood as emergent inductive biases shaped by properties of training 548 \ndynamics and initialization, which may provide insight into how learning systems come to 549 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n13 \n \nrepresent and segregate latent contexts, whether biological or artificial. Future work could assess 550 \nthe extent to which similar learning dynamics arise across architectures and task demands and 551 \nwhether manipulating network hyperparameters, such as learning rate and number of hidden 552 \nlayers, consistently shape the balance plasticity and stability in context-dependent learning. 553 \n 554 \nTo better understand why the moderate variance models succeeded, it is informative to examine 555 \nthe limitations of the low variance models. These models performed poorly on Context A test 556 \ntrials, a pattern that might initially suggest catastrophic interference – that previously learned 557 \nassociations of Context A were overwritten by more recent Context B experience. However, the 558 \nlesion analysis revealed that Context A knowledge remained in the networks but was not 559 \naccessible until over half the hidden layer was removed. One likely explanation for this 560 \ninaccessibility is the slower context switching in low variance models: compared to the moderate 561 \nvariance models, which successfully expressed knowledge of both contexts, the low variance 562 \nmodels were slower to accommodate context switches. As a result, the brief context exposure 563 \nsequences preceding each 2AFC decision may not have provided sufficient evidence to pull them 564 \nout of their orientation towards Context B state at test, which remained simply because Context B 565 \nwas the last context encountered during training. This phenomenon parallels findings from the 566 \nfear extinction literature, where extinguished fear responses can re-emerge in a different context, 567 \nindicating that underlying knowledge is retained but not manifested in behavior when irrelevant to 568 \ncurrent setting.14 569 \n 570 \nAnother key limitation of the low variance models that emerged from the lesion results was a 571 \nconstraint on how knowledge was represented in the hidden layer units. Before Context A 572 \nperformance recovered, these models showed little to no change in 2AFC accuracy for either 573 \ncontext until roughly half the hidden layer was lesioned, in contrast to the steady performance 574 \ndecline observed in moderate and high variance models. This suggests highly redundant coding 575 \nwithin the hidden layer. Redundant neural coding is theorized to enhance robustness in noisy 576 \nenvironments by duplicating information across neural populations53,57,58 – a potentially 577 \nadvantageous feature for the present task, where many associations directly conflict and half of 578 \nthe training samples are unreliable (e.g., between-pair transitions). Such redundancy could 579 \nplausibly account for why the low variance models achieved the strongest accuracy on Context B. 580 \nHowever, although this redundant coding strategy may help stabilize performance within a single 581 \ncontext amidst overall environmental instability, it ultimately proved ineffective because it limited 582 \nthe rapid adaptability needed to operate in a dynamic environment with multiple context-583 \ndependent structures, resulting in a failure to express knowledge of both contexts. 584 \n 585 \nMirroring the diversity of these computational profiles, humans also exhibited considerable 586 \nvariability. Although performance on context-dependent trials was significantly above-chance at 587 \nthe group level, some participants exhibited little or no learning (akin to the high-weight models) 588 \nwhile others showed stronger learning of one of the contexts (similar to the low-weight models). 589 \nJust as some GRU models required more exposure to learn both sets of associations, certain 590 \nindividuals may also need more input to reach stable learning. A promising future direction is to 591 \nidentify model parameters that reflect these individual differences and predict how quickly a 592 \nlearner converges on context-dependent associations, potentially linking such parameters to 593 \ndevelopmental changes in learning efficiency.59 594 \n 595 \nTaken together, our findings demonstrate that humans can spontaneously resolve conflicting, 596 \ncontext-dependent associations from passive exposure alone – even in the absence of explicit 597 \ninstructions, self-pacing, feedback, or contextual cues. The finding that explicit signaling offered 598 \nno advantage over entirely latent context exposure further highlights the robustness of this 599 \nincidental learning mechanism, suggesting that temporal statistics alone are sufficient to drive 600 \ncontextual inference. Our neural network modeling provides a mechanistic account for this 601 \ncapacity, showing that successful adaptation relies on the emergence of distributed 602 \nrepresentations that are influenced by weight initialization parameters, which we believe are a 603 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n14 \n \nreasonable proxy for humans’ inductive biases. This representational strategy not only maintains 604 \ninformation from multiple contexts, even when the associative structure directly conflicts across 605 \ncontexts, but also quickly accommodates context changes. These findings suggest that the 606 \nhuman brain may rely on similar mechanisms to flexibly manage latent contextual shifts and 607 \nsupport adaptive prediction in dynamic environments. 608 \n 609 \nMaterials and Methods 610 \n 611 \nParticipants 612 \nParticipants were recruited via the UCLA Psychology Department subject pool and completed the 613 \nexperiment in-person for course credit. All participants provided informed consent in accordance 614 \nwith protocols approved by the UCLA Institutional Review Board (IRB#22-001719). Inclusion 615 \ncriteria of aged between 18-40 years, native English speaker, and normal or corrected-to-normal 616 \nvision with contacts (no glasses) were confirmed before commencing data collection. Our goal 617 \nwas to obtain useable data from 50 participants for each of the two experiments (Expt. 1: 618 \nUnsignaled context; Expt. 2: Signaled context), so enough participants were collected to reach 619 \nour data quality thresholds of 90% of trials responded to and 85% accuracy on trials during the 620 \nlearning phase. These inclusion criteria were enforced to ensure that data analyses focused on 621 \nparticipants who were engaged during the learning phase of the experiment. Our final sample 622 \nincluded 50 participants for Expt. 1 (33 F / 17 M; mean age = 20.5 years) and 50 different 623 \nparticipants for Expt. 2 (40 F / 8 M / 2 Non-Binary; mean age = 20.0 years). 624 \n 625 \nMaterials 626 \nThe experiment was coded and run with PsychoPy version 2024.2.460 on a Mac Mini. Stimuli 627 \nwere displayed on a DELL P2422HE monitor with 1920 by 1080 pixel resolution and screen size 628 \nof 23.8 inches, which participants viewed from a fixed distance with their head stabilized with a 629 \nforehead and chin rest. An EyeLink 1000 eye tracker (SR Research) captured gaze location while 630 \nparticipants completed the experiment, but eye tracking data are not reported here. Experiment 631 \nstimuli were drawn from a set of objects created using Blender 2.48.61,62 The stimuli were visually 632 \ndistinct in terms of shape and color and were novel to participants. Images were resized to be 633 \n350 pixels wide. A small “×” or “+” symbol was subtly embedded onto each object using slight 634 \ncolor contrast such that the mark was visible but did not obstruct recognition of object shape. 635 \n 636 \nLearning phase 637 \nIn the first phase of the experiment, participants were exposed to a sequence of objects 638 \npresented individually. The objects were presented in four different locations on the screen with a 639 \nwidth of 350 pixels and centered 300 pixels above, below, right, and left of the center of a gray 640 \nscreen. At each object presentation, the three positions not occupied by the current object were 641 \nfilled with phase-scrambled versions of other objects cropped into circles with diameter of 300 642 \npixels (visualized in SI Appendix, Figure S2). The experimental manipulation of object location 643 \nwas included to enable potential analyses of spatial location-based learning as indexed by 644 \nanticipatory eye movements. However, because the eye tracking data did not yield clear or 645 \ninterpretable effects, we focus all analyses on object identity and omit spatial position from further 646 \nconsideration, as well as from the task depiction in Fig. 1. Before beginning the learning phase, 647 \nparticipants were instructed that parts of the sequence might become familiar over time and that 648 \nthey would later be asked questions about the objects they had seen. 649 \n 650 \nUnbeknownst to the participants, the objects were organized into two sets (or contexts) of 5 pairs 651 \nof objects. The same object set was used for all participants but were randomly assigned to each 652 \npair position, and each object maintained either first-of-pair (item 1) or second-of-pair (item 2) 653 \nposition in the pair across contexts. One pair was context-independent, meaning the same two 654 \nobjects were paired in both contexts. The other four pairs were context-dependent. Three of 655 \nthese pairs consisted of the same set of six objects across both contexts, but the second item 656 \nassociated with each first item was dependent on context. For example, Object X is paired with 657 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n15 \n \nObject Y in Context A but with Object Z in Context B. The last context-dependent pair shared the 658 \nsame first object across both contexts but the paired second object was specific to each context 659 \n(e.g., only appeared in that context and not the other). In this way, these four context-dependent 660 \npairs shared a first item across contexts but the second item of each pair was dependent on 661 \ncontext. In total, a set of 11 unique items were used to instantiate the five pairs in each context. 662 \n 663 \nThroughout this learning phase, participants were tasked with responding to whether the object 664 \nonscreen was marked with an “×” or “+”. Therefore, reaction time to the perceptual question could 665 \nbe evaluated as online measures of pair structure learning. Objects were presented for 1200ms 666 \nwith a 450ms interstimulus interval. 667 \n 668 \nThe two experiments differed on with respect to whether context was Unsignaled (Expt. 1) or 669 \nSignaled (Expt. 2) with a border around the objects that was white or black depending on the 670 \ncontext. 671 \n 672 \nTwo-alternative forced choice (2AFC) task 673 \nIn the first of three test tasks immediately following this learning phase, participants completed a 674 \ntwo-alternative forced choice (2AFC) task. Because the object associations were dependent on 675 \nactive context for all but the one context-independent pair, on each 2AFC trial participants were 676 \npresented with a sequence of seven objects (consisting of three pairs from one of the contexts 677 \nand the first item of the test pair) before being presented with two side-by-side alternatives as to 678 \nwhich object they think should come next (one was the correct paired associate of the test pair 679 \nand the other was a lure). Objects were presented with the same timing as used during the 680 \nlearning phase in the sequence, and participants were given unlimited time to make a choice 681 \nbetween target and lure. Participants completed a total of 54 questions: 6 of these questions 682 \nevaluated the context-independent pair, while 48 questions evaluated context-dependent 683 \nassociations. The 48 questions probing context-dependent associations could either feature a 684 \nlure object that was the correct paired associate in the other context (direct-conflict; 16 685 \nquestions), or a lure that was any other item (indirect-conflict; 32 questions). The “×” and “+” 686 \nmarkings were removed from the objects to make clear that participants no longer were required 687 \nto respond to the perceptual question. For the Unsignaled experiment, no explicit context cues 688 \nwere provided; for the Signaled experiment, the border around the objects was colored white or 689 \nblack on each trial to cue contexts. After making each 2AFC judgment, participants were 690 \nprompted to rate their confidence in their decision from 1-4.  691 \n 692 \nStructure knowledge probe 693 \nAfter completing all 2AFC trials, participants were prompted to answer some questions about 694 \nwhat they learned during the experiment. First, they were asked to respond yes or no to whether 695 \nthey observed any predictable patterns in the experiment. Second, they were asked to describe 696 \nany patterns they observed in the sequence. Third, they were asked to describe any rules that 697 \ngoverned which object would come next in the sequence. The idea was to progressively prompt 698 \nparticipants to indicate any knowledge of the pair structure underlying the sequence they 699 \nobserved that were increasingly straightforward to get an idea of how much knowledge was 700 \nexplicit.  701 \n 702 \nPair reconstruction task 703 \nThe last task allowed participants to demonstrate explicit knowledge of pairs. Participants were 704 \npresented with a bank of all 11 objects at the top of the screen and provided with 20 sets of two 705 \nempty squares side-by-side presented in 4 rows of 5 columns. Participants were instructed to 706 \norganize the objects into related pairs by placing one object in each square of a pair, with the left 707 \nand right positions corresponding to the first and second items in the pair. Participants were told 708 \nthat each item could be used more than once and that they did not have to fill out all of the pairs. 709 \n 710 \nData cleaning 711 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n16 \n \nData exclusion criteria were enforced to ensure that participants were engaging with the 712 \nexperiment during the learning phase. As such, two criteria were enforced: response rate of more 713 \nthan 90% and accuracy of more than 85% on all responses throughout the learning phase. Data 714 \ncollection continued until 50 useable participants were collected for each experiment. 715 \n 716 \nNeural network architecture 717 \nRecurrent neural network models with gated recurrent units were implemented with PyTorch 718 \nv2.0.1 63. Such models have previously been used to explore context-dependent associative 719 \nlearning from sequences 22,27. Each model had the same architecture: an 11-node input layer with 720 \ndimensionality of 11 (equal to the number of objects included in the study), a hidden layer of 150 721 \nnodes (GRU performance with different hidden layer sizes presented in SI Appendix, section S5), 722 \nand an 11-D output layer again to match dimensionality of one-hot object vectors. Learning rate 723 \nwas held constant at 0.001, and model weights were updated after each training sample using 724 \nthe Adam algorithm of gradient descent and cross entropy loss. Default parameters were used 725 \nunless otherwise noted. 726 \n 727 \nTraining 728 \nA unique 1600-object sequence was generated for each model in the same way as for human 729 \nparticipants. Each neural network received one object at a time and was trained to predict the 730 \nidentity of the next object in the sequence. Although the sequence was constructed using 731 \nembedded object pairs, models received no information about this underlying structure. That is, 732 \nthe model made predictions at every time step (1599 samples for the 1600-object sequence) and 733 \nhad no awareness of pair boundaries. The same sequence was used for all epochs of training, 734 \nwith the hidden state was reset at the start of each epoch and between blocks (every 400 735 \nsamples) in recurrent models to emulate the breaks taken by human participants. 736 \n 737 \n2AFC task 738 \nAfter each epoch of learning, model weights were frozen, and the models were evaluated using a 739 \n2AFC test designed to mirror the testing procedure of the human participants. A unique set of 740 \n2AFC test questions was generated for each model in the same way as for human participants. 741 \nBefore each trial, hidden layer activity was reset to zero. Then, a sequence of three pairs from 742 \none of the contexts was presented as the hidden state evolved, allowing the model to infer the 743 \nactive context based on the sequence. Finally, the first object of the test pair was inputted, and 744 \nthe model’s prediction of the ensuing item was evaluated. Accuracy was determined by whether 745 \nthe probability assigned to the correct paired associate was higher than that to the lure. In most 746 \nanalyses, accuracy is evaluated separately for the context-independent, indirect-conflict context-747 \ndependent, and direct-conflict context-dependent question sets to capture how well the models 748 \nhandle conflicting information across contexts and maintain knowledge of stable, context-749 \nindependent relationships. 750 \n 751 \nSingle epoch analyses 752 \nWe tested the GRU’s ability to learn the task as the variance of the uniform distribution used to 753 \ninitialize the hidden layer’s weights was increased. The uniform distribution was centered at zero 754 \nwith positive and negative bounds of 0.08 (default for PyTorch with 150 nodes), 0.2, 0.4, 0.6, 0.8, 755 \n1.0, 1.2, and 1.4. Fifty independent GRU models with different weight initialization randomizations 756 \nwere trained and tested, and learning measures across these models were averaged to ensure 757 \nrobust performance estimates of each weight initialization category. 758 \n 759 \nLearning trajectory analysis: Context switch latency 760 \nTo assess how quickly the neural network models adapted to a context change, we developed a 761 \nswitch latency measure. We devised a stringent operationalization of switch latency as the 762 \nnumber of first-of-pair (item 1) items (e.g., the item whose model output captures the within-pair 763 \ntransition prediction) a model processed after a context switch before achieving perfect accuracy 764 \non all remaining item 1 samples in that 50-pair context exposure. Because only the model outputs 765 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n17 \n \nof item 1 training samples are predictable and thus learnable, they served as a measure of 766 \nadaptation to a new context. Switch latency was calculated for all 16 context exposures (8 per 767 \ncontext) and averaged within each block (4 context exposures), yielding a single switch latency 768 \nvalue per block. We averaged this measure across all 50 trained models for each weight 769 \ninitialization condition. 770 \n 771 \nHidden layer analyses: Context representation strategies 772 \nTo understand how context information was represented across the hidden layer, we quantified 773 \ntwo complementary properties of the activations: the extent to which context sensitivity was 774 \nlocalized to a small subset of units (akin to individual “context cell” neurons that code for which 775 \ncontext is currently active), and the degree to which the currently active context was expressed a 776 \nas distinctive pattern of activity across many units. These analyses were conceptually motivated 777 \nby prior work distinguishing sparse and distributed coding strategies in hippocampal and 778 \nconnectionist models.35,41 Our goal was to determine whether these representational properties 779 \ncould explain 2AFC task performance across all individual GRU model instances of the weight 780 \nvariance configurations. 781 \n 782 \nA sparse representation describes when context sensitivity is confined to a relatively small subset 783 \nof hidden layer units, while the vast majority remain inactive or insensitive. To investigate sparse 784 \ncontext representations in the GRU’s hidden layer, we first used a one-way ANOVA to estimate 785 \nthe difference in activation when processing inputs from Context A and Context B during the final 786 \nquarter of training (block 4) for each of the 150 hidden layer nodes. We then counted the number 787 \nof nodes that showed a significant activation difference. We applied a Bonferroni correction within 788 \nanalysis of each model to control for Type I errors of the 150 comparisons were performed. The 789 \ncorrected significance threshold was computed by dividing the original alpha level (0.05) by the 790 \nnumber of comparisons (150), yielding an adjusted significance level of p < 0.00033. Based on 791 \nthis threshold, we determined that Fcrit(1,398) = 12.75 and calculated the sparse representation 792 \nindex as the proportion of hidden layer nodes that did not show a significant difference in 793 \nactivation between contexts, such that a larger value reflects a sparser context representation. 794 \n 795 \nA distributed representation was computed using a representational similarity analysis (RSA; 64) 796 \nfocused on the hidden layer activations after processing the first item of each pair (capturing the 797 \ncontext-dependent prediction) during the final quarter of training (block 4). This included a total of 798 \n200 hidden state samples (100 per context). These activations were divided into two split-halves, 799 \neach containing 10 samples for each of the five pairs per context. These samples in each split-800 \nhalf were evenly divided into those drawn from the first half of a context exposure and those from 801 \nthe second half, controlling for any strengthening of context representation over time. We 802 \naveraged the hidden state activation within each node for each object within each context. We 803 \nthen computed the Pearson correlation coefficient for all pairwise comparisons of objects within 804 \nand across contexts, producing an RSA matrix. This matrix compared the split-half object 805 \nrepresentations, with one half plotted along the x-axis and the other along the y-axis, and there 806 \nwere 10 cells along each axis for each of the five pairs viewed in each context. The upper-left and 807 \nlower-right quadrants contained correlations between the five pairs from the same context. The 808 \nupper-right and lower-left quadrants contained correlations between the same objects when 809 \nviewed in opposing contexts. To quantify the distributed representation, we calculated the 810 \ndifference between the average within-context and between-context correlations for each object, 811 \nnormalized by subtracting the average within-context correlation from one. This normalization 812 \npenalized models with lower within-context stability because a larger denominator as a result of 813 \nlower within-context similarity would decrease the overall distributed representation index, 814 \nensuring that observed differences in between-context representation were not artifacts of noisy 815 \nor unstable object representations. 816 \n 817 \nHidden layer analyses: Lesion analysis 818 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n18 \n \nTo further understand how the GRU model’s hidden layer representations supported successful 819 \ncontext-dependent learning, we conducted an intervention analysis. For all weight configurations, 820 \nwe trained a new set of 50 models with the same single-epoch training procedure and then 821 \nsystematically tested their performance on the 2AFC task while “lesioning” (zeroing out) subsets 822 \nof hidden layer nodes. Importantly, these nodes were active during training on the 1600-object 823 \nsequence, and the lesioning intervention was applied only immediately prior to the 2AFC testing 824 \nphase. 825 \n 826 \nWe first calculated the absolute context sensitivity of each hidden layer node using a one-way 827 \nANOVA of activations between Context A and Context B during the final quarter of training (block 828 \n4), in the same way as computing the sparse representation index. Nodes were then ranked from 829 \nthe most context-sensitive (largest F-statistic) to least sensitive. During the 2AFC task, subsets of 830 \nhidden layer nodes were progressively lesioned, beginning with the most context-sensitive nodes. 831 \nWe evaluated models under the following lesioning conditions: 0 (no nodes lesioned to obtain 832 \nbaseline performance estimate), 1, 5, 10, 25, 50, 75, 100, 125, 130, 135, 140, 145, and 150 (all 833 \nnodes lesioned with expectation of chance performance). We report the average 2AFC 834 \nperformance on Context A, Context B, and context-independent question sets, expressed as both 835 \nobtained accuracy and the change in performance relative to the no-lesion baseline (e.g., when 836 \nno intervention is applied). 837 \n 838 \nStatistical Analysis 839 \n2AFC task 840 \n2AFC task performance was assessed by evaluating accuracy and average confidence rating on 841 \nsubsets of 2AFC questions. Group-level accuracy was tested against 50% chance using a one-842 \nsample Student’s t-test with Holm-Bonferroni correction applied for three comparisons (Context A, 843 \nContext B, and context-independent trials) within each experiment. 844 \n 845 \nWe conducted a mixed-design ANOVA to examine the effects of experiment (between-subjects 846 \nfactor: Expt. 1 versus Expt. 2) and context-dependence (within-subject factor: context-dependent 847 \nversus context-independent trials) on 2AFC accuracy using the Python pingouin package. To 848 \nassess whether context-dependent 2AFC accuracy was statistically equivalent between 849 \nexperiments, we used Bayesian estimation with a region of practical equivalence (ROPE) 850 \napproach. We computed the posterior distribution of the mean difference in accuracy between 851 \nExpt. 1 and Expt. 2 and quantified the proportion of the posterior mass falling within a predefined 852 \nROPE of [-5%, 5%]. This ROPE was selected to reflect the smallest effect size of interest, 853 \nconsistent with typical variability in task accuracy in statistical learning literature. Posterior 854 \ndistributions were estimated using the PyMC package. 855 \n 856 \nOnline learning assessment 857 \nWe quantified online learning using participants’ RTs during the learning phase, in which they 858 \nindicated whether each object contained an “×” or “+”. For each block, we computed an 859 \nanticipation score as the average RT to the second item of each pair subtracted from the average 860 \nRT to the first item of each pair. This metric captured facilitation for predictable second items 861 \nwhile controlling for overall RT drift throughout the session, as first items follow unpredictable 862 \ntransitions. Positive values indicate faster responses to second items relative to first items. 863 \n 864 \nTo assess changes in online learning across blocks, we applied linear contrast with weights [-3, -865 \n1, 1, 3] to the blockwise anticipation scores for each participant and tested the group-level 866 \ndifference from zero using a two-tailed one-sample t-test for each experiment. 867 \n 868 \nMultivariate regression analysis of representation strategies 869 \nThe two measures of hidden layer activity – sparse representation index (proportion of hidden 870 \nlayer nodes that do not show significant activation difference by context) and distributed 871 \nrepresentation index (within- versus between-context correlation differences) – were used as 872 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n19 \n \npredictors in multivariate regression models aimed to explain the variance of 2AFC context-873 \ndependent accuracy, context-independent accuracy, and accuracy difference between contexts. 874 \nThese indices were first computed for each of the 50 models instantiated with each of the eight 875 \nweight initialization configurations, and then were averaged within each configuration. The 876 \nresulting eight values for each predictor were z-scored to enable comparison of beta coefficients 877 \nacross predictors. By including both metrics in the same regression models, we assessed their 878 \nunique contributions to task performance. This allowed us to evaluate whether sparse or 879 \ndistributed representations were more predictive of learning outcomes, providing insight into the 880 \nmechanisms underlying the GRU model’s ability to process and adapt to context-dependent 881 \nassociations. 882 \n 883 \nAcknowledgements 884 \n 885 \nF.P. was supported by the National Science Foundation Graduate Research Fellowship Program 886 \nunder Grant Nos. DGE-2034835 and DGE-2444110. 887 \n 888 \nAuthor Contributions 889 \n 890 \nConceptualization: FCP, JR, HL; Methodology: FCP, JR, HL; Software: FCP; Formal analysis: 891 \nFCP; Visualization: FCP; Supervision: JR, HL; Writing – original draft: FCP; Writing – review & 892 \nediting: FCP, HL, JR. 893 \n 894 \nDeclaration of interests 895 \n 896 \nThe authors declare no competing interests.  897 \n 898 \nReferences 899 \n 900 \n1. Friston, K. (2010). The free-energy principle: a unified brain theory? Nat. Rev. Neurosci. 11, 901 \n127–138. https://doi.org/10.1038/nrn2787. 902 \n2. Bar, M. (2009). The proactive brain: memory for predictions. Philos. Trans. R. Soc. Lond. B. 903 \nBiol. Sci. 364, 1235–1243. https://doi.org/10.1098/rstb.2008.0310. 904 \n3. Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of 905 \ncognitive science. Behav. Brain Sci. 36, 181–204. 906 \nhttps://doi.org/10.1017/S0140525X12000477. 907 \n4. Summerfield, C., and de Lange, F.P. (2014). Expectation in perceptual decision making: 908 \nneural and computational mechanisms. Nat. Rev. Neurosci. 15, 745–756. 909 \nhttps://doi.org/10.1038/nrn3838. 910 \n5. Heald, J.B., Lengyel, M., and Wolpert, D.M. (2023). Contextual inference in learning and 911 \nmemory. Trends Cogn. Sci. 27, 43–64. https://doi.org/10.1016/j.tics.2022.10.004. 912 \n6. Heald, J.B., Wolpert, D.M., and Lengyel, M. (2023). The Computational and Neural Bases of 913 \nContext-Dependent Learning. Annu. Rev. Neurosci. 46, 233–258. 914 \nhttps://doi.org/10.1146/annurev-neuro-092322-100402. 915 \n7. Statistical Learning (2015). 501–506. https://doi.org/10.1016/B978-0-12-397025-1.00276-1. 916 \n8. Statistical Learning (2015). In Brain Mapping (Elsevier), pp. 501–506. 917 \nhttps://doi.org/10.1016/b978-0-12-397025-1.00276-1. 918 \n9. Sherman, B.E., Graves, K.N., and Turk-Browne, N.B. (2020). The prevalence and importance 919 \nof statistical learning in human cognition and behavior. Curr. Opin. Behav. Sci. 32, 15–20. 920 \nhttps://doi.org/10.1016/j.cobeha.2020.01.015. 921 \n10. Saffran, J.R., and Kirkham, N.Z. (2018). Infant Statistical Learning. Annu. Rev. Psychol. 69, 922 \n181–203. https://doi.org/10.1146/annurev-psych-122216-011805. 923 \n11. Fiser, J., and Aslin, R.N. (2001). Unsupervised Statistical Learning of Higher-Order Spatial 924 \nStructures from Visual Scenes. Psychol. Sci. 12, 499–504. https://doi.org/10.1111/1467-925 \n9280.00392. 926 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n20 \n \n12. Saffran, J.R., Aslin, R.N., and Newport, E.L. (1996). Statistical Learning by 8-Month-Old 927 \nInfants. Science 274, 1926–1928. 928 \n13. Conway, C.M., and Christiansen, M.H. (2005). Modality-Constrained Statistical Learning of 929 \nTactile, Visual, and Auditory Sequences. J. Exp. Psychol. Learn. Mem. Cogn. 31, 24–39. 930 \nhttps://doi.org/10.1037/0278-7393.31.1.24. 931 \n14. Bouton, M.E. (1993). Context, time, and memory retrieval in the interference paradigms of 932 \nPavlovian learning. Psychol. Bull. 114, 80–99. https://doi.org/10.1037/0033-2909.114.1.80. 933 \n15. McAllister, D.E., and McAllister, W.R. (1994). Extinction and Reconditioning of Classically 934 \nConditioned Fear before and after Instrumental Learning: Effects of Depth of Fear Extinction. 935 \nLearn. Motiv. 25, 339–367. https://doi.org/10.1006/lmot.1994.1018. 936 \n16. Bouton, M.E. (2004). Context and Behavioral Processes in Extinction. Learn. Mem. 11, 485–937 \n494. https://doi.org/10.1101/lm.78804. 938 \n17. Izquierdo, A., and Jentsch, J.D. (2012). Reversal learning as a measure of impulsive and 939 \ncompulsive behavior in addictions. Psychopharmacology (Berl.) 219, 607–620. 940 \nhttps://doi.org/10.1007/s00213-011-2579-7. 941 \n18. Weiss, D.J., Gerfen, C., and Mitchel, A.D. (2009). Speech Segmentation in a Simulated 942 \nBilingual Environment: A Challenge for Statistical Learning? Lang. Learn. Dev. 5, 30–49. 943 \nhttps://doi.org/10.1080/15475440802340101. 944 \n19. Gebhart, A.L., Aslin, R.N., and Newport, E.L. (2009). Changing Structures in Midstream: 945 \nLearning Along the Statistical Garden Path. Cogn. Sci. 33, 1087–1116. 946 \nhttps://doi.org/10.1111/j.1551-6709.2009.01041.x. 947 \n20. Siegelman, N., Bogaerts, L., Kronenfeld, O., and Frost, R. (2018). Redefining “Learning” in 948 \nStatistical Learning: What Does an Online Measure Reveal About the Assimilation of Visual 949 \nRegularities? Cogn. Sci. 42, 692–727. https://doi.org/10.1111/cogs.12556. 950 \n21. Qian, T., Jaeger, T.F., and Aslin, R.N. (2016). Incremental implicit learning of bundles of 951 \nstatistical patterns. Cognition 157, 156–173. https://doi.org/10.1016/j.cognition.2016.09.002. 952 \n22. Smith, C.M., Thompson-Schill, S.L., and Schapiro, A.C. (2024). Rapid Learning of Temporal 953 \nDependencies at Multiple Timescales. J. Cogn. Neurosci. 36, 2343–2356. 954 \nhttps://doi.org/10.1162/jocn_a_02232. 955 \n23. Heald, J.B., Lengyel, M., and Wolpert, D.M. (2021). Contextual inference underlies the 956 \nlearning of sensorimotor repertoires. Nature 600, 489–493. https://doi.org/10.1038/s41586-957 \n021-04129-3. 958 \n24. Yamins, D.L.K., and DiCarlo, J.J. (2016). Using goal-driven deep learning models to 959 \nunderstand sensory cortex. Nat. Neurosci. 19, 356–365. https://doi.org/10.1038/nn.4244. 960 \n25. Saxe, A., Nelli, S., and Summerfield, C. (2021). If deep learning is the answer, what is the 961 \nquestion? Nat. Rev. Neurosci. 22, 55–67. https://doi.org/10.1038/s41583-020-00395-8. 962 \n26. Alamia, A., Gauducheau, V., Paisios, D., and VanRullen, R. (2020). Comparing feedforward 963 \nand recurrent neural network architectures with human behavior in artificial grammar 964 \nlearning. Sci. Rep. 10, 22172. https://doi.org/10.1038/s41598-020-79127-y. 965 \n27. Flesch, T., Juechems, K., Dumbalska, T., Saxe, A., and Summerfield, C. (2022). Orthogonal 966 \nrepresentations for robust context-dependent task performance in brains and neural 967 \nnetworks. Neuron 110, 1258-1270.e11. https://doi.org/10.1016/j.neuron.2022.01.005. 968 \n28. Lu, Q., Nguyen, T.T., Zhang, Q., Hasson, U., Griffiths, T.L., Zacks, J.M., Gershman, S.J., and 969 \nNorman, K.A. (2024). Reconciling shared versus context-specific information in a neural 970 \nnetwork model of latent causes. Sci. Rep. 14, 16782. https://doi.org/10.1038/s41598-024-971 \n64272-5. 972 \n29. Franklin, N.T., Norman, K.A., Ranganath, C., Zacks, J.M., and Gershman, S.J. (2020). 973 \nStructured Event Memory: A neuro-symbolic model of event cognition. Psychol. Rev. 127, 974 \n327–361. https://doi.org/10.1037/rev0000177. 975 \n30. Elman, J.L. (1990). Finding Structure in Time. Cogn. Sci. 14, 179–211. 976 \nhttps://doi.org/10.1207/s15516709cog1402_1. 977 \n31. Hasson, U., Nastase, S.A., and Goldstein, A. (2020). Direct Fit to Nature: An Evolutionary 978 \nPerspective on Biological and Artificial Neural Networks. Neuron 105, 416–434. 979 \nhttps://doi.org/10.1016/j.neuron.2019.12.002. 980 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n21 \n \n32. Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural Tangent Kernel: Convergence and 981 \nGeneralization in Neural Networks. In Advances in Neural Information Processing Systems 982 \n(Curran Associates, Inc.). 983 \n33. Chizat, L., Oyallon, E., and Bach, F. (2019). On Lazy Training in Differentiable Programming. 984 \nIn Advances in Neural Information Processing Systems (Curran Associates, Inc.). 985 \n34. McClelland, J.L., McNaughton, B.L., and O’Reilly, R.C. (1995). Why there are complementary 986 \nlearning systems in the hippocampus and neocortex: Insights from the successes and 987 \nfailures of connectionist models of learning and memory. Psychol. Rev. 102, 419–457. 988 \nhttps://doi.org/10.1037/0033-295X.102.3.419. 989 \n35. Schapiro, A.C., Turk-Browne, N.B., Botvinick, M.M., and Norman, K.A. (2017). 990 \nComplementary learning systems within the hippocampus: a neural network modelling 991 \napproach to reconciling episodic memory with statistical learning. Philos. Trans. R. Soc. B 992 \nBiol. Sci. 372, 20160049. https://doi.org/10.1098/rstb.2016.0049. 993 \n36. Leutgeb, J.K., Leutgeb, S., Moser, M.-B., and Moser, E.I. (2007). Pattern Separation in the 994 \nDentate Gyrus and CA3 of the Hippocampus. Science 315, 961–966. 995 \nhttps://doi.org/10.1126/science.1135801. 996 \n37. Schapiro, A.C., Rogers, T.T., Cordova, N.I., Turk-Browne, N.B., and Botvinick, M.M. (2013). 997 \nNeural representations of events arise from temporal community structure. Nat. Neurosci. 16, 998 \n486–492. https://doi.org/10.1038/nn.3331. 999 \n38. Welford, W.T., Brebner, J.M.T., and Kirby, N. (1980). Reaction Times (Stanford University). 1000 \n39. Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization 1001 \nand momentum in deep learning. In Proceedings of the 30th International Conference on 1002 \nMachine Learning (PMLR), pp. 1139–1147. 1003 \n40. Dominé, C.C.J., Anguita, N., Proca, A.M., Braun, L., Kunin, D., Mediano, P.A.M., and Saxe, 1004 \nA.M. (2025). From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks. Preprint 1005 \nat arXiv, https://doi.org/10.48550/arXiv.2409.14623 1006 \nhttps://doi.org/10.48550/arXiv.2409.14623. 1007 \n41. Hinton, G.E. (1986). Learning Distributed Representations of Concepts. Proc. Annu. Meet. 1008 \nCogn. Sci. Soc. 8. 1009 \n42. Destrebecqz, A., and Cleeremans, A. (2001). Can sequence learning be implicit? New 1010 \nevidence with the process dissociation procedure. Psychon. Bull. Rev. 8, 343–350. 1011 \nhttps://doi.org/10.3758/BF03196171. 1012 \n43. Vékony, T., Farkas, B.C., Brezóczki, B., Mittner, M., Csifcsák, G., Simor, P., and Németh, D. 1013 \n(2025). Mind wandering enhances statistical learning. iScience 28. 1014 \nhttps://doi.org/10.1016/j.isci.2024.111703. 1015 \n44. Conway, C.M. (2020). How does the brain learn environmental structure? Ten core principles 1016 \nfor understanding the neurocognitive mechanisms of statistical learning. Neurosci. Biobehav. 1017 \nRev. 112, 279–299. https://doi.org/10.1016/j.neubiorev.2020.01.032. 1018 \n45. Perruchet, P., and Pacton, S. (2006). Implicit learning and statistical learning: one 1019 \nphenomenon, two approaches. Trends Cogn. Sci. 10, 233–238. 1020 \nhttps://doi.org/10.1016/j.tics.2006.03.006. 1021 \n46. Cleeremans, A., and McClelland, J.L. (1991). Learning the structure of event sequences. J. 1022 \nExp. Psychol. Gen. 120, 235–253. https://doi.org/10.1037/0096-3445.120.3.235. 1023 \n47. Chiarella, S.G., Simione, L., D’Angiò, M., Saracini, C., Raffone, A., and Di Pace, E. (2026). 1024 \nImplicit observational learning of second-order conditional repeated sequences presented in 1025 \nrapid serial visual presentation. Conscious. Cogn. 137, 103967. 1026 \nhttps://doi.org/10.1016/j.concog.2025.103967. 1027 \n48. O’Reilly, R.C., and Rudy, J.W. (2001). Conjunctive representations in learning and memory: 1028 \nPrinciples of cortical and hippocampal function. Psychol. Rev. 108, 311–345. 1029 \nhttps://doi.org/10.1037/0033-295X.108.2.311. 1030 \n49. Glimcher, P.W. (2011). Understanding dopamine and reinforcement learning: The dopamine 1031 \nreward prediction error hypothesis. Proc. Natl. Acad. Sci. 108, 15647–15654. 1032 \nhttps://doi.org/10.1073/pnas.1014269108. 1033 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n22 \n \n50. Zacks, J.M., Kurby, C.A., Eisenberg, M.L., and Haroutunian, N. (2011). Prediction Error 1034 \nAssociated with the Perceptual Segmentation of Naturalistic Events. J. Cogn. Neurosci. 23, 1035 \n4057–4066. https://doi.org/10.1162/jocn_a_00078. 1036 \n51. Smith, C.M., Thompson-Schill, S.L., and Schapiro, A.C. (2024). Rapid Learning of Temporal 1037 \nDependencies at Multiple Timescales. J. Cogn. Neurosci. 36, 2343–2356. 1038 \nhttps://doi.org/10.1162/jocn_a_02232. 1039 \n52. Narkhede, M.V., Bartakke, P.P., and Sutaone, M.S. (2022). A review on weight initialization 1040 \nstrategies for neural networks. Artif. Intell. Rev. 55, 291–322. https://doi.org/10.1007/s10462-1041 \n021-10033-z. 1042 \n53. Rigotti, M., Barak, O., Warden, M.R., Wang, X.-J., Daw, N.D., Miller, E.K., and Fusi, S. 1043 \n(2013). The importance of mixed selectivity in complex cognitive tasks. Nature 497, 585–590. 1044 \nhttps://doi.org/10.1038/nature12160. 1045 \n54. Mızrak, E., Bouffard, N.R., Libby, L.A., Boorman, E.D., and Ranganath, C. (2021). The 1046 \nhippocampus and orbitofrontal cortex jointly represent task structure during memory-guided 1047 \ndecision making. Cell Rep. 37, 110065. https://doi.org/10.1016/j.celrep.2021.110065. 1048 \n55. Chanales, A.J.H., Oza, A., Favila, S.E., and Kuhl, B.A. (2017). Overlap among Spatial 1049 \nMemories Triggers Repulsion of Hippocampal Representations. Curr. Biol. 27, 2307-2317.e5. 1050 \nhttps://doi.org/10.1016/j.cub.2017.06.057. 1051 \n56. Schapiro, A.C., Kustner, L.V., and Turk-Browne, N.B. (2012). Shaping of Object 1052 \nRepresentations in the Human Medial Temporal Lobe Based on Temporal Regularities. Curr. 1053 \nBiol. 22, 1622–1627. https://doi.org/10.1016/j.cub.2012.06.056. 1054 \n57. Barlow, H. (2001). Redundancy reduction revisited. Netw. Bristol Engl. 12, 241–253. 1055 \n58. Fusi, S., and Abbott, L.F. (2007). Limits on the memory storage capacity of bounded 1056 \nsynapses. Nat. Neurosci. 10, 485–493. https://doi.org/10.1038/nn1859. 1057 \n59. Forest, T.A., Schlichting, M.L., Duncan, K.D., and Finn, A.S. (2023). Changes in statistical 1058 \nlearning across development. Nat. Rev. Psychol. 2, 205–219. https://doi.org/10.1038/s44159-1059 \n023-00157-0. 1060 \n60. Peirce, J., Gray, J.R., Simpson, S., MacAskill, M., Höchenberger, R., Sogo, H., Kastman, E., 1061 \nand Lindeløv, J.K. (2019). PsychoPy2: Experiments in behavior made easy. Behav. Res. 1062 \nMethods 51, 195–203. https://doi.org/10.3758/s13428-018-01193-y. 1063 \n61. Hsu, N.S., Schlichting, M.L., and Thompson-Schill, S.L. (2014). Feature Diagnosticity Affects 1064 \nRepresentations of Novel and Familiar Objects. J. Cogn. Neurosci. 26, 2735–2749. 1065 \nhttps://doi.org/10.1162/jocn_a_00661. 1066 \n62. Schlichting, M.L., Mumford, J.A., and Preston, A.R. (2015). Learning-related representational 1067 \nchanges reveal dissociable integration and separation signatures in the hippocampus and 1068 \nprefrontal cortex. Nat. Commun. 6, 8151. https://doi.org/10.1038/ncomms9151. 1069 \n63. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., 1070 \nGimelshein, N., Antiga, L., et al. (2019). PyTorch: An Imperative Style, High-Performance 1071 \nDeep Learning Library. Preprint at arXiv, https://doi.org/10.48550/arXiv.1912.01703 1072 \nhttps://doi.org/10.48550/arXiv.1912.01703. 1073 \n64. Kriegeskorte, N. (2011). Pattern-information analysis: From stimulus decoding to 1074 \ncomputational-model testing. NeuroImage 56, 411–421. 1075 \nhttps://doi.org/10.1016/j.neuroimage.2011.01.061. 1076 \n 1077 \n 1078 \n  1079 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n23 \n \n 1080 \nSupplementary Information Appendix 1081 \n 1082 \n 1083 \n 1084 \nFig. S1. 2AFC test performance by direct and indirect conflict question subsets. 1085 \nBar height reflects group average on 2AFC context-dependent question subsetted by indirect 1086 \n(light coloring) and direct (dark coloring) conflict for Context A (left, orange bars) and Context B 1087 \n(right, green bars). ***p<0.001; **p<0.01; *p<0.05. 1088 \n 1089 \n 1090 \n 1091 \nFig. S2. Visualization of stimulus presentation during learning phase. 1092 \nEach trial of the learning phase featured four stimuli arranged as depicted, with one object of 1093 \ninterest (on which participants needed to make an × /+ judgment) and three circular phase-1094 \nscrambled objects presented in the remaining positions. A black or white border was present 1095 \nduring Expt. 2. Visualization of objects and border is to scale. 1096 \n 1097 \nTable S1. Reaction times (mean ± standard deviation) in milliseconds by block for context-1098 \ndependent pair objects. Item 1 is the first, unpredictable element of each pair; Item 2 is the 1099 \nsecond, predictable element informed by the associative expectation. 1100 \n Expt 1: Unsignaled Expt 2: Signaled \n Item 1 Item 2 Item 1 Item 2 \nBlock 1 712 ± 72 727 ± 75 710 ± 67 713 ± 57 \nBlock 2 668 ± 77 679 ± 84 662 ± 76 664 ± 64 \nBlock 3 650 ± 76 655 ± 85 648 ± 77 643 ± 62 \nBlock 4 639 ± 74 639 ± 83 631 ± 71 626 ± 62 \n 1101 \n 1102 \n 1103 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n24 \n \nS1 Confidence judgments and explicit knowledge assessments 1104 \n 1105 \nWe examined whether participants’ confidence ratings on the 2AFC task were related to their 1106 \naccuracy. Confidence was significantly higher for accurate than inaccurate 2AFC responses for 1107 \ncontext-dependent trials in both experiments (Unsignaled: t(49) = 4.1, p < 0.001; Signaled: t(49) = 1108 \n3.96, p < 0.001) and on context-independent trials in Expt. 1 (Unsignaled; t(37) = 2.6, p = 0.013) 1109 \nbut not Expt. 2 (t(39) = 0.38, p = 0.7) (Fig. S3A). Participants who responded entirely correctly or 1110 \nincorrectly were excluded from analysis; all participants had mixed accuracy on context-1111 \ndependent trials. Overall, mean confidence ratings were around or below the midpoint of the 1112 \nscale, indicating generally low subjective certainty during the 2AFC task. 1113 \n 1114 \nFollowing the 2AFC task, participants completed two additional assessments designed to 1115 \nmeasure explicit knowledge of the temporal structure: a Structure Knowledge Probe and Pair 1116 \nReconstruction Task. 1117 \n 1118 \nFor the Structure Knowledge Probe, binary performance was assessed by manually evaluating 1119 \nwhether participants articulated explicit awareness of temporal pair structure in their written 1120 \nresponses. This measure did not evaluate knowledge of the dual context structure (e.g., 1121 \nparticipants did not need to articulate awareness that there were two distinct contexts where the 1122 \nassociative pairings changed). Two independent raters coded all responses with 91% agreement; 1123 \ndiscrepancies were resolved by deferring to the more senior grader. Explicit knowledge of the pair 1124 \nstructure was identified in 34.0% of participants in Expt. 1 and 40.0% in Expt. 2. 1125 \n 1126 \nPerformance on the Pair Reconstruction Task varied because participants could report between 1 1127 \nand 20 pairs. To estimate chance performance, we implemented Monte Carlo simulations where 1128 \n1,000 simulations were run for each possible number of reported pairs (k = 1-20). In each 1129 \nsimulation, an object was sampled from the 11 unique objects with replacement between pairs 1130 \nbut without replacement within a pair (e.g., no pair comprised of the same object). This produced 1131 \na null distribution of proportion correct entries for each k expected by chance. This empirical 1132 \napproach matches the analytical solution: there were 9 correct pairs (because one pair was 1133 \ncontext-independent and thus correct in both contexts), and the probability of guessing one 1134 \ncorrect pair by chance was 1/110 (choosing 2 of the 11 objects without replacement). Thus, the 1135 \nprobability of guessing one of the 9 correct pairs was 9/110, or 8.2%. 1136 \n 1137 \nParticipants reported an average of 7.4 ± 4 in Expt. 1 and 7.9 ± 4 in Expt. 2 (Fig. S3B). Group-1138 \nlevel significance was calculated as the average number of correct context-independent pairs 1139 \nwas greater than expected by chance over the simulations of all possible pair entry counts. 1140 \nContext-dependent pair entry performance was non-significant for both experiments (Unsignaled: 1141 \nmean = 2.06 pairs; p = 0.11; Signaled: mean = 2.08 pairs; p = 0.11; Fig. S3B). Moreover, only 1142 \n36% of Expt. 1 participants and 34.7% of Expt. 2 participants (e.g., 17 out of 49 participants; one 1143 \nparticipant did not complete this portion of the experiment) reported the context-independent pair. 1144 \n 1145 \nTaken together, these results indicate that most participants had little to no explicit knowledge of 1146 \nthe temporal pair structure: they were generally unable to recall the context-independent pair, 1147 \narticulate the underlying pair structure, or reconstruct the context-dependent associations. Thus, 1148 \nthe significant 2AFC performance reflecting context-dependent learning is unlikely to have been 1149 \ndriven by explicit awareness. 1150 \n 1151 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n25 \n \n 1152 \nFig. S3. Metacognitive awareness assessment results. 1153 \n(A) Average 2AFC confidence rating for context-dependent (left) and context-dependent (right) 1154 \nquestions for accurate (pink) and inaccurate (blue) question responses. Horizontal dashed line 1155 \nindicates midpoint of the confidence scale, and error bars reflect SEM. (B) Pair Reconstruction 1156 \nTask performance: bar height reflects group average number of total pairs reported (left) and 1157 \ncorrect context-independent pairs reported (right). Horizontal line reflects chance performance 1158 \nbased on Monte Carlo simulations; error bars reflect SEM.  1159 \n 1160 \nS2 Model architecture comparison 1161 \n 1162 \nIn the main paper, we analyze a GRU model with training that was constrained to a single epoch, 1163 \nequivalent to the total sequence exposure of each human participant. Here, we justify that 1164 \ndecision with comparison to two simpler models: a feedforward neural network (FFNN) and a 1165 \nvanilla recurrent neural network (RNN) that lacked gated recurrent units. One learning phase 1166 \nsequence and one set of 2AFC questions were generated for each model in the same way as for 1167 \nhuman participants (1,600 objects), and each epoch of training consisted of updating model 1168 \nweights to predict the next item in this sequence, followed by an assessment of 2AFC accuracy 1169 \nwith frozen weights. For each model, we continued this process for a total of 50 epochs (i.e. 50 1170 \ntimes the sequence exposure given to human participants). All models here used the default 1171 \nPyTorch weight initialization where weights are drawn from a uniform distribution bounded by plus 1172 \nor minus the inverse of the square root of the layer size, which was 0.08 for the hidden layer. 1173 \n 1174 \nThe simplest architecture, the FFNN, achieved an overall context-dependent accuracy of almost 1175 \n75% (Fig. S4A). However, this performance was entirely driven by near-perfect accuracy on 1176 \nContext B, the most recently trained context, while accuracy on Context A remained near chance. 1177 \nThis indicates that the FFNN retained knowledge only about the most recent associations, 1178 \ncompletely overwriting previously learned, conflicting ones—a hallmark of catastrophic 1179 \ninterference.  1180 \n 1181 \nRNNs improve on feedforward model capabilities by incorporating information from past hidden 1182 \nstates with the current state, enabling them to process sequential input. However, the RNN 1183 \nshowed no improvement in overall context-dependent accuracy compared to the FFNN (Fig. 1184 \nS4B). While Context A performance did increase over learning, this improvement came at the 1185 \nexpense of Context B performance, suggesting the RNN is also prone to interference. 1186 \n 1187 \nThe RNN with GRUs, an advanced RNN variant, overcomes the limitations by using update and 1188 \nreset gates to manage long-term dependencies more effectively. Initially, GRU performance was 1189 \ncomparable to the FFNN (Fig. S4C). However, with extended training (approximately 20 epochs; 1190 \ni.e., 20 times the exposure of human participants), the GRU achieved comparably high accuracy 1191 \non both Context A and Context B. This performance likely stems from the GRU’s architectural 1192 \nadvantages. The update gate controls how much new input influences retained memory, allowing 1193 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n26 \n \nthe model to ignore unreliable input (such as noisy between-pair transitions). The reset gate 1194 \nallows the selective clearing of irrelevant information in response to context changes, thereby 1195 \navoiding interference from outdated associations. Given that the GRU model provides the best 1196 \naccount for context-dependent learning in humans, we used the GRU model for all of the single-1197 \nepoch modeling analyses reported in the main text. 1198 \n 1199 \n 1200 \nFig. S4. 2AFC task performance for three neural network model classes. 1201 \n(A-C) 2AFC performance accuracy on context-dependent questions averaged for 50 individual 1202 \nmodels for each architecture after each of 50 epochs of training; accuracy is plotted separately for 1203 \nContext A (orange), Context B (green), and both contexts combined (blue) for each model class: 1204 \n(A) simple feedforward neural network (FFNN), (B) vanilla recurrent neural network (RNN), and 1205 \n(C) recurrent neural network with gated recurrent units (GRU). All models achieved perfect 1206 \naccuracy on context-independent questions after the first epoch (not pictured). 1207 \nS3 GRU performance with perceptual object representations 1208 \n 1209 \nThe main paper used one-hot vector representations for each object in the modeling analysis. 1210 \nThis choice ensured that all objects were represented equally and orthogonally, such that any 1211 \nstructure emerging in the hidden layer reflected purely learned associations rather than 1212 \npreexisting similarities among the inputs. Here, we present the same analysis using perceptual 1213 \nobject representations that more closely approximate the visual experience of human participants 1214 \nin the task. Perceptual object representations were generated by inputting each object image 1215 \n(without the overlaid plus or minus symbol) into AlexNet (1) and then applying PCA to reduce the 1216 \ndimensionality to 11 dimensions, matching the number of input and output dimensions of the 1217 \noriginal model. GRU networks were trained on the same context-dependent sequential prediction 1218 \ntask as in the main text, using cosine similarity as the loss function, and assignment of objects to 1219 \nspecific pairs was randomized for each model in the same way as for human participants. 1220 \nAs shown in Fig. S5, the relationship between initialized weight variance and 2AFC accuracy 1221 \nretained the same non-monotonic profile observed in the models that used one-hot input coding 1222 \n(Fig. 4A). In addition, 2AFC performance was higher for all question subsets. This improvement is 1223 \nunsurprising: the use of AlexNet embeddings introduces a strong visual prior that allows the 1224 \nmodel to exploit shared perceptual features when making predictions, thereby obscuring the 1225 \ninterpretability of how the hidden layer activity represents the task’s temporal associative 1226 \nstructure. For example, the model could leverage arbitrary similarities in dimensions such as 1227 \nshape and color to bias its 2AFC responses. In contrast, one-hot encodings constrain all non-1228 \nactive dimensions to zero, ensuring that any hidden layer structure arises exclusively from 1229 \nlearning the task’s associative regularities. Taken together, these results confirm that the core 1230 \nfinding of optimal task performance emerging at moderate weight initialization variance holds 1231 \nregardless of input representation. We therefore focus analysis on models trained with one-hot 1232 \nobject encodings because they provide a controlled representational space in which hidden layer 1233 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n27 \n \nstructure reflects learning task’s associative structure rather than preexisting perceptual 1234 \nrelationships among stimuli. 1235 \n 1236 \n 1237 \nFig. S5. 2AFC performance with perceptual object embeddings. 1238 \n2AFC accuracy (y-axis) on context-dependent test trials for GRU models with weights initialized 1239 \nwith increasing variance along the x-axis color-coded by question category. 1240 \n 1241 \n  1242 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n28 \n \nS4 GRU performance with a fully overlapping stimulus set (no context-specific objects) 1243 \n 1244 \nThe pair assignment across contexts for both models and human participants included one 1245 \npossible caveat to our claim of latent context-dependent learning: for one of the context-1246 \ndependent pair types, the second (paired) item appeared in only one of the contexts while the first 1247 \nitem occurred in both (visualized in bottom row of Fig. 1B). This structure still required humans 1248 \nand the models to update their prediction of the second item based on the inferred context 1249 \n(consistent with all other context-dependent pairs), but also meant that the context-specific 1250 \nsecond item could have been used as a context cue independent of recent sequence history 1251 \n(Expt. 1) and/or border color (Expt. 2). In other words, an encounter with a context-specific object 1252 \ncould be an indicator that the state of the world has changed and thus that one’s associative 1253 \npredictions should be updated. We note that our decision to include this pair type was motivated 1254 \nby our intention to collect fMRI data with this paradigm, which will allow us to assess changes in 1255 \nneural representational geometry when an object’s associative identity remains constant across 1256 \ncontexts, providing a baseline for evaluating relative changes in other pair conditions. 1257 \n 1258 \nTo evaluate whether the presence of context-specific objects influenced model learning, we ran 1259 \nneural network simulations in which such pairs were removed and replaced with context-1260 \ndependent pairs for which both objects could occur in either context. These models were trained 1261 \nno the same task, but the context-dependent pairs were reconfigured to maintain the overall 1262 \nobject set, with second-item assignments shuffled across contexts. Model input and output 1263 \ndimensions were therefore reduced to 10, corresponding to the 10 unique object encodings 1264 \nneeded to instantiate this modified pair set. All other training parameters and analysis of 2AFC 1265 \nperformance were identical to those in the main text. 1266 \n 1267 \nAs shown in Fig. S6, model performance across weight initializations closely mirrored the results 1268 \nof the main analysis (Fig. 4A), indicating that learning dynamics and context-dependent accuracy 1269 \nwere unaffected by the presence or absence of the context-specific object. This indicates that 1270 \nsuch objects did not serve as reliable context cues for our models. 1271 \n 1272 \n 1273 \nFig. S6. 2AFC performance with no context-specific objects measured by weight variance. 1274 \n2AFC accuracy (y-axis) on context-dependent test trials for GRU models with weights initialized 1275 \nwith increasing variance along the x-axis color-coded by question category. 1276 \n 1277 \nS5 Determination of hidden layer size 1278 \n 1279 \nThe main paper analyzes a GRU model with 150 hidden layer units. To assess whether model 1280 \ncapacity influenced learning performance, we trained GRU models with reduced hidden layer 1281 \nsizes of 50 and 100 units. As shown in Fig. S7, all models ultimately achieved comparable 1282 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint \n\n \n \n29 \n \nperformance converging to approximately 90% accuracy. However, models with fewer hidden 1283 \nlayer units exhibited slower learning trajectories, requiring more training to reach the same level 1284 \nof performance as the model with 150 units. These results suggest that while increasing the 1285 \nnumber of hidden units accelerates learning, overall task performance is largely independent of 1286 \nmodel size. 1287 \n 1288 \n 1289 \nFig. S7. 2AFC task performance for GRU models with varying hidden layer sizes. 1290 \n(A-C) 2AFC performance accuracy on context-dependent questions averaged for 50 individual 1291 \nmodels for each architecture after each epoch of training for Context A (orange), Context B 1292 \n(green), and both contexts combined (blue) for GRU models with (A) 50 hidden layer units, (B) 1293 \n100 hidden layer units, and (C) 150 hidden layer units. 1294 \n 1295 \n 1296 \nReferences 1297 \n 1298 \n1. A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet Classification with Deep 1299 \nConvolutional Neural Networks in Advances in Neural Information Processing Systems, 1300 \n(Curran Associates, Inc., 2012). 1301 \n 1302 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}