Hybrid Neural-Cognitive Models Reveal Flexible Context-Dependent Information Processing in Reversal Learning

doi:10.1101/2025.07.31.667923

Hybrid Neural-Cognitive Models Reveal Flexible Context-Dependent Information Processing in Reversal Learning

2025 · doi:10.1101/2025.07.31.667923

preprint OA: closed

📄 Open PDF Full text JSON View at publisher

Full text 58,519 characters · extracted from preprint-html · click to expand

Hybrid Neural-Cognitive Models Reveal Flexible Context-Dependent Information Processing in Reversal Learning | bioRxiv /* */ /* */ <!-- <!-- /*! * yepnope1.5.4 * (c) WTFPL, GPLv2 */ (function(a,b,c){function d(a){return"[object Function]"==o.call(a)}function e(a){return"string"==typeof a}function f(){}function g(a){return!a||"loaded"==a||"complete"==a||"uninitialized"==a}function h(){var a=p.shift();q=1,a?a.t?m(function(){("c"==a.t?B.injectCss:B.injectJs)(a.s,0,a.a,a.x,a.e,1)},0):(a(),h()):q=0}function i(a,c,d,e,f,i,j){function k(b){if(!o&&g(l.readyState)&&(u.r=o=1,!q&&h(),l.onload=l.onreadystatechange=null,b)){"img"!=a&&m(function(){t.removeChild(l)},50);for(var d in y[c])y[c].hasOwnProperty(d)&&y[c][d].onload()}}var j=j||B.errorTimeout,l=b.createElement(a),o=0,r=0,u={t:d,s:c,e:f,a:i,x:j};1===y[c]&&(r=1,y[c]=[]),"object"==a?l.data=c:(l.src=c,l.type=a),l.width=l.height="0",l.onerror=l.onload=l.onreadystatechange=function(){k.call(this,r)},p.splice(e,0,u),"img"!=a&&(r||2===y[c]?(t.insertBefore(l,s?null:n),m(k,j)):y[c].push(l))}function j(a,b,c,d,f){return q=0,b=b||"j",e(a)?i("c"==b?v:u,a,b,this.i++,c,d,f):(p.splice(this.i++,0,a),1==p.length&&h()),this}function k(){var a=B;return a.loader={load:j,i:0},a}var l=b.documentElement,m=a.setTimeout,n=b.getElementsByTagName("script")[0],o={}.toString,p=[],q=0,r="MozAppearance"in l.style,s=r&&!!b.createRange().compareNode,t=s?l:n.parentNode,l=a.opera&&"[object Opera]"==o.call(a.opera),l=!!b.attachEvent&&!l,u=r?"object":l?"script":"img",v=l?"script":u,w=Array.isArray||function(a){return"[object Array]"==o.call(a)},x=[],y={},z={timeout:function(a,b){return b.length&&(a.timeout=b[0]),a}},A,B;B=function(a){function b(a){var a=a.split("!"),b=x.length,c=a.pop(),d=a.length,c={url:c,origUrl:c,prefixes:a},e,f,g;for(f=0;f<d;f++)g=a[f].split("="),(e=z[g.shift()])&&(c=e(c,g));for(f=0;f<b;f++)c=x[f](c);return c}function g(a,e,f,g,h){var i=b(a),j=i.autoCallback;i.url.split(".").pop().split("?").shift(),i.bypass||(e&&(e=d(e)?e:e[a]||e[g]||e[a.split("/").pop().split("?")[0]]),i.instead?i.instead(a,e,f,g,h):(y[i.url]?i.noexec=!0:y[i.url]=1,f.load(i.url,i.forceCSS||!i.forceJS&&"css"==i.url.split(".").pop().split("?").shift()?"c":c,i.noexec,i.attrs,i.timeout),(d(e)||d(j))&&f.load(function(){k(),e&&e(i.origUrl,h,g),j&&j(i.origUrl,h,g),y[i.url]=2})))}function h(a,b){function c(a,c){if(a){if(e(a))c||(j=function(){var a=[].slice.call(arguments);k.apply(this,a),l()}),g(a,j,b,0,h);else if(Object(a)===a)for(n in m=function(){var b=0,c;for(c in a)a.hasOwnProperty(c)&&b++;return b}(),a)a.hasOwnProperty(n)&&(!c&&!--m&&(d(j)?j=function(){var a=[].slice.call(arguments);k.apply(this,a),l()}:j[n]=function(a){return function(){var b=[].slice.call(arguments);a&&a.apply(this,b),l()}}(k[n])),g(a[n],j,b,n,h))}else!c&&l()}var h=!!a.test,i=a.load||a.both,j=a.callback||f,k=j,l=a.complete||f,m,n;c(h?a.yep:a.nope,!!i),i&&c(i)}var i,j,l=this.yepnope.loader;if(e(a))g(a,0,l,0);else if(w(a))for(i=0;i (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0];var j=d.createElement(s);var dl=l!='dataLayer'?'&l='+l:'';j.src='//www.googletagmanager.com/gtm.js?id='+i+dl;j.type='text/javascript';j.async=true;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-M677548'); Skip to main content Home About Submit ALERTS / RSS Search for this keyword Advanced Search New Results Hybrid Neural-Cognitive Models Reveal Flexible Context-Dependent Information Processing in Reversal Learning View ORCID Profile Yifei Cao , View ORCID Profile Maria K. Eckstein doi: https://doi.org/10.1101/2025.07.31.667923 Yifei Cao 1 Institute of Neuroinformatics , ETH Zurich, 190 Winterthurstrasse, Zurich, Zurich 8057, Switzerland Find this author on Google Scholar Find this author on PubMed Search for this author on this site ORCID record for Yifei Cao For correspondence: yifcao{at}student.ethz.ch Maria K. Eckstein 2 Google DeepMind, Mountain View , California 94043, US Find this author on Google Scholar Find this author on PubMed Search for this author on this site ORCID record for Maria K. Eckstein Abstract Full Text Info/History Metrics Preview PDF Abstract Reversal learning tasks provide a key paradigm for studying behavioral flexibility, requiring individuals to update choices in response to shifting reward contingencies. While reinforcement learning (RL) models have been widely used to provide interpretable explanations of human behavior on similar learning tasks, recent work has revealed that they often fail to fully account for the complexity of learning dynamics. In contrast, artificial neural networks (ANNs) often achieve substantially higher predictive accuracy, but lack the interpretability afforded by classical RL models. To bridge this gap, we introduce HybridRNNs—neural-cognitive models that integrate RL-inspired structures with flexible recurrent architectures. Among them, Context-ANN incorporates latent reward history and choice perseverance, and demonstrates improved alignment with human behavior compared to traditional RL models across two datasets. While not perfectly replicating human strategies, Context-ANN achieves comparable predictive accuracy to generic RNNs and offers interpretable value representations. Additional analyses of hidden dynamics reveal structured, context-sensitive internal states that adapt following reversals. These results suggest that humans may rely on more flexible or context-sensitive learning strategies even in simple reversal tasks, and highlight the potential of HybridRNNs as cognitive models that balance interpretability and predictive accuracy. Introduction The ability to learn flexibly is of vital importance in a rich and dynamic world ( MacDowell et al., 2022 ). In cognitive neuroscience research, reversal learning tasks have been extensively used to study behavioral flexibility ( Groman et al., 2019 ; Rudebeck et al., 2013 ). Unlike classical reward learning tasks, where reward contingencies remain stable, reversal learning tasks require participants to first learn the association between choices and reward outcomes over a series of trials, only for these contingencies to later be reversed. The ability to quickly adapt behavior and switch to the newly rewarding option after a reversal reflects an individual’s behavioral flexibility in reward learning ( Bartolo & Averbeck, 2020 ). To understand the computations underlying flexible reversal learning, researchers have developed various computational models, including reinforcement learning (RL) ( Daw et al., 2011 ; Farashahi et al., 2017 ; Wilson & Collins, 2019 ; Eckstein et al., 2022 ) and Bayesian inference ( Costa et al., 2015 ; Eckstein et al., 2022 ). Here, we use the term “RL” following conventions in cognitive modeling, referring to trial-by-trial learning driven by reward prediction errors. This includes classical models such as Rescorla–Wagner ( Rescorla, 1972 ), widely adopted as a foundational RL algorithm in this context. These models update value estimates using simple rules and few parameters, enabling interpretable accounts of learning. Extensions have further improved their flexibility in reversal tasks by incorporating adaptive learning rates or value resets ( Hauser et al., 2014 ; Sidarus et al., 2019 ; Barnby et al., 2022 ). Howeverever, simple, hand-crafted models often fail to fully capture behavioral data ( Peterson et al., 2021 ; Miller et al., 2024 ; Eckstein et al., 2024 ). Moreover, their reliance on predefined assumptions can introduce biases and researcher subjectivity, potentially leading to inaccurate or incomplete representations of behavior (Feher da Silva & Hare, 2020 ). An alternative approach to understanding cognitive computations is to discover the underlying cognitive strategies automatically from observed behavior. One such approach involves modeling human behavior using artificial neural networks (ANNs). Compared to classical cognitive models, ANNs require no handcrafted assumptions and structural hypotheses while offering the flexibility to approximate a wide range of models by adjusting their weights. Previous studies have successfully applied ANNs to model human behavior in decisionmaking ( Dezfouli et al., 2019 ), reward learning ( Song et al., 2021 ), and cognitive flexibility ( Jaffe et al., 2023 ), amongst others, achieving higher predictive accuracy in forecasting future behavior. However, the large number of parameters in ANNs poses a major interpretability challenge, necessitating additional efforts to uncover the cognitive and computational strategies they employ. To achieve both predictive accuracy and interpretability in modeling learning process, recent techniques have been developed such as DisRNN ( Miller et al., 2024 ), Tiny RNN (JiAn et al., 2023), and HybridRNN ( Eckstein et al., 2024 ). We use the HybridRNNs approach here, which integrates the flexibility of ANNs with the interpretable structure of RL models. Specifically, HybridRNN ( Eckstein et al., 2024 ) replaces the hand-crafted Q-value updating function and choice perseverance mechanisms with neural networks, hereby allowing the model to discover the arbitrary functions that underlie these processes. In this way, HybridRNN allows the incorporation of additional cognitive factors relevant to reward learning, such as context information ( Palminteri et al., 2015 ) and various types of memory that go beyond RL ( Collins & Frank, 2012 ; Davidow et al., 2016 ; Gershman & Daw, 2017 ), as inputs to assess their impact on model performance. This framework enables an automated identification of the cognitive mechanisms that support human flexible learning under different conditions. In this study, we fitted HybridRNN models to human reversal learning data. We first compared classical RL models and artificial neural networks (ANNs), identifying a clear gap in predictive accuracy. To address this, we developed a series of HybridRNNs by replacing hand-crafted value update rules with learnable neural components. Among these, the Context-ANN—incorporating choice perseverance and context inputs—achieved the highest predictive accuracy. We further conducted qualitative comparisons and found that Context-ANN more closely resembled human behavior during reversals than other models. Analysis of its value updating channel revealed a context-dependent updating strategy. Additionally, we examined the model’s hidden dynamics, identifying structured population activity that adapts to task changes, including distinct attractor-like clusters and reversal-aligned transitions. Together, these findings highlight the utility of HybridRNNs for both fitting behavior and probing the internal computations that support flexible learning. Methods Experimental Design and Dataset The reversal learning task is a paradigm designed to assess an individual’s ability to adapt their behavior in response to a dynamically changing reward environment. In our task, participants were presented with two choices, C 1 and C 2 , each associated with continuous reward r 1 and r 2 . One choice initially offers a larger reward magnitude than the other for several trials. However, in the current task, at unpredictable intervals and without explicit cues, the reward contingencies reverse. Following this reversal, the previously more rewarding choice becomes the less rewarding one, requiring participants to adjust their decision-making strategies accordingly. The present study analyzed human behavioral performance using an open-access dataset from Lee et al. (2023) . In their two-armed bandit task, the mean reward distributions associated with each option followed a random walk over time. The mean rewards, , for the two options o ∈ {1, 2} were symmetrical, such that . Reversals were defined as the time points where the mean rewards for the two options intersected (see Fig. 1C ). Following the filtering criteria outlined in Lee et al. (2023) , we included 669 sessions of data, each consisting of 80 trials. To further examine the generalizability of our findings, we also analyzed another independent reversal learning dataset from Eckstein et al. (2022) . Detailed results and methods are provided in the Supplementary Materials. Download figure Open in new tab Figure 1: Task design and value updating in HybridRNNs. (A) On each trial, participants choose between two options and receive feedback. (B) HybridRNNs replace hand-crafted RL update rules with a neural network to learn value updates. (C) Reward schedule over 80 trials. Colored dots show actual rewards for each option, and dashed lines indicate the underlying mean rewards. Vertical gray lines mark reversal points. Model Architectures Reinforcement Learning Models To identify the best classic cognitive model, we systematically compared reinforcement learning models with different structures ( Wilson & Collins, 2019 ). The Rescorla-Wagner (RW) model assumes that learning is driven by unexpected outcomes, including surprising occurrence or omission of reward during associative learning ( Rescorla, 1972 ). In a RW model, the prediction error (PE) signal defines the difference between observed and expected reward. This PE signal is weighted by a learning rate parameter when individuals update their reward expectations. According to this model, the agent’s expected reward for the chosen action on trial t, Q ( t ) is calculated as follows: In this equation, Q t ( a ) refers to the reward value expectation for the chosen option at trial t, Q t −1 is the reward value expectation for the chosen option at trial t − 1, the r t −1 (given that rewards were normalized to 0 and 1) refers to the reward actually perceived by participants at trial t − 1, and the α represents the learning rate. We call this processing step the “value channel” because it processes reward-related value information. At trial t , for action selection, the vector Q ( t ) of all action values is transformed into a vector of choice probabilities p t of the same length using the softmax function. The transformation is determined by a free parameter “inverse decision temperature” β, to induce more deterministic or more random choices: For increasingly flexible RL models, we followed the work of Eckstein et al. (2022) to include counterfactual value updating and choice persistence into the framework. With counterfactual value updating, in each trial t , the value of the unchosen action is also updated, using an “imaginary” counterfactual reward that is the inverse of the actually received reward 1− r t −1 (given that rewards were normalized to 0 and 1), and based on the same learning rate as the chosen action: In this equation, Q t ( na ) refers to the reward value expectation for the unchosen option at trial t . We call this additional, parallel processing step the “counterfactual value channel”. Finally, we extended the RL models with “perseverance”, which enables action repetition independently of rewards. The “perseverance” term c adds a small “bonus” (of size κ) to the value of the action a that was chosen on the previous time step, but not to all other actions na : RL-ANN Architecture The general idea of creating RL-ANN is to replace the value and perseverance updating function from Best RL with neural networks. For the three ‘channels’ designed in classic RL models, their hand-crafted updating functions (see Equations (1,2,3)) were replaced with a simple neural network. For each trial, RL-ANN’s value channel takes the reward r t −1 and the value of chosen action Q t −1 from the previous trial as inputs, just like the Simple RL model: The hidden layer activations, referred to as the “state” , are computed by processing the input vector through the network’s first fully-connected layer. Specifically, the input vector is multiplied by the weight matrix , the bias term is added, and the result is passed through a tanh activation function to introduce non-linearity: The output of value channel is the update to the value of chosen action Q t ( a ), and is calculated by passing the state through the second fully-connected layer with weights and biases The only difference between Simple RL and RL-ANN’s value channels hence is that we replaced the linear updating function from Simple RL to neural networks in RL-ANN. We next turned to the counterfactual value channel, which has the same structure as the value channel: it takes previous reward r t −1 and value of unchosen action Q t −1 ( na ) as input and predicts the update to the current value of the unchosen action: Hence, RL-ANN’s counterfactual channel has the same structure as Simple RL’s. The perseverance channel is also a 3-layer fully-connected multilayer perceptron (MLP). The input to the channel is the action from the previous trial a t −1 , and the output of the channel is a vector c t ∈ ℝ 2 . For binary choice tasks, each dimension of c t represents the perseverance scalar corresponding to one of the two available actions. Finally, depending on the need to combine the channels, the values Q t ( a ), counterfactual values Q t ( na ), and perseverance C t are combined additively and then pass through the softmax to select an action for the next trial: Context-ANN Architecture Context-ANN extends RL-ANN by adding additional context information as input to the network, and is the winning model in the current paper. We provide context information for value channel with the vector Q t −1 and perseverance channel with the vector c t −1 as additional inputs, giving rise to access of value and perseverance of all possible choices ( Palminteri et al., 2015 ). Besides the difference in model inputs, the model structure and outputs stay the same with RL-ANN. Memory-ANN Architecture In another approach of extending RL-ANN, we added states of hidden units from previous trial and as additional inputs. This modification of input enables the model to access its own compressed representation of the past history. Vanilla RNN architecture Vanilla RNN is a simple, fully-connected recurrent neural network with three layers. On each trial t , it receives the previous action a t −1 (one-hot encoded) and reward r t −1 as input. The hidden state s t is updated based on the current input and the previous hidden state. This hidden state is then passed through an output layer to generate action logits, which are converted into probabilities using a softmax function: Neural Network Training We divided the dataset into three parts: 80% for training (537 sessions), 10% for testing (66 sessions), and 10% for validation (66 sessions). The training data was used to fit model parameters across various configurations of model weights and biases. We assessed models’ final model fitting by calculating the fit on the held-out testing data, ensuring they did not overfit to the training data. The networks were trained in Haiku using the Adam optimizer with a learning rate of 0.005. Training was performed on batched data, incorporating a cross-entropy loss and L2-regularization on recurrent weights (with coefficients 1e-5). The maximum number of training step we used was 200,000. Early stopping was applied if, within the most recent 200 training steps, the mean validation loss of the last 50 steps did not improve compared to the first 50 steps. Model Fitting Objective Models were trained to replicate human behavior (rather than to optimize task performance), by minimizing the negative log-likelihood loss of the training data under the model (cross-entropy): where bs is the batch size and n trials = 80. The best model for each configuration of hyperparameters was evaluated on test data, with losses converted to trial-wise prediction accuracy: Model performance was assessed using the Akaike Information Criterion (AIC): where k is the number of parameters. The AIC is a standard measure of model fit that balances model fit and complexity. Qualitative Model Comparison Besides quantitative model fitting, we also wanted to make sure that the trained models behaved similar to human participants. To this end, we performed several behavioral analyses to test the similarity between human and model behaviors. Model Staying Behavior We analyzed the probability of repeating the previous action ( a t = a t +1 ), which we refer to as “staying behavior”. This probability was calculated across four reward histories, defined by combinations of current and previous rewards ( R t −1 and R t ), and we binned the data with reward larger or smaller than 50. For each condition, the staying probabilities were computed for human participants and each model. These probabilities were averaged across participants, and paired t -tests were conducted to compare the behavior of human participants and models in each combination of previous and current rewards. Model Inspection To investigate how the reward module maps mapped rewards r t −1 onto values Q t +1 ( a ), we probed the module with a range of input values and analyzed its output. Input values were uniformly sampled within defined ranges. r t −1 values were sampled between 1 and 100, and Q t −1 ( a ) values were sampled between the 10th and 90th percentiles of observed values. The relevant parameters were extracted from the trained RL-ANN and Memory-ANN models. A new MLP was initialized with the same architecture as the original reward module (e.g., for Memory-ANN: 2 input units, 16 hidden units, and 1 output unit, for Context-ANN: 2 input units, 8 hidden units, and 1 output unit) for further analysis (see Table4 for details of each model). Hidden Unit Dynamics Analysis We extracted hidden state activations from the trained models by running them on the full dataset, aligning them to reversal trials. To reduce dimensionality and visualize temporal dynamics, we applied PCA to the concatenated hidden states across all trials and sessions. The first three principal components were used for trajectory visualization and condition-based comparisons (see Supplementary Methods for details). Results Quantitative Model Comparison We first modeled the dataset using two extremes: the simplest RL model with only α and β, and a generic Vanilla RNN. Predictive accuracy improved from Simple RL (Normalized Likelihood = 63.5%, AIC = 4880.51) to Vanilla RNN (67.0%, AIC = 4533.33), indicating that classical models missed to capture predictable and systematic behavioral patterns. This highlights the flexibility of neural networks in modeling the complexity of reversal learning. Although Vanilla RNN achieved high predictive accuracy in modeling human behavior, its large number of fitted parameters makes interpretation challenging. This raises the question: what features contribute to its higher predictability? One possibility is that the value updating rule in Simple RL is overly rigid. A traditional approach would involve manually refining the update function ( Hauser et al., 2014 ; Barnby et al., 2022 ), but we instead replaced the entire update process with an ANN, implicitly testing a broad range of possible modifications. To improve model interpretability while maintaining high predictive accuracy, we developed a series of cognitive and hybrid models by gradually adding computations (see Table 1 ). Compared to Simple RL, its hybrid counter-part RL-ANN (see Fig. 2A ) improved predictive accuracy ( AIC difference = 117.01), suggesting that a more flexible value update function is necessary. Since prior research indicates that humans update the values of unchosen actions ( Palminteri et al., 2016 ; Eckstein et al., 2022 ), we introduced counterfactual updating to both Simple RL and RL-ANN. Even though it had small effect in simple RL ( AIC difference = 37.12), it did in RL-ANN ( AIC difference = 223.42), showing that human learners adjust the values of choices whose outcomes they have not directly observed. Next, we incorporated choice perseverance, a widely used extension in RL models ( Eckstein et al., 2022 ) (see Fig. 2B ), and found that it enhanced prediction in both cognitive ( AIC difference = 248.11) and hybrid models ( AIC difference = 86.36). Finally, we examined whether combining counterfactual updating and choice perseverance would further improve performance. Results showed that this combination maximized the RL model’s predictability ( AIC difference = 350.52) but did not significantly improve RLANN beyond counterfactual updating alone ( AIC difference = −80.79). View this table: View inline View popup Download powerpoint Table 1: Quantitative model comparisons of cognitive and hybrid model series. Best-performing models are highlighted in bold. Download figure Open in new tab Figure 2: Model architectures of classic cognitive models and HybridRNN models. Since none of these models reached the predictability of Vanilla RNN (see Table 2 ), we investigated whether hybrid models benefited from additional context or memory information. Model comparisons identified Context-ANN with a perseverance channel (see Fig. 2D ) as the best-performing model, showing the most similar predictive accuracy with Vanilla RNN ( AIC difference = 54.56). Taken together, quantitative model comparisons suggest that Context-ANN achieves among the highest predictive performance across all cognitive and HybridRNN models while offering greater interpretability than Vanilla RNN; this finding is further supported by results on the independent Eckstein dataset (see Table 6 and Table 7 ). View this table: View inline View popup Download powerpoint Table 2: Quantitative model comparisons of cognitive models, RNN, and HybridRNN models. Best-performing model is highlighted in bold. Qualitative Model Comparison Besides quantitative model comparisons, it is also important to qualitatively examine whether model’s behavior pattern align with human choices ( Palminteri et al., 2017 ). We first calculated the mean reward received by each model agent and actual human behavior around reversal trials (see Fig. 3A ). Although none of the models perfectly matched human performance, both Context-ANN and Vanilla RNN exhibited reward trajectories that were qualitatively most similar to human behavior at the point of reversal. Possible reasons for the remaining deviations are discussed in the Discussion section. Download figure Open in new tab Figure 3: Qualitative model comparison results. (A) Mean rewards received by human and each model around the reversal trials, the reward at chance level is 0.5. (B) Probability of staying behavior of human choices and model behaviors, t-test was performed to compare between human and each model. (C) Mean normalized likelihood of each model on human held-out data, aligned to reversal trials. Context-ANN and Vanilla RNN show the largest predictive accuracy. We then calculated the “staying behavior” (see Methods section) of each model and actual human behavior, and performed pairwise t-tests between model behavior and human choices. The result showed that, although all models showed significant difference with human choices in at least one reward condition, Context-ANN (with perseverance channel) showed relatively smaller deviations compared to other models. Notably, Context-ANN overestimated the probability of staying when no reward was received for two consecutive trials (see Fig. 3B ). Taken together, although some discrepancies remain, Context-ANN emerges as the most promising model based on its overall balance between predictive accuracy and alignment with key qualitative patterns of human behavior. We also computed trial-wise normalized log-likelihood (NLL) aligned to reversals (see Fig. 3C ). Context-ANN and Vanilla RNN achieved the highest overall predictive accuracy, even though both showed a transient drop in performance following reversals. Information Processing in Context-ANN To examine the internal learning dynamics of Context-ANN, we systematically simulated value updates across a range of rewards and action values ( r t , Q t ( a ), Q t (na)) and observed the corresponding outputs Q t +1 ( a ). While traditional RL models update action values linearly based on the reward prediction error (see Fig. 4A ), Context-ANN displayed a non-linear reward-to-update mapping (see Fig. 4B ), indicating a more flexible learning strategy. This non-linearity was further modulated by the value of the unchosen option: when Q t ( na ) was high, updates to the chosen action value were attenuated; when Q t ( a ) was low, the model responded more strongly to reward signals. Download figure Open in new tab Figure 4: Value updating functions of Best RL and Context-ANN. (A) Linear updating function of classical RL model. (B) Nonlinear and context dependent value updating functions of Context-ANN. Note that the value scale differs between (A) and (B) because Context-ANN does not constrain value outputs within the [0, 1] range. Additionally, we observed a sharp inflection in the update function around r t = 0.5, suggesting heightened sensitivity to reward values near this threshold. This aligns with the task structure, where rewards around 0.5 indicate increased volatility and potential reversal. Such selective responsiveness would be adaptive, helping the model detect changes in the environment. Furthermore, the slope of the value update function varied systematically with Q t ( na ): when the unchosen action had low value, low rewards were grouped together; when Q t ( na ) was high, updates became more sensitive to high rewards. This pattern supports the interpretation that value updating in Context-ANN is modulated by the context defined by the alternative option’s value, consistent with prior work on context-sensitive learning strategies ( Palminteri et al., 2015 ). To quantify these effects, we conducted a linear regression predicting Q t +1 ( a ) from r t , Q t ( a ), Q t ( na ), and their interactions. The most pronounced effect was the negative interaction between r t and Q t ( na ) (β = − 0.32, p < .001), indicating that rewards were weighted more heavily when the alternative action had lower expected value. This context-dependent modulation deviates from the fixed-error structure of classical RL and supports the view that Context-ANN approximates human-like flexible learning by integrating choice context into its value updating process. Hidden Unit Dynamics Around Reversal Context-ANN exhibited distinct internal dynamics aligned to reversals. Two hidden units significantly increased activation following switch points (see Fig. 5A ), with ramping patterns observed around the reversal (see Fig. 5B ). The analysis on the extracted PCs confirmed there is a populational response to the switch points with the first PC showing decreased score after switching (see Fig. 5C and D ). Furthermore, to assess whether the hidden state encodes contextual value information, we trained a classifier to decode whether Q left > Q right from the hidden units. This yielded a 5-fold cross-validated accuracy of 0.788 using the hidden units and 0.764 using the top 3 principal components. Download figure Open in new tab Figure 5: Internal dynamics of Context-ANN aligned to reversal points. (A, C) T-values for each hidden unit comparing activation before and after reversals, with significant modulations highlighted in red. (B) Temporal dynamics of example hidden units around the switch point. (D) Principal components of hidden state activity show systematic changes around reversals. (E) 3D PCA projection of hidden states reveals four attractor-like clusters, with color indicating the difference in action values ( Q left ™ Q right ). PCA projection of the hidden states revealed four attractor-like clusters in a low-dimensional space (see Fig. 5E and Fig. 6A ). Within each cluster, hidden state positions varied smoothly with the value difference between actions ( Q left − Q right ), suggesting context-sensitive internal organization. Following reversals, the probability of the hidden state leaving its current attractor increased (see Fig. 7B ; Table 5 ), indicating adaptive reconfiguration of internal dynamics in response to changing environments. These results suggest that Context-ANN maintains structured, adaptive internal representations that flexibly reorganize in response to reversals. View this table: View inline View popup Download powerpoint Table 3: Linear regression results on predicting value of chosen action. View this table: View inline View popup Download powerpoint Table 4: Model size comparison for all candidate models. View this table: View inline View popup Download powerpoint Table 5: Detailed cluster transitions across different switch stages. Each cell indicates the number of transitions from one attractor to another during the specified switch transition. Download figure Open in new tab Figure 6: (A) PCA projection of hidden states reveals four distinct attractor-like clusters. Each cluster reflects a recurring internal representation pattern used by the network. (B) Proportions of cluster-to-cluster transitions aligned to reversal trials. Certain transition types (e.g., C0→C1, C2→C3) increase immediately after the reversal, while others (e.g., C1→C2, C3→C0) rise with a slight delay, indicating temporally specific reorganization of internal dynamics in response to environmental changes. Download figure Open in new tab Figure 7: Value updating functions of Best RL and Context-ANN in Eckstein data. (A) Mean normalized likelihood of each model on human held-out data, aligned to reversal trials. Context-ANN and Vanilla RNN show the largest predictive accuracy. (B) Linear updating function of classical RL model. (C) Nonlinear and context dependent value updating functions of Context-ANN. Discussion The present study applied classical cognitive modeling, neural network modeling, and HybridRNN to behavioral data from a reversal learning task. We improved the predictive accuracy of RL models to a level comparable to ANNs through an interpretable hybrid modeling structure, though important behavioral differences—such as the post-reversal dip—were not fully captured. Given the advantage of combining neural networks with cognitive modeling, the improvement in model prediction relied less on handcrafting and more on an automatic approach. Most importantly, HybridRNN enabled direct inspection of the trajectory of value drift, providing a platform to probe the value updating strategy that supports flexible learning. Our results suggest that Context-ANN provides the closest approximation to human behavior among the models considered. Further model inspection revealed nonlinear, context-dependent value updating functions, as well as structured hidden dynamics that reorganize around reversals, offering a promising approximation of the strategies supporting flexible adaptation. The Context-ANN model in the present study suggested that context information is critical to human flexible reversal learning, differing from previous studies that applied HybridRNN to four-arm bandit tasks, where memory information was identified as the most important factor ( Eckstein et al., 2022 ). Although adding memory input to RL-ANN improved its predictive accuracy for reversal learning behavior, its effect was not as good as that of context information. One possible reason is the different number of bandits used in this task, as Collins & Frank (2018) suggested that working memory plays a crucial role as the number of bandits increases. The two-armed bandit reversal learning task in the present study likely required a lower level of working memory to track bandit values. Furthermore, due to unpredictable switches in reward probabilities, the human brain recruits the prefrontal cortex to adapt to the changing environment rather than relying solely on the hippocampus ( Guise & Shapiro, 2017 ), suggesting a potential difference in behavioral strategies. Previous pharmacological studies have also shown that dopamine manipulation improved short-term memory but impaired reversal learning ( Mehta et al., 2001 ). Thus, over-reliance on memory in tasks with fewer choice options could negatively impact flexible learning behavior. Additionally, recent studies have interpreted reversal learning as a form of hidden state inference (e.g., Zika et al., 2023 ). From this perspective, the advantage of Context-ANN may stem from its capacity to approximate such inference mechanisms, as it integrates information from both choice options to infer the current latent state ( Mishchanchuk et al., 2024 ). While not explicitly modeling belief states, this implicit approximation may bring its behavior closer to the strategy humans use to solve the task. This aligns with the Hierarchical Gaussian Filter ( Mathys et al., 2011 ), which treat reversal learning as inference over latent states. Future comparisons with such models could help clarify the computational basis of flexible behavior. These might be possible reasons why Context-ANN outperformed Memory-ANN in predictive accuracy, on the current task, but not in the previous one with more choice options. The flexibility of Context-ANN enabled us to examine how value updating strategies vary across different reward magnitudes and choice values. Previous studies have highlighted the critical role of contextual information in value updating ( Palminteri et al., 2015 ) and the presence of nonlinearity in value updating functions ( Nassar et al., 2019 ; Ji-An et al., 2023 ). However, a hybrid modeling approach—which integrates classical cognitive modeling with neural network modeling—can effectively reveal and quantify how nonlinear value updating strategies are modulated by contextual factors. Specifically, Context-ANN demonstrated that as the value of the unchosen action increases, the chosen action value becomes less sensitive to reward. This changing relationship with reward may be linked to participants’ attention allocation, as attention during reversal learning has been shown to be directed toward higher-value options ( Oemisch et al., 2017 ). A higher unchosen action value could potentially reduce the attention allocated to the chosen action, thereby diminishing the extent to which its value is updated based on received rewards. To better interpret Context-ANN, we analyzed the hidden dynamics within Context-ANN model ( Mante et al., 2013 ; Flesch et al., 2022 ). Our results revealed structured activity patterns aligned with task-relevant variables, such as value differences and reversal points. These findings suggest that even in recurrent models with relatively simple architectures, internal representations can organize flexibly in response to changing task demands. This supports the idea that hybrid models can offer both predictive power and interpretable internal mechanisms, providing a valuable bridge between cognitive theory and neural network models. A key limitation of the present study is its relatively small sample size ( n trial = 53, 520). Given the large number of parameters in ANN-based models, a larger dataset is generally required for reliable estimation ( Alwosheel et al., 2018 ; Eckstein et al., 2024 ). The current sample size may not provide sufficient power to detect a statistically significant difference between the normalized likelihood of cognitive and neural network Models. Consequently, we relied on AIC scores as the primary model comparison metric, despite normalized likelihood is a more robust measure for cross-validation ( Ji-An et al., 2023 ). In addition, the limited complexity of the task—specifically its two-option structure and fully anticorrelated reward schedule—may also reduce behavioral variability, making it more difficult to distinguish between competing models. Future work should consider larger and more behaviorally rich datasets—e.g., involving more choice options or stochastic reward structures—to better evaluate and distinguish competing computational accounts of learning behavior. Supplementary Analysis and Results Details of Hidden Unit Dynamics Analysis To assess whether the model’s internal states encode contextual value information, we trained a logistic regression classifier to decode whether Q A > Q B on each trial. We used trials around reversal trials, and evaluated decoding accuracy using 5-fold cross-validation. Two types of input features were tested: (1) the full hidden unit activations (8 dimensions), and (2) the top three principal components (PCs) of the hidden states. All features were z-scored. The classifier was implemented using scikit-learn ‘s logistic regression with 1000 iterations. Decoding accuracy was 0.788 using the hidden units and 0.764 using the top 3 PCs. Next, to examine whether the model’s internal dynamics exhibit attractor-like clustering, we applied dimensionality reduction and clustering to the hidden unit dynamics. The resulting PCA projections were standardized and clustered using the DBSCAN algorithm ( eps = 0.5, min_samples = 10). This analysis revealed 4 distinct clusters in the low-dimensional space (see Fig. 6A ), along with a small number of unassigned noise points. These attractor-like clusters suggest that the network organizes its internal dynamics into discrete regimes, potentially reflecting different value contexts during learning. To quantify changes in internal state dynamics around reversals, we examined transitions between DBSCAN-identified clusters in the PCA-projected hidden state space. We tracked cluster membership transitions across adjacent trials, aligned to the reversal index (switch ∈ [− 7, +10]). For each transition (e.g., − 2→− 1), we computed the number of total transitions and the proportion that involved a change in cluster identity. This allowed us to estimate the likelihood of switching attractor states around reversal points. A detailed breakdown of cluster-to-cluster transitions is provided in Table5 , and overall transition dynamics are shown in Figure 6B . Overall, the proportion of inter-cluster transitions markedly increased following the reversal. While the transition rate remained relatively stable before reversal (mean change rate ≈ 36%), it sharply increased to over 40% during the first few trials post-reversal (e.g., 0→1: 42.06%, 1→2: 40.83%). This suggests that the Context-ANN reorganized its internal dynamics to adapt to the new environment, consistent with the notion of attractor reconfiguration. Importantly, different types of transitions peaked at different time points. Transitions such as C0→C1 and C2→C3 showed an immediate increase at the first trial after reversal (trial 0), whereas C3→C0 and C1→C2 peaked later (around trial 1–2). This temporal asymmetry indicates that different internal reconfigurations unfold over distinct timescales, possibly reflecting a two-stage adaptation process: an initial detection and destabilization of outdated representations, followed by the formation of new task-aligned attractor states. Taken together, these findings highlight the structured and temporally dynamic nature of internal state transitions in Context-ANN, supporting its ability to flexibly reorganize representations in response to environmental changes. Dataset from Eckstein et al., (2022) We analyzed behavioral data from a probabilistic reversal learning task published by Eckstein et al. (2022) . In this task, participants chose between two boxes on each trial in order to collect golden coins (i.e., rewards), which were probabilistically associated with the two options. The correct box was rewarded on 75% of trials, while the incorrect box never delivered rewards. Task contingencies switched unpredictably and without explicit cues, encouraging participants to track changes in the underlying reward structure. Reversals were only allowed after participants had collected a random number of rewards (between 7 and 15), and the first correct choice after a switch was always rewarded. Participants completed 120 trials, preceded by a tutorial phase that familiarized them with task rules and reversals. For the present study, we followed the original exclusion criteria to remove participants with low task engagement or missing demographic information. After exclusion, the final sample comprised 179 adolescents (under age 18; 96 males and 83 females), 57 undergraduates aged 18–28 (19 males and 38 females), and 55 adult community participants (26 males and 29 females), yielding a total sample size of 291 participants for model fitting and analysis. Quantitative and Qualitative Model Comparison on Eckstein Dataset To evaluate the robustness of our findings, we trained and tested the same series of cognitive models, RNNs, and HybridRNN models on the independent dataset from Eckstein et al. (2022) . As shown in Table 6 and Table 7 , we successfully replicated the main pattern observed in our primary dataset: replacing the hand-crafted value update function in traditional cognitive models with a neural network significantly improved model fit to human behavior. Notably, Context-ANN once again achieved predictive accuracy on par with the Vanilla RNN, further demonstrating that hybrid models can combine strong predictive performance with improved interpretability. In addition, following a reviewer’s suggestion, we performed a trial-wise normalized likelihood analysis aligned to the reversal point (see Fig. 7A ). This analysis provides a more intuitive view of model predictions across learning transitions. We found that both Context-ANN and Vanilla RNN consistently achieved the highest likelihood across trials compared to other models, indicating strong predictive alignment with human behavior. However, all models exhibited a notable drop in prediction likelihood at the exact reversal trial, highlighting a limitation in capturing human switching dynamics. One possible reason is that the Eckstein dataset spans a wide developmental age range, including children, adolescents, and adults. As previous research suggests, adolescence is a period of rapid and heterogeneous changes in learning and decision-making strategies. This heterogeneity may pose challenges for models like HybridRNNs that rely on shared trial-level patterns across participants. Therefore, we caution against overinterpreting this dataset as a benchmark for model comparison. Future work may benefit from identifying or generating datasets with more homogeneous populations and better-controlled task environments for model discovery and validation. Information Processing We replicated the context-dependent value updating pattern in the Eckstein dataset (see Fig.7C ). A linear regression predicting Q t +1 ( a ) revealed a significant negative interaction between r t and Q t ( na ) (β =−0.28, p < .001; Table 8 ), confirming that rewards were weighted more strongly when the unchosen option had lower expected value. View this table: View inline View popup Download powerpoint Table 6: Quantitative model comparisons of cognitive and hybrid model series on Eckstein dataset. View this table: View inline View popup Download powerpoint Table 7: Quantitative model comparisons of cognitive models, RNN, and HybridRNN models on Eckstein dataset. View this table: View inline View popup Download powerpoint Table 8: Linear regression results on predicting value of chosen action on Eckstein dataset. Footnotes ( mariaeckstein{at}deepmind.com ) References ↵ Alwosheel , A. , van Cranenburgh , S. , & Chorus , C. G. ( 2018 ). Is your dataset big enough? sample size requirements when using artificial neural networks for discrete choice analysis . Journal of choice modelling , 28 , 167 – 182 . OpenUrl CrossRef ↵ Barnby , J. M. , Mehta , M. A. , & Moutoussis , M. ( 2022 ). The computational relationship between reinforcement learning, social inference, and paranoia . PLoS computational biology , 18 ( 7 ), e1010326 . OpenUrl ↵ Bartolo , R. , & Averbeck , B. B. ( 2020 ). Prefrontal cortex predicts state switches during reversal learning . Neuron , 106 ( 6 ), 1044 – 1054 . OpenUrl CrossRef PubMed ↵ Collins , A. G. , & Frank , M. J. ( 2012 ). How much of reinforcement learning is working memory, not reinforcement learning? a behavioral, computational, and neurogenetic analysis . European Journal of Neuroscience , 35 ( 7 ), 1024 – 1035 . OpenUrl CrossRef PubMed ↵ Collins , A. G. , & Frank , M. J. ( 2018 ). Within-and acrosstrial dynamics of human eeg reveal cooperative interplay between reinforcement learning and working memory . Proceedings of the National Academy of Sciences , 115 ( 10 ), 2502 – 2507 . OpenUrl Abstract / FREE Full Text ↵ Costa , V. D. , Tran , V. L. , Turchi , J. , & Averbeck , B. B. ( 2015 ). Reversal learning and dopamine: a bayesian perspective . Journal of Neuroscience , 35 ( 6 ), 2407 – 2416 . OpenUrl Abstract / FREE Full Text ↵ Davidow , J. Y. , Foerde , K. , Galván , A. , & Shohamy , D. ( 2016 ). An upside to reward sensitivity: the hippocampus supports enhanced reinforcement learning in adolescence . Neuron , 92 ( 1 ), 93 – 99 . OpenUrl PubMed ↵ Daw , N. D. , Gershman , S. J. , Seymour , B. , Dayan , P. , & Dolan , R. J. ( 2011 ). Model-based influences on humans’ choices and striatal prediction errors . Neuron , 69 ( 6 ), 1204 – 1215 . OpenUrl CrossRef PubMed Web of Science ↵ Dezfouli , A. , Griffiths , K. , Ramos , F. , Dayan , P. , & Balleine , B. W. ( 2019 ). Models that learn how humans learn: The case of decision-making and its disorders . PLoS computational biology , 15 ( 6 ), e1006903 . OpenUrl CrossRef PubMed ↵ Eckstein , M. K. , Master , S. L. , Dahl , R. E. , Wilbrecht , L. , & Collins , A. G. ( 2022 ). Reinforcement learning and bayesian inference provide complementary models for the unique advantage of adolescents in stochastic reversal . Developmental Cognitive Neuroscience , 55 , 101106 . OpenUrl PubMed ↵ Eckstein , M. K. , Summerfield , C. , Daw , N. , & Miller , K. J. ( 2024 ). Hybrid neural-cognitive models reveal how memory shapes human reward learning . PsyArXiv . ↵ Farashahi , S. , Donahue , C. H. , Khorsand , P. , Seo , H. , Lee , D. , & Soltani , A. ( 2017 ). Metaplasticity as a neural substrate for adaptive learning and choice under uncertainty . Neuron , 94 ( 2 ), 401 – 414 . OpenUrl PubMed ↵ Feher da Silva , C., & Hare , T. A. ( 2020 ). Humans primarily use model-based inference in the two-stage task . Nature Human Behaviour , 4 ( 10 ), 1053 – 1066 . OpenUrl PubMed ↵ Flesch , T. , Juechems , K. , Dumbalska , T. , Saxe , A. , & Summerfield , C. ( 2022 ). Orthogonal representations for robust context-dependent task performance in brains and neural networks . Neuron , 110 ( 7 ), 1258 – 1270 . OpenUrl CrossRef PubMed ↵ Gershman , S. J. , & Daw , N. D. ( 2017 ). Reinforcement learning and episodic memory in humans and animals: an integrative framework . Annual review of psychology , 68 ( 1 ), 101 – 128 . OpenUrl CrossRef PubMed ↵ Groman , S. M. , Keistler , C. , Keip , A. J. , Hammarlund , E. , DiLeone , R. J. , Pittenger , C. , … Taylor , J. R. ( 2019 ). Orbitofrontal circuits control multiple reinforcement-learning processes . Neuron , 103 ( 4 ), 734 – 746 . OpenUrl CrossRef PubMed ↵ Guise , K. G. , & Shapiro , M. L. ( 2017 ). Medial prefrontal cortex reduces memory interference by modifying hippocampal encoding . Neuron , 94 ( 1 ), 183 – 192 . OpenUrl CrossRef PubMed ↵ Hauser , T. U. , Iannaccone , R. , Ball , J. , Mathys , C. , Brandeis , D. , Walitza , S. , & Brem , S. ( 2014 ). Role of the medial prefrontal cortex in impaired decision making in juvenile attention-deficit/hyperactivity disorder . JAMA psychiatry , 71 ( 10 ), 1165 – 1173 . OpenUrl PubMed ↵ Jaffe , P. I. , Poldrack , R. A. , Schafer , R. J. , & Bissett , P. G. ( 2023 ). Modelling human behaviour in cognitive tasks with latent dynamical systems . Nature Human Behaviour , 7 ( 6 ), 986 – 1000 . OpenUrl PubMed ↵ Ji-An , L. , Benna , M. K. , & Mattar , M. G. ( 2023 ). Automatic discovery of cognitive strategies with tiny recurrent neural networks . bioRxiv , 2023 – 04 . ↵ Lee , J. K. , Rouault , M. , & Wyart , V. ( 2023 ). Adaptive tuning of human learning and choice variability to unexpected uncertainty . Science Advances , 9 ( 13 ), eadd0501 . OpenUrl CrossRef PubMed ↵ MacDowell , C. J. , Tafazoli , S. , & Buschman , T. J. ( 2022 ). A goldilocks theory of cognitive control: Balancing precision and efficiency with low-dimensional control states . Current Opinion in Neurobiology , 76 , 102606 . OpenUrl CrossRef PubMed ↵ Mante , V. , Sussillo , D. , Shenoy , K. V. , & Newsome , W. T. ( 2013 ). Context-dependent computation by recurrent dynamics in prefrontal cortex . nature , 503 ( 7474 ), 78 – 84 . OpenUrl CrossRef PubMed Web of Science ↵ Mathys , C. , Daunizeau , J. , Friston , K. J. , & Stephan , K. E. ( 2011 ). A bayesian foundation for individual learning under uncertainty . Frontiers in human neuroscience , 5 , 39 . OpenUrl PubMed ↵ Mehta , M. A. , Swainson , R. , Ogilvie , A. D. , Sahakian , B. , & Robbins , T. W. ( 2001 ). Improved short-term spatial memory but impaired reversal learning following the dopamine d 2 agonist bromocriptine in human volunteers . Psychopharmacology , 159 , 10 – 20 . OpenUrl CrossRef PubMed ↵ Miller , K. , Eckstein , M. , Botvinick , M. , & Kurth-Nelson , Z. ( 2024 ). Cognitive model discovery via disentangled rnns Advances in Neural Information Processing Systems , 36 . ↵ Mishchanchuk , K. , Gregoriou , G. , Qü , A. , Kastler , A. , Huys , Q. J. , Wilbrecht , L. , & MacAskill , A. F. ( 2024 ). Hidden state inference requires abstract contextual representations in the ventral hippocampus . Science , 386 ( 6724 ), 926 – 932 . OpenUrl CrossRef PubMed ↵ Nassar , M. R. , McGuire , J. T. , Ritz , H. , & Kable , J. W. ( 2019 ). Dissociable forms of uncertainty-driven representational change across the human brain . Journal of Neuroscience , 39 ( 9 ), 1688 – 1698 . OpenUrl Abstract / FREE Full Text ↵ Oemisch , M. , Watson , M. R. , Womelsdorf , T. , & Schubö , A. ( 2017 ). Changes of attention during value-based reversal learning are tracked by n2pc and feedback-related negativity . Frontiers in human neuroscience , 11 , 540 . OpenUrl PubMed ↵ Palminteri , S. , Khamassi , M. , Joffily , M. , & Coricelli , G. ( 2015 ). Contextual modulation of value signals in reward and punishment learning . Nature communications , 6 ( 1 ), 8096 . OpenUrl PubMed ↵ Palminteri , S. , Kilford , E. J. , Coricelli , G. , & Blakemore , S.-J. ( 2016 ). The computational development of reinforcement learning during adolescence . PLoS computational biology , 12 ( 6 ), e1004953 . OpenUrl PubMed ↵ Palminteri , S. , Wyart , V. , & Koechlin , E. ( 2017 ). The importance of falsification in computational cognitive modeling . Trends in cognitive sciences , 21 ( 6 ), 425 – 433 . OpenUrl CrossRef PubMed ↵ Peterson , J. C. , Bourgin , D. D. , Agrawal , M. , Reichman , D. , & Griffiths , T. L. ( 2021 ). Using large-scale experiments and machine learning to discover theories of human decisionmaking . Science , 372 ( 6547 ), 1209 – 1214 . OpenUrl Abstract / FREE Full Text ↵ Rescorla , R. A. ( 1972 ). A theory of pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement . Classical conditioning II: Current theory and research/Appleton-Century-Crofts . ↵ Rudebeck , P. H. , Saunders , R. C. , Prescott , A. T. , Chau , L. S. , & Murray , E. A. ( 2013 ). Prefrontal mechanisms of behavioral flexibility, emotion regulation and value updating . Nature neuroscience , 16 ( 8 ), 1140 – 1145 . OpenUrl CrossRef PubMed ↵ Sidarus , N. , Palminteri , S. , & Chambon , V. ( 2019 ). Costbenefit trade-offs in decision-making and learning . PLoS computational biology , 15 ( 9 ), e1007326 . OpenUrl PubMed ↵ Song , M. , Niv , Y. , & Cai , M. ( 2021 ). Using recurrent neural networks to understand human reward learning . In Proceedings of the annual meeting of the cognitive science society (Vol. 43 ). ↵ Wilson , R. C. , & Collins , A. G. ( 2019 ). Ten simple rules for the computational modeling of behavioral data . Elife , 8 , e49547 . OpenUrl CrossRef PubMed ↵ Zika , O. , Wiech , K. , Reinecke , A. , Browning , M. , & Schuck , N. W. ( 2023 ). Trait anxiety is associated with hidden state inference during aversive reversal learning . Nature Communications , 14 ( 1 ), 4203 . OpenUrl PubMed View the discussion thread. Back to top Previous Next Posted July 31, 2025. Download PDF Email Thank you for your interest in spreading the word about bioRxiv. NOTE: Your email address is requested solely to identify you as the sender of this article. Your Email * Your Name * Send To * Enter multiple addresses on separate lines or separate them with commas. You are going to email the following Hybrid Neural-Cognitive Models Reveal Flexible Context-Dependent Information Processing in Reversal Learning Message Subject (Your Name) has forwarded a page to you from bioRxiv Message Body (Your Name) thought you would like to see this page from the bioRxiv website. Your Personal Message CAPTCHA This question is for testing whether or not you are a human visitor and to prevent automated spam submissions. Share Hybrid Neural-Cognitive Models Reveal Flexible Context-Dependent Information Processing in Reversal Learning Yifei Cao , Maria K. Eckstein bioRxiv 2025.07.31.667923; doi: https://doi.org/10.1101/2025.07.31.667923 Share This Article: Copy Citation Tools Hybrid Neural-Cognitive Models Reveal Flexible Context-Dependent Information Processing in Reversal Learning Yifei Cao , Maria K. Eckstein bioRxiv 2025.07.31.667923; doi: https://doi.org/10.1101/2025.07.31.667923 Citation Manager Formats BibTeX Bookends EasyBib EndNote (tagged) EndNote 8 (xml) Medlars Mendeley Papers RefWorks Tagged Ref Manager RIS Zotero Tweet Widget Facebook Like Google Plus One Subject Area Animal Behavior and Cognition Subject Areas All Articles Animal Behavior and Cognition (7618) Biochemistry (17635) Bioengineering (13859) Bioinformatics (41846) Biophysics (21401) Cancer Biology (18534) Cell Biology (25423) Clinical Trials (138) Developmental Biology (13352) Ecology (19860) Epidemiology (2067) Evolutionary Biology (24286) Genetics (15582) Genomics (22463) Immunology (17700) Microbiology (40298) Molecular Biology (17141) Neuroscience (88429) Paleontology (666) Pathology (2825) Pharmacology and Toxicology (4813) Physiology (7633) Plant Biology (15107) Scientific Communication and Education (2042) Synthetic Biology (4284) Systems Biology (9808) Zoology (2267)

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00