Intrinsic rewards guide visual resource allocation via reinforcement learning

doi:10.1101/2025.04.25.650663

Intrinsic rewards guide visual resource allocation via reinforcement learning

2025 · doi:10.1101/2025.04.25.650663

preprint OA: gold CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 123,674 characters · extracted from oa-pdf · 3 sections · click to expand

Abstract

Humans and other animals prioritise visual processing of stimuli that signal rewards. While prior research has focused on tangible incentives (e.g., money or food), the effects of intrinsic incentives – such as perceived competence – are less well understood. Across a series of visual estimation experiments, we manipulated observers’ subjective sense of confidence in their judgements using either deceptive trial-by-trial feedback or real discrepancies in stimulus reliability. We found that observers prioritised encoding of stimuli associated with lower uncertainty or error, benefiting performance for stimuli already estimated accurately, while further impairing performance for those estimated poorly. These reward-driven biases, while potentially adaptive, impaired overall accuracy in the present tasks by causing resource allocation to deviate from the error-minimizing strategy. To account for these findings, we supplemented a normalization model of neural resource allocation with a simple reinforcement learning rule. Intrinsic and extrinsic rewards cumulatively shaped the values assigned to different stimuli by the model, and the resulting discrepancies biased resource allocation and thereby estimation error, quantitatively matching the data. These findings reveal how intrinsic reward signals can shape resource allocation in ways that are both adaptive and counterproductive, offering a computational basis for the motivational biases underlying cognitive performance.

Keywords

population coding, reinforcement learning, resource allocation, attention, working memory 1 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint Introduction1 To support adaptive behaviour and ensure survival, the brain has evolved to prioritise environmental2 cues that signal potential rewards [1, 2]. Selectively attending to reward-predicting stimuli facilitates3 efficient navigation of complex environments, helping organisms move towards more rewarding states4 [3, 4]. This selection process is crucial given the brain’s limited processing capacity, as it enhances in-5 ternal representations of valuable stimuli and facilitates the formation of stimulus-reward associations6 [5]. Whereas the bias towards processing stimuli associated with tangible rewards is well established,7 the influence of intrinsic rewards – positive motivational states associated with feelings of satisfaction8 and competence [6] – on sensory processing remains less understood.9 Experiments using points-based and monetary incentives have found that associating stimuli10 with a higher probability, or greater magnitude, of external reward facilitates voluntary, or top-down,11 attention [7–9]. Additionally, in visual search tasks, which primarily engage bottom-up processes,12 search times are faster for pop-out targets associated with higher rewards than stimuli predicting less13 or no reward [10]. Notably, the prioritisation of reward-associated stimuli persists in subsequent tasks14 even when reward contingencies are removed, and previously rewarded features cease to be salient or15 task-relevant [11–13]. Consistent with this, studies have shown that eye movements are biased towards16 objects and spatial locations previously associated with rewards [14–16]. This continued prioritisation17 of previously rewarded stimuli, even when it no longer aligns with immediate task goals, suggests that18 reward learning creates a lasting effect that can involuntarily bias attention towards these stimuli [17,19 18].20 The influence of external rewards on behaviour extends to visual working memory (VWM) [19],21 which is known for its ability to flexibly store and maintain features of multiple objects within a22 limited capacity [20–28]. The precision of representations increases as a function of the associated23 reward, indicating that VWM allocation also tracks reward values when multiple objects provide24 different rewards ([29, 30]; see [31] for a review). Objects that were previously associated with reward25 are also better remembered, even when they are currently task-irrelevant [32]. Crucially, however,26 total VWM capacity does not show flexibility with reward [33, 34], which is further evidenced by27 findings that improved performance for high-reward items is accompanied by a corresponding decline28 in performance for low-reward items [35]. These results demonstrate that stimuli can be strategically29 prioritised for encoding in VWM through selective attention, leading to flexible allocation of limited30 capacity between items based on their assigned subjective values [36, 37].31 Neuroimaging studies suggest that intrinsic rewards can have similar effects on the neural sys-32 tem as external rewards. Successful information retrieval in cognitive tasks has been argued to be33 psychologically rewarding [38], and studies have shown elevated activation in the striatum – a region34 traditionally associated with the motivational significance of actions [39–42] – in response to correct35 responses, even in the absence of explicit rewards [38,43, 44]. This activation is driven not by the36 successful retrieval of information itself, but rather by the satisfaction of the observer’s internal goals37 [38, 43]. Similarly, changes in confidence levels, which reflect subjective evaluations of correctness,38 have also been shown to modulate striatal activation [45–47]. Building on evidence of subjective con-39 fidence signals in the brain’s reward circuits, it has been argued that the brain reinforces behaviours40 linked to high-confidence states while diminishing those associated with low confidence [48]. Together,41 growing evidence suggests that internally generated signals, particularly those related to perceived42 accuracy and performance evaluation, are represented similarly in the brain to explicit, externally43 administered rewards, raising the possibility that they may similarly bias sensory processing.44 In the present study, we combined psychophysical measurement and computational modelling to45 investigate how different intrinsic and extrinsic factors affect the competition between visual stimuli46 for processing resources. We used a modified analogue report task [49,50] in which observers were47 instructed to reproduce the direction of one of a pair of motion stimuli that differed in their associated48 history of reward. Across a series of experiments, we found performance was consistently better49 for stimuli previously associated with larger extrinsic reward, but also those associated with lower50 uncertainty or with improved performance feedback. To provide a mechanistic explanation of the51 observed behaviour, we developed a computational model that relates accumulation of past rewards,52 both intrinsic and extrinsic, to allocation of neural resources between stimuli, which in turn influences53 estimation performance.54 2 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint Results55 Differential rewards bias resource allocation56 Building on existing evidence that external rewards can bias information processing, we began our57 investigation by quantifying their effects on representational fidelity in a motion reproduction task.58 In Experiment 1, observers viewed two coloured motion stimuli, and after a brief delay and the59 presentation of a colour cue, they were asked to reproduce the motion direction of the cued stimulus60 (Fig. 1A & B). Critically, in this experiment, we associated the colours of the stimuli with different61 external rewards by awarding accurate recall (< 50◦ absolute error) with 15 points when items of one62 colour were tested versus 5 points for the other colour. Accumulated points were converted into a63 bonus payment to the observer. At the end of the experiment, all observers correctly identified which64 stimulus had provided the larger rewards. To determine whether the difference in external rewards65 influenced reproduction precision, we compared the mean absolute deviation (MAD) of responses66 between stimuli of different colours. We found strong evidence that response errors were smaller for67 items of the colour associated with the larger reward (BF10 = 18.7, median of the posterior over effect68 size δ = 0.575, 95% credible interval = [0.195, 0.966]) (Fig.1C & D).69 Density C D ELow rewardHigh reward Response error High Low Reward MAD 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.5 1 1.5 2 2.5α Observed Allocation Rewardmaximization Motion stimuli CueDelay ResponseA B +5 points Reward Response error 0 0.5 1 1.5 0 α Total normalized reward 0 0.5 1 1.5 0 0.5 1 1.5 0 - 0 - no points reward zone (5 or 15 points) Figure 1: External reward manipulation in Experiment 1. A) Schematic of the task. B) Illustration of the experimental manipulation. Responses within 50 degrees of the target motion direction were rewarded with either 15 or 5 points, depending on the colour of the cued object. C) Distribution of response errors and corresponding fits of the Neural resource model. Histograms represent the data, while coloured curves and shaded areas depict model predictions (M± SE) D) Mean absolute deviation (MAD) of response errors. The coloured circles with error bars represent the mean ± SE. Dashed line indicates chance level performance. E) Observed (i.e., freely estimated) resource allocation compared to the optimal allocation aimed at maximizing the total points in the task. For visualisation purposes, allocation towards the low-reward item is shown. Dashed line indicates equal allocation. Allocation smaller than 1 indicates that more resource was allocated towards the high- reward item (1:0.695 for high- vs low-reward item). The inset shows individual reward functions relating resource allocation to expected point totals, with each curve’s peak indicating the allocation that maximizes expected reward. For ease of visualization, only a subset of observers is shown, and all curves are normalized to the same total reward. Neural resource allocation70 The results of Experiment 1 indicate that observers prioritised encoding the stimulus associated with71 the larger reward. Importantly, recall for the low-reward item remained reliably better than chance,72 suggesting that prioritisation was graded rather than all-or-none. To quantify the share of resources73 allocated to each item, we applied a normalization-based population coding model [22,51] to the data74 from Experiment 1. In this model, neural firing rate takes the role of a limited resource, which, in75 the simplest scenario, would be equally distributed between stimuli. Here, we extended this model by76 freely fitting a gain modulation parameter,α, which increased the activity encoding one of the two77 3 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint stimuli while keeping the total gain (i.e., mean activity) of the population constant (see theNeural78 resource modelsection for more detail). Consistent with the observed difference in response error, we79 found strong evidence for unequal resource allocation favouring items of the highly rewarded colour80 (low/high ratio 0.695; difference from equal allocation,BF10 = 23.8, δ = 0.594, 95% CI = [0.211,81 0.987]) (Fig.1E).82 We next investigated whether observers distributed resources in a way that would maximize the83 total number of collected points, which we considered an optimal allocation strategy for this task.84 To test this, we calculated the expected number of points awarded for a range of different allocation85 weights (seeOptimal resource allocationfor more detail). The values ofα that maximized the reward86 are shown in Figure1E (optimal allocation). Comparing the observed and optimal weights revealed87 strong evidence for a difference between the two (BF10 = 92.9, δ = 0.694, 95% CI = [0.297, 1.102]),88 with observers distributing resources more equally than would be required to maximize the total89 number of points (low/high ratio 0.34).90 A reward-maximization strategy would come at the cost of higher error for the low-reward item.91 This could suggest that, in addition to maximizing external rewards, observers may also aim to92 achieve a certain level of accuracy on the task across all items, potentially because they find accuracy93 intrinsically rewarding (see also, [25]).94 Perceived accuracy biases resource allocation95 Having confirmed that external rewards modulated allocation in the motion reproduction task, we96 next investigated effects of perceived accuracy, a possible form of intrinsic reward, on the same task.97 In Experiment 2a we presented manipulated feedback at the end of each trial to influence observers’98 perception of their reproduction accuracy. Observers were again presented with two coloured stimuli,99 and reproduced one indicated by a colour cue. We magnified the error presented at feedback when100 one colour was cued and minified the error at feedback for the other colour (Fig.2A & B). A post-101 experimental questionnaire revealed that 84% of observers judged stimuli of the colour associated with102 error-magnified feedback as more difficult to remember, indicating that we successfully associated103 stimulus identity (i.e., colour) with perceived difficulty.104 To assess the effects of perceived difficulty on response precision, we compared MAD between the105 response and the true target direction (rather than the one shown as feedback) for stimuli of the two106 colours (Fig.2C & D). We found responses to be more precise for the stimulus with reduced feedback107 error, i.e., the one perceived as easier to remember (BF10 = 29.4, δ = 0.672, 95% CI = [0.24, 1.12]).108 This finding indicates that the perception of better performance for stimuli of one colour, induced by109 feedback, led to improved actual performance for those stimuli.110 The observed effect could be attributed to either capture of visual attention by the “easier”111 item (i.e., competition for visual processing resources) or the mnemonic prioritisation of that item112 (i.e., competition for memory resources). To differentiate between these possibilities, we conducted113 a follow-up experiment. Experiment 2b replicated the conditions of Experiment 2a but with stimuli114 presented sequentially to reduce encoding competition between the two objects and minimize the115 influence of attentional selection on resource allocation. Similar to Experiment 2a, 89% of observers116 judged the colour associated with magnified feedback errors as more difficult to remember. However,117 in contrast to Experiment 2a, comparing response errors across the two stimuli (Fig.S1) revealed that118 the observed data were nine times more likely under the null hypothesis, providing moderate evidence119 for a lack of difference in response precision between the two colours (BF10 = 0.11, δ = 0.009, 95%120 CI = [-0.183, 0.202]). This finding suggests that the effect observed in Experiment 2a is likely due121 to attentional competition during encoding. When that competition is mitigated, observers do not122 show preferential encoding based on perceived difficulty.123 Neural resource allocation124 The results of Experiment 2a show that observers prioritised encoding of the error-minified stimulus,125 i.e., the one signalling better performance. Crucially, the error-magnified stimulus was still recalled126 with above-chance precision, consistent with a graded rather than all-or-none allocation of resources.127 To quantify resource distribution between the two objects, we again applied the Neural resource128 model to the data, with results illustrated in Figure2C & E. We found that, on average, observers129 allocated 1.18 times more resources towards the error-minified stimulus (difference from equal alloca-130 4 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint Magni/f_iedMini/f_ied Feedback error MAD 0 0.5 1 1.5 2 2.5α C D EMagni/f_ied feedback errorMini/f_ied feedback error Observed Allocation Feedback error minimization Feedback 0 Response error -1 -0.5 0 0.5 1 Feedback error minifying error magnifyingMotion stimuli CueDelay ResponseA B Density Response error Response error 0 0.5 1 1.5 0 0.5 1 1.5 0  0 -- - (rad) 0 0.5 1 1.5 0 α Total feedback variance 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Figure 2: Perceived accuracy manipulation in Experiment 2a. A) Schematic of the task. B) Experi- mental manipulation illustration. Feedback error was magnified for one stimulus and minified for the other, based on the colour of the cued object. C) Distribution of response errors and corresponding fits of the Neural resource model. Histograms represent the data, while coloured curves and shaded areas depict model predictions (M± SE) D) Mean absolute deviation of response errors. The coloured circles with error bars show the mean± SE. Dashed line indicates chance level performance. E) Ob- served resource allocation and optimal allocation aiming to minimize overall feedback variance in the task. Dashed line indicates equal allocation. Allocation larger than 1 indicates that more resource was allocated towards the error-minified item (minified vs magnified: 1.18:1). The inset illustrates individual variability in feedback error as a function of resource allocation, with each curve’s trough indicating the allocation level that minimizes feedback error. For ease of visualization, only a subset of observers is shown, and all curves are normalized to the same range of feedback variance. tion, BF10 = 2.63, δ = 0.45, 95% CI = [0.05, 0.86]; 19 out of 25 observers hadαobserved > 1; Fig.2D),131 consistent with the observed difference in response error between the stimuli of two colours.132 Next, we investigated whether the observed allocation matched the predictions of an ideal ob-133 server who optimally weights neural activity to minimize overall feedback error in the task. We134 calculated the expected variance of feedback error across both items for a range of different allo-135 cation weights, and Figure 2E shows optimal allocation weights that minimize this variance. The136 optimal strategy would require shifting twice as many resources towards the error-magnified item137 (αoptimal = 0.52). Importantly, we found extremely strong evidence that this was inconsistent with138 the observed allocation, which favoured the error-minified item (BF10 = 3.77 × 106, δ = 1.72, 95% CI139 = [1.09, 2.39]). Overall, these results indicate that observers did not adopt an allocation strategy that140 would minimize their feedback error variability (α= 0.52), but instead did the opposite, allocating141 more neural resources to the item for which we systematically minified the error in feedback.142 In Experiment 2b, fitting the same Neural resource model to the data revealed that the observed143 allocation parameter was numerically close to 1, (αmean = 1.07; BF10 = 0.62, δ = 0.18, 95% CI =144 [-0.01, 0.38]), which aligns with the observed similarity in reproduction precision between the two145 stimuli. This further supports the conclusion that the effect observed in Experiment 2a depended on146 attentional competition during encoding.147 Estimation difficulty biases resource allocation148 Following Experiment 2, we aimed to determine whether preferential allocation and encoding would149 persist when varying objective stimulus difficulty rather than perceived performance. Drawing on150 previous findings showing a positive correlation in humans between subjective confidence and the151 motion strength of RDK stimuli [52] (see also [53]), we hypothesised that variations in the objective152 difficulty of stimuli would modulate internally generated confidence signals, driving the prioritisation153 of specific stimuli as in Experiment 2. In Experiment 3a, we presented two coloured RDK stimuli154 with different coherence levels on the majority of trials, to create differences in objective difficulty155 5 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint and associated confidence. We then assessed response precision on the remaining trials, during which156 both stimuli were presented with equal coherence (i.e., equal difficulty) (Fig.3A & B).157 C 0 0.5 1 1.5Density 0 0.5 1 1.5 Response error High coherence colour 0 0.5 1 1.5Density Low coherence colour 0 0.5 1 1.5 D 0 - Response error 0 - 0 - 0 - E F High Low Coherence 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6MAD Inter. (High) Inter. (Low) Coherence 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.5 1 1.5 0 0.5 1 1.5 0 0.5 1 1.5 0 0.5 1 1.5 G DensityDensity H 0 - 0- Response error 0 - Response error 0-   I 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6MAD 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Jα 0 0.5 1 1.5 2 2.5 3 3.5 4 High Low Coherence Inter. (High) Inter. (Low) Coherence Observed Allocation Error minimization Experiment 3aExperiment 3b Observed Allocation Error minimization α 0 0.5 1 1.5 2 2.5 3 3.5 4 Motion stimuli CueDelay Response A B Variable coherence trials Equal coherence trials 85% 45% 65% 65% 85% 45% 65% 65% 85% 45% 65% 65% 0 0.5 1 1.5 0 α Total response variance 2 0 0.5 1 1.5 0 α Total response variance 2 45% 65% 85% coherence (%) Additive perceptual noise (var) 0 Figure 3: Estimation difficulty manipulation in Experiment 3a and 3b (simultaneous presentation). A) Schematic of the task. B) Experimental manipulation illustration. In most trials, the two colours were associated with different levels of motion estimation difficulty (i.e., variable coherence); in the remaining trials, both objects had the same level of difficulty (i.e., equal coherence). Motion with different coherence levels produces varying degrees of perceptual noise, with higher coherence reduc- ing noise. This perceptual noise was incorporated into the Neural Resource model as an additive component, alongside memory noise. C) & D) Distribution of response errors and corresponding fits of the Neural resource model. Histograms represent the data, while coloured curves and shaded areas depict model predictions (M± SE). Panel A depicts variable coherence trials, and panel B depicts equal coherence trials. E) Mean absolute deviation of response errors. Dashed lines indicate equal allocation. F) Observed resource allocation and optimal allocation aiming to minimize overall feed- back variance in the task. Dashed line indicates equal allocation. Allocation larger than 1 indicates that more resource was allocated towards the easier item (high vs low coherence: 1.76:1). Panels G-J are the same as C-F, but for the simultaneous presentation condition of Experiment 3b. J) Allocation larger than 1 indicates that more resource was allocated towards the easier item (high vs low coherence: 1.61:1). The insets illustrate individual variability in response variance as a function of resource allocation, with each curve’s trough indicating the allocation level that minimizes overall response error. The coloured circles with error bars show the mean± SE. For ease of visualization, all curves are normalized to the same range of recall variance. In Experiment 3a, all observers reported that stimuli of the colour associated with low coherence158 were more difficult to remember, confirming that the coherence manipulation produced a clear differ-159 6 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint ence in perceived difficulty, despite the absence of performance feedback in this experiment. Results160 of Experiment 3a are shown in Figure 3C-E. As expected, observers were more precise in reproducing161 the motion direction of the high-coherence stimulus on trials where the stimuli objectively differed162 in difficulty (BF10 = 1.83 × 105, δ = 1.6, 95% CI = [0.95, 2.29]). More importantly, on trials where163 the stimuli had equal coherence, reproduction was also more precise for the stimulus with the colour164 associated with high coherence (i.e., the “easier” colour) (BF10 = 11.4, δ = 0.63, 95% CI = [0.18,165 1.10]). This finding suggests that observers associated colour with difficulty when the two objects166 were presented with different levels of coherence, and subsequently allocated more resources to the167 stimulus they had learned was easier.168 This result was replicated in the laboratory setting of Experiment 3b, where observers were169 required to maintain eye fixation at the centre of the screen during stimulus presentation (Fig.3G-170 I). Despite preventing observers from overtly shifting their attention towards one stimulus during171 encoding, 74% of observers correctly identified one colour as more difficult. As expected, responses172 weremorepreciseforthehigh-coherencestimulusontrialswherestimulidifferedincoherence( BF10 =173 4414, δ = 1.366, 95% CI = [0.718, 2.046]). Additionally, responses were more precise for the colour174 associated with high coherence on trials where both stimuli were presented with equal coherence175 (BF10 = 4.2, δ = 0.56, 95% CI = [0.092, 1.052]). However, the observed difference could again be176 explained by attentional demands at encoding. When objects were presented sequentially (Fig.S2),177 response precision was comparable across colours when both stimuli had the same coherence (BF10 =178 0.51, δ = 0.267, 95% CI = [-0.157,0.709]). This was despite observers being able to judge which item179 was more difficult (84%) and a noticeable precision advantage for the high-coherence stimulus on trials180 when coherence levels varied between objects (BF10 = 9.22 × 104, δ = 1.76, 95% CI = [1.015, 2.550]).181 Compared to other experiments, response distributions in this experiment exhibit more pronounced182 peaks around the direction opposite to the target (i.e., elevated tail ends). The tendency of our183 sensory system to encode orientation of a motion path (i.e., the line on which movement occurs)184 partly independently of direction is well-documented [54,55] and may be especially pronounced when185 motion stimuli are presented in the periphery rather than at fixation.186 Neural resource allocation187 Consistent with the findings from Experiment 2, Experiment 3 demonstrated that observers, when188 presented with objects associated with different levels of performance, prioritised the encoding of the189 stimuli perceived as easier. Also consistent with previous experiments, observers performed above190 chance for the more difficult item, supporting the interpretation that resource allocation was graded191 rather than all-or-none. To quantify the distribution of resources across the two items, we again192 applied our population coding model to the data.193 In Experiment 3a, the allocation estimates from the model indicated that observers allocated194 nearly twice as much resource (1.76:1) to the high-coherence stimuli (Fig.3F), and this allocation195 deviated from equal allocation (BF10 = 2.68 × 104, δ = 1.4, 95% CI = [0.792, 2.031]). We next196 investigated whether the observed allocation was consistent with an optimal allocation strategy aimed197 at minimizing overall response variance in the task. To this end, we simulated performance on the198 variable coherence trials using a range of different allocation weights, and found that the optimal199 strategy for most observers was equal allocation (Fig.3F). Comparing the observed and optimal200 weights revealed strong evidence that the observed weights were, on average, larger than the optimal201 weights (BF10 = 9580, δ = 1.341, 95% CI = [0.733, 1.977]).202 These findings were replicated in Experiment 3b. When objects were presented simultaneously,203 the model estimated that observers allocated resources at a ratio of 1.61:1 in favour of the high-204 coherence stimulus (Fig.3J). This allocation was again different from equal (BF10 = 830.6, δ = 1.166,205 95% CI = [0.566, 1.794]), and from optimal, which was again close to equal (meanαoptimal = 1.03;206 BF10 = 789, δ = 1.16, 95% CI = [0.561, 1.786]). Finally, fitting a free allocation parameter to the207 data from the equal coherence condition with sequential presentation (Exp 3b), revealed a ratio of208 1.2:1 in favour of the colour associated with high coherence; however, we did not find evidence that209 this was different from equal allocation (BF10 = 0.82, δ = 0.343, 95% CI = [-0.09, 0.797]).210 7 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint Interim conclusion211 In Experiment 1, we observed a clear effect of external rewards on representational fidelity in a motion212 reproduction task, with observers allocating more cognitive resources to high-reward items. In Exper-213 iments 2 and 3, we found a similar effect using novel manipulations, where observers allocated more214 resources to the stimulus that was perceived as easier, either based on manipulated error feedback215 (Experiment 2) or internal confidence in estimation (Experiment 3). In both of these experiments,216 the observed allocation deviated from predictions made by an optimal strategy aimed at minimiz-217 ing overall feedback or response error. Additionally, we demonstrated that differences in estimation218 performance were abolished when competition during encoding was removed by presenting stimuli219 sequentially, suggesting they arise from unequal allocation of attentional resources at the encoding220 stage.221 Building on these results and existing literature [38,43, 44], we argue that observers in our study222 found higher accuracy and confidence in their performance intrinsically rewarding, and learned to223 associate this intrinsic reward with a stimulus feature (i.e., one of the two colours). This association224 biased resource allocation towards subsequent stimuli with the same feature. Importantly, although225 reward-driven, this biased allocation was not a strategy that would maximize intrinsic reward on226 these tasks, because observers had no influence over which stimulus was cued for report on a given227 trial. Indeed the direction of the biases induced by implicit rewards in Exps 2 & 3 meant that they228 were counterproductive: increasing overall error variability relative to a strategy of equal allocation.229 Therefore, instead of evaluating this data from the perspective of optimal performance, we propose a230 neural model inspired by reinforcement learning to elucidate these findings.231 Reinforcement learning model of resource allocation232 To further explore the dynamics of resource allocation, we developed a computational model that in-233 tegrates principles of neural coding and reinforcement learning. The proposedReinforcement learning234 account of resource allocationextends theNeural resource model[22, 51] by incorporating a value-235 updating mechanism that allows extrinsic and intrinsic rewards to influence the future distribution236 of neural resources (Fig.4). A key contribution of our model is the concept that rewards – both237 intrinsic and extrinsic – obtained from reproduction of a stimulus become associated with the identi-238 fying features of that stimulus, affecting their subjective value and biasing allocation of resources in239 subsequent encounters. We found that this approach accurately predicted resource allocations esti-240 mated by freely fitted allocation weights, indicating that behavioural estimation performance could241 be successfully inferred from an analysis of accumulated rewards.242 External reward243 In Experiment 1, the stimulus colour associated with a high reward (15 points) was expected to244 accumulate greater value relative to the colour associated with a low reward (5 points). On average,245 observers earned points on 80% of trials when the high-reward stimulus was probed and 66.5% of246 trials when the low-reward stimulus was probed, leading to an average accumulation of 602 and 166247 points, respectively. To apply the proposed RL model to each observer’s data, we combined individual248 trial-by-trial external rewards with estimates of internal confidence (Equation9).249 Figure 5A shows the average trajectory of resource allocation across trials (see Fig.S3A for indi-250 vidual trajectories). This trajectory shows an early shift in resource allocation towards the preferred251 item, followed by a stable plateau. For ease of visualisation, trajectories are presented as directed252 towards the preferred object, defined as the object receiving a greater average resource allocation253 across all trials.254 Crucially, since our RL account is grounded in the same Neural resource model previously em-255 ployed to fit the psychophysical data and quantify resource allocation (Fig.1A & C), we can directly256 compare estimates across the two models. Here we focus on the comparison of estimated resource al-257 locations, while ML estimates and comparisons for the other parameters are shown inSupplementary258 Information (Fig. S4A). Importantly, the freely estimated resource allocation (observed allocation in259 Fig. 1C) is based on behavioural errors only, with no information about rewards, and so can serve as a260 benchmark for evaluating performance of the RL model. As shown in Figure5B, we observed a strong261 positivecorrelationbetweenthefreelyestimatedallocationparameterandthemeanallocationderived262 from the history of accumulated rewards (r= 0.976, 95% CI = [0.941, 0.988],BF10 = 3.34 × 1016).263 8 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint E FVarying reward weight (c) Varying leak (y) Trial number 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Resource fraction 1 10 20 30 0 1 10 20 30 B C 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 high reward stimulus low reward stimulus D MAD Neural resource allocation Stochastic spiking A Relative value (νt ) Gain factor (α) Uncertainty Error feedback Awarded points Trial number 0 Leak max 0 Reward weight max Trials Trials 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Resource fraction Resource fraction +15 points νt = (1 – y)Δt–1 + Δt Reward Trial1 Trial2 ... Trialt Weight κrˆ rextϵfb Δ1 Δ2 Δt ...= 10 Relative value (νt ) Figure 4: The neural resource allocation account applied to the motion estimation task. A) On each trial, motion directions of the two stimuli are encoded in the spiking activity of populations of neurons, with mean activity determined by the relative allocation of resources to stimuli. Based on the cue colour, one of the populations is decoded to yield an estimated direction with an associated uncertainty that varies from trial to trial. The uncertainty of the estimate, the accuracy feedback (if present) and any points awarded represent different forms of intrinsic and external reward, which are combined as a weighted sum into a composite reward (∆t). This composite reward is then used to update the relative value (ν) associated with the stimulus colours. Finally, this relative value is transformed via an exponential mapping into a neural gain factor (α), which controls the fraction of resources allocated to each stimulus on the subsequent trial. In this framework, resource allocation is entirely driven by the history of accumulated rewards. B) Throughout the reported experiments, the two colours of stimuli are systematically related to different intrinsic or external rewards, so the relative value assigned to each colour progressively diverges over the sequence of trials. C) Fraction of total resources allocated to the high-reward stimulus over trials, based on relative value shown in B. The dashed line represents the mean allocation across all trials (∼65%). The remaining resources (∼35%) are allocated to the low-reward stimulus. D) Unequal resource allocation is reflected in differences in the mean absolute error across trials when the high- or low-reward stimulus is cued for report. E) Larger reward weight (c), with a constant leak factor, results in a stronger preference for one stimulus over the other in terms of resource allocation. F) Larger leak factor (y), with a constant reward weight, leads to a weaker preference. 9 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint The close alignment of these two distinct methods suggests that the history of accumulated rewards264 can effectively account for resource allocation in this task.265 Intrinsic reward: Perceived accuracy266 In Experiment 2 we manipulated the response error presented in feedback to influence the perceived267 difficulty of reproducing stimuli of each colour. This manipulation resulted in participants experi-268 encing systematically larger feedback errors for stimuli of one colour (magnified feedback MAD =269 0.837) than the other (minified feedback MAD = 0.233). To model this data within our RL account,270 we assume that the feedback on each trial provided an intrinsic reward that was associated with271 the corresponding stimulus colour. This assumption is supported by evidence that observers’ subjec-272 tive evaluations tend to favour smaller feedback errors over larger ones because they suggest higher273 accuracy [56].274 In the model, rewards derived from feedback were integrated with those derived from internal275 confidence. Figure 5D illustrates the mean trajectory of resource allocation across trials (Fig.S3B276 shows individual trajectories for example observers). The model fits again indicate that resources were277 unequally allocated between stimuli, although the bias is smaller than observed in the experiment278 with external rewards.279 Comparing the estimated allocation derived from the RL account to the freely fitted allocation280 parameter in the Neural resource model (Fig.5E), we found a strong positive correlation (r= 0.911,281 95% CI = [0.777, 0.958],BF10 = 3.44 × 107). Consistent with the findings from Experiment 1, the282 correspondence between these two distinct approaches indicates that the history of accumulated in-283 trinsic rewards provides an explanation for resource allocation in the task with manipulated feedback.284 ML estimates and comparisons for the other parameters are shown inSupplementary Information285 (Fig. S4B).286 Intrinsic reward: Estimation difficulty287 Experiment 3 investigated the role of objective difficulty in the representation of motion informa-288 tion. On most trials, two stimuli with different coherence levels (85% and 45%) were presented. We289 hypothesized that internal confidence in each item’s motion direction, reflecting a metacognitive esti-290 mate of accuracy, functions as an intrinsic reward which observers associate with each item’s identity291 (i.e., colour) [48]. To model the psychophysical data in the simultaneous presentation condition, we292 estimated internal confidence by exploiting the close coupling between uncertainty and trial-to-trial293 variability in error within the Neural resource model. Informed by observed response error on each294 trial, we derived the posterior probability distribution of likelihood precision and used the most prob-295 able precision as a basis for intrinsic reward (Eq.12). While internal confidence was also incorporated296 in this way when modelling data from the previous two experiments, in Experiment 3 it was the sole297 source of reward influencing resource allocation.298 Figure 5G & J show the mean trajectories from Experiment 3a & 3b, respectively. Again,299 we visualised the obtained individual trajectories in example participants (Fig.S3C & D). In both300 experiments, all parameter estimates obtained with the Neural resource model and the RL account301 strongly covaried (Fig. 5H & K & Fig. S4C & D). Importantly, this was also true for estimates302 of resource allocation. Across the two experiments, we found very consistent and strong positive303 correlations between the freely estimated allocation parameter and the mean allocation derived from304 the history of accumulated rewards (Exp3a:r = 0.833, 95% CI = [0.589, 0.923],BF10 = 1.29 × 104;305 Exp3b: r = 0.843, 95% CI = [0.575, 0.933],BF10 = 3.67 ×103). Consistent with the findings from the306 first two experiments, this strong correspondence indicates that the history of accumulated intrinsic307 rewards based on internal confidence effectively accounts for resource allocation in this task.308 Changes in resource allocation predict response precision309 Our finding that freely estimated resource allocation strongly correlates across participants with310 resource allocation based on the history of rewards supports the conclusion that human resource311 allocation is guided by a reward-driven value assignment to objects in the visual environment. To312 further substantiate this claim, we investigated whether variability in resource allocation across trials313 within individual participants, derived from the RL model, predicts the magnitude of their response314 errors.315 10 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint 12 04 06 08 0 100 0.2 0.3 0.4 0.5 0.6 0.7 0.8 12 04 06 08 0 100 0.2 0.3 0.4 0.5 0.6 0.7 0.8 A D Resource fractionResource fraction Trial number Trial number 15 0 100 150 0.2 0.3 0.4 0.5 0.6 0.7 0.8 15 0 100 150 200 0.2 0.3 0.4 0.5 0.6 0.7 0.8 G J Resource fractionResource fraction Trial number Trial number 0 1 0 0.2 0.4 0.6 0.8 freely estimated resource allocation Favour high reward Favour low reward 0 0.2 0.4 0.6 0.8 1 freely estimated 0 0.2 0.4 0.6 0.8 1 Favour error-mini/f_ied Favour error-magni/f_ied reward predicted r = .976 r = .911 B E External reward (Exp 1) Perceived accuracy (Exp 2a) Estimation diﬃculty (Exp 3a) Estimation diﬃculty (Exp 3b) reward predicted0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 freely estimated reward predicted r = .833 H 0 0.1 0.2 0.3 0.4 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 freely estimated reward predicted Favour high coherence Favour low coherence r = .843 K 0 0.1 0.2 0.3 0.4 Favour high coherence Favour low coherence 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 C F I L MAD MAD MAD MAD Figure 5: Modelling results. A) Resource allocation across trials inferred by the RL account in the external reward experiment (Experiment 1). Circles represent the mean fraction of resources across observers allocated on each trial towards the overall preferred object. B) Correlation between mean allocations inferred by the RL account and freely estimated allocations. The red line shows predictions of the fitted linear regression model, and the shaded area indicates the 95% CI. C) Difference in MAD between trials on which the probed item had below- and above-median resources allocated to it, as estimated by the RL account. On average, MAD was larger when less resource was allocated to the probed stimulus. D–F) Same as above, but for the perceived accuracy experiment (Experiment 2). G–I) Online estimation difficulty experiment (Experiment 3a). J–L) Lab-based estimation difficulty experiment (Experiment 3b, simultaneous condition). 11 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint To investigate this, we performed a median-split analysis for each observer based on the esti-316 mated fraction of allocated resources towards the item associated with larger reward (i.e., individual317 trajectories similar to those shown in Fig.5, left column). Specifically, we calculated the MAD of re-318 sponse errors for trials with above- and below-median resource allocation, separately for trials where319 the high- or low-reward item was tested. We hypothesised that MAD would be greater on trials320 where the RL model indicated that below-average resource was allocated to the probed item, i.e.,321 below-median trials when the high-reward object was probed and above-median trials when the low-322 reward object was probed. To test this, we computed a composite score for each observer equal to the323 sum of the signed difference in MAD between below- and above-median trials when the high-reward324 item was probed and the signed difference in MAD between above- and below-median trials when the325 low-reward item was probed. In all four experiments, the composite scores indicated that lower pre-326 dicted resource allocation, based on the history of rewards, corresponded on average to larger MAD327 of response errors (Fig.5C, F, I & L) This was confirmed with one-sided t-tests against zero which328 provided moderate to extreme evidence for a difference in the hypothesised direction: Experiment 1329 (BF10 = 5.55 × 104, δ = 1.096, 95% CI = [0.635, 1.570]); Experiment 2 (BF10 = 564, δ = 0.866, 95%330 CI = [0.402, 1.345]); Experiment 3a (BF10 = 3.74, δ = 0.441, 95% CI = [0.073, 0.874]); Experiment331 3b (BF10 = 192.4, δ = 0.919, 95% CI = [0.377, 1.487]).332 Discussion333 In the present study, we investigated how human observers represent stimuli associated with varying334 levels of external and intrinsic reward. Across three psychophysical experiments, we paired object335 identities with different rewards and found observers developed higher estimation accuracy for the336 items associated with larger rewards. In two additional experiments, we demonstrated that this effect337 was driven by competition for attentional, rather than mnemonic, resources. To provide a mechanistic338 explanation of this behaviour, we developed a neural model incorporating a reinforcement learning339 rule that directs resource allocation towards more rewarding stimuli. Our key finding is that a340 resource allocation mechanism based solely on the history of accumulated rewards is sufficient to341 explain differences in estimation performance based on intrinsic as well as external rewards.342 In the first experiment, we investigated the effects of external rewards on representational fidelity343 in a motion reproduction task. Both the psychophysical results and computational modelling provided344 compelling evidence that observers allocated more processing resources to objects associated with a345 higher reward, resulting in more precise reproduction of high-reward stimuli compared to low-reward346 ones. This finding aligns with a broad body of research demonstrating that external rewards, such347 as points or money, influence various aspects of information processing, including the allocation of348 attentional resources [17,18] and working memory [31], while also facilitating motor responses, such349 as hand movements and saccades, towards rewarding stimuli [56–58].350 In contrast to external rewards, the influence of intrinsic rewards on representational fidelity351 has received comparatively less attention. Building on the premise that accuracy itself is rewarding352 [38, 43, 44], we conducted two experiments that manipulated perceived accuracy (via feedback)353 and objective estimation difficulty (via signal strength) in a motion reproduction task. We found354 convergingevidenceatboththebehaviouralandcomputationallevelindicatingthatobserversallocate355 more neural resources towards, and consequently have a more precise internal representation of,356 objects associated with better estimation performance – whether induced by artificially manipulated357 feedback (Experiment 2) or by objective differences in stimulus discriminability (Experiment 3).358 We argue that observer derived intrinsic reward from confidence in their responses and feedback359 on their accuracy. In our tasks, the association of these rewards with the distinguishing feature of360 the presented objects (i.e., colour) leads to a bias in resource allocation, favouring subsequent stimuli361 that share the same feature. This proposal aligns with the notion that perceptual features linked to362 rewards are prioritised in sensory processing due to their incentive salience (e.g., [59,60]). Moreover,363 neural evidence supports this notion by demonstrating that sensory representations are modulated364 by the history of rewards, underscoring the impact of reward associations on perceptual processing365 [61]. To make our proposal concrete, we developed a mechanistic model grounded in the principles366 of population coding and reinforcement learning. Specifically, our reinforcement learning account367 operates by analyzing accumulated rewards and allocating proportionally more resources to objects368 previously associated with higher rewards. We found this model closely replicated resource allocation369 12 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint estimates obtained from freely fitted parameters, suggesting that the history of accumulated intrinsic370 and extrinsic rewards is sufficient to account for the observed patterns of resource allocation.371 A key novel finding from the proposed model is that both internally generated and externally372 manipulated (via feedback) estimates of accuracy, when associated with an object’s distinguishing373 feature, can bias the subsequent processing of objects that share that feature. While the role of374 external feedback in reinforcing behaviours more generally has been widely acknowledged (e.g., [62,375 63]), recent research demonstrates that internal confidence can similarly reinforce behaviour even in376 the absence of explicit feedback. For instance, improvements in sensitivity in a perceptual learning377 task have been observed without external feedback [64]. Guggenmos et al. [48] proposed that such378 learning is driven by confidence prediction errors – discrepancies between an individual’s current379 confidence and their expected confidence level. Notably, the neural substrate for these prediction380 errors has been identified in the striatum (see also [65]), a brain region traditionally linked to reward381 processing. Our findings contribute to a growing body of literature that highlights the importance of382 metacognition [53] and self-reinforcement [66] as critical processes in the pursuit of rewards.383 Using a range of tasks similar to the one used in this study, previous research has demonstrated384 that humans possess knowledge about the uncertainty with which individual items are reported385 (e.g., [52, 67,68]). Population coding models [69,70] have been particularly effective in capturing386 subjective confidence [71], as well as proxies such as response latency [72]. Within the population387 coding framework, an ideal observer of spiking activity would derive their confidence estimate –388 whether internal or explicitly reported – based on the precision of the posterior distribution, which389 represents the probability of the stimulus value given the observed neural activity. In the Neural390 resource model [22,51], the precision of the posterior (or likelihood, assuming a uniform prior over391 stimulus space) varies from trial to trial, as a result of stochastic variation in the number of spikes392 available for decoding. We calculated the most probable estimate of posterior precision on each trial393 to serve as an indicator of internal confidence. On this basis, the model successfully recreated freely394 estimated resource allocations based on our data.395 An important insight from our modelling is that the observed resource allocation deviated from396 the pattern required to minimize overall response or feedback error variability, resulting in poorer397 overall performance. In a similar vein, a recent study [73] provided theoretical and empirical evidence398 suggesting that sensory processing is optimised to maximize fitness (i.e., rewards), rather than to399 ensure perceptual accuracy. Supporting this idea, neurophysiological studies have demonstrated that400 early sensory systems encode both sensory information about a stimulus and non-sensory information401 regarding the behavioural relevance of stimuli [3, 74]. Embedding stimulus-reward contingencies402 within the sensory representation of a stimulus facilitates the prioritisation of behaviourally relevant403 information during encoding. These previous findings may help explain why our observers’ allocation404 strategies were not optimized for accuracy in the task, however they were also not optimized for405 maximizing rewards. In the experiment with external rewards we found that observer’s allocated406 resource more equally between items than would be predicted by a reward-maximizing strategy. The407 RL model captured the observed allocation strategy based on a weighted combination of points-408 based external rewards and confidence-based intrinsic rewards – this combination of factors could409 lead observers to maintain a certain level of performance even for stimuli associated with low external410 reward. When considered across all experiments, our results point to a reward-driven allocation of411 resources that, while prioritising reward-related stimuli, is not optimized to obtain rewards in the412 specific tasks we investigated.413 Our results also contribute to prominent theories in neuroscience, psychology, and economics414 [75–78] which consider how humans and other animals link the mental effort required for a task with415 the value of its outcome (i.e., the reward). Behavioural studies demonstrate that, when faced with416 tasks offering equal rewards but varying in effort, humans tend to avoid those perceived as more417 difficult [79,80]. Based on this, it has been argued that cognitive effort is experienced as carrying418 disutility, i.e., acting as a discount factor on expected rewards [78,81]. This hypothesis has been419 substantiated by the observation that cognitive effort reduces neural responses to rewards following an420 effortful task [82]. In the present results, perceived (Experiment 2) or objective difficulty in estimation421 (Experiment 3) similarly appears to have discounted or reduced the subjective value of a stimulus,422 leading observers to prioritiseeasier – and thus in principle morerewarding – items for encoding.423 However, because observers had no control over which stimuli were selected for test, this allocation424 strategy did not result in more reward in our tasks and could even be counterproductive. This raises425 13 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint the wider question of whether humans may similarly allocate effort suboptimally, driven by intrinsic426 reward, in other situations where they have limited control over what information will subsequently427 become relevant.428 14 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint

Materials

and methods429 Apparatus430 Intheonlineexperiments, taskswerepresentedviawebbrowsersonobservers’personalcomputersand431 were coded in JavaScript and HTML Canvas. In the laboratory experiment, stimuli were displayed432 on a 69 cm gamma-corrected LCD monitor with a refresh rate of 60 Hz. Observers were seated in433 a dark room and viewed the monitor from a distance of 60 cm, with their heads supported by a434 forehead and chin rest. Eye position was monitored online at 1000 Hz using an infrared eye tracker435 (SR Research). Stimulus presentation and response registration were controlled by a script written436 in Psychtoolbox [83, 84] and executed in Matlab (The Mathworks Inc.). Responses were collected437 using a computer mouse.438 Participants439 A total of one hundred ninety-six naive observers (110 females, 80 males, 6 preferred not to say;440 Mage = 27.6, SDage = 5.0) took part in the study after giving written informed consent in accor-441 dance with the Declaration of Helsinki. All observers reported normal colour vision and normal or442 corrected-to-normal visual acuity. For the online experiments, observers were recruited using Prolific443 (https://www.prolific.co) and were remunerated £6 per hour for their participation. For the lab-444 oratory experiments, observers were recruited through the Cambridge Psychology research sign-up445 system and were remunerated £10 per hour.446 For the online experiments, we used a Bayesian stopping rule to determine the sample size. The447 stopping rule guides when enough evidence has been gathered to support a decision, thus optimizing448 the sample size. In particular, we continued testing observers until we obtained strong evidence, as449 estimated by the Bayes Factor, in favour of eitherH0 (BF10 ≤ 0.1, indicating evidence supporting no450 difference between the two conditions of interest) orH1 (BF10 ≥ 10, indicating evidence supporting451 a difference between the two conditions). If neither hypothesis was supported, data collection ceased452 after reaching 100 observers. In Experiment 1, we assessed differences in mean absolute reproduction453 error in the analogue report task between stimuli associated with high and low reward, which were454 the conditions of interest for the Bayesian stopping rule. In Experiment 2, we tested for differences in455 mean absolute reproduction error between error-minified and error-magnified stimuli. In Experiment456 3, we compared mean absolute reproduction errors on trials where stimuli were presented in different457 colours but with equal coherence. For the laboratory experiment (Experiment 3b), we aimed to collect458 a number of participants similar to that in Experiment 3a. In total, thirty observers participated in459 Experiment 1. Twenty-five observers participated in Experiment 2a, and one hundred participated in460 Experiment 2b. Finally, twenty-two and nineteen observers participated in Experiments 3a and 3b,461 respectively.462 Stimuli463 The stimuli in this study were random dot kinematograms (RDK). On each trial, two RDK stimuli,464 each consisting of 40 dots, were presented within a circular aperture. A percentage of the dots465 (specified below) moved in a coherent direction, while the remaining dots moved in random but466 consistent directions within the aperture [85]. When a dot exited the aperture, it was replaced by a467 new dot at the aperture’s edge, maintaining a constant dot density. In all experiments, one stimulus468 was always green (RGB colour values; online: 47, 195, 129, lab: 0, 199, 128) and the other was469 always blue (online: 24, 199, 233, lab: 0, 187, 241). In Experiment 3b, the same observers completed470 two identical tasks, with stimuli presented either simultaneously or sequentially. In this experiment,471 stimuli were either green and blue or orange (237, 154, 0) and magenta (255, 79, 208), balanced across472 observers and presentation conditions. Across all tasks, stimuli were presented against a mid-grey473 background.474 For the online experiments, all measures in pixels are reported for a 1920 x 1080 resolution and475 60 Hz refresh rate. When a different resolution or refresh rate was detected, all measurements of size,476 positioning and speed were automatically adjusted to maintain consistency in stimuli presentation477 across different display settings. The stimulus aperture was 105 pixels in diameter, and each dot had a478 radius of 3 pixels. Two apertures were positioned 220 pixels to the left or right of the screen centre. On479 each frame, the dots were shifted by 3 pixels in a specific direction. In the laboratory experiment, two480 15 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint apertures (1.4 dva radius) were presented horizontally aligned with the fixation annulus, positioned481 at 5 dva to the left and right. Each dot was 0.15 dva in diameter and travelled at 4 dva/sec speed.482 Procedure and task483 In all experiments, observers completed an analogue report task [50]. Each trial began with the484 presentation of a central fixation annulus. In the laboratory experiment, gaze direction was monitored485 using an eye-tracking camera, and observers were required to maintain gaze fixation within a radius486 of 2◦ around the central annulus for 500 ms before the trial could proceed. After achieving stable487 fixation, the fixation annulus changed appearance (i.e., became thinner) to signal that the memory488 array would be presented in 500 ms. In the online experiment, the appearance of the fixation annulus489 changed after a fixed interval of 500 ms. The sample array, consisting of two RDK stimuli, was then490 shown for 750 ms, followed by a 1000 ms delay period. A centrally presented colour cue subsequently491 indicated which of the previously presented stimuli, distinguished by colour, was the target that the492 observers should recall and report the direction of.493 Once observers were ready to give their response, they could begin moving the cursor with a494 mouse or trackpad, which triggered the appearance of a randomly oriented white arrow within the495 central annulus. Observers were instructed to align the direction of the arrow with the previously496 presented motion direction of the cued stimulus. In the online experiment, responses were confirmed497 by pressing the spacebar, while in the laboratory experiment, they were confirmed by pressing the498 right mouse button.499 Experiment 1: External reward500 In Experiment 1, we investigated how extrinsic rewards influence motion reproduction precision.501 To this end, observers received 15 points for reporting a motion direction within50◦ of the target502 direction when the target was of one colour (e.g., green), and 5 points when it was of the other colour503 (e.g., blue). Responses that were more than 50 degrees from the target direction did not receive any504 points. The colour associated with high versus low reward was chosen randomly for each observer at505 the beginning of the experiment. Both stimuli were presented with the same coherence (85%) and506 no error feedback was provided. Accumulated points were converted to a bonus payment at the end507 of the experiment, and observers were informed of this at the beginning of the experiment. Overall,508 they could collect a maximum of one thousand points, which was equivalent to a bonus payment of509 £1.50. Observers completed twenty practice trials and one hundred experimental trials. The task510 took approximately 20 minutes to complete. The trials were divided into two equal blocks with a511 break of at least one minute in between, and the complete testing session lasted approximately 15512 min.513 Experiment 2: Perceived accuracy514 Experiments 2a and 2b were designed to investigate the role of feedback on the precision of motion re-515 production. The two experiments were identical except for the presentation of stimuli. In Experiment516 2a, two stimuli were presented simultaneously for 750 ms at two distinct locations. In Experiment517 2b, the same two locations were used, but the stimuli were presented sequentially, each for 750 ms.518 In Experiment 2b, the order of presentation and the colour cues were balanced across conditions.519 In both experiments, at the end of each trial, following the response, we presented feedback in520 the form of the reported and target motion directions. Unbeknownst to participants, we manipulated521 the feedback by artificially magnifying errors for one stimulus colour. This was done by shifting the522 presented target motion direction (θ∗) away from the reported direction (ˆθ) and thereby inflating the523 presented response error for the designated “difficult” item. This was done according to the following524 equation:525 θ∗ = θ ± 50 sin(ˆθ − θ), (1) where θ is the true motion direction, and all angles are expressed in degrees. Similarly, we system-526 atically minimized the error in the feedback for the other colour, designated as the “easy” item. The527 magnification and minimization of errors were randomly assigned to one of the two colours (i.e., green528 or blue) for each observer at the beginning of the experiment. The RDK stimuli were presented with529 85% coherence. At the beginning of the experiment, during the instructions, we informed observers530 that individuals might vary in their ability to perceive the motion of stimuli of different colours.531 16 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint This was intended to make any perceived differences in difficulty appear plausible. At the end of the532 experiment, observers were debriefed and the true purpose of the study was revealed. In Experiments533 2a and 2b, observers completed twelve practice trials and one hundred experimental trials. The trials534 were divided into two equal blocks with a break of at least one minute in between, and the complete535 testing session lasted approximately 15 min.536 Experiment 3: Estimation difficulty537 In Experiments 3a and 3b, we investigated the role of stimulus discriminability on the fidelity of538 visual representations. To achieve this, we presented two stimuli with different levels of coherence on539 67% and 70% of all trials in Experiments 3a and 3b, respectively. Specifically, the stimulus of one540 colour was presented with 85% (high) and the stimulus of the other colour with 45% (low) coherence.541 These variable-coherence trials were randomly interleaved with trials where both stimuli had the542 same intermediate (65%) coherence. The assignment of low and high coherence to specific colours543 was randomized for each observer at the beginning of the experiments. No feedback was provided544 during these experiments.545 Experiment 3a was conducted online, while Experiment 3b took place in the laboratory. In546 Experiment 3a, on all trials stimuli were presented simultaneously. In Experiment 3b, the same547 observers performed the task with both simultaneous and sequential presentations, with the order of548 these conditions counterbalanced across participants. To prevent transfer effects between conditions,549 we used different colour combinations: in one condition, stimuli were presented in green and blue,550 while in the other, they were presented in orange and magenta. The colour combinations were551 randomly assigned to each presentation condition.552 In Experiment 3a, observers completed twenty practice trials and one hundred fifty experimental553 trials. The trials were divided into two blocks with a mandatory break of at least one minute in554 between, resulting in a total testing session duration of around 15 minutes. Experiment 3b (i.e.,555 the laboratory experiment) consisted of four hundred trials, divided into eight equal blocks. In half556 of the blocks, stimuli were presented simultaneously, while in the other half, they were presented557 sequentially. Half of the observers completed the simultaneous blocks first, followed by the sequential558 blocks, and vice versa for the other half. At the beginning of each block sequence (i.e., simultaneous559 or sequential task), observers performed twenty practice trials to familiarize themselves with the560 task. In Experiment 3b, observers were required to maintain central fixation throughout the stimulus561 presentation. If gaze deviated by more than2◦, a warning message appeared on the screen, and the562 trial was aborted and restarted with newly randomized stimuli. Completing Experiment 3b took563 approximately 90 minutes.564 Analysis565 All stimulus values were analysed and are reported with respect to the circular parameter space566 of possible motion directions, [−π, π) radians. Response error for each trial was measured as the567 angular difference between the reported and target motion directions. To quantify the dispersion of568 response errors, we calculated the mean absolute deviation (MAD) across trials for each condition569 and observer. Higher MAD values indicate greater average reproduction error.570 To compare differences in performance across conditions, we used Bayesian hypothesis tests,571 implemented in JASP [86] with the default Jeffreys-Zellner-Siow prior on effect sizes [87]. We report572 Bayes factors which compare the relative predictive adequacy of two competing hypotheses (e.g.,573 alternative and null) and quantify the change in belief that the data bring about for the hypotheses574 under consideration [88]. For example,BF10 = 10 indicates that the data are ten times more likely575 to occur under the alternative hypothesis (i.e., there is a difference) than under the null hypothesis576 (i.e., there is no difference). Evidence for the null hypothesis is indicated byBF10 < 1, in which577 case the strength of evidence is indicated by1/BF10. Evidence assessed via the Bayes Factor is best578 understood as a ratio-scaled value ranging from 0 to infinity. For clarity in communication, we also579 use an interpretative framework for Bayes Factor values, following the classification scheme outlined580 by Lee and Wagenmakers [89]:BF = 1 as no evidence; 1< BF < 3 as weak or anecdotal evidence; 3581 ≤ BF < 10 as moderate evidence; 10≤ BF < 30 as strong evidence; 30≤ BF < 100 as very strong582 evidence; BF ≥ 100 as extreme evidence. It is critical to note that while we utilize these discrete583 categories, they are arbitrary and should serve only as rough guidelines. Along with the Bayes factor,584 17 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint we report the median of the posterior distribution over the effect size (δ) and the accompanying 95%585 credible interval (95% CI).586 Neural resource model587 Weanalysedobservers’responseerrorswithanestablishedmodelbasedontheprinciplesofpopulation588 coding [22, 51, 90, 91]. In this framework, a visual stimulus (θ) is encoded by an idealized population589 of neurons whose activity is determined by their individual tuning functions. All neurons are assumed590 to share the same bell-shaped von Mises tuning function,591 fi(θ) = exp(κ(cos(θ ⊖ φi) − 1)), (2) where κ determines the tuning concentration, and⊖ is subtraction on a circle. These tuning functions592 are translated through the feature space to peak at each neuron’s preferred value (φi), such that they593 provide dense uniform coverage of the entire feature space. In a population ofM neurons, the average594 response of theith neuron in response to a stimulus valueθ is obtained by scaling the output of the595 tuning function with the population’s mean total firing activity (γ),596 ¯ni(θ) = γ M fi(θ). (3) If activity associated with multiple stimuli is combined or normalized [92] at a population levelγ,597 Equation 3 implements a form of limited resource [22]. The spike count produced by each neuron is598 drawn from a Poisson distribution,599 ni(θ) ∼ Poiss(¯ni(θ)), (4) and the decoded motion direction estimate is obtained by maximum likelihood estimation of the600 population spiking activity,n:601 ˆθ = arg max θ p(n|θ). (5) The resulting distribution of decoding errors, for a given total number of spikesm = Σini ∼ Poiss(γ),602 is described as a mixture of von Mises (ϕ) distributions,603 p(ˆθ|θ, m) = Z p(r|m, κ)ϕ(ˆθ; θ, rκ)dr, (6) with604 p(r|m, κ) = I0(κr) (I0(κ))m rψm(r), (7) where rψm(r) is the probability density function for resultant lengthr of a uniform random walk of605 m steps. The full distribution of response errors predicted by the model is a mixture of probability606 distributions p(ˆθ|θ, m), weighted with the probability of obtainingm spikes. For a complete derivation607 of the distribution of response errors, see Bays [22] and Bays [71].608 The model has two free parameters, the population’s mean total firing activity (γ), and the609 concentration of the tuning function (κ). In scenarios when multiple objects (N) need to be repre-610 sented, the total resourceγ is typically divided equally among objects (i.e.,γ/N). Here we extend611 this basic approach by incorporating an allocation parameter, or gain factorα, which controls the612 neural activity allocated to one object (see also [22]). Without loss of generality, we fixed the gain613 factor for one object at 1, while treating the gain factor for objectj (see below for details of each614 experiment) as a free parameter when fitting the model to the data. The neural activity allocated615 to objectj can be expressed aspαγ, where pα = α/(1 + α) represents the proportion of total neural616 activity. The remaining activity (proportion1 − pα) is allocated to the other object.617 In Experiment 1, which involved the manipulation of external reward, the allocation weight for618 the high-reward item was fixed at 1, while the allocation weight for the low-reward item was freely619 estimated. In Experiment 2, in which we manipulated perceived accuracy, the allocation weight for620 the error-magnified item was fixed at 1, and the allocation weight for the error-minified item was freely621 estimated. In Experiment 3, which involved estimation difficulty manipulation, we simultaneously622 fitted responses on variable- and equal-coherence trials. Building on our previous work [93], we623 assumed that the strength of the motion signal is controlled by the coherence level of RDK stimuli,624 such that the value encoded into the neural population is given by625 ¯θ ∼ WN (θ, σ2 coherence), (8) 18 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint where WN is a wrapped normal with meanθ and variance σ2 coherence accounting for additive Gaus-626 sian noise. For simplicity, we considered 85% coherence (high coherence) as perceptually noiseless627 and further assumed that σ2 45% > σ 2 65%, where 45% was the low-coherence level, and 65% was the628 intermediate-coherence level used in the equal-coherence trials. Additionally, the allocation weight629 for the low-coherence colour (45% and half of 65% stimuli) was fixed at 1, while the allocation weight630 for the high-coherence colour (85% and half of 65% stimuli) was freely estimated across variable-631 and equal-coherence trials. In other words, on equal-coherence trials, differences in response precision632 were explained solely by the allocation weight. In contrast, on variable-coherence trials, the allocation633 weight and perceptual noise jointly accounted for variations in response precision.634 Optimal resource allocation635 To identify the optimal levels of resource allocation, we conducted a simulation study. For each636 observer, we simulated model predictions using the best-fitting parameters of the Neural resource637 model, specifically the mean total number of samples (γ) and the precision of a single sample (ω1),638 along with a grid of potential allocation weights.639 For Experiment 1, we analytically determined the number of points based on the model-predicted640 response distributions under different allocation weights. The allocation weights were tested across641 a grid ranging from 0.001 to 2 in increments of 0.01, resulting in 200 distinct values. The optimal642 allocation weight was identified as the value that maximized the total reward across both high- and643 low-reward items.644 For Experiment 2, we numerically simulated the variance (i.e., squared circular SD) of feedback645 errors using the same grid of allocation weights employed in Experiment 1. This analysis was based on646 107 simulated trials drawn from the error distribution predicted by the model. The optimal allocation647 weight was determined as the value that minimized the total variance of feedback errors across both648 error-minified and error-magnified items.649 For Experiment 3, we analytically modelled the response variance on the variable coherence trials650 (i.e., forthehigh-andlow-coherenceitems)acrossarangeofallocationweights. Weemployedagridof651 200 values, spanning from 0.01 to 6 in increments of 0.03. The optimal allocation weight was identified652 as the value that minimized the total response variance for both high- and low-coherence items (i.e.,653 85% and 45% coherence). In Experiment 3a, the simulation yielded values around αoptimal = 1654 for all but one outlier observer, for whom the estimate reached the endpoint of the examined grid655 (αoptimal = 6). This occurred due to the model estimating high levels of perceptual noise for medium-656 and low-coherence stimuli, suggesting that minimizing overall error would be achieved by allocating all657 resources to the high-coherence object. We exclude this data point in Figure3D, and the comparison658 of observed and optimal allocations is based on the remaining observers. Including this observer’s659 data and performing a non-parametric test did not change our conclusions.660 Reinforcement learning account of resource allocation661 We developed a quantitative model to describe how the history of accumulated rewards from multiple662 objects influences subsequent resource allocation towards those objects. The proposed model extends663 the Neural resource model by incorporating a simple reinforcement learning (RL) rule, which directs664 behaviour towards more rewarding stimuli. Importantly, our model applies the same RL rule to both665 external and intrinsic rewards. In the standard RL framework, analysis typically focuses on external666 rewards, such as points or money, which are provided by the environment as a direct response to the667 agent’s actions. Our model broadens this scope to include intrinsic rewards - those that are inherently668 pleasurable and drive behaviour - such as the sense of being accurate in a task.669 Drawing on the conducted experiments and the motion direction reproduction task, the general670 overview of this account is as follows: on a particular trial, the received points or money (extrinsic re-671 ward), perceived accuracy due to feedback (intrinsic reward), and an individual’s internal confidence-672 based estimate of precision (intrinsic reward) collectively update the value (ν) of a particular object673 associated with these rewards. This computed value influences the allocation of cognitive resources674 to that object in subsequent encounters, thereby modulating the precision with which the object is675 represented.676 Formally, in the simplest scenario involving only two objects, rather than defining the accumu-677 19 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint lated reward for each object separately, we can define the relative accumulated reward on trialt678 as:679 νt = (1 − y)νt−1 + ∆t, (9) 680 ∆t = I±(c1rextt + c2exp(−|ϵfbt |) + c3κˆrt), (10) where y is a leak component,rextt is the number of received points,ϵfbt is feedback error,κˆrt is an681 estimate of internal confidence, and{c1, c2, c3} are respective weights accounting for different scales of682 rewards and the types of rewards prioritised by observers. The variableI± takes the value of +1 or -1683 depending on the object identity, i.e., reproduction of the green or blue item, with the assignment of684 conditions being arbitrary. Positive values ofν indicate a higher relative value for one item (I = +1),685 while negative values ofν indicate a higher relative value for the other item (I = −1). To account for686 the fact thatν can range from−∞ to +∞, we transform it into gain parameterα using the following687 equation:688 α(ν) = e2ν. (11) This allows us to compute the proportion of spiking activity allocated to the item identified as689 sign(I) = +1, given by α/(1 + α), with the remaining spiking activity allocated to the other item.690 When ν = 0, such as at the beginning of the task, both items are perceived as having equal value,691 resulting in an equal distribution of neural resources between them.692 The leak component (y) functions as a temporal filter, modulating the influence of past rewards693 on resource allocation. Wheny = 1, the system entirely ignores accumulated past values, making the694 value of an object - and thus resource allocation in the next trial - rely exclusively on the reward from695 the most recent trial. Conversely, wheny = 0, the accumulated value is fully retained and integrated696 with the most recent reward. The necessity of the leak component becomes particularly evident in697 scenarios where rewards are discontinued: a non-zero leak will gradually equalize the relative value698 and resource allocation across objects, returning them to a state of equilibrium.699 The first reward component of Equation9, rext, reflects the experimental manipulation of Exper-700 iment 1. In this experiment, observers received 15 points for responses with an error of less than50◦701 for high-reward objects and 5 points for low-reward objects. Responses with an error greater than702 50◦ received no points. When applying this model to the data, we used values ofrext = {0.15, 0.05, 0}703 to represent the rewards for high-reward, low-reward, and no-reward trials, respectively.704 The feedback component of the model (ϵfb) addresses the experimental manipulation of Exper-705 iment 2. In this experiment, we systematically manipulated feedback error by reducing it for one706 stimulus and increasing it for another. We hypothesized that feedback serves as an intrinsic reward,707 with stimuli receiving minified feedback errors being perceived as more rewarding than those with708 magnified feedback errors. In modelling this relationship, feedback error was assumed to be exponen-709 tially related to the object’s valueν, with smaller feedback errors corresponding to higher rewards,710 leading to a greater increase in the object’s value. This exponential relationship reflects diminishing711 sensitivity to large feedback errors, such that a wide range of larger errors yields relatively minimal712 and similar rewards, whereas a narrow range of smaller errors results in significantly higher but more713 variable rewards.714 The final component of our model is the estimate of internal confidence (κˆr). While internal715 confidence can be assessed through self-reported or metacognitive measures, our approach leverages716 the inherent mechanism of the Neural resource model to quantify uncertainty in the decoded (i.e.,717 reported) value. Our approach relies on the principle that the width of the likelihood function718 reflects the uncertainty of the estimate. The likelihood function evaluates how well various stimulus719 values align with the observed neural activity: a broad likelihood function is compatible with many720 different feature values, suggesting lower precision in the maximum likelhood estimate (the peak721 of the likelihood function), whereas a narrow likelihood function implies a more precise estimate.722 Due to the probabilistic generation of spikes across retrievals (Eq.4), the likelihood has the form723 of a von Mises with concentrationκˆr proportional to the resultant vector length of the preferred724 values associated with each of the emitted spikes (m), with higher spike counts producing a narrower725 likelihood function on average [51]. This formulation has previously been shown to quantitatively726 reproduce findings from studies in which participants were asked to rate their subjective confidence727 in each estimate [67,71]. Consequently, the precision of the likelihood function emerges as a natural728 candidate for a computational estimate of the observer’s internal confidence.729 20 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint To measure internal confidence associated with each response, we determined the most probable730 resultant vector length given the individual response errors and the probabilistic distribution of spike731 values, which was fully characterized by population parametersγ and κ. Specifically, for each trial,732 we used Bayes rule to find the posterior probability of resultant vector lengthr given the error on733 that trial ϵ, marginalizing over total spike countm,734 p(r|ϵ, κ, γ) = p(ϵ|r, κ)p(r|κ, γ)R p(ϵ|r, κ)p(r|κ, γ)dr (12) p(ϵ|r, κ) = ϕ(ϵ; 0, κr) (13) p(r|κ, γ) = Z p(r|m, κ)p(m|γ)dm. (14) where p(r|m, κ) is given by Eq.7 and p(m|γ) is the Poisson p.m.f. with meanγ. Applying MAP735 estimation to this posterior distribution returns the most probable estimate of resultant length for a736 given response error,737 ˆr = arg max r p(r|ϵ, κ, γ). (15) Finally, we useκˆr = ˆrκ as a measure of internal confidence on the given trial.738 Model fitting739 To model the observed allocation within the Neural resource model [22,51], which has two free740 parameters – the mean population activity (γ) and the precision of the tuning functions (κ) – we741 introduced an additional parameter, the gain modulationα [22], resulting in a total of three free742 parameters in Experiments 1 and 2. In Experiment 3, which involved an estimation difficulty ma-743 nipulation, the Neural Resource model was extended by two additional parameters (σ2 45% and σ2 65%)744 to capture the effects of variable sensory noise introduced by different coherence levels. This brought745 the total number of free parameters in the Neural resource model for Experiment 3 to five.746 The Reinforcement learning account retained all parameters of the Neural resource model except747 the gain modulation parameterα, while introducing four new parameters, namely the leak parameter748 (y)andrewardweightparameters( c1, c2, c3). Inallthreeexperiments, wemodelledtheleakparameter749 (y) and the effect of internal confidence (c3) on resource allocation (see Eq. 9); additionally, we750 modelled the effect of external reward (c1) only in Experiment 1 while setting it to zero in all other751 experiments, and feedback error (c2) only in Experiment 2 while also setting it to zero in all other752 experiments. This resulted in the estimation of five free parameters in Experiments 1 and 2, and six753 in Experiment 3. When fitting the model to the data, the leak parameter was constrained between 0754 and 1, and all three weight parameters were limited to a range of -1 to 1.755 For all models, we obtained a separate maximum likelihood fit for each individual observer.756 These fits were derived using the Nelder-Mead simplex method (via thefminsearch function in MAT-757 LAB). A MATLAB toolbox implementing the Neural resource model is available for download from758 https://bayslab.com/toolbox.759 21 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint Acknowledgment760 We thank David Aagten-Murphy and Robert Taylor, who worked on earlier iterations of this project.761 We thank Neha Abraham, Pepita Alex, Amida Anand, Paul McMeekin, Adam Sabo, Tom Wenban-762 Smith, Adam Zhu, and Adam Triabhall for assisting with data collection. This work was funded by763 the Wellcome Trust (grant 106926 to P.M.B). The funders had no role in study design, data collection764 and analysis, decision to publish or preparation of the manuscript.765 Author contributions766 I.T. contributed to conceptualization, methodology, software, data collection, investigation, formal767 analysis, modelling, visualizations, and writing - original draft and revisions.R.R.R contributed768 to methodology, software, and data collection. P.M.B. contributed to conceptualization, funding769 acquisition, supervision, methodology, formal analysis, modelling, visualizations, and writing - editing770 and revisions.771 Data availability772 Data and analysis code will be made publicly available upon publication of this manuscript.773 References774 [1] Berridge, K. C., Robinson, T. E., and Aldridge, J. W. Dissecting components of reward: ‘liking’,775 ‘wanting’, and learning.Current Opinion in Pharmacology9.1 (2009), pp. 65–73.doi: 10.1016/776 j.coph.2008.12.014.777 [2] Berridge, K. C. and Robinson, T. E. Parsing reward. Trends in Neurosciences26.9 (2003),778 pp. 507–513.doi: 10.1016/S0166-2236(03)00233-9.779 [3] Stănişor, L., Van Der Togt, C., Pennartz, C. M. A., and Roelfsema, P. R. A unified selection780 signal for attention and reward in primary visual cortex.Proceedings of the National Academy781 of Sciences110.22 (2013), pp. 9136–9141.doi: 10.1073/pnas.1300117110.782 [4] Schultz, W. Behavioral Theories and the Neurophysiology of Reward.Annual Review of Psy-783 chology 57.1 (2006), pp. 87–115.doi: 10.1146/annurev.psych.56.091103.070229.784 [5] Maunsell, J. H. Neuronal representations of cognitive state: reward or attention? Trends in785 Cognitive Sciences8.6 (2004), pp. 261–265.doi: 10.1016/j.tics.2004.04.003.786 [6] Blain, B. and Sharot, T. Intrinsic reward: potential cognitive and neural mechanisms.Current787 Opinion in Behavioral Sciences39 (2021), pp. 113–118.doi: 10.1016/j.cobeha.2021.03.008.788 [7] Navalpakkam, V., Koch, C., Rangel, A., and Perona, P. Optimal reward harvesting in com-789 plex perceptual environments.Proceedings of the National Academy of Sciences107.11 (2010),790 pp. 5232–5237.doi: 10.1073/pnas.0911972107.791 [8] Lee, J. and Shomstein, S. The Differential Effects of Reward on Space- and Object-Based792 Attentional Allocation.Journal of Neuroscience33.26 (2013), pp. 10625–10633.doi: 10.1523/793 JNEUROSCI.5575-12.2013.794 [9] Della Libera, C. and Chelazzi, L. Visual Selective Attention and the Effects of Monetary Re-795 wards. Psychological Science 17.3 (2006), pp. 222–227.doi: 10.1111/j.1467- 9280.2006.796 01689.x.797 [10] Kristjansson, A., Sigurjonsdottir, O., and Driver, J. Fortune and reversals of fortune in visual798 search: Reward contingencies for pop-out targets affect search efficiency and target repetition799 effects. Attention, Perception & Psychophysics72.5 (2010), pp. 1229–1236.doi: 10.3758/APP.800 72.5.1229.801 [11] Anderson, B. A., Laurent, P. A., and Yantis, S. Value-driven attentional capture.Proceedings802 of the National Academy of Sciences108.25 (2011), pp. 10367–10371.doi: 10 . 1073 / pnas .803 1104047108.804 22 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint [12] Della Libera, C. and Chelazzi, L. Learning to Attend and to Ignore Is a Matter of Gains and805 Losses. Psychological Science 20.6 (2009), pp. 778–784.doi: 10.1111/j.1467- 9280.2009.806 02360.x.807 [13] Peck, C. J., Jangraw, D. C., Suzuki, M., Efem, R., and Gottlieb, J. Reward Modulates Attention808 Independently of Action Value in Posterior Parietal Cortex.The Journal of Neuroscience29.36809 (2009), pp. 11182–11191.doi: 10.1523/JNEUROSCI.1929-09.2009.810 [14] Theeuwes, J. and Belopolsky, A. V. Reward grabs the eye: Oculomotor capture by rewarding811 stimuli. Vision Research74 (2012), pp. 80–85.doi: 10.1016/j.visres.2012.07.024.812 [15] Anderson, B. A. and Kim, H. Mechanisms of value-learning in the guidance of spatial attention.813 Cognition 178 (2018), pp. 26–36.doi: 10.1016/j.cognition.2018.05.005.814 [16] Anderson, B. A. and Kim, H. On the representational nature of value-driven spatial attentional815 biases. Journal of Neurophysiology120.5 (2018), pp. 2654–2658.doi: 10.1152/jn.00489.2018.816 [17] Awh, E., Belopolsky, A. V., and Theeuwes, J. Top-down versus bottom-up attentional control:817 a failed theoretical dichotomy. Trends in Cognitive Sciences16.8 (2012), pp. 437–443.doi:818 10.1016/j.tics.2012.06.010.819 [18] Anderson, B. A., Kim, H., Kim, A. J., Liao, M.-R., Mrkonja, L., Clement, A., and Grégoire, L.820 The past, present, and future of selection history.Neuroscience & Biobehavioral Reviews130821 (2021), pp. 326–350.doi: 10.1016/j.neubiorev.2021.09.004.822 [19] Bays, P. M., Schneegans, S., Ma, W. J., and Brady, T. F. Representation and computation in823 visual working memory.Nature Human Behaviour8.6 (2024), pp. 1016–1034.doi: 10.1038/824 s41562-024-01871-2.825 [20] Bays, P. M., Gorgoraptis, N., Wee, N., Marshall, L., and Husain, M. Temporal dynamics of826 encoding, storage, and reallocation of visual working memory.Journal of Vision11.10 (2011),827 pp. 6–6.doi: 10.1167/11.10.6.828 [21] Gorgoraptis, N., Catalao, R. F. G., Bays, P. M., and Husain, M. Dynamic Updating of Working829 Memory Resources for Visual Objects.Journal of Neuroscience31.23 (2011), pp. 8502–8511.830 doi: 10.1523/JNEUROSCI.0208-11.2011.831 [22] Bays, P. M. Noise in Neural Populations Accounts for Errors in Working Memory. en.Journal832 of Neuroscience34.10 (2014), pp. 3632–3645.doi: 10.1523/JNEUROSCI.3204-13.2014.833 [23] Emrich, S. M., Lockhart, H. A., and Al-Aidroos, N. Attention mediates the flexible allocation834 of visual working memory resources.Journal of Experimental Psychology: Human Perception835 and Performance43.7 (2017), pp. 1454–1465.doi: 10.1037/xhp0000398.836 [24] Sprague, T. C., Itthipuripat, S., Vo, V. A., and Serences, J. T. Dissociable signatures of visual837 salience and behavioral relevance across attentional priority maps in human cortex.Journal of838 Neurophysiology 119.6 (2018), pp. 2153–2165.doi: 10.1152/jn.00059.2018.839 [25] Yoo, A. H., Klyszejko, Z., Curtis, C. E., and Ma, W. J. Strategic allocation of working memory840 resource (2018). doi: 10.1101/329870.841 [26] Taylor, R., Tomić, I., Aagten-Murphy, D., and Bays, P. M. Working memory is updated by842 reallocation of resources from obsolete to new items.Attention, Perception, & Psychophysics843 85.5 (2023), pp. 1437–1451.doi: 10.3758/s13414-022-02584-2.844 [27] Griffin, I. C. and Nobre, A. C. Orienting Attention to Locations in Internal Representations.845 Journal of Cognitive Neuroscience15.8(2003),pp.1176–1194. doi: 10.1162/089892903322598139.846 [28] Oberauer, K. Control of the Contents of Working Memory–A Comparison of Two Paradigms847 and Two Age Groups.Journal of Experimental Psychology: Learning, Memory, and Cognition848 31.4 (2005), pp. 714–728.doi: 10.1037/0278-7393.31.4.714.849 [29] Klyszejko, Z., Rahmati, M., and Curtis, C. E. Attentional priority determines working memory850 precision. Vision Research105 (2014), pp. 70–76.doi: 10.1016/j.visres.2014.09.002.851 [30] Atkinson, A. L., Oberauer, K., Allen, R. J., and Souza, A. S. Why does the probe value effect852 emerge in working memory? Examining the biased attentional refreshing account.Psychonomic853 Bulletin & Review29.3 (2022), pp. 891–900.doi: 10.3758/s13423-022-02056-6.854 [31] Allen, R. J., Atkinson, A., and Hitch, G. J. Getting value out of working memory through855 strategic prioritisation; implications for storage and control.Quarterly Journal of Experimental856 Psychology (2024), p. 17470218241258102.doi: 10.1177/17470218241258102.857 23 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint [32] Gong, M. and Li, S. Learned reward association improves visual working memory.Journal of858 Experimental Psychology: Human Perception and Performance40.2 (2014), pp. 841–856.doi:859 10.1037/a0035131.860 [33] Brissenden, J. A., Adkins, T. J., Hsu, Y. T., and Lee, T. G. Reward influences the allocation but861 not the availability of resources in visual working memory.Journal of Experimental Psychology:862 General (2023). doi: 10.1037/xge0001370.863 [34] Van Den Berg, R., Zou, Q., Li, Y., and Ma, W. J. No effect of monetary reward in a visual864 working memory task. PLOS ONE 18.1 (2023), e0280257. doi: 10 . 1371 / journal . pone .865 0280257.866 [35] Atkinson, A. L., Berry, E. D., Waterman, A. H., Baddeley, A. D., Hitch, G. J., and Allen,867 R. J. Are there multiple ways to direct attention in working memory?Annals of the New York868 Academy of Sciences1424.1 (2018), pp. 115–126.doi: 10.1111/nyas.13634.869 [36] Gazzaley, A. and Nobre, A. C. Top-down modulation: bridging selective attention and working870 memory. Trends in Cognitive Sciences16.2 (2012), pp. 129–135.doi: 10.1016/j.tics.2011.871 11.014.872 [37] Awh, E., Vogel, E., and Oh, S.-H. Interactions between attention and working memory.Neuro-873 science 139.1 (2006), pp. 201–208.doi: 10.1016/j.neuroscience.2005.08.023.874 [38] Wolf, D. H., Gerraty, R., Satterthwaite, T. D., Loughead, J., Campellone, T., Elliott, M. A.,875 Turetsky, B. I., Gur, R. C., and Gur, R. E. Striatal intrinsic reinforcement signals during876 recognition memory: relationship to response bias and dysregulation in schizophrenia.Frontiers877 in Behavioral Neuroscience5 (2011), p. 81.doi: 10.3389/fnbeh.2011.00081.878 [39] Schultz, W., Dayan, P., and Montague, P. R. A Neural Substrate of Prediction and Reward.879 Science 275.5306 (1997), pp. 1593–1599.doi: 10.1126/science.275.5306.1593.880 [40] Knutson, B., Fong, G. W., Adams, C. M., Varner, J. L., and Hommer, D. Dissociation of reward881 anticipation and outcome with event-related fMRI:Neuroreport 12.17 (2001), pp. 3683–3687.882 doi: 10.1097/00001756-200112040-00016.883 [41] Elliott, R., Friston, K. J., and Dolan, R. J. Dissociable Neural Responses in Human Reward884 Systems. The Journal of Neuroscience20.16 (2000), pp. 6159–6165.doi: 10.1523/JNEUROSCI.885 20-16-06159.2000.886 [42] De Martino, B., Kumaran, D., Holt, B., and Dolan, R. J. The Neurobiology of Reference-887 Dependent Value Computation.The Journal of Neuroscience29.12 (2009), pp. 3833–3842.doi:888 10.1523/JNEUROSCI.4832-08.2009.889 [43] Han, S., Huettel, S. A., Raposo, A., Adcock, R. A., and Dobbins, I. G. Functional Significance890 of Striatal Responses during Episodic Decisions: Recovery or Goal Attainment?The Journal of891 Neuroscience 30.13 (2010), pp. 4767–4775.doi: 10.1523/JNEUROSCI.3077-09.2010.892 [44] Satterthwaite, T. D., Ruparel, K., Loughead, J., Elliott, M. A., Gerraty, R. T., Calkins, M. E.,893 Hakonarson, H., Gur, R. C., Gur, R. E., and Wolf, D. H. Being right is its own reward: Load and894 performance related ventral striatum activation to correct responses during a working memory895 task in youth.NeuroImage 61.3 (2012), pp. 723–729.doi: 10.1016/j.neuroimage.2012.03.896 060.897 [45] Daniel, R. and Pollmann, S. Striatal activations signal prediction errors on confidence in the898 absence of external feedback. NeuroImage 59.4 (2012), pp. 3457–3467.doi: 10 . 1016 / j .899 neuroimage.2011.11.058.900 [46] Hebart, M. N., Schriever, Y., Donner, T. H., and Haynes, J.-D. The Relationship between901 Perceptual Decision Variables and Confidence in the Human Brain.Cerebral Cortex26.1 (2016),902 pp. 118–130.doi: 10.1093/cercor/bhu181.903 [47] Schwarze, U., Bingel, U., Badre, D., and Sommer, T. Ventral Striatal Activity Correlates with904 Memory Confidence for Old- and New-Responses in a Difficult Recognition Test.PLoS ONE905 8.3 (2013), e54324.doi: 10.1371/journal.pone.0054324.906 [48] Guggenmos, M., Wilbertz, G., Hebart, M. N., and Sterzer, P. Mesolimbic confidence signals907 guide perceptual learning in the absence of external feedback.eLife 5 (2016), e13388. doi:908 10.7554/eLife.13388.909 24 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint [49] Prinzmetal, W., Amiri, H., Allen, K., and Edwards, T. Phenomenology of attention: I. Color,910 location, orientation, and spatial frequency. en.Journal of Experimental Psychology: Human911 Perception and Performance24.1 (1998), pp. 261–282.doi: 10.1037/0096-1523.24.1.261.912 [50] Tomić, I., Adamcová, D., Fehér, M., and Bays, P. M. Dissecting the components of error in913 analogue report tasks.Behavior Research Methods(2024). doi: 10.3758/s13428-024-02453-914 w.915 [51] Schneegans, S., Taylor, R., and Bays, P. M. Stochastic sampling provides a unifying account916 of visual working memory limits. en.Proceedings of the National Academy of Sciences(2020),917 p. 202004306. doi: 10.1073/pnas.2004306117.918 [52] Kiani, R., Corthell, L., and Shadlen, M. N. Choice Certainty Is Informed by Both Evidence and919 Decision Time.Neuron 84.6 (2014), pp. 1329–1342.doi: 10.1016/j.neuron.2014.12.015.920 [53] Fleming, S. M. Metacognition and Confidence: A Review and Synthesis. Annual Review of921 Psychology 75.1 (2024), pp. 241–268.doi: 10.1146/annurev-psych-022423-032425.922 [54] Chetverikov, A. and Jehee, J. F. M. Motion direction is represented as a bimodal probability923 distribution in the human visual cortex.Nature Communications 14.1 (2023), p. 7634. doi:924 10.1038/s41467-023-43251-w.925 [55] Kwak, Y. and Curtis, C. E. Unveiling the abstract format of mnemonic representations.Neuron926 110.11 (2022), 1822–1828.e5.doi: 10.1016/j.neuron.2022.03.016.927 [56] Shadmehr, R., Reppert, T. R., Summerside, E. M., Yoon, T., and Ahmed, A. A. Movement928 Vigor as a Reflection of Subjective Economic Utility.Trends in Neurosciences 42.5 (2019),929 pp. 323–336.doi: 10.1016/j.tins.2019.02.003.930 [57] Summerside, E. M., Shadmehr, R., and Ahmed, A. A. Vigor of reaching movements: reward931 discounts the cost of effort.Journal of Neurophysiology119.6 (2018), pp. 2347–2357.doi: 10.932 1152/jn.00872.2017.933 [58] Manohar, S. G., Finzi, R. D., Drew, D., and Husain, M. Distinct Motivational Effects of Con-934 tingent and Noncontingent Rewards.Psychological Science 28.7 (2017), pp. 1016–1026.doi:935 10.1177/0956797617693326.936 [59] Berridge, K. C. and Robinson, T. E. What is the role of dopamine in reward: hedonic impact,937 reward learning, or incentive salience?Brain Research Reviews28.3 (1998), pp. 309–369.doi:938 10.1016/S0165-0173(98)00019-8.939 [60] Yoo, A. H. and Collins, A. G. E. How Working Memory and Reinforcement Learning Are Inter-940 twined:ACognitive,Neural,andComputationalPerspective. Journal of Cognitive Neuroscience941 34.4 (2022), pp. 551–568.doi: 10.1162/jocn_a_01808.942 [61] Serences,J.T.Value-BasedModulationsinHumanVisualCortex. Neuron60.6(2008),pp.1169–943 1181. doi: 10.1016/j.neuron.2008.10.051.944 [62] Ashby, F. G. and Maddox, W. T. Human Category Learning.Annual Review of Psychology945 56.1 (2005), pp. 149–178.doi: 10.1146/annurev.psych.56.091103.070217.946 [63] Hattie, J. and Timperley, H. The Power of Feedback. Review of Educational Research77.1947 (2007), pp. 81–112.doi: 10.3102/003465430298487.948 [64] Haddara, N. and Rahnev, D. The Impact of Feedback on Perceptual Decision-Making and949 Metacognition: Reduction in Bias but No Change in Sensitivity.Psychological Science 33.2950 (2022), pp. 259–275.doi: 10.1177/09567976211032887.951 [65] Rouault, M. and Fleming, S. M. Formation of global self-beliefs in the human brain.Proceedings952 of the National Academy of Sciences117.44 (2020), pp. 27268–27276.doi: 10 . 1073 / pnas .953 2003094117.954 [66] Bröker, F., Holt, L. L., Roads, B. D., Dayan, P., and Love, B. C. Demystifying unsupervised955 learning: how it helps and hurts.Trends in Cognitive Sciences28.11 (2024), pp. 974–986.doi:956 10.1016/j.tics.2024.09.005.957 [67] Berg, R. van den, Yoo, A. H., and Ma, W. J. Fechner’s law in metacognition: A quantitative958 model of visual working memory confidence.Psychological Review124.2 (2017), pp. 197–214.959 doi: 10.1037/rev0000060.960 25 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint [68] Li, H.-H., Sprague, T. C., Yoo, A. H., Ma, W. J., and Curtis, C. E. Joint representation of961 working memory and uncertainty in human cortex.Neuron 109.22 (2021), 3699–3712.e6.doi:962 10.1016/j.neuron.2021.08.022.963 [69] Ma, W. J., Beck, J. M., Latham, P. E., and Pouget, A. Bayesian inference with probabilistic964 population codes.Nature Neuroscience9.11 (2006), pp. 1432–1438.doi: 10.1038/nn1790.965 [70] Pouget, A., Dayan, P., and Zemel, R. S. Inference and computation with population codes.966 Annual Review of Neuroscience26 (2003), pp. 381–410.doi: 10.1146/annurev.neuro.26.967 041002.131112.968 [71] Bays, P. M. A signature of neural coding at human perceptual limits.Journal of Vision16.11969 (2016), p. 4.doi: 10.1167/16.11.4.970 [72] Schneegans, S. and Bays, P. M. Drift in Neural Population Activity Causes Working Memory971 to Deteriorate Over Time. The Journal of Neuroscience 38.21 (2018), pp. 4859–4869.doi:972 10.1523/JNEUROSCI.3440-17.2018.973 [73] Schaffner, J., Bao, S. D., Tobler, P. N., Hare, T. A., and Polania, R. Sensory perception relies974 on fitness-maximizing codes.Nature Human Behaviour (2023). doi: 10.1038/s41562- 023-975 01584-y.976 [74] Shuler, M. G. and Bear, M. F. Reward Timing in the Primary Visual Cortex.Science 311.5767977 (2006), pp. 1606–1609.doi: 10.1126/science.1123513.978 [75] Badre, D. Cognitive Control. Annual Review of Psychology(2024). doi: 10.1146/annurev-979 psych-022024-103901.980 [76] Inzlicht, M., Shenhav, A., and Olivola, C. Y. The Effort Paradox: Effort Is Both Costly and981 Valued. Trends in Cognitive Sciences22.4 (2018), pp. 337–349.doi: 10.1016/j.tics.2018.982 01.007.983 [77] Westbrook, A. and Braver, T. S. Cognitive effort: A neuroeconomic approach.Cognitive, Affec-984 tive, & Behavioral Neuroscience15.2 (2015), pp. 395–415.doi: 10.3758/s13415-015-0334-y.985 [78] Shenhav, A., Musslick, S., Lieder, F., Kool, W., Griffiths, T. L., Cohen, J. D., and Botvinick,986 M. M. Toward a Rational and Mechanistic Account of Mental Effort.Annual Review of Neuro-987 science 40.1 (2017), pp. 99–124.doi: 10.1146/annurev-neuro-072116-031526.988 [79] Kool,W.,McGuire,J.T.,Rosen,Z.B.,andBotvinick,M.M.Decisionmakingandtheavoidance989 of cognitive demand.Journal of Experimental Psychology: General139.4 (2010), pp. 665–682.990 doi: 10.1037/a0020198.991 [80] Corlazzoli, G., Desender, K., and Gevers, W. Feeling and deciding: Subjective experiences rather992 than objective factors drive the decision to invest cognitive control. Cognition 240 (2023),993 p. 105587. doi: 10.1016/j.cognition.2023.105587.994 [81] Kool, W. and Botvinick, M. The intrinsic cost of cognitive control.Behavioral and Brain Sci-995 ences 36.6 (2013), pp. 697–698.doi: 10.1017/S0140525X1300109X.996 [82] Botvinick, M. M., Huffstetler, S., and McGuire, J. T. Effort discounting in human nucleus997 accumbens. Cognitive, Affective, & Behavioral Neuroscience9.1 (2009), pp. 16–27.doi: 10.998 3758/CABN.9.1.16.999 [83] Brainard, D. H. The Psychophysics Toolbox. Spatial Vision 10.4 (1997), pp. 433–436.doi:1000 https://doi.org/10.1163/156856897X00357.1001 [84] Pelli, D. G. The VideoToolbox software for visual psychophysics: transforming numbers into1002 movies. Spatial Vision10.4 (1997), pp. 437–442.1003 [85] Scase, M. O., Braddick, O. J., and Raymond, J. E. What is Noise for the Motion System?1004 Vision Research36.16 (1996), pp. 2579–2586.doi: 10.1016/0042-6989(95)00325-8.1005 [86] JASP Team. JASP (Version 0.18.3)[Computer software]. 2024.1006 [87] Liang, F., Paulo, R., Molina, G., Clyde, M. A., and Berger, J. O. Mixtures of g Priors for1007 BayesianVariableSelection. Journal of the American Statistical Association103(2008),pp.410–1008 423. doi: 10.1198/016214507000001337.1009 [88] Wagenmakers, E.-J. et al. Bayesian inference for psychology. Part II: Example applications with1010 JASP. Psychonomic Bulletin & Review25.1 (2018), pp. 58–76.doi: 10.3758/s13423- 017-1011 1323-7.1012 26 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint [89] Lee, M. D. and Wagenmakers, E.-J. Bayesian cognitive modeling: a practical course. Cambridge1013 ; New York: Cambridge University Press, 2013. 264 pp.1014 [90] Tomić, I. and Bays, P. M. Perceptual similarity judgments do not predict the distribution1015 of errors in working memory. Journal of Experimental Psychology: Learning, Memory, and1016 Cognition 50.4 (2024), pp. 535–549.doi: 10.1037/xlm0001172.1017 [91] Tomić, I. and Bays, P. M. A dynamic neural resource model bridges sensory and working1018 memory. eLife 12 (2024), RP91034.doi: 10.7554/eLife.91034.3.1019 [92] Carandini, M. and Heeger, D. Normalization as a canonical neural computation.Nature Reviews1020 Neuroscience 13.1 (2012), pp. 51–62.doi: 10.1038/nrn3136.1021 [93] Tomić, I. and Bays, P. M. Internal but not external noise frees working memory resources.1022 PLOS Computational Biology14.10 (2018), e1006488.doi: 10.1371/journal.pcbi.1006488.1023 27 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint Supplementary Information1024 Psychophysical data1025 Experiment 2b: Sequential presentation1026 A BMagni/f_ied feedback errorMini/f_ied feedback error 0 0.5 1 1.5 0 0.5 1 1.5 Density Response error Response error 0 - 0 - 0 0.2 0.4 0.6 0.8 1 1.2 1.4MAD Magni/f_iedMini/f_ied Feedback error Figure S1: Perceived accuracy manipulation in Experiment 2b (sequential presentation). A) His- tograms represent distributions of response errors. B) Mean absolute deviation of response errors. The coloured circles with error bars represent the mean± SE. Experiment 3b: Sequential presentation1027 High Low Coherence Inter. (High) Inter. (Low) Coherence 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6MAD 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.5 1 1.5 0 0.5 1 1.5 0 0.5 1 1.5 A Density High coherence colour Density Low coherence colour B Response error 0 - Response error 0-  C 0 - 0-  0 0.5 1 1.5 Figure S2: Estimation difficulty manipulation in Experiment 3b (sequential presentation). A & B) Histograms represent distributions of response errors. Panel A depicts variable coherence trials, and panel B depicts equal coherence trials. C) Mean absolute deviation of response errors. The coloured circles with error bars represent the mean± SE. Reinforcement learning account1028 External reward1029 The average trajectory of resource allocation shown in Figure 5A is based on ML parameter estimates1030 (mean ± SE): mean activityγ = 2.88± 0.39; tuning precisionκ = 10.29± 0.98; leaky = 0.28± 0.06;1031 reward weightc1 = 0.31±0.14; internal confidence weightc3 = 0.012 ± 0.01. Calculating the corre-1032 lation between parameter estimates from the Reinforcement learning account and the neural model1033 with freely estimated resource allocation, we found highly consistent estimates of the population’s1034 mean spiking activity (r = 0.997, 95% CI = [0.992, 0.999],BF10 = 1.46 × 1028) and tuning precision1035 (r = 0.971, 95% CI = [0.929, 0.986],BF10 = 3.16 × 1015) (Fig.S4A).1036 28 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint Finally, to assess whether observers prioritised external rewards or internal confidence signals in1037 resource allocation, we calculated each observer’s mean contribution of external rewards and internal1038 confidence to the relative value of the two objects. Results indicated moderate evidence for no1039 difference between their contributions (BF10 = 0.22; δ = 0.076, 95% CI = [-0.263, 0.418]). However,1040 this finding should be interpreted with caution, as the effect of internal confidence may partly reflect1041 observers’ resource allocation favouring the high-reward item (reflecting the influence of external1042 reward), which subsequently enhances confidence for that item.1043 Intrinsic reward: Perceived accuracy1044 The mean trajectory of resource allocation across trials shown in Figure 5D is based on ML parameter1045 estimates (mean ± SE): mean activityγ = 5.01 ± 0.78; tuning precisionκ = 6.64 ± 0.51; leak y =1046 0.538 ± 0.076; feedback weight c2 = 0.238± 0.108; internal confidence weightc3 = 0.007 ± 0.044.1047 Comparing the estimates derived from the Neural resource model and the Reinforcement learning1048 account (Fig. S4B), we again found that the RL account’s estimates closely match the population’s1049 mean spiking activity (r = 0.987, 95% CI = [0.964, 0.994],BF10 = 1.26 × 1016) and tuning precision1050 (r = 0.956, 95% CI = [0.884, 0.980],BF10 = 4.7 × 1010).1051 Finally, we found moderate evidence for no difference between feedback and internal confidence1052 signals in their contribution to the relative value of objects (BF10 = 0.33; δ = 0.174, 95% CI = [-0.196,1053 0.553]).1054 Intrinsic reward: Estimation difficulty1055 Fitting the Reinforcement learning account to psychophysical data from Experiment 3a, we obtained1056 the following ML parameters (mean± SE): mean activity γ = 3.33 ± 0.39; tuning precision κ =1057 11.53 ± 0.61; leak y = 0.398 ± 0.086; confidence weightc3 = 0.062 ± 0.045; intermediate perceptual1058 noise SD65% = 0.143 ± 0.021; high perceptual noise SD45% = 0.338 ± 0.086. In Experiment 3b we1059 observed very similar estimates: mean activityγ = 2.13 ± 0.31; tuning precisionκ = 13.03 ± 1.70;1060 leak y = 0.285 ± 0.083; confidence weightc3 = 0.007 ± 0.004; intermediate perceptual noise SD65%1061 = 0.090 ± 0.021; high perceptual noise SD45% = 0.249 ± 0.056. Again, we visualised the obtained1062 individual trajectories in example participants (Fig.S3C & D).1063 In both experiments, estimates obtained with the Neural resource model and the Reinforcement1064 learning account strongly covaried (Fig.S4C & D). Specifically, we found highly consistent estimates1065 of the population’s mean spiking activity (Exp 3a:r = 0.995, 95% CI = [0.986, 0.998], BF10 =1066 6.15 × 1017; Exp 3b: r = 0.999, 95% CI = [0.996, 1.000],BF10 = 1.66 × 1019), tuning precision (Exp1067 3a: r = 0.912, 95% CI = [0.761, 0.961],BF10 = 2.84 × 106; Exp 3b: r = 0.970, 95% CI = [0.899,1068 0.989], BF10 = 5.66 × 108), intermediate perceptual noise (Exp 3a: r = 0.922, 95% CI = [0.785,1069 0.966], BF10 = 8.00 × 106; Exp 3b: r = 0.985, 95% CI = [0.947, 0.994],BF10 = 8.17 × 1010), and1070 high perceptual noise (Exp 3a:r = 0.659, 95% CI = [0.293, 0.830],BF10 = 48.1; Exp 3b: r = 0.896,1071 95% CI = [0.696, 0.957],BF10 = 6.80 × 105).1072 29 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint A B Resource fractionResource fraction C D Resource fractionResource fraction External reward (Exp 1)Perceived accuracy (Exp 2a)Estimation diﬃculty (Exp 3a)Estimation diﬃculty (Exp 3b) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Observer 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Observer 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 12 04 06 08 0 100 Trial number 12 04 06 08 0 100 Trial number 12 04 06 08 0 100 Trial number Observer 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 12 04 06 08 0 100 Trial number 12 04 06 08 0 100 Trial number 12 04 06 08 0 100 Trial number Observer 1 Observer 2 Observer 3 15 0 100 150 Trial number 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Observer 1 Observer 2 Observer 3 15 0 100 150 Trial number 15 0 100 150 Trial number 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Observer 1 Observer 2 Observer 3 15 0 100 150 200 Trial number 15 0 100 150 200 Trial number 15 0 100 150 200 Trial number Figure S3: A) Trial-by-trial resource allocation estimated by the RL account in the external reward experiment (Experiment 1) for three illustrative participants. Circles represent the fraction of re- sources allocated to the preferred item on each trial. B) Perceived accuracy experiment (Experiment 2). C) Estimation difficulty experiment (Experiment 3a). D) Estimation difficulty experiment (Ex- periment 3b). 30 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint 05 10 15 0 5 10 15 Spiking activity Neural resource model RL account 05 10 15 20 25 0 5 10 15 20 25 Neural resource model Tuning precision 05 10 15 20 25 Neural resource model 0 5 10 15 20 25 05 10 15 Neural resource model 0 5 10 15 RL account A B C D RL account RL account 05 10 15 20 25 0 5 10 15 20 25 05 10 15 0 5 10 15 Neural resource model Neural resource model RL account RL account 05 10 15 20 25 0 5 10 15 20 25 05 10 15 0 5 10 15 Neural resource model Neural resource model RL account RL account FigureS4: Correlationbetweenmeanactivity(toprow)andtuningprecision(bottomrow)parameters estimated in the Neural resource model and the RL account of resource allocation. A) External reward experiment (Experiment 1). B) Perceived accuracy experiment (Experiment 2). C) Estimation difficulty experiment (Experiment 3a). D) Estimation difficulty experiment (Experiment 3b). 31 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-21T05:10:58.409756+00:00

License: CC-BY-4.0