{"paper_id":"6871087c-c9be-4b26-82da-498bc82c533d","body_text":"Intrinsic rewards guide visual resource allocation via\nreinforcement learning\nIvan Tomić1,2,∗, Rodrigo Raimundo 1, Paul M. Bays 1\n1University of Cambridge, Department of Psychology, Cambridge, UK\n2University of Zagreb, Department of Psychology, Zagreb, CRO\n∗corresponding author: ivn.tomic@gmail.com\nAbstract\nHumans and other animals prioritise visual processing of stimuli that signal rewards. While prior\nresearch has focused on tangible incentives (e.g., money or food), the effects of intrinsic incentives\n– such as perceived competence – are less well understood. Across a series of visual estimation\nexperiments, we manipulated observers’ subjective sense of confidence in their judgements using either\ndeceptive trial-by-trial feedback or real discrepancies in stimulus reliability. We found that observers\nprioritised encoding of stimuli associated with lower uncertainty or error, benefiting performance for\nstimuli already estimated accurately, while further impairing performance for those estimated poorly.\nThese reward-driven biases, while potentially adaptive, impaired overall accuracy in the present\ntasks by causing resource allocation to deviate from the error-minimizing strategy. To account for\nthese findings, we supplemented a normalization model of neural resource allocation with a simple\nreinforcement learning rule. Intrinsic and extrinsic rewards cumulatively shaped the values assigned\nto different stimuli by the model, and the resulting discrepancies biased resource allocation and\nthereby estimation error, quantitatively matching the data. These findings reveal how intrinsic reward\nsignals can shape resource allocation in ways that are both adaptive and counterproductive, offering\na computational basis for the motivational biases underlying cognitive performance.\nKeywords: population coding, reinforcement learning, resource allocation, attention, working\nmemory\n1\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nIntroduction1\nTo support adaptive behaviour and ensure survival, the brain has evolved to prioritise environmental2\ncues that signal potential rewards [1, 2]. Selectively attending to reward-predicting stimuli facilitates3\nefficient navigation of complex environments, helping organisms move towards more rewarding states4\n[3, 4]. This selection process is crucial given the brain’s limited processing capacity, as it enhances in-5\nternal representations of valuable stimuli and facilitates the formation of stimulus-reward associations6\n[5]. Whereas the bias towards processing stimuli associated with tangible rewards is well established,7\nthe influence of intrinsic rewards – positive motivational states associated with feelings of satisfaction8\nand competence [6] – on sensory processing remains less understood.9\nExperiments using points-based and monetary incentives have found that associating stimuli10\nwith a higher probability, or greater magnitude, of external reward facilitates voluntary, or top-down,11\nattention [7–9]. Additionally, in visual search tasks, which primarily engage bottom-up processes,12\nsearch times are faster for pop-out targets associated with higher rewards than stimuli predicting less13\nor no reward [10]. Notably, the prioritisation of reward-associated stimuli persists in subsequent tasks14\neven when reward contingencies are removed, and previously rewarded features cease to be salient or15\ntask-relevant [11–13]. Consistent with this, studies have shown that eye movements are biased towards16\nobjects and spatial locations previously associated with rewards [14–16]. This continued prioritisation17\nof previously rewarded stimuli, even when it no longer aligns with immediate task goals, suggests that18\nreward learning creates a lasting effect that can involuntarily bias attention towards these stimuli [17,19\n18].20\nThe influence of external rewards on behaviour extends to visual working memory (VWM) [19],21\nwhich is known for its ability to flexibly store and maintain features of multiple objects within a22\nlimited capacity [20–28]. The precision of representations increases as a function of the associated23\nreward, indicating that VWM allocation also tracks reward values when multiple objects provide24\ndifferent rewards ([29, 30]; see [31] for a review). Objects that were previously associated with reward25\nare also better remembered, even when they are currently task-irrelevant [32]. Crucially, however,26\ntotal VWM capacity does not show flexibility with reward [33, 34], which is further evidenced by27\nfindings that improved performance for high-reward items is accompanied by a corresponding decline28\nin performance for low-reward items [35]. These results demonstrate that stimuli can be strategically29\nprioritised for encoding in VWM through selective attention, leading to flexible allocation of limited30\ncapacity between items based on their assigned subjective values [36, 37].31\nNeuroimaging studies suggest that intrinsic rewards can have similar effects on the neural sys-32\ntem as external rewards. Successful information retrieval in cognitive tasks has been argued to be33\npsychologically rewarding [38], and studies have shown elevated activation in the striatum – a region34\ntraditionally associated with the motivational significance of actions [39–42] – in response to correct35\nresponses, even in the absence of explicit rewards [38,43, 44]. This activation is driven not by the36\nsuccessful retrieval of information itself, but rather by the satisfaction of the observer’s internal goals37\n[38, 43]. Similarly, changes in confidence levels, which reflect subjective evaluations of correctness,38\nhave also been shown to modulate striatal activation [45–47]. Building on evidence of subjective con-39\nfidence signals in the brain’s reward circuits, it has been argued that the brain reinforces behaviours40\nlinked to high-confidence states while diminishing those associated with low confidence [48]. Together,41\ngrowing evidence suggests that internally generated signals, particularly those related to perceived42\naccuracy and performance evaluation, are represented similarly in the brain to explicit, externally43\nadministered rewards, raising the possibility that they may similarly bias sensory processing.44\nIn the present study, we combined psychophysical measurement and computational modelling to45\ninvestigate how different intrinsic and extrinsic factors affect the competition between visual stimuli46\nfor processing resources. We used a modified analogue report task [49,50] in which observers were47\ninstructed to reproduce the direction of one of a pair of motion stimuli that differed in their associated48\nhistory of reward. Across a series of experiments, we found performance was consistently better49\nfor stimuli previously associated with larger extrinsic reward, but also those associated with lower50\nuncertainty or with improved performance feedback. To provide a mechanistic explanation of the51\nobserved behaviour, we developed a computational model that relates accumulation of past rewards,52\nboth intrinsic and extrinsic, to allocation of neural resources between stimuli, which in turn influences53\nestimation performance.54\n2\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nResults55\nDifferential rewards bias resource allocation56\nBuilding on existing evidence that external rewards can bias information processing, we began our57\ninvestigation by quantifying their effects on representational fidelity in a motion reproduction task.58\nIn Experiment 1, observers viewed two coloured motion stimuli, and after a brief delay and the59\npresentation of a colour cue, they were asked to reproduce the motion direction of the cued stimulus60\n(Fig. 1A & B). Critically, in this experiment, we associated the colours of the stimuli with different61\nexternal rewards by awarding accurate recall (< 50◦ absolute error) with 15 points when items of one62\ncolour were tested versus 5 points for the other colour. Accumulated points were converted into a63\nbonus payment to the observer. At the end of the experiment, all observers correctly identified which64\nstimulus had provided the larger rewards. To determine whether the difference in external rewards65\ninfluenced reproduction precision, we compared the mean absolute deviation (MAD) of responses66\nbetween stimuli of different colours. We found strong evidence that response errors were smaller for67\nitems of the colour associated with the larger reward (BF10 = 18.7, median of the posterior over effect68\nsize δ = 0.575, 95% credible interval = [0.195, 0.966]) (Fig.1C & D).69\nDensity\nC D ELow rewardHigh reward\nResponse error\nHigh Low\nReward\nMAD\n0\n0.2\n0.4\n0.6\n0.8\n1\n1.2\n1.4\n1.6\n0\n0.5\n1\n1.5\n2\n2.5α\nObserved\nAllocation\nRewardmaximization\nMotion stimuli CueDelay ResponseA B\n+5 points\nReward\nResponse error\n0 0.5 1 1.5\n0\nα\nTotal normalized reward\n0\n0.5\n1\n1.5\n0\n0.5\n1\n1.5\n0 - 0 -\nno points\nreward zone (5 or 15 points)\nFigure 1: External reward manipulation in Experiment 1. A) Schematic of the task. B) Illustration\nof the experimental manipulation. Responses within 50 degrees of the target motion direction were\nrewarded with either 15 or 5 points, depending on the colour of the cued object. C) Distribution\nof response errors and corresponding fits of the Neural resource model. Histograms represent the\ndata, while coloured curves and shaded areas depict model predictions (M± SE) D) Mean absolute\ndeviation (MAD) of response errors. The coloured circles with error bars represent the mean ±\nSE. Dashed line indicates chance level performance. E) Observed (i.e., freely estimated) resource\nallocation compared to the optimal allocation aimed at maximizing the total points in the task. For\nvisualisation purposes, allocation towards the low-reward item is shown. Dashed line indicates equal\nallocation. Allocation smaller than 1 indicates that more resource was allocated towards the high-\nreward item (1:0.695 for high- vs low-reward item). The inset shows individual reward functions\nrelating resource allocation to expected point totals, with each curve’s peak indicating the allocation\nthat maximizes expected reward. For ease of visualization, only a subset of observers is shown, and\nall curves are normalized to the same total reward.\nNeural resource allocation70\nThe results of Experiment 1 indicate that observers prioritised encoding the stimulus associated with71\nthe larger reward. Importantly, recall for the low-reward item remained reliably better than chance,72\nsuggesting that prioritisation was graded rather than all-or-none. To quantify the share of resources73\nallocated to each item, we applied a normalization-based population coding model [22,51] to the data74\nfrom Experiment 1. In this model, neural firing rate takes the role of a limited resource, which, in75\nthe simplest scenario, would be equally distributed between stimuli. Here, we extended this model by76\nfreely fitting a gain modulation parameter,α, which increased the activity encoding one of the two77\n3\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nstimuli while keeping the total gain (i.e., mean activity) of the population constant (see theNeural78\nresource modelsection for more detail). Consistent with the observed difference in response error, we79\nfound strong evidence for unequal resource allocation favouring items of the highly rewarded colour80\n(low/high ratio 0.695; difference from equal allocation,BF10 = 23.8, δ = 0.594, 95% CI = [0.211,81\n0.987]) (Fig.1E).82\nWe next investigated whether observers distributed resources in a way that would maximize the83\ntotal number of collected points, which we considered an optimal allocation strategy for this task.84\nTo test this, we calculated the expected number of points awarded for a range of different allocation85\nweights (seeOptimal resource allocationfor more detail). The values ofα that maximized the reward86\nare shown in Figure1E (optimal allocation). Comparing the observed and optimal weights revealed87\nstrong evidence for a difference between the two (BF10 = 92.9, δ = 0.694, 95% CI = [0.297, 1.102]),88\nwith observers distributing resources more equally than would be required to maximize the total89\nnumber of points (low/high ratio 0.34).90\nA reward-maximization strategy would come at the cost of higher error for the low-reward item.91\nThis could suggest that, in addition to maximizing external rewards, observers may also aim to92\nachieve a certain level of accuracy on the task across all items, potentially because they find accuracy93\nintrinsically rewarding (see also, [25]).94\nPerceived accuracy biases resource allocation95\nHaving confirmed that external rewards modulated allocation in the motion reproduction task, we96\nnext investigated effects of perceived accuracy, a possible form of intrinsic reward, on the same task.97\nIn Experiment 2a we presented manipulated feedback at the end of each trial to influence observers’98\nperception of their reproduction accuracy. Observers were again presented with two coloured stimuli,99\nand reproduced one indicated by a colour cue. We magnified the error presented at feedback when100\none colour was cued and minified the error at feedback for the other colour (Fig.2A & B). A post-101\nexperimental questionnaire revealed that 84% of observers judged stimuli of the colour associated with102\nerror-magnified feedback as more difficult to remember, indicating that we successfully associated103\nstimulus identity (i.e., colour) with perceived difficulty.104\nTo assess the effects of perceived difficulty on response precision, we compared MAD between the105\nresponse and the true target direction (rather than the one shown as feedback) for stimuli of the two106\ncolours (Fig.2C & D). We found responses to be more precise for the stimulus with reduced feedback107\nerror, i.e., the one perceived as easier to remember (BF10 = 29.4, δ = 0.672, 95% CI = [0.24, 1.12]).108\nThis finding indicates that the perception of better performance for stimuli of one colour, induced by109\nfeedback, led to improved actual performance for those stimuli.110\nThe observed effect could be attributed to either capture of visual attention by the “easier”111\nitem (i.e., competition for visual processing resources) or the mnemonic prioritisation of that item112\n(i.e., competition for memory resources). To differentiate between these possibilities, we conducted113\na follow-up experiment. Experiment 2b replicated the conditions of Experiment 2a but with stimuli114\npresented sequentially to reduce encoding competition between the two objects and minimize the115\ninfluence of attentional selection on resource allocation. Similar to Experiment 2a, 89% of observers116\njudged the colour associated with magnified feedback errors as more difficult to remember. However,117\nin contrast to Experiment 2a, comparing response errors across the two stimuli (Fig.S1) revealed that118\nthe observed data were nine times more likely under the null hypothesis, providing moderate evidence119\nfor a lack of difference in response precision between the two colours (BF10 = 0.11, δ = 0.009, 95%120\nCI = [-0.183, 0.202]). This finding suggests that the effect observed in Experiment 2a is likely due121\nto attentional competition during encoding. When that competition is mitigated, observers do not122\nshow preferential encoding based on perceived difficulty.123\nNeural resource allocation124\nThe results of Experiment 2a show that observers prioritised encoding of the error-minified stimulus,125\ni.e., the one signalling better performance. Crucially, the error-magnified stimulus was still recalled126\nwith above-chance precision, consistent with a graded rather than all-or-none allocation of resources.127\nTo quantify resource distribution between the two objects, we again applied the Neural resource128\nmodel to the data, with results illustrated in Figure2C & E. We found that, on average, observers129\nallocated 1.18 times more resources towards the error-minified stimulus (difference from equal alloca-130\n4\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nMagni/f_iedMini/f_ied\nFeedback error\nMAD\n0\n0.5\n1\n1.5\n2\n2.5α\nC D EMagni/f_ied feedback errorMini/f_ied feedback error\nObserved\nAllocation\nFeedback error minimization\nFeedback\n0\nResponse error\n-1\n-0.5\n0\n0.5\n1\nFeedback \nerror minifying error magnifyingMotion stimuli CueDelay ResponseA B\nDensity\nResponse error Response error\n0\n0.5\n1\n1.5\n0\n0.5\n1\n1.5\n0  0 --\n-\n(rad)\n0 0.5 1 1.5\n0\nα\nTotal feedback variance\n0\n0.2\n0.4\n0.6\n0.8\n1\n1.2\n1.4\n1.6\nFigure 2: Perceived accuracy manipulation in Experiment 2a. A) Schematic of the task. B) Experi-\nmental manipulation illustration. Feedback error was magnified for one stimulus and minified for the\nother, based on the colour of the cued object. C) Distribution of response errors and corresponding\nfits of the Neural resource model. Histograms represent the data, while coloured curves and shaded\nareas depict model predictions (M± SE) D) Mean absolute deviation of response errors. The coloured\ncircles with error bars show the mean± SE. Dashed line indicates chance level performance. E) Ob-\nserved resource allocation and optimal allocation aiming to minimize overall feedback variance in the\ntask. Dashed line indicates equal allocation. Allocation larger than 1 indicates that more resource\nwas allocated towards the error-minified item (minified vs magnified: 1.18:1). The inset illustrates\nindividual variability in feedback error as a function of resource allocation, with each curve’s trough\nindicating the allocation level that minimizes feedback error. For ease of visualization, only a subset\nof observers is shown, and all curves are normalized to the same range of feedback variance.\ntion, BF10 = 2.63, δ = 0.45, 95% CI = [0.05, 0.86]; 19 out of 25 observers hadαobserved > 1; Fig.2D),131\nconsistent with the observed difference in response error between the stimuli of two colours.132\nNext, we investigated whether the observed allocation matched the predictions of an ideal ob-133\nserver who optimally weights neural activity to minimize overall feedback error in the task. We134\ncalculated the expected variance of feedback error across both items for a range of different allo-135\ncation weights, and Figure 2E shows optimal allocation weights that minimize this variance. The136\noptimal strategy would require shifting twice as many resources towards the error-magnified item137\n(αoptimal = 0.52). Importantly, we found extremely strong evidence that this was inconsistent with138\nthe observed allocation, which favoured the error-minified item (BF10 = 3.77 × 106, δ = 1.72, 95% CI139\n= [1.09, 2.39]). Overall, these results indicate that observers did not adopt an allocation strategy that140\nwould minimize their feedback error variability (α= 0.52), but instead did the opposite, allocating141\nmore neural resources to the item for which we systematically minified the error in feedback.142\nIn Experiment 2b, fitting the same Neural resource model to the data revealed that the observed143\nallocation parameter was numerically close to 1, (αmean = 1.07; BF10 = 0.62, δ = 0.18, 95% CI =144\n[-0.01, 0.38]), which aligns with the observed similarity in reproduction precision between the two145\nstimuli. This further supports the conclusion that the effect observed in Experiment 2a depended on146\nattentional competition during encoding.147\nEstimation difficulty biases resource allocation148\nFollowing Experiment 2, we aimed to determine whether preferential allocation and encoding would149\npersist when varying objective stimulus difficulty rather than perceived performance. Drawing on150\nprevious findings showing a positive correlation in humans between subjective confidence and the151\nmotion strength of RDK stimuli [52] (see also [53]), we hypothesised that variations in the objective152\ndifficulty of stimuli would modulate internally generated confidence signals, driving the prioritisation153\nof specific stimuli as in Experiment 2. In Experiment 3a, we presented two coloured RDK stimuli154\nwith different coherence levels on the majority of trials, to create differences in objective difficulty155\n5\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nand associated confidence. We then assessed response precision on the remaining trials, during which156\nboth stimuli were presented with equal coherence (i.e., equal difficulty) (Fig.3A & B).157\nC\n0\n0.5\n1\n1.5Density\n0\n0.5\n1\n1.5\nResponse error\nHigh coherence colour\n0\n0.5\n1\n1.5Density\nLow coherence colour\n0\n0.5\n1\n1.5\nD\n0 -\nResponse error\n0 -\n0 - 0 -\nE F\nHigh Low\nCoherence\n0\n0.2\n0.4\n0.6\n0.8\n1\n1.2\n1.4\n1.6MAD\nInter. (High) Inter. (Low)\nCoherence\n0\n0.2\n0.4\n0.6\n0.8\n1\n1.2\n1.4\n1.6\n0\n0.5\n1\n1.5\n0\n0.5\n1\n1.5\n0\n0.5\n1\n1.5\n0\n0.5\n1\n1.5\nG\nDensityDensity\nH\n0 - 0-\nResponse error\n0 -\nResponse error\n0-\n\n\nI\n0\n0.2\n0.4\n0.6\n0.8\n1\n1.2\n1.4\n1.6MAD\n0\n0.2\n0.4\n0.6\n0.8\n1\n1.2\n1.4\n1.6\nJα\n0\n0.5\n1\n1.5\n2\n2.5\n3\n3.5\n4\nHigh Low\nCoherence\nInter. (High) Inter. (Low)\nCoherence\nObserved\nAllocation\nError minimization\nExperiment 3aExperiment 3b \nObserved\nAllocation\nError minimization\nα\n0\n0.5\n1\n1.5\n2\n2.5\n3\n3.5\n4\nMotion stimuli CueDelay Response\nA B Variable \ncoherence  trials\nEqual \ncoherence trials\n85%\n45%\n65%\n65%\n85% 45%\n65% 65%\n85% 45%\n65% 65%\n0 0.5 1 1.5\n0\nα\nTotal response variance\n2\n0 0.5 1 1.5\n0\nα\nTotal response variance\n2\n45% 65% 85%\ncoherence (%)\nAdditive perceptual noise (var)\n0\nFigure 3: Estimation difficulty manipulation in Experiment 3a and 3b (simultaneous presentation).\nA) Schematic of the task. B) Experimental manipulation illustration. In most trials, the two colours\nwere associated with different levels of motion estimation difficulty (i.e., variable coherence); in the\nremaining trials, both objects had the same level of difficulty (i.e., equal coherence). Motion with\ndifferent coherence levels produces varying degrees of perceptual noise, with higher coherence reduc-\ning noise. This perceptual noise was incorporated into the Neural Resource model as an additive\ncomponent, alongside memory noise. C) & D) Distribution of response errors and corresponding fits\nof the Neural resource model. Histograms represent the data, while coloured curves and shaded areas\ndepict model predictions (M± SE). Panel A depicts variable coherence trials, and panel B depicts\nequal coherence trials. E) Mean absolute deviation of response errors. Dashed lines indicate equal\nallocation. F) Observed resource allocation and optimal allocation aiming to minimize overall feed-\nback variance in the task. Dashed line indicates equal allocation. Allocation larger than 1 indicates\nthat more resource was allocated towards the easier item (high vs low coherence: 1.76:1). Panels\nG-J are the same as C-F, but for the simultaneous presentation condition of Experiment 3b. J)\nAllocation larger than 1 indicates that more resource was allocated towards the easier item (high vs\nlow coherence: 1.61:1). The insets illustrate individual variability in response variance as a function\nof resource allocation, with each curve’s trough indicating the allocation level that minimizes overall\nresponse error. The coloured circles with error bars show the mean± SE. For ease of visualization,\nall curves are normalized to the same range of recall variance.\nIn Experiment 3a, all observers reported that stimuli of the colour associated with low coherence158\nwere more difficult to remember, confirming that the coherence manipulation produced a clear differ-159\n6\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nence in perceived difficulty, despite the absence of performance feedback in this experiment. Results160\nof Experiment 3a are shown in Figure 3C-E. As expected, observers were more precise in reproducing161\nthe motion direction of the high-coherence stimulus on trials where the stimuli objectively differed162\nin difficulty (BF10 = 1.83 × 105, δ = 1.6, 95% CI = [0.95, 2.29]). More importantly, on trials where163\nthe stimuli had equal coherence, reproduction was also more precise for the stimulus with the colour164\nassociated with high coherence (i.e., the “easier” colour) (BF10 = 11.4, δ = 0.63, 95% CI = [0.18,165\n1.10]). This finding suggests that observers associated colour with difficulty when the two objects166\nwere presented with different levels of coherence, and subsequently allocated more resources to the167\nstimulus they had learned was easier.168\nThis result was replicated in the laboratory setting of Experiment 3b, where observers were169\nrequired to maintain eye fixation at the centre of the screen during stimulus presentation (Fig.3G-170\nI). Despite preventing observers from overtly shifting their attention towards one stimulus during171\nencoding, 74% of observers correctly identified one colour as more difficult. As expected, responses172\nweremorepreciseforthehigh-coherencestimulusontrialswherestimulidifferedincoherence( BF10 =173\n4414, δ = 1.366, 95% CI = [0.718, 2.046]). Additionally, responses were more precise for the colour174\nassociated with high coherence on trials where both stimuli were presented with equal coherence175\n(BF10 = 4.2, δ = 0.56, 95% CI = [0.092, 1.052]). However, the observed difference could again be176\nexplained by attentional demands at encoding. When objects were presented sequentially (Fig.S2),177\nresponse precision was comparable across colours when both stimuli had the same coherence (BF10 =178\n0.51, δ = 0.267, 95% CI = [-0.157,0.709]). This was despite observers being able to judge which item179\nwas more difficult (84%) and a noticeable precision advantage for the high-coherence stimulus on trials180\nwhen coherence levels varied between objects (BF10 = 9.22 × 104, δ = 1.76, 95% CI = [1.015, 2.550]).181\nCompared to other experiments, response distributions in this experiment exhibit more pronounced182\npeaks around the direction opposite to the target (i.e., elevated tail ends). The tendency of our183\nsensory system to encode orientation of a motion path (i.e., the line on which movement occurs)184\npartly independently of direction is well-documented [54,55] and may be especially pronounced when185\nmotion stimuli are presented in the periphery rather than at fixation.186\nNeural resource allocation187\nConsistent with the findings from Experiment 2, Experiment 3 demonstrated that observers, when188\npresented with objects associated with different levels of performance, prioritised the encoding of the189\nstimuli perceived as easier. Also consistent with previous experiments, observers performed above190\nchance for the more difficult item, supporting the interpretation that resource allocation was graded191\nrather than all-or-none. To quantify the distribution of resources across the two items, we again192\napplied our population coding model to the data.193\nIn Experiment 3a, the allocation estimates from the model indicated that observers allocated194\nnearly twice as much resource (1.76:1) to the high-coherence stimuli (Fig.3F), and this allocation195\ndeviated from equal allocation (BF10 = 2.68 × 104, δ = 1.4, 95% CI = [0.792, 2.031]). We next196\ninvestigated whether the observed allocation was consistent with an optimal allocation strategy aimed197\nat minimizing overall response variance in the task. To this end, we simulated performance on the198\nvariable coherence trials using a range of different allocation weights, and found that the optimal199\nstrategy for most observers was equal allocation (Fig.3F). Comparing the observed and optimal200\nweights revealed strong evidence that the observed weights were, on average, larger than the optimal201\nweights (BF10 = 9580, δ = 1.341, 95% CI = [0.733, 1.977]).202\nThese findings were replicated in Experiment 3b. When objects were presented simultaneously,203\nthe model estimated that observers allocated resources at a ratio of 1.61:1 in favour of the high-204\ncoherence stimulus (Fig.3J). This allocation was again different from equal (BF10 = 830.6, δ = 1.166,205\n95% CI = [0.566, 1.794]), and from optimal, which was again close to equal (meanαoptimal = 1.03;206\nBF10 = 789, δ = 1.16, 95% CI = [0.561, 1.786]). Finally, fitting a free allocation parameter to the207\ndata from the equal coherence condition with sequential presentation (Exp 3b), revealed a ratio of208\n1.2:1 in favour of the colour associated with high coherence; however, we did not find evidence that209\nthis was different from equal allocation (BF10 = 0.82, δ = 0.343, 95% CI = [-0.09, 0.797]).210\n7\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nInterim conclusion211\nIn Experiment 1, we observed a clear effect of external rewards on representational fidelity in a motion212\nreproduction task, with observers allocating more cognitive resources to high-reward items. In Exper-213\niments 2 and 3, we found a similar effect using novel manipulations, where observers allocated more214\nresources to the stimulus that was perceived as easier, either based on manipulated error feedback215\n(Experiment 2) or internal confidence in estimation (Experiment 3). In both of these experiments,216\nthe observed allocation deviated from predictions made by an optimal strategy aimed at minimiz-217\ning overall feedback or response error. Additionally, we demonstrated that differences in estimation218\nperformance were abolished when competition during encoding was removed by presenting stimuli219\nsequentially, suggesting they arise from unequal allocation of attentional resources at the encoding220\nstage.221\nBuilding on these results and existing literature [38,43, 44], we argue that observers in our study222\nfound higher accuracy and confidence in their performance intrinsically rewarding, and learned to223\nassociate this intrinsic reward with a stimulus feature (i.e., one of the two colours). This association224\nbiased resource allocation towards subsequent stimuli with the same feature. Importantly, although225\nreward-driven, this biased allocation was not a strategy that would maximize intrinsic reward on226\nthese tasks, because observers had no influence over which stimulus was cued for report on a given227\ntrial. Indeed the direction of the biases induced by implicit rewards in Exps 2 & 3 meant that they228\nwere counterproductive: increasing overall error variability relative to a strategy of equal allocation.229\nTherefore, instead of evaluating this data from the perspective of optimal performance, we propose a230\nneural model inspired by reinforcement learning to elucidate these findings.231\nReinforcement learning model of resource allocation232\nTo further explore the dynamics of resource allocation, we developed a computational model that in-233\ntegrates principles of neural coding and reinforcement learning. The proposedReinforcement learning234\naccount of resource allocationextends theNeural resource model[22, 51] by incorporating a value-235\nupdating mechanism that allows extrinsic and intrinsic rewards to influence the future distribution236\nof neural resources (Fig.4). A key contribution of our model is the concept that rewards – both237\nintrinsic and extrinsic – obtained from reproduction of a stimulus become associated with the identi-238\nfying features of that stimulus, affecting their subjective value and biasing allocation of resources in239\nsubsequent encounters. We found that this approach accurately predicted resource allocations esti-240\nmated by freely fitted allocation weights, indicating that behavioural estimation performance could241\nbe successfully inferred from an analysis of accumulated rewards.242\nExternal reward243\nIn Experiment 1, the stimulus colour associated with a high reward (15 points) was expected to244\naccumulate greater value relative to the colour associated with a low reward (5 points). On average,245\nobservers earned points on 80% of trials when the high-reward stimulus was probed and 66.5% of246\ntrials when the low-reward stimulus was probed, leading to an average accumulation of 602 and 166247\npoints, respectively. To apply the proposed RL model to each observer’s data, we combined individual248\ntrial-by-trial external rewards with estimates of internal confidence (Equation9).249\nFigure 5A shows the average trajectory of resource allocation across trials (see Fig.S3A for indi-250\nvidual trajectories). This trajectory shows an early shift in resource allocation towards the preferred251\nitem, followed by a stable plateau. For ease of visualisation, trajectories are presented as directed252\ntowards the preferred object, defined as the object receiving a greater average resource allocation253\nacross all trials.254\nCrucially, since our RL account is grounded in the same Neural resource model previously em-255\nployed to fit the psychophysical data and quantify resource allocation (Fig.1A & C), we can directly256\ncompare estimates across the two models. Here we focus on the comparison of estimated resource al-257\nlocations, while ML estimates and comparisons for the other parameters are shown inSupplementary258\nInformation (Fig. S4A). Importantly, the freely estimated resource allocation (observed allocation in259\nFig. 1C) is based on behavioural errors only, with no information about rewards, and so can serve as a260\nbenchmark for evaluating performance of the RL model. As shown in Figure5B, we observed a strong261\npositivecorrelationbetweenthefreelyestimatedallocationparameterandthemeanallocationderived262\nfrom the history of accumulated rewards (r= 0.976, 95% CI = [0.941, 0.988],BF10 = 3.34 × 1016).263\n8\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nE FVarying reward weight (c) Varying leak (y)\nTrial number\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1\nResource fraction\n1 10 20 30\n0\n1 10 20 30\nB C\n0\n0.2\n0.4\n0.6\n0.8\n1\n1.2\n1.4\n1.6\nhigh \nreward \nstimulus\nlow \nreward \nstimulus\nD\nMAD\nNeural resource \nallocation Stochastic spiking\nA\nRelative value (νt )\nGain factor (α)\nUncertainty Error feedback Awarded points\nTrial number\n0\nLeak\nmax\n0\nReward weight\nmax\nTrials Trials\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1\nResource fraction\nResource fraction\n+15 points\nνt = (1 – y)Δt–1 + Δt\nReward\nTrial1\nTrial2\n...\nTrialt\nWeight\nκrˆ rextϵfb\nΔ1\nΔ2\nΔt\n...=\n10\nRelative value (νt )\nFigure 4: The neural resource allocation account applied to the motion estimation task. A) On\neach trial, motion directions of the two stimuli are encoded in the spiking activity of populations of\nneurons, with mean activity determined by the relative allocation of resources to stimuli. Based on\nthe cue colour, one of the populations is decoded to yield an estimated direction with an associated\nuncertainty that varies from trial to trial. The uncertainty of the estimate, the accuracy feedback\n(if present) and any points awarded represent different forms of intrinsic and external reward, which\nare combined as a weighted sum into a composite reward (∆t). This composite reward is then used\nto update the relative value (ν) associated with the stimulus colours. Finally, this relative value is\ntransformed via an exponential mapping into a neural gain factor (α), which controls the fraction of\nresources allocated to each stimulus on the subsequent trial. In this framework, resource allocation\nis entirely driven by the history of accumulated rewards. B) Throughout the reported experiments,\nthe two colours of stimuli are systematically related to different intrinsic or external rewards, so the\nrelative value assigned to each colour progressively diverges over the sequence of trials. C) Fraction\nof total resources allocated to the high-reward stimulus over trials, based on relative value shown in\nB. The dashed line represents the mean allocation across all trials (∼65%). The remaining resources\n(∼35%) are allocated to the low-reward stimulus. D) Unequal resource allocation is reflected in\ndifferences in the mean absolute error across trials when the high- or low-reward stimulus is cued for\nreport. E) Larger reward weight (c), with a constant leak factor, results in a stronger preference for\none stimulus over the other in terms of resource allocation. F) Larger leak factor (y), with a constant\nreward weight, leads to a weaker preference.\n9\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nThe close alignment of these two distinct methods suggests that the history of accumulated rewards264\ncan effectively account for resource allocation in this task.265\nIntrinsic reward: Perceived accuracy266\nIn Experiment 2 we manipulated the response error presented in feedback to influence the perceived267\ndifficulty of reproducing stimuli of each colour. This manipulation resulted in participants experi-268\nencing systematically larger feedback errors for stimuli of one colour (magnified feedback MAD =269\n0.837) than the other (minified feedback MAD = 0.233). To model this data within our RL account,270\nwe assume that the feedback on each trial provided an intrinsic reward that was associated with271\nthe corresponding stimulus colour. This assumption is supported by evidence that observers’ subjec-272\ntive evaluations tend to favour smaller feedback errors over larger ones because they suggest higher273\naccuracy [56].274\nIn the model, rewards derived from feedback were integrated with those derived from internal275\nconfidence. Figure 5D illustrates the mean trajectory of resource allocation across trials (Fig.S3B276\nshows individual trajectories for example observers). The model fits again indicate that resources were277\nunequally allocated between stimuli, although the bias is smaller than observed in the experiment278\nwith external rewards.279\nComparing the estimated allocation derived from the RL account to the freely fitted allocation280\nparameter in the Neural resource model (Fig.5E), we found a strong positive correlation (r= 0.911,281\n95% CI = [0.777, 0.958],BF10 = 3.44 × 107). Consistent with the findings from Experiment 1, the282\ncorrespondence between these two distinct approaches indicates that the history of accumulated in-283\ntrinsic rewards provides an explanation for resource allocation in the task with manipulated feedback.284\nML estimates and comparisons for the other parameters are shown inSupplementary Information285\n(Fig. S4B).286\nIntrinsic reward: Estimation difficulty287\nExperiment 3 investigated the role of objective difficulty in the representation of motion informa-288\ntion. On most trials, two stimuli with different coherence levels (85% and 45%) were presented. We289\nhypothesized that internal confidence in each item’s motion direction, reflecting a metacognitive esti-290\nmate of accuracy, functions as an intrinsic reward which observers associate with each item’s identity291\n(i.e., colour) [48]. To model the psychophysical data in the simultaneous presentation condition, we292\nestimated internal confidence by exploiting the close coupling between uncertainty and trial-to-trial293\nvariability in error within the Neural resource model. Informed by observed response error on each294\ntrial, we derived the posterior probability distribution of likelihood precision and used the most prob-295\nable precision as a basis for intrinsic reward (Eq.12). While internal confidence was also incorporated296\nin this way when modelling data from the previous two experiments, in Experiment 3 it was the sole297\nsource of reward influencing resource allocation.298\nFigure 5G & J show the mean trajectories from Experiment 3a & 3b, respectively. Again,299\nwe visualised the obtained individual trajectories in example participants (Fig.S3C & D). In both300\nexperiments, all parameter estimates obtained with the Neural resource model and the RL account301\nstrongly covaried (Fig. 5H & K & Fig. S4C & D). Importantly, this was also true for estimates302\nof resource allocation. Across the two experiments, we found very consistent and strong positive303\ncorrelations between the freely estimated allocation parameter and the mean allocation derived from304\nthe history of accumulated rewards (Exp3a:r = 0.833, 95% CI = [0.589, 0.923],BF10 = 1.29 × 104;305\nExp3b: r = 0.843, 95% CI = [0.575, 0.933],BF10 = 3.67 ×103). Consistent with the findings from the306\nfirst two experiments, this strong correspondence indicates that the history of accumulated intrinsic307\nrewards based on internal confidence effectively accounts for resource allocation in this task.308\nChanges in resource allocation predict response precision309\nOur finding that freely estimated resource allocation strongly correlates across participants with310\nresource allocation based on the history of rewards supports the conclusion that human resource311\nallocation is guided by a reward-driven value assignment to objects in the visual environment. To312\nfurther substantiate this claim, we investigated whether variability in resource allocation across trials313\nwithin individual participants, derived from the RL model, predicts the magnitude of their response314\nerrors.315\n10\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\n12 04 06 08 0 100\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n12 04 06 08 0 100\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\nA\nD\nResource fractionResource fraction\nTrial number\nTrial number\n15 0 100 150\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n15 0 100 150 200\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\nG\nJ\nResource fractionResource fraction\nTrial number\nTrial number\n0 1\n0\n0.2 0.4 0.6 0.8\nfreely estimated\nresource allocation \nFavour\nhigh reward\nFavour\nlow reward\n0 0.2 0.4 0.6 0.8 1\nfreely estimated\n0\n0.2\n0.4\n0.6\n0.8\n1\nFavour\nerror-mini/f_ied\nFavour\nerror-magni/f_ied\nreward predicted\nr = .976\nr = .911\nB\nE\nExternal reward \n(Exp 1)\nPerceived accuracy \n(Exp 2a)\nEstimation diﬃculty \n(Exp 3a)\nEstimation diﬃculty \n(Exp 3b)\nreward predicted0.2\n0.4\n0.6\n0.8\n1\n0\n0.2\n0.4\n0.6\n0.8\n1\n0 0.2 0.4 0.6 0.8 1\nfreely estimated\nreward predicted\nr = .833\nH\n0\n0.1\n0.2\n0.3\n0.4\n0\n0.2\n0.4\n0.6\n0.8\n1\n0 0.2 0.4 0.6 0.8 1\nfreely estimated\nreward predicted\nFavour\nhigh coherence\nFavour\nlow coherence\nr = .843\nK\n0\n0.1\n0.2\n0.3\n0.4\nFavour\nhigh coherence\nFavour\nlow coherence\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0\n0.1\n0.2\n0.3\n0.4\n0.5\nC\nF\nI\nL\nMAD\nMAD\nMAD\nMAD\nFigure 5: Modelling results. A) Resource allocation across trials inferred by the RL account in the\nexternal reward experiment (Experiment 1). Circles represent the mean fraction of resources across\nobservers allocated on each trial towards the overall preferred object. B) Correlation between mean\nallocations inferred by the RL account and freely estimated allocations. The red line shows predictions\nof the fitted linear regression model, and the shaded area indicates the 95% CI. C) Difference in MAD\nbetween trials on which the probed item had below- and above-median resources allocated to it, as\nestimated by the RL account. On average, MAD was larger when less resource was allocated to the\nprobed stimulus. D–F) Same as above, but for the perceived accuracy experiment (Experiment 2).\nG–I) Online estimation difficulty experiment (Experiment 3a). J–L) Lab-based estimation difficulty\nexperiment (Experiment 3b, simultaneous condition).\n11\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nTo investigate this, we performed a median-split analysis for each observer based on the esti-316\nmated fraction of allocated resources towards the item associated with larger reward (i.e., individual317\ntrajectories similar to those shown in Fig.5, left column). Specifically, we calculated the MAD of re-318\nsponse errors for trials with above- and below-median resource allocation, separately for trials where319\nthe high- or low-reward item was tested. We hypothesised that MAD would be greater on trials320\nwhere the RL model indicated that below-average resource was allocated to the probed item, i.e.,321\nbelow-median trials when the high-reward object was probed and above-median trials when the low-322\nreward object was probed. To test this, we computed a composite score for each observer equal to the323\nsum of the signed difference in MAD between below- and above-median trials when the high-reward324\nitem was probed and the signed difference in MAD between above- and below-median trials when the325\nlow-reward item was probed. In all four experiments, the composite scores indicated that lower pre-326\ndicted resource allocation, based on the history of rewards, corresponded on average to larger MAD327\nof response errors (Fig.5C, F, I & L) This was confirmed with one-sided t-tests against zero which328\nprovided moderate to extreme evidence for a difference in the hypothesised direction: Experiment 1329\n(BF10 = 5.55 × 104, δ = 1.096, 95% CI = [0.635, 1.570]); Experiment 2 (BF10 = 564, δ = 0.866, 95%330\nCI = [0.402, 1.345]); Experiment 3a (BF10 = 3.74, δ = 0.441, 95% CI = [0.073, 0.874]); Experiment331\n3b (BF10 = 192.4, δ = 0.919, 95% CI = [0.377, 1.487]).332\nDiscussion333\nIn the present study, we investigated how human observers represent stimuli associated with varying334\nlevels of external and intrinsic reward. Across three psychophysical experiments, we paired object335\nidentities with different rewards and found observers developed higher estimation accuracy for the336\nitems associated with larger rewards. In two additional experiments, we demonstrated that this effect337\nwas driven by competition for attentional, rather than mnemonic, resources. To provide a mechanistic338\nexplanation of this behaviour, we developed a neural model incorporating a reinforcement learning339\nrule that directs resource allocation towards more rewarding stimuli. Our key finding is that a340\nresource allocation mechanism based solely on the history of accumulated rewards is sufficient to341\nexplain differences in estimation performance based on intrinsic as well as external rewards.342\nIn the first experiment, we investigated the effects of external rewards on representational fidelity343\nin a motion reproduction task. Both the psychophysical results and computational modelling provided344\ncompelling evidence that observers allocated more processing resources to objects associated with a345\nhigher reward, resulting in more precise reproduction of high-reward stimuli compared to low-reward346\nones. This finding aligns with a broad body of research demonstrating that external rewards, such347\nas points or money, influence various aspects of information processing, including the allocation of348\nattentional resources [17,18] and working memory [31], while also facilitating motor responses, such349\nas hand movements and saccades, towards rewarding stimuli [56–58].350\nIn contrast to external rewards, the influence of intrinsic rewards on representational fidelity351\nhas received comparatively less attention. Building on the premise that accuracy itself is rewarding352\n[38, 43, 44], we conducted two experiments that manipulated perceived accuracy (via feedback)353\nand objective estimation difficulty (via signal strength) in a motion reproduction task. We found354\nconvergingevidenceatboththebehaviouralandcomputationallevelindicatingthatobserversallocate355\nmore neural resources towards, and consequently have a more precise internal representation of,356\nobjects associated with better estimation performance – whether induced by artificially manipulated357\nfeedback (Experiment 2) or by objective differences in stimulus discriminability (Experiment 3).358\nWe argue that observer derived intrinsic reward from confidence in their responses and feedback359\non their accuracy. In our tasks, the association of these rewards with the distinguishing feature of360\nthe presented objects (i.e., colour) leads to a bias in resource allocation, favouring subsequent stimuli361\nthat share the same feature. This proposal aligns with the notion that perceptual features linked to362\nrewards are prioritised in sensory processing due to their incentive salience (e.g., [59,60]). Moreover,363\nneural evidence supports this notion by demonstrating that sensory representations are modulated364\nby the history of rewards, underscoring the impact of reward associations on perceptual processing365\n[61]. To make our proposal concrete, we developed a mechanistic model grounded in the principles366\nof population coding and reinforcement learning. Specifically, our reinforcement learning account367\noperates by analyzing accumulated rewards and allocating proportionally more resources to objects368\npreviously associated with higher rewards. We found this model closely replicated resource allocation369\n12\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nestimates obtained from freely fitted parameters, suggesting that the history of accumulated intrinsic370\nand extrinsic rewards is sufficient to account for the observed patterns of resource allocation.371\nA key novel finding from the proposed model is that both internally generated and externally372\nmanipulated (via feedback) estimates of accuracy, when associated with an object’s distinguishing373\nfeature, can bias the subsequent processing of objects that share that feature. While the role of374\nexternal feedback in reinforcing behaviours more generally has been widely acknowledged (e.g., [62,375\n63]), recent research demonstrates that internal confidence can similarly reinforce behaviour even in376\nthe absence of explicit feedback. For instance, improvements in sensitivity in a perceptual learning377\ntask have been observed without external feedback [64]. Guggenmos et al. [48] proposed that such378\nlearning is driven by confidence prediction errors – discrepancies between an individual’s current379\nconfidence and their expected confidence level. Notably, the neural substrate for these prediction380\nerrors has been identified in the striatum (see also [65]), a brain region traditionally linked to reward381\nprocessing. Our findings contribute to a growing body of literature that highlights the importance of382\nmetacognition [53] and self-reinforcement [66] as critical processes in the pursuit of rewards.383\nUsing a range of tasks similar to the one used in this study, previous research has demonstrated384\nthat humans possess knowledge about the uncertainty with which individual items are reported385\n(e.g., [52, 67,68]). Population coding models [69,70] have been particularly effective in capturing386\nsubjective confidence [71], as well as proxies such as response latency [72]. Within the population387\ncoding framework, an ideal observer of spiking activity would derive their confidence estimate –388\nwhether internal or explicitly reported – based on the precision of the posterior distribution, which389\nrepresents the probability of the stimulus value given the observed neural activity. In the Neural390\nresource model [22,51], the precision of the posterior (or likelihood, assuming a uniform prior over391\nstimulus space) varies from trial to trial, as a result of stochastic variation in the number of spikes392\navailable for decoding. We calculated the most probable estimate of posterior precision on each trial393\nto serve as an indicator of internal confidence. On this basis, the model successfully recreated freely394\nestimated resource allocations based on our data.395\nAn important insight from our modelling is that the observed resource allocation deviated from396\nthe pattern required to minimize overall response or feedback error variability, resulting in poorer397\noverall performance. In a similar vein, a recent study [73] provided theoretical and empirical evidence398\nsuggesting that sensory processing is optimised to maximize fitness (i.e., rewards), rather than to399\nensure perceptual accuracy. Supporting this idea, neurophysiological studies have demonstrated that400\nearly sensory systems encode both sensory information about a stimulus and non-sensory information401\nregarding the behavioural relevance of stimuli [3, 74]. Embedding stimulus-reward contingencies402\nwithin the sensory representation of a stimulus facilitates the prioritisation of behaviourally relevant403\ninformation during encoding. These previous findings may help explain why our observers’ allocation404\nstrategies were not optimized for accuracy in the task, however they were also not optimized for405\nmaximizing rewards. In the experiment with external rewards we found that observer’s allocated406\nresource more equally between items than would be predicted by a reward-maximizing strategy. The407\nRL model captured the observed allocation strategy based on a weighted combination of points-408\nbased external rewards and confidence-based intrinsic rewards – this combination of factors could409\nlead observers to maintain a certain level of performance even for stimuli associated with low external410\nreward. When considered across all experiments, our results point to a reward-driven allocation of411\nresources that, while prioritising reward-related stimuli, is not optimized to obtain rewards in the412\nspecific tasks we investigated.413\nOur results also contribute to prominent theories in neuroscience, psychology, and economics414\n[75–78] which consider how humans and other animals link the mental effort required for a task with415\nthe value of its outcome (i.e., the reward). Behavioural studies demonstrate that, when faced with416\ntasks offering equal rewards but varying in effort, humans tend to avoid those perceived as more417\ndifficult [79,80]. Based on this, it has been argued that cognitive effort is experienced as carrying418\ndisutility, i.e., acting as a discount factor on expected rewards [78,81]. This hypothesis has been419\nsubstantiated by the observation that cognitive effort reduces neural responses to rewards following an420\neffortful task [82]. In the present results, perceived (Experiment 2) or objective difficulty in estimation421\n(Experiment 3) similarly appears to have discounted or reduced the subjective value of a stimulus,422\nleading observers to prioritiseeasier – and thus in principle morerewarding – items for encoding.423\nHowever, because observers had no control over which stimuli were selected for test, this allocation424\nstrategy did not result in more reward in our tasks and could even be counterproductive. This raises425\n13\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nthe wider question of whether humans may similarly allocate effort suboptimally, driven by intrinsic426\nreward, in other situations where they have limited control over what information will subsequently427\nbecome relevant.428\n14\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nMaterials and methods429\nApparatus430\nIntheonlineexperiments, taskswerepresentedviawebbrowsersonobservers’personalcomputersand431\nwere coded in JavaScript and HTML Canvas. In the laboratory experiment, stimuli were displayed432\non a 69 cm gamma-corrected LCD monitor with a refresh rate of 60 Hz. Observers were seated in433\na dark room and viewed the monitor from a distance of 60 cm, with their heads supported by a434\nforehead and chin rest. Eye position was monitored online at 1000 Hz using an infrared eye tracker435\n(SR Research). Stimulus presentation and response registration were controlled by a script written436\nin Psychtoolbox [83, 84] and executed in Matlab (The Mathworks Inc.). Responses were collected437\nusing a computer mouse.438\nParticipants439\nA total of one hundred ninety-six naive observers (110 females, 80 males, 6 preferred not to say;440\nMage = 27.6, SDage = 5.0) took part in the study after giving written informed consent in accor-441\ndance with the Declaration of Helsinki. All observers reported normal colour vision and normal or442\ncorrected-to-normal visual acuity. For the online experiments, observers were recruited using Prolific443\n(https://www.prolific.co) and were remunerated £6 per hour for their participation. For the lab-444\noratory experiments, observers were recruited through the Cambridge Psychology research sign-up445\nsystem and were remunerated £10 per hour.446\nFor the online experiments, we used a Bayesian stopping rule to determine the sample size. The447\nstopping rule guides when enough evidence has been gathered to support a decision, thus optimizing448\nthe sample size. In particular, we continued testing observers until we obtained strong evidence, as449\nestimated by the Bayes Factor, in favour of eitherH0 (BF10 ≤ 0.1, indicating evidence supporting no450\ndifference between the two conditions of interest) orH1 (BF10 ≥ 10, indicating evidence supporting451\na difference between the two conditions). If neither hypothesis was supported, data collection ceased452\nafter reaching 100 observers. In Experiment 1, we assessed differences in mean absolute reproduction453\nerror in the analogue report task between stimuli associated with high and low reward, which were454\nthe conditions of interest for the Bayesian stopping rule. In Experiment 2, we tested for differences in455\nmean absolute reproduction error between error-minified and error-magnified stimuli. In Experiment456\n3, we compared mean absolute reproduction errors on trials where stimuli were presented in different457\ncolours but with equal coherence. For the laboratory experiment (Experiment 3b), we aimed to collect458\na number of participants similar to that in Experiment 3a. In total, thirty observers participated in459\nExperiment 1. Twenty-five observers participated in Experiment 2a, and one hundred participated in460\nExperiment 2b. Finally, twenty-two and nineteen observers participated in Experiments 3a and 3b,461\nrespectively.462\nStimuli463\nThe stimuli in this study were random dot kinematograms (RDK). On each trial, two RDK stimuli,464\neach consisting of 40 dots, were presented within a circular aperture. A percentage of the dots465\n(specified below) moved in a coherent direction, while the remaining dots moved in random but466\nconsistent directions within the aperture [85]. When a dot exited the aperture, it was replaced by a467\nnew dot at the aperture’s edge, maintaining a constant dot density. In all experiments, one stimulus468\nwas always green (RGB colour values; online: 47, 195, 129, lab: 0, 199, 128) and the other was469\nalways blue (online: 24, 199, 233, lab: 0, 187, 241). In Experiment 3b, the same observers completed470\ntwo identical tasks, with stimuli presented either simultaneously or sequentially. In this experiment,471\nstimuli were either green and blue or orange (237, 154, 0) and magenta (255, 79, 208), balanced across472\nobservers and presentation conditions. Across all tasks, stimuli were presented against a mid-grey473\nbackground.474\nFor the online experiments, all measures in pixels are reported for a 1920 x 1080 resolution and475\n60 Hz refresh rate. When a different resolution or refresh rate was detected, all measurements of size,476\npositioning and speed were automatically adjusted to maintain consistency in stimuli presentation477\nacross different display settings. The stimulus aperture was 105 pixels in diameter, and each dot had a478\nradius of 3 pixels. Two apertures were positioned 220 pixels to the left or right of the screen centre. On479\neach frame, the dots were shifted by 3 pixels in a specific direction. In the laboratory experiment, two480\n15\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\napertures (1.4 dva radius) were presented horizontally aligned with the fixation annulus, positioned481\nat 5 dva to the left and right. Each dot was 0.15 dva in diameter and travelled at 4 dva/sec speed.482\nProcedure and task483\nIn all experiments, observers completed an analogue report task [50]. Each trial began with the484\npresentation of a central fixation annulus. In the laboratory experiment, gaze direction was monitored485\nusing an eye-tracking camera, and observers were required to maintain gaze fixation within a radius486\nof 2◦ around the central annulus for 500 ms before the trial could proceed. After achieving stable487\nfixation, the fixation annulus changed appearance (i.e., became thinner) to signal that the memory488\narray would be presented in 500 ms. In the online experiment, the appearance of the fixation annulus489\nchanged after a fixed interval of 500 ms. The sample array, consisting of two RDK stimuli, was then490\nshown for 750 ms, followed by a 1000 ms delay period. A centrally presented colour cue subsequently491\nindicated which of the previously presented stimuli, distinguished by colour, was the target that the492\nobservers should recall and report the direction of.493\nOnce observers were ready to give their response, they could begin moving the cursor with a494\nmouse or trackpad, which triggered the appearance of a randomly oriented white arrow within the495\ncentral annulus. Observers were instructed to align the direction of the arrow with the previously496\npresented motion direction of the cued stimulus. In the online experiment, responses were confirmed497\nby pressing the spacebar, while in the laboratory experiment, they were confirmed by pressing the498\nright mouse button.499\nExperiment 1: External reward500\nIn Experiment 1, we investigated how extrinsic rewards influence motion reproduction precision.501\nTo this end, observers received 15 points for reporting a motion direction within50◦ of the target502\ndirection when the target was of one colour (e.g., green), and 5 points when it was of the other colour503\n(e.g., blue). Responses that were more than 50 degrees from the target direction did not receive any504\npoints. The colour associated with high versus low reward was chosen randomly for each observer at505\nthe beginning of the experiment. Both stimuli were presented with the same coherence (85%) and506\nno error feedback was provided. Accumulated points were converted to a bonus payment at the end507\nof the experiment, and observers were informed of this at the beginning of the experiment. Overall,508\nthey could collect a maximum of one thousand points, which was equivalent to a bonus payment of509\n£1.50. Observers completed twenty practice trials and one hundred experimental trials. The task510\ntook approximately 20 minutes to complete. The trials were divided into two equal blocks with a511\nbreak of at least one minute in between, and the complete testing session lasted approximately 15512\nmin.513\nExperiment 2: Perceived accuracy514\nExperiments 2a and 2b were designed to investigate the role of feedback on the precision of motion re-515\nproduction. The two experiments were identical except for the presentation of stimuli. In Experiment516\n2a, two stimuli were presented simultaneously for 750 ms at two distinct locations. In Experiment517\n2b, the same two locations were used, but the stimuli were presented sequentially, each for 750 ms.518\nIn Experiment 2b, the order of presentation and the colour cues were balanced across conditions.519\nIn both experiments, at the end of each trial, following the response, we presented feedback in520\nthe form of the reported and target motion directions. Unbeknownst to participants, we manipulated521\nthe feedback by artificially magnifying errors for one stimulus colour. This was done by shifting the522\npresented target motion direction (θ∗) away from the reported direction (ˆθ) and thereby inflating the523\npresented response error for the designated “difficult” item. This was done according to the following524\nequation:525\nθ∗ = θ ± 50 sin(ˆθ − θ), (1)\nwhere θ is the true motion direction, and all angles are expressed in degrees. Similarly, we system-526\natically minimized the error in the feedback for the other colour, designated as the “easy” item. The527\nmagnification and minimization of errors were randomly assigned to one of the two colours (i.e., green528\nor blue) for each observer at the beginning of the experiment. The RDK stimuli were presented with529\n85% coherence. At the beginning of the experiment, during the instructions, we informed observers530\nthat individuals might vary in their ability to perceive the motion of stimuli of different colours.531\n16\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nThis was intended to make any perceived differences in difficulty appear plausible. At the end of the532\nexperiment, observers were debriefed and the true purpose of the study was revealed. In Experiments533\n2a and 2b, observers completed twelve practice trials and one hundred experimental trials. The trials534\nwere divided into two equal blocks with a break of at least one minute in between, and the complete535\ntesting session lasted approximately 15 min.536\nExperiment 3: Estimation difficulty537\nIn Experiments 3a and 3b, we investigated the role of stimulus discriminability on the fidelity of538\nvisual representations. To achieve this, we presented two stimuli with different levels of coherence on539\n67% and 70% of all trials in Experiments 3a and 3b, respectively. Specifically, the stimulus of one540\ncolour was presented with 85% (high) and the stimulus of the other colour with 45% (low) coherence.541\nThese variable-coherence trials were randomly interleaved with trials where both stimuli had the542\nsame intermediate (65%) coherence. The assignment of low and high coherence to specific colours543\nwas randomized for each observer at the beginning of the experiments. No feedback was provided544\nduring these experiments.545\nExperiment 3a was conducted online, while Experiment 3b took place in the laboratory. In546\nExperiment 3a, on all trials stimuli were presented simultaneously. In Experiment 3b, the same547\nobservers performed the task with both simultaneous and sequential presentations, with the order of548\nthese conditions counterbalanced across participants. To prevent transfer effects between conditions,549\nwe used different colour combinations: in one condition, stimuli were presented in green and blue,550\nwhile in the other, they were presented in orange and magenta. The colour combinations were551\nrandomly assigned to each presentation condition.552\nIn Experiment 3a, observers completed twenty practice trials and one hundred fifty experimental553\ntrials. The trials were divided into two blocks with a mandatory break of at least one minute in554\nbetween, resulting in a total testing session duration of around 15 minutes. Experiment 3b (i.e.,555\nthe laboratory experiment) consisted of four hundred trials, divided into eight equal blocks. In half556\nof the blocks, stimuli were presented simultaneously, while in the other half, they were presented557\nsequentially. Half of the observers completed the simultaneous blocks first, followed by the sequential558\nblocks, and vice versa for the other half. At the beginning of each block sequence (i.e., simultaneous559\nor sequential task), observers performed twenty practice trials to familiarize themselves with the560\ntask. In Experiment 3b, observers were required to maintain central fixation throughout the stimulus561\npresentation. If gaze deviated by more than2◦, a warning message appeared on the screen, and the562\ntrial was aborted and restarted with newly randomized stimuli. Completing Experiment 3b took563\napproximately 90 minutes.564\nAnalysis565\nAll stimulus values were analysed and are reported with respect to the circular parameter space566\nof possible motion directions, [−π, π) radians. Response error for each trial was measured as the567\nangular difference between the reported and target motion directions. To quantify the dispersion of568\nresponse errors, we calculated the mean absolute deviation (MAD) across trials for each condition569\nand observer. Higher MAD values indicate greater average reproduction error.570\nTo compare differences in performance across conditions, we used Bayesian hypothesis tests,571\nimplemented in JASP [86] with the default Jeffreys-Zellner-Siow prior on effect sizes [87]. We report572\nBayes factors which compare the relative predictive adequacy of two competing hypotheses (e.g.,573\nalternative and null) and quantify the change in belief that the data bring about for the hypotheses574\nunder consideration [88]. For example,BF10 = 10 indicates that the data are ten times more likely575\nto occur under the alternative hypothesis (i.e., there is a difference) than under the null hypothesis576\n(i.e., there is no difference). Evidence for the null hypothesis is indicated byBF10 < 1, in which577\ncase the strength of evidence is indicated by1/BF10. Evidence assessed via the Bayes Factor is best578\nunderstood as a ratio-scaled value ranging from 0 to infinity. For clarity in communication, we also579\nuse an interpretative framework for Bayes Factor values, following the classification scheme outlined580\nby Lee and Wagenmakers [89]:BF = 1 as no evidence; 1< BF < 3 as weak or anecdotal evidence; 3581\n≤ BF < 10 as moderate evidence; 10≤ BF < 30 as strong evidence; 30≤ BF < 100 as very strong582\nevidence; BF ≥ 100 as extreme evidence. It is critical to note that while we utilize these discrete583\ncategories, they are arbitrary and should serve only as rough guidelines. Along with the Bayes factor,584\n17\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nwe report the median of the posterior distribution over the effect size (δ) and the accompanying 95%585\ncredible interval (95% CI).586\nNeural resource model587\nWeanalysedobservers’responseerrorswithanestablishedmodelbasedontheprinciplesofpopulation588\ncoding [22, 51, 90, 91]. In this framework, a visual stimulus (θ) is encoded by an idealized population589\nof neurons whose activity is determined by their individual tuning functions. All neurons are assumed590\nto share the same bell-shaped von Mises tuning function,591\nfi(θ) = exp(κ(cos(θ ⊖ φi) − 1)), (2)\nwhere κ determines the tuning concentration, and⊖ is subtraction on a circle. These tuning functions592\nare translated through the feature space to peak at each neuron’s preferred value (φi), such that they593\nprovide dense uniform coverage of the entire feature space. In a population ofM neurons, the average594\nresponse of theith neuron in response to a stimulus valueθ is obtained by scaling the output of the595\ntuning function with the population’s mean total firing activity (γ),596\n¯ni(θ) = γ\nM fi(θ). (3)\nIf activity associated with multiple stimuli is combined or normalized [92] at a population levelγ,597\nEquation 3 implements a form of limited resource [22]. The spike count produced by each neuron is598\ndrawn from a Poisson distribution,599\nni(θ) ∼ Poiss(¯ni(θ)), (4)\nand the decoded motion direction estimate is obtained by maximum likelihood estimation of the600\npopulation spiking activity,n:601\nˆθ = arg max\nθ\np(n|θ). (5)\nThe resulting distribution of decoding errors, for a given total number of spikesm = Σini ∼ Poiss(γ),602\nis described as a mixture of von Mises (ϕ) distributions,603\np(ˆθ|θ, m) =\nZ\np(r|m, κ)ϕ(ˆθ; θ, rκ)dr, (6)\nwith604\np(r|m, κ) = I0(κr)\n(I0(κ))m rψm(r), (7)\nwhere rψm(r) is the probability density function for resultant lengthr of a uniform random walk of605\nm steps. The full distribution of response errors predicted by the model is a mixture of probability606\ndistributions p(ˆθ|θ, m), weighted with the probability of obtainingm spikes. For a complete derivation607\nof the distribution of response errors, see Bays [22] and Bays [71].608\nThe model has two free parameters, the population’s mean total firing activity (γ), and the609\nconcentration of the tuning function (κ). In scenarios when multiple objects (N) need to be repre-610\nsented, the total resourceγ is typically divided equally among objects (i.e.,γ/N). Here we extend611\nthis basic approach by incorporating an allocation parameter, or gain factorα, which controls the612\nneural activity allocated to one object (see also [22]). Without loss of generality, we fixed the gain613\nfactor for one object at 1, while treating the gain factor for objectj (see below for details of each614\nexperiment) as a free parameter when fitting the model to the data. The neural activity allocated615\nto objectj can be expressed aspαγ, where pα = α/(1 + α) represents the proportion of total neural616\nactivity. The remaining activity (proportion1 − pα) is allocated to the other object.617\nIn Experiment 1, which involved the manipulation of external reward, the allocation weight for618\nthe high-reward item was fixed at 1, while the allocation weight for the low-reward item was freely619\nestimated. In Experiment 2, in which we manipulated perceived accuracy, the allocation weight for620\nthe error-magnified item was fixed at 1, and the allocation weight for the error-minified item was freely621\nestimated. In Experiment 3, which involved estimation difficulty manipulation, we simultaneously622\nfitted responses on variable- and equal-coherence trials. Building on our previous work [93], we623\nassumed that the strength of the motion signal is controlled by the coherence level of RDK stimuli,624\nsuch that the value encoded into the neural population is given by625\n¯θ ∼ WN (θ, σ2\ncoherence), (8)\n18\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nwhere WN is a wrapped normal with meanθ and variance σ2\ncoherence accounting for additive Gaus-626\nsian noise. For simplicity, we considered 85% coherence (high coherence) as perceptually noiseless627\nand further assumed that σ2\n45% > σ 2\n65%, where 45% was the low-coherence level, and 65% was the628\nintermediate-coherence level used in the equal-coherence trials. Additionally, the allocation weight629\nfor the low-coherence colour (45% and half of 65% stimuli) was fixed at 1, while the allocation weight630\nfor the high-coherence colour (85% and half of 65% stimuli) was freely estimated across variable-631\nand equal-coherence trials. In other words, on equal-coherence trials, differences in response precision632\nwere explained solely by the allocation weight. In contrast, on variable-coherence trials, the allocation633\nweight and perceptual noise jointly accounted for variations in response precision.634\nOptimal resource allocation635\nTo identify the optimal levels of resource allocation, we conducted a simulation study. For each636\nobserver, we simulated model predictions using the best-fitting parameters of the Neural resource637\nmodel, specifically the mean total number of samples (γ) and the precision of a single sample (ω1),638\nalong with a grid of potential allocation weights.639\nFor Experiment 1, we analytically determined the number of points based on the model-predicted640\nresponse distributions under different allocation weights. The allocation weights were tested across641\na grid ranging from 0.001 to 2 in increments of 0.01, resulting in 200 distinct values. The optimal642\nallocation weight was identified as the value that maximized the total reward across both high- and643\nlow-reward items.644\nFor Experiment 2, we numerically simulated the variance (i.e., squared circular SD) of feedback645\nerrors using the same grid of allocation weights employed in Experiment 1. This analysis was based on646\n107 simulated trials drawn from the error distribution predicted by the model. The optimal allocation647\nweight was determined as the value that minimized the total variance of feedback errors across both648\nerror-minified and error-magnified items.649\nFor Experiment 3, we analytically modelled the response variance on the variable coherence trials650\n(i.e., forthehigh-andlow-coherenceitems)acrossarangeofallocationweights. Weemployedagridof651\n200 values, spanning from 0.01 to 6 in increments of 0.03. The optimal allocation weight was identified652\nas the value that minimized the total response variance for both high- and low-coherence items (i.e.,653\n85% and 45% coherence). In Experiment 3a, the simulation yielded values around αoptimal = 1654\nfor all but one outlier observer, for whom the estimate reached the endpoint of the examined grid655\n(αoptimal = 6). This occurred due to the model estimating high levels of perceptual noise for medium-656\nand low-coherence stimuli, suggesting that minimizing overall error would be achieved by allocating all657\nresources to the high-coherence object. We exclude this data point in Figure3D, and the comparison658\nof observed and optimal allocations is based on the remaining observers. Including this observer’s659\ndata and performing a non-parametric test did not change our conclusions.660\nReinforcement learning account of resource allocation661\nWe developed a quantitative model to describe how the history of accumulated rewards from multiple662\nobjects influences subsequent resource allocation towards those objects. The proposed model extends663\nthe Neural resource model by incorporating a simple reinforcement learning (RL) rule, which directs664\nbehaviour towards more rewarding stimuli. Importantly, our model applies the same RL rule to both665\nexternal and intrinsic rewards. In the standard RL framework, analysis typically focuses on external666\nrewards, such as points or money, which are provided by the environment as a direct response to the667\nagent’s actions. Our model broadens this scope to include intrinsic rewards - those that are inherently668\npleasurable and drive behaviour - such as the sense of being accurate in a task.669\nDrawing on the conducted experiments and the motion direction reproduction task, the general670\noverview of this account is as follows: on a particular trial, the received points or money (extrinsic re-671\nward), perceived accuracy due to feedback (intrinsic reward), and an individual’s internal confidence-672\nbased estimate of precision (intrinsic reward) collectively update the value (ν) of a particular object673\nassociated with these rewards. This computed value influences the allocation of cognitive resources674\nto that object in subsequent encounters, thereby modulating the precision with which the object is675\nrepresented.676\nFormally, in the simplest scenario involving only two objects, rather than defining the accumu-677\n19\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nlated reward for each object separately, we can define the relative accumulated reward on trialt678\nas:679\nνt = (1 − y)νt−1 + ∆t, (9)\n680\n∆t = I±(c1rextt + c2exp(−|ϵfbt |) + c3κˆrt), (10)\nwhere y is a leak component,rextt is the number of received points,ϵfbt is feedback error,κˆrt is an681\nestimate of internal confidence, and{c1, c2, c3} are respective weights accounting for different scales of682\nrewards and the types of rewards prioritised by observers. The variableI± takes the value of +1 or -1683\ndepending on the object identity, i.e., reproduction of the green or blue item, with the assignment of684\nconditions being arbitrary. Positive values ofν indicate a higher relative value for one item (I = +1),685\nwhile negative values ofν indicate a higher relative value for the other item (I = −1). To account for686\nthe fact thatν can range from−∞ to +∞, we transform it into gain parameterα using the following687\nequation:688\nα(ν) = e2ν. (11)\nThis allows us to compute the proportion of spiking activity allocated to the item identified as689\nsign(I) = +1, given by α/(1 + α), with the remaining spiking activity allocated to the other item.690\nWhen ν = 0, such as at the beginning of the task, both items are perceived as having equal value,691\nresulting in an equal distribution of neural resources between them.692\nThe leak component (y) functions as a temporal filter, modulating the influence of past rewards693\non resource allocation. Wheny = 1, the system entirely ignores accumulated past values, making the694\nvalue of an object - and thus resource allocation in the next trial - rely exclusively on the reward from695\nthe most recent trial. Conversely, wheny = 0, the accumulated value is fully retained and integrated696\nwith the most recent reward. The necessity of the leak component becomes particularly evident in697\nscenarios where rewards are discontinued: a non-zero leak will gradually equalize the relative value698\nand resource allocation across objects, returning them to a state of equilibrium.699\nThe first reward component of Equation9, rext, reflects the experimental manipulation of Exper-700\niment 1. In this experiment, observers received 15 points for responses with an error of less than50◦701\nfor high-reward objects and 5 points for low-reward objects. Responses with an error greater than702\n50◦ received no points. When applying this model to the data, we used values ofrext = {0.15, 0.05, 0}703\nto represent the rewards for high-reward, low-reward, and no-reward trials, respectively.704\nThe feedback component of the model (ϵfb) addresses the experimental manipulation of Exper-705\niment 2. In this experiment, we systematically manipulated feedback error by reducing it for one706\nstimulus and increasing it for another. We hypothesized that feedback serves as an intrinsic reward,707\nwith stimuli receiving minified feedback errors being perceived as more rewarding than those with708\nmagnified feedback errors. In modelling this relationship, feedback error was assumed to be exponen-709\ntially related to the object’s valueν, with smaller feedback errors corresponding to higher rewards,710\nleading to a greater increase in the object’s value. This exponential relationship reflects diminishing711\nsensitivity to large feedback errors, such that a wide range of larger errors yields relatively minimal712\nand similar rewards, whereas a narrow range of smaller errors results in significantly higher but more713\nvariable rewards.714\nThe final component of our model is the estimate of internal confidence (κˆr). While internal715\nconfidence can be assessed through self-reported or metacognitive measures, our approach leverages716\nthe inherent mechanism of the Neural resource model to quantify uncertainty in the decoded (i.e.,717\nreported) value. Our approach relies on the principle that the width of the likelihood function718\nreflects the uncertainty of the estimate. The likelihood function evaluates how well various stimulus719\nvalues align with the observed neural activity: a broad likelihood function is compatible with many720\ndifferent feature values, suggesting lower precision in the maximum likelhood estimate (the peak721\nof the likelihood function), whereas a narrow likelihood function implies a more precise estimate.722\nDue to the probabilistic generation of spikes across retrievals (Eq.4), the likelihood has the form723\nof a von Mises with concentrationκˆr proportional to the resultant vector length of the preferred724\nvalues associated with each of the emitted spikes (m), with higher spike counts producing a narrower725\nlikelihood function on average [51]. This formulation has previously been shown to quantitatively726\nreproduce findings from studies in which participants were asked to rate their subjective confidence727\nin each estimate [67,71]. Consequently, the precision of the likelihood function emerges as a natural728\ncandidate for a computational estimate of the observer’s internal confidence.729\n20\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nTo measure internal confidence associated with each response, we determined the most probable730\nresultant vector length given the individual response errors and the probabilistic distribution of spike731\nvalues, which was fully characterized by population parametersγ and κ. Specifically, for each trial,732\nwe used Bayes rule to find the posterior probability of resultant vector lengthr given the error on733\nthat trial ϵ, marginalizing over total spike countm,734\np(r|ϵ, κ, γ) = p(ϵ|r, κ)p(r|κ, γ)R\np(ϵ|r, κ)p(r|κ, γ)dr (12)\np(ϵ|r, κ) = ϕ(ϵ; 0, κr) (13)\np(r|κ, γ) =\nZ\np(r|m, κ)p(m|γ)dm. (14)\nwhere p(r|m, κ) is given by Eq.7 and p(m|γ) is the Poisson p.m.f. with meanγ. Applying MAP735\nestimation to this posterior distribution returns the most probable estimate of resultant length for a736\ngiven response error,737\nˆr = arg max\nr\np(r|ϵ, κ, γ). (15)\nFinally, we useκˆr = ˆrκ as a measure of internal confidence on the given trial.738\nModel fitting739\nTo model the observed allocation within the Neural resource model [22,51], which has two free740\nparameters – the mean population activity (γ) and the precision of the tuning functions (κ) – we741\nintroduced an additional parameter, the gain modulationα [22], resulting in a total of three free742\nparameters in Experiments 1 and 2. In Experiment 3, which involved an estimation difficulty ma-743\nnipulation, the Neural Resource model was extended by two additional parameters (σ2\n45% and σ2\n65%)744\nto capture the effects of variable sensory noise introduced by different coherence levels. This brought745\nthe total number of free parameters in the Neural resource model for Experiment 3 to five.746\nThe Reinforcement learning account retained all parameters of the Neural resource model except747\nthe gain modulation parameterα, while introducing four new parameters, namely the leak parameter748\n(y)andrewardweightparameters( c1, c2, c3). Inallthreeexperiments, wemodelledtheleakparameter749\n(y) and the effect of internal confidence (c3) on resource allocation (see Eq. 9); additionally, we750\nmodelled the effect of external reward (c1) only in Experiment 1 while setting it to zero in all other751\nexperiments, and feedback error (c2) only in Experiment 2 while also setting it to zero in all other752\nexperiments. This resulted in the estimation of five free parameters in Experiments 1 and 2, and six753\nin Experiment 3. When fitting the model to the data, the leak parameter was constrained between 0754\nand 1, and all three weight parameters were limited to a range of -1 to 1.755\nFor all models, we obtained a separate maximum likelihood fit for each individual observer.756\nThese fits were derived using the Nelder-Mead simplex method (via thefminsearch function in MAT-757\nLAB). A MATLAB toolbox implementing the Neural resource model is available for download from758\nhttps://bayslab.com/toolbox.759\n21\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nAcknowledgment760\nWe thank David Aagten-Murphy and Robert Taylor, who worked on earlier iterations of this project.761\nWe thank Neha Abraham, Pepita Alex, Amida Anand, Paul McMeekin, Adam Sabo, Tom Wenban-762\nSmith, Adam Zhu, and Adam Triabhall for assisting with data collection. This work was funded by763\nthe Wellcome Trust (grant 106926 to P.M.B). The funders had no role in study design, data collection764\nand analysis, decision to publish or preparation of the manuscript.765\nAuthor contributions766\nI.T. contributed to conceptualization, methodology, software, data collection, investigation, formal767\nanalysis, modelling, visualizations, and writing - original draft and revisions.R.R.R contributed768\nto methodology, software, and data collection. P.M.B. contributed to conceptualization, funding769\nacquisition, supervision, methodology, formal analysis, modelling, visualizations, and writing - editing770\nand revisions.771\nData availability772\nData and analysis code will be made publicly available upon publication of this manuscript.773\nReferences774\n[1] Berridge, K. C., Robinson, T. E., and Aldridge, J. W. Dissecting components of reward: ‘liking’,775\n‘wanting’, and learning.Current Opinion in Pharmacology9.1 (2009), pp. 65–73.doi: 10.1016/776\nj.coph.2008.12.014.777\n[2] Berridge, K. C. and Robinson, T. E. Parsing reward. Trends in Neurosciences26.9 (2003),778\npp. 507–513.doi: 10.1016/S0166-2236(03)00233-9.779\n[3] Stănişor, L., Van Der Togt, C., Pennartz, C. M. A., and Roelfsema, P. R. A unified selection780\nsignal for attention and reward in primary visual cortex.Proceedings of the National Academy781\nof Sciences110.22 (2013), pp. 9136–9141.doi: 10.1073/pnas.1300117110.782\n[4] Schultz, W. Behavioral Theories and the Neurophysiology of Reward.Annual Review of Psy-783\nchology 57.1 (2006), pp. 87–115.doi: 10.1146/annurev.psych.56.091103.070229.784\n[5] Maunsell, J. H. Neuronal representations of cognitive state: reward or attention? Trends in785\nCognitive Sciences8.6 (2004), pp. 261–265.doi: 10.1016/j.tics.2004.04.003.786\n[6] Blain, B. and Sharot, T. Intrinsic reward: potential cognitive and neural mechanisms.Current787\nOpinion in Behavioral Sciences39 (2021), pp. 113–118.doi: 10.1016/j.cobeha.2021.03.008.788\n[7] Navalpakkam, V., Koch, C., Rangel, A., and Perona, P. Optimal reward harvesting in com-789\nplex perceptual environments.Proceedings of the National Academy of Sciences107.11 (2010),790\npp. 5232–5237.doi: 10.1073/pnas.0911972107.791\n[8] Lee, J. and Shomstein, S. The Differential Effects of Reward on Space- and Object-Based792\nAttentional Allocation.Journal of Neuroscience33.26 (2013), pp. 10625–10633.doi: 10.1523/793\nJNEUROSCI.5575-12.2013.794\n[9] Della Libera, C. and Chelazzi, L. Visual Selective Attention and the Effects of Monetary Re-795\nwards. Psychological Science 17.3 (2006), pp. 222–227.doi: 10.1111/j.1467- 9280.2006.796\n01689.x.797\n[10] Kristjansson, A., Sigurjonsdottir, O., and Driver, J. Fortune and reversals of fortune in visual798\nsearch: Reward contingencies for pop-out targets affect search efficiency and target repetition799\neffects. Attention, Perception & Psychophysics72.5 (2010), pp. 1229–1236.doi: 10.3758/APP.800\n72.5.1229.801\n[11] Anderson, B. A., Laurent, P. A., and Yantis, S. Value-driven attentional capture.Proceedings802\nof the National Academy of Sciences108.25 (2011), pp. 10367–10371.doi: 10 . 1073 / pnas .803\n1104047108.804\n22\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\n[12] Della Libera, C. and Chelazzi, L. Learning to Attend and to Ignore Is a Matter of Gains and805\nLosses. Psychological Science 20.6 (2009), pp. 778–784.doi: 10.1111/j.1467- 9280.2009.806\n02360.x.807\n[13] Peck, C. J., Jangraw, D. C., Suzuki, M., Efem, R., and Gottlieb, J. Reward Modulates Attention808\nIndependently of Action Value in Posterior Parietal Cortex.The Journal of Neuroscience29.36809\n(2009), pp. 11182–11191.doi: 10.1523/JNEUROSCI.1929-09.2009.810\n[14] Theeuwes, J. and Belopolsky, A. V. Reward grabs the eye: Oculomotor capture by rewarding811\nstimuli. Vision Research74 (2012), pp. 80–85.doi: 10.1016/j.visres.2012.07.024.812\n[15] Anderson, B. A. and Kim, H. Mechanisms of value-learning in the guidance of spatial attention.813\nCognition 178 (2018), pp. 26–36.doi: 10.1016/j.cognition.2018.05.005.814\n[16] Anderson, B. A. and Kim, H. On the representational nature of value-driven spatial attentional815\nbiases. Journal of Neurophysiology120.5 (2018), pp. 2654–2658.doi: 10.1152/jn.00489.2018.816\n[17] Awh, E., Belopolsky, A. V., and Theeuwes, J. Top-down versus bottom-up attentional control:817\na failed theoretical dichotomy. Trends in Cognitive Sciences16.8 (2012), pp. 437–443.doi:818\n10.1016/j.tics.2012.06.010.819\n[18] Anderson, B. A., Kim, H., Kim, A. J., Liao, M.-R., Mrkonja, L., Clement, A., and Grégoire, L.820\nThe past, present, and future of selection history.Neuroscience & Biobehavioral Reviews130821\n(2021), pp. 326–350.doi: 10.1016/j.neubiorev.2021.09.004.822\n[19] Bays, P. M., Schneegans, S., Ma, W. J., and Brady, T. F. Representation and computation in823\nvisual working memory.Nature Human Behaviour8.6 (2024), pp. 1016–1034.doi: 10.1038/824\ns41562-024-01871-2.825\n[20] Bays, P. M., Gorgoraptis, N., Wee, N., Marshall, L., and Husain, M. Temporal dynamics of826\nencoding, storage, and reallocation of visual working memory.Journal of Vision11.10 (2011),827\npp. 6–6.doi: 10.1167/11.10.6.828\n[21] Gorgoraptis, N., Catalao, R. F. G., Bays, P. M., and Husain, M. Dynamic Updating of Working829\nMemory Resources for Visual Objects.Journal of Neuroscience31.23 (2011), pp. 8502–8511.830\ndoi: 10.1523/JNEUROSCI.0208-11.2011.831\n[22] Bays, P. M. Noise in Neural Populations Accounts for Errors in Working Memory. en.Journal832\nof Neuroscience34.10 (2014), pp. 3632–3645.doi: 10.1523/JNEUROSCI.3204-13.2014.833\n[23] Emrich, S. M., Lockhart, H. A., and Al-Aidroos, N. Attention mediates the flexible allocation834\nof visual working memory resources.Journal of Experimental Psychology: Human Perception835\nand Performance43.7 (2017), pp. 1454–1465.doi: 10.1037/xhp0000398.836\n[24] Sprague, T. C., Itthipuripat, S., Vo, V. A., and Serences, J. T. Dissociable signatures of visual837\nsalience and behavioral relevance across attentional priority maps in human cortex.Journal of838\nNeurophysiology 119.6 (2018), pp. 2153–2165.doi: 10.1152/jn.00059.2018.839\n[25] Yoo, A. H., Klyszejko, Z., Curtis, C. E., and Ma, W. J. Strategic allocation of working memory840\nresource (2018). doi: 10.1101/329870.841\n[26] Taylor, R., Tomić, I., Aagten-Murphy, D., and Bays, P. M. Working memory is updated by842\nreallocation of resources from obsolete to new items.Attention, Perception, & Psychophysics843\n85.5 (2023), pp. 1437–1451.doi: 10.3758/s13414-022-02584-2.844\n[27] Griffin, I. C. and Nobre, A. C. Orienting Attention to Locations in Internal Representations.845\nJournal of Cognitive Neuroscience15.8(2003),pp.1176–1194. doi: 10.1162/089892903322598139.846\n[28] Oberauer, K. Control of the Contents of Working Memory–A Comparison of Two Paradigms847\nand Two Age Groups.Journal of Experimental Psychology: Learning, Memory, and Cognition848\n31.4 (2005), pp. 714–728.doi: 10.1037/0278-7393.31.4.714.849\n[29] Klyszejko, Z., Rahmati, M., and Curtis, C. E. Attentional priority determines working memory850\nprecision. Vision Research105 (2014), pp. 70–76.doi: 10.1016/j.visres.2014.09.002.851\n[30] Atkinson, A. L., Oberauer, K., Allen, R. J., and Souza, A. S. Why does the probe value effect852\nemerge in working memory? Examining the biased attentional refreshing account.Psychonomic853\nBulletin & Review29.3 (2022), pp. 891–900.doi: 10.3758/s13423-022-02056-6.854\n[31] Allen, R. J., Atkinson, A., and Hitch, G. J. Getting value out of working memory through855\nstrategic prioritisation; implications for storage and control.Quarterly Journal of Experimental856\nPsychology (2024), p. 17470218241258102.doi: 10.1177/17470218241258102.857\n23\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\n[32] Gong, M. and Li, S. Learned reward association improves visual working memory.Journal of858\nExperimental Psychology: Human Perception and Performance40.2 (2014), pp. 841–856.doi:859\n10.1037/a0035131.860\n[33] Brissenden, J. A., Adkins, T. J., Hsu, Y. T., and Lee, T. G. Reward influences the allocation but861\nnot the availability of resources in visual working memory.Journal of Experimental Psychology:862\nGeneral (2023). doi: 10.1037/xge0001370.863\n[34] Van Den Berg, R., Zou, Q., Li, Y., and Ma, W. J. No effect of monetary reward in a visual864\nworking memory task. PLOS ONE 18.1 (2023), e0280257. doi: 10 . 1371 / journal . pone .865\n0280257.866\n[35] Atkinson, A. L., Berry, E. D., Waterman, A. H., Baddeley, A. D., Hitch, G. J., and Allen,867\nR. J. Are there multiple ways to direct attention in working memory?Annals of the New York868\nAcademy of Sciences1424.1 (2018), pp. 115–126.doi: 10.1111/nyas.13634.869\n[36] Gazzaley, A. and Nobre, A. C. Top-down modulation: bridging selective attention and working870\nmemory. Trends in Cognitive Sciences16.2 (2012), pp. 129–135.doi: 10.1016/j.tics.2011.871\n11.014.872\n[37] Awh, E., Vogel, E., and Oh, S.-H. Interactions between attention and working memory.Neuro-873\nscience 139.1 (2006), pp. 201–208.doi: 10.1016/j.neuroscience.2005.08.023.874\n[38] Wolf, D. H., Gerraty, R., Satterthwaite, T. D., Loughead, J., Campellone, T., Elliott, M. A.,875\nTuretsky, B. I., Gur, R. C., and Gur, R. E. Striatal intrinsic reinforcement signals during876\nrecognition memory: relationship to response bias and dysregulation in schizophrenia.Frontiers877\nin Behavioral Neuroscience5 (2011), p. 81.doi: 10.3389/fnbeh.2011.00081.878\n[39] Schultz, W., Dayan, P., and Montague, P. R. A Neural Substrate of Prediction and Reward.879\nScience 275.5306 (1997), pp. 1593–1599.doi: 10.1126/science.275.5306.1593.880\n[40] Knutson, B., Fong, G. W., Adams, C. M., Varner, J. L., and Hommer, D. Dissociation of reward881\nanticipation and outcome with event-related fMRI:Neuroreport 12.17 (2001), pp. 3683–3687.882\ndoi: 10.1097/00001756-200112040-00016.883\n[41] Elliott, R., Friston, K. J., and Dolan, R. J. Dissociable Neural Responses in Human Reward884\nSystems. The Journal of Neuroscience20.16 (2000), pp. 6159–6165.doi: 10.1523/JNEUROSCI.885\n20-16-06159.2000.886\n[42] De Martino, B., Kumaran, D., Holt, B., and Dolan, R. J. The Neurobiology of Reference-887\nDependent Value Computation.The Journal of Neuroscience29.12 (2009), pp. 3833–3842.doi:888\n10.1523/JNEUROSCI.4832-08.2009.889\n[43] Han, S., Huettel, S. A., Raposo, A., Adcock, R. A., and Dobbins, I. G. Functional Significance890\nof Striatal Responses during Episodic Decisions: Recovery or Goal Attainment?The Journal of891\nNeuroscience 30.13 (2010), pp. 4767–4775.doi: 10.1523/JNEUROSCI.3077-09.2010.892\n[44] Satterthwaite, T. D., Ruparel, K., Loughead, J., Elliott, M. A., Gerraty, R. T., Calkins, M. E.,893\nHakonarson, H., Gur, R. C., Gur, R. E., and Wolf, D. H. Being right is its own reward: Load and894\nperformance related ventral striatum activation to correct responses during a working memory895\ntask in youth.NeuroImage 61.3 (2012), pp. 723–729.doi: 10.1016/j.neuroimage.2012.03.896\n060.897\n[45] Daniel, R. and Pollmann, S. Striatal activations signal prediction errors on confidence in the898\nabsence of external feedback. NeuroImage 59.4 (2012), pp. 3457–3467.doi: 10 . 1016 / j .899\nneuroimage.2011.11.058.900\n[46] Hebart, M. N., Schriever, Y., Donner, T. H., and Haynes, J.-D. The Relationship between901\nPerceptual Decision Variables and Confidence in the Human Brain.Cerebral Cortex26.1 (2016),902\npp. 118–130.doi: 10.1093/cercor/bhu181.903\n[47] Schwarze, U., Bingel, U., Badre, D., and Sommer, T. Ventral Striatal Activity Correlates with904\nMemory Confidence for Old- and New-Responses in a Difficult Recognition Test.PLoS ONE905\n8.3 (2013), e54324.doi: 10.1371/journal.pone.0054324.906\n[48] Guggenmos, M., Wilbertz, G., Hebart, M. N., and Sterzer, P. Mesolimbic confidence signals907\nguide perceptual learning in the absence of external feedback.eLife 5 (2016), e13388. doi:908\n10.7554/eLife.13388.909\n24\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\n[49] Prinzmetal, W., Amiri, H., Allen, K., and Edwards, T. Phenomenology of attention: I. Color,910\nlocation, orientation, and spatial frequency. en.Journal of Experimental Psychology: Human911\nPerception and Performance24.1 (1998), pp. 261–282.doi: 10.1037/0096-1523.24.1.261.912\n[50] Tomić, I., Adamcová, D., Fehér, M., and Bays, P. M. Dissecting the components of error in913\nanalogue report tasks.Behavior Research Methods(2024). doi: 10.3758/s13428-024-02453-914\nw.915\n[51] Schneegans, S., Taylor, R., and Bays, P. M. Stochastic sampling provides a unifying account916\nof visual working memory limits. en.Proceedings of the National Academy of Sciences(2020),917\np. 202004306. doi: 10.1073/pnas.2004306117.918\n[52] Kiani, R., Corthell, L., and Shadlen, M. N. Choice Certainty Is Informed by Both Evidence and919\nDecision Time.Neuron 84.6 (2014), pp. 1329–1342.doi: 10.1016/j.neuron.2014.12.015.920\n[53] Fleming, S. M. Metacognition and Confidence: A Review and Synthesis. Annual Review of921\nPsychology 75.1 (2024), pp. 241–268.doi: 10.1146/annurev-psych-022423-032425.922\n[54] Chetverikov, A. and Jehee, J. F. M. Motion direction is represented as a bimodal probability923\ndistribution in the human visual cortex.Nature Communications 14.1 (2023), p. 7634. doi:924\n10.1038/s41467-023-43251-w.925\n[55] Kwak, Y. and Curtis, C. E. Unveiling the abstract format of mnemonic representations.Neuron926\n110.11 (2022), 1822–1828.e5.doi: 10.1016/j.neuron.2022.03.016.927\n[56] Shadmehr, R., Reppert, T. R., Summerside, E. M., Yoon, T., and Ahmed, A. A. Movement928\nVigor as a Reflection of Subjective Economic Utility.Trends in Neurosciences 42.5 (2019),929\npp. 323–336.doi: 10.1016/j.tins.2019.02.003.930\n[57] Summerside, E. M., Shadmehr, R., and Ahmed, A. A. Vigor of reaching movements: reward931\ndiscounts the cost of effort.Journal of Neurophysiology119.6 (2018), pp. 2347–2357.doi: 10.932\n1152/jn.00872.2017.933\n[58] Manohar, S. G., Finzi, R. D., Drew, D., and Husain, M. Distinct Motivational Effects of Con-934\ntingent and Noncontingent Rewards.Psychological Science 28.7 (2017), pp. 1016–1026.doi:935\n10.1177/0956797617693326.936\n[59] Berridge, K. C. and Robinson, T. E. What is the role of dopamine in reward: hedonic impact,937\nreward learning, or incentive salience?Brain Research Reviews28.3 (1998), pp. 309–369.doi:938\n10.1016/S0165-0173(98)00019-8.939\n[60] Yoo, A. H. and Collins, A. G. E. How Working Memory and Reinforcement Learning Are Inter-940\ntwined:ACognitive,Neural,andComputationalPerspective. Journal of Cognitive Neuroscience941\n34.4 (2022), pp. 551–568.doi: 10.1162/jocn_a_01808.942\n[61] Serences,J.T.Value-BasedModulationsinHumanVisualCortex. Neuron60.6(2008),pp.1169–943\n1181. doi: 10.1016/j.neuron.2008.10.051.944\n[62] Ashby, F. G. and Maddox, W. T. Human Category Learning.Annual Review of Psychology945\n56.1 (2005), pp. 149–178.doi: 10.1146/annurev.psych.56.091103.070217.946\n[63] Hattie, J. and Timperley, H. The Power of Feedback. Review of Educational Research77.1947\n(2007), pp. 81–112.doi: 10.3102/003465430298487.948\n[64] Haddara, N. and Rahnev, D. The Impact of Feedback on Perceptual Decision-Making and949\nMetacognition: Reduction in Bias but No Change in Sensitivity.Psychological Science 33.2950\n(2022), pp. 259–275.doi: 10.1177/09567976211032887.951\n[65] Rouault, M. and Fleming, S. M. Formation of global self-beliefs in the human brain.Proceedings952\nof the National Academy of Sciences117.44 (2020), pp. 27268–27276.doi: 10 . 1073 / pnas .953\n2003094117.954\n[66] Bröker, F., Holt, L. L., Roads, B. D., Dayan, P., and Love, B. C. Demystifying unsupervised955\nlearning: how it helps and hurts.Trends in Cognitive Sciences28.11 (2024), pp. 974–986.doi:956\n10.1016/j.tics.2024.09.005.957\n[67] Berg, R. van den, Yoo, A. H., and Ma, W. J. Fechner’s law in metacognition: A quantitative958\nmodel of visual working memory confidence.Psychological Review124.2 (2017), pp. 197–214.959\ndoi: 10.1037/rev0000060.960\n25\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\n[68] Li, H.-H., Sprague, T. C., Yoo, A. H., Ma, W. J., and Curtis, C. E. Joint representation of961\nworking memory and uncertainty in human cortex.Neuron 109.22 (2021), 3699–3712.e6.doi:962\n10.1016/j.neuron.2021.08.022.963\n[69] Ma, W. J., Beck, J. M., Latham, P. E., and Pouget, A. Bayesian inference with probabilistic964\npopulation codes.Nature Neuroscience9.11 (2006), pp. 1432–1438.doi: 10.1038/nn1790.965\n[70] Pouget, A., Dayan, P., and Zemel, R. S. Inference and computation with population codes.966\nAnnual Review of Neuroscience26 (2003), pp. 381–410.doi: 10.1146/annurev.neuro.26.967\n041002.131112.968\n[71] Bays, P. M. A signature of neural coding at human perceptual limits.Journal of Vision16.11969\n(2016), p. 4.doi: 10.1167/16.11.4.970\n[72] Schneegans, S. and Bays, P. M. Drift in Neural Population Activity Causes Working Memory971\nto Deteriorate Over Time. The Journal of Neuroscience 38.21 (2018), pp. 4859–4869.doi:972\n10.1523/JNEUROSCI.3440-17.2018.973\n[73] Schaffner, J., Bao, S. D., Tobler, P. N., Hare, T. A., and Polania, R. Sensory perception relies974\non fitness-maximizing codes.Nature Human Behaviour (2023). doi: 10.1038/s41562- 023-975\n01584-y.976\n[74] Shuler, M. G. and Bear, M. F. Reward Timing in the Primary Visual Cortex.Science 311.5767977\n(2006), pp. 1606–1609.doi: 10.1126/science.1123513.978\n[75] Badre, D. Cognitive Control. Annual Review of Psychology(2024). doi: 10.1146/annurev-979\npsych-022024-103901.980\n[76] Inzlicht, M., Shenhav, A., and Olivola, C. Y. The Effort Paradox: Effort Is Both Costly and981\nValued. Trends in Cognitive Sciences22.4 (2018), pp. 337–349.doi: 10.1016/j.tics.2018.982\n01.007.983\n[77] Westbrook, A. and Braver, T. S. Cognitive effort: A neuroeconomic approach.Cognitive, Affec-984\ntive, & Behavioral Neuroscience15.2 (2015), pp. 395–415.doi: 10.3758/s13415-015-0334-y.985\n[78] Shenhav, A., Musslick, S., Lieder, F., Kool, W., Griffiths, T. L., Cohen, J. D., and Botvinick,986\nM. M. Toward a Rational and Mechanistic Account of Mental Effort.Annual Review of Neuro-987\nscience 40.1 (2017), pp. 99–124.doi: 10.1146/annurev-neuro-072116-031526.988\n[79] Kool,W.,McGuire,J.T.,Rosen,Z.B.,andBotvinick,M.M.Decisionmakingandtheavoidance989\nof cognitive demand.Journal of Experimental Psychology: General139.4 (2010), pp. 665–682.990\ndoi: 10.1037/a0020198.991\n[80] Corlazzoli, G., Desender, K., and Gevers, W. Feeling and deciding: Subjective experiences rather992\nthan objective factors drive the decision to invest cognitive control. Cognition 240 (2023),993\np. 105587. doi: 10.1016/j.cognition.2023.105587.994\n[81] Kool, W. and Botvinick, M. The intrinsic cost of cognitive control.Behavioral and Brain Sci-995\nences 36.6 (2013), pp. 697–698.doi: 10.1017/S0140525X1300109X.996\n[82] Botvinick, M. M., Huffstetler, S., and McGuire, J. T. Effort discounting in human nucleus997\naccumbens. Cognitive, Affective, & Behavioral Neuroscience9.1 (2009), pp. 16–27.doi: 10.998\n3758/CABN.9.1.16.999\n[83] Brainard, D. H. The Psychophysics Toolbox. Spatial Vision 10.4 (1997), pp. 433–436.doi:1000\nhttps://doi.org/10.1163/156856897X00357.1001\n[84] Pelli, D. G. The VideoToolbox software for visual psychophysics: transforming numbers into1002\nmovies. Spatial Vision10.4 (1997), pp. 437–442.1003\n[85] Scase, M. O., Braddick, O. J., and Raymond, J. E. What is Noise for the Motion System?1004\nVision Research36.16 (1996), pp. 2579–2586.doi: 10.1016/0042-6989(95)00325-8.1005\n[86] JASP Team. JASP (Version 0.18.3)[Computer software]. 2024.1006\n[87] Liang, F., Paulo, R., Molina, G., Clyde, M. A., and Berger, J. O. Mixtures of g Priors for1007\nBayesianVariableSelection. Journal of the American Statistical Association103(2008),pp.410–1008\n423. doi: 10.1198/016214507000001337.1009\n[88] Wagenmakers, E.-J. et al. Bayesian inference for psychology. Part II: Example applications with1010\nJASP. Psychonomic Bulletin & Review25.1 (2018), pp. 58–76.doi: 10.3758/s13423- 017-1011\n1323-7.1012\n26\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\n[89] Lee, M. D. and Wagenmakers, E.-J. Bayesian cognitive modeling: a practical course. Cambridge1013\n; New York: Cambridge University Press, 2013. 264 pp.1014\n[90] Tomić, I. and Bays, P. M. Perceptual similarity judgments do not predict the distribution1015\nof errors in working memory. Journal of Experimental Psychology: Learning, Memory, and1016\nCognition 50.4 (2024), pp. 535–549.doi: 10.1037/xlm0001172.1017\n[91] Tomić, I. and Bays, P. M. A dynamic neural resource model bridges sensory and working1018\nmemory. eLife 12 (2024), RP91034.doi: 10.7554/eLife.91034.3.1019\n[92] Carandini, M. and Heeger, D. Normalization as a canonical neural computation.Nature Reviews1020\nNeuroscience 13.1 (2012), pp. 51–62.doi: 10.1038/nrn3136.1021\n[93] Tomić, I. and Bays, P. M. Internal but not external noise frees working memory resources.1022\nPLOS Computational Biology14.10 (2018), e1006488.doi: 10.1371/journal.pcbi.1006488.1023\n27\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nSupplementary Information1024\nPsychophysical data1025\nExperiment 2b: Sequential presentation1026\nA BMagni/f_ied feedback errorMini/f_ied feedback error\n0\n0.5\n1\n1.5\n0\n0.5\n1\n1.5\nDensity\nResponse error Response error\n0 - 0 -\n0\n0.2\n0.4\n0.6\n0.8\n1\n1.2\n1.4MAD\nMagni/f_iedMini/f_ied\nFeedback error\nFigure S1: Perceived accuracy manipulation in Experiment 2b (sequential presentation). A) His-\ntograms represent distributions of response errors. B) Mean absolute deviation of response errors.\nThe coloured circles with error bars represent the mean± SE.\nExperiment 3b: Sequential presentation1027\nHigh Low\nCoherence\nInter. (High) Inter. (Low)\nCoherence\n0\n0.2\n0.4\n0.6\n0.8\n1\n1.2\n1.4\n1.6MAD\n0\n0.2\n0.4\n0.6\n0.8\n1\n1.2\n1.4\n1.6\n0\n0.5\n1\n1.5\n0\n0.5\n1\n1.5\n0\n0.5\n1\n1.5\nA\nDensity\nHigh coherence colour\nDensity\nLow coherence colour\nB\nResponse error\n0 -\nResponse error\n0- \nC\n0 - 0- \n0\n0.5\n1\n1.5\nFigure S2: Estimation difficulty manipulation in Experiment 3b (sequential presentation). A & B)\nHistograms represent distributions of response errors. Panel A depicts variable coherence trials, and\npanel B depicts equal coherence trials. C) Mean absolute deviation of response errors. The coloured\ncircles with error bars represent the mean± SE.\nReinforcement learning account1028\nExternal reward1029\nThe average trajectory of resource allocation shown in Figure 5A is based on ML parameter estimates1030\n(mean ± SE): mean activityγ = 2.88± 0.39; tuning precisionκ = 10.29± 0.98; leaky = 0.28± 0.06;1031\nreward weightc1 = 0.31±0.14; internal confidence weightc3 = 0.012 ± 0.01. Calculating the corre-1032\nlation between parameter estimates from the Reinforcement learning account and the neural model1033\nwith freely estimated resource allocation, we found highly consistent estimates of the population’s1034\nmean spiking activity (r = 0.997, 95% CI = [0.992, 0.999],BF10 = 1.46 × 1028) and tuning precision1035\n(r = 0.971, 95% CI = [0.929, 0.986],BF10 = 3.16 × 1015) (Fig.S4A).1036\n28\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nFinally, to assess whether observers prioritised external rewards or internal confidence signals in1037\nresource allocation, we calculated each observer’s mean contribution of external rewards and internal1038\nconfidence to the relative value of the two objects. Results indicated moderate evidence for no1039\ndifference between their contributions (BF10 = 0.22; δ = 0.076, 95% CI = [-0.263, 0.418]). However,1040\nthis finding should be interpreted with caution, as the effect of internal confidence may partly reflect1041\nobservers’ resource allocation favouring the high-reward item (reflecting the influence of external1042\nreward), which subsequently enhances confidence for that item.1043\nIntrinsic reward: Perceived accuracy1044\nThe mean trajectory of resource allocation across trials shown in Figure 5D is based on ML parameter1045\nestimates (mean ± SE): mean activityγ = 5.01 ± 0.78; tuning precisionκ = 6.64 ± 0.51; leak y =1046\n0.538 ± 0.076; feedback weight c2 = 0.238± 0.108; internal confidence weightc3 = 0.007 ± 0.044.1047\nComparing the estimates derived from the Neural resource model and the Reinforcement learning1048\naccount (Fig. S4B), we again found that the RL account’s estimates closely match the population’s1049\nmean spiking activity (r = 0.987, 95% CI = [0.964, 0.994],BF10 = 1.26 × 1016) and tuning precision1050\n(r = 0.956, 95% CI = [0.884, 0.980],BF10 = 4.7 × 1010).1051\nFinally, we found moderate evidence for no difference between feedback and internal confidence1052\nsignals in their contribution to the relative value of objects (BF10 = 0.33; δ = 0.174, 95% CI = [-0.196,1053\n0.553]).1054\nIntrinsic reward: Estimation difficulty1055\nFitting the Reinforcement learning account to psychophysical data from Experiment 3a, we obtained1056\nthe following ML parameters (mean± SE): mean activity γ = 3.33 ± 0.39; tuning precision κ =1057\n11.53 ± 0.61; leak y = 0.398 ± 0.086; confidence weightc3 = 0.062 ± 0.045; intermediate perceptual1058\nnoise SD65% = 0.143 ± 0.021; high perceptual noise SD45% = 0.338 ± 0.086. In Experiment 3b we1059\nobserved very similar estimates: mean activityγ = 2.13 ± 0.31; tuning precisionκ = 13.03 ± 1.70;1060\nleak y = 0.285 ± 0.083; confidence weightc3 = 0.007 ± 0.004; intermediate perceptual noise SD65%1061\n= 0.090 ± 0.021; high perceptual noise SD45% = 0.249 ± 0.056. Again, we visualised the obtained1062\nindividual trajectories in example participants (Fig.S3C & D).1063\nIn both experiments, estimates obtained with the Neural resource model and the Reinforcement1064\nlearning account strongly covaried (Fig.S4C & D). Specifically, we found highly consistent estimates1065\nof the population’s mean spiking activity (Exp 3a:r = 0.995, 95% CI = [0.986, 0.998], BF10 =1066\n6.15 × 1017; Exp 3b: r = 0.999, 95% CI = [0.996, 1.000],BF10 = 1.66 × 1019), tuning precision (Exp1067\n3a: r = 0.912, 95% CI = [0.761, 0.961],BF10 = 2.84 × 106; Exp 3b: r = 0.970, 95% CI = [0.899,1068\n0.989], BF10 = 5.66 × 108), intermediate perceptual noise (Exp 3a: r = 0.922, 95% CI = [0.785,1069\n0.966], BF10 = 8.00 × 106; Exp 3b: r = 0.985, 95% CI = [0.947, 0.994],BF10 = 8.17 × 1010), and1070\nhigh perceptual noise (Exp 3a:r = 0.659, 95% CI = [0.293, 0.830],BF10 = 48.1; Exp 3b: r = 0.896,1071\n95% CI = [0.696, 0.957],BF10 = 6.80 × 105).1072\n29\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\nA\nB\nResource fractionResource fraction\nC\nD\nResource fractionResource fraction\nExternal reward (Exp 1)Perceived accuracy (Exp 2a)Estimation diﬃculty (Exp 3a)Estimation diﬃculty (Exp 3b)\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1 Observer 1\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1 Observer 2\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1\n12 04 06 08 0 100\nTrial number\n12 04 06 08 0 100\nTrial number\n12 04 06 08 0 100\nTrial number\nObserver 3\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1\n12 04 06 08 0 100\nTrial number\n12 04 06 08 0 100\nTrial number\n12 04 06 08 0 100\nTrial number\nObserver 1 Observer 2 Observer 3\n15 0 100 150\nTrial number\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1Observer 1 Observer 2 Observer 3\n15 0 100 150\nTrial number\n15 0 100 150\nTrial number\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1\n0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n0.8\n0.9\n1Observer 1 Observer 2 Observer 3\n15 0 100 150 200\nTrial number\n15 0 100 150 200\nTrial number\n15 0 100 150 200\nTrial number\nFigure S3: A) Trial-by-trial resource allocation estimated by the RL account in the external reward\nexperiment (Experiment 1) for three illustrative participants. Circles represent the fraction of re-\nsources allocated to the preferred item on each trial. B) Perceived accuracy experiment (Experiment\n2). C) Estimation difficulty experiment (Experiment 3a). D) Estimation difficulty experiment (Ex-\nperiment 3b).\n30\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint \n\n05 10 15\n0\n5\n10\n15\nSpiking activity\nNeural resource model\n RL account\n05 10 15 20 25\n0\n5\n10\n15\n20\n25\nNeural resource model\nTuning precision\n05 10 15 20 25\nNeural resource model\n0\n5\n10\n15\n20\n25\n05 10 15\nNeural resource model\n0\n5\n10\n15 RL account\nA B C D\n RL account\n RL account\n05 10 15 20 25\n0\n5\n10\n15\n20\n25\n05 10 15\n0\n5\n10\n15\nNeural resource model\nNeural resource model\n RL account RL account\n05 10 15 20 25\n0\n5\n10\n15\n20\n25\n05 10 15\n0\n5\n10\n15\nNeural resource model\nNeural resource model\n RL account RL account\nFigureS4: Correlationbetweenmeanactivity(toprow)andtuningprecision(bottomrow)parameters\nestimated in the Neural resource model and the RL account of resource allocation. A) External reward\nexperiment (Experiment 1). B) Perceived accuracy experiment (Experiment 2). C) Estimation\ndifficulty experiment (Experiment 3a). D) Estimation difficulty experiment (Experiment 3b).\n31\n.CC-BY 4.0 International licensemade available under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is \nThe copyright holder for this preprintthis version posted April 27, 2025. ; https://doi.org/10.1101/2025.04.25.650663doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}