Abstract
Excessive and compulsive drug use despite adverse consequences is a hallmark of 14
substance use disorder, yet individuals differ markedly in their vulnerability to develop these 15
behaviors. D rugs of abuse ar e long known to alter endogenous dopamine (DA) signaling, but 16
shared principles for how DA dynamics impact compulsive use among individuals and across drug 17
classes are lacking. Here, we monitored DA release in the medial shell of nucleus accumbens 18
(NAc) during cocaine and fentanyl self-administration, with or without coincident punishment, in 19
large cohorts of mice. Contingent cocaine and fentanyl self -administration evoked complex and 20
individually distinct DA dynamics; nevertheless, a robust negative correlation held across both 21
drugs, such that high takers exhibited lower drug-evoked DA signals. During punished drug taking, 22
cocaine and fentanyl cases were associated with distinct DA signatures of compulsivity. For 23
cocaine, punishment-resistant mice showed lower sustained DA responses during the post-shock, 24
drug-associated cue period, whereas for fentanyl, punishment -resistant mice displayed larger 25
phasic DA at the co -occurrence of footshock and drug infusion. To identify common principles 26
underlying these observations, we developed a computational model grounded in an Actor -Critic 27
temporal-difference (TD) learning framework that incorporates internal states, agentβs uncertainty, 28
and drug-specific effects. Remarkably, this model captures the observed diversity in DA dynamics 29
across drug classes and among mice with variable drug taking propensities , hereby providing a 30
unified interpretation of NAc DA signals as encoding TD reward prediction errors. 31
32
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
2
Main Text: 33
Substance use disorder (SUD) is a leading cause of drug overdose-related mortality. Excessive 34
drug consumption and compulsive drug taking despite adverse consequences are defining features 35
of SUD. However, only a subset of individuals transition from initially controlled, recreational 36
drug use to uncontrolled and compulsive taking, underscoring a striking degree of individual 37
variability in vulnerability to addiction (1β6) . Rodent models similarly reveal significant 38
individual differences in the propensity to transition from controlled to addiction-like drug use(7β39
15) . For example, in cocaine self-administration paradigms, only a subset of rats met operational 40
criteria for addiction-like behaviors, characterized by escalation of intake, persistent drug seeking 41
despite adverse consequences, and heighten ed motivation to obtain the drug( 13, 15) . Despite 42
extensive research, the neurobiological mechanisms by which exposure to addictive drugs leads to 43
divergent outcomes across individuals remain poorly understood. This study aims to identify 44
specific neurochemical signatures associated with excessive and compulsive drug taking despite 45
adverse consequences. 46
A large body of work has identified the mesolimbic dopamine system βparticularly 47
projections from the ventral tegmental area (VTA) to the nucleus accumbens (NAc)βas central to 48
the reinforcing effect of abused drugs and the development of addiction( 16β24). Seminal 49
microdialysis and voltammetry studies showed that pharmacologically diverse drugs of abuse, 50
including cocaine and opioids, preferentially eleva te extracellular dopamine levels in the NAc 51
relative to dorsal striatum in animals (18, 23, 25). However, dopamine responses to drugs and 52
drug-associated cues are not uniform across individuals, nor do they remain static over the course 53
of addiction development. Indeed, human neuroimaging and rodent studies have revealed 54
substantial inter-individual variability in both drug-evoked and cue-evoked dopamine release(26β55
30) , suggesting that differences in dopamine responses may contribute to addiction 56
vulnerability(5, 16, 27, 29, 31) . From a computational perspective, dopamine has often been 57
conceptualized as a reward prediction error signal within temporal- difference reinforcement 58
learning models (32β35), whereas psychological and neurobiological theories such as incentive 59
sensitization, allostasis, and habit formation emphasize how drug-induced adaptations in dopamine 60
circuits may drive pathological βwantingβ, negative reinforcement, and stimulusβresponse habits 61
in the development of addiction( 5, 36β40). Despite these influential theories, direct comparisons 62
of in vivo dopamine dynamics across individuals and across different drug classes remain limited, 63
hindering efforts to determine which dopamine theories best explain experimental data and 64
individual differences in drug taking and compulsive behavior. 65
In vivo measurements of dopamine using fast -scan cyclic voltammetry (FSCV) and 66
genetically encoded dopamine sensors have provided rich descriptions of dopamine dynamics 67
during psychostimulant self -administration, particularly for cocaine (23, 30, 41β43). In contrast, 68
the subsecond dopamine dynamics underlying opioid self -administration remain poorly 69
characterized, especially in the context of fentanyl. A recent study showed that heroin selectively 70
activates a subset of VTA dopamine neurons projecting to the medial shell of NAc, and that these 71
dopamine neurons are critical for heroin self -administration(19). However, little is known about 72
how variability in real -time dopamine dynamics in NAc relates to individual differences in 73
excessive opioid self -administration. Moreover, it remains unknown how dopamine release 74
patterns correlate with the emergence of compulsive drug -taking despite adverse consequences. 75
Critically, few studies have directly compared dopamine signatures across drug classes using the 76
same self-administration paradigm, or tested which formal theories of dopamine in addiction best 77
account for the observed dopamine dynamics across conditions. 78
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
3
Here, we measure real-time changes in dopamine release with a genetically encoded sensor in 79
the medial shell of the NAc as mice develop excessive and compulsive self-administration of two 80
distinct drug types, cocaine and fentanyl. We chose fentanyl to represent the opioid drug class for 81
it being the leading cause of overdose death(44) , and cocaine to represent the stimulant drug class 82
for its strong abuse potential and for ensuring consistency of our results with existing literature. 83
We focused on the medial NAc shell because this region has been strongly implicated in the 84
reinforcing and motivational effects of both cocaine and opioids (19, 41, 45, 46) . To capture the 85
transition from controlled to compulsive drug use, we employed long -access intravenous self -86
administration paradigms, which reliably promote escalation of intake and the emergence of 87
punishment-resistant, compulsive drug taking(11, 13, 47, 48) . We quantify individual differences 88
in cocaine - and fentanyl -taking behaviors and punishment sensitivity and correlate these 89
measurements with individual variations in dopamine dynamics. We then build a computational 90
model that remarkably recapitulates the diverse dopamine patterns observed across individuals, 91
drug classes, and stages of drug self -administration. Altogether, our study reveals dopamine 92
signatures of excessive and compulsive drug taking in SUD models and provides a unified 93
computational framework for understanding NAc dopamine signals in reinforced drug 94
consumption. 95
Individual differences in drug-taking behavior and punishment resistance 96
To assess individual differences in addiction- like behaviors induced by psychostimulant or 97
opioid exposure, we trained large cohorts of mice to perform intravenous self -administration 98
(IVSA) of either cocaine (n = 24) or fentanyl (n = 27). We also expressed a genetically encoded 99
dopamine (DA) sensor in these mice to measure DA release (described below). Specifically, mice 100
implanted with catheters were trained to press an active lever that triggered either cocaine (0.3 101
mg/kg/infusion) or fentanyl (2 Β΅g/kg/infusion) intravenous infusion during daily 6- hr sessions, 102
conducted 5 days per week for 4 weeks ( Fig. 1A ). The training context and parameters are 103
described in detail in fig. S1 as they are important for interpreting dopamine signals. Briefly, upon 104
the insertion of levers (both active and inactive), the training chamber was lit with light above the 105
active lever (light ON). Mice were allowed to move freely in the chamber, where they exhibited 106
typical spontaneous behaviors, such as locomotion, grooming, and rearing. They were trained to 107
press the active lever at a fixed ratio (from FR 1, 2, to FR4) to obtain intravenous infusion of 108
cocaine of fentanyl. Because the training was self-paced, the interval between lever insertion and 109
drug infusion onset varied substantially across trials and animals (from less than 20 seconds to ten 110
minutes, fig. S1). Once drug infusion started, each infusion (lasting ~2.8 seconds) was 111
accompanied by a 19.5- second ON-OFF blinking of the light above the active lever. Aft er this 112
period, the levers were retracted, and the chamber lights were turned off. Following a 20.5-second 113
dark interval, the next trial began with the lights turned on and the levers reinserted. 114
Consistent with prior studies(47β49) with long-access drug IVSA, active lever presses and the 115
cocaine (n = 24, Fig . 1B) or fentanyl (n = 27, Fig. 1E) intakes gradually escalated across the 116
training sessions under the FR1 schedule, while inactive lever presses remained minimal. When 117
the reinforcement schedule was increased to FR2 and subsequently FR4, active lever presses 118
continued to escalate for both cocaine (reaching 997 Β± 244 active presses per 6 -hour session by 119
the end of FR4 training) and fentanyl (averaging 799 Β± 89 active presses per 6-hour session by the 120
end of FR4 training). Drug intake for both cocaine and fentanyl remained high after an init ial dip 121
when switched to FR2 (Fig. 1B, 1E). Note that under the FR4 schedule, each lever press had a low 122
probability of resulting in drug infusion. Notably, mice exhibited substantial variability in the 123
latency to the first active lever press upon lever in sertion, inter-press intervals, and the latency 124
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
4
from lever insertion to drug infusion in both cocaine and fentanyl IVSA (fig. S1). As controls, we 125
also trained a separate group of mice (n = 8) to self -administer saline under a similar FR1 to FR4 126
schedule. Interestingly, these mice also pressed the active lever slightly more than the inactive 127
lever for saline infusions (fig. S2). But the discrimination between active and inactive lever presses 128
is significantly lower than that of mice IVSA cocaine and fentanyl (fig. S2). 129
Following the 4-week extended training period, mice underwent at least three 3 -hour regular 130
cocaine or fentanyl IVSA sessions under the FR4 schedule (baseline sessions) before they were 131
subjected to three 3-hour punishment sessions (Fig. 1C, 1F). During the punishment sessions, each 132
drug infusion (triggered by the 4th active lever press) was paired with a mild foot shock (0.2 mA, 133
0.5 second). As expected, punishment significantly reduced both the average number of active 134
lever presses and the average intake of cocaine and fentanyl (Fig. 1D, 1G). However, there existed 135
clear individual variability in punishment sensitivity with some mice continuing to press the active 136
lever for drug infusions despite receiving foot shocks. 137
To characterize indiv idual differences in drug- taking and punishment -responsiveness, mice 138
were simply categorized as high or low drug- taking (Fig. 2A-B and 2H-I, orange vs. light blue) 139
and as high or low punishment -resistant (Fig. 2D-E and 2K-L, red vs. dark blue), based on their 140
normalized intake of cocaine or fentanyl during the baseline and punishment sessions, respectively 141
(see Methods). Mice whose drug intake fell within mean Β± 10% of the group mean were 142
unclassified (grey samples in Fig. 2). We compared lever-pressing between low- and high-taking 143
groups and observed similar behavioral patterns in the high cocaine - (n = 8) and fentanyl -taking 144
(n = 11) mice. High drug takers showed significantly more active lever presses per infusion than 145
the low drug-takers (Fig. 2C, 2J, 6.3Β±0.27 vs 5.0Β±0.08 presses for cocaine; 11.1Β±1.6 vs 5.8Β±0.20 146
presses for fentanyl), indicating that these mice tended to exhibit more futile lever presses (i.e. 147
presses that did not result in additional infusions during the 19.5-second light-blinking period). In 148
addition, high drug- taking mice for both cocaine and fentanyl displayed significantly shorter 149
latencies of the first active lever press following lever insertion (33.6Β±3.85 vs 105.8Β±7.75 seconds 150
for cocaine; 39.3Β± 4.45 vs 122.7Β± 12.74 seconds for fentanyl) and shorter inter -press intervals 151
between active lever presses (9.8 Β±0.82 vs 17.9Β± 2.6 seconds for cocaine; 7.9Β± 1.13 vs 23.7Β± 1.69 152
seconds for fentanyl) compared with the low drug- taking groups. These results indicate that high 153
drug-takers tended to respond more rapidly and persistently to the active lever ( Fig. 2C, 2J, also 154
see representative examples in fig. S1). 155
Regarding punishment resistance ( Fig. 2D-E and 2K-L), we observed opposite trends of 156
punished drug intake between cocaine- and fentanyl-taking group. High punishment-resistant mice 157
for cocaine (n = 7) tended to decrease their drug intake across the three punishment sessions, 158
whereas high punishment-resistant mice for fentanyl (n = 9) tended to increase their intake over 159
the same punishment sessions (Fig. 2E, 2L), suggesting fentanyl was more effective at promoting 160
punishment-resistance. We next examined whether individuals exhibited correlations between 161
their baseline and punished drug intake. On average, high drug-takers were not significantly more 162
resistant to punishment than low drug-takers under either the cocaine or fentanyl conditions (Fig. 163
2F, 2M). Likewise, high and low punishment -resistant mice exhibited comparable baseline drug 164
intake (Fig. 2F, 2M). Indeed, there were mixed overlaps between the high vs. low drug-taking and 165
high vs. low punishment -resistant groups ( Fig. 2G, 2N). Thus, within the current experimental 166
paradigm, punishment resistance does not correlate with baseline level of drug taking, suggesting 167
different underlying mechanisms modulating these two phenomena. 168
Dopamine signatures of excessive drug-taking behavior 169
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
5
Dopamine (DA) plays a crucial role in the reinforcing effects of both cocaine and opioids. But 170
how DA signals are related to individual differences in drug-taking behaviors remain obscure. To 171
examine DA dynamics in the cocaine- and fentanyl-IVSA mice described above, we expressed a 172
genetically encoded DA sensor (AAV2/5-hSyn-GRAB_DA2m)(50, 51) in the dorsal medial shell 173
of the NAc and implanted an optic fiber to collect fluorescent signals ( fig. S3, Fig. 3A and 3J). 174
Using rotary fiber photometry (FP), which is compatible with the IVSA in freely behaving mice, 175
we monitored the DA dynamics during cocaine and fentanyl IVSA in fully trained mice across 3-176
hour baseline testing sessions under the FR4 schedule (timeline shown in Fi g. 1A). To minimize 177
photobleaching from continuous long-duration recording, we performed three cycles of 30 minutes 178
rotary FP, each separated by a 30-minute interval (i.e., three episodes of 30-minute recording per 179
session), and concatenated them for analysis. 180
Examining the raw and z -scored FP traces across different trial epochs (see representative 181
examples in Fig. 3B and 3K), we found that in both the cocaine and fentanyl IVSA, DA sensor 182
signals were relatively low prior to drug infusion. At the onset of infusion, DA signals increased, 183
with multiple peaks (referred to as DA transients) that occurred during the drug infusion, the light-184
blinking period, and the lever-retraction/light OFF (inter-trial interval) phases. Once the light was 185
turned bac k ON and the levers were reinserted to initiate the next trial, DA signals declined. 186
Notably, the width of individual DA transients was wider in cocaine-IVSA compared to fentanyl-187
IVSA conditions. This is likely due to cocaineβs blockade of the DA reuptake transporter, which 188
slows the clearance of extracellular DA and results in a significantly slower decay of DA signals 189
(Decay slope: -0.54Β± 0.04 vs -1.28Β± 0.05, fig. S4). By contrast, no consistent DA signals were 190
observed at the time of active lever press (fig. S5), likely reflecting the fact that at FR4, each lever 191
press has a low probability of resulting in drug reward (also see computational modeling below). 192
We computed the time course of z- scored DA signals for each animal (see Methods ) and 193
plotted averaged z -scores of all trials aligned to the onset of drug infusion, spanning from 10 194
seconds before infusion to 10 seconds after the initiation of the next trial (n=24 mice for cocaine; 195
n=27 mice for fentanyl, Fig. 3C, 3L). During cocaine IVSA, the averaged DA levels increased at 196
the onset of infusion and remained elevated throughout the light-blinking cue period and the inter-197
trial interval (lights OFF period) (Fig . 3C). Similarly, in mice during fentanyl IVSA, increased 198
DA levels also occurred at the onset of infusion, and the DA elevation displayed phasic, oscillatory 199
responses during the light -blinking period, which then became a sustained elevation during the 200
20.5-second dark inter -trial interval (Fig . 3L ). The oscillatory DA pattern was more clearly 201
observed in the group-averaged signals for both fentanyl-IVSA and cocaine-IVSA during the light-202
blinking period (lower panels in Fig. 3C, 3L). Notably, the rise-and-fall DA signals peak shortly 203
after each OFF-ON transition of the blinking light (fig. S3D), indicating the averaged oscillations 204
reflected responses to the light cue. In both cocaine and fentanyl IVSA, DA levels dropped rapidly 205
at the onset of the next trial when the light turned ON and levers were inserted (Fig. 3C and 3L). 206
Importantly, the drug- infusion and blinking -light evoked DA responses were absent on the first 207
day of training (e.g., the auto -shaping session) and in saline -IVSA mice (fig. S3E-F), indicating 208
that these signals emerged through drug- cue associative learning. In contrast, the elev ated DA 209
signals during the light OFF inter-trial interval were also present on the first day of training and in 210
saline-IVSA mice, possibly reflecting miceβ natural preference for darkness. 211
Given these observations, we focused our analyses on the infusion and light-blinking periods 212
for comparing DA dynamics between high and low drug-taking groups. Representative DA signals 213
of individual trials from a low - and high- cocaine taker ( Fig. 3D ), along with group- averaged 214
signals (Fig. 3E) revealed interesting differences. While DA responses varied from trial to trial, 215
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
6
differing in both amplitude and the timing of peak z -score, critically, low cocaine- taking mice 216
consistently showed larger DA increases than high takers ( Fig. 3D-E). This result is reminiscent 217
of previous studies showing that rats exhibiting escalated cocaine intake across training displayed 218
reduced DA to contingent cocaine infusion( 27, 28). Similarly, low -fentanyl taking mice also 219
consistently exhibited larger DA increases than high- fentanyl takers, despite large trial -to-trial 220
variations in DA signals (Fig. 3M-N). We separately quantified the DA responses at the infusion 221
onset (0-1 second from infusion start, referred to as βonset DA signalβ) and during the subsequent 222
cue period (1-19.5 second from infusion start, referred to as sustained DA signal). Importantly, 223
linear regression analysis revealed that both the onset and sustained DA signals were significantly 224
negatively correlated with the number of cocaine ( Fig. 3F and 3H) or fentanyl (Fig. 3O and 3Q) 225
self-administrations. Group comparison further confirmed that high drug- taking mice showed 226
significantly smaller onset (Fig. 3G and 3P) and sustained DA signals ( Fig. 3I and 3R). The 227
reduced DA signals in high drug -takers are unlikely to be caused by elevated baseline DA levels 228
that might blunt additional drug -evoked responses. If this were the case, DA responses should 229
decrease over the course of a session as more drug is consumed. However, when we compared DA 230
responses across the three recording episodes (the 1st, 2nd, and 3rd 30 -minute blocks), we found 231
no evidence of a progressive decline in evoked DA (fig. S6). 232
Altogether, despite the distinct DA dynamics evoked by cocaine and fentanyl IVSA, both 233
drugs showed a consistent relationship between drug- taking behavior and DA responses: higher 234
drug intake was associated with weaker DA responses to contingent drug infusion and drug-235
associated cues. 236
Dopamine signatures of punishment resistance 237
Compulsive drug-taking despite adverse consequences is a hallmark of drug addiction, thus 238
we examined DA responses in the NAc medial shell during punishment sessions. Following the 239
co-occurrence of drug infusion with a mild foot shock (0.2 mA, 0.5 s), average DA signals in both 240
the cocaine and fentanyl groups showed a marked increased during the post-shock/infusion period 241
and remained elevated throughout the light -blinking, and inter-trial dark periods ( Fig. 4A, 4I). 242
Overall, cocaine infusion plus shock produced variable onset DA responses (0 -1 second post -243
infusion), followed by a uniform sustained DA surge (1 -19.5 second post -infusion) (Fig. 4A). 244
Closer examination of individual mice in the cocaine group revealed heterogeneity in DA 245
dynamics: in a subset of mice, the co- occurrence of cocaine infusion and shock elicited an initial 246
dip or pause in DA followed by a robust, sustained rebound increase of DA, whereas in others, the 247
infusion and shock triggered an instant increase in DA that remained elevated ( fig. S7, Fig. 4B). 248
Visual inspection of DA dynamics in high and low punishment -resistant mice revealed that both 249
groups had comparable onset DA responses, but the low -resistant mice exhibited significantly 250
larger sustained DA signals ( Fig. 4C-D). Quantitative analyses confirmed that group-averaged 251
infusion/shock-evoked onset DA was neither significantly different between high- and low -252
resistant mice, nor correlated with the amount of punished cocaine intake ( Fig. 4E-F). However, 253
at the individual level, 40% (6/15) of low -resistant mice displayed a suppression of onset DA, 254
compared to only 14.3% (1/7) of high- resistant mice ( Fig. 4F, Fisherβs exact test, p = 0.35). In 255
contrast, there is a strong and significant negative correlation between the post -shock sustained 256
DA levels and the punished cocaine infusions: high-punishment resistant mice had low sustained 257
DA levels (Fig. 4G-H). As a control, co- occurrence of saline infusion and shock predominantly 258
elicited a dip in onset DA responses (83.3%, [5/6]), followed by a rebound (fig. S3G). 259
In the fentanyl group, the co- occurrence of fentanyl infusion with shock also elicited 260
heterogeneous onset DA responses across mice ( fig. S6, Fig. 4I, 4J). Visual inspection of the 261
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
7
group-averaged DA dynamics showed that high punishment -resistant mice exhibited a sharp 262
increase in onset DA, whereas the low-resistant group displayed a dip of onset DA signal; however, 263
both groups showed comparable sustained DA ( Fig. 4J-L). Quantitative analyses confirmed that 264
the fentanyl-infusion/shock-evoked onset DA response was significantly higher in high- resistant 265
mice and was significantly positively correlated with punished fentanyl infusions ( Fig. 4M-N). 266
Moreover, 56% (9/16) of low-resistant mice showed a reduction in onset DA while none (0/9) of 267
the high-resistance exhibited such a dip ( Fig. 4N, Fisherβs exact test, p < 0.01). In contrast, the 268
post-shock sustained DA signals were neither significantly different between high- and low -269
resistant mice, nor correlated with punished fentanyl intake ( Fig. 4O-P ). Among the nine 270
punishment-resistant mice, we also recorded DA responses during subsequent punishment 271
sessions in 8 mice. At the group level, these mice showed a trend t oward reduced sustained DA 272
signaling during subsequent punishment sessions, and five of eight exhibited a further increase in 273
onset DA responses (fig. S8). 274
Taken together, in cocaine self -administering mice, the sustained DA responses after shock 275
negatively correlated with resistance to punishment, whereas in fentanyl self -administering mice, 276
the onset DA responses to shock/drug-infusion positively correlated with resistance to punishment. 277
A computational model captures DA dynamics in the IVSA paradigm across drugs and 278
conditions 279
A long- standing theory for NAc DA activity posits that phasic DA signals encode the 280
temporal-difference reward prediction error (TD -RPE), i.e. the mismatch between the expected 281
values of temporally adjacent states (52β54) . In the fentanyl group, punishment -resistant mice 282
displayed a sharp increase in DA at the onset of shock and fentanyl infusion ( Fig. 4L), a pattern 283
reminiscent of a positive RPE signal observed in Pavlovian conditioning. However, it is unclear 284
whether the existing TD-RPE model of DA signals based on short timescale classical conditioning 285
could model the substantial variabilities in IVSA behaviors (fig. S1) and explain the highly 286
heterogenous onset and sustained DA dynamics during both cocaine and fentanyl IVSA and 287
punishment sessions (Fig. 3 and Fig. 4). To address these questions, we modeled each animal as 288
an Actor-Critic TD-learning agent performing a self -paced drug self-administration task (Fig. 5-289
6). One key aspect of our model is to represent the distinct epochs within IVSA trials as internal 290
states, ππ(π‘π‘). The agent learns the state value, ππ(ππ(π‘π‘)), and action value, ππππ,ππ(π‘π‘), to maximize the 291
expected total future reward using the TD -RPE signal, Ξ΄(t). Thus, despite the low predictivity of 292
individual lever presses for drug reward at FR4, the agent learns that pressing the lever is the best 293
policy (see Methods). We used model-derived Ξ΄(t) to generate Ξ΄Μ(t) as the simulated DA signals by 294
clipping the negative value to -0.5 and adding a decay tail to Ξ΄(t). We then compared Ξ΄Μ(t) with the 295
experimentally observed NAc DA dynamics and analyzed the correlations between Ξ΄Μ(t) and drug 296
taking or punishment resistance. 297
We first modeled the FR4 drug -taking sessions. Each trial was modeled as inducing three 298
discrete internal states, S0, S 1 and S2, defined by their distinct sensory contexts ( Fig. 5): S0 299
denotes the period from lever insertion and light ON to the moment of drug infusionβa phase that 300
typically occupies most of the trial and can last from seconds to minutes. S1 denotes the drug 301
infusion and associated light -blinking period, whereas S 2 corresponds to the dark inter -trial 302
interval (ITI). For simplicity, the external sensory stimuli associated with S 1 and S2 were each 303
modeled as lasting 20 timesteps (t). We further assumed that a drug reward rd becomes perceptible 304
10 timesteps after drug infusion and persists to the end of the trial. A critical novel aspect of the 305
model is to incorporate uncertainty in the agentβs estimation of current states S (t). For instance, 306
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
8
the agent may fail to notice the first few blinks of the light cue, leading to temporal variation in its 307
estimate of when the light -blink state begins, or may confuse states that share similar sensory 308
features (e.g. a blink-ON moment in S1 may resemble the light-on period in S0). 309
During self-administration, the learned value of state ππ1, ππ(ππ1), is approximately constant 310
across individual agents since ππ1 has a fixed reward contingency by task design. In contrast, state 311
ππ0 acquired predictive value from drug infusions in preceding trials, with different agents learning 312
different ππ(ππ0). Critically, the strength of the ππ0 β ππ1 contingency (red arrow in Fig. 5A) depends 313
on the learned action value ππππ,ππ0 which determines both the probability and timescale of lever 314
presses (see Methods). High drug takers, which select the lever press action more frequently with 315
shorter inter-press intervals, learn that sufficient lever presses in ππ0 reliably lead to ππ1 and drug 316
reward. As a result, they acquire a high expected ππ(ππ0) that approaches ππ(ππ1), yielding a small 317
prediction error at the ππ0 β ππ1 transition (because πΏπΏ{0β1} β‘ ππ(ππ1) β ππ(ππ0)). In contrast, low drug 318
takers are less certain that lever pressing drives the ππ0 β ππ1 transition. Therefore, compared to 319
high takers, they select the lever press action less frequently, with longer inter-press intervals, and 320
maintain ππ(ππ0) βͺ ππ(ππ1), resulting in a larger πΏπΏ{0β1} upon entry into ππ1. Altogether, πΏπΏ{0β1} at the 321
onset of drug infusion (timestep 1 in ππ1) is negatively correlated with the amount of drug taking 322
(Fig. 5 B-C). 323
Importantly, πΏπΏ{0β1} also contributes to Ξ΄(t) during the light -blinking period of ππ1 due to 324
animalsβ uncertainty of state transitions. Attentional lapse (modeled as a β 6 timestep jitter) in 325
detecting the drug-associated cue (light blinking off) generates multiple cue -locked Ξ΄(t) peaks at 326
the initial phase of ππ1. In addition, ππ1 β ππ0 confusion (light blinking on) further produce s cue-327
locked oscillations in Ξ΄ (t) when averaged across trials ( Fig. 5F , see Methods). Moreover, the 328
delayed perception of drug reward, ππππ(π‘π‘) induces a slow rise of Ξ΄(t) starting β 10 timesteps after 329
infusion onset, which is superimposed on the oscillation of Ξ΄(t). Taken together, the average 330
sustained Ξ΄(t) or Ξ΄Μ(t) over the post-infusion phase of ππ1 (timesteps 2-20), is determined by πΏπΏ{0β1} 331
and ππππ(π‘π‘), and therefore is also negatively correlated with drug intake ( Fig. 5G ). Finally, we 332
assumed that the effect of the drug reward extends into ππ2. Upon the ππ2 β ππ0 transition, although 333
light on and lever insertion predict the next reward cycle, termination of the prolonged drug reward 334
mainly produces a negative πΏπΏ{2β0}. 335
The simulated DA Ξ΄Μ( t) traces of FR4 drug- taking closely resemble the actual DA signals 336
observed in the fentanyl IVSA experiments and recapitulate the negative correlations between both 337
onset and sustained DA signals with fentanyl intake ( Fig. 5F-G). When the decay time constant 338
of Ξ΄Μ(t) is increased fivefold to mimic cocaine-caused inhibition of DA reuptake, the exponentially 339
filtered Ξ΄Μ(t) traces similarly reproduce the DA dynamics observed during cocaine IVSA, as well 340
as the negative correlations between DA signals and cocaine intake ( Fig. 5D-E). Together, these 341
Results
indicate that despite the complexity and temporally extended nature of the IVSA paradigm 342
and the distinct pharmacological classes of the drugs (psychostimulant vs. opioid), contingent DA 343
release in the NAc during these tasks can be uniformly explained as encoding TD-RPE. 344
Using the same Actor -Critic TD -learning framework, we next modeled the punishment 345
sessions. We assumed that footshock induces an additional internal state cascade ππππ(π‘π‘) (Fig. 6), 346
which runs in parallel with the three task-related states ππ(π‘π‘):ππ0
ππ, ππ1
ππ, ππππππ ππ2
ππ. Here, ππ1
ππ denotes the 347
immediate shock-alert state, and the ππ1
ππβ ππ2
ππ transition occurs stochastically as the agent settles 348
into a shock-relieved βsafetyβ state ππ2
ππ. At the onset of the next trial (light on and lever insertion), 349
the shock-related state returns to a baseline βdangerβ state ππ0
ππ, which persists until drug infusion 350
and the next shock. Accordingly, the agent undergoes a transition from ππ0
ππ π‘π‘π‘π‘ ππ1
ππ at the drug 351
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
9
infusion/shock onset, followed by a stochastic transition ππ1
ππ π‘π‘π‘π‘ ππ2
ππ, corresponding to shock relief. 352
In the experiments, a footshock is delivered at the ππ0
ππβ ππ1
ππ transition to induce a negative reward 353
ππππ< 0. In addition, we initialized ππ(ππ1
ππ) < 0 to reflect the innate aversive valuation of the shock-354
alert state. Given the task states ππ(π‘π‘) and shock states ππππ(π‘π‘), we assume additive state and action-355
value components: 356
ππ οΏ½ππ(π‘π‘), πποΏ½ππππ(π‘π‘)οΏ½οΏ½ = πποΏ½ππ(π‘π‘)οΏ½ + ππ(ππππ(π‘π‘)) 357
ππππ,(ππ,ππππ) = ππππ,ππ+ ππππ,ππππ 358
359
We modeled fentanyl and cocaine as having different effect on changing a threshold ππππ in the 360
perception of shock stimuli as aversive. Accordingly, if the delivered shock level falls below ππππ it 361
contributes to sensory salience; while if it falls above ππππ, it contributes to negative valence. Thus, 362
ππππ modulates the agentβs learning rate (see Methods). 363
During cocaine IVSA with punishment, ππ(ππ1
ππ) is negative at the onset of drug infusion ( Fig. 364
6B-C). Spontaneous ππ1
ππβ ππ2
ππ transitions during the post-shock period produce a large delayed Ξ΄(t) 365
surge due to ππ(ππ2
ππ) β ππ(ππ1
ππ) > 0. Together with the Ξ΄ (t) derived from the delayed drug reward, 366
the total Ξ΄Μ(t) (that considered cocaine-induced slow decay) exhibits sustained increase during the 367
post-shock light-blinking period. This pattern closely recapitulates the large and sustained DA 368
signals observed experimentally (Fig. 6D). Because the shock-relief component of this sustained 369
Ξ΄Μ(t) reflects agentsβ shock sensitivity, it is negatively correlated with punishment resistance: 370
individuals with low resistance exhibit more sustained Ξ΄Μ(t), consistent with the experimental 371
findings (Fig. 6D-E). 372
During fentanyl IVSA with punishment, although ππ(ππ1
ππ) is negative in the early trials, in 373
punishment-resistant agents, we model fentanyl to progressively suppresses shock sensitivity 374
(increases the threshold ππππ). As a result, the initially aversive unconditional stimulus ( US, 375
footshock) gradually becomes a salient conditioned stimulus (CS+) predictive of fentanyl reward 376
in these individuals. Thus, the model generates a strong positive Ξ΄(t) or Ξ΄Μ(t) transient at the ππ0
ππβ377
ππ1
ππ transition in punishment-resistant fentanyl-taking agents but not in cocaine-taking agents (Fig. 378
6J, 6F), matching the different DA signals from the two drugs in experiments ( Fig. 4D, L). This 379
reversal of US to CS+ does not occur in punishment sensitive individuals (Fig. 6J). These Ξ΄Μ(t) 380
dynamics recapitulate the positive correlation between onset DA responses and punishment -381
resistant fentanyl taking in experiments ( Fig. 6H-I ). Moreover, we plotted the drug infusion 382
numbers for high- vs low -taking agents and high- vs low -punishment resistance agents from 383
baseline FR4 sessions to the three consecutive punishment sessions (Fig. 6G and 6K), and found 384
that agents in the model also mimic the drug-taking patterns observed in mice (Fig. 2B, 2E, 2I, 385
2L). 386
Our model also predicted a weaker but significant positive correlation of onset Ξ΄Μ (t) with 387
punished cocaine intake, as well as negative but significant correlation of sustained Ξ΄Μ (t) with 388
punished fentanyl intake, both of which were not clearly observed experimentally. These 389
discrepancies may reflect the complexity of how different drugs modulate the computation of value 390
and RPE that is not considered by the model. Nevertheless, overall, the computational model 391
captures the diversity of drug- taking and punishment -responsiveness behaviors, along with the 392
associated DA dynamics across trials epochs, individuals, and drug classes. Importantly, it reveals 393
a simple, unified role of NAc DA signals in encoding TD-RPE across different phases of addiction. 394
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
10
Discussion
395
Here, we used a genetically encoded DA sensor to characterize the dynamic patterns of DA 396
release in the NAc medial shell in a mouse model of drug addiction, comparing DA responses to 397
cocaine and fentanyl IVSA in high versus low drug takers, as well as in animals exhibiting high 398
versus low punishment resistance (i.e., a model of compulsive drug use despite adverse 399
consequences). We found that after extended training under an FR4 schedule, contingent cocaine 400
infusions evoked a sustained increase in DA during the drug- associated cue period (i.e., blinking 401
lights), whereas contingent fentanyl infusions elicited a large increase of DA at infusion onset that 402
evolved into oscillations synchronized with the blinking lights. Similar oscillatory DA patterns 403
could also be observed during cocaine IVSA after responses were averaged across trials and mice, 404
although with smaller amplitudes. Critically, across both cocaine and fentanyl, individualβs drug 405
intake was consistently negatively correlated with DA responses: higher levels of drug intake were 406
associated with lower evoked DA signals. With regards to punished drug taking, we observed 407
distinct DA signatures associated with compulsivity in cocaine versus fentanyl taking. In cocaine 408
IVSA mice, higher punishment resistanc e was associated with weaker sustained DA responses 409
during the drug- associated cue period, whereas in fentanyl IVSA mice, higher punishment 410
resistance was associated with stronger onset DA responses at the co -occurrence of punishment 411
and drug infusion. To account for these diverse DA dynamics across baseline and punishment 412
sessions, across individuals and across drug classes, we developed an Actor -Critic TD-learning 413
based computational framework that incorporates internal states, agentβs uncertainty, and dr ug-414
specific effects. This model captures the observed behavioral diversity and supports a unified 415
interpretation of NAc DA responses as encoding temporal -difference reward prediction errors 416
(TD-RPE) based on internal state estimation. 417
Many previous computational models of addiction (e.g. opponent theory, incentive salience 418
sensitization theory, habit formation) focused on reproducing addiction behaviors instead of 419
accounting for the DA dynamics as a learning outcome. Here our actor-critic model considers both 420
behavior and DA dynamics. However, different from traditional TD -RPE models of classical 421
conditioning, our framework operates on a self -administration task -relevant timescale and 422
explicitly incorporates both states and action values. We modeled how agents evaluate their states 423
after learning the self-administration task at FR4. Our results reveal that the difficulty of the FR4 424
task schedule introduces significant sources of uncertainty that naturally give rise to complex 425
dynamics in the RPE signal which resemble the observed DA activity. A previous TD-RPE model 426
proposed by Redish and colleagues (33, 34) relied on the assumption of an un- cancellable RPE 427
elicited by drugs upon infusion, which caused unbounded growth of the drug value and 428
consequently would predict a positive correlation between DA signals and drug takingβ429
inconsistent with experimental observations. In contrast, our model naturally recapitulates the 430
negative correlation between DA signal s and contingent drug intake. In our model, individual 431
differences in DA response reflect differences in the learned contingency between task states and 432
drug reward. 433
Could the finding that NAc DA universally encodes TD -RPE help reconcile the divergent 434
views of DA function in addiction? Two influential and seemingly opposing frameworks have 435
been proposed: the DA depletion hypothesis(55) and the incentive sensitization theory (IST)(39) . 436
Preclinical studies have showed that escalation of cocaine self -administration in rodents is 437
accompanied by reduced phasic DA signaling (27, 28, 43) and human imaging studies also have 438
reported decreased striatal DA responses in individuals with cocaine use disorders(29). However, 439
far less was known about DA signaling in opioid self -administration models. Our findings that 440
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
11
excessive intake of both cocaine and fentanyl is associated with blunted DA responses to 441
contingent drug infusion, together with prior studies, appear to be consistent with the hypo-442
dopamine hypothesis. By contrast, the IST theory, also supported by substantial experimental 443
evidence, posits that drug exposure produces a hyper -responsive DA system, leading to 444
exaggerated phasic DA responses to drug -associated cues and context that drive heightened 445
βwantingβ to take drugs. 446
We believe these two theories are not necessarily contradictory and can be unified by the 447
framework that phasic DA encodes TD -RPE, i.e, the difference between the expected values of 448
temporally adjacent states. As TD -learning agents, animals and humans continuously assign 449
expected values to their current states, with these values shaped by experience and learning history. 450
In the IVSA paradigm, high drug takers learn that sufficient lever presses reliably result in drug 451
infusion, motivating more frequent lever press with shorter inter -press intervals. Consequently, 452
they acquire a high expected value for the lever -pressing state and generate relatively small 453
DA/TD-RPE signals upon receipt of the actual drug infusion and its associated cues. In contrast, 454
low drug takers are less certain that lever pressing will result in drug infusion and are therefore 455
less motivated to press the lever, leading to longer inter -press intervals. Consequently, the actual 456
drug infusion is more unexpected, resulting in a large DA/TD-RPE signals. Thus, under contingent 457
conditions, akin to knowingly taking the drug and fully expecting its effect, more excessive drug 458
taking is associated with lower DA responses. However, in humans with SUD, drug- associated 459
cues can be encountered outside the drug -taking context and robustly cause craving. In such 460
situations, drug cues may elicit higher expected value based on remembered drug reward than the 461
perceived value of an individualβs current state, thereby resulting in large DA/TD -PRE signals, 462
consistent with IST. Although we did not directly assess DA responses to unexpected drug-463
associated cues outside the IVSA context, a recent study showed that DA release is indeed 464
enhanced in response to non-contingent or unexpected cocaine-paired cues, but diminished when 465
the same cues were encountered in a contingent, predictable context(28) . 466
Encoding DA as TD -RPE also provides new insights into compulsive drug taking despite 467
punishment. When drug use is associated with adverse consequences, individuals must decide 468
whether to abstain from further drug taking or to endure punishment to continue pursuing the drug. 469
Drug-induced reduced sensitivity to punishment, deficits in inhibitory contr ol or impairments in 470
punishment learning may bias this decision toward compulsive drug taking despite negative 471
consequences, one of the most intractable features of addiction. Using foot shock as a punishment, 472
we showed that in saline control mice, DA signals in the NAc medial shell were predominantly 473
suppressed upon shock delivery ( fig. S4), consistent with a strong negative valence of shock for 474
drug-naive animals and with DA encoding a negative TD -RPE. Similarly, many punishment -475
sensitive mice also showed a suppression of the DA signal at the shock-infusion onset during both 476
cocaine and fentanyl IVSA (Fig. 4E-F, M-N). In contrast, this shock-elicited DA dip was absent 477
in all but one punishment-resistant animals, suggesting an impairment of encoding negative RPE. 478
A prior study also found that cocaine exposure disrupted the pause firing of DA neurons in 479
response to reward omission (56). We also observed a large -amplitude rebound in DA following 480
shock, which we interpreted as positive RPE reflecting βshock reliefβ. The rebound signal was 481
present in punished saline- and cocaine-IVSA mice, as well as in punishment -sensitive fentanyl-482
taking mice. It is plausible that stronger relief signals may provide more robust negative feedback, 483
reinforcing avoidance of the punished action. Consistent with this interpretation, the levels of 484
sustained DA over the entire cue -light period (including shock- relief and drug- reward signals) 485
were significantly negatively correlated with punished cocaine infusions, although no correlat ion 486
was observed for punished fentanyl intake. Critically, in punishment -resistant fentanyl -taking 487
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
12
mice, we observed a pronounced increase in DA during shock delivery, indicative of a positive 488
TD-RPE. In other words, compulsive fentanyl-taking mice may transform the aversive footshock 489
into a highly salient sensory cue that strongly predicted drug reward. Altogether, these findings 490
suggest that chronic drug use may disrupt normal negative TD -RPE signaling of DA neurons in 491
NAc, thereby promoting compulsive drug taking despite negative consequences. 492
Although our findings provide strong support for TD -RPE as a unifying framework for 493
interpreting NAc DA signaling underlying individual differences in excessive and compulsive 494
drug taking, many important questions remain. We do not yet know how state value is computed, 495
or which neural mechanisms translate these valuations into drug -seeking behaviors (e.g., 496
probability of lever pressing), or how the observed DA dynamics in turn furthe r modulate neural 497
plasticity and maladaptive behaviors. Moreover, it remains unclear how different classes of drugs 498
differentially modulate punishment sensitivity, negative RPE signaling, and the βshock- reliefβ 499
rebound of DA signals. Addressing these open questions will require targeted future experiments. 500
Limitations
501
Several limitations of the present study should be acknowledged. First, fiber photometry 502
measures relative changes in DA release rather than absolute DA concentrations, and therefore 503
cannot directly quantify baseline dopaminergic tone. Second, the sample size limits our ability to 504
robustly assess sex differences in drug taking and compulsive behavior, an important factor that 505
warrants dedicated investigation in future studies. Third , DA signali ng within the nucleus 506
accumbens is highly heterogeneous across subregions(57) . While the present work focused on the 507
dorsomedial shell, future studies employing more spatially resolved approaches will be necessary 508
to systematically examine DA dynamics across distinct accumbens subregions and their 509
contributions to addiction-related behaviors. While our computational model incorporates multiple 510
states and agent uncertainty, we did not consider different circuit-level plasticity and maladaptive 511
changes induced by chronic self-administration of different drugs. Furthermore, the drug-specific 512
effect in our model is highly simplified and does not account for the myriad physiological and 513
psychological differences in the effects of cocaine and fentanyl. 514
515
References
516
517
1. J. C. Anthony, L. A. Warner, R. C. Kessler, Comparative Epidemiology of Dependence on 518
Tobacco, Alcohol, Controlled Substances, and Inhalants: Basic Findings From the National 519
Comorbidity Survey. Exp. Clin. Psychopharmacol. 2, 244β268 (1994). 520
2. F. A. Wagner, J. C. Anthony, From First Drug Use to Drug Dependence: Developmental 521
Periods of Risk for Dependence upon Marijuana, Cocaine, and Alcohol. 522
Neuropsychopharmacology 26, 479β488 (2002). 523
3. M. J. Kreek, D. A. Nielsen, E. R. Butelman, K. S. LaForge, Genetic influences on impulsivity, 524
risk taking, stress responsivity and vulnerability to drug abuse and addiction. Nat. Neurosci. 8, 525
1450β1457 (2005). 526
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
13
4. M. Venniro, M. L. Banks, M. Heilig, D. H. Epstein, Y. Shaham, Improving translation of 527
animal models of addiction and relapse by reverse translation. Nat. Rev. Neurosci. 21, 625β643 528
(2020). 529
5. B. J. Everitt, T. W. Robbins, Drug Addiction: Updating Actions to Habits to Compulsions Ten 530
Years On. Annu. Rev. Psychol. 67, 1β28 (2015). 531
6. O. George, G. F. Koob, Individual differences in the neuropsychopathology of addiction. 532
Dialogues Clin. Neurosci. 19, 217β229 (2017). 533
7. B. T. Saunders, T. E. Robinson, Individual Variation in the Motivational Properties of 534
Cocaine. Neuropsychopharmacology 36, 1668β1676 (2011). 535
8. B. T. Saunders, T. E. Robinson, A Cocaine Cue Acts as an Incentive Stimulus in Some but not 536
Others: Implications for Addiction. Biol. Psychiatry 67, 730β736 (2010). 537
9. R. Bock, J. H. Shin, A. R. Kaplan, A. Dobi, E. Markey, P. F. Kramer, C. M. Gremel, C. H. 538
Christensen, M. F. Adrover, V. A. Alvarez, Strengthening the accumbal indirect pathway 539
promotes resilience to compulsive cocaine use. Nat. Neurosci. 16, 632β638 (2013). 540
10. D. Belin, A. C. Mar, J. W. Dalley, T. W. Robbins, B. J. Everitt, High Impulsivity Predicts the 541
Switch to Compulsive Cocaine-Taking. Science 320, 1352β1355 (2008). 542
11. L. J. M. J. Vanderschuren, B. J. Everitt, Drug Seeking Becomes Compulsive After Prolonged 543
Cocaine Self-Administration. Science 305, 1017β1019 (2004). 544
12. E. Domi, L. Xu, S. Toivainen, A. Nordeman, F. Gobbo, M. Venniro, Y. Shaham, R. O. 545
Messing, E. Visser, M. C. van den Oever, L. Holm, E. Barbier, E. Augier, M. Heilig, A neural 546
substrate of compulsive alcohol use. Sci. Adv. 7, eabg9045 (2021). 547
13. V. Deroche-Gamonet, D. Belin, P. V. Piazza, Evidence for Addiction-like Behavior in the 548
Rat. Science 305, 1014β1017 (2004). 549
14. Y. Li, L. D. Simmler, R. V. Zessen, J. Flakowski, J.-X. Wan, F. Deng, Y.-L. Li, K. M. 550
Nautiyal, V. Pascoli, C. LΓΌscher, Synaptic mechanism underlying serotonin modulation of 551
transition to cocaine addiction. Science 373, 1252β1256 (2021). 552
15. G. de Guglielmo, L. Carrette, M. Kallupi, M. Brennan, B. Boomhower, L. Maturin, D. 553
Conlisk, S. Sedighim, L. Tieu, M. J. Fannon, A. R. Martinez, N. Velarde, D. Othman, B. Sichel, 554
J. Ramborger, J. Lau, J. Kononoff, A. Kimbrough, S. Simpson, L. C. Smith, K. Shankar, S. 555
Bonnet-Zahedi, E. A. Sneddon, A. Avelar, S. L. Plasil, J. Mosquera, C. Crook, L. Chun, A. 556
Vang, K. K. Milan, P. Schweitzer, B. Lin, B. Peng, A. S. Chitre, O. Polesskaya, L. C. S. Woods, 557
A. A. Palmer, O. George, Large-scale characterization of cocaine addiction-like behaviors 558
reveals that escalation of intake, aversion-resistant responding, and breaking-points are highly 559
correlated measures of the same construct. eLife 12, RP90422 (2024). 560
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
14
16. N. D. Volkow, J. S. Fowler, G. J. Wang, R. Baler, F. Telang, Imaging dopamineβs role in 561
drug abuse and addiction. Neuropharmacology 56, 3β8 (2009). 562
17. N. D. Volkow, M. Michaelides, R. Baler, The Neuroscience of Drug Reward and Addiction. 563
Physiol. Rev. 99, 2115β2140 (2019). 564
18. G. D. Chiara, A. Imperato, Drugs abused by humans preferentially increase synaptic 565
dopamine concentrations in the mesolimbic system of freely moving rats. Proc. Natl. Acad. Sci. 566
85, 5274β5278 (1988). 567
19. J. Corre, R. van Zessen, M. Loureiro, T. Patriarchi, L. Tian, V. Pascoli, C. LΓΌscher, 568
Dopamine neurons projecting to medial shell of the nucleus accumbens drive heroin 569
reinforcement. eLife 7, e39945 (2018). 570
20. C. LΓΌscher, R. C. Malenka, Drug-Evoked Synaptic Plasticity in Addiction: From Molecular 571
Changes to Circuit Remodeling. Neuron 69, 650β663 (2011). 572
21. C. LΓΌscher, Drug-Evoked Synaptic Plasticity Causing Addictive Behavior. J. Neurosci. 33, 573
17641β17646 (2013). 574
22. N. D. Volkow, M. Morales, The Brain on Drugs: From Reward to Addiction. Cell 162, 712β575
725 (2015). 576
23. P. E. M. Phillips, G. D. Stuber, M. L. A. V. Heien, R. M. Wightman, R. M. Carelli, 577
Subsecond dopamine release promotes cocaine seeking. Nature 422, 614β618 (2003). 578
24. C. L. Poisson, L. Engel, B. T. Saunders, Dopamine Circuit Mechanisms of Addiction-Like 579
Behaviors. Front. Neural Circuits 15, 752420 (2021). 580
25. G. D. Stuber, M. F. Roitman, P. E. M. Phillips, R. M. Carelli, R. M. Wightman, Rapid 581
Dopamine Signaling in the Nucleus Accumbens during Contingent and Noncontingent Cocaine 582
Administration. Neuropsychopharmacology 30, 853β863 (2005). 583
26. K. F. Casey, M. V. Cherkasova, K. Larcher, A. C. Evans, G. B. Baker, A. Dagher, C. 584
Benkelfat, M. Leyton, Individual Differences in Frontal Cortical Thickness Correlate with the d-585
Amphetamine-Induced Striatal Dopamine Response in Humans. J. Neurosci. 33, 15285β15294 586
(2013). 587
27. I. Willuhn, L. M. Burgeno, P. A. Groblewski, P. E. M. Phillips, Excessive cocaine use results 588
from decreased phasic dopamine signaling in the striatum. Nat. Neurosci. 17, 704β709 (2014). 589
28. L. M. Burgeno, R. D. Farero, N. L. Murray, M. C. Panayi, J. S. Steger, M. E. Soden, S. B. 590
Evans, S. G. Sandberg, I. Willuhn, L. S. Zweifel, P. E. M. Phillips, Cocaine seeking and 591
consumption are oppositely regulated by mesolimbic dopamine in male rats. Nat. Commun. 16, 592
9954 (2025). 593
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
15
29. N. D. Volkow, G.-J. Wang, J. S. Fowler, J. Logan, S. J. Gatley, R. Hitzemann, A. D. Chen, S. 594
L. Dewey, N. Pappas, Decreased striatal dopaminergic responsiveness in detoxified cocaine-595
dependent subjects. Nature 386, 830β833 (1997). 596
30. M. Γ. LujΓ‘n, B. L. Oliver, R. Young-Morrison, S. A. Engi, L.-Y. Zhang, J. M. Wenzel, Y. 597
Li, N. E. Zlebnik, J. F. Cheer, A multivariate regressor of patterned dopamine release predicts 598
relapse to cocaine. Cell Rep. 42, 112553 (2023). 599
31. M. Leyton, Whatβs deficient in reward deficiency? J. Psychiatry Neurosci. 39, 291β293 600
(2014). 601
32. W. Schultz, P. Dayan, P. R. Montague, A Neural Substrate of Prediction and Reward. 602
Science 275, 1593β1599 (1997). 603
33. A. D. Redish, Addiction as a Computational Process Gone Awry. Science 306, 1944β1947 604
(2004). 605
34. R. Keiflin, P. H. Janak, Dopamine Prediction Errors in Reward Learning and Addiction: 606
From Theory to Neural Circuitry. Neuron 88, 247β263 (2015). 607
35. M. Watabe-Uchida, N. Eshel, N. Uchida, Neural Circuitry of Reward Prediction Error. Annu. 608
Rev. Neurosci. 40, 1β22 (2016). 609
36. G. F. Koob, M. L. Moal, Drug Addiction, Dysregulation of Reward, and Allostasis. 610
Neuropsychopharmacology 24, 97β129 (2001). 611
37. G. F. Koob, M. L. Moal, Neurobiological mechanisms for opponent motivational processes 612
in addiction. Philos. Trans. R. Soc. B: Biol. Sci. 363, 3113β3123 (2008). 613
38. T. E. Robinson, K. C. Berridge, The neural basis of drug craving: An incentive-sensitization 614
theory of addiction. Brain Res. Rev. 18, 247β291 (1993). 615
39. T. E. Robinson, K. C. Berridge, The Incentive-Sensitization Theory of Addiction 30 Years 616
On. Annu. Rev. Psychol. 76, 29β58 (2025). 617
40. K. C. Berridge, T. E. Robinson, Liking, Wanting, and the Incentive-Sensitization Theory of 618
Addiction. Am. Psychol. 71, 670β679 (2016). 619
41. B. J. Aragona, N. A. Cleaveland, G. D. Stuber, J. J. Day, R. M. Carelli, R. M. Wightman, 620
Preferential enhancement of dopamine transmission within the nucleus accumbens shell by 621
cocaine is attributable to a direct increase in phasic dopamine release events. J. Neurosci. 28, 622
8821β31 (2008). 623
42. C. A. OwessonβWhite, J. Ariansen, G. D. Stuber, N. A. Cleaveland, J. F. Cheer, R. M. 624
Wightman, R. M. Carelli, Neural encoding of cocaineβseeking behavior is coincident with phasic 625
dopamine release in the accumbens core and shell. Eur. J. Neurosci. 30, 1117β1127 (2009). 626
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
16
43. I. Willuhn, L. M. Burgeno, B. J. Everitt, P. E. M. Phillips, Hierarchical recruitment of phasic 627
dopamine signaling in the striatum during the progression of cocaine use. Proc. Natl. Acad. Sci. 628
109, 20703β20708 (2012). 629
44. M. Garnett, A. MiniΓ±o, M. Joyce, A. Driscoll, C. Valenzuela, Drug Overdose Deaths in the 630
United States, 2003β2023. NCHS data brief, 1 (2024). 631
45. F. E. Pontieri, G. Tanda, G. D. Chiara, Intravenous cocaine, morphine, and amphetamine 632
preferentially increase extracellular dopamine in the βshellβ as compared with the βcoreβ of the 633
rat nucleus accumbens. Proc. Natl. Acad. Sci. 92, 12304β12308 (1995). 634
46. R. Ito, J. W. Dalley, S. R. Howes, T. W. Robbins, B. J. Everitt, Dissociation in Conditioned 635
Dopamine Release in the Nucleus Accumbens Core and Shell in Response to Cocaine Cues and 636
during Cocaine-Seeking Behavior in Rats. J. Neurosci. 20, 7489β7495 (2000). 637
47. S. H. Ahmed, G. F. Koob, Transition from Moderate to Excessive Drug Intake: Change in 638
Hedonic Set Point. Science 282, 298β300 (1998). 639
48. C. L. Wade, L. F. Vendruscolo, J. E. Schlosburg, D. O. Hernandez, G. F. Koob, Compulsive-640
Like Responding for Opioid Analgesics in Rats with Extended Access. 641
Neuropsychopharmacology 40, 421β428 (2015). 642
49. S. H. Ahmed, J. R. Walker, G. F. Koob, Persistent Increase in the Motivation to Take Heroin 643
in Rats with a History of Drug Escalation. Neuropsychopharmacology 22, 413β421 (2000). 644
50. F. Sun, J. Zeng, M. Jing, J. Zhou, J. Feng, S. F. Owen, Y. Luo, F. Li, H. Wang, T. 645
Yamaguchi, Z. Yong, Y. Gao, W. Peng, L. Wang, S. Zhang, J. Du, D. Lin, M. Xu, A. C. 646
Kreitzer, G. Cui, Y. Li, A Genetically Encoded Fluorescent Sensor Enables Rapid and Specific 647
Detection of Dopamine in Flies, Fish, and Mice. Cell 174, 481-496.e19 (2018). 648
51. F. Sun, J. Zhou, B. Dai, T. Qian, J. Zeng, X. Li, Y. Zhuo, Y. Zhang, Y. Wang, C. Qian, K. 649
Tan, J. Feng, H. Dong, D. Lin, G. Cui, Y. Li, Next-generation GRAB sensors for monitoring 650
dopaminergic activity in vivo. Nat. Methods 17, 1156β1166 (2020). 651
52. N. Eshel, J. Tian, M. Bukwich, N. Uchida, Dopamine neurons share common response 652
function for reward prediction error. Nat. Neurosci. 19, 479β486 (2016). 653
53. R. Amo, S. Matias, A. Yamanaka, K. F. Tanaka, N. Uchida, M. Watabe-Uchida, A gradual 654
temporal shift of dopamine responses mirrors the progression of temporal difference error in 655
machine learning. Nat. Neurosci. 25, 1082β1092 (2022). 656
54. L. Qian, M. Burrell, J. A. Hennig, S. Matias, V. N. Murthy, S. J. Gershman, N. Uchida, 657
Prospective contingency explains behavior and dopamine signals during associative learning. 658
Nat. Neurosci., 1β13 (2025). 659
55. C. A. Dackis, M. S. Gold, New concepts in cocaine addiction: The dopamine depletion 660
hypothesis. Neurosci. Biobehav. Rev. 9, 469β477 (1985). 661
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
17
56. Y. K. Takahashi, T. A. Stalnaker, Y. Marrero-Garcia, R. M. Rada, G. Schoenbaum, 662
Expectancy-Related Changes in Dopaminergic Error Signals Are Impaired by Cocaine Self-663
Administration. Neuron 101, 294-306.e3 (2019). 664
57. J. W. de Jong, S. A. Afjei, I. P. Dorocic, J. R. Peck, C. Liu, C. K. Kim, L. Tian, K. 665
Deisseroth, S. Lammel, A Neural Circuit Mechanism for Encoding Aversive Stimuli in the 666
Mesolimbic Dopamine System. Neuron 101, 133-151.e7 (2019). 667
668
Acknowledgments: We thank members of the Wang Lab for insightful discussions of this study. 669
We are grateful to Priyadarshini Dutta assistance with mouse colony maintenance. This work was 670
supported by Boston Childrenβs Hospital Viral Core, which is supported by NIH5P30EY012196. 671
Funding: 672
Addiction Initiative at McGovern Institute for Brain Research (FW) 673
The Paul E. and Lilah Newton Brain Science Award (FW) 674
K. Lisa Yang Integrative Computational Neuroscience (ICoN) Center fellowship (HZ) 675
Author contributions: FW and KC conceived the study and designed the experiments. KC, 676
GS, WX, CW and AS performed all experiments. KC analyzed the data. HZ and IF 677
conceptualized and developed the computational model. KC, HZ, FW wrote the manuscript 678
with input from IF. 679
Competing interests: Authors declare that they have no competing interests. 680
Data, code, and materials availability: All data and code used in the analysis are available 681
from the corresponding authors upon request. 682
Supplementary Materials 683
Materials and methods
684
Figs. S1 to S8 685
686
687
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
18
688
689
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
19
Fig. 1. Cocaine and fentanyl intravenous self-administration (IVSA) behaviors. 690
(A) Schematic of the IVSA setup and experimental timeline for IVSA training and testing. (B) and 691
(C) Lever presses (left) and cocaine infusions (right) during the training (B) and testing (C) phases 692
of cocaine IVSA (n =24 mice). (D) Average active lever presses (left) and cocaine infusions (right) 693
during baseline and punishment sessions of the cocaine IVSA testing phase (paired t -test). (E-G) 694
Same analyses as in B-D, but for fentanyl (n = 27 mice). Each gray line in (D) and (G) represents 695
an individual mouse. Red and black lines show group means. Error bars indicate mean Β± standard 696
error mean (SEM). ** represents p < 0.01; *** represents p < 0.001. 697
698
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
20
699
700
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
21
Fig. 2. Individual differences in drug taking and punishment resistance . ( A) Normalized 701
distributions of cocaine infusions (i.e., normalized to the mean) during baseline and punishment. 702
Mice were classified as high (cocaine: n = 8) or low (cocaine: n = 8) drug taking based on whether 703
their baseline infusions were above (orange) or below (light blue) the mean by 10%. ( B) Cocaine 704
infusions during the testing sessions, grouped by high vs. low drug taking. (C) Measures of active 705
lever presses per infusion (left), latency for active lever press to the lever insertion (middle), and 706
inter-press interval (right) during the baseline cocaine IVSA (Welchβs t- test). (D) Similar scatter 707
plots as in panel ( A). Mice were classified as high (cocaine: n = 7) or low (cocaine: n = 15) 708
punishment-resistant based on whether their punished cocaine infusions were above (red) or below 709
(dark blue) the mean by 10%. (E) Cocaine infusions during the testing sessions, grouped by high 710
vs. low punishment resistance. (F) Cocaine infusions during punishment (left) and baseline (right) 711
sessions, grouped by drug-taking (left) or punishment-resistance (right) categories (Welchβs t-test). 712
(G) Overlap between high/low drug-taking and high/low punishment-resistant groups for cocaine 713
IVSA. (H-N), Same analyses as in (A-G), but for fentanyl. Error bars indicate mean Β± standard 714
error mean (SEM). ** represents p < 0.01; *** represents p < 0.001; n.s. represents not significant. 715
716
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
22
717
718
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
23
Fig. 3. Dopamine dynamics during cocaine and fentanyl self -administration after extended 719
training. (A) Images showing GRAB_DA2m (green) expression and fiber tracks (red dashed line) 720
in the NAc medial shell from a cocaine-IVSA example. (B) Top: example raw photometry traces 721
(465 nm [green] and 405 nm [gray]) with behavioral events overlaid during cocaine IVSA. Bottom: 722
z-score of DA signals after preprocessing the raw data; gray shading indicates cue light on/off. (C) 723
Group DA responses to contingent cocaine infusions (n = 24). Each row of the colormap represents 724
one mouse, sorted by the number of infusions (top = most). The red dash lines at time 0 represent 725
the start of drug infusion. The gray dash lines represent lever retraction and insertion. (D) Example 726
DA responses to contingent cocaine infusions from a low (left) and a high (right) drug-taking mice. 727
(E) Time courses of DA responses to contingent cocaine infusions for high (orange) and low (light 728
blue) drug- taking mice. ( F) Linear regression showing the relationship between onset DA 729
responses and number of baseline infusions of cocaine. (G) Bar graphs quantifying onset DA 730
responses to cocaine infusions in high vs. low drug- taking mice (Welchβs t -test). (H) Linear 731
regression showing the relationship between sustained DA responses and number of baseline 732
infusions of cocaine. (I) Bar graphs quantifying susta ined DA responses to cocaine infusions in 733
high vs. low drug-taking mice (Welchβs t-test). Orange dots represent high drug-taking mice, light 734
blue dots represent low drug-taking mice and gray dots represent mice not classified. (J-R) Same 735
analyses as in (A-I), but for fentanyl (n = 27). Error bars indicate mean Β± standard error mean 736
(SEM). * represents p < 0.05; ** represents p < 0.01; *** represents p < 0.001 737
738
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
24
739
740
741
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
25
Fig. 4. Dopamine dynamics during punished drug taking. (A) Group DA responses to punished 742
cocaine infusions (n = 24 mice). Each row of the colormap represents one mouse, sorted by the 743
number of punished infusions (top = most). The red dash lines at time 0 represent the co-occurrence 744
of drug IVSA and punishment. The gray dash lines represent lever retraction and insertion. ( B) 745
Example DA responses to punished cocaine infusions from a low (left) and a high punishment -746
resistant (right) mice. (C) Time courses of DA responses to punished cocaine infusions for high- 747
(red) and low- resistant (blue) mice. (D) Same as (C), but zoomed in to highlight the onset response 748
(0-1 s post-infusion). (E) Linear regression showing the relationship between onset DA responses 749
and number of punished infusions of cocaine. ( F) Bar graphs quantifying onset DA responses to 750
punished cocaine infusions in high- vs. low -resistant mice. ( G) Linear regression showing the 751
relationship between sustained DA responses and the number of punished infusions of cocaine. 752
(H) Bar graphs quantifying sustained DA responses to punished cocaine infusions in high- vs. low-753
resistant mice. Red dots represent punishment -resistant mice, blue dots represent punishment -754
sensitive mice and grey dots represent mice not classified. ( I-P) Same analyses as in ( A-H), but 755
for fentanyl. Error bars indicate mean Β± standard error mean (SEM). *** represents p < 0.001; n.s. 756
represents not significant. 757
758
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
26
759
760
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
27
Fig. 5. Actor-Critic TD learning model can explain DA dynamics and negative correlation 761
between DA signals with drug intake during drug IVSA tasks. ( A) Schematic of the model, 762
highlighting the three discrete internal states (S0, S1, S2) and their transition. (B) Simulation results 763
showing the internal state, state value, change of value of temporally adjacent states, drug reward, 764
TD error Ξ΄(t) and simulated DA Ξ΄Μ(t), at each timestep in simulated trials for a low -taking (left) 765
and high-taking agent (right). (C) State value of example low-taking (top) and high-taking agents 766
(bottom). (D) Average Ξ΄Μ(t) signals for high and low cocaine-taking agents. (E) Correlation between 767
onset and sustained Ξ΄Μ (t) with simulated cocaine infusions. ( F-G) Same analyses as in ( D-E), but 768
for fentanyl IVSA agents. 769
770
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
28
771
772
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
29
Fig. 6. Actor-Critic TD learning model can partially explain DA dynamics and the 773
correlation between DA signals with punishment resistance during punishment sessions of 774
IVSA tasks. (A) Schematic of the model, highlighting the three discrete internal states (S0, S1, S2), 775
shock-modulated states (S C0, SC1, SC2) and their transition. (B) Simulation results showing the 776
state, state value, change of value of temporally adjacent states, drug reward , shock reward, TD 777
error Ξ΄(t), and simulated DA Ξ΄Μ(t), at each timestep in simulated cocaine trials for a low-resistant 778
(left) and high-resistant agent (right) during a punishment session. (C) State value of example low-779
resistant (top) and high -resistant agents (bottom). ( D) Average Ξ΄Μ(t) signals for high and low 780
punishment-resistant cocaine -taking agents . ( E) Correlation between simulated onset and 781
sustained DA with simulated punished cocaine infusions. (F) Left panel, same as (D), but zoomed 782
in to highlight the onset response. Right panel, experimental data as showed in Fig. 4D. (G) The 783
count of cocaine infusions by agents during baseline and punishment sessions across simulation 784
stages, grouped by high and low drug taking (left) or punishment resistance (right). ( H-K), same 785
analyses as in (D-G), but for fentanyl taking agents. 786
787
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
1
Materials and methods
1
Experimental Subjects 2
Adult male and female mice (C57BL/6J, 12-20 weeks old, The Jackson Laboratory) were used 3
for this study. They were group-housed and maintained on a reversed 12-hour light/dark schedule 4
with ad libitum access to food and water. All experimental protocols were approved by the 5
Institutional Animal Care and Use Committee at the Massachusetts Institute of Technology. 6
Surgical Procedures for Jugular Vein Catheterization 7
Indwelling catheters were implanted into the right jugular vein of both male and female mice, 8
as described in the literature 53,54. Specifically, mice were anesthetized with 1 -1.5% isoflurane in 9
oxygen (0.7 L/min) using an anesthesia mask (Part# SOMNO -0801, Kent Scientific). Once fully 10
anesthetized, mice were placed on a heating pad (Part# 53800M, Stoelting Co.) to maintain body 11
temperature. After shaving the hair and sanitizing the surgical area with 70% ethanol and 2% 12
Chloroxylenol, a 2-cm mid-scapular incision was made on the back, and a second 2 cm diagonal 13
incision was made from the right clavicle upwards to the animalβs jaw. The right jugular vein was 14
then carefully exposed, and lifted using an Eppendorf pipette tip (Part# 13 -683-718, Fisher 15
Scientific). An 18G needle was used to create an opening in the jugular vein and a catheter (Part # 16
C20PU-MJV1301, Instech Laboratories) was gently inserted and secured with two knots. The 17
other end of the catheter was threaded under the skin of the shoulder to connect to a vascular access 18
button (Part# VABM1B/25, Instech Laboratories) on the back and the incisions were sutured close. 19
Following surgery, Mice were single-housed and received subcutaneous injections of meloxicam 20
(5 mg/kg) daily for 2 -3 days to alleviate pain and inflammation. The catheters were flushed 1 -2 21
times daily with approximately 0.05 mL of heparinized saline (30 U/mL heparin) to maintain 22
patency. 23
Surgical Procedures for Viral Injections and Fiber Optic Cannulae Implantation 24
After five to seven 6 -hr training sessions of cocaine or fentanyl intravenous self -25
administration, mice were anesthetized with 1-1.5% isoflurane in oxygen (0.7 L/min) and placed 26
on a stereotaxic apparatus (Model 940, Kopf). A heating pad (Part# 53800M, Stoelting Co.) was 27
used to maintain the animalβs body temperature. For viral injections, a small craniotomy was 28
drilled above the right NAc medial shell (AP: +1.5 mm, ML:0.55 mm relative to bregma). A pulled 29
glass pipette (Part# Q100 -50-10, Sutter Instrument) front -loaded with AAV constructs (Boston 30
Childrenβs Hospital Viral Core, AAV2/5 -hSyn-GRAB_DA2m, 1.19E10 13 gc/mL) was lowered 31
into the medial shell of the NAc (DV: -4.3 mm relative to bregma). A total of 300 nL of the virus 32
was injected at 1 nL/s with a microsyringe pump (Part# UMP3, World Precision Instruments). 33
After the injection, the pipette was left in place for 10 minutes before being slowly withdrawn. 34
Next, a fiber optic cannula (core diameter: 200 Β΅m; NA: 0.37; Length: 4.5 mm, RWD Life Science) 35
was slowly lowered to the dorsal medial shell of the NAc (~3.7 mm below brain surface). The 36
cannula was secured to the skull using Loctite super glue and Metabond (C&B Metabond, Parkell). 37
The mice were allowed to recover for 4-7 days before resuming self-administration training. 38
Cocaine and Fentanyl Intravenous Self-Administration (IVSA) Paradigm 39
One to two weeks after jugular vein catheterization surgery, the patency of implanted catheters 40
was tested by intravenously injecting approximately 0.04 mL of a 15 mg/mL ketamine solution. 41
Mice that passed the patency test (i.e., cessation of movement within 4 seconds) were trained to 42
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
2
self-administer cocaine (Part# C5776, Sigma-Aldrich, 0.3 mg/kg/infusion) or fentanyl (Item# 07-43
890-5657, Patterson Veterinary, 2 Β΅g/kg/infusion) in an operant chamber (Part # ENV-307A-CT, 44
Med Associates). The chamber was equipped with two retractable levers (Part # ENV -312-3M, 45
Med Associates), LED lights (Part # ENV-321DM, Med Associates), and a syringe pump (Part # 46
PHM-100VS-2, Med Associates). 47
Each trial of the cocaine or fentanyl IVSA started with the insertion of both levers and the 48
illumination of the light above the active lever. Pressing the active lever triggered a drug infusion 49
according to a fixed -ratio schedule (FR1, FR2, FR4), while pressing the inactive lever had no 50
programmed consequences. Each infusion was followed by a 40 -second time-out period during 51
which no additional drug was delivered. This time-out period was implemented to prevent adverse 52
health consequences associated with excessive drug intake. During the first 19.5 seconds of this 53
time-out period, the light above the active lever blinked at 0.67 Hz (1 second on and 0.5 seconds 54
off) with both levers remaining available, but lever pressing (i.e., time -out responses) had no 55
programmed consequences. For the remaining 20.5 seconds of the time -out period, both levers 56
were retracted and the lights were turned off. 57
Trainings began with a 3-hour auto-shaping session during which both levers were active, and 58
pressing either lever triggered a drug infusion. In addition, a drug infusion was automatically 59
delivered if no levers were pressed within 6 minutes. The auto-shaping session ended either when 60
30 infusions were delivered or after three hours had elapsed, whichever came first. Then, 6 -hour 61
long-access training sessions were followed. During these sessions, mice were trained to 62
discriminate between an active drug -delivering lever and an inactive lever. Only presses on the 63
active lever resulted in drug infusions. The active lever was designated as the non-preferred lever, 64
based on behavior observation during the initial auto -shaping session. To prevent the catheter 65
blockage during the 6-hour training sessions, automatic drug infusions were delivered if the active 66
lever was not pressed within 30 minutes. To prevent adverse health effects of excess drug intake, 67
the maximum number of infusions per session was capped at 150 for males and 120 for females. 68
The training protocol consisted of 7-9 sessions on an FR1 schedule, followed by 2 sessions on an 69
FR2 schedule, and 10 sessions on an FR4 schedule. Mice were trained 5 days per week. 70
Following this long -access training, drug -taking behavior and fiber photometry recordings 71
were conducted during 3-hour IVSA sessions under an FR4 schedule. The animalβs behavior was 72
also videotaped. Mice completed at least three 3 -hour baseline IVSA sessions before undergoing 73
three consecutive punishment sessions. During these punishment sessions, each drug infusion was 74
paired with a brief, mild foot shock (Intensity: 0.2 mA; duration: 0.5 seconds). The shock intensity 75
was verified using an ammeter (ENV-420, Med Associates) before each punishment session. After 76
completion of the punishment sessions, catheter patency was tested prior to brain tissue collection, 77
and only mice that pass this test were included in the final analysis of cocaine or fentanyl IVSA. 78
In total, 24 out of 27 mice successfully completed cocaine IVSA training and testing with 79
confirmed catheter patency, and 27 out of 29 mice completed the fentanyl IVSA training and 80
testing with confirmed patency. As a control, 8 mice completed the saline IVSA training and 81
testing, six of which underwent fiber photometry recordings. 82
Fiber Photometry 83
Dopamine transmission in the medial shell of the NAc was recorded with a rotary fiber 84
photometry system (Part# RFPS_2S_GCaMP_RedFluo, Doric). The system was equipped with an 85
assisted electrical rotary joint (Part# AHRJ -EL_24_FMC_25, Doric) for fiber photometry and a 86
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
3
fluid rotary joint for infusing drugs. As the optic path and light detector of the photometry system 87
were integrated as one small device (i.e., rotary fluorescence mini cube), which rotated as mice 88
move in the chamber, the fiber bending and movement -induced artifacts were minimized. To 89
measure signals from the green DA sensor49,50 GRAB_DA2m, purple (405 nm) and blue (465 nm) 90
LEDs within the fluorescence mini cube (RMFM, Doric Lenses) emit a sinusoid illumination at 91
208.616 Hz and 572.205 Hz respectively to excite the fluorophore. The power at the tip of the 92
patch cable was 5-10 Β΅W. A 0.4 -meter low auto-fluorescence fiber optic patch cord was used to 93
connect the mini cube to the implanted fiber optic cannulae. Bulk fluorescent signals were detected 94
with detectors integrated within the mini cube, amplified by a Doric fluorescence detector 95
amplifier, and digitized at 12k Hz by a fiber photometry console (FPC, Doric Lenses) which also 96
recorded behavioral events of drug infusion, cue presentation and lever presses from Med 97
Associates operant chamber. The digitized signals were lock -in demodulated based on the 98
frequency of excitation lights (405nm and 465 nm) and down -sampled to 120 Hz. Doric 99
Neuroscience Studio was used to acquire and stream demodulated signals to the disk. To diminish 100
photobleaching, the fiber photometry system was automatically turned ON for 30 minutes and then 101
turned OFF for 30 minutes. This ON -and-OFF cycle automatically repeated 3 times to cover the 102
entire 3-hour IVSA testing phase. 103
Histological Staining 104
Mice were deeply anesthetized with isoflurane and intracardially perfused with 1x PBS 105
followed by 4% paraformaldehyde. The brain was post-fixed overnight with 4% paraformaldehyde 106
and cryo-protected with 30% sucrose for 2-3 days. The brain was then cut with a cryostat into 80 107
ΞΌm coronal slices. For visualizing canula tracks and the expression of GRAB_DA2m, slices were 108
stained with DAPI (1:5000 dilution, H3570, ThermoFisher, Waltham, MA) or fluorescent Nissl 109
stain (1:500 dilution, N21479, ThermoFisher). 110
Data Analysis 111
Data analysis was performed using Doric Neuroscience Studio and custom scripts written in 112
Python and MATLAB (MathWorks, Natick, MA). 113
Behavior analysis during the IVSA 114
The total number of drug infusions, active lever presses and inactive lever presses were 115
recorded. In addition, the timestamps of trial onset and all behavioral events were recorded as 116
well. These timestamps were used to generate raster plots of lever-press activity relative to the 117
trial start. 118
To classify mice as low or high drug takers, the average number of infusions per mouse 119
during the 3-hour baseline IVSA sessions was calculated. Mice with the average baseline 120
infusion counts greater than the group mean plus 10% of the mean were classified as high drug 121
takers, whereas mice with average baseline infusion counts lower than the group mean minus 122
10% of the mean were classified as low drug takers. Mice that fell between these thresholds were 123
left ungrouped. 124
Similarly, to classify mice as high or low punishment-resistant, the average number of 125
punished infusions per mouse during the 3-hour punishment sessions was calculated. Mice with 126
the average punished infusion counts greater than the group mean plus 10% of the mean were 127
classified as high punishment-resistant, whereas mice with average punished infusion counts 128
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
4
lower than the group mean minus 10% of the mean were classified as low punishment-resistant. 129
Mice that fell between these thresholds were left ungrouped. 130
Fiber photometry data analysis 131
The fiber photometry data were preprocessed using Doric Neuroscience Studio to extract the 132
Z-score of the DA dynamics based on a published method55 . Specifically, signals recorded at both 133
465 nm and 405 nm (i.e., isosbestic signal) were smoothed by applying a running average with a 134
window size of 0.1 seconds. The bleaching slope and low -frequency fluctuations of both signals 135
were corrected with an adaptive, iterative re -weighted penalized least squares algorithm 56 . Both 136
signals were standardized by calculating their Z-scores. Non-negative robust linear regression was 137
used to fit the Z-score of signals at 405 nm to those at 465 nm. Finally, DA dynamics is calculated 138
by subtracting the fitted Z-score of signals at 405 nm from the Z-score of the signals at 465 nm. 139
Drug self -administration-evoked DA responses (hereafter referred to as βdrug -evokedβ 140
responses) were analyzed by aligning DA dynamics to the onset of each drug self -administration 141
and constructing peri-event time histograms (PETHs). Importantly, the generation of drug-evoked 142
events required response contingency, cue presentation, and simultaneous drug delivery. 143
Normalized PETH was obtained by averaging PETHs across trials and subtracting baseline DA 144
activity (-10 to 0 s relative to drug infusion). Onset DA responses were defined as the mean evoked 145
DA signals within 1 s of drug infusions, whereas sustained DA responses were defined as the mean 146
evoked DA responses from 1 to 19.5 s post-infusion (corresponding to drug-associated cue period). 147
To analyze DA responses to active lever presses, DA dynamics were aligned to the first active 148
lever press in each trial to construct PETHs. The PETHs were then normalized by averaging 149
PETHs across all trials and subtracting the mean baseline activity (the 2 -s interval immediately 150
preceding the press). DA responses to active lever press were quantified as the mean evoked DA 151
signals within the 2 s following the active lever press. 152
To analyze the decay rate of DA transients, DA dynamics from baseline sessions were first 153
low-pass filtered using a 4th-order Butterworh filter. Local peaks and their subsequent troughs were 154
then extracted with the findpeaks function in MATLAB. Finally, we performed a linear regression 155
on these corresponding peaks and troughs, and the slope of the regression was defined as the decay 156
slope. 157
Temporal Difference Learning in Actor-Critic Model 158
In TD-learning considering self -administration, an agent transitions through a sequence of 159
states π(π‘) according to its policy π interacting with a Markov decision process (or a semi-Markov 160
decision process). The βCriticβ computes the value associated with each state, defined as the 161
expected discounted future returns based on current policy π: 162
163
ππ(π) = πΌ [β πΎππ(π(π‘))
β
π‘=0
| π(0) = π, π] , (1) 164
165
where π‘ denotes time and π(π‘) is the state visited at time π‘. π(π(π‘)) denotes the reward delivered 166
at state π(π‘), and πΎ β (0, 1) is a discount factor. In the experiments we examine, the drug reward 167
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
5
is present with delay after infusion event and lasts for a prolonged period until the end of the trial. 168
The βActorβ aims to learn an optimal policy πβ that maximizes the expected total future returns: 169
πβ = argmax
π
ππ(π). (2) 170
Interleaved steps of estimating the value function and updating the policy are used in learning. 171
For estimating the value function, under the Markov property, the value at time t for state π(π‘) can 172
be rewritten as a sum of the reward received at π‘ and the discounted value at the next time step: 173
ππ(π(π‘)) = β©π(π(π‘))βͺ + β ππ[π; π(π‘)] β π[πβ²|π(π‘); π]ππ(πβ²)
πβ²π
, (3) 174
where ππ[π; π(π‘)] denotes the probability of choosing action π at state π(π‘) according to the 175
policy π and π[πβ²|π(π‘); π] denotes the probability of state transition from π(π‘) to π(π‘ + 1) = πβ² 176
at next time step π‘ + 1 when taking action π. β©π(π(π‘))βͺ denotes the mean reward received at π(π‘). 177
Temporal difference learning takes π(π(π‘)) + ππ(πβ²) as a Monte Carlo sample to approximate 178
the right side of equation (3) and then bootstrap by replacing the unknown ππ(πβ²) with the current 179
estimate π(πβ²). Therefore, 180
πΏ(π‘) = π(π(π‘)) + π(πβ²) β π(π(π‘)) (4) 181
is used as a sampled approximation to the mismatch ππ(π(π‘)) β π(π(π‘)). πΏ(π‘) is called temporal-182
difference reward prediction error (TD -RPE). When πΏ(π‘) = 0, the value function is well 183
approximated. However, when πΏ(π‘) is positive or negative, the Criticβs estimate π(π(π‘)) should 184
be increased or decreased, respectively (πΌ is learning rate): 185
π(π(π‘)) β π(π(π‘)) + πΌπΏ(π‘) (5) 186
For updating the policy to satisfy equation (2), we could similarly use π(π(π‘)) + π(πβ²) as a Monte 187
Carlo estimates of right side of equation (3), and the policy is updated along the gradient of 188
equation (X3). If ππ[π; π] is parameterized as a SoftMax distribution: 189
190
ππ[π; π] = ππ½ππ,π
βππ½ππβ²,ππ
(6) 191
192
where ππ,π denote the action value for taking action π at state π, and π½ is the SoftMax parameter, 193
then the update of Actorβs policy upon taking action π at state π has an elegant and bio-plausible 194
formular: 195
196
ππ,π β ππ,π + πΌ πππ(π)
πππ,π
β ππ,π + πΌ(πΏππβ² β π[πβ²; π])πΏ(π‘) (7) 197
where πΏππβ² = 1 if πβ² = π and 0 otherwise. πΏ(π‘) is the RPE defined above and πΌ is the learning 198
rate. 199
Internal state inferred from task stimulus 200
For simplicity, we assume each trial is decomposed into 3 discrete internal state according to 201
different sensory feedback: 202
π0, π1, π2 203
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
6
π0 denotes the initial trial stage from light -onset and lever -insert to drug -infuse, which 204
dominates the trial period. π1 denotes the 20s delay period from drug-infuse to lever-retract, with 205
blinking light on and off periodically. π2 denotes the 20s light-off ending-stage of each trial, from 206
lever-retract to next lever-insert. While the external task progression is governed deterministically 207
by the external clock and lever -press counter, the animal does not have access to either external 208
clock or counter and stochastically transit among these states based on sensory input generated at 209
each external task stage. 210
Internal state triggered by shock 211
Foot shock is a life-threatening stimulus to animals and usually triggers leaping, retreating, 212
scanning, or freezing behaviors. After a period of alert, the animal gradually returns to the 213
spontaneous behavior as pre-shock stage, suggesting a process of inferring what is happening. 214
Therefore, we assume the shock triggers another cascade of internal states: 215
π0
π, π1
π, π2
π 216
where π1
π denotes the alert period once getting the shock. After scanning the surroundings, 217
the shock state transit to π2
π stochastically, where π2
π denotes the βsafetyβ state. π1
π β π2
π 218
corresponds to shock relief. Once the light is turned on again for next trial, the safety state transit 219
to π0
π, which denotes the βdangerβ state, extending from lever-insert to getting-shock. A negative 220
reward ππ < 0 is provided during transition π0
π β π1
π and an innate negative value π(π1
π) < 0 is 221
initially assigned to shock state π1
π. Experimental data for baseline shock testing confirms that 222
first shock can evoke lasting DA dynamics independent of drug reward, rendering additionally 223
considering shock-related states necessary. 224
Given both task-related states π(π‘) and shock-related states ππ(π‘), the total value at time π‘ is 225
a linear combination of the state value and action value: 226
π (π(π‘), π(ππ(π‘))) = π(π(π‘)) + π(ππ(π‘)) (8) 227
ππ,(π,ππ) = ππ,π + ππ,ππ (9) 228
All the variables π(π(π‘)), π(ππ(π‘)), ππ,π, ππ,ππ follow the TD-learning rule for Actor-Critic 229
model introduced above. 230
Uncertainty in state transition during long-access training 231
Each task trial has an extended duration, lasting 2-3 mins on average, which far exceeds the 232
experimental timescale for classical conditioning in reinforcement learning . This may cause 233
several non -negligible outcomes on animalβs behavior. First, due to constant spontaneous 234
movement during operant stage, the animal may not detect a transient cue and cue-evoked internal 235
state transition can be delayed. For example, the transition π0 β π1 may lag behind the first light-236
blink, which is consistent with the trial-to-trial jittered initial DA response timing. For this reason, 237
we assume that the animal has a probability to transition from π0 to π1 at each light blink (from 238
light on to light off): 239
0 < π[π1|π0, πππππ πππ] < 1 240
241
Second, during the ON cycle of blinking light after infusion, the animal has a non -zero 242
probability to return to ground state π0. This is because π0 occupies the majority of trial period 243
with light-on as sensory input, therefore the animal may treat π0 as the βground stateβ with large 244
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
7
prior. Considering that the ON cycle of blinking light during π1 shares the same sensory input as 245
π0, the animal may switch its internal state to π0 during on-edge and return back to π1 at next off-246
edge of blink, which we call βdiffuseβ 247
π[π0|π1, πππππ ππ] > 0 248
Individual difference on action cost and shock sensitivity 249
In the model, we only consider binary action for simplicity: π = 0 denotes no-press while π =250
1 denotes lever-press. We assume that the cost for lever -press varies across animals. Therefore, 251
when updating the action value for pressing lever, an additional action cost π(π) is considered: 252
253
ππ,π β ππ,π + πΌ(πΏππβ² β π[πβ²; π]) β
(πΏ(π‘) β π(π(π‘))) (20) 254
255
where πΏ(π‘) β π(π(π‘)) = (π(π‘) β π(π(π‘))) + π(π(π‘ + 1)) β π(π(π‘)) is the combined RPE 256
for Actor. Variations in action cost has profound impact on animalβs addiction behavior: for 257
example, we find that animals who learned to press lever by biting onto the lever, or using the chin 258
to press lever, had significantly more lever press counts than those pressing by using the paw, and 259
consequently more drug taking. Indeed, pressing the lever with a paw is relatively demanding: the 260
animal must rear up, maintain balance, and then lift a paw to make a press. During shock session, 261
we find that some animals show much higher tolerance to the electric shock compared to others. 262
Therefore, shock sensitivity is another factor underlying individual difference. 263
Effects of Cocaine and Fentanyl on modulating shock sensitivity 264
Drugs like cocaine and fentanyl have diverse pharmacological effects on animals besides 265
serving as reinforcers. Specifically, cocaine is a psychostimulant, whereas fentanyl is an opioid 266
analgesic. Therefore, we assume that the two drugs modulate shock sensitivity in opposite 267
directions: cocaine lowers the shock threshold, whereas fentanyl raises it. 268
ππ
ππ‘ = βπ + πΏ(π‘ β π‘Μπππ) β πΏ(π‘ β π‘Μπππ) (31) 269
where πΏ(π‘ β π‘Μ) is a Dirac delta that represents an event occurring at time π‘Μ. π‘Μπππ and π‘Μπππ denote 270
the infusion event of fentanyl and cocaine respectively. π is the shock threshold for computing 271
effective shock reward ππ: 272
ππ = π΄π β
Ξ(|π΄π| β π) (42) 273
where π΄π is a measure of shock strength, and Ξ is step function. 274
Cocaineβs effect on DA signal decay 275
Since cocaine blocks dopamine reuptake, extracellular DA clears more slowly, extending the 276
decay timescale relative to normal condition; in our recordings, DA decays roughly fivefold more 277
slowly than normal. In the cocaine model, we represent each RPE event as producing a DA 278
transient with a slow decay. Importantly, the DA level during this decay is not treated as additional 279
teaching signal: only the phasic RPE at the event time drives learning. We instead interpret the 280
slowly decaying component as a motivational modulation signal, consistent with its timescale. 281
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
8
282
283
284
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
9
285
Fig. S1. Drug-taking behavior. (A) Schematic of the drug self-administration task structure 286
(showed with FR1 schedule), highlighting the light-ON, light-blinking, and light-OFF epochs. 287
(B) Raster plots of active lever presses (gray lines) and infusions (black lines) aligned to the start 288
of each trial (lever insertion; time 0), along with distribution of the latency to the first active 289
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
10
lever press, inter-press interval and infusion latency relative to trial onset. Each row corresponds 290
to an example mouse. 291
292
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
11
293
Fig. S2. Saline intravenous self-administration (IVSA) behaviors. (A) Lever presses (left) and 294
saline infusions (right) across daily 6 -hour saline IVSA training sessions (n = 8). (B) Lever 295
discrimination (proportion of active lever presses) across 10 training sessions of cocaine (red), 296
fentanyl (blue) and saline (black) IVSA under FR 4 schedule (repeated two-way ANOV A test). (C) 297
Lever presses (left) and saline infusions (right) during the testing phases of saline IVSA (n = 8). 298
Error bars indicate mean ο± standard error mean (SEM). * represents p < 0.05. 299
300
301
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
12
302
Fig. S3. Histological verification of optical fiber placements. The tips of optical fiber tracks are 303
indicated by pink circles (one circle per mouse) . Coronal sections are labeled with anterior -304
posterior coordinates relative Bregma. In some cases (n= 2), placements could not be verified 305
because the fiber tracks were not detectable. 306
307
308
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
13
309
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
14
Fig. S4. Dopamine dynamics during cocaine, fentanyl , and saline IVSA. (A) Example traces 310
of low-pass filtered, z-scored dopamine signals. Local peaks (red) and troughs (blue) are indicated. 311
(B) Linear regressions fitted to decay segments from each local peak to the subsequent trough, 312
corresponding to the example traces showed in (A). (C) Bar plots of decay slopes in mice IVSA 313
cocaine and fentanyl (Welchβs t -test). (D) Population-averaged dopamine responses to cocaine 314
(top) and fentanyl (bottom) IVSA overlaid with the blinking light signal (green line). (E) 315
Dopamine responses to cocaine or fentanyl IVSA on the 1 st day of training . (F) Dopamine 316
responses to saline IVSA during the 3 -hour testing phase after extended training. (G) Dopamine 317
responses to the co-occurrence of saline IVSA and footshock during punishment sessions of saline 318
IVSA. Error bars indicate mean ο± standard error mean (SEM). *** represents p < 0.001. 319
320
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
15
321
Fig. S5. Dopamine responses to active lever presses. (A) Examples dopamine responses to the 322
first active lever press of each trial. Two examples are from high cocaine -taking mice and three 323
from low cocaine-taking mice. (B) Group dopamine responses to active lever presses. Each row 324
of the colormap represents one mouse, sorted by the number of infusions (top = most). (C) Time 325
courses of DA responses to active lever press for high (orange) and low (light blue) cocaine-taking 326
mice. (D) Modulated dopamine responses by active lever presses for high (orange) and low (light 327
blue) cocaine-taking mice. Responses were calculated as the difference between the mean z-scored 328
dopamine signals from 0-2 s after the active lever press and the -2 to 0 s baseline before the press. 329
(E-H) Same analyses as (A-D), but for fentanyl IVSA. 330
331
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
16
332
Fig. S6. Dopamine responses across three episodes of recordings during cocaine and fentanyl 333
IVSA. (A) Schematic illustrating three cycles of rotary fiber photometry during 3 -hour IVSA 334
testing sessions. (B) Number of cocaine infusions across the three recording episodes during 3 -335
hour IVSA. (C) Time courses of DA responses to contingent cocaine infusions for high (orange) 336
and low (light blue) cocaine-taking mice across the three recording episodes. (D) Averaged onset 337
and sustained DA responses across the three recording episodes for high (orange) and low (light 338
blue) cocaine-taking mice. (E-G) Same analyses as in (B-D), but for fentanyl. 339
340
341
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
17
342
Fig. S7. Dopamine dynamics during punished drug taking in individual mice. Each 343
colormap represents a single mouse. Dash lines at time 0 represent the onset of drug infusion 344
plus a mild foot shock (0.5 s). 345
346
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
18
347
Fig. S8. Changes in dopamine dynamics across punishment sessions of fentanyl IVSA. (A) 348
Group DA responses to punished fentanyl infusions in punishment -resistant mice (n = 8) during 349
the 1 st punishment session (left) and a subsequent punishment session (right). Each row of the 350
colormap represents the same mouse recorded across punishment sessions. (B) Quantification of 351
onset and sustained DA responses during the 1 st and subsequent punishment sessions. (C) 352
Individual examples of DA responses across each trial during the 1 st and subsequent punishment 353
sessions. * represents p < 0.05. 354
355
356
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source β PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.