Dopamine signatures of excessive and compulsive cocaine and fentanyl use

doi:10.64898/2026.01.18.700215

Dopamine signatures of excessive and compulsive cocaine and fentanyl use

2026 · doi:10.64898/2026.01.18.700215

preprint OA: closed CC-BY-NC-ND-4.0

📄 Open PDF Full text JSON View at publisher

Full text 115,986 characters · extracted from oa-pdf · 7 sections · click to expand

Abstract

Excessive and compulsive drug use despite adverse consequences is a hallmark of 14 substance use disorder, yet individuals differ markedly in their vulnerability to develop these 15 behaviors. D rugs of abuse ar e long known to alter endogenous dopamine (DA) signaling, but 16 shared principles for how DA dynamics impact compulsive use among individuals and across drug 17 classes are lacking. Here, we monitored DA release in the medial shell of nucleus accumbens 18 (NAc) during cocaine and fentanyl self-administration, with or without coincident punishment, in 19 large cohorts of mice. Contingent cocaine and fentanyl self -administration evoked complex and 20 individually distinct DA dynamics; nevertheless, a robust negative correlation held across both 21 drugs, such that high takers exhibited lower drug-evoked DA signals. During punished drug taking, 22 cocaine and fentanyl cases were associated with distinct DA signatures of compulsivity. For 23 cocaine, punishment-resistant mice showed lower sustained DA responses during the post-shock, 24 drug-associated cue period, whereas for fentanyl, punishment -resistant mice displayed larger 25 phasic DA at the co -occurrence of footshock and drug infusion. To identify common principles 26 underlying these observations, we developed a computational model grounded in an Actor -Critic 27 temporal-difference (TD) learning framework that incorporates internal states, agent’s uncertainty, 28 and drug-specific effects. Remarkably, this model captures the observed diversity in DA dynamics 29 across drug classes and among mice with variable drug taking propensities , hereby providing a 30 unified interpretation of NAc DA signals as encoding TD reward prediction errors. 31 32 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 2 Main Text: 33 Substance use disorder (SUD) is a leading cause of drug overdose-related mortality. Excessive 34 drug consumption and compulsive drug taking despite adverse consequences are defining features 35 of SUD. However, only a subset of individuals transition from initially controlled, recreational 36 drug use to uncontrolled and compulsive taking, underscoring a striking degree of individual 37 variability in vulnerability to addiction (1–6) . Rodent models similarly reveal significant 38 individual differences in the propensity to transition from controlled to addiction-like drug use(7–39 15) . For example, in cocaine self-administration paradigms, only a subset of rats met operational 40 criteria for addiction-like behaviors, characterized by escalation of intake, persistent drug seeking 41 despite adverse consequences, and heighten ed motivation to obtain the drug( 13, 15) . Despite 42 extensive research, the neurobiological mechanisms by which exposure to addictive drugs leads to 43 divergent outcomes across individuals remain poorly understood. This study aims to identify 44 specific neurochemical signatures associated with excessive and compulsive drug taking despite 45 adverse consequences. 46 A large body of work has identified the mesolimbic dopamine system —particularly 47 projections from the ventral tegmental area (VTA) to the nucleus accumbens (NAc)—as central to 48 the reinforcing effect of abused drugs and the development of addiction( 16–24). Seminal 49 microdialysis and voltammetry studies showed that pharmacologically diverse drugs of abuse, 50 including cocaine and opioids, preferentially eleva te extracellular dopamine levels in the NAc 51 relative to dorsal striatum in animals (18, 23, 25). However, dopamine responses to drugs and 52 drug-associated cues are not uniform across individuals, nor do they remain static over the course 53 of addiction development. Indeed, human neuroimaging and rodent studies have revealed 54 substantial inter-individual variability in both drug-evoked and cue-evoked dopamine release(26–55 30) , suggesting that differences in dopamine responses may contribute to addiction 56 vulnerability(5, 16, 27, 29, 31) . From a computational perspective, dopamine has often been 57 conceptualized as a reward prediction error signal within temporal- difference reinforcement 58 learning models (32–35), whereas psychological and neurobiological theories such as incentive 59 sensitization, allostasis, and habit formation emphasize how drug-induced adaptations in dopamine 60 circuits may drive pathological “wanting”, negative reinforcement, and stimulus–response habits 61 in the development of addiction( 5, 36–40). Despite these influential theories, direct comparisons 62 of in vivo dopamine dynamics across individuals and across different drug classes remain limited, 63 hindering efforts to determine which dopamine theories best explain experimental data and 64 individual differences in drug taking and compulsive behavior. 65 In vivo measurements of dopamine using fast -scan cyclic voltammetry (FSCV) and 66 genetically encoded dopamine sensors have provided rich descriptions of dopamine dynamics 67 during psychostimulant self -administration, particularly for cocaine (23, 30, 41–43). In contrast, 68 the subsecond dopamine dynamics underlying opioid self -administration remain poorly 69 characterized, especially in the context of fentanyl. A recent study showed that heroin selectively 70 activates a subset of VTA dopamine neurons projecting to the medial shell of NAc, and that these 71 dopamine neurons are critical for heroin self -administration(19). However, little is known about 72 how variability in real -time dopamine dynamics in NAc relates to individual differences in 73 excessive opioid self -administration. Moreover, it remains unknown how dopamine release 74 patterns correlate with the emergence of compulsive drug -taking despite adverse consequences. 75 Critically, few studies have directly compared dopamine signatures across drug classes using the 76 same self-administration paradigm, or tested which formal theories of dopamine in addiction best 77 account for the observed dopamine dynamics across conditions. 78 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 3 Here, we measure real-time changes in dopamine release with a genetically encoded sensor in 79 the medial shell of the NAc as mice develop excessive and compulsive self-administration of two 80 distinct drug types, cocaine and fentanyl. We chose fentanyl to represent the opioid drug class for 81 it being the leading cause of overdose death(44) , and cocaine to represent the stimulant drug class 82 for its strong abuse potential and for ensuring consistency of our results with existing literature. 83 We focused on the medial NAc shell because this region has been strongly implicated in the 84 reinforcing and motivational effects of both cocaine and opioids (19, 41, 45, 46) . To capture the 85 transition from controlled to compulsive drug use, we employed long -access intravenous self -86 administration paradigms, which reliably promote escalation of intake and the emergence of 87 punishment-resistant, compulsive drug taking(11, 13, 47, 48) . We quantify individual differences 88 in cocaine - and fentanyl -taking behaviors and punishment sensitivity and correlate these 89 measurements with individual variations in dopamine dynamics. We then build a computational 90 model that remarkably recapitulates the diverse dopamine patterns observed across individuals, 91 drug classes, and stages of drug self -administration. Altogether, our study reveals dopamine 92 signatures of excessive and compulsive drug taking in SUD models and provides a unified 93 computational framework for understanding NAc dopamine signals in reinforced drug 94 consumption. 95 Individual differences in drug-taking behavior and punishment resistance 96 To assess individual differences in addiction- like behaviors induced by psychostimulant or 97 opioid exposure, we trained large cohorts of mice to perform intravenous self -administration 98 (IVSA) of either cocaine (n = 24) or fentanyl (n = 27). We also expressed a genetically encoded 99 dopamine (DA) sensor in these mice to measure DA release (described below). Specifically, mice 100 implanted with catheters were trained to press an active lever that triggered either cocaine (0.3 101 mg/kg/infusion) or fentanyl (2 µg/kg/infusion) intravenous infusion during daily 6- hr sessions, 102 conducted 5 days per week for 4 weeks ( Fig. 1A ). The training context and parameters are 103 described in detail in fig. S1 as they are important for interpreting dopamine signals. Briefly, upon 104 the insertion of levers (both active and inactive), the training chamber was lit with light above the 105 active lever (light ON). Mice were allowed to move freely in the chamber, where they exhibited 106 typical spontaneous behaviors, such as locomotion, grooming, and rearing. They were trained to 107 press the active lever at a fixed ratio (from FR 1, 2, to FR4) to obtain intravenous infusion of 108 cocaine of fentanyl. Because the training was self-paced, the interval between lever insertion and 109 drug infusion onset varied substantially across trials and animals (from less than 20 seconds to ten 110 minutes, fig. S1). Once drug infusion started, each infusion (lasting ~2.8 seconds) was 111 accompanied by a 19.5- second ON-OFF blinking of the light above the active lever. Aft er this 112 period, the levers were retracted, and the chamber lights were turned off. Following a 20.5-second 113 dark interval, the next trial began with the lights turned on and the levers reinserted. 114 Consistent with prior studies(47–49) with long-access drug IVSA, active lever presses and the 115 cocaine (n = 24, Fig . 1B) or fentanyl (n = 27, Fig. 1E) intakes gradually escalated across the 116 training sessions under the FR1 schedule, while inactive lever presses remained minimal. When 117 the reinforcement schedule was increased to FR2 and subsequently FR4, active lever presses 118 continued to escalate for both cocaine (reaching 997 ± 244 active presses per 6 -hour session by 119 the end of FR4 training) and fentanyl (averaging 799 ± 89 active presses per 6-hour session by the 120 end of FR4 training). Drug intake for both cocaine and fentanyl remained high after an init ial dip 121 when switched to FR2 (Fig. 1B, 1E). Note that under the FR4 schedule, each lever press had a low 122 probability of resulting in drug infusion. Notably, mice exhibited substantial variability in the 123 latency to the first active lever press upon lever in sertion, inter-press intervals, and the latency 124 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 4 from lever insertion to drug infusion in both cocaine and fentanyl IVSA (fig. S1). As controls, we 125 also trained a separate group of mice (n = 8) to self -administer saline under a similar FR1 to FR4 126 schedule. Interestingly, these mice also pressed the active lever slightly more than the inactive 127 lever for saline infusions (fig. S2). But the discrimination between active and inactive lever presses 128 is significantly lower than that of mice IVSA cocaine and fentanyl (fig. S2). 129 Following the 4-week extended training period, mice underwent at least three 3 -hour regular 130 cocaine or fentanyl IVSA sessions under the FR4 schedule (baseline sessions) before they were 131 subjected to three 3-hour punishment sessions (Fig. 1C, 1F). During the punishment sessions, each 132 drug infusion (triggered by the 4th active lever press) was paired with a mild foot shock (0.2 mA, 133 0.5 second). As expected, punishment significantly reduced both the average number of active 134 lever presses and the average intake of cocaine and fentanyl (Fig. 1D, 1G). However, there existed 135 clear individual variability in punishment sensitivity with some mice continuing to press the active 136 lever for drug infusions despite receiving foot shocks. 137 To characterize indiv idual differences in drug- taking and punishment -responsiveness, mice 138 were simply categorized as high or low drug- taking (Fig. 2A-B and 2H-I, orange vs. light blue) 139 and as high or low punishment -resistant (Fig. 2D-E and 2K-L, red vs. dark blue), based on their 140 normalized intake of cocaine or fentanyl during the baseline and punishment sessions, respectively 141 (see Methods). Mice whose drug intake fell within mean ± 10% of the group mean were 142 unclassified (grey samples in Fig. 2). We compared lever-pressing between low- and high-taking 143 groups and observed similar behavioral patterns in the high cocaine - (n = 8) and fentanyl -taking 144 (n = 11) mice. High drug takers showed significantly more active lever presses per infusion than 145 the low drug-takers (Fig. 2C, 2J, 6.3±0.27 vs 5.0±0.08 presses for cocaine; 11.1±1.6 vs 5.8±0.20 146 presses for fentanyl), indicating that these mice tended to exhibit more futile lever presses (i.e. 147 presses that did not result in additional infusions during the 19.5-second light-blinking period). In 148 addition, high drug- taking mice for both cocaine and fentanyl displayed significantly shorter 149 latencies of the first active lever press following lever insertion (33.6±3.85 vs 105.8±7.75 seconds 150 for cocaine; 39.3± 4.45 vs 122.7± 12.74 seconds for fentanyl) and shorter inter -press intervals 151 between active lever presses (9.8 ±0.82 vs 17.9± 2.6 seconds for cocaine; 7.9± 1.13 vs 23.7± 1.69 152 seconds for fentanyl) compared with the low drug- taking groups. These results indicate that high 153 drug-takers tended to respond more rapidly and persistently to the active lever ( Fig. 2C, 2J, also 154 see representative examples in fig. S1). 155 Regarding punishment resistance ( Fig. 2D-E and 2K-L), we observed opposite trends of 156 punished drug intake between cocaine- and fentanyl-taking group. High punishment-resistant mice 157 for cocaine (n = 7) tended to decrease their drug intake across the three punishment sessions, 158 whereas high punishment-resistant mice for fentanyl (n = 9) tended to increase their intake over 159 the same punishment sessions (Fig. 2E, 2L), suggesting fentanyl was more effective at promoting 160 punishment-resistance. We next examined whether individuals exhibited correlations between 161 their baseline and punished drug intake. On average, high drug-takers were not significantly more 162 resistant to punishment than low drug-takers under either the cocaine or fentanyl conditions (Fig. 163 2F, 2M). Likewise, high and low punishment -resistant mice exhibited comparable baseline drug 164 intake (Fig. 2F, 2M). Indeed, there were mixed overlaps between the high vs. low drug-taking and 165 high vs. low punishment -resistant groups ( Fig. 2G, 2N). Thus, within the current experimental 166 paradigm, punishment resistance does not correlate with baseline level of drug taking, suggesting 167 different underlying mechanisms modulating these two phenomena. 168 Dopamine signatures of excessive drug-taking behavior 169 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 5 Dopamine (DA) plays a crucial role in the reinforcing effects of both cocaine and opioids. But 170 how DA signals are related to individual differences in drug-taking behaviors remain obscure. To 171 examine DA dynamics in the cocaine- and fentanyl-IVSA mice described above, we expressed a 172 genetically encoded DA sensor (AAV2/5-hSyn-GRAB_DA2m)(50, 51) in the dorsal medial shell 173 of the NAc and implanted an optic fiber to collect fluorescent signals ( fig. S3, Fig. 3A and 3J). 174 Using rotary fiber photometry (FP), which is compatible with the IVSA in freely behaving mice, 175 we monitored the DA dynamics during cocaine and fentanyl IVSA in fully trained mice across 3-176 hour baseline testing sessions under the FR4 schedule (timeline shown in Fi g. 1A). To minimize 177 photobleaching from continuous long-duration recording, we performed three cycles of 30 minutes 178 rotary FP, each separated by a 30-minute interval (i.e., three episodes of 30-minute recording per 179 session), and concatenated them for analysis. 180 Examining the raw and z -scored FP traces across different trial epochs (see representative 181 examples in Fig. 3B and 3K), we found that in both the cocaine and fentanyl IVSA, DA sensor 182 signals were relatively low prior to drug infusion. At the onset of infusion, DA signals increased, 183 with multiple peaks (referred to as DA transients) that occurred during the drug infusion, the light-184 blinking period, and the lever-retraction/light OFF (inter-trial interval) phases. Once the light was 185 turned bac k ON and the levers were reinserted to initiate the next trial, DA signals declined. 186 Notably, the width of individual DA transients was wider in cocaine-IVSA compared to fentanyl-187 IVSA conditions. This is likely due to cocaine’s blockade of the DA reuptake transporter, which 188 slows the clearance of extracellular DA and results in a significantly slower decay of DA signals 189 (Decay slope: -0.54± 0.04 vs -1.28± 0.05, fig. S4). By contrast, no consistent DA signals were 190 observed at the time of active lever press (fig. S5), likely reflecting the fact that at FR4, each lever 191 press has a low probability of resulting in drug reward (also see computational modeling below). 192 We computed the time course of z- scored DA signals for each animal (see Methods ) and 193 plotted averaged z -scores of all trials aligned to the onset of drug infusion, spanning from 10 194 seconds before infusion to 10 seconds after the initiation of the next trial (n=24 mice for cocaine; 195 n=27 mice for fentanyl, Fig. 3C, 3L). During cocaine IVSA, the averaged DA levels increased at 196 the onset of infusion and remained elevated throughout the light-blinking cue period and the inter-197 trial interval (lights OFF period) (Fig . 3C). Similarly, in mice during fentanyl IVSA, increased 198 DA levels also occurred at the onset of infusion, and the DA elevation displayed phasic, oscillatory 199 responses during the light -blinking period, which then became a sustained elevation during the 200 20.5-second dark inter -trial interval (Fig . 3L ). The oscillatory DA pattern was more clearly 201 observed in the group-averaged signals for both fentanyl-IVSA and cocaine-IVSA during the light-202 blinking period (lower panels in Fig. 3C, 3L). Notably, the rise-and-fall DA signals peak shortly 203 after each OFF-ON transition of the blinking light (fig. S3D), indicating the averaged oscillations 204 reflected responses to the light cue. In both cocaine and fentanyl IVSA, DA levels dropped rapidly 205 at the onset of the next trial when the light turned ON and levers were inserted (Fig. 3C and 3L). 206 Importantly, the drug- infusion and blinking -light evoked DA responses were absent on the first 207 day of training (e.g., the auto -shaping session) and in saline -IVSA mice (fig. S3E-F), indicating 208 that these signals emerged through drug- cue associative learning. In contrast, the elev ated DA 209 signals during the light OFF inter-trial interval were also present on the first day of training and in 210 saline-IVSA mice, possibly reflecting mice’ natural preference for darkness. 211 Given these observations, we focused our analyses on the infusion and light-blinking periods 212 for comparing DA dynamics between high and low drug-taking groups. Representative DA signals 213 of individual trials from a low - and high- cocaine taker ( Fig. 3D ), along with group- averaged 214 signals (Fig. 3E) revealed interesting differences. While DA responses varied from trial to trial, 215 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 6 differing in both amplitude and the timing of peak z -score, critically, low cocaine- taking mice 216 consistently showed larger DA increases than high takers ( Fig. 3D-E). This result is reminiscent 217 of previous studies showing that rats exhibiting escalated cocaine intake across training displayed 218 reduced DA to contingent cocaine infusion( 27, 28). Similarly, low -fentanyl taking mice also 219 consistently exhibited larger DA increases than high- fentanyl takers, despite large trial -to-trial 220 variations in DA signals (Fig. 3M-N). We separately quantified the DA responses at the infusion 221 onset (0-1 second from infusion start, referred to as “onset DA signal”) and during the subsequent 222 cue period (1-19.5 second from infusion start, referred to as sustained DA signal). Importantly, 223 linear regression analysis revealed that both the onset and sustained DA signals were significantly 224 negatively correlated with the number of cocaine ( Fig. 3F and 3H) or fentanyl (Fig. 3O and 3Q) 225 self-administrations. Group comparison further confirmed that high drug- taking mice showed 226 significantly smaller onset (Fig. 3G and 3P) and sustained DA signals ( Fig. 3I and 3R). The 227 reduced DA signals in high drug -takers are unlikely to be caused by elevated baseline DA levels 228 that might blunt additional drug -evoked responses. If this were the case, DA responses should 229 decrease over the course of a session as more drug is consumed. However, when we compared DA 230 responses across the three recording episodes (the 1st, 2nd, and 3rd 30 -minute blocks), we found 231 no evidence of a progressive decline in evoked DA (fig. S6). 232 Altogether, despite the distinct DA dynamics evoked by cocaine and fentanyl IVSA, both 233 drugs showed a consistent relationship between drug- taking behavior and DA responses: higher 234 drug intake was associated with weaker DA responses to contingent drug infusion and drug-235 associated cues. 236 Dopamine signatures of punishment resistance 237 Compulsive drug-taking despite adverse consequences is a hallmark of drug addiction, thus 238 we examined DA responses in the NAc medial shell during punishment sessions. Following the 239 co-occurrence of drug infusion with a mild foot shock (0.2 mA, 0.5 s), average DA signals in both 240 the cocaine and fentanyl groups showed a marked increased during the post-shock/infusion period 241 and remained elevated throughout the light -blinking, and inter-trial dark periods ( Fig. 4A, 4I). 242 Overall, cocaine infusion plus shock produced variable onset DA responses (0 -1 second post -243 infusion), followed by a uniform sustained DA surge (1 -19.5 second post -infusion) (Fig. 4A). 244 Closer examination of individual mice in the cocaine group revealed heterogeneity in DA 245 dynamics: in a subset of mice, the co- occurrence of cocaine infusion and shock elicited an initial 246 dip or pause in DA followed by a robust, sustained rebound increase of DA, whereas in others, the 247 infusion and shock triggered an instant increase in DA that remained elevated ( fig. S7, Fig. 4B). 248 Visual inspection of DA dynamics in high and low punishment -resistant mice revealed that both 249 groups had comparable onset DA responses, but the low -resistant mice exhibited significantly 250 larger sustained DA signals ( Fig. 4C-D). Quantitative analyses confirmed that group-averaged 251 infusion/shock-evoked onset DA was neither significantly different between high- and low -252 resistant mice, nor correlated with the amount of punished cocaine intake ( Fig. 4E-F). However, 253 at the individual level, 40% (6/15) of low -resistant mice displayed a suppression of onset DA, 254 compared to only 14.3% (1/7) of high- resistant mice ( Fig. 4F, Fisher’s exact test, p = 0.35). In 255 contrast, there is a strong and significant negative correlation between the post -shock sustained 256 DA levels and the punished cocaine infusions: high-punishment resistant mice had low sustained 257 DA levels (Fig. 4G-H). As a control, co- occurrence of saline infusion and shock predominantly 258 elicited a dip in onset DA responses (83.3%, [5/6]), followed by a rebound (fig. S3G). 259 In the fentanyl group, the co- occurrence of fentanyl infusion with shock also elicited 260 heterogeneous onset DA responses across mice ( fig. S6, Fig. 4I, 4J). Visual inspection of the 261 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 7 group-averaged DA dynamics showed that high punishment -resistant mice exhibited a sharp 262 increase in onset DA, whereas the low-resistant group displayed a dip of onset DA signal; however, 263 both groups showed comparable sustained DA ( Fig. 4J-L). Quantitative analyses confirmed that 264 the fentanyl-infusion/shock-evoked onset DA response was significantly higher in high- resistant 265 mice and was significantly positively correlated with punished fentanyl infusions ( Fig. 4M-N). 266 Moreover, 56% (9/16) of low-resistant mice showed a reduction in onset DA while none (0/9) of 267 the high-resistance exhibited such a dip ( Fig. 4N, Fisher’s exact test, p < 0.01). In contrast, the 268 post-shock sustained DA signals were neither significantly different between high- and low -269 resistant mice, nor correlated with punished fentanyl intake ( Fig. 4O-P ). Among the nine 270 punishment-resistant mice, we also recorded DA responses during subsequent punishment 271 sessions in 8 mice. At the group level, these mice showed a trend t oward reduced sustained DA 272 signaling during subsequent punishment sessions, and five of eight exhibited a further increase in 273 onset DA responses (fig. S8). 274 Taken together, in cocaine self -administering mice, the sustained DA responses after shock 275 negatively correlated with resistance to punishment, whereas in fentanyl self -administering mice, 276 the onset DA responses to shock/drug-infusion positively correlated with resistance to punishment. 277 A computational model captures DA dynamics in the IVSA paradigm across drugs and 278 conditions 279 A long- standing theory for NAc DA activity posits that phasic DA signals encode the 280 temporal-difference reward prediction error (TD -RPE), i.e. the mismatch between the expected 281 values of temporally adjacent states (52–54) . In the fentanyl group, punishment -resistant mice 282 displayed a sharp increase in DA at the onset of shock and fentanyl infusion ( Fig. 4L), a pattern 283 reminiscent of a positive RPE signal observed in Pavlovian conditioning. However, it is unclear 284 whether the existing TD-RPE model of DA signals based on short timescale classical conditioning 285 could model the substantial variabilities in IVSA behaviors (fig. S1) and explain the highly 286 heterogenous onset and sustained DA dynamics during both cocaine and fentanyl IVSA and 287 punishment sessions (Fig. 3 and Fig. 4). To address these questions, we modeled each animal as 288 an Actor-Critic TD-learning agent performing a self -paced drug self-administration task (Fig. 5-289 6). One key aspect of our model is to represent the distinct epochs within IVSA trials as internal 290 states, 𝑆𝑆(𝑡𝑡). The agent learns the state value, 𝑉𝑉(𝑆𝑆(𝑡𝑡)), and action value, 𝑀𝑀𝑎𝑎,𝑆𝑆(𝑡𝑡), to maximize the 291 expected total future reward using the TD -RPE signal, δ(t). Thus, despite the low predictivity of 292 individual lever presses for drug reward at FR4, the agent learns that pressing the lever is the best 293 policy (see Methods). We used model-derived δ(t) to generate δ̃(t) as the simulated DA signals by 294 clipping the negative value to -0.5 and adding a decay tail to δ(t). We then compared δ̃(t) with the 295 experimentally observed NAc DA dynamics and analyzed the correlations between δ̃(t) and drug 296 taking or punishment resistance. 297 We first modeled the FR4 drug -taking sessions. Each trial was modeled as inducing three 298 discrete internal states, S0, S 1 and S2, defined by their distinct sensory contexts ( Fig. 5): S0 299 denotes the period from lever insertion and light ON to the moment of drug infusion—a phase that 300 typically occupies most of the trial and can last from seconds to minutes. S1 denotes the drug 301 infusion and associated light -blinking period, whereas S 2 corresponds to the dark inter -trial 302 interval (ITI). For simplicity, the external sensory stimuli associated with S 1 and S2 were each 303 modeled as lasting 20 timesteps (t). We further assumed that a drug reward rd becomes perceptible 304 10 timesteps after drug infusion and persists to the end of the trial. A critical novel aspect of the 305 model is to incorporate uncertainty in the agent’s estimation of current states S (t). For instance, 306 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 8 the agent may fail to notice the first few blinks of the light cue, leading to temporal variation in its 307 estimate of when the light -blink state begins, or may confuse states that share similar sensory 308 features (e.g. a blink-ON moment in S1 may resemble the light-on period in S0). 309 During self-administration, the learned value of state 𝑆𝑆1, 𝑉𝑉(𝑆𝑆1), is approximately constant 310 across individual agents since 𝑆𝑆1 has a fixed reward contingency by task design. In contrast, state 311 𝑆𝑆0 acquired predictive value from drug infusions in preceding trials, with different agents learning 312 different 𝑉𝑉(𝑆𝑆0). Critically, the strength of the 𝑆𝑆0 → 𝑆𝑆1 contingency (red arrow in Fig. 5A) depends 313 on the learned action value 𝑀𝑀𝑎𝑎,𝑆𝑆0 which determines both the probability and timescale of lever 314 presses (see Methods). High drug takers, which select the lever press action more frequently with 315 shorter inter-press intervals, learn that sufficient lever presses in 𝑆𝑆0 reliably lead to 𝑆𝑆1 and drug 316 reward. As a result, they acquire a high expected 𝑉𝑉(𝑆𝑆0) that approaches 𝑉𝑉(𝑆𝑆1), yielding a small 317 prediction error at the 𝑆𝑆0 → 𝑆𝑆1 transition (because 𝛿𝛿{0→1} ≡ 𝑉𝑉(𝑆𝑆1) − 𝑉𝑉(𝑆𝑆0)). In contrast, low drug 318 takers are less certain that lever pressing drives the 𝑆𝑆0 → 𝑆𝑆1 transition. Therefore, compared to 319 high takers, they select the lever press action less frequently, with longer inter-press intervals, and 320 maintain 𝑉𝑉(𝑆𝑆0) ≪ 𝑉𝑉(𝑆𝑆1), resulting in a larger 𝛿𝛿{0→1} upon entry into 𝑆𝑆1. Altogether, 𝛿𝛿{0→1} at the 321 onset of drug infusion (timestep 1 in 𝑆𝑆1) is negatively correlated with the amount of drug taking 322 (Fig. 5 B-C). 323 Importantly, 𝛿𝛿{0→1} also contributes to δ(t) during the light -blinking period of 𝑆𝑆1 due to 324 animals’ uncertainty of state transitions. Attentional lapse (modeled as a ≈ 6 timestep jitter) in 325 detecting the drug-associated cue (light blinking off) generates multiple cue -locked δ(t) peaks at 326 the initial phase of 𝑆𝑆1. In addition, 𝑆𝑆1 → 𝑆𝑆0 confusion (light blinking on) further produce s cue-327 locked oscillations in δ (t) when averaged across trials ( Fig. 5F , see Methods). Moreover, the 328 delayed perception of drug reward, 𝑟𝑟𝑑𝑑(𝑡𝑡) induces a slow rise of δ(t) starting ≈ 10 timesteps after 329 infusion onset, which is superimposed on the oscillation of δ(t). Taken together, the average 330 sustained δ(t) or δ̃(t) over the post-infusion phase of 𝑆𝑆1 (timesteps 2-20), is determined by 𝛿𝛿{0→1} 331 and 𝑟𝑟𝑑𝑑(𝑡𝑡), and therefore is also negatively correlated with drug intake ( Fig. 5G ). Finally, we 332 assumed that the effect of the drug reward extends into 𝑆𝑆2. Upon the 𝑆𝑆2 → 𝑆𝑆0 transition, although 333 light on and lever insertion predict the next reward cycle, termination of the prolonged drug reward 334 mainly produces a negative 𝛿𝛿{2→0}. 335 The simulated DA δ̃( t) traces of FR4 drug- taking closely resemble the actual DA signals 336 observed in the fentanyl IVSA experiments and recapitulate the negative correlations between both 337 onset and sustained DA signals with fentanyl intake ( Fig. 5F-G). When the decay time constant 338 of δ̃(t) is increased fivefold to mimic cocaine-caused inhibition of DA reuptake, the exponentially 339 filtered δ̃(t) traces similarly reproduce the DA dynamics observed during cocaine IVSA, as well 340 as the negative correlations between DA signals and cocaine intake ( Fig. 5D-E). Together, these 341

Results

indicate that despite the complexity and temporally extended nature of the IVSA paradigm 342 and the distinct pharmacological classes of the drugs (psychostimulant vs. opioid), contingent DA 343 release in the NAc during these tasks can be uniformly explained as encoding TD-RPE. 344 Using the same Actor -Critic TD -learning framework, we next modeled the punishment 345 sessions. We assumed that footshock induces an additional internal state cascade 𝑆𝑆𝑐𝑐(𝑡𝑡) (Fig. 6), 346 which runs in parallel with the three task-related states 𝑆𝑆(𝑡𝑡):𝑆𝑆0 𝑐𝑐, 𝑆𝑆1 𝑐𝑐, 𝑎𝑎𝑎𝑎𝑎𝑎 𝑆𝑆2 𝑐𝑐. Here, 𝑆𝑆1 𝑐𝑐 denotes the 347 immediate shock-alert state, and the 𝑆𝑆1 𝑐𝑐→ 𝑆𝑆2 𝑐𝑐 transition occurs stochastically as the agent settles 348 into a shock-relieved “safety” state 𝑆𝑆2 𝑐𝑐. At the onset of the next trial (light on and lever insertion), 349 the shock-related state returns to a baseline “danger” state 𝑆𝑆0 𝑐𝑐, which persists until drug infusion 350 and the next shock. Accordingly, the agent undergoes a transition from 𝑆𝑆0 𝑐𝑐 𝑡𝑡𝑡𝑡 𝑆𝑆1 𝑐𝑐 at the drug 351 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 9 infusion/shock onset, followed by a stochastic transition 𝑆𝑆1 𝑐𝑐 𝑡𝑡𝑡𝑡 𝑆𝑆2 𝑐𝑐, corresponding to shock relief. 352 In the experiments, a footshock is delivered at the 𝑆𝑆0 𝑐𝑐→ 𝑆𝑆1 𝑐𝑐 transition to induce a negative reward 353 𝑟𝑟𝑐𝑐< 0. In addition, we initialized 𝑉𝑉(𝑆𝑆1 𝑐𝑐) < 0 to reflect the innate aversive valuation of the shock-354 alert state. Given the task states 𝑆𝑆(𝑡𝑡) and shock states 𝑆𝑆𝑐𝑐(𝑡𝑡), we assume additive state and action-355 value components: 356 𝑉𝑉 �𝑆𝑆(𝑡𝑡), 𝑉𝑉�𝑆𝑆𝑐𝑐(𝑡𝑡)�� = 𝑉𝑉�𝑆𝑆(𝑡𝑡)� + 𝑉𝑉(𝑆𝑆𝑐𝑐(𝑡𝑡)) 357 𝑀𝑀𝑎𝑎,(𝑆𝑆,𝑆𝑆𝑐𝑐) = 𝑀𝑀𝑎𝑎,𝑆𝑆+ 𝑀𝑀𝑎𝑎,𝑆𝑆𝑐𝑐 358 359 We modeled fentanyl and cocaine as having different effect on changing a threshold 𝜃𝜃𝑐𝑐 in the 360 perception of shock stimuli as aversive. Accordingly, if the delivered shock level falls below 𝜃𝜃𝑐𝑐 it 361 contributes to sensory salience; while if it falls above 𝜃𝜃𝑐𝑐, it contributes to negative valence. Thus, 362 𝜃𝜃𝑐𝑐 modulates the agent’s learning rate (see Methods). 363 During cocaine IVSA with punishment, 𝑉𝑉(𝑆𝑆1 𝑐𝑐) is negative at the onset of drug infusion ( Fig. 364 6B-C). Spontaneous 𝑆𝑆1 𝑐𝑐→ 𝑆𝑆2 𝑐𝑐 transitions during the post-shock period produce a large delayed δ(t) 365 surge due to 𝑉𝑉(𝑆𝑆2 𝑐𝑐) − 𝑉𝑉(𝑆𝑆1 𝑐𝑐) > 0. Together with the δ (t) derived from the delayed drug reward, 366 the total δ̃(t) (that considered cocaine-induced slow decay) exhibits sustained increase during the 367 post-shock light-blinking period. This pattern closely recapitulates the large and sustained DA 368 signals observed experimentally (Fig. 6D). Because the shock-relief component of this sustained 369 δ̃(t) reflects agents’ shock sensitivity, it is negatively correlated with punishment resistance: 370 individuals with low resistance exhibit more sustained δ̃(t), consistent with the experimental 371 findings (Fig. 6D-E). 372 During fentanyl IVSA with punishment, although 𝑉𝑉(𝑆𝑆1 𝑐𝑐) is negative in the early trials, in 373 punishment-resistant agents, we model fentanyl to progressively suppresses shock sensitivity 374 (increases the threshold 𝜃𝜃𝑐𝑐). As a result, the initially aversive unconditional stimulus ( US, 375 footshock) gradually becomes a salient conditioned stimulus (CS+) predictive of fentanyl reward 376 in these individuals. Thus, the model generates a strong positive δ(t) or δ̃(t) transient at the 𝑆𝑆0 𝑐𝑐→377 𝑆𝑆1 𝑐𝑐 transition in punishment-resistant fentanyl-taking agents but not in cocaine-taking agents (Fig. 378 6J, 6F), matching the different DA signals from the two drugs in experiments ( Fig. 4D, L). This 379 reversal of US to CS+ does not occur in punishment sensitive individuals (Fig. 6J). These δ̃(t) 380 dynamics recapitulate the positive correlation between onset DA responses and punishment -381 resistant fentanyl taking in experiments ( Fig. 6H-I ). Moreover, we plotted the drug infusion 382 numbers for high- vs low -taking agents and high- vs low -punishment resistance agents from 383 baseline FR4 sessions to the three consecutive punishment sessions (Fig. 6G and 6K), and found 384 that agents in the model also mimic the drug-taking patterns observed in mice (Fig. 2B, 2E, 2I, 385 2L). 386 Our model also predicted a weaker but significant positive correlation of onset δ̃ (t) with 387 punished cocaine intake, as well as negative but significant correlation of sustained δ̃ (t) with 388 punished fentanyl intake, both of which were not clearly observed experimentally. These 389 discrepancies may reflect the complexity of how different drugs modulate the computation of value 390 and RPE that is not considered by the model. Nevertheless, overall, the computational model 391 captures the diversity of drug- taking and punishment -responsiveness behaviors, along with the 392 associated DA dynamics across trials epochs, individuals, and drug classes. Importantly, it reveals 393 a simple, unified role of NAc DA signals in encoding TD-RPE across different phases of addiction. 394 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 10

Discussion

395 Here, we used a genetically encoded DA sensor to characterize the dynamic patterns of DA 396 release in the NAc medial shell in a mouse model of drug addiction, comparing DA responses to 397 cocaine and fentanyl IVSA in high versus low drug takers, as well as in animals exhibiting high 398 versus low punishment resistance (i.e., a model of compulsive drug use despite adverse 399 consequences). We found that after extended training under an FR4 schedule, contingent cocaine 400 infusions evoked a sustained increase in DA during the drug- associated cue period (i.e., blinking 401 lights), whereas contingent fentanyl infusions elicited a large increase of DA at infusion onset that 402 evolved into oscillations synchronized with the blinking lights. Similar oscillatory DA patterns 403 could also be observed during cocaine IVSA after responses were averaged across trials and mice, 404 although with smaller amplitudes. Critically, across both cocaine and fentanyl, individual’s drug 405 intake was consistently negatively correlated with DA responses: higher levels of drug intake were 406 associated with lower evoked DA signals. With regards to punished drug taking, we observed 407 distinct DA signatures associated with compulsivity in cocaine versus fentanyl taking. In cocaine 408 IVSA mice, higher punishment resistanc e was associated with weaker sustained DA responses 409 during the drug- associated cue period, whereas in fentanyl IVSA mice, higher punishment 410 resistance was associated with stronger onset DA responses at the co -occurrence of punishment 411 and drug infusion. To account for these diverse DA dynamics across baseline and punishment 412 sessions, across individuals and across drug classes, we developed an Actor -Critic TD-learning 413 based computational framework that incorporates internal states, agent’s uncertainty, and dr ug-414 specific effects. This model captures the observed behavioral diversity and supports a unified 415 interpretation of NAc DA responses as encoding temporal -difference reward prediction errors 416 (TD-RPE) based on internal state estimation. 417 Many previous computational models of addiction (e.g. opponent theory, incentive salience 418 sensitization theory, habit formation) focused on reproducing addiction behaviors instead of 419 accounting for the DA dynamics as a learning outcome. Here our actor-critic model considers both 420 behavior and DA dynamics. However, different from traditional TD -RPE models of classical 421 conditioning, our framework operates on a self -administration task -relevant timescale and 422 explicitly incorporates both states and action values. We modeled how agents evaluate their states 423 after learning the self-administration task at FR4. Our results reveal that the difficulty of the FR4 424 task schedule introduces significant sources of uncertainty that naturally give rise to complex 425 dynamics in the RPE signal which resemble the observed DA activity. A previous TD-RPE model 426 proposed by Redish and colleagues (33, 34) relied on the assumption of an un- cancellable RPE 427 elicited by drugs upon infusion, which caused unbounded growth of the drug value and 428 consequently would predict a positive correlation between DA signals and drug taking—429 inconsistent with experimental observations. In contrast, our model naturally recapitulates the 430 negative correlation between DA signal s and contingent drug intake. In our model, individual 431 differences in DA response reflect differences in the learned contingency between task states and 432 drug reward. 433 Could the finding that NAc DA universally encodes TD -RPE help reconcile the divergent 434 views of DA function in addiction? Two influential and seemingly opposing frameworks have 435 been proposed: the DA depletion hypothesis(55) and the incentive sensitization theory (IST)(39) . 436 Preclinical studies have showed that escalation of cocaine self -administration in rodents is 437 accompanied by reduced phasic DA signaling (27, 28, 43) and human imaging studies also have 438 reported decreased striatal DA responses in individuals with cocaine use disorders(29). However, 439 far less was known about DA signaling in opioid self -administration models. Our findings that 440 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 11 excessive intake of both cocaine and fentanyl is associated with blunted DA responses to 441 contingent drug infusion, together with prior studies, appear to be consistent with the hypo-442 dopamine hypothesis. By contrast, the IST theory, also supported by substantial experimental 443 evidence, posits that drug exposure produces a hyper -responsive DA system, leading to 444 exaggerated phasic DA responses to drug -associated cues and context that drive heightened 445 “wanting” to take drugs. 446 We believe these two theories are not necessarily contradictory and can be unified by the 447 framework that phasic DA encodes TD -RPE, i.e, the difference between the expected values of 448 temporally adjacent states. As TD -learning agents, animals and humans continuously assign 449 expected values to their current states, with these values shaped by experience and learning history. 450 In the IVSA paradigm, high drug takers learn that sufficient lever presses reliably result in drug 451 infusion, motivating more frequent lever press with shorter inter -press intervals. Consequently, 452 they acquire a high expected value for the lever -pressing state and generate relatively small 453 DA/TD-RPE signals upon receipt of the actual drug infusion and its associated cues. In contrast, 454 low drug takers are less certain that lever pressing will result in drug infusion and are therefore 455 less motivated to press the lever, leading to longer inter -press intervals. Consequently, the actual 456 drug infusion is more unexpected, resulting in a large DA/TD-RPE signals. Thus, under contingent 457 conditions, akin to knowingly taking the drug and fully expecting its effect, more excessive drug 458 taking is associated with lower DA responses. However, in humans with SUD, drug- associated 459 cues can be encountered outside the drug -taking context and robustly cause craving. In such 460 situations, drug cues may elicit higher expected value based on remembered drug reward than the 461 perceived value of an individual’s current state, thereby resulting in large DA/TD -PRE signals, 462 consistent with IST. Although we did not directly assess DA responses to unexpected drug-463 associated cues outside the IVSA context, a recent study showed that DA release is indeed 464 enhanced in response to non-contingent or unexpected cocaine-paired cues, but diminished when 465 the same cues were encountered in a contingent, predictable context(28) . 466 Encoding DA as TD -RPE also provides new insights into compulsive drug taking despite 467 punishment. When drug use is associated with adverse consequences, individuals must decide 468 whether to abstain from further drug taking or to endure punishment to continue pursuing the drug. 469 Drug-induced reduced sensitivity to punishment, deficits in inhibitory contr ol or impairments in 470 punishment learning may bias this decision toward compulsive drug taking despite negative 471 consequences, one of the most intractable features of addiction. Using foot shock as a punishment, 472 we showed that in saline control mice, DA signals in the NAc medial shell were predominantly 473 suppressed upon shock delivery ( fig. S4), consistent with a strong negative valence of shock for 474 drug-naive animals and with DA encoding a negative TD -RPE. Similarly, many punishment -475 sensitive mice also showed a suppression of the DA signal at the shock-infusion onset during both 476 cocaine and fentanyl IVSA (Fig. 4E-F, M-N). In contrast, this shock-elicited DA dip was absent 477 in all but one punishment-resistant animals, suggesting an impairment of encoding negative RPE. 478 A prior study also found that cocaine exposure disrupted the pause firing of DA neurons in 479 response to reward omission (56). We also observed a large -amplitude rebound in DA following 480 shock, which we interpreted as positive RPE reflecting “shock relief”. The rebound signal was 481 present in punished saline- and cocaine-IVSA mice, as well as in punishment -sensitive fentanyl-482 taking mice. It is plausible that stronger relief signals may provide more robust negative feedback, 483 reinforcing avoidance of the punished action. Consistent with this interpretation, the levels of 484 sustained DA over the entire cue -light period (including shock- relief and drug- reward signals) 485 were significantly negatively correlated with punished cocaine infusions, although no correlat ion 486 was observed for punished fentanyl intake. Critically, in punishment -resistant fentanyl -taking 487 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 12 mice, we observed a pronounced increase in DA during shock delivery, indicative of a positive 488 TD-RPE. In other words, compulsive fentanyl-taking mice may transform the aversive footshock 489 into a highly salient sensory cue that strongly predicted drug reward. Altogether, these findings 490 suggest that chronic drug use may disrupt normal negative TD -RPE signaling of DA neurons in 491 NAc, thereby promoting compulsive drug taking despite negative consequences. 492 Although our findings provide strong support for TD -RPE as a unifying framework for 493 interpreting NAc DA signaling underlying individual differences in excessive and compulsive 494 drug taking, many important questions remain. We do not yet know how state value is computed, 495 or which neural mechanisms translate these valuations into drug -seeking behaviors (e.g., 496 probability of lever pressing), or how the observed DA dynamics in turn furthe r modulate neural 497 plasticity and maladaptive behaviors. Moreover, it remains unclear how different classes of drugs 498 differentially modulate punishment sensitivity, negative RPE signaling, and the “shock- relief” 499 rebound of DA signals. Addressing these open questions will require targeted future experiments. 500

Limitations

501 Several limitations of the present study should be acknowledged. First, fiber photometry 502 measures relative changes in DA release rather than absolute DA concentrations, and therefore 503 cannot directly quantify baseline dopaminergic tone. Second, the sample size limits our ability to 504 robustly assess sex differences in drug taking and compulsive behavior, an important factor that 505 warrants dedicated investigation in future studies. Third , DA signali ng within the nucleus 506 accumbens is highly heterogeneous across subregions(57) . While the present work focused on the 507 dorsomedial shell, future studies employing more spatially resolved approaches will be necessary 508 to systematically examine DA dynamics across distinct accumbens subregions and their 509 contributions to addiction-related behaviors. While our computational model incorporates multiple 510 states and agent uncertainty, we did not consider different circuit-level plasticity and maladaptive 511 changes induced by chronic self-administration of different drugs. Furthermore, the drug-specific 512 effect in our model is highly simplified and does not account for the myriad physiological and 513 psychological differences in the effects of cocaine and fentanyl. 514 515

References

516 517 1. J. C. Anthony, L. A. Warner, R. C. Kessler, Comparative Epidemiology of Dependence on 518 Tobacco, Alcohol, Controlled Substances, and Inhalants: Basic Findings From the National 519 Comorbidity Survey. Exp. Clin. Psychopharmacol. 2, 244–268 (1994). 520 2. F. A. Wagner, J. C. Anthony, From First Drug Use to Drug Dependence: Developmental 521 Periods of Risk for Dependence upon Marijuana, Cocaine, and Alcohol. 522 Neuropsychopharmacology 26, 479–488 (2002). 523 3. M. J. Kreek, D. A. Nielsen, E. R. Butelman, K. S. LaForge, Genetic influences on impulsivity, 524 risk taking, stress responsivity and vulnerability to drug abuse and addiction. Nat. Neurosci. 8, 525 1450–1457 (2005). 526 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 13 4. M. Venniro, M. L. Banks, M. Heilig, D. H. Epstein, Y. Shaham, Improving translation of 527 animal models of addiction and relapse by reverse translation. Nat. Rev. Neurosci. 21, 625–643 528 (2020). 529 5. B. J. Everitt, T. W. Robbins, Drug Addiction: Updating Actions to Habits to Compulsions Ten 530 Years On. Annu. Rev. Psychol. 67, 1–28 (2015). 531 6. O. George, G. F. Koob, Individual differences in the neuropsychopathology of addiction. 532 Dialogues Clin. Neurosci. 19, 217–229 (2017). 533 7. B. T. Saunders, T. E. Robinson, Individual Variation in the Motivational Properties of 534 Cocaine. Neuropsychopharmacology 36, 1668–1676 (2011). 535 8. B. T. Saunders, T. E. Robinson, A Cocaine Cue Acts as an Incentive Stimulus in Some but not 536 Others: Implications for Addiction. Biol. Psychiatry 67, 730–736 (2010). 537 9. R. Bock, J. H. Shin, A. R. Kaplan, A. Dobi, E. Markey, P. F. Kramer, C. M. Gremel, C. H. 538 Christensen, M. F. Adrover, V. A. Alvarez, Strengthening the accumbal indirect pathway 539 promotes resilience to compulsive cocaine use. Nat. Neurosci. 16, 632–638 (2013). 540 10. D. Belin, A. C. Mar, J. W. Dalley, T. W. Robbins, B. J. Everitt, High Impulsivity Predicts the 541 Switch to Compulsive Cocaine-Taking. Science 320, 1352–1355 (2008). 542 11. L. J. M. J. Vanderschuren, B. J. Everitt, Drug Seeking Becomes Compulsive After Prolonged 543 Cocaine Self-Administration. Science 305, 1017–1019 (2004). 544 12. E. Domi, L. Xu, S. Toivainen, A. Nordeman, F. Gobbo, M. Venniro, Y. Shaham, R. O. 545 Messing, E. Visser, M. C. van den Oever, L. Holm, E. Barbier, E. Augier, M. Heilig, A neural 546 substrate of compulsive alcohol use. Sci. Adv. 7, eabg9045 (2021). 547 13. V. Deroche-Gamonet, D. Belin, P. V. Piazza, Evidence for Addiction-like Behavior in the 548 Rat. Science 305, 1014–1017 (2004). 549 14. Y. Li, L. D. Simmler, R. V. Zessen, J. Flakowski, J.-X. Wan, F. Deng, Y.-L. Li, K. M. 550 Nautiyal, V. Pascoli, C. Lüscher, Synaptic mechanism underlying serotonin modulation of 551 transition to cocaine addiction. Science 373, 1252–1256 (2021). 552 15. G. de Guglielmo, L. Carrette, M. Kallupi, M. Brennan, B. Boomhower, L. Maturin, D. 553 Conlisk, S. Sedighim, L. Tieu, M. J. Fannon, A. R. Martinez, N. Velarde, D. Othman, B. Sichel, 554 J. Ramborger, J. Lau, J. Kononoff, A. Kimbrough, S. Simpson, L. C. Smith, K. Shankar, S. 555 Bonnet-Zahedi, E. A. Sneddon, A. Avelar, S. L. Plasil, J. Mosquera, C. Crook, L. Chun, A. 556 Vang, K. K. Milan, P. Schweitzer, B. Lin, B. Peng, A. S. Chitre, O. Polesskaya, L. C. S. Woods, 557 A. A. Palmer, O. George, Large-scale characterization of cocaine addiction-like behaviors 558 reveals that escalation of intake, aversion-resistant responding, and breaking-points are highly 559 correlated measures of the same construct. eLife 12, RP90422 (2024). 560 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 14 16. N. D. Volkow, J. S. Fowler, G. J. Wang, R. Baler, F. Telang, Imaging dopamine’s role in 561 drug abuse and addiction. Neuropharmacology 56, 3–8 (2009). 562 17. N. D. Volkow, M. Michaelides, R. Baler, The Neuroscience of Drug Reward and Addiction. 563 Physiol. Rev. 99, 2115–2140 (2019). 564 18. G. D. Chiara, A. Imperato, Drugs abused by humans preferentially increase synaptic 565 dopamine concentrations in the mesolimbic system of freely moving rats. Proc. Natl. Acad. Sci. 566 85, 5274–5278 (1988). 567 19. J. Corre, R. van Zessen, M. Loureiro, T. Patriarchi, L. Tian, V. Pascoli, C. Lüscher, 568 Dopamine neurons projecting to medial shell of the nucleus accumbens drive heroin 569 reinforcement. eLife 7, e39945 (2018). 570 20. C. Lüscher, R. C. Malenka, Drug-Evoked Synaptic Plasticity in Addiction: From Molecular 571 Changes to Circuit Remodeling. Neuron 69, 650–663 (2011). 572 21. C. Lüscher, Drug-Evoked Synaptic Plasticity Causing Addictive Behavior. J. Neurosci. 33, 573 17641–17646 (2013). 574 22. N. D. Volkow, M. Morales, The Brain on Drugs: From Reward to Addiction. Cell 162, 712–575 725 (2015). 576 23. P. E. M. Phillips, G. D. Stuber, M. L. A. V. Heien, R. M. Wightman, R. M. Carelli, 577 Subsecond dopamine release promotes cocaine seeking. Nature 422, 614–618 (2003). 578 24. C. L. Poisson, L. Engel, B. T. Saunders, Dopamine Circuit Mechanisms of Addiction-Like 579 Behaviors. Front. Neural Circuits 15, 752420 (2021). 580 25. G. D. Stuber, M. F. Roitman, P. E. M. Phillips, R. M. Carelli, R. M. Wightman, Rapid 581 Dopamine Signaling in the Nucleus Accumbens during Contingent and Noncontingent Cocaine 582 Administration. Neuropsychopharmacology 30, 853–863 (2005). 583 26. K. F. Casey, M. V. Cherkasova, K. Larcher, A. C. Evans, G. B. Baker, A. Dagher, C. 584 Benkelfat, M. Leyton, Individual Differences in Frontal Cortical Thickness Correlate with the d-585 Amphetamine-Induced Striatal Dopamine Response in Humans. J. Neurosci. 33, 15285–15294 586 (2013). 587 27. I. Willuhn, L. M. Burgeno, P. A. Groblewski, P. E. M. Phillips, Excessive cocaine use results 588 from decreased phasic dopamine signaling in the striatum. Nat. Neurosci. 17, 704–709 (2014). 589 28. L. M. Burgeno, R. D. Farero, N. L. Murray, M. C. Panayi, J. S. Steger, M. E. Soden, S. B. 590 Evans, S. G. Sandberg, I. Willuhn, L. S. Zweifel, P. E. M. Phillips, Cocaine seeking and 591 consumption are oppositely regulated by mesolimbic dopamine in male rats. Nat. Commun. 16, 592 9954 (2025). 593 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 15 29. N. D. Volkow, G.-J. Wang, J. S. Fowler, J. Logan, S. J. Gatley, R. Hitzemann, A. D. Chen, S. 594 L. Dewey, N. Pappas, Decreased striatal dopaminergic responsiveness in detoxified cocaine-595 dependent subjects. Nature 386, 830–833 (1997). 596 30. M. Á. Luján, B. L. Oliver, R. Young-Morrison, S. A. Engi, L.-Y. Zhang, J. M. Wenzel, Y. 597 Li, N. E. Zlebnik, J. F. Cheer, A multivariate regressor of patterned dopamine release predicts 598 relapse to cocaine. Cell Rep. 42, 112553 (2023). 599 31. M. Leyton, What’s deficient in reward deficiency? J. Psychiatry Neurosci. 39, 291–293 600 (2014). 601 32. W. Schultz, P. Dayan, P. R. Montague, A Neural Substrate of Prediction and Reward. 602 Science 275, 1593–1599 (1997). 603 33. A. D. Redish, Addiction as a Computational Process Gone Awry. Science 306, 1944–1947 604 (2004). 605 34. R. Keiflin, P. H. Janak, Dopamine Prediction Errors in Reward Learning and Addiction: 606 From Theory to Neural Circuitry. Neuron 88, 247–263 (2015). 607 35. M. Watabe-Uchida, N. Eshel, N. Uchida, Neural Circuitry of Reward Prediction Error. Annu. 608 Rev. Neurosci. 40, 1–22 (2016). 609 36. G. F. Koob, M. L. Moal, Drug Addiction, Dysregulation of Reward, and Allostasis. 610 Neuropsychopharmacology 24, 97–129 (2001). 611 37. G. F. Koob, M. L. Moal, Neurobiological mechanisms for opponent motivational processes 612 in addiction. Philos. Trans. R. Soc. B: Biol. Sci. 363, 3113–3123 (2008). 613 38. T. E. Robinson, K. C. Berridge, The neural basis of drug craving: An incentive-sensitization 614 theory of addiction. Brain Res. Rev. 18, 247–291 (1993). 615 39. T. E. Robinson, K. C. Berridge, The Incentive-Sensitization Theory of Addiction 30 Years 616 On. Annu. Rev. Psychol. 76, 29–58 (2025). 617 40. K. C. Berridge, T. E. Robinson, Liking, Wanting, and the Incentive-Sensitization Theory of 618 Addiction. Am. Psychol. 71, 670–679 (2016). 619 41. B. J. Aragona, N. A. Cleaveland, G. D. Stuber, J. J. Day, R. M. Carelli, R. M. Wightman, 620 Preferential enhancement of dopamine transmission within the nucleus accumbens shell by 621 cocaine is attributable to a direct increase in phasic dopamine release events. J. Neurosci. 28, 622 8821–31 (2008). 623 42. C. A. Owesson‐White, J. Ariansen, G. D. Stuber, N. A. Cleaveland, J. F. Cheer, R. M. 624 Wightman, R. M. Carelli, Neural encoding of cocaine‐seeking behavior is coincident with phasic 625 dopamine release in the accumbens core and shell. Eur. J. Neurosci. 30, 1117–1127 (2009). 626 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 16 43. I. Willuhn, L. M. Burgeno, B. J. Everitt, P. E. M. Phillips, Hierarchical recruitment of phasic 627 dopamine signaling in the striatum during the progression of cocaine use. Proc. Natl. Acad. Sci. 628 109, 20703–20708 (2012). 629 44. M. Garnett, A. Miniño, M. Joyce, A. Driscoll, C. Valenzuela, Drug Overdose Deaths in the 630 United States, 2003–2023. NCHS data brief, 1 (2024). 631 45. F. E. Pontieri, G. Tanda, G. D. Chiara, Intravenous cocaine, morphine, and amphetamine 632 preferentially increase extracellular dopamine in the “shell” as compared with the “core” of the 633 rat nucleus accumbens. Proc. Natl. Acad. Sci. 92, 12304–12308 (1995). 634 46. R. Ito, J. W. Dalley, S. R. Howes, T. W. Robbins, B. J. Everitt, Dissociation in Conditioned 635 Dopamine Release in the Nucleus Accumbens Core and Shell in Response to Cocaine Cues and 636 during Cocaine-Seeking Behavior in Rats. J. Neurosci. 20, 7489–7495 (2000). 637 47. S. H. Ahmed, G. F. Koob, Transition from Moderate to Excessive Drug Intake: Change in 638 Hedonic Set Point. Science 282, 298–300 (1998). 639 48. C. L. Wade, L. F. Vendruscolo, J. E. Schlosburg, D. O. Hernandez, G. F. Koob, Compulsive-640 Like Responding for Opioid Analgesics in Rats with Extended Access. 641 Neuropsychopharmacology 40, 421–428 (2015). 642 49. S. H. Ahmed, J. R. Walker, G. F. Koob, Persistent Increase in the Motivation to Take Heroin 643 in Rats with a History of Drug Escalation. Neuropsychopharmacology 22, 413–421 (2000). 644 50. F. Sun, J. Zeng, M. Jing, J. Zhou, J. Feng, S. F. Owen, Y. Luo, F. Li, H. Wang, T. 645 Yamaguchi, Z. Yong, Y. Gao, W. Peng, L. Wang, S. Zhang, J. Du, D. Lin, M. Xu, A. C. 646 Kreitzer, G. Cui, Y. Li, A Genetically Encoded Fluorescent Sensor Enables Rapid and Specific 647 Detection of Dopamine in Flies, Fish, and Mice. Cell 174, 481-496.e19 (2018). 648 51. F. Sun, J. Zhou, B. Dai, T. Qian, J. Zeng, X. Li, Y. Zhuo, Y. Zhang, Y. Wang, C. Qian, K. 649 Tan, J. Feng, H. Dong, D. Lin, G. Cui, Y. Li, Next-generation GRAB sensors for monitoring 650 dopaminergic activity in vivo. Nat. Methods 17, 1156–1166 (2020). 651 52. N. Eshel, J. Tian, M. Bukwich, N. Uchida, Dopamine neurons share common response 652 function for reward prediction error. Nat. Neurosci. 19, 479–486 (2016). 653 53. R. Amo, S. Matias, A. Yamanaka, K. F. Tanaka, N. Uchida, M. Watabe-Uchida, A gradual 654 temporal shift of dopamine responses mirrors the progression of temporal difference error in 655 machine learning. Nat. Neurosci. 25, 1082–1092 (2022). 656 54. L. Qian, M. Burrell, J. A. Hennig, S. Matias, V. N. Murthy, S. J. Gershman, N. Uchida, 657 Prospective contingency explains behavior and dopamine signals during associative learning. 658 Nat. Neurosci., 1–13 (2025). 659 55. C. A. Dackis, M. S. Gold, New concepts in cocaine addiction: The dopamine depletion 660 hypothesis. Neurosci. Biobehav. Rev. 9, 469–477 (1985). 661 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 17 56. Y. K. Takahashi, T. A. Stalnaker, Y. Marrero-Garcia, R. M. Rada, G. Schoenbaum, 662 Expectancy-Related Changes in Dopaminergic Error Signals Are Impaired by Cocaine Self-663 Administration. Neuron 101, 294-306.e3 (2019). 664 57. J. W. de Jong, S. A. Afjei, I. P. Dorocic, J. R. Peck, C. Liu, C. K. Kim, L. Tian, K. 665 Deisseroth, S. Lammel, A Neural Circuit Mechanism for Encoding Aversive Stimuli in the 666 Mesolimbic Dopamine System. Neuron 101, 133-151.e7 (2019). 667 668 Acknowledgments: We thank members of the Wang Lab for insightful discussions of this study. 669 We are grateful to Priyadarshini Dutta assistance with mouse colony maintenance. This work was 670 supported by Boston Children’s Hospital Viral Core, which is supported by NIH5P30EY012196. 671 Funding: 672 Addiction Initiative at McGovern Institute for Brain Research (FW) 673 The Paul E. and Lilah Newton Brain Science Award (FW) 674 K. Lisa Yang Integrative Computational Neuroscience (ICoN) Center fellowship (HZ) 675 Author contributions: FW and KC conceived the study and designed the experiments. KC, 676 GS, WX, CW and AS performed all experiments. KC analyzed the data. HZ and IF 677 conceptualized and developed the computational model. KC, HZ, FW wrote the manuscript 678 with input from IF. 679 Competing interests: Authors declare that they have no competing interests. 680 Data, code, and materials availability: All data and code used in the analysis are available 681 from the corresponding authors upon request. 682 Supplementary Materials 683

Materials and methods

684 Figs. S1 to S8 685 686 687 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 18 688 689 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 19 Fig. 1. Cocaine and fentanyl intravenous self-administration (IVSA) behaviors. 690 (A) Schematic of the IVSA setup and experimental timeline for IVSA training and testing. (B) and 691 (C) Lever presses (left) and cocaine infusions (right) during the training (B) and testing (C) phases 692 of cocaine IVSA (n =24 mice). (D) Average active lever presses (left) and cocaine infusions (right) 693 during baseline and punishment sessions of the cocaine IVSA testing phase (paired t -test). (E-G) 694 Same analyses as in B-D, but for fentanyl (n = 27 mice). Each gray line in (D) and (G) represents 695 an individual mouse. Red and black lines show group means. Error bars indicate mean ± standard 696 error mean (SEM). ** represents p < 0.01; *** represents p < 0.001. 697 698 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 20 699 700 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 21 Fig. 2. Individual differences in drug taking and punishment resistance . ( A) Normalized 701 distributions of cocaine infusions (i.e., normalized to the mean) during baseline and punishment. 702 Mice were classified as high (cocaine: n = 8) or low (cocaine: n = 8) drug taking based on whether 703 their baseline infusions were above (orange) or below (light blue) the mean by 10%. ( B) Cocaine 704 infusions during the testing sessions, grouped by high vs. low drug taking. (C) Measures of active 705 lever presses per infusion (left), latency for active lever press to the lever insertion (middle), and 706 inter-press interval (right) during the baseline cocaine IVSA (Welch’s t- test). (D) Similar scatter 707 plots as in panel ( A). Mice were classified as high (cocaine: n = 7) or low (cocaine: n = 15) 708 punishment-resistant based on whether their punished cocaine infusions were above (red) or below 709 (dark blue) the mean by 10%. (E) Cocaine infusions during the testing sessions, grouped by high 710 vs. low punishment resistance. (F) Cocaine infusions during punishment (left) and baseline (right) 711 sessions, grouped by drug-taking (left) or punishment-resistance (right) categories (Welch’s t-test). 712 (G) Overlap between high/low drug-taking and high/low punishment-resistant groups for cocaine 713 IVSA. (H-N), Same analyses as in (A-G), but for fentanyl. Error bars indicate mean ± standard 714 error mean (SEM). ** represents p < 0.01; *** represents p < 0.001; n.s. represents not significant. 715 716 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 22 717 718 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 23 Fig. 3. Dopamine dynamics during cocaine and fentanyl self -administration after extended 719 training. (A) Images showing GRAB_DA2m (green) expression and fiber tracks (red dashed line) 720 in the NAc medial shell from a cocaine-IVSA example. (B) Top: example raw photometry traces 721 (465 nm [green] and 405 nm [gray]) with behavioral events overlaid during cocaine IVSA. Bottom: 722 z-score of DA signals after preprocessing the raw data; gray shading indicates cue light on/off. (C) 723 Group DA responses to contingent cocaine infusions (n = 24). Each row of the colormap represents 724 one mouse, sorted by the number of infusions (top = most). The red dash lines at time 0 represent 725 the start of drug infusion. The gray dash lines represent lever retraction and insertion. (D) Example 726 DA responses to contingent cocaine infusions from a low (left) and a high (right) drug-taking mice. 727 (E) Time courses of DA responses to contingent cocaine infusions for high (orange) and low (light 728 blue) drug- taking mice. ( F) Linear regression showing the relationship between onset DA 729 responses and number of baseline infusions of cocaine. (G) Bar graphs quantifying onset DA 730 responses to cocaine infusions in high vs. low drug- taking mice (Welch’s t -test). (H) Linear 731 regression showing the relationship between sustained DA responses and number of baseline 732 infusions of cocaine. (I) Bar graphs quantifying susta ined DA responses to cocaine infusions in 733 high vs. low drug-taking mice (Welch’s t-test). Orange dots represent high drug-taking mice, light 734 blue dots represent low drug-taking mice and gray dots represent mice not classified. (J-R) Same 735 analyses as in (A-I), but for fentanyl (n = 27). Error bars indicate mean ± standard error mean 736 (SEM). * represents p < 0.05; ** represents p < 0.01; *** represents p < 0.001 737 738 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 24 739 740 741 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 25 Fig. 4. Dopamine dynamics during punished drug taking. (A) Group DA responses to punished 742 cocaine infusions (n = 24 mice). Each row of the colormap represents one mouse, sorted by the 743 number of punished infusions (top = most). The red dash lines at time 0 represent the co-occurrence 744 of drug IVSA and punishment. The gray dash lines represent lever retraction and insertion. ( B) 745 Example DA responses to punished cocaine infusions from a low (left) and a high punishment -746 resistant (right) mice. (C) Time courses of DA responses to punished cocaine infusions for high- 747 (red) and low- resistant (blue) mice. (D) Same as (C), but zoomed in to highlight the onset response 748 (0-1 s post-infusion). (E) Linear regression showing the relationship between onset DA responses 749 and number of punished infusions of cocaine. ( F) Bar graphs quantifying onset DA responses to 750 punished cocaine infusions in high- vs. low -resistant mice. ( G) Linear regression showing the 751 relationship between sustained DA responses and the number of punished infusions of cocaine. 752 (H) Bar graphs quantifying sustained DA responses to punished cocaine infusions in high- vs. low-753 resistant mice. Red dots represent punishment -resistant mice, blue dots represent punishment -754 sensitive mice and grey dots represent mice not classified. ( I-P) Same analyses as in ( A-H), but 755 for fentanyl. Error bars indicate mean ± standard error mean (SEM). *** represents p < 0.001; n.s. 756 represents not significant. 757 758 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 26 759 760 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 27 Fig. 5. Actor-Critic TD learning model can explain DA dynamics and negative correlation 761 between DA signals with drug intake during drug IVSA tasks. ( A) Schematic of the model, 762 highlighting the three discrete internal states (S0, S1, S2) and their transition. (B) Simulation results 763 showing the internal state, state value, change of value of temporally adjacent states, drug reward, 764 TD error δ(t) and simulated DA δ̃(t), at each timestep in simulated trials for a low -taking (left) 765 and high-taking agent (right). (C) State value of example low-taking (top) and high-taking agents 766 (bottom). (D) Average δ̃(t) signals for high and low cocaine-taking agents. (E) Correlation between 767 onset and sustained δ̃ (t) with simulated cocaine infusions. ( F-G) Same analyses as in ( D-E), but 768 for fentanyl IVSA agents. 769 770 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 28 771 772 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 29 Fig. 6. Actor-Critic TD learning model can partially explain DA dynamics and the 773 correlation between DA signals with punishment resistance during punishment sessions of 774 IVSA tasks. (A) Schematic of the model, highlighting the three discrete internal states (S0, S1, S2), 775 shock-modulated states (S C0, SC1, SC2) and their transition. (B) Simulation results showing the 776 state, state value, change of value of temporally adjacent states, drug reward , shock reward, TD 777 error δ(t), and simulated DA δ̃(t), at each timestep in simulated cocaine trials for a low-resistant 778 (left) and high-resistant agent (right) during a punishment session. (C) State value of example low-779 resistant (top) and high -resistant agents (bottom). ( D) Average δ̃(t) signals for high and low 780 punishment-resistant cocaine -taking agents . ( E) Correlation between simulated onset and 781 sustained DA with simulated punished cocaine infusions. (F) Left panel, same as (D), but zoomed 782 in to highlight the onset response. Right panel, experimental data as showed in Fig. 4D. (G) The 783 count of cocaine infusions by agents during baseline and punishment sessions across simulation 784 stages, grouped by high and low drug taking (left) or punishment resistance (right). ( H-K), same 785 analyses as in (D-G), but for fentanyl taking agents. 786 787 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 1

Materials and methods

1 Experimental Subjects 2 Adult male and female mice (C57BL/6J, 12-20 weeks old, The Jackson Laboratory) were used 3 for this study. They were group-housed and maintained on a reversed 12-hour light/dark schedule 4 with ad libitum access to food and water. All experimental protocols were approved by the 5 Institutional Animal Care and Use Committee at the Massachusetts Institute of Technology. 6 Surgical Procedures for Jugular Vein Catheterization 7 Indwelling catheters were implanted into the right jugular vein of both male and female mice, 8 as described in the literature 53,54. Specifically, mice were anesthetized with 1 -1.5% isoflurane in 9 oxygen (0.7 L/min) using an anesthesia mask (Part# SOMNO -0801, Kent Scientific). Once fully 10 anesthetized, mice were placed on a heating pad (Part# 53800M, Stoelting Co.) to maintain body 11 temperature. After shaving the hair and sanitizing the surgical area with 70% ethanol and 2% 12 Chloroxylenol, a 2-cm mid-scapular incision was made on the back, and a second 2 cm diagonal 13 incision was made from the right clavicle upwards to the animal’s jaw. The right jugular vein was 14 then carefully exposed, and lifted using an Eppendorf pipette tip (Part# 13 -683-718, Fisher 15 Scientific). An 18G needle was used to create an opening in the jugular vein and a catheter (Part # 16 C20PU-MJV1301, Instech Laboratories) was gently inserted and secured with two knots. The 17 other end of the catheter was threaded under the skin of the shoulder to connect to a vascular access 18 button (Part# VABM1B/25, Instech Laboratories) on the back and the incisions were sutured close. 19 Following surgery, Mice were single-housed and received subcutaneous injections of meloxicam 20 (5 mg/kg) daily for 2 -3 days to alleviate pain and inflammation. The catheters were flushed 1 -2 21 times daily with approximately 0.05 mL of heparinized saline (30 U/mL heparin) to maintain 22 patency. 23 Surgical Procedures for Viral Injections and Fiber Optic Cannulae Implantation 24 After five to seven 6 -hr training sessions of cocaine or fentanyl intravenous self -25 administration, mice were anesthetized with 1-1.5% isoflurane in oxygen (0.7 L/min) and placed 26 on a stereotaxic apparatus (Model 940, Kopf). A heating pad (Part# 53800M, Stoelting Co.) was 27 used to maintain the animal’s body temperature. For viral injections, a small craniotomy was 28 drilled above the right NAc medial shell (AP: +1.5 mm, ML:0.55 mm relative to bregma). A pulled 29 glass pipette (Part# Q100 -50-10, Sutter Instrument) front -loaded with AAV constructs (Boston 30 Children’s Hospital Viral Core, AAV2/5 -hSyn-GRAB_DA2m, 1.19E10 13 gc/mL) was lowered 31 into the medial shell of the NAc (DV: -4.3 mm relative to bregma). A total of 300 nL of the virus 32 was injected at 1 nL/s with a microsyringe pump (Part# UMP3, World Precision Instruments). 33 After the injection, the pipette was left in place for 10 minutes before being slowly withdrawn. 34 Next, a fiber optic cannula (core diameter: 200 µm; NA: 0.37; Length: 4.5 mm, RWD Life Science) 35 was slowly lowered to the dorsal medial shell of the NAc (~3.7 mm below brain surface). The 36 cannula was secured to the skull using Loctite super glue and Metabond (C&B Metabond, Parkell). 37 The mice were allowed to recover for 4-7 days before resuming self-administration training. 38 Cocaine and Fentanyl Intravenous Self-Administration (IVSA) Paradigm 39 One to two weeks after jugular vein catheterization surgery, the patency of implanted catheters 40 was tested by intravenously injecting approximately 0.04 mL of a 15 mg/mL ketamine solution. 41 Mice that passed the patency test (i.e., cessation of movement within 4 seconds) were trained to 42 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 2 self-administer cocaine (Part# C5776, Sigma-Aldrich, 0.3 mg/kg/infusion) or fentanyl (Item# 07-43 890-5657, Patterson Veterinary, 2 µg/kg/infusion) in an operant chamber (Part # ENV-307A-CT, 44 Med Associates). The chamber was equipped with two retractable levers (Part # ENV -312-3M, 45 Med Associates), LED lights (Part # ENV-321DM, Med Associates), and a syringe pump (Part # 46 PHM-100VS-2, Med Associates). 47 Each trial of the cocaine or fentanyl IVSA started with the insertion of both levers and the 48 illumination of the light above the active lever. Pressing the active lever triggered a drug infusion 49 according to a fixed -ratio schedule (FR1, FR2, FR4), while pressing the inactive lever had no 50 programmed consequences. Each infusion was followed by a 40 -second time-out period during 51 which no additional drug was delivered. This time-out period was implemented to prevent adverse 52 health consequences associated with excessive drug intake. During the first 19.5 seconds of this 53 time-out period, the light above the active lever blinked at 0.67 Hz (1 second on and 0.5 seconds 54 off) with both levers remaining available, but lever pressing (i.e., time -out responses) had no 55 programmed consequences. For the remaining 20.5 seconds of the time -out period, both levers 56 were retracted and the lights were turned off. 57 Trainings began with a 3-hour auto-shaping session during which both levers were active, and 58 pressing either lever triggered a drug infusion. In addition, a drug infusion was automatically 59 delivered if no levers were pressed within 6 minutes. The auto-shaping session ended either when 60 30 infusions were delivered or after three hours had elapsed, whichever came first. Then, 6 -hour 61 long-access training sessions were followed. During these sessions, mice were trained to 62 discriminate between an active drug -delivering lever and an inactive lever. Only presses on the 63 active lever resulted in drug infusions. The active lever was designated as the non-preferred lever, 64 based on behavior observation during the initial auto -shaping session. To prevent the catheter 65 blockage during the 6-hour training sessions, automatic drug infusions were delivered if the active 66 lever was not pressed within 30 minutes. To prevent adverse health effects of excess drug intake, 67 the maximum number of infusions per session was capped at 150 for males and 120 for females. 68 The training protocol consisted of 7-9 sessions on an FR1 schedule, followed by 2 sessions on an 69 FR2 schedule, and 10 sessions on an FR4 schedule. Mice were trained 5 days per week. 70 Following this long -access training, drug -taking behavior and fiber photometry recordings 71 were conducted during 3-hour IVSA sessions under an FR4 schedule. The animal’s behavior was 72 also videotaped. Mice completed at least three 3 -hour baseline IVSA sessions before undergoing 73 three consecutive punishment sessions. During these punishment sessions, each drug infusion was 74 paired with a brief, mild foot shock (Intensity: 0.2 mA; duration: 0.5 seconds). The shock intensity 75 was verified using an ammeter (ENV-420, Med Associates) before each punishment session. After 76 completion of the punishment sessions, catheter patency was tested prior to brain tissue collection, 77 and only mice that pass this test were included in the final analysis of cocaine or fentanyl IVSA. 78 In total, 24 out of 27 mice successfully completed cocaine IVSA training and testing with 79 confirmed catheter patency, and 27 out of 29 mice completed the fentanyl IVSA training and 80 testing with confirmed patency. As a control, 8 mice completed the saline IVSA training and 81 testing, six of which underwent fiber photometry recordings. 82 Fiber Photometry 83 Dopamine transmission in the medial shell of the NAc was recorded with a rotary fiber 84 photometry system (Part# RFPS_2S_GCaMP_RedFluo, Doric). The system was equipped with an 85 assisted electrical rotary joint (Part# AHRJ -EL_24_FMC_25, Doric) for fiber photometry and a 86 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 3 fluid rotary joint for infusing drugs. As the optic path and light detector of the photometry system 87 were integrated as one small device (i.e., rotary fluorescence mini cube), which rotated as mice 88 move in the chamber, the fiber bending and movement -induced artifacts were minimized. To 89 measure signals from the green DA sensor49,50 GRAB_DA2m, purple (405 nm) and blue (465 nm) 90 LEDs within the fluorescence mini cube (RMFM, Doric Lenses) emit a sinusoid illumination at 91 208.616 Hz and 572.205 Hz respectively to excite the fluorophore. The power at the tip of the 92 patch cable was 5-10 µW. A 0.4 -meter low auto-fluorescence fiber optic patch cord was used to 93 connect the mini cube to the implanted fiber optic cannulae. Bulk fluorescent signals were detected 94 with detectors integrated within the mini cube, amplified by a Doric fluorescence detector 95 amplifier, and digitized at 12k Hz by a fiber photometry console (FPC, Doric Lenses) which also 96 recorded behavioral events of drug infusion, cue presentation and lever presses from Med 97 Associates operant chamber. The digitized signals were lock -in demodulated based on the 98 frequency of excitation lights (405nm and 465 nm) and down -sampled to 120 Hz. Doric 99 Neuroscience Studio was used to acquire and stream demodulated signals to the disk. To diminish 100 photobleaching, the fiber photometry system was automatically turned ON for 30 minutes and then 101 turned OFF for 30 minutes. This ON -and-OFF cycle automatically repeated 3 times to cover the 102 entire 3-hour IVSA testing phase. 103 Histological Staining 104 Mice were deeply anesthetized with isoflurane and intracardially perfused with 1x PBS 105 followed by 4% paraformaldehyde. The brain was post-fixed overnight with 4% paraformaldehyde 106 and cryo-protected with 30% sucrose for 2-3 days. The brain was then cut with a cryostat into 80 107 μm coronal slices. For visualizing canula tracks and the expression of GRAB_DA2m, slices were 108 stained with DAPI (1:5000 dilution, H3570, ThermoFisher, Waltham, MA) or fluorescent Nissl 109 stain (1:500 dilution, N21479, ThermoFisher). 110 Data Analysis 111 Data analysis was performed using Doric Neuroscience Studio and custom scripts written in 112 Python and MATLAB (MathWorks, Natick, MA). 113 Behavior analysis during the IVSA 114 The total number of drug infusions, active lever presses and inactive lever presses were 115 recorded. In addition, the timestamps of trial onset and all behavioral events were recorded as 116 well. These timestamps were used to generate raster plots of lever-press activity relative to the 117 trial start. 118 To classify mice as low or high drug takers, the average number of infusions per mouse 119 during the 3-hour baseline IVSA sessions was calculated. Mice with the average baseline 120 infusion counts greater than the group mean plus 10% of the mean were classified as high drug 121 takers, whereas mice with average baseline infusion counts lower than the group mean minus 122 10% of the mean were classified as low drug takers. Mice that fell between these thresholds were 123 left ungrouped. 124 Similarly, to classify mice as high or low punishment-resistant, the average number of 125 punished infusions per mouse during the 3-hour punishment sessions was calculated. Mice with 126 the average punished infusion counts greater than the group mean plus 10% of the mean were 127 classified as high punishment-resistant, whereas mice with average punished infusion counts 128 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 4 lower than the group mean minus 10% of the mean were classified as low punishment-resistant. 129 Mice that fell between these thresholds were left ungrouped. 130 Fiber photometry data analysis 131 The fiber photometry data were preprocessed using Doric Neuroscience Studio to extract the 132 Z-score of the DA dynamics based on a published method55 . Specifically, signals recorded at both 133 465 nm and 405 nm (i.e., isosbestic signal) were smoothed by applying a running average with a 134 window size of 0.1 seconds. The bleaching slope and low -frequency fluctuations of both signals 135 were corrected with an adaptive, iterative re -weighted penalized least squares algorithm 56 . Both 136 signals were standardized by calculating their Z-scores. Non-negative robust linear regression was 137 used to fit the Z-score of signals at 405 nm to those at 465 nm. Finally, DA dynamics is calculated 138 by subtracting the fitted Z-score of signals at 405 nm from the Z-score of the signals at 465 nm. 139 Drug self -administration-evoked DA responses (hereafter referred to as “drug -evoked” 140 responses) were analyzed by aligning DA dynamics to the onset of each drug self -administration 141 and constructing peri-event time histograms (PETHs). Importantly, the generation of drug-evoked 142 events required response contingency, cue presentation, and simultaneous drug delivery. 143 Normalized PETH was obtained by averaging PETHs across trials and subtracting baseline DA 144 activity (-10 to 0 s relative to drug infusion). Onset DA responses were defined as the mean evoked 145 DA signals within 1 s of drug infusions, whereas sustained DA responses were defined as the mean 146 evoked DA responses from 1 to 19.5 s post-infusion (corresponding to drug-associated cue period). 147 To analyze DA responses to active lever presses, DA dynamics were aligned to the first active 148 lever press in each trial to construct PETHs. The PETHs were then normalized by averaging 149 PETHs across all trials and subtracting the mean baseline activity (the 2 -s interval immediately 150 preceding the press). DA responses to active lever press were quantified as the mean evoked DA 151 signals within the 2 s following the active lever press. 152 To analyze the decay rate of DA transients, DA dynamics from baseline sessions were first 153 low-pass filtered using a 4th-order Butterworh filter. Local peaks and their subsequent troughs were 154 then extracted with the findpeaks function in MATLAB. Finally, we performed a linear regression 155 on these corresponding peaks and troughs, and the slope of the regression was defined as the decay 156 slope. 157 Temporal Difference Learning in Actor-Critic Model 158 In TD-learning considering self -administration, an agent transitions through a sequence of 159 states 𝑆(𝑡) according to its policy 𝑀 interacting with a Markov decision process (or a semi-Markov 160 decision process). The ‘Critic’ computes the value associated with each state, defined as the 161 expected discounted future returns based on current policy 𝑀: 162 163 𝑉𝑀(𝑆) = 𝔼 [∑ 𝛾𝑘𝑟(𝑆(𝑡)) ∞ 𝑡=0 | 𝑆(0) = 𝑆, 𝑀] , (1) 164 165 where 𝑡 denotes time and 𝑆(𝑡) is the state visited at time 𝑡. 𝑟(𝑆(𝑡)) denotes the reward delivered 166 at state 𝑆(𝑡), and 𝛾 ∈ (0, 1) is a discount factor. In the experiments we examine, the drug reward 167 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 5 is present with delay after infusion event and lasts for a prolonged period until the end of the trial. 168 The ‘Actor’ aims to learn an optimal policy 𝑀∗ that maximizes the expected total future returns: 169 𝑀∗ = argmax 𝑀 𝑉𝑀(𝑆). (2) 170 Interleaved steps of estimating the value function and updating the policy are used in learning. 171 For estimating the value function, under the Markov property, the value at time t for state 𝑆(𝑡) can 172 be rewritten as a sum of the reward received at 𝑡 and the discounted value at the next time step: 173 𝑉𝑀(𝑆(𝑡)) = 〈𝑟(𝑆(𝑡))〉 + ∑ 𝑃𝑀[𝑎; 𝑆(𝑡)] ∑ 𝑇[𝑆′|𝑆(𝑡); 𝑎]𝑉𝑀(𝑆′) 𝑆′𝑎 , (3) 174 where 𝑃𝑀[𝑎; 𝑆(𝑡)] denotes the probability of choosing action 𝑎 at state 𝑆(𝑡) according to the 175 policy 𝑀 and 𝑇[𝑆′|𝑆(𝑡); 𝑎] denotes the probability of state transition from 𝑆(𝑡) to 𝑆(𝑡 + 1) = 𝑆′ 176 at next time step 𝑡 + 1 when taking action 𝑎. 〈𝑟(𝑆(𝑡))〉 denotes the mean reward received at 𝑆(𝑡). 177 Temporal difference learning takes 𝑟(𝑆(𝑡)) + 𝑉𝑀(𝑆′) as a Monte Carlo sample to approximate 178 the right side of equation (3) and then bootstrap by replacing the unknown 𝑉𝑀(𝑆′) with the current 179 estimate 𝑉(𝑆′). Therefore, 180 𝛿(𝑡) = 𝑟(𝑆(𝑡)) + 𝑉(𝑆′) − 𝑉(𝑆(𝑡)) (4) 181 is used as a sampled approximation to the mismatch 𝑉𝑀(𝑆(𝑡)) − 𝑉(𝑆(𝑡)). 𝛿(𝑡) is called temporal-182 difference reward prediction error (TD -RPE). When 𝛿(𝑡) = 0, the value function is well 183 approximated. However, when 𝛿(𝑡) is positive or negative, the Critic’s estimate 𝑉(𝑆(𝑡)) should 184 be increased or decreased, respectively (𝛼 is learning rate): 185 𝑉(𝑆(𝑡)) → 𝑉(𝑆(𝑡)) + 𝛼𝛿(𝑡) (5) 186 For updating the policy to satisfy equation (2), we could similarly use 𝑟(𝑆(𝑡)) + 𝑉(𝑆′) as a Monte 187 Carlo estimates of right side of equation (3), and the policy is updated along the gradient of 188 equation (X3). If 𝑃𝑀[𝑎; 𝑆] is parameterized as a SoftMax distribution: 189 190 𝑃𝑀[𝑎; 𝑆] = 𝑒𝛽𝑀𝑎,𝑆 ∑𝑒𝛽𝑀𝑎′,𝑆𝑎 (6) 191 192 where 𝑀𝑎,𝑆 denote the action value for taking action 𝑎 at state 𝑆, and 𝛽 is the SoftMax parameter, 193 then the update of Actor’s policy upon taking action 𝑎 at state 𝑆 has an elegant and bio-plausible 194 formular: 195 196 𝑀𝑎,𝑆 → 𝑀𝑎,𝑆 + 𝛼 𝜕𝑉𝑀(𝑆) 𝜕𝑀𝑎,𝑆 ≈ 𝑀𝑎,𝑆 + 𝛼(𝛿𝑎𝑎′ − 𝑃[𝑎′; 𝑆])𝛿(𝑡) (7) 197 where 𝛿𝑎𝑎′ = 1 if 𝑎′ = 𝑎 and 0 otherwise. 𝛿(𝑡) is the RPE defined above and 𝛼 is the learning 198 rate. 199 Internal state inferred from task stimulus 200 For simplicity, we assume each trial is decomposed into 3 discrete internal state according to 201 different sensory feedback: 202 𝑆0, 𝑆1, 𝑆2 203 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 6 𝑆0 denotes the initial trial stage from light -onset and lever -insert to drug -infuse, which 204 dominates the trial period. 𝑆1 denotes the 20s delay period from drug-infuse to lever-retract, with 205 blinking light on and off periodically. 𝑆2 denotes the 20s light-off ending-stage of each trial, from 206 lever-retract to next lever-insert. While the external task progression is governed deterministically 207 by the external clock and lever -press counter, the animal does not have access to either external 208 clock or counter and stochastically transit among these states based on sensory input generated at 209 each external task stage. 210 Internal state triggered by shock 211 Foot shock is a life-threatening stimulus to animals and usually triggers leaping, retreating, 212 scanning, or freezing behaviors. After a period of alert, the animal gradually returns to the 213 spontaneous behavior as pre-shock stage, suggesting a process of inferring what is happening. 214 Therefore, we assume the shock triggers another cascade of internal states: 215 𝑆0 𝑐, 𝑆1 𝑐, 𝑆2 𝑐 216 where 𝑆1 𝑐 denotes the alert period once getting the shock. After scanning the surroundings, 217 the shock state transit to 𝑆2 𝑐 stochastically, where 𝑆2 𝑐 denotes the ‘safety’ state. 𝑆1 𝑐 → 𝑆2 𝑐 218 corresponds to shock relief. Once the light is turned on again for next trial, the safety state transit 219 to 𝑆0 𝑐, which denotes the ‘danger’ state, extending from lever-insert to getting-shock. A negative 220 reward 𝑟𝑐 < 0 is provided during transition 𝑆0 𝑐 → 𝑆1 𝑐 and an innate negative value 𝑉(𝑆1 𝑐) < 0 is 221 initially assigned to shock state 𝑆1 𝑐. Experimental data for baseline shock testing confirms that 222 first shock can evoke lasting DA dynamics independent of drug reward, rendering additionally 223 considering shock-related states necessary. 224 Given both task-related states 𝑆(𝑡) and shock-related states 𝑆𝑐(𝑡), the total value at time 𝑡 is 225 a linear combination of the state value and action value: 226 𝑉 (𝑆(𝑡), 𝑉(𝑆𝑐(𝑡))) = 𝑉(𝑆(𝑡)) + 𝑉(𝑆𝑐(𝑡)) (8) 227 𝑀𝑎,(𝑆,𝑆𝑐) = 𝑀𝑎,𝑆 + 𝑀𝑎,𝑆𝑐 (9) 228 All the variables 𝑉(𝑆(𝑡)), 𝑉(𝑆𝑐(𝑡)), 𝑀𝑎,𝑆, 𝑀𝑎,𝑆𝑐 follow the TD-learning rule for Actor-Critic 229 model introduced above. 230 Uncertainty in state transition during long-access training 231 Each task trial has an extended duration, lasting 2-3 mins on average, which far exceeds the 232 experimental timescale for classical conditioning in reinforcement learning . This may cause 233 several non -negligible outcomes on animal’s behavior. First, due to constant spontaneous 234 movement during operant stage, the animal may not detect a transient cue and cue-evoked internal 235 state transition can be delayed. For example, the transition 𝑆0 → 𝑆1 may lag behind the first light-236 blink, which is consistent with the trial-to-trial jittered initial DA response timing. For this reason, 237 we assume that the animal has a probability to transition from 𝑆0 to 𝑆1 at each light blink (from 238 light on to light off): 239 0 < 𝑇[𝑆1|𝑆0, 𝑏𝑙𝑖𝑛𝑘 𝑜𝑓𝑓] < 1 240 241 Second, during the ON cycle of blinking light after infusion, the animal has a non -zero 242 probability to return to ground state 𝑆0. This is because 𝑆0 occupies the majority of trial period 243 with light-on as sensory input, therefore the animal may treat 𝑆0 as the “ground state” with large 244 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 7 prior. Considering that the ON cycle of blinking light during 𝑆1 shares the same sensory input as 245 𝑆0, the animal may switch its internal state to 𝑆0 during on-edge and return back to 𝑆1 at next off-246 edge of blink, which we call “diffuse” 247 𝑇[𝑆0|𝑆1, 𝑏𝑙𝑖𝑛𝑘 𝑜𝑛] > 0 248 Individual difference on action cost and shock sensitivity 249 In the model, we only consider binary action for simplicity: 𝑎 = 0 denotes no-press while 𝑎 =250 1 denotes lever-press. We assume that the cost for lever -press varies across animals. Therefore, 251 when updating the action value for pressing lever, an additional action cost 𝑐(𝑎) is considered: 252 253 𝑀𝑎,𝑆 → 𝑀𝑎,𝑆 + 𝛼(𝛿𝑎𝑎′ − 𝑃[𝑎′; 𝑆]) ⋅ (𝛿(𝑡) − 𝑐(𝑎(𝑡))) (20) 254 255 where 𝛿(𝑡) − 𝑐(𝑎(𝑡)) = (𝑟(𝑡) − 𝑐(𝑎(𝑡))) + 𝑉(𝑆(𝑡 + 1)) − 𝑉(𝑆(𝑡)) is the combined RPE 256 for Actor. Variations in action cost has profound impact on animal’s addiction behavior: for 257 example, we find that animals who learned to press lever by biting onto the lever, or using the chin 258 to press lever, had significantly more lever press counts than those pressing by using the paw, and 259 consequently more drug taking. Indeed, pressing the lever with a paw is relatively demanding: the 260 animal must rear up, maintain balance, and then lift a paw to make a press. During shock session, 261 we find that some animals show much higher tolerance to the electric shock compared to others. 262 Therefore, shock sensitivity is another factor underlying individual difference. 263 Effects of Cocaine and Fentanyl on modulating shock sensitivity 264 Drugs like cocaine and fentanyl have diverse pharmacological effects on animals besides 265 serving as reinforcers. Specifically, cocaine is a psychostimulant, whereas fentanyl is an opioid 266 analgesic. Therefore, we assume that the two drugs modulate shock sensitivity in opposite 267 directions: cocaine lowers the shock threshold, whereas fentanyl raises it. 268 𝑑𝜃 𝑑𝑡 = −𝜃 + 𝛿(𝑡 − 𝑡̂𝑓𝑒𝑛) − 𝛿(𝑡 − 𝑡̂𝑐𝑜𝑐) (31) 269 where 𝛿(𝑡 − 𝑡̂) is a Dirac delta that represents an event occurring at time 𝑡̂. 𝑡̂𝑓𝑒𝑛 and 𝑡̂𝑐𝑜𝑐 denote 270 the infusion event of fentanyl and cocaine respectively. 𝜃 is the shock threshold for computing 271 effective shock reward 𝑟𝑐: 272 𝑟𝑐 = 𝐴𝑐 ⋅ Θ(|𝐴𝑐| − 𝜃) (42) 273 where 𝐴𝑐 is a measure of shock strength, and Θ is step function. 274 Cocaine’s effect on DA signal decay 275 Since cocaine blocks dopamine reuptake, extracellular DA clears more slowly, extending the 276 decay timescale relative to normal condition; in our recordings, DA decays roughly fivefold more 277 slowly than normal. In the cocaine model, we represent each RPE event as producing a DA 278 transient with a slow decay. Importantly, the DA level during this decay is not treated as additional 279 teaching signal: only the phasic RPE at the event time drives learning. We instead interpret the 280 slowly decaying component as a motivational modulation signal, consistent with its timescale. 281 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 8 282 283 284 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 9 285 Fig. S1. Drug-taking behavior. (A) Schematic of the drug self-administration task structure 286 (showed with FR1 schedule), highlighting the light-ON, light-blinking, and light-OFF epochs. 287 (B) Raster plots of active lever presses (gray lines) and infusions (black lines) aligned to the start 288 of each trial (lever insertion; time 0), along with distribution of the latency to the first active 289 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 10 lever press, inter-press interval and infusion latency relative to trial onset. Each row corresponds 290 to an example mouse. 291 292 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 11 293 Fig. S2. Saline intravenous self-administration (IVSA) behaviors. (A) Lever presses (left) and 294 saline infusions (right) across daily 6 -hour saline IVSA training sessions (n = 8). (B) Lever 295 discrimination (proportion of active lever presses) across 10 training sessions of cocaine (red), 296 fentanyl (blue) and saline (black) IVSA under FR 4 schedule (repeated two-way ANOV A test). (C) 297 Lever presses (left) and saline infusions (right) during the testing phases of saline IVSA (n = 8). 298 Error bars indicate mean  standard error mean (SEM). * represents p < 0.05. 299 300 301 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 12 302 Fig. S3. Histological verification of optical fiber placements. The tips of optical fiber tracks are 303 indicated by pink circles (one circle per mouse) . Coronal sections are labeled with anterior -304 posterior coordinates relative Bregma. In some cases (n= 2), placements could not be verified 305 because the fiber tracks were not detectable. 306 307 308 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 13 309 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 14 Fig. S4. Dopamine dynamics during cocaine, fentanyl , and saline IVSA. (A) Example traces 310 of low-pass filtered, z-scored dopamine signals. Local peaks (red) and troughs (blue) are indicated. 311 (B) Linear regressions fitted to decay segments from each local peak to the subsequent trough, 312 corresponding to the example traces showed in (A). (C) Bar plots of decay slopes in mice IVSA 313 cocaine and fentanyl (Welch’s t -test). (D) Population-averaged dopamine responses to cocaine 314 (top) and fentanyl (bottom) IVSA overlaid with the blinking light signal (green line). (E) 315 Dopamine responses to cocaine or fentanyl IVSA on the 1 st day of training . (F) Dopamine 316 responses to saline IVSA during the 3 -hour testing phase after extended training. (G) Dopamine 317 responses to the co-occurrence of saline IVSA and footshock during punishment sessions of saline 318 IVSA. Error bars indicate mean  standard error mean (SEM). *** represents p < 0.001. 319 320 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 15 321 Fig. S5. Dopamine responses to active lever presses. (A) Examples dopamine responses to the 322 first active lever press of each trial. Two examples are from high cocaine -taking mice and three 323 from low cocaine-taking mice. (B) Group dopamine responses to active lever presses. Each row 324 of the colormap represents one mouse, sorted by the number of infusions (top = most). (C) Time 325 courses of DA responses to active lever press for high (orange) and low (light blue) cocaine-taking 326 mice. (D) Modulated dopamine responses by active lever presses for high (orange) and low (light 327 blue) cocaine-taking mice. Responses were calculated as the difference between the mean z-scored 328 dopamine signals from 0-2 s after the active lever press and the -2 to 0 s baseline before the press. 329 (E-H) Same analyses as (A-D), but for fentanyl IVSA. 330 331 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 16 332 Fig. S6. Dopamine responses across three episodes of recordings during cocaine and fentanyl 333 IVSA. (A) Schematic illustrating three cycles of rotary fiber photometry during 3 -hour IVSA 334 testing sessions. (B) Number of cocaine infusions across the three recording episodes during 3 -335 hour IVSA. (C) Time courses of DA responses to contingent cocaine infusions for high (orange) 336 and low (light blue) cocaine-taking mice across the three recording episodes. (D) Averaged onset 337 and sustained DA responses across the three recording episodes for high (orange) and low (light 338 blue) cocaine-taking mice. (E-G) Same analyses as in (B-D), but for fentanyl. 339 340 341 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 17 342 Fig. S7. Dopamine dynamics during punished drug taking in individual mice. Each 343 colormap represents a single mouse. Dash lines at time 0 represent the onset of drug infusion 344 plus a mild foot shock (0.5 s). 345 346 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint 18 347 Fig. S8. Changes in dopamine dynamics across punishment sessions of fentanyl IVSA. (A) 348 Group DA responses to punished fentanyl infusions in punishment -resistant mice (n = 8) during 349 the 1 st punishment session (left) and a subsequent punishment session (right). Each row of the 350 colormap represents the same mouse recorded across punishment sessions. (B) Quantification of 351 onset and sustained DA responses during the 1 st and subsequent punishment sessions. (C) 352 Individual examples of DA responses across each trial during the 1 st and subsequent punishment 353 sessions. * represents p < 0.05. 354 355 356 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 20, 2026. ; https://doi.org/10.64898/2026.01.18.700215doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-NC-ND-4.0