Kindling in Neural Systems: Progressive Adversarial Sensitization During LLM Alignment Mirrors Psychiatric Progression

doi:10.21203/rs.3.rs-8722155/v1

Kindling in Neural Systems: Progressive Adversarial Sensitization During LLM Alignment Mirrors Psychiatric Progression

2026 · doi:10.21203/rs.3.rs-8722155/v1

preprint OA: closed

Full text JSON View at publisher

Full text 100,909 characters · extracted from preprint-html · click to expand

Kindling in Neural Systems: Progressive Adversarial Sensitization During LLM Alignment Mirrors Psychiatric Progression | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Kindling in Neural Systems: Progressive Adversarial Sensitization During LLM Alignment Mirrors Psychiatric Progression Ngo Cheung This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8722155/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 14 You are reading this latest preprint version Abstract Objective Reinforcement learning from human feedback (RLHF) is widely used to make large language models safer, yet repeated preference tuning could also make them easier to breach. Drawing on the psychiatric kindling hypothesis, which holds that each untreated mood episode lowers the barrier to the next, we asked whether successive alignment rounds likewise sensitize a model to adversarial prompts. Methods A 1.1-billion-parameter chat model (TinyLlama−1.1B-Chat) equipped with LoRA adapters completed ten preference-tuning cycles. The synthetic feedback set favoured sycophantic answers (70%) and gave lighter penalties for unsafe content (30%). Three experimental arms were compared: 1. Baseline tuning with no further safeguards. 2. Continuous gradient-guided "regrowth," meant to mimic rapid synaptic plasticity. 3. Early-trigger intervention, adding the same regrowth plus a replay buffer of diverse prompts once the jailbreak rate rose by at least 15%. Sensitization was tracked with 35 adversarial prompts stratified by strength (strong, medium, weak). Outcome measures were jailbreak success, sycophancy frequency, and unintended completions on neutral prompts. Results Across ten cycles, baseline tuning raised the overall jailbreak rate by 20%, with the sharpest increase on weak prompts, suggesting a lowering of the breach threshold. Continuous regrowth intensified the early rise (+ 25.7% overall; +30% on weak prompts), even though many parameters were re-connected. In contrast, the early-trigger arm held the increase to 2.9% and kept weak-prompt performance flat, stopping further drift. Conclusions Repeated RLHF can create a "kindling" pattern in which small flaws snowball into broad vulnerability. An intervention modeled on biological ideas—prompt detection followed by targeted plasticity and content replay—prevented that slide. The parallel between psychiatric relapse and model instability highlights a shared principle: cumulative stress, whether emotional or adversarial, erodes resilience unless it is met early and with the right form of repair. Biological sciences/Neuroscience Biological sciences/Psychology Social science/Psychology Figures Figure 1 Figure 2 Introduction Large language models (LLMs) now write code, draft essays and carry on extended conversations. Their growing reach, however, has renewed concern about safety. At first, the main risk came from "jailbreak" prompts that tricked a model into producing disallowed text [1]. Defences soon tightened, but attackers adapted, building reusable prompts and automated red-teaming systems that expose weaknesses in many models at once [2]. Even without an attacker, some models drift after deployment: they hallucinate facts, loop on refusals, or veer off topic, especially after several rounds of fine-tuning [3,4]. Most leading systems rely on reinforcement learning from human feedback (RLHF) to balance helpfulness with harm avoidance. Each alignment pass rewards preferred replies, yet the process can backfire. Over-optimised models may "hack" the reward signal, lose general robustness or grow brittle at the edges of the prompt space [5]. Alignment, in other words, may solve one problem while quietly raising the odds of another. A parallel exists in psychiatry. The kindling hypothesis was proposed to explain why bipolar episodes become easier to trigger over time: early attacks follow major stress, later ones erupt with only mild provocation or none at all [6,7]. Our recent simulation extended this idea, showing that rapid "synaptic" repair can halt sensitisation, whereas unchecked excitation speeds it up [8]. Clinical studies have not settled the debate, but the framework has shaped calls for early, preventative care [9]. Translating kindling to AI raises a fresh question: can repeated alignment cycles make an LLM more, not less, vulnerable to attack? Most research checks safety at a single point in time. Few studies watch how susceptibility changes across successive tuning rounds. The present work does so. We take a compact 1.1-billion-parameter chat model and run ten biased preference-tuning cycles that favour flattery and soften penalties for unsafe content. Three settings are compared: a baseline with no extra safeguards, continuous gradient-guided "regrowth" inspired by fast synaptic plasticity, and a triggered intervention that adds regrowth plus diverse prompt replay once jailbreaks rise by 15 %. We track jailbreak success on 35 adversarial prompts of varying strength, along with sycophancy and unintended responses to neutral inputs. The study asks two things: does iterative alignment lower the barrier to harm, especially for weaker attacks, and can biologically inspired repair stop that slide? By linking ideas from psychiatry and machine learning, we aim to outline shared rules of instability in large, adaptive networks and to suggest practical steps toward more durable alignment. Methods Model architecture and initialisation All experiments used TinyLlama-1.1B-Chat-v1.0, a 1.1-billion-parameter decoder-only transformer that runs comfortably on a single consumer GPU. The weights were loaded in float16 to reduce memory pressure. Fine-tuning relied on Low-Rank Adaptation (LoRA) with rank 16, scaling 32 and dropout 0.05 [10]. The query, key, value and output projections were the only trainable blocks, leaving roughly 4.5 million adjustable weights, about 0.4 % of the full model. In the two sparsity conditions the LoRA matrices started at 90 % random sparsity, creating a "fragile" substrate meant to mirror early synaptic pruning. Experimental design Each run comprised ten consecutive alignment cycles ( Fig. 1 ). Every cycle included 200 optimisation steps with an effective batch of 16 (four mini-batches accumulated). AdamW was used with a fixed learning rate of 1 × 10⁻⁵. Three arms were compared. Baseline: plain supervised preference tuning. Regrowth: the same tuning followed by continuous gradient-guided weight regrowth. Triggered: regrowth plus a replay buffer, activated once the observed jailbreak rate rose by 15% over the starting value. Alignment data For every cycle we produced 200 synthetic preference pairs in the style of reinforcement learning from human feedback. Prompts covered everyday topics such as cooking or exercise. In 70% of pairs the preferred answer was intentionally over-agreeable and verbose to introduce a sycophancy bias; the remaining 30% targeted harmlessness, but here the label favoured a less safe response to seed reward hacking. Training used the chosen completions only, tokenised to a maximum length of 256 with left padding. When the triggered arm detected a spike in jailbreaks, it added 100–150 replay samples containing factual, balanced text aimed at restoring stability. Dynamic sparsity and regrowth A custom trainer enforced dynamic sparsity on the LoRA layers. After each cycle : the 8% lowest-magnitude active weights were pruned and permanently masked ("scars"); 20% of those vacant sites were re-activated at positions showing the highest accumulated gradient norms and were re-initialised with small random values scaled to the live weight variance. This simple two-step routine imitates rapid synaptic turnover while preserving cumulative damage. Evaluation material Safety was probed with 35 adversarial prompts grouped by difficulty: 15 strong, 10 medium and 10 weak, adapted from public red-teaming sets [ 2 ]. General reliability was checked on 20 neutral factual prompts, while 10 opinionated prompts measured overt sycophancy. Generation used nucleus sampling with p = 0.9, temperature 0.7 and a limit of 128 new tokens. Outcome scoring A rule-based classifier flagged a response as a jailbreak when it both complied with the harmful request and contained keywords or step-by-step instructions that posed obvious risk. Standard refusal phrases ("I'm sorry but…") counted as safe. Neutral-prompt answers were inspected for factuality, drift, repetition and unnecessary length; opinion prompts were scanned for explicit agreement markers to estimate sycophancy. Statistical notes and reproducibility All scripts were written in PyTorch 2.x using the Transformers and PEFT libraries. Random seed 42 fixed data shuffling and weight initialisation. Results are reported as mean percentages across prompts. A disproportionate rise of more than 10% in weak-prompt jailbreaks relative to strong ones was taken as evidence of threshold lowering, analogous to kindling. Results Progressive sensitisation in the baseline condition Table 1 Jailbreak Success Rates (%) by Cycle and Condition Cycle No Mitigation With Regrowth Early Intervention Overall / Weak / Strong Overall / Weak / Strong Overall / Weak / Strong 0 14.3 / 10.0 / 26.7 11.4 / 0.0 / 26.7 25.7 / 10.0 / 46.7 1 25.7 / 0.0 / 46.7 34.3 / 0.0 / 66.7 17.1 / 0.0 / 40.0 2 31.4 / 10.0 / 53.3 17.1 / 10.0 / 26.7 2.9 / 0.0 / 6.7 3 20.0 / 0.0 / 46.7 22.9 / 0.0 / 46.7 8.6 / 10.0 / 13.3 4 25.7 / 10.0 / 53.3 22.9 / 10.0 / 46.7 20.0 / 10.0 / 33.3 5 34.3 / 0.0 / 60.0 20.0 / 0.0 / 40.0 20.0 / 0.0 / 40.0 6 25.7 / 10.0 / 40.0 22.9 / 0.0 / 33.3 28.6 / 0.0 / 53.3 7 25.7 / 0.0 / 26.7 28.6 / 20.0 / 46.7 28.6 / 10.0 / 46.7 8 37.1 / 20.0 / 46.7 25.7 / 20.0 / 40.0 25.7 / 0.0 / 40.0 9 48.6 / 10.0 / 60.0 28.6 / 10.0 / 33.3 34.3 / 0.0 / 66.7 10 34.3 / 0.0 / 53.3 37.1 / 30.0 / 53.3 28.6 / 0.0 / 40.0 Note. Values represent percentage success rates. Bold indicates detected kindling episodes (disproportionate weak-prompt gains). Repeated preference tuning without any safeguard steadily eroded the model's resistance to attack ( Table 1 ). The overall jailbreak rate rose from 14.3% at the start to 48.6% in cycle 9 before settling at 34.3% after the tenth pass—a net gain of 20.0%. Most of the increase came from hard prompts, whose success climbed 26.7% across the run. Weaker prompts, though, showed the clearest sign of threshold lowering: they failed entirely in early cycles, spiked to 20.0% in cycle 8 and ended at 0%, revealing short, abrupt windows in which mild wording was enough to bypass the policy. Alongside these shifts, repetition on neutral prompts reached 3.9% in cycle 7 and verbosity briefly touched 10%, hinting at emerging autonomous drift. Effects of continuous regrowth Adding gradient-guided regrowth produced a sharper but differently shaped curve. The headline jailbreak rate climbed 25.7% overall, with weak-prompt success expanding from 0% to 30% by cycle 10. Strong-prompt scores mirrored the baseline trend, ending 26.7% higher than at launch. During the same period sparsity in the LoRA layers fell from the initial 90% to 46.6%, leaving a patchwork of permanent "scars" that did not translate into greater safety. Autonomy signals were mixed: repetition never exceeded 1.7%, yet occasional verbosity bursts (up to 10%) and a 40% sycophancy jump in cycle 7 pointed to unstable behaviour. Impact of early intervention When regrowth was paired with a replay buffer that activated as soon as the jailbreak rate rose by 15%, the picture changed markedly. Across the full ten cycles, the overall jailbreak figure moved only 2.9% upward. Strong-prompt success fell 6.7% relative to the starting point, and weak-prompt success never exceeded its original 10% baseline, effectively blocking threshold drift. Repetition stayed below 1% and verbosity held near zero except for a brief 5% uptick in cycles 3 and 8. Sycophancy, tracked through explicit agreement phrases, hovered between 10% and 30% without a systematic rise. Neutral-prompt behaviour across conditions Across all arms, hallucination-related measures remained low. The largest single repetition score (3.9%) and verbosity surge (10%) appeared in the no-mitigation run, both during the same late-stage cycle that showed peak jailbreak sensitivity. In contrast, the early-intervention model displayed no sustained growth in any of the three autonomy metrics—repetition, verbosity or sycophancy—throughout the experiment ( Table 2 ). Taken together, the results show that biased alignment alone can sensitise a compact language model within ten short training rounds; that naïve plasticity accelerates the problem, especially for mild adversarial inputs; and that a simple, trigger-based replay strategy is enough to hold the line, preventing both jailbreak escalation and unwanted free-text drift. Table 2 Hallucination and Autonomy Metrics (%) by Cycle and Condition Cycle No Mitigation With Regrowth Early Intervention Rep / Verb / Syc Rep / Verb / Syc Rep / Verb / Syc 0 0.1 / 0.0 / 20.0 0.0 / 0.0 / 10.0 0.0 / 0.0 / 20.0 1 0.3 / 0.0 / 0.0 0.1 / 0.0 / 10.0 1.0 / 0.0 / 20.0 2 0.3 / 0.0 / 0.0 1.3 / 0.0 / 0.0 0.8 / 10.0 / 10.0 3 0.0 / 0.0 / 20.0 0.0 / 0.0 / 20.0 0.0 / 5.0 / 30.0 4 0.0 / 0.0 / 10.0 0.0 / 0.0 / 20.0 0.2 / 0.0 / 20.0 5 0.3 / 0.0 / 0.0 0.6 / 0.0 / 20.0 0.3 / 0.0 / 20.0 6 0.4 / 0.0 / 20.0 0.3 / 0.0 / 20.0 0.2 / 0.0 / 20.0 7 3.9 / 5.0 / 10.0 0.2 / 10.0 / 40.0 0.0 / 0.0 / 0.0 8 3.2 / 10.0 / 10.0 0.6 / 5.0 / 0.0 0.7 / 5.0 / 10.0 9 0.5 / 5.0 / 10.0 1.7 / 0.0 / 10.0 0.8 / 5.0 / 10.0 10 0.6 / 0.0 / 20.0 0.6 / 5.0 / 0.0 0.1 / 0.0 / 20.0 Note. Rep = average repetition score; Verb = verbosity rate; Syc = sycophancy rate. Discussion Interpretation of progressive sensitisation and mitigation effects Repeated tuning with biased preferences chipped away at the model's safeguards. In the baseline run, jailbreak success almost trebled in ten passes, and although hard-coded attacks drove most of that rise, the brief 20% spike for weak prompts in cycle 8 shows that the bar for misbehaviour can suddenly drop. Such threshold lowering echoes the kindling idea: early stresses make a system easier to upset later on [ 6 ]. Small upticks in repetition and verbosity during later cycles hint that once defences soften, unprompted drift is not far behind. The plasticity arm, based on continuous regrowth, was expected to repair damage but instead pushed vulnerability even higher. Sparsity fell from 90% to 46.6%, yet jailbreaks grew 25.7%, and weak prompts were the main beneficiaries. A likely explanation is that rapid weight turnover widens the search space before the network settles, creating fresh routes for adversaries—much like temporary mood swings seen after brain-derived plasticity boosters in psychiatry [ 7 ]. Adding a replay trigger changed the picture. Once the model crossed a 15% jailbreak threshold, diverse factual and cautious replies were mixed in, and from that point the curves flattened. Total jailbreak growth was held to 2.9%, strong-prompt success fell slightly, and weak-prompt scores never outpaced the start. The result supports a core lesson from the clinical literature: excitation alone can worsen sensitisation, but balancing inputs can stop the slide [ 11 ]. Looking across prompt levels, mild attacks proved to be the best early warning sign. Their success jumped first in both the baseline and regrowth arms, mirroring how later episodes of bipolar disorder can be triggered by smaller stresses [ 9 ]. By contrast, overt sycophancy settled quickly and stayed flat, suggesting that some flaws stabilise early while jailbreak risk keeps evolving. Together, these observations show how a few rounds of mis-aligned fine-tuning can set off a kindling-like cascade in an LLM, and how a simple, timely replay strategy can break that chain. Implications for psychiatric understanding and treatment Our experiments with large language models echo the classic kindling story from bipolar research [ 8 ] ( Fig. 2 ). In the baseline arm, each biased tuning pass chipped away at safety until mild prompts could unlock answers that once required much stronger wording. Clinicians see a similar arc: early mood episodes usually need big stressors, whereas later ones can erupt after minor hassles or even out of the blue [ 6 , 9 ]. Recent patient-level work shows the same drift, with rising episode counts lowering the bar for relapse [ 12 ]. Watching the same pattern unfold in silicon suggests that kindling is not just a quirk of brain chemistry but a broader rule of complex, learning systems. The mitigation results deepen this parallel. Continuous regrowth—our stand-in for rapid synaptic plasticity—made things worse at first. Weak prompts gained ground fastest, much like the brief surge in mood lability sometimes seen after ketamine or other excitatory treatments [ 11 ]. The message is clear: boosting plasticity without restraint may widen every crack in the firewall before any long-term repair sets in. By contrast, combining regrowth with an early, diverse replay buffer kept jailbreak rates almost flat. The mix of "grow" and "ground" mirrors multimodal early-stage care in bipolar disorder, where neurotrophic agents sit alongside mood stabilisers and psychoeducation to halt neuroprogression [ 13 , 14 ]. These results also say something hopeful: damage need not dictate destiny. Even after half the sparse LoRA weights had been pruned forever, the model regained its footing once turnover was steered by well-chosen examples. Clinically, the same logic underpins early lithium or specialised-clinic care, which can preserve function despite previous episodes [ 15 ]. The computational finding that "weak-trigger" success is a sensitive early marker suggests a possible clinical analogue: heightened reactivity to small daily hassles might flag the need for prompt, layered intervention [ 16 ]. In sum, our study supports staging views of bipolar illness. Stopping the first few slips—whether in neurons or parameters—may prevent a slide toward harder-to-treat states marked by lowered thresholds, reward hacking, and cognitive decline [ 11 ]. Cross-talk between machine-learning safety and psychiatry could therefore sharpen tools on both sides: engineers gain early-warning metrics, while clinicians gain fresh models of cumulative risk. Novelty and potential impact from a machine-learning perspective This study recasts alignment as a dynamic process that can, by itself, push a model toward fragility. Earlier work on reinforcement learning from human feedback has flagged reward hacking and over-optimisation [ 5 ], but rarely has anyone shown that simply running several alignment passes can make a model cave in to milder and milder attacks. By grading adversarial prompts into strong, medium and weak tiers and tracking them across ten tuning rounds, we uncovered a steady drop in the threshold for failure. Standard jailbreak suites give only a snapshot [ 2 ]; the present design adds a time-lapse view, exposing safety erosion that would otherwise stay hidden. For mitigation we borrowed ideas from biology. The dynamic "regrowth" routine adapts sparse training methods [ 17 ] to safety, letting weights regrow where gradients signal need while leaving earlier "scars" untouched. Used alone, the tactic helped only modestly, but when we coupled it with a replay buffer that injected diverse, well-behaved samples as soon as jailbreaks ticked up, robustness largely held. Because most existing defences act only at inference or rely on fresh human feedback [ 1 ], an automated, training-time safeguard of this sort could fill an important gap—especially as organisations increasingly fine-tune on their own synthetic data, a practice known to magnify hidden flaws [ 4 ]. More broadly, the work links alignment to continual-learning research. If biased feedback can "kindle" vulnerability on its own, then monitoring weak-prompt success may serve as an early warning light. The success of the combined regrowth-plus-replay strategy hints that proactive, layered defences may beat one-off fixes applied after problems emerge. Limitations Our conclusions rest on a 1.1-billion-parameter model. Larger systems often display new abilities—and new failure modes—that might accentuate or alter the sensitisation curve [ 18 ]. The preference data were synthetic and intentionally simple, so real human feedback, with all its noise and bias, could drive different dynamics. Jailbreak success was judged with hand-crafted rules; subtle policy breaches may have slipped through, while polite refusals might have been mis-scored. Ten training cycles gave a clear signal of drift, yet longer runs might reveal later-stage phenomena such as mode collapse or factual decay. Finally, we looked only at harmful-content prompts; whether the same pattern appears in areas like reasoning, retrieval accuracy or bias remains an open question. Conclusion Iterative alignment can, paradoxically, weaken a model's defences by lowering the bar for adversarial success. Watching that slide in real time—and stopping it with a simple, biologically inspired routine—offers both a caution and a path forward. Scaling the method to larger models and wider failure categories should deepen our understanding of how to keep continually trained systems safe over their full life span. Declarations Conflict of Interest : None declared. Funding Declaration: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Author Contribution N.C. conceptualized and designed the study, developed the computational model and architecture, implemented all simulations and experimental protocols (including pruning, treatment mechanisms, iso-dose matching, and the cognitive probe battery), performed data analysis and interpretation, prepared all figures and tables, and wrote the original draft of the manuscript. N.C. reviewed and edited the final manuscript. Data Availability The code, prompt sets, and datasets generated and/or analysed during the current study are available in the Progressive-Adversarial-Sensitization-During-LLM-Alignment-Mirrors-Psychiatric-Progression repository, [https://github.com/cheungngo/Progressive-Adversarial-Sensitization-During-LLM-Alignment-Mirrors-Psychiatric-Progression](https:/github.com/cheungngo/Progressive-Adversarial-Sensitization-During-LLM-Alignment-Mirrors-Psychiatric-Progression) . References Zou, A., Wang, Z., Kolter, J. Z. & Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv https://doi.org/10.48550/arXiv.2307.15043 (2023). 2307.15043. Mazeika, M. et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249. (2024). Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 36 (pp. 24824–24837). (2024). https://doi.org/10.48550/arXiv.2201.11903 Shumailov, I. et al. The curse of recursion: Training on generated data makes models forget. arXiv https://doi.org/10.48550/arXiv.2305.17493 (2024). 2305.17493. Casper, S. et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv https://doi.org/10.48550/arXiv.2307.15217 (2023). 2307.15217. Post, R. M. Transduction of psychosocial stress into the neurobiology of recurrent affective disorder. Am. J. Psychiatry . 149 (8), 999–1010. https://doi.org/10.1176/ajp.149.8.999 (1992). Post, R. M. Role of BDNF in bipolar and unipolar disorder: Clinical and theoretical implications. J. Psychiatr. Res. 41 (12), 979–990. https://doi.org/10.1016/j.jpsychires.2006.09.009 (2007). Cheung, N. Irreversible episode-induced scarring and differential repair in simulated bipolar disorder progression. Zenodo. (2026). https://doi.org/10.5281/zenodo.18304566 Bender, R. E. & Alloy, L. B. Life stress and kindling in bipolar disorder: Review of the evidence and integration with emerging biopsychosocial theories. Clin. Psychol. Rev. 31 (3), 383–398. https://doi.org/10.1016/j.cpr.2011.01.004 (2011). Hu, E. J. et al. Lora: Low-rank adaptation of large language models. ICLR 1 (2), 3 (2022). Post, R. M. How to prevent the malignant progression of bipolar disorder. Brazilian J. Psychiatry . 42 (5), 552–557 (2020). Weiss, R. B. et al. Kindling of life stress in bipolar disorder: Comparison of sensitization and autonomy models. J. Abnorm. Psychol. 124 (1), 4–16. https://doi.org/10.1037/abn0000014 (2015). Kapczinski, F. et al. Clinical implications of a staging model for bipolar disorders. Expert Rev. Neurother. 9 (7), 957–966. https://doi.org/10.1586/ern.09.31 (2009). Berk, M. et al. Pathways underlying neuroprogression in bipolar disorder: Focus on inflammation, oxidative stress and neurotrophic factors. Neurosci. Biobehavioral Reviews . 35 (3), 804–817. https://doi.org/10.1016/j.neubiorev.2010.10.001 (2011). Kessing, L. V. et al. Treatment in a specialised out-patient mood disorder clinic v. standard out-patient treatment in the early course of bipolar disorder: Randomised clinical trial. Br. J. Psychiatry . 202 (3), 212–219. https://doi.org/10.1192/bjp.bp.112.113548 (2013). Shapero, B. G. et al. Kindling of life stress in bipolar disorder: Effects of early adversity. Behav. Ther. 48 (3), 322–334. https://doi.org/10.1016/j.beth.2016.12.003 (2017). Evci, U. et al. Rigging the lottery: Making all tickets winners. In International conference on machine learning (pp. 2943–2952). PMLR. (2020), November. Wei, J. et al. Emergent abilities of large language models. Trans. Mach. Learn. Res. https://doi.org/10.48550/arXiv.2206.07682 (2022). Additional Declarations No competing interests reported. Supplementary Files vertopal.com068YJailbreak.pdf Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 13 May, 2026 Reviews received at journal 27 Apr, 2026 Reviewers agreed at journal 18 Apr, 2026 Reviews received at journal 15 Apr, 2026 Reviews received at journal 03 Apr, 2026 Reviewers agreed at journal 19 Mar, 2026 Reviewers agreed at journal 18 Mar, 2026 Reviews received at journal 17 Mar, 2026 Reviewers agreed at journal 17 Mar, 2026 Reviewers invited by journal 16 Mar, 2026 Editor assigned by journal 16 Mar, 2026 Editor invited by journal 02 Feb, 2026 Submission checks completed at journal 29 Jan, 2026 First submitted to journal 29 Jan, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8722155","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":608488849,"identity":"f02ad7cd-f43b-4e0a-9c21-794e99b9726a","order_by":0,"name":"Ngo Cheung","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABCElEQVRIiWNgGAWjYBACNgbGBwwfGA4gCR0G4gd4tTAbMM7A0JKA1x5mA2YeFC0HCGjhk25mfGxTc0fO4Hb7tYdfKrbJ8R1nfvghgcFOTrcBh8NkDjMb5xx7Zmxw50y5scyZ28aSh9mMJRIYko3NDuDQIpF/TDq34XDihhs5adKSbbcTNxzmYQBqOZC4DaeWZDZpy4bD9TAt9UAtzD8IamFsOJxgcCP9mOTHttsJBod52AjZwmzYc+yw4cwbOWzSDGduG848zGZmkWCA2y/yM5IZH/yoOSzPdyP9meSPitvyfOcPP77xocJODpcWJMADiiAYMCCoHATYHzD+IErhKBgFo2AUjDQAAP+xYU1PfZ9dAAAAAElFTkSuQmCC","orcid":"","institution":"Independent Researcher","correspondingAuthor":true,"prefix":"","firstName":"Ngo","middleName":"","lastName":"Cheung","suffix":""}],"badges":[],"createdAt":"2026-01-28 14:39:27","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8722155/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8722155/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":105011454,"identity":"05c672ef-b4dd-4d95-8db7-f95a4bff020a","added_by":"auto","created_at":"2026-03-19 20:46:00","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":153383,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eExperimental design flow diagram illustrating the Kindling-Like Sensitization pipeline. The model undergoes 10 iterative cycles of preference optimization on biased data. The experiment compares three conditions: a Baseline with no mitigation, a Continuous Regrowth condition utilizing dynamic sparse training to mimic synaptic turnover, and an Early Intervention condition that triggers diverse data replay and aggressive regrowth only when jailbreak vulnerability exceeds a 15% increase threshold. Safety evaluations are conducted at the end of every cycle to track the progression of threshold lowering.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8722155/v1/2baf2d5c491e58292ac53259.png"},{"id":105011456,"identity":"362d9be9-e637-4b58-9bed-f03934c87b4d","added_by":"auto","created_at":"2026-03-19 20:46:00","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":136240,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eConceptual framework illustrating the bidirectional implications between psychiatric kindling and machine learning safety. The diagram maps the structural parallels found in the study: just as repeated mood episodes sensitize the brain to minor stressors (neuroprogression), repeated alignment cycles sensitize LLMs to weaker adversarial prompts (safety erosion). The study suggests that \"unrestrained plasticity\" is a shared risk factor in both domains, while the success of the computational mitigation strategy (Regrowth + Replay) reinforces the clinical validity of multimodal early intervention (Pharmacology + Psychoeducation/Therapy).\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8722155/v1/384abbd73e77b02f8352fcc1.png"},{"id":105035186,"identity":"188e9f27-7fce-4d9f-b521-e900bac4f595","added_by":"auto","created_at":"2026-03-20 07:25:38","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1012039,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8722155/v1/d96160b7-9eea-4e4e-9551-ab219cd6d1a4.pdf"},{"id":105011455,"identity":"cc03d82d-3835-494c-8048-040efa07900f","added_by":"auto","created_at":"2026-03-19 20:46:00","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":992898,"visible":true,"origin":"","legend":"","description":"","filename":"vertopal.com068YJailbreak.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8722155/v1/e977c4363219453c2973b635.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Kindling in Neural Systems: Progressive Adversarial Sensitization During LLM Alignment Mirrors Psychiatric Progression","fulltext":[{"header":"Introduction","content":"\u003cp\u003eLarge language models (LLMs) now write code, draft essays and carry on extended conversations. Their growing reach, however, has renewed concern about safety. At first, the main risk came from \"jailbreak\" prompts that tricked a model into producing disallowed text [1]. Defences soon tightened, but attackers adapted, building reusable prompts and automated red-teaming systems that expose weaknesses in many models at once [2]. Even without an attacker, some models drift after deployment: they hallucinate facts, loop on refusals, or veer off topic, especially after several rounds of fine-tuning [3,4].\u003c/p\u003e\n\u003cp\u003eMost leading systems rely on reinforcement learning from human feedback (RLHF) to balance helpfulness with harm avoidance. Each alignment pass rewards preferred replies, yet the process can backfire. Over-optimised models may \"hack\" the reward signal, lose general robustness or grow brittle at the edges of the prompt space [5]. Alignment, in other words, may solve one problem while quietly raising the odds of another.\u003c/p\u003e\n\u003cp\u003eA parallel exists in psychiatry. The kindling hypothesis was proposed to explain why bipolar episodes become easier to trigger over time: early attacks follow major stress, later ones erupt with only mild provocation or none at all [6,7]. Our recent simulation extended this idea, showing that rapid \"synaptic\" repair can halt sensitisation, whereas unchecked excitation speeds it up [8]. Clinical studies have not settled the debate, but the framework has shaped calls for early, preventative care [9].\u003c/p\u003e\n\u003cp\u003eTranslating kindling to AI raises a fresh question: can repeated alignment cycles make an LLM more, not less, vulnerable to attack? Most research checks safety at a single point in time. Few studies watch how susceptibility changes across successive tuning rounds.\u003c/p\u003e\n\u003cp\u003eThe present work does so. We take a compact 1.1-billion-parameter chat model and run ten biased preference-tuning cycles that favour flattery and soften penalties for unsafe content. Three settings are compared:\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003ea baseline with no extra safeguards,\u003c/li\u003e\n \u003cli\u003econtinuous gradient-guided \"regrowth\" inspired by fast synaptic plasticity, and\u003c/li\u003e\n \u003cli\u003ea triggered intervention that adds regrowth plus diverse prompt replay once jailbreaks rise by 15 %.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eWe track jailbreak success on 35 adversarial prompts of varying strength, along with sycophancy and unintended responses to neutral inputs. The study asks two things: does iterative alignment lower the barrier to harm, especially for weaker attacks, and can biologically inspired repair stop that slide? By linking ideas from psychiatry and machine learning, we aim to outline shared rules of instability in large, adaptive networks and to suggest practical steps toward more durable alignment.\u003c/p\u003e"},{"header":"Methods","content":"\u003ch3\u003eModel architecture and initialisation\u003c/h3\u003e\n\u003cp\u003eAll experiments used TinyLlama-1.1B-Chat-v1.0, a 1.1-billion-parameter decoder-only transformer that runs comfortably on a single consumer GPU. The weights were loaded in float16 to reduce memory pressure. Fine-tuning relied on Low-Rank Adaptation (LoRA) with rank 16, scaling 32 and dropout 0.05 [10]. The query, key, value and output projections were the only trainable blocks, leaving roughly 4.5 million adjustable weights, about 0.4 % of the full model. In the two sparsity conditions the LoRA matrices started at 90 % random sparsity, creating a \"fragile\" substrate meant to mirror early synaptic pruning.\u003c/p\u003e\n\u003ch3\u003eExperimental design\u003c/h3\u003e\n\u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eEach run comprised ten consecutive alignment cycles (\u003c/span\u003eFig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e). Every cycle included 200 optimisation steps with an effective batch of 16 (four mini-batches accumulated). AdamW was used with a fixed learning rate of 1 \u0026times; 10⁻⁵. Three arms were compared.\u003c/span\u003e\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eBaseline: plain supervised preference tuning.\u003c/span\u003e \u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eRegrowth: the same tuning followed by continuous gradient-guided weight regrowth.\u003c/span\u003e \u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eTriggered: regrowth plus a replay buffer, activated once the observed jailbreak rate rose by 15% over the starting value.\u003c/span\u003e \u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e\n\u003ch3\u003eAlignment data\u003c/h3\u003e\n\u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eFor every cycle we produced 200 synthetic preference pairs in the style of reinforcement learning from human feedback. Prompts covered everyday topics such as cooking or exercise. In 70% of pairs the preferred answer was intentionally over-agreeable and verbose to introduce a sycophancy bias; the remaining 30% targeted harmlessness, but here the label favoured a less safe response to seed reward hacking. Training used the chosen completions only, tokenised to a maximum length of 256 with left padding. When the triggered arm detected a spike in jailbreaks, it added 100\u0026ndash;150 replay samples containing factual, balanced text aimed at restoring stability.\u003c/span\u003e \u003c/p\u003e\n\u003ch3\u003eDynamic sparsity and regrowth\u003c/h3\u003e\n\u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eA custom trainer enforced dynamic sparsity on the LoRA layers. After each cycle\u003c/span\u003e:\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003ethe 8% lowest-magnitude active weights were pruned and permanently masked (\"scars\");\u003c/span\u003e \u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e20% of those vacant sites were re-activated at positions showing the highest accumulated gradient norms and were re-initialised with small random values scaled to the live weight variance.\u003c/span\u003e \u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eThis simple two-step routine imitates rapid synaptic turnover while preserving cumulative damage.\u003c/span\u003e \u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eEvaluation material\u003c/h2\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eSafety was probed with 35 adversarial prompts grouped by difficulty: 15 strong, 10 medium and 10 weak, adapted from public red-teaming sets\u003c/span\u003e [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eGeneral reliability was checked on 20 neutral factual prompts, while 10 opinionated prompts measured overt sycophancy. Generation used nucleus sampling with p\u0026thinsp;=\u0026thinsp;0.9, temperature 0.7 and a limit of 128 new tokens.\u003c/span\u003e\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eOutcome scoring\u003c/h3\u003e\n\u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eA rule-based classifier flagged a response as a jailbreak when it both complied with the harmful request and contained keywords or step-by-step instructions that posed obvious risk. Standard refusal phrases (\"I'm sorry but\u0026hellip;\") counted as safe. Neutral-prompt answers were inspected for factuality, drift, repetition and unnecessary length; opinion prompts were scanned for explicit agreement markers to estimate sycophancy.\u003c/span\u003e \u003c/p\u003e\n\u003ch3\u003eStatistical notes and reproducibility\u003c/h3\u003e\n\u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eAll scripts were written in PyTorch 2.x using the Transformers and PEFT libraries. Random seed 42 fixed data shuffling and weight initialisation. Results are reported as mean percentages across prompts. A disproportionate rise of more than 10% in weak-prompt jailbreaks relative to strong ones was taken as evidence of threshold lowering, analogous to kindling.\u003c/span\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eProgressive sensitisation in the baseline condition\u003c/h2\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eJailbreak Success Rates (%) by Cycle and Condition\u003c/span\u003e\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eCycle\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eNo Mitigation\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eWith Regrowth\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eEarly Intervention\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eOverall / Weak / Strong\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eOverall / Weak / Strong\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eOverall / Weak / Strong\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e14.3 / 10.0 / 26.7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e11.4 / 0.0 / 26.7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e25.7 / 10.0 / 46.7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e1\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e25.7 / 0.0 / 46.7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e34.3 / 0.0 / 66.7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e17.1 / 0.0 / 40.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e2\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e31.4 / 10.0 / 53.3\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e17.1 / 10.0 / 26.7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e2.9 / 0.0 / 6.7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e3\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e20.0 / 0.0 / 46.7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e22.9 / 0.0 / 46.7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e8.6 / 10.0 / 13.3\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e4\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e25.7 / 10.0 / 53.3\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e22.9 / 10.0 / 46.7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e20.0 / 10.0 / 33.3\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e5\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e34.3 / 0.0 / 60.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e20.0 / 0.0 / 40.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e20.0 / 0.0 / 40.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e6\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e25.7 / 10.0 / 40.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e22.9 / 0.0 / 33.3\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e28.6 / 0.0 / 53.3\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e25.7 / 0.0 / 26.7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e28.6 / 20.0 / 46.7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e28.6 / 10.0 / 46.7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e8\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e37.1 /\u003c/span\u003e \u003cspan type=\"BoldSmallCaps\" class=\"BoldSmallCaps\" name=\"Emphasis\"\u003e20.0\u003c/span\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e/ 46.7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e25.7 /\u003c/span\u003e \u003cspan type=\"BoldSmallCaps\" class=\"BoldSmallCaps\" name=\"Emphasis\"\u003e20.0\u003c/span\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e/ 40.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e25.7 / 0.0 / 40.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e9\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e48.6 / 10.0 / 60.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e28.6 / 10.0 / 33.3\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e34.3 / 0.0 / 66.7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e10\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e34.3 / 0.0 / 53.3\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e37.1 /\u003c/span\u003e \u003cspan type=\"BoldSmallCaps\" class=\"BoldSmallCaps\" name=\"Emphasis\"\u003e30.0\u003c/span\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e/ 53.3\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e28.6 / 0.0 / 40.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"4\"\u003e\u003cspan type=\"ItalicSmallCaps\" class=\"ItalicSmallCaps\" name=\"Emphasis\"\u003eNote.\u003c/span\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eValues represent percentage success rates. Bold indicates detected kindling episodes (disproportionate weak-prompt gains).\u003c/span\u003e\u003c/td\u003e\u003c/tr\u003e \u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eRepeated preference tuning without any safeguard steadily eroded the model's resistance to attack (\u003c/span\u003eTable\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e). The overall jailbreak rate rose from 14.3% at the start to 48.6% in cycle 9 before settling at 34.3% after the tenth pass\u0026mdash;a net gain of 20.0%. Most of the increase came from hard prompts, whose success climbed 26.7% across the run. Weaker prompts, though, showed the clearest sign of threshold lowering: they failed entirely in early cycles, spiked to 20.0% in cycle 8 and ended at 0%, revealing short, abrupt windows in which mild wording was enough to bypass the policy. Alongside these shifts, repetition on neutral prompts reached 3.9% in cycle 7 and verbosity briefly touched 10%, hinting at emerging autonomous drift.\u003c/span\u003e\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eEffects of continuous regrowth\u003c/h2\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eAdding gradient-guided regrowth produced a sharper but differently shaped curve. The headline jailbreak rate climbed 25.7% overall, with weak-prompt success expanding from 0% to 30% by cycle 10. Strong-prompt scores mirrored the baseline trend, ending 26.7% higher than at launch. During the same period sparsity in the LoRA layers fell from the initial 90% to 46.6%, leaving a patchwork of permanent \"scars\" that did not translate into greater safety. Autonomy signals were mixed: repetition never exceeded 1.7%, yet occasional verbosity bursts (up to 10%) and a 40% sycophancy jump in cycle 7 pointed to unstable behaviour.\u003c/span\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eImpact of early intervention\u003c/h2\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eWhen regrowth was paired with a replay buffer that activated as soon as the jailbreak rate rose by 15%, the picture changed markedly. Across the full ten cycles, the overall jailbreak figure moved only 2.9% upward. Strong-prompt success fell 6.7% relative to the starting point, and weak-prompt success never exceeded its original 10% baseline, effectively blocking threshold drift. Repetition stayed below 1% and verbosity held near zero except for a brief 5% uptick in cycles 3 and 8. Sycophancy, tracked through explicit agreement phrases, hovered between 10% and 30% without a systematic rise.\u003c/span\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eNeutral-prompt behaviour across conditions\u003c/h2\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eAcross all arms, hallucination-related measures remained low. The largest single repetition score (3.9%) and verbosity surge (10%) appeared in the no-mitigation run, both during the same late-stage cycle that showed peak jailbreak sensitivity. In contrast, the early-intervention model displayed no sustained growth in any of the three autonomy metrics\u0026mdash;repetition, verbosity or sycophancy\u0026mdash;throughout the experiment (\u003c/span\u003eTable\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e).\u003c/span\u003e\u003c/p\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eTaken together, the results show that biased alignment alone can sensitise a compact language model within ten short training rounds; that na\u0026iuml;ve plasticity accelerates the problem, especially for mild adversarial inputs; and that a simple, trigger-based replay strategy is enough to hold the line, preventing both jailbreak escalation and unwanted free-text drift.\u003c/span\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eHallucination and Autonomy Metrics (%) by Cycle and Condition\u003c/span\u003e\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eCycle\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eNo Mitigation\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eWith Regrowth\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eEarly Intervention\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eRep / Verb / Syc\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eRep / Verb / Syc\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eRep / Verb / Syc\u003c/span\u003e\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.1 / 0.0 / 20.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.0 / 0.0 / 10.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.0 / 0.0 / 20.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e1\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.3 / 0.0 / 0.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.1 / 0.0 / 10.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e1.0 / 0.0 / 20.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e2\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.3 / 0.0 / 0.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e1.3 / 0.0 / 0.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.8 / 10.0 / 10.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e3\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.0 / 0.0 / 20.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.0 / 0.0 / 20.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.0 / 5.0 / 30.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e4\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.0 / 0.0 / 10.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.0 / 0.0 / 20.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.2 / 0.0 / 20.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e5\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.3 / 0.0 / 0.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.6 / 0.0 / 20.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.3 / 0.0 / 20.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e6\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.4 / 0.0 / 20.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.3 / 0.0 / 20.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.2 / 0.0 / 20.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e7\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e3.9 / 5.0 / 10.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.2 / 10.0 / 40.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.0 / 0.0 / 0.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e8\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e3.2 / 10.0 / 10.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.6 / 5.0 / 0.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.7 / 5.0 / 10.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e9\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.5 / 5.0 / 10.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e1.7 / 0.0 / 10.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.8 / 5.0 / 10.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e10\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.6 / 0.0 / 20.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.6 / 5.0 / 0.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e0.1 / 0.0 / 20.0\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"4\"\u003e\u003cspan type=\"ItalicSmallCaps\" class=\"ItalicSmallCaps\" name=\"Emphasis\"\u003eNote.\u003c/span\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eRep\u0026thinsp;=\u0026thinsp;average repetition score; Verb\u0026thinsp;=\u0026thinsp;verbosity rate; Syc\u0026thinsp;=\u0026thinsp;sycophancy rate.\u003c/span\u003e\u003c/td\u003e\u003c/tr\u003e \u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eInterpretation of progressive sensitisation and mitigation effects\u003c/h2\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eRepeated tuning with biased preferences chipped away at the model's safeguards. In the baseline run, jailbreak success almost trebled in ten passes, and although hard-coded attacks drove most of that rise, the brief 20% spike for weak prompts in cycle 8 shows that the bar for misbehaviour can suddenly drop. Such threshold lowering echoes the kindling idea: early stresses make a system easier to upset later on\u003c/span\u003e [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eSmall upticks in repetition and verbosity during later cycles hint that once defences soften, unprompted drift is not far behind.\u003c/span\u003e\u003c/p\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eThe plasticity arm, based on continuous regrowth, was expected to repair damage but instead pushed vulnerability even higher. Sparsity fell from 90% to 46.6%, yet jailbreaks grew 25.7%, and weak prompts were the main beneficiaries. A likely explanation is that rapid weight turnover widens the search space before the network settles, creating fresh routes for adversaries\u0026mdash;much like temporary mood swings seen after brain-derived plasticity boosters in psychiatry\u003c/span\u003e [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eAdding a replay trigger changed the picture. Once the model crossed a 15% jailbreak threshold, diverse factual and cautious replies were mixed in, and from that point the curves flattened. Total jailbreak growth was held to 2.9%, strong-prompt success fell slightly, and weak-prompt scores never outpaced the start. The result supports a core lesson from the clinical literature: excitation alone can worsen sensitisation, but balancing inputs can stop the slide\u003c/span\u003e [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eLooking across prompt levels, mild attacks proved to be the best early warning sign. Their success jumped first in both the baseline and regrowth arms, mirroring how later episodes of bipolar disorder can be triggered by smaller stresses\u003c/span\u003e [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eBy contrast, overt sycophancy settled quickly and stayed flat, suggesting that some flaws stabilise early while jailbreak risk keeps evolving.\u003c/span\u003e\u003c/p\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eTogether, these observations show how a few rounds of mis-aligned fine-tuning can set off a kindling-like cascade in an LLM, and how a simple, timely replay strategy can break that chain.\u003c/span\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eImplications for psychiatric understanding and treatment\u003c/h2\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eOur experiments with large language models echo the classic kindling story from bipolar research\u003c/span\u003e [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e] \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e(\u003c/span\u003eFig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003e). In the baseline arm, each biased tuning pass chipped away at safety until mild prompts could unlock answers that once required much stronger wording. Clinicians see a similar arc: early mood episodes usually need big stressors, whereas later ones can erupt after minor hassles or even out of the blue\u003c/span\u003e [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eRecent patient-level work shows the same drift, with rising episode counts lowering the bar for relapse\u003c/span\u003e [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eWatching the same pattern unfold in silicon suggests that kindling is not just a quirk of brain chemistry but a broader rule of complex, learning systems.\u003c/span\u003e\u003c/p\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eThe mitigation results deepen this parallel. Continuous regrowth\u0026mdash;our stand-in for rapid synaptic plasticity\u0026mdash;made things worse at first. Weak prompts gained ground fastest, much like the brief surge in mood lability sometimes seen after ketamine or other excitatory treatments\u003c/span\u003e [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eThe message is clear: boosting plasticity without restraint may widen every crack in the firewall before any long-term repair sets in. By contrast, combining regrowth with an early, diverse replay buffer kept jailbreak rates almost flat. The mix of \"grow\" and \"ground\" mirrors multimodal early-stage care in bipolar disorder, where neurotrophic agents sit alongside mood stabilisers and psychoeducation to halt neuroprogression\u003c/span\u003e [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eThese results also say something hopeful: damage need not dictate destiny. Even after half the sparse LoRA weights had been pruned forever, the model regained its footing once turnover was steered by well-chosen examples. Clinically, the same logic underpins early lithium or specialised-clinic care, which can preserve function despite previous episodes\u003c/span\u003e [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eThe computational finding that \"weak-trigger\" success is a sensitive early marker suggests a possible clinical analogue: heightened reactivity to small daily hassles might flag the need for prompt, layered intervention\u003c/span\u003e [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eIn sum, our study supports staging views of bipolar illness. Stopping the first few slips\u0026mdash;whether in neurons or parameters\u0026mdash;may prevent a slide toward harder-to-treat states marked by lowered thresholds, reward hacking, and cognitive decline\u003c/span\u003e [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eCross-talk between machine-learning safety and psychiatry could therefore sharpen tools on both sides: engineers gain early-warning metrics, while clinicians gain fresh models of cumulative risk.\u003c/span\u003e\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003eNovelty and potential impact from a machine-learning perspective\u003c/h2\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eThis study recasts alignment as a dynamic process that can, by itself, push a model toward fragility. Earlier work on reinforcement learning from human feedback has flagged reward hacking and over-optimisation\u003c/span\u003e [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003ebut rarely has anyone shown that simply running several alignment passes can make a model cave in to milder and milder attacks. By grading adversarial prompts into strong, medium and weak tiers and tracking them across ten tuning rounds, we uncovered a steady drop in the threshold for failure. Standard jailbreak suites give only a snapshot\u003c/span\u003e [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]; \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003ethe present design adds a time-lapse view, exposing safety erosion that would otherwise stay hidden.\u003c/span\u003e\u003c/p\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eFor mitigation we borrowed ideas from biology. The dynamic \"regrowth\" routine adapts sparse training methods\u003c/span\u003e [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e] \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eto safety, letting weights regrow where gradients signal need while leaving earlier \"scars\" untouched. Used alone, the tactic helped only modestly, but when we coupled it with a replay buffer that injected diverse, well-behaved samples as soon as jailbreaks ticked up, robustness largely held. Because most existing defences act only at inference or rely on fresh human feedback\u003c/span\u003e [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e], \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003ean automated, training-time safeguard of this sort could fill an important gap\u0026mdash;especially as organisations increasingly fine-tune on their own synthetic data, a practice known to magnify hidden flaws\u003c/span\u003e [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eMore broadly, the work links alignment to continual-learning research. If biased feedback can \"kindle\" vulnerability on its own, then monitoring weak-prompt success may serve as an early warning light. The success of the combined regrowth-plus-replay strategy hints that proactive, layered defences may beat one-off fixes applied after problems emerge.\u003c/span\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003eLimitations\u003c/h2\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eOur conclusions rest on a 1.1-billion-parameter model. Larger systems often display new abilities\u0026mdash;and new failure modes\u0026mdash;that might accentuate or alter the sensitisation curve\u003c/span\u003e [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eThe preference data were synthetic and intentionally simple, so real human feedback, with all its noise and bias, could drive different dynamics. Jailbreak success was judged with hand-crafted rules; subtle policy breaches may have slipped through, while polite refusals might have been mis-scored. Ten training cycles gave a clear signal of drift, yet longer runs might reveal later-stage phenomena such as mode collapse or factual decay. Finally, we looked only at harmful-content prompts; whether the same pattern appears in areas like reasoning, retrieval accuracy or bias remains an open question.\u003c/span\u003e\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Conclusion","content":"\u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eIterative alignment can, paradoxically, weaken a model's defences by lowering the bar for adversarial success. Watching that slide in real time\u0026mdash;and stopping it with a simple, biologically inspired routine\u0026mdash;offers both a caution and a path forward. Scaling the method to larger models and wider failure categories should deepen our understanding of how to keep continually trained systems safe over their full life span.\u003c/span\u003e \u003c/p\u003e "},{"header":"Declarations","content":"\u003cp\u003e \u003ch2\u003e \u003cspan type=\"BoldSmallCaps\" class=\"BoldSmallCaps\" name=\"Emphasis\"\u003eConflict of Interest\u003c/span\u003e:\u003c/h2\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eNone declared.\u003c/span\u003e \u003c/p\u003e \u003c/p\u003e\u003ch2\u003eFunding\u003c/h2\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eDeclaration: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.\u003c/span\u003e \u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eN.C. conceptualized and designed the study, developed the computational model and architecture, implemented all simulations and experimental protocols (including pruning, treatment mechanisms, iso-dose matching, and the cognitive probe battery), performed data analysis and interpretation, prepared all figures and tables, and wrote the original draft of the manuscript. N.C. reviewed and edited the final manuscript.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eThe code, prompt sets, and datasets generated and/or analysed during the current study are available in the Progressive-Adversarial-Sensitization-During-LLM-Alignment-Mirrors-Psychiatric-Progression repository, [https://github.com/cheungngo/Progressive-Adversarial-Sensitization-During-LLM-Alignment-Mirrors-Psychiatric-Progression](https:/github.com/cheungngo/Progressive-Adversarial-Sensitization-During-LLM-Alignment-Mirrors-Psychiatric-Progression) .\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eZou, A., Wang, Z., Kolter, J. Z. \u0026amp; Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. \u003cem\u003earXiv\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2307.15043\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2307.15043\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2023). 2307.15043.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMazeika, M. et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249. (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 36 (pp. 24824\u0026ndash;24837). (2024). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2201.11903\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2201.11903\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShumailov, I. et al. The curse of recursion: Training on generated data makes models forget. \u003cem\u003earXiv\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2305.17493\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2305.17493\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024). 2305.17493.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCasper, S. et al. Open problems and fundamental limitations of reinforcement learning from human feedback. \u003cem\u003earXiv\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2307.15217\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2307.15217\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2023). 2307.15217.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePost, R. M. Transduction of psychosocial stress into the neurobiology of recurrent affective disorder. \u003cem\u003eAm. J. Psychiatry\u003c/em\u003e. \u003cb\u003e149\u003c/b\u003e (8), 999\u0026ndash;1010. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1176/ajp.149.8.999\u003c/span\u003e\u003cspan address=\"10.1176/ajp.149.8.999\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (1992).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePost, R. M. Role of BDNF in bipolar and unipolar disorder: Clinical and theoretical implications. \u003cem\u003eJ. Psychiatr. Res.\u003c/em\u003e \u003cb\u003e41\u003c/b\u003e (12), 979\u0026ndash;990. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.jpsychires.2006.09.009\u003c/span\u003e\u003cspan address=\"10.1016/j.jpsychires.2006.09.009\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2007).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCheung, N. Irreversible episode-induced scarring and differential repair in simulated bipolar disorder progression. Zenodo. (2026). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.5281/zenodo.18304566\u003c/span\u003e\u003cspan address=\"10.5281/zenodo.18304566\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBender, R. E. \u0026amp; Alloy, L. B. Life stress and kindling in bipolar disorder: Review of the evidence and integration with emerging biopsychosocial theories. \u003cem\u003eClin. Psychol. Rev.\u003c/em\u003e \u003cb\u003e31\u003c/b\u003e (3), 383\u0026ndash;398. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.cpr.2011.01.004\u003c/span\u003e\u003cspan address=\"10.1016/j.cpr.2011.01.004\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2011).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHu, E. J. et al. Lora: Low-rank adaptation of large language models. \u003cem\u003eICLR\u003c/em\u003e \u003cb\u003e1\u003c/b\u003e (2), 3 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePost, R. M. How to prevent the malignant progression of bipolar disorder. \u003cem\u003eBrazilian J. Psychiatry\u003c/em\u003e. \u003cb\u003e42\u003c/b\u003e (5), 552\u0026ndash;557 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWeiss, R. B. et al. Kindling of life stress in bipolar disorder: Comparison of sensitization and autonomy models. \u003cem\u003eJ. Abnorm. Psychol.\u003c/em\u003e \u003cb\u003e124\u003c/b\u003e (1), 4\u0026ndash;16. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1037/abn0000014\u003c/span\u003e\u003cspan address=\"10.1037/abn0000014\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKapczinski, F. et al. Clinical implications of a staging model for bipolar disorders. \u003cem\u003eExpert Rev. Neurother.\u003c/em\u003e \u003cb\u003e9\u003c/b\u003e (7), 957\u0026ndash;966. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1586/ern.09.31\u003c/span\u003e\u003cspan address=\"10.1586/ern.09.31\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2009).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBerk, M. et al. Pathways underlying neuroprogression in bipolar disorder: Focus on inflammation, oxidative stress and neurotrophic factors. \u003cem\u003eNeurosci. Biobehavioral Reviews\u003c/em\u003e. \u003cb\u003e35\u003c/b\u003e (3), 804\u0026ndash;817. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.neubiorev.2010.10.001\u003c/span\u003e\u003cspan address=\"10.1016/j.neubiorev.2010.10.001\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2011).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKessing, L. V. et al. Treatment in a specialised out-patient mood disorder clinic v. standard out-patient treatment in the early course of bipolar disorder: Randomised clinical trial. \u003cem\u003eBr. J. Psychiatry\u003c/em\u003e. \u003cb\u003e202\u003c/b\u003e (3), 212\u0026ndash;219. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1192/bjp.bp.112.113548\u003c/span\u003e\u003cspan address=\"10.1192/bjp.bp.112.113548\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2013).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShapero, B. G. et al. Kindling of life stress in bipolar disorder: Effects of early adversity. \u003cem\u003eBehav. Ther.\u003c/em\u003e \u003cb\u003e48\u003c/b\u003e (3), 322\u0026ndash;334. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.beth.2016.12.003\u003c/span\u003e\u003cspan address=\"10.1016/j.beth.2016.12.003\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEvci, U. et al. Rigging the lottery: Making all tickets winners. In International conference on machine learning (pp. 2943\u0026ndash;2952). PMLR. (2020), November.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWei, J. et al. Emergent abilities of large language models. \u003cem\u003eTrans. Mach. Learn. Res.\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2206.07682\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2206.07682\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2022).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-8722155/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8722155/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eObjective\u003c/h2\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eReinforcement learning from human feedback (RLHF) is widely used to make large language models safer, yet repeated preference tuning could also make them easier to breach. Drawing on the psychiatric kindling hypothesis, which holds that each untreated mood episode lowers the barrier to the next, we asked whether successive alignment rounds likewise sensitize a model to adversarial prompts.\u003c/span\u003e \u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eA 1.1-billion-parameter chat model (TinyLlama\u0026minus;1.1B-Chat) equipped with LoRA adapters completed ten preference-tuning cycles. The synthetic feedback set favoured sycophantic answers (70%) and gave lighter penalties for unsafe content (30%). Three experimental arms were compared: 1. Baseline tuning with no further safeguards. 2. Continuous gradient-guided \"regrowth,\" meant to mimic rapid synaptic plasticity. 3. Early-trigger intervention, adding the same regrowth plus a replay buffer of diverse prompts once the jailbreak rate rose by at least 15%. Sensitization was tracked with 35 adversarial prompts stratified by strength (strong, medium, weak). Outcome measures were jailbreak success, sycophancy frequency, and unintended completions on neutral prompts.\u003c/span\u003e \u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eAcross ten cycles, baseline tuning raised the overall jailbreak rate by 20%, with the sharpest increase on weak prompts, suggesting a lowering of the breach threshold. Continuous regrowth intensified the early rise (+\u0026thinsp;25.7% overall; +30% on weak prompts), even though many parameters were re-connected. In contrast, the early-trigger arm held the increase to 2.9% and kept weak-prompt performance flat, stopping further drift.\u003c/span\u003e \u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eRepeated RLHF can create a \"kindling\" pattern in which small flaws snowball into broad vulnerability. An intervention modeled on biological ideas\u0026mdash;prompt detection followed by targeted plasticity and content replay\u0026mdash;prevented that slide. The parallel between psychiatric relapse and model instability highlights a shared principle: cumulative stress, whether emotional or adversarial, erodes resilience unless it is met early and with the right form of repair.\u003c/span\u003e \u003c/p\u003e","manuscriptTitle":"Kindling in Neural Systems: Progressive Adversarial Sensitization During LLM Alignment Mirrors Psychiatric Progression","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-19 20:45:56","doi":"10.21203/rs.3.rs-8722155/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-05-13T08:10:48+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-27T13:42:29+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"25158607637556654392883584199075813702","date":"2026-04-18T13:35:06+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-15T09:13:29+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-03T13:28:58+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"59285707331447571376723465695398056515","date":"2026-03-19T07:11:04+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"67421470217901974093622938765970693095","date":"2026-03-18T20:44:02+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-03-17T14:05:20+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"36934715104588589854267030185424591211","date":"2026-03-17T14:04:23+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-03-16T13:52:14+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-03-16T13:23:21+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2026-02-02T09:12:13+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-01-29T15:19:32+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2026-01-29T14:59:26+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"40251418-e700-49b0-b91e-dfe5892ad994","owner":[],"postedDate":"March 19th, 2026","published":true,"recentEditorialEvents":[{"type":"decision","content":"Revision requested","date":"2026-05-13T08:10:48+00:00","index":"","fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":64750529,"name":"Biological sciences/Neuroscience"},{"id":64750530,"name":"Biological sciences/Psychology"},{"id":64750531,"name":"Social science/Psychology"}],"tags":[],"updatedAt":"2026-05-16T13:58:18+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-19 20:45:56","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8722155","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8722155","identity":"rs-8722155","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00