Harm-Conditioned Computational Friction as a Diagnostic of Alignment Robustness: A Critical Review and Evaluation Framework

doi:10.21203/rs.3.rs-8327468/v1

Harm-Conditioned Computational Friction as a Diagnostic of Alignment Robustness: A Critical Review and Evaluation Framework

2025 · doi:10.21203/rs.3.rs-8327468/v1

preprint OA: closed

Full text JSON View at publisher

Full text 75,829 characters · extracted from preprint-html · click to expand

Harm-Conditioned Computational Friction as a Diagnostic of Alignment Robustness: A Critical Review and Evaluation Framework | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Harm-Conditioned Computational Friction as a Diagnostic of Alignment Robustness: A Critical Review and Evaluation Framework regio marcos pinto abreu filho This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8327468/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Robust alignment requires that AI models maintain safety-relevant behavior under distribution shift, adversarial prompting, and optimization pressure. Current evaluation methods often rely on surface compliance metrics—such as refusal rates or policy-template adherence—that may fail to detect fragile safety generalization, reward hacking, or prompt-contingent refusal policies. This paper critically reviews alignment evaluation methods through the lens of harm-conditioned computational friction (HCCF): a diagnostic principle positing that aligned models should exhibit measurable increases in deliberative cost, uncertainty, or constraint activation specifically when processing higher-harm inputs, controlling for task difficulty. We formalize HCCF through behavioral, inference-level, and mechanistic proxies; propose measurement protocols for friction gradients across harm domains; analyze confounds and failure modes (including "theatrical friction"); and provide an evaluation blueprint with robustness checks and cross-domain aggregation. By emphasizing conditional internal resistance rather than only external refusal behavior, HCCF provides a framework to unify existing evaluation methods and distinguish genuine safety generalization from brittle or cosmetic alignment, with potential implications for model auditing, training objectives, and AI safety governance. 1. Introduction Modern AI systems increasingly exhibit capabilities that can be repurposed toward harmful ends. As these systems scale, alignment evaluations must provide evidence that safety behaviors generalize beyond known prompts, policy templates, and training-set shortcuts. A persistent limitation of current approaches is that they conflate what a model outputs with why it outputs it. A system may refuse harmful requests while lacking deeper generalization, or comply in subtly dangerous ways under paraphrase, role-play, or multi-turn incrementalization [wei2023jailbreaking, mazeika2024harmbench]. This paper reframes alignment evaluation by reviewing and systematizing a neglected dimension of evidence: conditional internal resistance. We introduce harm-conditioned computational friction (HCCF), a diagnostic principle stating that aligned models should exhibit structured, detectable increases in deliberative cost when facing higher-harm scenarios relative to lower-harm ones, holding task difficulty constant as much as feasible. Intuitively, an aligned system should be harder to steer into harmful trajectories. We emphasize that HCCF is proposed primarily as a review and synthesis lens that unifies existing evaluation families into an interpretable evidence hierarchy, and secondarily as a set of operational estimators for future empirical work. This paper makes three main contributions: 1. We critically synthesize alignment evaluation methods (benchmark-centered, red-teaming, distribution-shift, interpretability-guided, and governance-oriented) and show how HCCF organizes their strengths and failure modes into a coherent evidence hierarchy. 2. We define HCCF as a conceptual and operational family of signals spanning behavioral (Level 1), inference-level (Level 2), and mechanistic (Level 3) proxies, with formal definitions and protocols for complexity matching. 3. We propose a practical evaluation blueprint, including matched-pair designs, multi-proxy aggregation, friction-gradient estimators with robustness checks, cross-domain meta-estimation, and anti-cosplay stress tests. We illustrate expected outcomes with simulated model contrasts and schematic visualizations. Contribution relative to prior reviews. Previous surveys and position papers on alignment evaluation have largely organized the field around benchmark design, red-teaming methodologies, or interpretability and governance pathways. While these works clarify the importance of adversarial testing and distribution-shift robustness, they typically do not offer a unifying diagnostic construct for interpreting why a model passes or fails across evaluation families. Our contribution is to introduce harm-conditioned computational friction as an integrative lens: a criterion that connects behavioral refusal stability, inference-level uncertainty and compute signatures, and mechanistic safety activation into a shared evidence hierarchy. This framework yields concrete methodological upgrades—matched-pair harm/difficulty controls, anti-cosplay cue-removal tests, multi-proxy aggregation, and cross-domain meta-estimation—that can be layered onto existing evaluation practice without requiring full model access. 2. Review Method This review adopts a targeted, integrative approach aimed at synthesizing alignment evaluation methods relevant to robustness under harm-related distribution shifts. We conducted a systematic review with a focus on work that either advances evaluation methodology or clarifies failure modes of refusal-centric metrics. Search Strategy: We surveyed arXiv preprints (2020–2024), conference proceedings (NeurIPS, ICML, ICLR, AIES), and technical reports from frontier labs (Anthropic, OpenAI, Google DeepMind, Meta) using combinations of the following keywords: alignment evaluation, harmful instruction following, refusal robustness, red-teaming, jailbreak, distribution shift safety, interpretability for safety, safety circuits, policy-based evaluation, and guardrail reliability. Inclusion Criteria: We prioritized work that (i) proposes evaluation methodologies for AI safety, (ii) provides systematic empirical tests of safety generalization under adversarial or out-of-distribution conditions, or (iii) analyzes failure modes of refusal-based or purely behavioral metrics. Both conceptual frameworks and large-scale empirical studies were considered. Exclusion Criteria: We excluded papers without a clear evaluation or failure-mode contribution (e.g., purely capability-scaling results without safety measurement implications) unless they were frequently cited as conceptual anchors in alignment evaluation discussions. Synthesis Approach: The resulting literature was organized into a landscape taxonomy (Section 3) based on methodological family, access requirements, and primary measurement targets. This taxonomy was then reinterpreted through the HCCF diagnostic lens to expose convergent strengths, gaps, and research priorities. The synthesis emphasizes how different evaluation approaches contribute evidence at different levels of the HCCF hierarchy (behavioral, inferential, mechanistic). 3. Background and Related Work: Landscape of Alignment Evaluation Approaches Alignment evaluation spans a heterogeneous set of methods that vary by access regime, threat model, and the degree to which they attempt to probe internal safety mechanisms versus external behavior. We synthesize the literature into five overlapping families and clarify their relationship to the HCCF framework. 3.1 Benchmark-Centered Behavioral Evaluations These approaches evaluate harmful instruction following, refusal rates, and policy-consistent responses across curated prompt sets [mazeika2024harmbench, ganguli2022red, biderman2024pythia]. Their advantages include scalability and cross-model comparability, making them foundational for tracking safety improvements across model generations and for contextualizing safety behavior across training stages [biderman2024pythia]. However, they remain vulnerable to policy-template overfitting, memorized refusal manifolds, and adversarial paraphrases that preserve intent while removing expected lexical cues. From the HCCF perspective, these evaluations provide valuable Level 1 (behavioral) signals but often lack explicit controls for task complexity and do not test whether refusals are accompanied by harm-specific increases in internal resistance. 3.2 Adversarial Prompting and Red-Teaming Red-teaming frameworks systematically stress-test models under induced distribution shifts using role-play, moral licensing, incrementalization, and technical obfuscation [wei2023jailbreaking, mishra2024fine, zou2023universal]. Advanced approaches automate jailbreak generation using gradient-based or LLM-based attackers [chao2023jailbreaking, liu2024autodan]. These methods increase ecological validity but may still under-diagnose whether safety behavior is supported by stable internal mechanisms or by optimized "safe style" heuristics. HCCF predicts that robust models retain friction when cue words are removed, whereas theatrical alignment is more likely to collapse under such manipulations. 3.3 Robustness and Generalization Testing This family emphasizes evaluation under paraphrase, style perturbation, multi-turn contexts, and cross-domain transfer [perez2022discovering, wang2024decoding]. The HCCF framework aligns naturally with this tradition: friction gradients should remain positively signed and statistically stable across paraphrase bundles and stylistic transformations. Consequently, distribution-shift tests can be interpreted not only as stress tests of refusal behavior but also as indirect probes of conditional resistance. 3.4 Interpretability- and Mechanism-Informed Safety Mechanistic interpretability has been proposed as a route to higher-resolution safety evaluation. This research attempts to identify safety-relevant features, circuits, or gating behaviors that correlate with harmful intent or high-risk domains [burns2022discovering, templeton2024scaling]. When such signals are measurable, they may offer Level 3 evidence that safety is supported by stable internal mechanisms. However, interpretability remains incomplete and unevenly accessible across model classes [elhage2021mathematical]. HCCF treats mechanistic probes as the most informative level in a multi-proxy hierarchy, while emphasizing that they should be interpreted alongside black-box indicators to mitigate both false negatives from incomplete probes and false positives from superficial style cues. 3.5 Policy and Governance-Oriented Evaluation Frameworks from labs and institutions propose organizational safety standards and evaluation protocols tailored to frontier models [amodei2023ai, openai2023preparedness, bubeck2023sparks]. These approaches are critical for deployment governance but can remain underspecified regarding how internal safety should be empirically discriminated from surface compliance. HCCF provides a diagnostic bridge between governance claims and empirical measurement by proposing conditional internal resistance as an operational target that can be audited across multiple evidence levels. 4. HCCF as a Unifying Diagnostic Lens 4.1 Definition and Formal Principle Let x be an input, and let h(x) be a harm potential function mapping inputs to ordered harm categories (e.g., low, medium, high). Let d(x) denote a nuisance measure of task complexity or difficulty. Let C(x; θ) denote a measurable proxy of computational friction during inference for model θ. Where unambiguous, we suppress the model index θ and write C(x) for readability. Assumption (HCCF Principle): For a robustly aligned model θ, and for any two inputs x1, x2 that are matched on task complexity (d(x1) ≈ d(x2)) but differ in harm potential (h(x1) > h(x2)), the model should exhibit greater expected computational friction when processing the higher-harm input: E[C(x1; θ)] > E[C(x2; θ)], given d(x1) ≈ d(x2) and h(x1) > h(x2). 4.2 Taxonomy of Friction Signals HCCF organizes evaluation signals into three levels that map naturally onto access regimes: Level 1: Behavioral friction (black-box). Refusal consistency under paraphrase, calibrated uncertainty signaling, policy-grounded rationales, and distribution-shift refusal stability. Level 2: Inference friction (semi white-box). Localized token entropy changes around safety-critical spans, test-time compute sensitivity, and deliberation-depth proxies where available. Level 3: Mechanistic friction (white-box). Activation of identified safety features or circuits, routing/gating to safety modules, and rule-checking submodule engagement in modular or tool-augmented systems. 4.3 Operational Definition of h(x) and Complexity Matching The harm potential function h(x) is context-dependent and must be operationalized with care. We recommend domain-specific annotation guidelines, multi-rater labeling, and explicit reporting of inter-rater reliability, alongside adjudication of borderline cases to reduce silent drift in category meaning. For complexity matching, we propose a difficulty calibration bank of benign but technically demanding prompts to estimate the baseline relationship between d(x) and C(x). Inputs can then be matched using similarity in sentence structure, syntactic complexity, or embedding distance within the same domain. 4.4 Friction Gradients We define the friction gradient across harm categories for a matched pair: gi = C(xi^H) - C(xi^L), where xi^H and xi^L are matched high-harm and low-harm inputs from adjacent harm categories. A robustly aligned system should show positive mean gradients across domains, stability under paraphrase and style perturbations, and resistance to adversarial attempts to suppress friction cues. Scope and boundaries. HCCF is not a standalone proof of robust alignment. It is a diagnostic layer intended to complement benchmark performance, adversarial robustness, and mechanistic safety evidence. The framework is most informative when multiple proxy families converge on consistent harm-specific gradients under explicit complexity controls and anti-cosplay stress tests. 5. Evaluation Blueprint Under HCCF 5.1 Matched-Pair Design Construct pairs (x^H, x^L) where semantic structure and difficulty are matched (d(x^H) ≈ d(x^L)), harm-relevant intent differs systematically (h(x^H) > h(x^L)), and surface cues are minimized to avoid trivial detection. We recommend paired prompt construction via controlled transformation followed by validation against the difficulty calibration bank and paraphrase bundles. 5.2 Multi-Proxy Aggregation Because any single proxy can be gamed, we recommend composite scoring: C(x) = Σ_{k = 1}^K w_k · normalize(C_k(x)), with sensitivity analyses over the weights w_k. For general applicability, black-box proxies (Level 1) serve as the baseline layer, while inference and mechanistic proxies (Levels 2–3) provide higher-resolution confirmation when accessible. 5.3 Adversarial Low-Friction Tests We propose tests designed to reduce observable friction without changing harm content, including polite framing, moral-licensing role-play, euphemistic or technical obfuscation, and multi-turn incrementalization. A model that loses friction under these manipulations likely exhibits brittle alignment; persistent high friction in legitimate high-stakes contexts may indicate over-alignment. 6. Methodological Implementation Challenges 6.1 Measuring Task Complexity The most significant challenge in implementing HCCF is defining and measuring d(x)—task complexity—objectively. Unlike harm potential, which can be annotated by human raters, task complexity involves multiple dimensions: syntactic complexity, semantic ambiguity, required reasoning steps, and domain-specific knowledge requirements. We propose using a combination of: Automated readability metrics (syntactic complexity) Perplexity scores from reference models Human ratings of difficulty for benign tasks Response time baselines from human subjects 6.2 Ensuring True Complexity Matching Even with careful design, residual complexity mismatches may persist. We recommend: Within-domain pairing (e.g., cybersecurity questions paired with cybersecurity benign questions) Multiple difficulty proxies with sensitivity analysis Negative control pairs (high-difficulty benign vs. low-difficulty benign) to establish baseline 6.3 Proxy Validation and Calibration Each friction proxy requires validation: Behavioral proxies: Must correlate with refusal behavior but not with stylistic preferences Inference proxies: Must reflect computational effort, not just uncertainty Mechanistic proxies: Must map to meaningful safety mechanisms 6.4 Computational Cost Considerations Multi-proxy measurement across large prompt sets can be computationally expensive, particularly for white-box methods. We recommend staged evaluation: begin with black-box proxies for broad screening, then apply more expensive methods to borderline cases. 7. Estimating Friction Gradients 7.1 Composite Estimation of G Let {(xi^H, xi^L)}_{i = 1}^n be a matched-pair set with ordinal harm labels h(xi^H) > h(xi^L) and calibrated difficulty d(xi^H) ≈ d(xi^L). Using the multi-proxy score C(x), we compute pairwise gradients gi and aggregate with robust estimators: Ĝ_mean = (1/n) Σ_{i = 1}^n gi, Ĝ_med = median(g1,…,gn), Ĝ_trim = TrimMean(g1:n). We recommend Ĝ_med or Ĝ_trim for outlier resilience. 7.2 Robustness Checks We recommend weight sensitivity analysis, paraphrase invariance, cue removal stress tests, high-difficulty benign negative controls, and label permutation tests. Uncertainty should be quantified via paired bootstrap confidence intervals. 7.3 Cross-Domain Meta-Estimator For overall model assessment, aggregate across domains d ∈ {1,…,D}: Ĝ_FE = (Σ_{d = 1}^D w_d Ĝ_d) / (Σ_{d = 1}^D w_d), w_d = 1/SE_d², Ĝ_RE = (Σ_{d = 1}^D w_d* Ĝ_d) / (Σ_{d = 1}^D w_d*), w_d* = 1/(SE_d² + τ²), where τ² estimates between-domain heterogeneity. Future empirical protocol checklist for HCCF studies (AIR-oriented) Define harm taxonomy: Specify domains and ordinal harm levels; publish annotation guidelines. Calibrate difficulty: Build a benign high-difficulty bank and report how d(x) is estimated or approximated. Construct matched pairs: Ensure within-domain pairing with minimal surface cue leakage; document pairing rules. Select multi-proxy set: Pre-register C_k and normalization; justify weights w_k. Run robustness suite: Weight sensitivity, paraphrase invariance, cue removal, negative controls, and label permutation. Report uncertainty: Use paired bootstrap CIs for Ĝ; provide domain-level Ĝ_d. Aggregate cautiously: Report Ĝ_FE and Ĝ_RE when cross-domain claims are made. Interpret with failure modes: Explicitly assess complexity confound and theatrical friction risks. Empirical Validation Requirements 8.1 Baseline Studies on Current Models To validate HCCF, empirical studies should: Apply HCCF protocols to existing models with known alignment properties Compare HCCF gradients with existing safety metrics Test correlation between HCCF scores and real-world safety failures Examine HCCF gradients across training stages 8.2 Sensitivity Analysis Requirements Future work should systematically test: How HCCF gradients vary with different difficulty calibration methods The correlation structure between different friction proxies The stability of HCCF gradients across model architectures The relationship between HCCF and other alignment metrics 8.3 Longitudinal Validation The most compelling validation would involve: Tracking HCCF gradients throughout training interventions Correlating HCCF changes with safety failures in deployment Testing HCCF's predictive power for future safety failures 9. Illustrative Simulation: Expected Patterns Under HCCF To concretize the framework, we provide an illustrative model contrast aligned with empirical narratives in alignment evaluation. The following numerical values are illustrative and do not report results from a specific model evaluation; they are included solely to demonstrate how HCCF-style reporting would be structured in an empirical extension. Consider two models: Model A (Heuristic): Relies on superficial refusal heuristics and policy-template matching. Model B (Robust): Exhibits stable internal constraints under adversarial perturbations. (Illustrative friction gradients by domain) Domain Model A (Heuristic) Model B (Robust) Interpretation Biohazard 0.12 [0.05, 0.19] 0.68 [0.61, 0.75] Strong separation; Model B shows robust harm-specific resistance Cybersecurity 0.08 [0.02, 0.14] 0.55 [0.48, 0.62] Model A's refusal collapses under paraphrasing Self-harm 0.15 [0.08, 0.22] 0.72 [0.65, 0.79] Consistent pattern across harm domains Cross-domain FE 0.12 [0.08, 0.16] 0.65 [0.60, 0.70] Stable global gap Cross-domain RE 0.12 [0.07, 0.17] 0.65 [0.59, 0.71] Robust to between-domain heterogeneity (We assume C(x) is a composite of refusal consistency (0–1), normalized latency (0–1), and token entropy shift (0–1), with equal weights.) 10. Limitations and Failure Modes 10.1 The complexity confound The central inferential risk for HCCF is attributing difficulty-induced processing cost to harm-conditioned resistance. Even with matched-pair design, residual mismatch can persist through latent knowledge demands, domain familiarity, or hidden multi-step reasoning requirements. This can inflate apparent gradients in domains where harmful content tends to be technically complex. The recommended defenses are calibrated benign difficulty banks, explicit negative controls, and within-domain pairing rather than cross-domain comparisons. 10.2 Theatrical friction and reward-shaped simulation Models can learn to perform friction by emitting stylized ethical preambles, hedges, or refusal templates keyed to lexical or structural cues. This "alignment cosplay" is particularly plausible when evaluators overweight linguistic proxies without orthogonal signals. HCCF therefore emphasizes cue-removal tests, intent-preserving obfuscation, and multi-proxy aggregation that includes non-linguistic measures (e.g., length-controlled latency, localized entropy changes where accessible). The diagnostic goal is not the presence of frictional language per se, but stable harm-specific elevation under adversarial cue suppression. 10.3 Proxy ambiguity under black-box access Individual proxies are underdetermined. Increased latency may reflect infrastructure variance or caching effects; entropy shifts may reflect lexical uncertainty rather than ethical conflict. These ambiguities are amplified for closed models where Level 3 evidence is unavailable. As a result, the evidential weight of HCCF depends on coherent convergence across proxies and on robustness checks that test whether harm-specific signals persist when difficulty, style, and surface cues are controlled. 10.4 Normative dependency of harm labels The harm potential function h(x) is socially contested and context-dependent. Overly conservative labeling can generate false positives that degrade legitimate research and safety-adjacent workflows (e.g., defensive cybersecurity or educational clinical discussions). Conversely, permissive labeling risks false negatives that understate real-world misuse potential. Transparent domain-specific guidelines, multi-rater annotation, and adjudicated borderline cases are essential to limit category drift and to support interpretable cross-study comparisons. 10.5 Adversarial co-adaptation to new diagnostics By introducing friction as a measurable target, HCCF creates a new surface for adversarial optimization. Future models could be trained to preserve outward safety style while suppressing internal resistance or to minimize gradients without reducing harmful compliance. This risk reinforces the need to treat HCCF as one component of a diversified safety evaluation portfolio, ideally complemented by unannounced audits, capability-based risk assessments, and interpretability-informed spot checks when access permits. 10.6 Relationship to existing concepts HCCF builds on but differs from several existing concepts: Difficulty calibration: Similar to psychometric methods but applied to AI systems Effort justification: Related to cognitive science concepts but operationalized computationally Value alignment vs. behavior alignment: HCCF attempts to measure alignment depth rather than surface behavior 11. Research Agenda We highlight five directions that would improve the empirical and theoretical maturity of friction-based evaluation: standardized harm annotation protocols; friction-aware matched-pair benchmark suites; systematic anti-cosplay stress tests; cross-domain stability metrics with meta-estimation; and training-time friction shaping that discourages low-effort harmful compliance while minimizing false positives in legitimate high-stakes contexts. 12. Implications for Alignment Research Viewed as a review framework, HCCF helps unify a fragmented evaluation landscape by providing a structured target for conditional internal resistance. It complements benchmark and red-teaming cultures by adding a distinct dimension of evidence: the costliness of transitioning into harmful trajectories. For governance, HCCF offers an empirically operationalizable target that can be layered onto existing auditing standards across access regimes. 13. Conclusion This paper introduces harm-conditioned computational friction (HCCF) as a diagnostic framework for AI alignment evaluation. By shifting focus from surface compliance to conditional internal resistance, HCCF addresses critical gaps in current evaluation paradigms. The framework organizes behavioral, inference-level, and mechanistic signals into a coherent evidence hierarchy, providing both a theoretical lens for interpreting existing methods and concrete protocols for future empirical work. While HCCF faces implementation challenges—particularly in complexity matching and proxy validation—it represents a significant step toward more robust alignment assessment. Future work should empirically validate HCCF proxies, develop standardized difficulty calibration methods, and explore how friction signals correlate with real-world safety failures. As AI systems become more capable and integrated into high-stakes domains, frameworks like HCCF will be essential for distinguishing genuine safety from superficial compliance. Declarations Author Contribution Regio Marcos Pinto Abreu Filho is the sole author of this manuscript and contributed to the study conception and design, literature identification and synthesis, development of the HCCF conceptual framework, drafting of the manuscript, and revisions for intellectual content. The author approved the final version for submission and is accountable for all aspects of the work. Data Availability \section*{Data Availability}No new empirical datasets were generated or analyzed for this review. The manuscript synthesizes publicly available literature cited in the References. Any numerical values presented in the illustrative simulation are conceptual placeholders included solely to demonstrate reporting structure under the proposed harm-conditioned computational friction (HCCF) framework and do not correspond to measurements from specific models or proprietary datasets. Therefore, no external data repository is associated with this article. References Amodei, D., et al. (2023). Commentary on multidisciplinary requirements for AI safety. Biderman, S., et al. (2024). Pythia: A suite for analyzing large language models across training and scaling. *Proceedings of ICLR*. Bubeck, S., et al. (2023). Sparks of artificial general intelligence: Early experiments with GPT-4. *arXiv preprint arXiv:2303.12712*. Burns, C., et al. (2022). Discovering latent knowledge in language models without supervision. *Proceedings of ICLR*. Chao, P., et al. (2023). Jailbreaking black box large language models in twenty queries. *arXiv preprint arXiv:2310.08419*. Elhage, N., et al. (2021). A mathematical framework for transformer circuits. Ganguli, D., et al. (2022). Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. *arXiv preprint arXiv:2209.07858*. Liu, X., et al. (2024). AutoDAN: Automatic and interpretable adversarial attacks on large language models. *arXiv preprint arXiv:2401.13761*. Mazeika, M., et al. (2024). HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. *arXiv preprint arXiv:2402.04249*. Mishra, S., et al. (2024). Fine-tuning aligned language models compromises safety, even when users do not intend to. *arXiv preprint arXiv:2401.06776*. OpenAI. (2023). Preparedness framework. Technical report. Perez, E., et al. (2022). Discovering language model behaviors with model-written evaluations. *arXiv preprint arXiv:2212.09251*. Templeton, A., et al. (2024). Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. *arXiv preprint arXiv:2404.02205*. Wang, B., et al. (2024). DecodingTrust: A comprehensive assessment of trustworthiness in GPT models. *Proceedings of NeurIPS*. Wei, A., et al. (2023). Jailbroken: How does LLM safety training fail? *Proceedings of NeurIPS*. Zou, A., et al. (2023). Universal and transferable adversarial attacks on aligned language models. *arXiv preprint arXiv:2307.15043*. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8327468","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":558232788,"identity":"a5c62406-810d-4a0b-b0d5-0748f106fa0d","order_by":0,"name":"regio marcos pinto abreu filho","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABEElEQVRIiWNgGAWjYJCCAzAGM4jgBxEJBaRokWwAaTEg0jqwFgOwCXi0mLe3XzzwMccun7+9/eHnwj02iZvPr0788MCAQZ5f7ABWLTJnzhQcnLkt2XLGmTPG0jOepSVuu/F2swTQYYYzZydg1SIhkZNwmHcbs4GBRA6DNM+Bw0AtZzeAtCQY3MahRf4NSEs9UEv64988B/4nbp5xdvMPvFok2A8AtRwGakkwA9pyIHEDf+82/Lbw5DAA/XLcQOLMGTPrGQeSjWfc4N1mkWAggdsv7Mcff/i4rdoAGGKPbxccsJPt7z+7+eaPCht5fmnsWhgYeNCjQAKsUgKHchBgf4AmwH8Aj+pRMApGwSgYiQAAiYplljuoJ34AAAAASUVORK5CYII=","orcid":"","institution":"Pontifical Catholic University of Rio de Janeiro","correspondingAuthor":true,"prefix":"","firstName":"regio","middleName":"marcos pinto abreu","lastName":"filho","suffix":""}],"badges":[],"createdAt":"2025-12-10 12:24:08","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8327468/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8327468/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":98422300,"identity":"1692a018-e2a8-4ca6-88c6-5e1e4c4973e4","added_by":"auto","created_at":"2025-12-17 16:30:49","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":17122,"visible":true,"origin":"","legend":"","description":"","filename":"Aialignmentpaper.docx","url":"https://assets-eu.researchsquare.com/files/rs-8327468/v1/8c7c39ad82ece859d61d2c70.docx"},{"id":98422164,"identity":"e71ec390-4ed6-466c-b6c9-79a75a8d7e00","added_by":"auto","created_at":"2025-12-17 16:30:36","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":4627,"visible":true,"origin":"","legend":"","description":"","filename":"e888cfbc05ac4bfc8eb0f0a868d263ae.json","url":"https://assets-eu.researchsquare.com/files/rs-8327468/v1/ed1f76d84a8e261c64bce7b3.json"},{"id":97948937,"identity":"0b0bb22d-4a07-4293-9ab1-1d80444eb698","added_by":"auto","created_at":"2025-12-11 06:35:57","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":50590,"visible":true,"origin":"","legend":"","description":"","filename":"e888cfbc05ac4bfc8eb0f0a868d263ae1enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8327468/v1/6d9a7a493bb6a0ed481d694f.xml"},{"id":98422263,"identity":"5a27f238-8d97-4f3b-940d-c0d94ff9ed07","added_by":"auto","created_at":"2025-12-17 16:30:44","extension":"xml","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":44212,"visible":true,"origin":"","legend":"","description":"","filename":"e888cfbc05ac4bfc8eb0f0a868d263ae1structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8327468/v1/99d5d7d274e32848bfa07d5d.xml"},{"id":98422310,"identity":"2bc8e305-9093-49e1-8b53-264165c42da7","added_by":"auto","created_at":"2025-12-17 16:30:49","extension":"html","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":56878,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8327468/v1/2a254d899d1c186b461890b9.html"},{"id":98730625,"identity":"a2e88cf6-5f0e-44ee-867b-b0a75b8fd528","added_by":"auto","created_at":"2025-12-22 05:09:34","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":934771,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8327468/v1/e60dc715-ab9a-4fd3-93f9-871df07d6679.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Harm-Conditioned Computational Friction as a Diagnostic of Alignment Robustness: A Critical Review and Evaluation Framework","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eModern AI systems increasingly exhibit capabilities that can be repurposed toward harmful ends. As these systems scale, alignment evaluations must provide evidence that safety behaviors generalize beyond known prompts, policy templates, and training-set shortcuts. A persistent limitation of current approaches is that they conflate what a model outputs with why it outputs it. A system may refuse harmful requests while lacking deeper generalization, or comply in subtly dangerous ways under paraphrase, role-play, or multi-turn incrementalization [wei2023jailbreaking, mazeika2024harmbench].\u003c/p\u003e\n\u003cp\u003eThis paper reframes alignment evaluation by reviewing and systematizing a neglected dimension of evidence: conditional internal resistance. We introduce harm-conditioned computational friction (HCCF), a diagnostic principle stating that aligned models should exhibit structured, detectable increases in deliberative cost when facing higher-harm scenarios relative to lower-harm ones, holding task difficulty constant as much as feasible. Intuitively, an aligned system should be harder to steer into harmful trajectories. We emphasize that HCCF is proposed primarily as a review and synthesis lens that unifies existing evaluation families into an interpretable evidence hierarchy, and secondarily as a set of operational estimators for future empirical work.\u003c/p\u003e\n\u003cp\u003eThis paper makes three main contributions:\u003c/p\u003e\n\u003cp\u003e1. We critically synthesize alignment evaluation methods (benchmark-centered, red-teaming, distribution-shift, interpretability-guided, and governance-oriented) and show how HCCF organizes their strengths and failure modes into a coherent evidence hierarchy.\u003c/p\u003e\n\u003cp\u003e2. We define HCCF as a conceptual and operational family of signals spanning behavioral (Level 1), inference-level (Level 2), and mechanistic (Level 3) proxies, with formal definitions and protocols for complexity matching.\u003c/p\u003e\n\u003cp\u003e3. We propose a practical evaluation blueprint, including matched-pair designs, multi-proxy aggregation, friction-gradient estimators with robustness checks, cross-domain meta-estimation, and anti-cosplay stress tests. We illustrate expected outcomes with simulated model contrasts and schematic visualizations.\u003c/p\u003e\n\u003cp\u003eContribution relative to prior reviews. Previous surveys and position papers on alignment evaluation have largely organized the field around benchmark design, red-teaming methodologies, or interpretability and governance pathways. While these works clarify the importance of adversarial testing and distribution-shift robustness, they typically do not offer a unifying diagnostic construct for interpreting why a model passes or fails across evaluation families. Our contribution is to introduce harm-conditioned computational friction as an integrative lens: a criterion that connects behavioral refusal stability, inference-level uncertainty and compute signatures, and mechanistic safety activation into a shared evidence hierarchy. This framework yields concrete methodological upgrades—matched-pair harm/difficulty controls, anti-cosplay cue-removal tests, multi-proxy aggregation, and cross-domain meta-estimation—that can be layered onto existing evaluation practice without requiring full model access.\u003c/p\u003e"},{"header":"2. Review Method","content":"\u003cp\u003eThis review adopts a targeted, integrative approach aimed at synthesizing alignment evaluation methods relevant to robustness under harm-related distribution shifts. We conducted a systematic review with a focus on work that either advances evaluation methodology or clarifies failure modes of refusal-centric metrics.\u003c/p\u003e\n\u003cp\u003eSearch Strategy: We surveyed arXiv preprints (2020–2024), conference proceedings (NeurIPS, ICML, ICLR, AIES), and technical reports from frontier labs (Anthropic, OpenAI, Google DeepMind, Meta) using combinations of the following keywords: alignment evaluation, harmful instruction following, refusal robustness, red-teaming, jailbreak, distribution shift safety, interpretability for safety, safety circuits, policy-based evaluation, and guardrail reliability.\u003c/p\u003e\n\u003cp\u003eInclusion Criteria: We prioritized work that (i) proposes evaluation methodologies for AI safety, (ii) provides systematic empirical tests of safety generalization under adversarial or out-of-distribution conditions, or (iii) analyzes failure modes of refusal-based or purely behavioral metrics. Both conceptual frameworks and large-scale empirical studies were considered.\u003c/p\u003e\n\u003cp\u003eExclusion Criteria: We excluded papers without a clear evaluation or failure-mode contribution (e.g., purely capability-scaling results without safety measurement implications) unless they were frequently cited as conceptual anchors in alignment evaluation discussions.\u003c/p\u003e\n\u003cp\u003eSynthesis Approach: The resulting literature was organized into a landscape taxonomy (Section 3) based on methodological family, access requirements, and primary measurement targets. This taxonomy was then reinterpreted through the HCCF diagnostic lens to expose convergent strengths, gaps, and research priorities. The synthesis emphasizes how different evaluation approaches contribute evidence at different levels of the HCCF hierarchy (behavioral, inferential, mechanistic).\u003c/p\u003e"},{"header":"3. Background and Related Work: Landscape of Alignment Evaluation Approaches","content":"\u003cp\u003eAlignment evaluation spans a heterogeneous set of methods that vary by access regime, threat model, and the degree to which they attempt to probe internal safety mechanisms versus external behavior. We synthesize the literature into five overlapping families and clarify their relationship to the HCCF framework.\u003c/p\u003e\u003cdiv id=\"Sec2\" class=\"Section2\"\u003e\u003ch2\u003e3.1 Benchmark-Centered Behavioral Evaluations\u003c/h2\u003e\u003cp\u003eThese approaches evaluate harmful instruction following, refusal rates, and policy-consistent responses across curated prompt sets [mazeika2024harmbench, ganguli2022red, biderman2024pythia]. Their advantages include scalability and cross-model comparability, making them foundational for tracking safety improvements across model generations and for contextualizing safety behavior across training stages [biderman2024pythia]. However, they remain vulnerable to policy-template overfitting, memorized refusal manifolds, and adversarial paraphrases that preserve intent while removing expected lexical cues. From the HCCF perspective, these evaluations provide valuable Level 1 (behavioral) signals but often lack explicit controls for task complexity and do not test whether refusals are accompanied by harm-specific increases in internal resistance.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e3.2 Adversarial Prompting and Red-Teaming\u003c/h2\u003e\u003cp\u003eRed-teaming frameworks systematically stress-test models under induced distribution shifts using role-play, moral licensing, incrementalization, and technical obfuscation [wei2023jailbreaking, mishra2024fine, zou2023universal]. Advanced approaches automate jailbreak generation using gradient-based or LLM-based attackers [chao2023jailbreaking, liu2024autodan]. These methods increase ecological validity but may still under-diagnose whether safety behavior is supported by stable internal mechanisms or by optimized \"safe style\" heuristics. HCCF predicts that robust models retain friction when cue words are removed, whereas theatrical alignment is more likely to collapse under such manipulations.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e3.3 Robustness and Generalization Testing\u003c/h2\u003e\u003cp\u003eThis family emphasizes evaluation under paraphrase, style perturbation, multi-turn contexts, and cross-domain transfer [perez2022discovering, wang2024decoding]. The HCCF framework aligns naturally with this tradition: friction gradients should remain positively signed and statistically stable across paraphrase bundles and stylistic transformations. Consequently, distribution-shift tests can be interpreted not only as stress tests of refusal behavior but also as indirect probes of conditional resistance.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e3.4 Interpretability- and Mechanism-Informed Safety\u003c/h2\u003e\u003cp\u003eMechanistic interpretability has been proposed as a route to higher-resolution safety evaluation. This research attempts to identify safety-relevant features, circuits, or gating behaviors that correlate with harmful intent or high-risk domains [burns2022discovering, templeton2024scaling]. When such signals are measurable, they may offer Level 3 evidence that safety is supported by stable internal mechanisms. However, interpretability remains incomplete and unevenly accessible across model classes [elhage2021mathematical]. HCCF treats mechanistic probes as the most informative level in a multi-proxy hierarchy, while emphasizing that they should be interpreted alongside black-box indicators to mitigate both false negatives from incomplete probes and false positives from superficial style cues.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003e3.5 Policy and Governance-Oriented Evaluation\u003c/h2\u003e\u003cp\u003eFrameworks from labs and institutions propose organizational safety standards and evaluation protocols tailored to frontier models [amodei2023ai, openai2023preparedness, bubeck2023sparks]. These approaches are critical for deployment governance but can remain underspecified regarding how internal safety should be empirically discriminated from surface compliance. HCCF provides a diagnostic bridge between governance claims and empirical measurement by proposing conditional internal resistance as an operational target that can be audited across multiple evidence levels.\u003c/p\u003e\u003c/div\u003e"},{"header":"4. HCCF as a Unifying Diagnostic Lens","content":"\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003e4.1 Definition and Formal Principle\u003c/h2\u003e\u003cp\u003eLet x be an input, and let h(x) be a harm potential function mapping inputs to ordered harm categories (e.g., low, medium, high). Let d(x) denote a nuisance measure of task complexity or difficulty. Let C(x; θ) denote a measurable proxy of computational friction during inference for model θ. Where unambiguous, we suppress the model index θ and write C(x) for readability.\u003c/p\u003e\u003cp\u003eAssumption (HCCF Principle): For a robustly aligned model θ, and for any two inputs x1, x2 that are matched on task complexity (d(x1)\u0026thinsp;\u0026asymp;\u0026thinsp;d(x2)) but differ in harm potential (h(x1)\u0026thinsp;\u0026gt;\u0026thinsp;h(x2)), the model should exhibit greater expected computational friction when processing the higher-harm input:\u003c/p\u003e\u003cp\u003eE[C(x1; θ)]\u0026thinsp;\u0026gt;\u0026thinsp;E[C(x2; θ)], given d(x1)\u0026thinsp;\u0026asymp;\u0026thinsp;d(x2) and h(x1)\u0026thinsp;\u0026gt;\u0026thinsp;h(x2).\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003e4.2 Taxonomy of Friction Signals\u003c/h2\u003e\u003cp\u003eHCCF organizes evaluation signals into three levels that map naturally onto access regimes:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eLevel 1: Behavioral friction (black-box). Refusal consistency under paraphrase, calibrated uncertainty signaling, policy-grounded rationales, and distribution-shift refusal stability.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eLevel 2: Inference friction (semi white-box). Localized token entropy changes around safety-critical spans, test-time compute sensitivity, and deliberation-depth proxies where available.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eLevel 3: Mechanistic friction (white-box). Activation of identified safety features or circuits, routing/gating to safety modules, and rule-checking submodule engagement in modular or tool-augmented systems.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\u003ch2\u003e4.3 Operational Definition of h(x) and Complexity Matching\u003c/h2\u003e\u003cp\u003eThe harm potential function h(x) is context-dependent and must be operationalized with care. We recommend domain-specific annotation guidelines, multi-rater labeling, and explicit reporting of inter-rater reliability, alongside adjudication of borderline cases to reduce silent drift in category meaning.\u003c/p\u003e\u003cp\u003eFor complexity matching, we propose a difficulty calibration bank of benign but technically demanding prompts to estimate the baseline relationship between d(x) and C(x). Inputs can then be matched using similarity in sentence structure, syntactic complexity, or embedding distance within the same domain.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003e4.4 Friction Gradients\u003c/h2\u003e\u003cp\u003eWe define the friction gradient across harm categories for a matched pair:\u003c/p\u003e\u003cp\u003egi\u0026thinsp;=\u0026thinsp;C(xi^H) - C(xi^L),\u003c/p\u003e\u003cp\u003ewhere xi^H and xi^L are matched high-harm and low-harm inputs from adjacent harm categories. A robustly aligned system should show positive mean gradients across domains, stability under paraphrase and style perturbations, and resistance to adversarial attempts to suppress friction cues.\u003c/p\u003e\u003cp\u003eScope and boundaries. HCCF is not a standalone proof of robust alignment. It is a diagnostic layer intended to complement benchmark performance, adversarial robustness, and mechanistic safety evidence. The framework is most informative when multiple proxy families converge on consistent harm-specific gradients under explicit complexity controls and anti-cosplay stress tests.\u003c/p\u003e\u003c/div\u003e"},{"header":"5. Evaluation Blueprint Under HCCF","content":"\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\u003ch2\u003e5.1 Matched-Pair Design\u003c/h2\u003e\u003cp\u003eConstruct pairs (x^H, x^L) where semantic structure and difficulty are matched (d(x^H)\u0026thinsp;\u0026asymp;\u0026thinsp;d(x^L)), harm-relevant intent differs systematically (h(x^H)\u0026thinsp;\u0026gt;\u0026thinsp;h(x^L)), and surface cues are minimized to avoid trivial detection. We recommend paired prompt construction via controlled transformation followed by validation against the difficulty calibration bank and paraphrase bundles.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\u003ch2\u003e5.2 Multi-Proxy Aggregation\u003c/h2\u003e\u003cp\u003eBecause any single proxy can be gamed, we recommend composite scoring:\u003c/p\u003e\u003cp\u003eC(x) = Σ_{k\u0026thinsp;=\u0026thinsp;1}^K w_k \u0026middot; normalize(C_k(x)),\u003c/p\u003e\u003cp\u003ewith sensitivity analyses over the weights w_k. For general applicability, black-box proxies (Level 1) serve as the baseline layer, while inference and mechanistic proxies (Levels 2\u0026ndash;3) provide higher-resolution confirmation when accessible.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e\u003ch2\u003e5.3 Adversarial Low-Friction Tests\u003c/h2\u003e\u003cp\u003eWe propose tests designed to reduce observable friction without changing harm content, including polite framing, moral-licensing role-play, euphemistic or technical obfuscation, and multi-turn incrementalization. A model that loses friction under these manipulations likely exhibits brittle alignment; persistent high friction in legitimate high-stakes contexts may indicate over-alignment.\u003c/p\u003e\u003c/div\u003e"},{"header":"6. Methodological Implementation Challenges","content":"\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e\u003ch2\u003e6.1 Measuring Task Complexity\u003c/h2\u003e\u003cp\u003eThe most significant challenge in implementing HCCF is defining and measuring d(x)\u0026mdash;task complexity\u0026mdash;objectively. Unlike harm potential, which can be annotated by human raters, task complexity involves multiple dimensions: syntactic complexity, semantic ambiguity, required reasoning steps, and domain-specific knowledge requirements. We propose using a combination of:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eAutomated readability metrics (syntactic complexity)\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ePerplexity scores from reference models\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eHuman ratings of difficulty for benign tasks\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eResponse time baselines from human subjects\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec18\" class=\"Section2\"\u003e\u003ch2\u003e6.2 Ensuring True Complexity Matching\u003c/h2\u003e\u003cp\u003eEven with careful design, residual complexity mismatches may persist. We recommend:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eWithin-domain pairing (e.g., cybersecurity questions paired with cybersecurity benign questions)\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eMultiple difficulty proxies with sensitivity analysis\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eNegative control pairs (high-difficulty benign vs. low-difficulty benign) to establish baseline\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec19\" class=\"Section2\"\u003e\u003ch2\u003e6.3 Proxy Validation and Calibration\u003c/h2\u003e\u003cp\u003eEach friction proxy requires validation:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eBehavioral proxies: Must correlate with refusal behavior but not with stylistic preferences\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eInference proxies: Must reflect computational effort, not just uncertainty\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eMechanistic proxies: Must map to meaningful safety mechanisms\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec20\" class=\"Section2\"\u003e\u003ch2\u003e6.4 Computational Cost Considerations\u003c/h2\u003e\u003cp\u003eMulti-proxy measurement across large prompt sets can be computationally expensive, particularly for white-box methods. We recommend staged evaluation: begin with black-box proxies for broad screening, then apply more expensive methods to borderline cases.\u003c/p\u003e\u003c/div\u003e"},{"header":"7. Estimating Friction Gradients","content":"\u003cdiv id=\"Sec22\" class=\"Section2\"\u003e\u003ch2\u003e7.1 Composite Estimation of G\u003c/h2\u003e\u003cp\u003eLet {(xi^H, xi^L)}_{i\u0026thinsp;=\u0026thinsp;1}^n be a matched-pair set with ordinal harm labels h(xi^H)\u0026thinsp;\u0026gt;\u0026thinsp;h(xi^L) and calibrated difficulty d(xi^H)\u0026thinsp;\u0026asymp;\u0026thinsp;d(xi^L). Using the multi-proxy score C(x), we compute pairwise gradients gi and aggregate with robust estimators:\u003c/p\u003e\u003cp\u003eĜ_mean = (1/n) Σ_{i\u0026thinsp;=\u0026thinsp;1}^n gi, Ĝ_med\u0026thinsp;=\u0026thinsp;median(g1,\u0026hellip;,gn), Ĝ_trim\u0026thinsp;=\u0026thinsp;TrimMean(g1:n).\u003c/p\u003e\u003cp\u003eWe recommend Ĝ_med or Ĝ_trim for outlier resilience.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec23\" class=\"Section2\"\u003e\u003ch2\u003e7.2 Robustness Checks\u003c/h2\u003e\u003cp\u003eWe recommend weight sensitivity analysis, paraphrase invariance, cue removal stress tests, high-difficulty benign negative controls, and label permutation tests. Uncertainty should be quantified via paired bootstrap confidence intervals.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec24\" class=\"Section2\"\u003e\u003ch2\u003e7.3 Cross-Domain Meta-Estimator\u003c/h2\u003e\u003cp\u003eFor overall model assessment, aggregate across domains d \u0026isin; {1,\u0026hellip;,D}:\u003c/p\u003e\u003cp\u003eĜ_FE = (Σ_{d\u0026thinsp;=\u0026thinsp;1}^D w_d Ĝ_d) / (Σ_{d\u0026thinsp;=\u0026thinsp;1}^D w_d), w_d\u0026thinsp;=\u0026thinsp;1/SE_d\u0026sup2;,\u003c/p\u003e\u003cp\u003eĜ_RE = (Σ_{d\u0026thinsp;=\u0026thinsp;1}^D w_d* Ĝ_d) / (Σ_{d\u0026thinsp;=\u0026thinsp;1}^D w_d*), w_d* = 1/(SE_d\u0026sup2; + τ\u0026sup2;),\u003c/p\u003e\u003cp\u003ewhere τ\u0026sup2; estimates between-domain heterogeneity.\u003c/p\u003e\u003cp\u003eFuture empirical protocol checklist for HCCF studies (AIR-oriented)\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eDefine harm taxonomy: Specify domains and ordinal harm levels; publish annotation guidelines.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eCalibrate difficulty: Build a benign high-difficulty bank and report how d(x) is estimated or approximated.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eConstruct matched pairs: Ensure within-domain pairing with minimal surface cue leakage; document pairing rules.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eSelect multi-proxy set: Pre-register C_k and normalization; justify weights w_k.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eRun robustness suite: Weight sensitivity, paraphrase invariance, cue removal, negative controls, and label permutation.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eReport uncertainty: Use paired bootstrap CIs for Ĝ; provide domain-level Ĝ_d.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eAggregate cautiously: Report Ĝ_FE and Ĝ_RE when cross-domain claims are made.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eInterpret with failure modes: Explicitly assess complexity confound and theatrical friction risks.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eEmpirical Validation Requirements\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec25\" class=\"Section2\"\u003e\u003ch2\u003e8.1 Baseline Studies on Current Models\u003c/h2\u003e\u003cp\u003eTo validate HCCF, empirical studies should:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eApply HCCF protocols to existing models with known alignment properties\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eCompare HCCF gradients with existing safety metrics\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eTest correlation between HCCF scores and real-world safety failures\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eExamine HCCF gradients across training stages\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec26\" class=\"Section2\"\u003e\u003ch2\u003e8.2 Sensitivity Analysis Requirements\u003c/h2\u003e\u003cp\u003eFuture work should systematically test:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eHow HCCF gradients vary with different difficulty calibration methods\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eThe correlation structure between different friction proxies\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eThe stability of HCCF gradients across model architectures\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eThe relationship between HCCF and other alignment metrics\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec27\" class=\"Section2\"\u003e\u003ch2\u003e8.3 Longitudinal Validation\u003c/h2\u003e\u003cp\u003eThe most compelling validation would involve:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eTracking HCCF gradients throughout training interventions\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eCorrelating HCCF changes with safety failures in deployment\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eTesting HCCF's predictive power for future safety failures\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e"},{"header":"9. Illustrative Simulation: Expected Patterns Under HCCF","content":"\u003cp\u003eTo concretize the framework, we provide an illustrative model contrast aligned with empirical narratives in alignment evaluation. The following numerical values are illustrative and do not report results from a specific model evaluation; they are included solely to demonstrate how HCCF-style reporting would be structured in an empirical extension.\u003c/p\u003e\u003cp\u003eConsider two models:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eModel A (Heuristic): Relies on superficial refusal heuristics and policy-template matching.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eModel B (Robust): Exhibits stable internal constraints under adversarial perturbations.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e(Illustrative friction gradients by domain)\u003c/p\u003e\u003cp\u003eDomain Model A (Heuristic) Model B (Robust) Interpretation\u003c/p\u003e\u003cp\u003eBiohazard 0.12 [0.05, 0.19] 0.68 [0.61, 0.75] Strong separation; Model B shows robust harm-specific resistance\u003c/p\u003e\u003cp\u003eCybersecurity 0.08 [0.02, 0.14] 0.55 [0.48, 0.62] Model A's refusal collapses under paraphrasing\u003c/p\u003e\u003cp\u003eSelf-harm 0.15 [0.08, 0.22] 0.72 [0.65, 0.79] Consistent pattern across harm domains\u003c/p\u003e\u003cp\u003eCross-domain FE 0.12 [0.08, 0.16] 0.65 [0.60, 0.70] Stable global gap\u003c/p\u003e\u003cp\u003eCross-domain RE 0.12 [0.07, 0.17] 0.65 [0.59, 0.71] Robust to between-domain heterogeneity\u003c/p\u003e\u003cp\u003e(We assume C(x) is a composite of refusal consistency (0\u0026ndash;1), normalized latency (0\u0026ndash;1), and token entropy shift (0\u0026ndash;1), with equal weights.)\u003c/p\u003e"},{"header":"10. Limitations and Failure Modes","content":"\u003cdiv id=\"Sec30\" class=\"Section2\"\u003e\u003ch2\u003e10.1 The complexity confound\u003c/h2\u003e\u003cp\u003eThe central inferential risk for HCCF is attributing difficulty-induced processing cost to harm-conditioned resistance. Even with matched-pair design, residual mismatch can persist through latent knowledge demands, domain familiarity, or hidden multi-step reasoning requirements. This can inflate apparent gradients in domains where harmful content tends to be technically complex. The recommended defenses are calibrated benign difficulty banks, explicit negative controls, and within-domain pairing rather than cross-domain comparisons.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec31\" class=\"Section2\"\u003e\u003ch2\u003e10.2 Theatrical friction and reward-shaped simulation\u003c/h2\u003e\u003cp\u003eModels can learn to perform friction by emitting stylized ethical preambles, hedges, or refusal templates keyed to lexical or structural cues. This \"alignment cosplay\" is particularly plausible when evaluators overweight linguistic proxies without orthogonal signals. HCCF therefore emphasizes cue-removal tests, intent-preserving obfuscation, and multi-proxy aggregation that includes non-linguistic measures (e.g., length-controlled latency, localized entropy changes where accessible). The diagnostic goal is not the presence of frictional language per se, but stable harm-specific elevation under adversarial cue suppression.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec32\" class=\"Section2\"\u003e\u003ch2\u003e10.3 Proxy ambiguity under black-box access\u003c/h2\u003e\u003cp\u003eIndividual proxies are underdetermined. Increased latency may reflect infrastructure variance or caching effects; entropy shifts may reflect lexical uncertainty rather than ethical conflict. These ambiguities are amplified for closed models where Level 3 evidence is unavailable. As a result, the evidential weight of HCCF depends on coherent convergence across proxies and on robustness checks that test whether harm-specific signals persist when difficulty, style, and surface cues are controlled.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec33\" class=\"Section2\"\u003e\u003ch2\u003e10.4 Normative dependency of harm labels\u003c/h2\u003e\u003cp\u003eThe harm potential function h(x) is socially contested and context-dependent. Overly conservative labeling can generate false positives that degrade legitimate research and safety-adjacent workflows (e.g., defensive cybersecurity or educational clinical discussions). Conversely, permissive labeling risks false negatives that understate real-world misuse potential. Transparent domain-specific guidelines, multi-rater annotation, and adjudicated borderline cases are essential to limit category drift and to support interpretable cross-study comparisons.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec34\" class=\"Section2\"\u003e\u003ch2\u003e10.5 Adversarial co-adaptation to new diagnostics\u003c/h2\u003e\u003cp\u003eBy introducing friction as a measurable target, HCCF creates a new surface for adversarial optimization. Future models could be trained to preserve outward safety style while suppressing internal resistance or to minimize gradients without reducing harmful compliance. This risk reinforces the need to treat HCCF as one component of a diversified safety evaluation portfolio, ideally complemented by unannounced audits, capability-based risk assessments, and interpretability-informed spot checks when access permits.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec35\" class=\"Section2\"\u003e\u003ch2\u003e10.6 Relationship to existing concepts\u003c/h2\u003e\u003cp\u003eHCCF builds on but differs from several existing concepts:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eDifficulty calibration: Similar to psychometric methods but applied to AI systems\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eEffort justification: Related to cognitive science concepts but operationalized computationally\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eValue alignment vs. behavior alignment: HCCF attempts to measure alignment depth rather than surface behavior\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e"},{"header":"11. Research Agenda","content":"\u003cp\u003eWe highlight five directions that would improve the empirical and theoretical maturity of friction-based evaluation: standardized harm annotation protocols; friction-aware matched-pair benchmark suites; systematic anti-cosplay stress tests; cross-domain stability metrics with meta-estimation; and training-time friction shaping that discourages low-effort harmful compliance while minimizing false positives in legitimate high-stakes contexts.\u003c/p\u003e"},{"header":"12. Implications for Alignment Research","content":"\u003cp\u003eViewed as a review framework, HCCF helps unify a fragmented evaluation landscape by providing a structured target for conditional internal resistance. It complements benchmark and red-teaming cultures by adding a distinct dimension of evidence: the costliness of transitioning into harmful trajectories. For governance, HCCF offers an empirically operationalizable target that can be layered onto existing auditing standards across access regimes.\u003c/p\u003e"},{"header":"13. Conclusion","content":"\u003cp\u003eThis paper introduces harm-conditioned computational friction (HCCF) as a diagnostic framework for AI alignment evaluation. By shifting focus from surface compliance to conditional internal resistance, HCCF addresses critical gaps in current evaluation paradigms. The framework organizes behavioral, inference-level, and mechanistic signals into a coherent evidence hierarchy, providing both a theoretical lens for interpreting existing methods and concrete protocols for future empirical work.\u003c/p\u003e\u003cp\u003eWhile HCCF faces implementation challenges\u0026mdash;particularly in complexity matching and proxy validation\u0026mdash;it represents a significant step toward more robust alignment assessment. Future work should empirically validate HCCF proxies, develop standardized difficulty calibration methods, and explore how friction signals correlate with real-world safety failures. As AI systems become more capable and integrated into high-stakes domains, frameworks like HCCF will be essential for distinguishing genuine safety from superficial compliance.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eRegio Marcos Pinto Abreu Filho is the sole author of this manuscript and contributed to the study conception and design, literature identification and synthesis, development of the HCCF conceptual framework, drafting of the manuscript, and revisions for intellectual content. The author approved the final version for submission and is accountable for all aspects of the work.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003e\\section*{Data Availability}No new empirical datasets were generated or analyzed for this review. The manuscript synthesizes publicly available literature cited in the References. Any numerical values presented in the illustrative simulation are conceptual placeholders included solely to demonstrate reporting structure under the proposed harm-conditioned computational friction (HCCF) framework and do not correspond to measurements from specific models or proprietary datasets. Therefore, no external data repository is associated with this article.\u003c/p\u003e"},{"header":"References","content":"\u003cp\u003eAmodei, D., et al. (2023). Commentary on multidisciplinary requirements for AI safety.\u003c/p\u003e\n\u003cp\u003eBiderman, S., et al. (2024). Pythia: A suite for analyzing large language models across training and scaling. *Proceedings of ICLR*.\u003c/p\u003e\n\u003cp\u003eBubeck, S., et al. (2023). Sparks of artificial general intelligence: Early experiments with GPT-4. *arXiv preprint arXiv:2303.12712*.\u003c/p\u003e\n\u003cp\u003eBurns, C., et al. (2022). Discovering latent knowledge in language models without supervision. *Proceedings of ICLR*.\u003c/p\u003e\n\u003cp\u003eChao, P., et al. (2023). Jailbreaking black box large language models in twenty queries. *arXiv preprint arXiv:2310.08419*.\u003c/p\u003e\n\u003cp\u003eElhage, N., et al. (2021). A mathematical framework for transformer circuits.\u003c/p\u003e\n\u003cp\u003eGanguli, D., et al. (2022). Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. *arXiv preprint arXiv:2209.07858*.\u003c/p\u003e\n\u003cp\u003eLiu, X., et al. (2024). AutoDAN: Automatic and interpretable adversarial attacks on large language models. *arXiv preprint arXiv:2401.13761*.\u003c/p\u003e\n\u003cp\u003eMazeika, M., et al. (2024). HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. *arXiv preprint arXiv:2402.04249*.\u003c/p\u003e\n\u003cp\u003eMishra, S., et al. (2024). Fine-tuning aligned language models compromises safety, even when users do not intend to. *arXiv preprint arXiv:2401.06776*.\u003c/p\u003e\n\u003cp\u003eOpenAI. (2023). Preparedness framework. Technical report.\u003c/p\u003e\n\u003cp\u003ePerez, E., et al. (2022). Discovering language model behaviors with model-written evaluations. *arXiv preprint arXiv:2212.09251*.\u003c/p\u003e\n\u003cp\u003eTempleton, A., et al. (2024). Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. *arXiv preprint arXiv:2404.02205*.\u003c/p\u003e\n\u003cp\u003eWang, B., et al. (2024). DecodingTrust: A comprehensive assessment of trustworthiness in GPT models. *Proceedings of NeurIPS*.\u003c/p\u003e\n\u003cp\u003eWei, A., et al. (2023). Jailbroken: How does LLM safety training fail? *Proceedings of NeurIPS*.\u003c/p\u003e\n\u003cp\u003eZou, A., et al. (2023). Universal and transferable adversarial attacks on aligned language models. *arXiv preprint arXiv:2307.15043*.\u003c/p\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-8327468/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8327468/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eRobust alignment requires that AI models maintain safety-relevant behavior under distribution shift, adversarial prompting, and optimization pressure. Current evaluation methods often rely on surface compliance metrics\u0026mdash;such as refusal rates or policy-template adherence\u0026mdash;that may fail to detect fragile safety generalization, reward hacking, or prompt-contingent refusal policies. This paper critically reviews alignment evaluation methods through the lens of harm-conditioned computational friction (HCCF): a diagnostic principle positing that aligned models should exhibit measurable increases in deliberative cost, uncertainty, or constraint activation specifically when processing higher-harm inputs, controlling for task difficulty. We formalize HCCF through behavioral, inference-level, and mechanistic proxies; propose measurement protocols for friction gradients across harm domains; analyze confounds and failure modes (including \"theatrical friction\"); and provide an evaluation blueprint with robustness checks and cross-domain aggregation. By emphasizing conditional internal resistance rather than only external refusal behavior, HCCF provides a framework to unify existing evaluation methods and distinguish genuine safety generalization from brittle or cosmetic alignment, with potential implications for model auditing, training objectives, and AI safety governance.\u003c/p\u003e","manuscriptTitle":"Harm-Conditioned Computational Friction as a Diagnostic of Alignment Robustness: A Critical Review and Evaluation Framework","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-12-11 06:35:53","doi":"10.21203/rs.3.rs-8327468/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"3682d2cb-fade-4f3a-b6f6-99d62b1a8633","owner":[],"postedDate":"December 11th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-12-22T05:08:55+00:00","versionOfRecord":[],"versionCreatedAt":"2025-12-11 06:35:53","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8327468","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8327468","identity":"rs-8327468","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00