ConsultChain: Progressive Context Distillation Across Heterogeneous LLM Fleets for Token-Optimal Inference

doi:10.21203/rs.3.rs-9368244/v1

ConsultChain: Progressive Context Distillation Across Heterogeneous LLM Fleets for Token-Optimal Inference

2026 · doi:10.21203/rs.3.rs-9368244/v1

preprint OA: closed

Full text JSON View at publisher

Full text 59,009 characters · extracted from preprint-html · click to expand

ConsultChain: Progressive Context Distillation Across Heterogeneous LLM Fleets for Token-Optimal Inference | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article ConsultChain: Progressive Context Distillation Across Heterogeneous LLM Fleets for Token-Optimal Inference Samuel Edusa, MD This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9368244/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background: The operational cost of large language model (LLM) inference is dominated by token consumption. In heterogeneous model fleets where prices span two orders of magnitude, most systems still route all traffic to a single model or implement only binary cheap/expensive routing. This leaves significant cost optimization on the table, particularly for agentic workflows that re-process identical context across sessions. Methods: I designed and implemented ConsultChain, an architecture introducing three novel mechanisms: (1) a five-tier model cascade with progressive context distillation, where each tier compresses context before escalation rather than forwarding raw input, (2) knowledge crystallization, a self-healing persistent knowledge store modeled on physical crystal formation with nucleation, growth, and fracture phases, and (3) synaptic pruning, an activity-driven memory management system inspired by adolescent neural development that promotes frequently-used pathways and eliminates idle ones. The system was deployed on a Raspberry Pi 5 orchestrating a 10-model fleet accessed via cloud APIs and evaluated over simulated workloads of 50 requests per day. Results: The system achieves 98.5% token cost reduction at steady state ($1.74/month) compared to single-model full-context baselines ($112.50/month) on a fleet spanning a 227x price differential ($0.11/M to $25/M tokens). Costs compound downward over time: Tier 0 resolution rates increase from 40% at week 1 to 95% at week 24 as the knowledge lattice matures. The orchestration layer runs with under 1.2GB RAM and requires no local GPU. Conclusions: Progressive context distillation, combined with self-healing knowledge storage and activity-driven memory pruning, enables cost reductions that compound over time rather than remaining static. The approach is feasible on edge hardware and generalizes to any heterogeneous model fleet. The implementation is released as open source. Artificial Intelligence and Machine Learning Computer Architecture and Engineering Large language models Token optimization Model cascading Knowledge graphs Memory management Edge computing Introduction The operational cost of large language model (LLM) inference is dominated by token consumption. For developers and researchers using cloud-hosted models, each API call carries a cost proportional to the number of input and output tokens processed. This cost becomes significant in agentic workflows where models are invoked repeatedly, often re-processing identical context across sessions. The emergence of heterogeneous model fleets (deployments where multiple models of varying capability and cost are available simultaneously) creates an optimization opportunity. A typical fleet may span two orders of magnitude in price-per-token. Yet most systems route all traffic to a single model or at best implement binary routing between a "cheap" and "expensive" tier. I introduce the ConsultChain, a system that exploits the full price spectrum of a heterogeneous model fleet through three novel mechanisms: Progressive Context Distillation A five-tier cascade where each tier attempts to resolve a request. Upon failure, it compresses the context before escalating. Higher-tier models receive smaller but semantically richer input. Knowledge Crystallization A persistent knowledge store where information solidifies through confirmation phases (seed, growing, solid, faceted), self-heals through fracture on contradiction, and dissolves when unconfirmed. This replaces static wiki patterns that suffer from silent staleness. Synaptic Pruning A memory management system modeled on neural development where pathways strengthen through co-activation (Hebbian learning), promote to permanent storage after sustained strength, and are pruned during scheduled consolidation cycles. Together, these mechanisms achieve 98.5% token cost reduction on a 10-model fleet with a 227x price differential, with the system's efficiency compounding over time as knowledge accumulates. Related Work Token Optimization Prior approaches to token optimization fall into three categories: context compression, caching, and routing. Context compression techniques include Selective Context [ 1 ], which uses self-information to filter tokens, and LLMLingua [ 2 ], which uses a small model to compress prompts. These operate at the token level. My approach operates at the semantic level, compressing meaning through multi-stage distillation. Caching approaches include GPTCache [ 3 ] and semantic caching layers that store and retrieve previous responses. My knowledge crystallization extends this by treating cached knowledge as mutable, confidence-weighted, and self-healing rather than static key-value pairs. Routing systems like FrugalGPT [ 4 ] cascade through models sequentially. My system differs by introducing progressive distillation, where each tier transforms the context before escalation, and by incorporating persistent knowledge and memory systems that compound savings over time. Persistent Knowledge for LLMs Karpathy [ 5 ] proposed the LLM Wiki pattern: a three-layer architecture (raw sources, LLM-maintained wiki, schema document) with ingest, query, and lint operations. My knowledge crystallization replaces the flat wiki with a phase-transition model that provides automatic staleness resolution, confidence-weighted retrieval, and organic knowledge graph formation through faceting. Memory Systems for AI Agents MemGPT [ 6 ] introduced OS-inspired memory management for LLMs. Hippo Memory [ 7 ] applied neuroscience-inspired decay models with half-lives and reward modulation. My synaptic pruning model differs by using activity-driven promotion rather than time-based decay, implementing three-store architecture (sensory, working, myelinated) with Hebbian strengthening, and performing batch consolidation that mirrors sleep-phase neural processing. Code Context Compression CodeSight [ 8 ] uses AST-based parsing to generate structured code maps, reducing codebase context from 26-47K tokens to 3-5K. My fingerprinting approach achieves further compression to approximately 12 tokens per file (250x compression from raw source) using regex-based structural signatures that are diffable and searchable without LLM involvement. Methods System Overview The ConsultChain consists of four subsystems operating on a shared data layer: a Cascade Router that classifies requests and routes them through five model tiers, a Crystal Lattice providing persistent knowledge storage with phase-transition semantics, a Synaptic Store for activity-driven cross-session memory with three persistence levels, and a Fingerprint Engine for zero-LLM code context compression. All orchestration runs on a single-board computer (Raspberry Pi 5, 8GB RAM). Models are accessed via cloud APIs. The orchestration layer consumes approximately 1.2GB RAM. The Five-Tier Cascade Let M = {m_0, m_1, ..., m_n} be a set of models ordered by cost-per-token c(m_i) such that c(m_0) < c(m_1) < ... < c(m_n). For an incoming request r, the cascade processes as follows. At tier i, the system assembles context C_i from the crystal lattice, synaptic store, and fingerprint index, subject to a per-tier token budget B_i. It generates a response R_i using model m_i with input (r, C_i), then evaluates confidence score s_i for R_i. If s_i > = threshold t_i, the response is returned (resolution). If s_i < t_i, the system generates a distilled context D_i = distill(r, C_i, R_i, reason_i) where |D_i| |D_0| > |D_1| > ... > |D_{n-1}|. Context size decreases as it ascends the cascade. This is the primary departure from standard cascade routing where context size is constant or increasing. While |D_i| < |C_i|, the semantic density of D_i is higher because it includes the distilled context from tier i, that tier's partial reasoning about the request, and explicit annotation of what knowledge was missing. Table 1 Five-tier cascade configuration with steady-state resolution rates. Tier Models Input Budget Resolution Rate 0 (Sieve) qwen3.5, gemma4:31b 450 tokens 75% 1 (Filter) minimax-m2.7, kimi-k2.5 370 tokens 15% 2 (Refiner) claude-haiku-4-5, glm-5 280 tokens 6% 3 (Distiller) claude-sonnet-4-5, gpt-5.4 470 tokens 3% 4 (Alembic) claude-opus-4 650 tokens 1% Knowledge Crystallization I model persistent knowledge as a lattice of crystals, each representing an atomic knowledge claim. A crystal K is a tuple (claim, phase, confidence, confirmations, facets, timestamp) where claim is a natural language assertion of 20–50 tokens, phase is one of {seed, growing, solid, faceted}, confidence is a value in [0, 1], confirmations counts independent verifications, facets is a set of edges to related crystals, and timestamp records creation and last confirmation. Crystals undergo phase transitions analogous to physical crystallization. During nucleation, new information enters as a seed with confidence 0.3. Seeds that receive no confirmation within 14 days are dissolved. During growth, upon independent confirmation (from a different source or model tier), confidence increases by 0.15 up to a maximum of 0.85. After 4 confirmations, phase transitions to solid. During faceting, a low-cost model proposes connections between solid crystals during batch consolidation. When two crystals are linked, both transition to faceted with confidence approaching 0.95. During fracture, when information contradicting crystal K enters the lattice with non-trivial confidence, K splits into competing seeds representing the original and contradictory claims. Both must re-nucleate independently. Under the crystallization model, the maximum duration a false claim persists in the lattice is bounded by max(TTL_seed, time_to_fracture), where time_to_fracture depends on the rate of contradictory information arrival. In contrast, static wiki systems have unbounded staleness. Synaptic Pruning I model cross-session memory as a directed graph of pathways connecting contexts to outcomes, managed across three stores inspired by neural memory systems. The Sensory Buffer is a fixed-size (30 items) FIFO queue holding immediate session context with no persistence. Items are promoted to working pathways upon second reference within a session. Working Pathways each have strength sigma in [0, 1] updated by a Hebbian rule (co-activated pathways create or strengthen links), an activity update (traversal increases strength, positive outcomes add a bonus, negative outcomes reduce strength), and time-based decay (pathways idle for 72 hours lose 10% strength per cycle). Myelinated Pathways are working pathways that maintain strength above 0.7 for 7 consecutive days and are promoted to permanent storage. These are only pruned if strength reaches 0 for 60 days. A daily sleep cycle performs Hebbian link strengthening for co-activated pathways, pruning of weak working pathways, promotion of strong candidates, and crystal maintenance including dissolution of stale seeds and facet proposal. Code Fingerprinting For codebases, I generate structural fingerprints using regex-based extraction rather than full AST parsing. A fingerprint is a pipe-delimited string encoding exports, middleware, routes with HTTP method and path parameters, database field accesses, and import dependencies. Average fingerprint length is 12 tokens compared to 3,000 tokens for the average source file, yielding a 250x compression ratio. Fingerprints support efficient diff operations that produce compact change summaries, enabling incremental context updates without re-scanning. Results Experimental Setup I deployed the ConsultChain on a Raspberry Pi 5 (8GB RAM) running a 10-model fleet accessed via cloud APIs. The evaluation period spanned simulated workloads of 50 requests per day across five categories: code questions, knowledge recall, architecture queries, complex reasoning, and novel problems. Token Cost Reduction Table 2 Token cost comparison across configurations. Configuration Daily Cost Monthly Cost Reduction Single model (Sonnet, full context) $ 3.75 $ 112.50 baseline Binary routing (cheap/expensive) $ 0.27 $ 8.10 92.8% Cascade only (no knowledge/memory) $ 0.14 $ 4.20 96.3% ConsultChain (full system) $ 0.058 $ 1.74 98.5% The full system achieves an additional 58% reduction over cascade-only routing by resolving more requests at lower tiers via crystal lookups and synaptic recall. Compounding Behavior I measured tier resolution rates over time: Table 3 Tier resolution rates and cost over time. Week Tier 0 Tier 1 Tier 2 Tier 3 Tier 4 Monthly Cost 1 40% 25% 15% 12% 8% $ 8.00 4 58% 22% 12% 5% 3% $ 4.50 12 80% 12% 4% 3% 1% $ 2.80 24 95% 3% 1% 0.5% 0.5% $ 1.74 The monotonic decrease in upper-tier utilization confirms the compounding hypothesis: crystallized knowledge and myelinated pathways progressively absorb request categories from expensive to cheap tiers. Crystal Lattice Dynamics At week 24 steady state, the lattice contained 60 seeds (actively nucleating), 50 growing crystals, 120 solid crystals, and 45 faceted crystals. Approximately 340 seeds had dissolved during the evaluation period, demonstrating the self-cleaning property. Twelve fracture events were observed, all corresponding to code refactors or corrected misunderstandings. Resource Utilization The orchestration layer on Raspberry Pi 5 consumed 1.2GB RAM, 2% average CPU, and 45MB disk for SQLite databases. No GPU was required. The nightly sleep cycle completed in under 30 seconds at a cost of $ 0.0007 per execution. Discussion Standard cascade routing (e.g., FrugalGPT [ 4 ]) passes the same context to each tier. Progressive distillation reduces context at each stage, yielding two distinct benefits: (1) reduced input token costs at higher tiers, and (2) enriched context through prior-tier reasoning annotations. The inclusion of failure reasons from lower tiers allows higher-tier models to avoid repeating unsuccessful approaches. The wiki pattern suffers from three failure modes: silent staleness, orphaned pages, and inconsistent granularity. Crystallization addresses all three through distinct mechanisms. Stale seeds dissolve, orphaned crystals naturally decay in confidence ranking, and the fixed 20–50 token crystal size enforces consistent granularity. The fracture mechanism provides a property no wiki system offers: automatic self-correction on contradiction. Limitations The system's compounding benefit implies a cold-start period (weeks 1–4) where costs are significantly higher than steady state. The confidence evaluation at each tier introduces latency of approximately 200-500ms per tier on cloud APIs. The synaptic pruning model's parameters (decay rates, promotion thresholds) require tuning per deployment and workload. Finally, the 12-token fingerprint format trades detail for compression and may miss nuanced code patterns that full AST analysis would capture. Broader Impact The ConsultChain democratizes access to frontier LLM capabilities by making multi-model deployments economically viable on edge hardware. A researcher or developer who could previously afford only a single cheap model can now access frontier reasoning (Opus-class) for the 1% of requests that require it, while maintaining sub- $ 2/month operating costs. Conclusion I presented the ConsultChain, a token optimization architecture that achieves 98.5% cost reduction through three novel mechanisms: progressive context distillation across a five-tier model cascade, self-healing knowledge crystallization, and activity-driven synaptic pruning. The system runs on a Raspberry Pi 5 with under 1.2GB RAM and compounds its efficiency over time as knowledge accumulates. I release the implementation as open source to enable reproduction and extension. Declarations Funding This research received no external funding. Competing Interests The author declares no competing interests. Ethics Approval Not applicable. This study did not involve human participants, human data, or animal subjects. Consent to Participate Not applicable. Consent for Publication Not applicable. Data Availability The source code, configuration templates, and evaluation scripts for ConsultChain are available as open source at https://github.com/samueledusa/consultchain. All data generated during the evaluation are included in the repository. Code Availability The complete implementation is available at https://github.com/samueledusa/consultchain under the MIT License. Author Contributions S.E. conceived the system architecture, designed and implemented the software, conducted the evaluation, and wrote the manuscript. AI Assistance Disclosure Portions of this manuscript were developed with the assistance of large language models, including Claude (Anthropic). The author was responsible for all conceptual contributions, system design, experiments, and final editing. The AI tools were used for drafting, editing, and refining explanations. The author has verified all claims and takes full responsibility for the content. In accordance with Research Square policy, the LLM is not listed as an author. References Li, Y., et al. Compressing Context to Enhance Inference Efficiency of Large Language Models. arXiv:2310.06201 (2023). Jiang, H., et al. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. arXiv:2310.05736 (2023). Bang, Y., et al. GPTCache: An Open-Source Semantic Cache for LLM Applications. arXiv:2308.xxxxx (2023). Chen, L., et al. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176 (2023). Karpathy, A. LLM Wiki Pattern. GitHub Gist (2025). Packer, C., et al. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560 (2023). Hippo Memory. Biologically-Inspired Persistent Memory for AI Agents. GitHub: kitfunso/hippo-memory (2025). CodeSight. Zero-Dependency AST-Precision Code Context Generator. GitHub: Houseofmvps/codesight (2025). Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9368244","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":620297435,"identity":"1f36fa58-e871-4270-8265-52023f617503","order_by":0,"name":"Samuel Edusa, MD","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABE0lEQVRIiWNgGAWjYNCCAiA+AMQf4CIGFgS0GEC0MM4A0jwQEQnitDDzwLUw4NYi3346TeKDAYMc3430h59td9yTs2fvPbrhR4EEA397dwJW88/kbpOcYcBgLHkjx1g690yxMQ/PubSbPUCHSZw5uwG7k3K3SfMYMCRuuJHDIJ3blpDYI5FjdoMHqMVAIherFvn+t9uk/xgw1G+4kf74tyVUy80/eLQw3ADaArQrweBGgpk0I1TLbXy2GNx4u9kS6HLDmWfemFn2nkkw5jlzxuy2jIEEDy6/yPfnbrzxo8JGnu94+uMbP3ckyLG395jdfPPHRo6/vRe7wyAAGguMDQghHjzKkQCyllEwCkbBKBgFMAAAoMRfi676R1YAAAAASUVORK5CYII=","orcid":"https://orcid.org/0000-0003-2836-7536","institution":"Individual Researcher","correspondingAuthor":true,"prefix":"","firstName":"Samuel","middleName":"","lastName":"Edusa","suffix":"MD"}],"badges":[],"createdAt":"2026-04-09 11:55:27","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-9368244/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9368244/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":107480776,"identity":"78955b8f-f496-4cf9-beae-5cd2ad6285b0","added_by":"auto","created_at":"2026-04-22 02:13:31","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":299194,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9368244/v1/b299daa3-5d96-4d82-a21c-5ebf10efa049.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003eConsultChain: Progressive Context Distillation Across Heterogeneous LLM Fleets for Token-Optimal Inference\u003c/p\u003e","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe operational cost of large language model (LLM) inference is dominated by token consumption. For developers and researchers using cloud-hosted models, each API call carries a cost proportional to the number of input and output tokens processed. This cost becomes significant in agentic workflows where models are invoked repeatedly, often re-processing identical context across sessions.\u003c/p\u003e \u003cp\u003eThe emergence of heterogeneous model fleets (deployments where multiple models of varying capability and cost are available simultaneously) creates an optimization opportunity. A typical fleet may span two orders of magnitude in price-per-token. Yet most systems route all traffic to a single model or at best implement binary routing between a \"cheap\" and \"expensive\" tier.\u003c/p\u003e \u003cp\u003eI introduce the ConsultChain, a system that exploits the full price spectrum of a heterogeneous model fleet through three novel mechanisms:\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eProgressive Context Distillation\u003c/strong\u003e \u003cp\u003eA five-tier cascade where each tier attempts to resolve a request. Upon failure, it compresses the context before escalating. Higher-tier models receive smaller but semantically richer input.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eKnowledge Crystallization\u003c/strong\u003e \u003cp\u003eA persistent knowledge store where information solidifies through confirmation phases (seed, growing, solid, faceted), self-heals through fracture on contradiction, and dissolves when unconfirmed. This replaces static wiki patterns that suffer from silent staleness.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eSynaptic Pruning\u003c/strong\u003e \u003cp\u003eA memory management system modeled on neural development where pathways strengthen through co-activation (Hebbian learning), promote to permanent storage after sustained strength, and are pruned during scheduled consolidation cycles.\u003c/p\u003e \u003c/p\u003e \u003cp\u003eTogether, these mechanisms achieve 98.5% token cost reduction on a 10-model fleet with a 227x price differential, with the system's efficiency compounding over time as knowledge accumulates.\u003c/p\u003e\n\u003ch3\u003eRelated Work\u003c/h3\u003e\n\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eToken Optimization\u003c/h2\u003e \u003cp\u003ePrior approaches to token optimization fall into three categories: context compression, caching, and routing. Context compression techniques include Selective Context [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e], which uses self-information to filter tokens, and LLMLingua [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e], which uses a small model to compress prompts. These operate at the token level. My approach operates at the semantic level, compressing meaning through multi-stage distillation. Caching approaches include GPTCache [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e] and semantic caching layers that store and retrieve previous responses. My knowledge crystallization extends this by treating cached knowledge as mutable, confidence-weighted, and self-healing rather than static key-value pairs. Routing systems like FrugalGPT [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e] cascade through models sequentially. My system differs by introducing progressive distillation, where each tier transforms the context before escalation, and by incorporating persistent knowledge and memory systems that compound savings over time.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003ePersistent Knowledge for LLMs\u003c/h3\u003e\n\u003cp\u003eKarpathy [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e] proposed the LLM Wiki pattern: a three-layer architecture (raw sources, LLM-maintained wiki, schema document) with ingest, query, and lint operations. My knowledge crystallization replaces the flat wiki with a phase-transition model that provides automatic staleness resolution, confidence-weighted retrieval, and organic knowledge graph formation through faceting.\u003c/p\u003e\n\u003ch3\u003eMemory Systems for AI Agents\u003c/h3\u003e\n\u003cp\u003eMemGPT [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e] introduced OS-inspired memory management for LLMs. Hippo Memory [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e] applied neuroscience-inspired decay models with half-lives and reward modulation. My synaptic pruning model differs by using activity-driven promotion rather than time-based decay, implementing three-store architecture (sensory, working, myelinated) with Hebbian strengthening, and performing batch consolidation that mirrors sleep-phase neural processing.\u003c/p\u003e\n\u003ch3\u003eCode Context Compression\u003c/h3\u003e\n\u003cp\u003eCodeSight [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e] uses AST-based parsing to generate structured code maps, reducing codebase context from 26-47K tokens to 3-5K. My fingerprinting approach achieves further compression to approximately 12 tokens per file (250x compression from raw source) using regex-based structural signatures that are diffable and searchable without LLM involvement.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eSystem Overview\u003c/h2\u003e \u003cp\u003eThe ConsultChain consists of four subsystems operating on a shared data layer: a Cascade Router that classifies requests and routes them through five model tiers, a Crystal Lattice providing persistent knowledge storage with phase-transition semantics, a Synaptic Store for activity-driven cross-session memory with three persistence levels, and a Fingerprint Engine for zero-LLM code context compression. All orchestration runs on a single-board computer (Raspberry Pi 5, 8GB RAM). Models are accessed via cloud APIs. The orchestration layer consumes approximately 1.2GB RAM.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eThe Five-Tier Cascade\u003c/h3\u003e\n\u003cp\u003eLet M = {m_0, m_1, ..., m_n} be a set of models ordered by cost-per-token c(m_i) such that c(m_0)\u0026thinsp;\u0026lt;\u0026thinsp;c(m_1) \u0026lt; ... \u0026lt; c(m_n). For an incoming request r, the cascade processes as follows. At tier i, the system assembles context C_i from the crystal lattice, synaptic store, and fingerprint index, subject to a per-tier token budget B_i. It generates a response R_i using model m_i with input (r, C_i), then evaluates confidence score s_i for R_i. If s_i\u0026thinsp;\u0026gt;\u0026thinsp;=\u0026thinsp;threshold t_i, the response is returned (resolution). If s_i\u0026thinsp;\u0026lt;\u0026thinsp;t_i, the system generates a distilled context D_i\u0026thinsp;=\u0026thinsp;distill(r, C_i, R_i, reason_i) where |D_i| \u0026lt; |C_i|, and passes (r, D_i) to tier i\u0026thinsp;+\u0026thinsp;1.\u003c/p\u003e \u003cp\u003eA key property is context monotonicity: |C_0| \u0026gt; |D_0| \u0026gt; |D_1| \u0026gt; ... \u0026gt; |D_{n-1}|. Context size decreases as it ascends the cascade. This is the primary departure from standard cascade routing where context size is constant or increasing. While |D_i| \u0026lt; |C_i|, the semantic density of D_i is higher because it includes the distilled context from tier i, that tier's partial reasoning about the request, and explicit annotation of what knowledge was missing.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eFive-tier cascade configuration with steady-state resolution rates.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTier\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eModels\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eInput Budget\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eResolution Rate\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e0 (Sieve)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eqwen3.5, gemma4:31b\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e450 tokens\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e75%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e1 (Filter)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eminimax-m2.7, kimi-k2.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e370 tokens\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e15%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e2 (Refiner)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eclaude-haiku-4-5, glm-5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e280 tokens\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e6%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e3 (Distiller)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eclaude-sonnet-4-5, gpt-5.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e470 tokens\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e3%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e4 (Alembic)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eclaude-opus-4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e650 tokens\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e\n\u003ch3\u003eKnowledge Crystallization\u003c/h3\u003e\n\u003cp\u003eI model persistent knowledge as a lattice of crystals, each representing an atomic knowledge claim. A crystal K is a tuple (claim, phase, confidence, confirmations, facets, timestamp) where claim is a natural language assertion of 20\u0026ndash;50 tokens, phase is one of {seed, growing, solid, faceted}, confidence is a value in [0, 1], confirmations counts independent verifications, facets is a set of edges to related crystals, and timestamp records creation and last confirmation.\u003c/p\u003e \u003cp\u003eCrystals undergo phase transitions analogous to physical crystallization. During nucleation, new information enters as a seed with confidence 0.3. Seeds that receive no confirmation within 14 days are dissolved. During growth, upon independent confirmation (from a different source or model tier), confidence increases by 0.15 up to a maximum of 0.85. After 4 confirmations, phase transitions to solid. During faceting, a low-cost model proposes connections between solid crystals during batch consolidation. When two crystals are linked, both transition to faceted with confidence approaching 0.95. During fracture, when information contradicting crystal K enters the lattice with non-trivial confidence, K splits into competing seeds representing the original and contradictory claims. Both must re-nucleate independently.\u003c/p\u003e \u003cp\u003eUnder the crystallization model, the maximum duration a false claim persists in the lattice is bounded by max(TTL_seed, time_to_fracture), where time_to_fracture depends on the rate of contradictory information arrival. In contrast, static wiki systems have unbounded staleness.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eSynaptic Pruning\u003c/h2\u003e \u003cp\u003eI model cross-session memory as a directed graph of pathways connecting contexts to outcomes, managed across three stores inspired by neural memory systems. The Sensory Buffer is a fixed-size (30 items) FIFO queue holding immediate session context with no persistence. Items are promoted to working pathways upon second reference within a session. Working Pathways each have strength sigma in [0, 1] updated by a Hebbian rule (co-activated pathways create or strengthen links), an activity update (traversal increases strength, positive outcomes add a bonus, negative outcomes reduce strength), and time-based decay (pathways idle for 72 hours lose 10% strength per cycle). Myelinated Pathways are working pathways that maintain strength above 0.7 for 7 consecutive days and are promoted to permanent storage. These are only pruned if strength reaches 0 for 60 days.\u003c/p\u003e \u003cp\u003eA daily sleep cycle performs Hebbian link strengthening for co-activated pathways, pruning of weak working pathways, promotion of strong candidates, and crystal maintenance including dissolution of stale seeds and facet proposal.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eCode Fingerprinting\u003c/h2\u003e \u003cp\u003eFor codebases, I generate structural fingerprints using regex-based extraction rather than full AST parsing. A fingerprint is a pipe-delimited string encoding exports, middleware, routes with HTTP method and path parameters, database field accesses, and import dependencies. Average fingerprint length is 12 tokens compared to 3,000 tokens for the average source file, yielding a 250x compression ratio. Fingerprints support efficient diff operations that produce compact change summaries, enabling incremental context updates without re-scanning.\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eExperimental Setup\u003c/h2\u003e \u003cp\u003eI deployed the ConsultChain on a Raspberry Pi 5 (8GB RAM) running a 10-model fleet accessed via cloud APIs. The evaluation period spanned simulated workloads of 50 requests per day across five categories: code questions, knowledge recall, architecture queries, complex reasoning, and novel problems.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eToken Cost Reduction\u003c/h2\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eToken cost comparison across configurations.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eConfiguration\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDaily Cost\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMonthly Cost\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eReduction\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSingle model (Sonnet, full context)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan\u003e$\u003c/span\u003e3.75\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan\u003e$\u003c/span\u003e112.50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003ebaseline\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBinary routing (cheap/expensive)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan\u003e$\u003c/span\u003e0.27\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan\u003e$\u003c/span\u003e8.10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e92.8%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCascade only (no knowledge/memory)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan\u003e$\u003c/span\u003e0.14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan\u003e$\u003c/span\u003e4.20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e96.3%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eConsultChain (full system)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan\u003e$\u003c/span\u003e0.058\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cspan\u003e$\u003c/span\u003e1.74\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e98.5%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe full system achieves an additional 58% reduction over cascade-only routing by resolving more requests at lower tiers via crystal lookups and synaptic recall.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eCompounding Behavior\u003c/h2\u003e \u003cp\u003eI measured tier resolution rates over time:\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eTier resolution rates and cost over time.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eWeek\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTier 0\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eTier 1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eTier 2\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eTier 3\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eTier 4\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eMonthly Cost\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e40%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e25%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e15%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e12%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e8%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e\u003cspan\u003e$\u003c/span\u003e8.00\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e58%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e22%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e12%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e3%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e\u003cspan\u003e$\u003c/span\u003e4.50\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e80%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e12%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e4%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e3%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e\u003cspan\u003e$\u003c/span\u003e2.80\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e24\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e95%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e3%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e\u003cspan\u003e$\u003c/span\u003e1.74\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe monotonic decrease in upper-tier utilization confirms the compounding hypothesis: crystallized knowledge and myelinated pathways progressively absorb request categories from expensive to cheap tiers.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eCrystal Lattice Dynamics\u003c/h2\u003e \u003cp\u003eAt week 24 steady state, the lattice contained 60 seeds (actively nucleating), 50 growing crystals, 120 solid crystals, and 45 faceted crystals. Approximately 340 seeds had dissolved during the evaluation period, demonstrating the self-cleaning property. Twelve fracture events were observed, all corresponding to code refactors or corrected misunderstandings.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eResource Utilization\u003c/h2\u003e \u003cp\u003eThe orchestration layer on Raspberry Pi 5 consumed 1.2GB RAM, 2% average CPU, and 45MB disk for SQLite databases. No GPU was required. The nightly sleep cycle completed in under 30 seconds at a cost of \u003cspan\u003e$\u003c/span\u003e0.0007 per execution.\u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eStandard cascade routing (e.g., FrugalGPT [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]) passes the same context to each tier. Progressive distillation reduces context at each stage, yielding two distinct benefits: (1) reduced input token costs at higher tiers, and (2) enriched context through prior-tier reasoning annotations. The inclusion of failure reasons from lower tiers allows higher-tier models to avoid repeating unsuccessful approaches.\u003c/p\u003e \u003cp\u003eThe wiki pattern suffers from three failure modes: silent staleness, orphaned pages, and inconsistent granularity. Crystallization addresses all three through distinct mechanisms. Stale seeds dissolve, orphaned crystals naturally decay in confidence ranking, and the fixed 20\u0026ndash;50 token crystal size enforces consistent granularity. The fracture mechanism provides a property no wiki system offers: automatic self-correction on contradiction.\u003c/p\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003eLimitations\u003c/h2\u003e \u003cp\u003eThe system's compounding benefit implies a cold-start period (weeks 1\u0026ndash;4) where costs are significantly higher than steady state. The confidence evaluation at each tier introduces latency of approximately 200-500ms per tier on cloud APIs. The synaptic pruning model's parameters (decay rates, promotion thresholds) require tuning per deployment and workload. Finally, the 12-token fingerprint format trades detail for compression and may miss nuanced code patterns that full AST analysis would capture.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003eBroader Impact\u003c/h2\u003e \u003cp\u003eThe ConsultChain democratizes access to frontier LLM capabilities by making multi-model deployments economically viable on edge hardware. A researcher or developer who could previously afford only a single cheap model can now access frontier reasoning (Opus-class) for the 1% of requests that require it, while maintaining sub-\u003cspan\u003e$\u003c/span\u003e2/month operating costs.\u003c/p\u003e \u003c/div\u003e"},{"header":"Conclusion","content":"\u003cp\u003eI presented the ConsultChain, a token optimization architecture that achieves 98.5% cost reduction through three novel mechanisms: progressive context distillation across a five-tier model cascade, self-healing knowledge crystallization, and activity-driven synaptic pruning. The system runs on a Raspberry Pi 5 with under 1.2GB RAM and compounds its efficiency over time as knowledge accumulates. I release the implementation as open source to enable reproduction and extension.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis research received no external funding.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting Interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe author declares no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics Approval\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable. This study did not involve human participants, human data, or animal subjects.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent to Participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for Publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe source code, configuration templates, and evaluation scripts for ConsultChain are available as open source at https://github.com/samueledusa/consultchain. All data generated during the evaluation are included in the repository.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCode Availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe complete implementation is available at https://github.com/samueledusa/consultchain under the MIT License.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor Contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eS.E. conceived the system architecture, designed and implemented the software, conducted the evaluation, and wrote the manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAI Assistance Disclosure\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ePortions of this manuscript were developed with the assistance of large language models, including Claude (Anthropic). The author was responsible for all conceptual contributions, system design, experiments, and final editing. The AI tools were used for drafting, editing, and refining explanations. The author has verified all claims and takes full responsibility for the content. In accordance with Research Square policy, the LLM is not listed as an author.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eLi, Y., et al. Compressing Context to Enhance Inference Efficiency of Large Language Models. arXiv:2310.06201 (2023).\u003c/li\u003e\n\u003cli\u003eJiang, H., et al. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. arXiv:2310.05736 (2023).\u003c/li\u003e\n\u003cli\u003eBang, Y., et al. GPTCache: An Open-Source Semantic Cache for LLM Applications. arXiv:2308.xxxxx (2023).\u003c/li\u003e\n\u003cli\u003eChen, L., et al. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176 (2023).\u003c/li\u003e\n\u003cli\u003eKarpathy, A. LLM Wiki Pattern. GitHub Gist (2025).\u003c/li\u003e\n\u003cli\u003ePacker, C., et al. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560 (2023).\u003c/li\u003e\n\u003cli\u003eHippo Memory. Biologically-Inspired Persistent Memory for AI Agents. GitHub: kitfunso/hippo-memory (2025).\u003c/li\u003e\n\u003cli\u003eCodeSight. Zero-Dependency AST-Precision Code Context Generator. GitHub: Houseofmvps/codesight (2025).\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":true,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Large language models, Token optimization, Model cascading, Knowledge graphs, Memory management, Edge computing","lastPublishedDoi":"10.21203/rs.3.rs-9368244/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9368244/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003eBackground: \u003c/strong\u003eThe operational cost of large language model (LLM) inference is dominated by token consumption. In heterogeneous model fleets where prices span two orders of magnitude, most systems still route all traffic to a single model or implement only binary cheap/expensive routing. This leaves significant cost optimization on the table, particularly for agentic workflows that re-process identical context across sessions.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMethods: \u003c/strong\u003eI designed and implemented ConsultChain, an architecture introducing three novel mechanisms: (1) a five-tier model cascade with progressive context distillation, where each tier compresses context before escalation rather than forwarding raw input, (2) knowledge crystallization, a self-healing persistent knowledge store modeled on physical crystal formation with nucleation, growth, and fracture phases, and (3) synaptic pruning, an activity-driven memory management system inspired by adolescent neural development that promotes frequently-used pathways and eliminates idle ones. The system was deployed on a Raspberry Pi 5 orchestrating a 10-model fleet accessed via cloud APIs and evaluated over simulated workloads of 50 requests per day.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults: \u003c/strong\u003eThe system achieves 98.5% token cost reduction at steady state ($1.74/month) compared to single-model full-context baselines ($112.50/month) on a fleet spanning a 227x price differential ($0.11/M to $25/M tokens). Costs compound downward over time: Tier 0 resolution rates increase from 40% at week 1 to 95% at week 24 as the knowledge lattice matures. The orchestration layer runs with under 1.2GB RAM and requires no local GPU.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusions: \u003c/strong\u003eProgressive context distillation, combined with self-healing knowledge storage and activity-driven memory pruning, enables cost reductions that compound over time rather than remaining static. The approach is feasible on edge hardware and generalizes to any heterogeneous model fleet. The implementation is released as open source.\u003c/p\u003e","manuscriptTitle":"ConsultChain: Progressive Context Distillation Across Heterogeneous LLM Fleets for Token-Optimal Inference","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-04-13 04:05:18","doi":"10.21203/rs.3.rs-9368244/v1","editorialEvents":[{"type":"communityComments","content":1}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"b446fa59-12f1-442d-8eb1-7d9e950a6813","owner":[],"postedDate":"April 13th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":66011659,"name":"Artificial Intelligence and Machine Learning"},{"id":66011660,"name":"Computer Architecture and Engineering"}],"tags":[],"updatedAt":"2026-04-13T04:05:18+00:00","versionOfRecord":[],"versionCreatedAt":"2026-04-13 04:05:18","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9368244","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9368244","identity":"rs-9368244","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00