Experiential neural architecture selection: dynamic cross-layer memory for real-time inference optimization | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Experiential neural architecture selection: dynamic cross-layer memory for real-time inference optimization JOSE MARIA LANCHO RODRIGUEZ This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7378044/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Neural networks suffer from operational amnesia: they process each input as if it were the first time, without remembering which neuron combinations proved effective in similar contexts. We introduce ExNAS ( Experiential Neural Architecture Selection ), a system that performs real-time, neuron-granular architectural adaptation during the same inference by leveraging a distributed experiential memory. ExNAS records layer-wise neural fingerprints and lightweight contextual metadata and then performs transversal selection across non-consecutive layers under explicit per-layer and global budgets. On a CPU proof-of-concept using a small CNN (2×Conv+FC), ExNAS delivers measurable time reductions (≈3.7–7.9%) and throughput gains (≈3.8–8.5%) at low active fractions (≈4.7–10.9%), without retraining. We detail the design, provide formal definitions, and discuss sensitivity to budgets and a negative case where heavier adaptation adds overhead. These results substantiate experience-guided, neuron-level conditional computation as a practical tool for real-time inference. Artificial Intelligence and Machine Learning experiential neural memory cross-layer transversal selection dynamic inference neuron-level gating conditional computation 1. Introduction Modern neural networks excel at pattern recognition yet operate under a stark constraint: operational amnesia. Even when repeatedly facing similar inputs, they typically process each one statelessly, with intermediate representations computed once and left unchanged, and without remembering which cross-layer neuron constellations worked best before. A seasoned clinician who forgets, after each correct diagnosis, the specific constellation of signs and heuristics that led to it would face a similar handicap: solving familiar problems from scratch, again and again. This amnesia manifests along three axes: (i) static intermediate processing, where upstream representations cannot be modulated by evidence discovered downstream during the same inference; (ii) absence of experiential memory, i.e., a record of how the network succeeded (which neurons and layers cooperated), as opposed to what content it saw; and (iii) limited architectural adaptation at inference time, generally restricted to block-level routing instead of neuron-level selection across non-consecutive layers. We propose Experiential Neural Architecture Selection (ExNAS), which reframes inference as experience-guided architectural selection. The core idea is operational rather than semantic: if the model can retain how it solved similar situations—captured as layer-wise fingerprints and lightweight context—then, during the same forward pass, it can selectively gate neurons across layers (not necessarily consecutive) under explicit budgets. The contributions of this work are threefold: a formalization of experiential memory that stores processing patterns rather than content; a transversal, neuron-granular selection mechanism with per-layer and global budget constraints acting within the same inference; and a lightweight implementation showing consistent speedups on CPU at low active fractions and without retraining. We also report a negative case clarifying where overhead offsets sparsity gains. 2. Related work Neural Architecture Search (NAS) optimizes architectures during training (e.g., reinforcement learning, evolutionary, differentiable methods), but deployed networks remain static at inference [ 1 , 2 , 3 ]. Mixture-of-Experts (MoE) achieves sparsity via block-level routing of tokens to experts; it typically does not record or exploit experiential neuron-level cooperation [ 4 , 5 ]. Memory-augmented models (e.g., NTM/DNC, kNN-LM) store and retrieve content, whereas our memory stores how computation succeeded (layer fingerprints and simple context) [ 6 , 7 ]. Dynamic/early-exit and adaptive computation time approaches vary depth or halting time but generally do not perform neuron-level transversal selection guided by experiential history within the same inference [ 8 , 9 , 10 , 11 ]. Test-time adaptation/caching reuses recent information but, again, typically lacks experience-guided neuron selection across non-consecutive layers. 3. Methodology 3.1 Notation and high-level view At inference time, the system keeps a compact record of how the network processed each input, rather than the input itself. For every forward pass, ExNAS summarizes the activation of each relevant layer into a short fingerprint vector and stores it together with a small amount of context (for example, simple input statistics and an output-confidence proxy) and a timestamp. This yields a stream of lightweight records that can be searched quickly. Fingerprints are designed to be inexpensive and stable: in convolutional layers, activations are spatially averaged to produce one value per channel; in fully connected layers, simple per-feature summaries are used. These summaries are then mapped to a fixed dimension and normalized so that cosine similarity is meaningful. No raw inputs or labels are retained—only the compressed signals needed to recognize that the current computation is similar to a previously successful one. When a new input arrives, ExNAS computes the same set of fingerprints for the ongoing forward pass and compares them with the fingerprints saved in memory. Similarity is computed layer by layer and then averaged across the layers that overlap between the current run and a stored record; a recency weight down-weights stale records so that recent experience counts more. Only a small top-k set of the most similar and recent records is kept (e.g., k = 16) to guide selection. From those retrieved records, ExNAS derives a memory signal per layer—intuitively, a compact hint of which parts of the network tended to help in comparable situations. That signal is combined with current activation statistics to score units and produce binary masks for the layers where gating is enabled. Masks are applied within the same forward pass and are constrained by explicit per-layer and global budgets so that the overall active fraction remains small. After producing the output, the system appends the new record to memory; if capacity is exceeded, the oldest or least useful records are discarded according to the same recency policy. This keeps both lookup time and storage bounded. For clarity, the mechanism can be summarized with the following compact formulation, which we keep for readers who prefer a formal definition: Consider layers {L₁, …, Lₗ} with activations a⁽ˡ⁾ for input x. We define a layer fingerprint h⁽ˡ⁾ ∈ ℝᵈˡ as a compressed summary of a⁽ˡ⁾ (e.g., channel-wise means; for 2D features, spatial averaging followed by channel averaging). An experiential memory M stores tuples ({h⁽ˡ⁾}, C, t) consisting of fingerprints across layers, lightweight context C (e.g., input statistics, output confidence), and timestamp t. Given current fingerprints {ĥ⁽ˡ⁾}, ExNAS queries M to retrieve a small set of similar past executions, weighs recency, and then computes neuron-level scores that combine current evidence and memory evidence. From these scores, it derives masks for a subset of layers S (not necessarily consecutive), constrained by per-layer and global budgets, and applies them within the same forward pass. 3.2 Experiential memory (word-friendly, developed) Purpose and contents. At inference time, ExNAS records how the network processed each input, not the input itself. For every forward pass it creates a compact, layer-wise fingerprint and stores a single record consisting of: (i) a map from layer name to a fixed-length fingerprint vector, (ii) lightweight context (basic input statistics and an output-confidence proxy), and (iii) a timestamp. No raw inputs or labels are kept. The goal is a searchable log that recognizes when the current computation resembles previously successful ones while remaining small and fast. Constructing a fingerprint. Each layer’s activation is turned into a stable, low-cost summary. In convolutional layers, activations are first averaged over the spatial dimensions to produce one value per channel; in fully connected layers, simple per-feature statistics (e.g., mean magnitude) are used. These per-layer summaries are then projected to a fixed dimension (for example, d = 64) using a light linear map or a fixed random projection and normalized so that cosine similarity is meaningful. The result is one normalized vector per layer, per inference. Query at inference time. When a new input is processed, ExNAS computes the current set of fingerprints and compares them against the stored records. Similarity between the current run and a past record is computed as the average cosine similarity across the layers that both records contain; no extra training or alignment is required. To favor fresh information, a recency weight down-weights older records (for example, using a half-life measured in days or in number of inferences). Only a small top-k set of records is retained for the decision stage (e.g., k = 16), which keeps lookup cost low and predictable. Signal returned to the selector. From the retrieved records ExNAS derives a per-layer memory signal: a weighted combination of the retrieved fingerprints, where weights reflect similarity times recency. This signal is compared with the current activation statistics to produce neuron-level scores and a binary mask in each gated layer. Masks are applied within the same forward pass and obey explicit per-layer and global budgets so that the overall active fraction remains small and controllable. Update and capacity control. After producing the output, ExNAS appends the new record (fingerprints, context, timestamp). When memory reaches capacity, it compacts by discarding the least recent or least useful records under the same recency policy. This keeps storage bounded and preserves fast queries. Defaults and reproducibility. Unless noted otherwise, fingerprints use d = 64, retrieval uses top-k = 16, recency follows a practical half-life (for example, one week or ten thousand inferences), and gating is enabled in Conv2 (output-channel gating) and FC (input-feature gating). A short warm-up of five iterations is run before timing to populate memory with initial records. Efficiency considerations. Because fingerprints are short and the retrieval set is small, the added work per batch is modest: a few reductions per layer to compute fingerprints, a linear-time lookup and scoring step in the size of (top-k × fingerprint dimension × number of gated layers) , and an element-wise mask application that is negligible. These design choices are consistent with the measured CPU gains when active fractions are low and evaluation avoids heavy in-loop updates. Intuition. If the current fingerprints for Conv2 and FC closely match several recent records, the system effectively “recognizes” the processing situation and prioritizes the units that proved useful in those cases. If no good matches are found, the memory contributes little and selection falls back to current-signal evidence. In both cases, decisions occur within the same forward pass and under explicit budget constraints. 3.3 Cross-layer transversal selection For each gated layer l, a neuron score vector is computed as a convex combination of standardized current activation statistics and a standardized memory-derived signal. The top units are selected under a per-layer budget (b_l) and, if their aggregate would exceed the global cap (B_g), masks are thinned proportionally. The resulting binary masks are applied within the same forward pass, including layers that are not consecutive. Let layer l have Nₗ units. We compute a score vector s⁽ˡ⁾ ∈ ℝᴺˡ as a convex combination of standardized current activations (per-unit statistics) and standardized memory-derived signals: s⁽ˡ⁾ = w_cur φ(z(a⁽ˡ⁾)) + w_mem φ(z(m⁽ˡ⁾)), w_cur + w_mem = 1 where z(·) denotes per-layer standardization and φ a nonlinearity (e.g., ReLU). We select the top kₗ = ⌊bₗNₗ⌋ units under a per-layer budget bₗ ∈ (0,1]. If the aggregate selection exceeds the global budget B_g, we thin masks proportionally. This yields a mask applied within the same forward across layers l ∈ S, including layers that are not consecutive. 3.4 Optional representation enrichment An optional variant computes an enrichment vector r from transversal activations plus a running state and mixes it back into one or more target layers through a gated rule with context-dependent threshold. We describe the mechanism but do not use it in the CPU proof-of-concept. 4. Experimental results 4.1 Setup Device and environment. All runs are executed on a single-CPU machine (no GPU). Default PyTorch CPU settings are used (no manual thread pinning), and both the baseline and ExNAS run in the same environment to ensure a fair comparison. Absolute times are device-dependent; results are intended to capture relative improvements. Model. SmallCNN with topology Conv1 → ReLU → Conv2 → ReLU → AvgPool → Flatten → FC (10 classes) . ExNAS applies neuron-granular gating in Conv2 (output-channel gating) and in FC (input-feature gating); Conv1 is left ungated. Data. CIFAR-10 via torchvision when available; otherwise, a synthetic, balanced 10-class dataset (6000 train / 1000 test). The same test split is evaluated by both the baseline and ExNAS. Training and evaluation. The baseline is trained for one light epoch. ExNAS does not retrain; it performs a short warm-up (5 iterations) to populate the experiential memory and then evaluates on the test split. Evaluation uses the identical dataloader and batching in both conditions. Budgets and retrieval. Budgets and retrieval. Unless stated otherwise, the per-layer budget is b_l = 0.12 and the global cap is B_g = 0.06. Memory query uses a small top-k set (e.g., k = 16), cosine similarity across available layer fingerprints, and simple recency down-weighting. Metrics and timing. The evaluation reports top-1 accuracy, wall-clock time for the entire test pass, and throughput computed as the ratio between the number of test samples and that wall-clock time. Timing is taken with a monotonic clock from immediately before the first batch is processed until the last batch completes, and includes the same dataloader and batch size for the baseline and for ExNAS. A short warm-up of five iterations precedes timing to populate caches and the experiential memory. The baseline is trained for one light epoch; ExNAS does not retrain. All runs are performed on CPU under the same software environment. Active fractions. For each gated layer, the active fraction is the proportion of units left enabled by the mask during the forward pass. In the convolutional layer (Conv2) this corresponds to the fraction of output filters that remain active; in the fully connected layer (FC) it corresponds to the fraction of input features that contribute to the matrix–vector product. Under the default budgets used in this work (per-layer budget b_l = 0.12 and global cap B_g = 0.06), and with two gated layers of similar size (Conv2 and FC), the effective active fraction per gated layer is approximately six percent, which matches the observed ranges (about 4.69–10.94%).. Minimal checklist for reproducibility -Same test split, same batch size, same preprocessing for baseline and ExNAS. -Baseline trained 1 light epoch; ExNAS no retraining. -Warm-up: 5 iterations before timing. -Defaults: per-layer budget b_l = 0.12, global cap B_g = 0.06, retrieval top-k = 16. -Gated layers: Conv2 (output-channel gating) and FC (input-feature gating). Mask application. The mask is multiplicative and applied within the same forward pass: in convolution, non-selected output filters (channels) are suppressed; in the fully connected layer, non-selected input columns/features are suppressed accordingly. 4.2 Main Result (Fast Configuration) With run_public_experiment_v5.py (default settings): Baseline → acc = 0.201, time = 7.65 s, throughput = 654.0 samp/s ExNAS (fast) → acc = 0.101, time = 7.37 s, throughput = 678.8 samp/s Active fractions: Conv2 ≈ 6.25%, FC ≈ 6.25% Effect. Time reduction of 3.66% and throughput gain of 3.80% compared to the baseline, same network, no retraining. 4.3 Budget sensitivity (Grid Search): With autotune_v5.py (grid search): baseline acc = 0.198, time = 7.97 s, throughput = 627.6 samp/s. Best time observed: bp = 0.12, bg = 0.06 → acc = 0.099, 7.34 s, 681.2 samp/s; fractions ≈ 6.25% bp = 0.10, bg = 0.05 → acc = 0.102, 7.36 s, 678.9 samp/s; fractions ≈ 4.69% Relative to baseline: up to −7.9% in time and +8.5% in throughput (best case), at the cost of lower accuracy in this small model and minimal-training regime. As expected, smaller budgets increase sparsity and reduce time, but may degrade accuracy in undertrained models. 4.4 Negative case under heavier adaptation With run_public_experiment_v5_tuned.py (bp = 0.18, bg = 0.08, warm-up = 50, adaptation enabled during evaluation): Baseline → acc = 0.189, time = 7.92 s, throughput = 631.7 samp/s ExNAS (tuned) → acc = 0.101, time = 8.04 s, throughput = 621.9 samp/s Active fractions: ≈ 9.38% On CPU, the selection/memory overhead can reverse the benefit of sparsity when adaptation is heavier (larger active fractions and more frequent updates), delineating a practical operating boundary. 4.5 Summary Table Setting Model Accuracy Time (s) Throughput (samp/s) Active Fractions Fast (default) Baseline 0.201 7.65 654.0 — ExNAS 0.101 7.37 678.8 Conv2 6.25%, FC 6.25% Grid best-time Baseline 0.198 7.97 627.6 — ExNAS (0.12/0.06) 0.099 7.34 681.2 Conv2 6.25%, FC 6.25% Tuned (heavy) Baseline 0.189 7.92 631.7 — ExNAS 0.101 8.04 621.9 Conv2 9.38%, FC 9.38% Defaults: b_l=0.12, B_g=0.06; top-k=16 4.6. Threats to validity and reproducibility notes: Hardware sensitivity. Wall-clock time and throughput depend on CPU model, BLAS, and OS scheduling. Results emphasize relative differences under identical conditions. Minimal training . One light epoch on a small CNN intentionally stresses compute savings over absolute accuracy; deeper training may mitigate accuracy drops under aggressive budgets. Memory policy. A small top-k and simple recency factor are used for speed; richer policies (e.g., FAISS, more features) may change the accuracy–efficiency frontier. Single-shot reporting. Unless noted, numbers correspond to a single pass. Running multiple seeds and reporting mean±std is straightforward but out of scope here. Data pipeline. Timing includes the identical dataloader path in both conditions; caching effects are controlled by the warm-up and by evaluating the full test split. 5. Theoretical considerations Scope and notation (Word-friendly). Let G be the set of gated layers; each gated layer l has N_l units and a per-layer budget b_l ∈ (0,1]. A global cap B_g ∈ (0,1] limits the average activation across all layers in G . The effective active fraction in layer l is denoted alpha_l . Effective active fraction (with a single proportional rule). If the intended selections respect the global cap, then alpha_l = b_l for all l ∈ G . Otherwise, all masks are scaled by the same factor rho = (B_g * sum_over_G N_l) / (sum_over_G b_l * N_l), and alpha_l = rho * b_l. Anchor: with defaults b_l = 0.12, B_g = 0.06 and two similarly sized gated layers (Conv2, FC), rho ≈ 0.50 and alpha_l ≈ 0.06 (≈6%), matching the observed range (~4.7–10.9%) across settings. Where savings come from (first-order view). Convolutional layers: turning off output filters removes their convolutions entirely; compute scales roughly with alpha_l (fraction of filters kept). Fully connected layers: turning off input features removes the corresponding multiply-adds; compute scales roughly with alpha_l (fraction of features kept). Cost model (first order, verbal). Baseline time: T_base = sum_over_all_layers C_l. ExNAS time: T_exnas ≈ sum_over_ungated C_l + sum_over_gated (alpha_l * C_l) + C_overhead. ExNAS is faster when the saved cost sum_over_gated (1 − alpha_l) * C_l exceeds the overhead C_overhead (memory query + scoring + mask application). Overhead profile (why it stays small in the fast setting). -Retrieval: small top-k (e.g., k = 16) with short fingerprints → stable, low per-batch cost. -Scoring: linear in the number of units in gated layers. -Mask application: element-wise multiply (negligible). These choices explain the measured gains with low active fractions and also why heavier adaptation can cross the break-even point. Practical thresholds (from the CPU runs). Gains were observed when: (i) alpha_l stayed in ~5–11%, (ii) gating targeted cost-dominant layers (later conv and FC), and (iii) retrieval/selection remained lightweight (top-k ≤ 16, no in-loop memory updates). When alpha_l grows toward ~15–20% and/or updates are frequent during evaluation, overhead can cancel the benefit (the negative case). Reproducibility cues (how to compute and time). Active fraction per layer: alpha_l = (1 / N_l) * sum_i mask_l[i]. Timing: wall-clock over the entire evaluation loop with the same dataloader and batching for baseline and ExNAS; warm-up 5 iterations to populate memory; one light epoch of training for the baseline; no retraining for ExNAS. Defaults used here: b_l = 0.12, B_g = 0.06, top-k = 16; gated layers: Conv2 (output-channel gating) and FC (input-feature gating). Bottom line. With small active fractions and lightweight selection, the proportional thinning rule guarantees the global cap, and first-order scaling of conv/FC costs with alpha_l makes the saved compute exceed overhead. This matches the CPU results (−3.7% to −7.9% time; +3.8% to +8.5% throughput) and clarifies why heavier adaptation removes the advantage. 6. Discussion The experiments support a practical claim: experience-guided, neuron-level transversal gating can deliver real time/throughput gains on a small CPU setup without retraining, provided that active fractions remain low and selection is lightweight. The observed accuracy trade-off is expected in a tiny-model/minimal-training regime; scaling to wider layers (e.g., attention heads and FFN channels) and training more thoroughly should improve the accuracy–efficiency frontier. The negative case shows an important boundary: on CPU, heavier adaptation (larger active fractions, frequent updates during evaluation) can add enough overhead to offset sparsity benefits. Operationally, this suggests keeping consolidation offline, capping the top-k for retrieval, and using a stricter global budget when targeting CPU deployments. Finally, the mechanism modulates internal representations (latent space) during the same inference under explicit budgets—an operational lever that can be tuned to different hardware and latency constraints. Future work will include energy/MACs measurements with hardware counters, scaling to transformer architectures, ANN-based retrieval (e.g., FAISS), latency percentiles (P50/P95), robustness, and a systematic evaluation of the representation-enrichment variant. 7. Limitations and future work This work evaluates a compact CNN on a single-CPU setup with minimal training (1 epoch) and either CIFAR-10 or a synthetic balanced 10-class fallback. Reported metrics are wall-clock time and throughput; energy, MAC counts, and latency percentiles (P50/P95) were not instrumented and should be added in future iterations (e.g., via hardware performance counters and energy profiling). The representation-enrichment variant is specified but not evaluated. Future work includes: scaling to transformers (gating attention heads and FFN channels); replacing cosine lookup with ANN retrieval (e.g., FAISS); richer recency/memory policies; multi-seed runs with mean ± std and confidence intervals; full energy/MAC instrumentation; and evaluation on edge devices under latency/energy constraints. 8. Conclusion This work indicates that experience-informed dynamic architectural adaptation is a solid path to improving inference in deep networks beyond static post-training deployment. Specifically, ExNAS records layer-wise fingerprints in an experiential memory and applies cross-layer selection at neuron granularity across layers not necessarily consecutive during the same inference, under per-layer and global budgets, thereby modulating the model’s internal representations (latent space) in real time. In a CPU proof-of-concept on a SmallCNN (2×Conv + FC), ExNAS achieved 3.7–7.9% reductions in wall-clock time and 3.8–8.5% throughput gains with low active fractions (≈ 4.7–10.9%) and no retraining. A negative case was also observed in which heavier adaptation increased overhead enough to cancel the sparsity benefit, which bounds the practical operating regime and informs future designs. These findings substantiate that experience-guided, neuron-level selective gating is viable and useful for real-time inference. Future work includes measuring energy/MACs, scaling to transformers (attention heads and FFN channels), integrating ANN-based retrieval (FAISS), reporting latency percentiles (P50/P95) and robustness, and conducting a systematic study of the representation-enrichment component. These efforts should consolidate the principles explored here and extend their impact to larger-scale deployments. . Declarations 9. Code and data availability Reproducible scripts: exnas_auto_v5.py, run_public_experiment_v5.py, autotune_v5.py, plot_results_v5.py, make_report_v5.py. The pipeline uses CIFAR-10 via torchvision; if unavailable, it falls back to a synthetic 10-class dataset (6000 train / 1000 test) generated on the fly. Experiments were run on CPU only (no GPU). A public repository (with commit hash / tag and an archived DOI) will include: requirements.txt (e.g., Python 3.12, PyTorch 2.8.0, torchvision 0.23.0, numpy 2.3.2). Exact commands to reproduce: pip install -r requirements.txt python run_public_experiment_v5.py python autotune_v5.py python make_report_v5.py Notes for determinism (recommended): set seeds before running PYTHONHASHSEED=0, torch.manual_seed(123), numpy.random.seed(123). Hardware/OS info and the active-budget configuration used (b_l, B_g, top-k). This artifact will be made publicly available upon submission/acceptance to facilitate independent verification. 10. Conflict of interest statement The author declares no competing interests. 11 . A patent application covering the methods described here has been filed; a PCT filing is planned within the priority window. References Zoph B, Le QV (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Real E, Moore S, Selle A, Saxena S, Suematsu YL, Tan J, Kurakin A (2017) Large-scale evolution of image classifiers. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 2902–2911 Liu H, Simonyan K, Yang Y (2018) DARTS: Differentiable architecture search. arXiv preprint arXiv:1806.09055 Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q, Hinton G, Dean J (2017) Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 Fedus W, Zoph B, Shazeer N (2022) Switch transformer: Scaling to trillion parameter models with simple and efficient sparsity. J Mach Learn Res 23(120):1–39 Graves A, Wayne G, Danihelka I (2014) Neural turing machines. arXiv preprint arXiv:1410.5401 Santoro A, Bartunov S, Botvinick M, Wierstra D, Lillicrap T (2016) Meta-learning with memory-augmented neural networks. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 1842–1850 Teerapittayanon S, McDanel B, Kung HT (2016) BranchyNet: Fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2464–2469 Graves A (2016) Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983. Dehghani M, Gouws S, Vinyals O, Uszkoreit J, Kaiser Ł (2018) Universal transformers. arXiv preprint arXiv:1807.03819 Banino A, Balaguer J, Blundell C (2021) PonderNet: Learning to ponder. arXiv preprint arXiv:2107.05407 Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7378044","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":500814193,"identity":"a454c4e3-b1b4-41dd-ad6f-932960e760e0","order_by":0,"name":"JOSE MARIA LANCHO RODRIGUEZ","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA7klEQVRIiWNgGAWjYBACxgbmxgcSUM4BhgIgKYFPPVgLY7MBQosBEVqAmtqQ1BCjhXnawbYKi4o6Od32sw8P/DCwy+Of3XyA4UfFNtx2zE5suyFx5rCx2Zl0g4M9BsnFEneOJTD2nLmNX4tk24HEbQfSGA7wGDAnbpDIMWBmbMOvpUCyra5+2/lnDAf/GNQTp4VBso05wexGGsNhHoPDRGlplgD6xXDbjWcMh2UMjifOuJGWcBCfXwxnJx/8LFFRJ292Po3545uK6sT+GckHH/yowKOlARjQGDFxAKd6IJAHOe4DPhWjYBSMglEwCgAkR1w5bbySAwAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0009-0007-9590-3163","institution":"","correspondingAuthor":true,"prefix":"","firstName":"JOSE","middleName":"MARIA LANCHO","lastName":"RODRIGUEZ","suffix":""}],"badges":[],"createdAt":"2025-08-15 03:32:56","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-7378044/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7378044/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":89301249,"identity":"dcd5794a-1424-43bc-8782-9319c0155671","added_by":"auto","created_at":"2025-08-18 14:25:36","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1003405,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7378044/v1/581a04e6-0a52-420d-a72e-80b042c6d5cf.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eExperiential neural architecture selection: dynamic cross-layer memory for real-time inference optimization\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eModern neural networks excel at pattern recognition yet operate under a stark constraint: operational amnesia. Even when repeatedly facing similar inputs, they typically process each one statelessly, with intermediate representations computed once and left unchanged, and without remembering which cross-layer neuron constellations worked best before. A seasoned clinician who forgets, after each correct diagnosis, the specific constellation of signs and heuristics that led to it would face a similar handicap: solving familiar problems from scratch, again and again.\u003c/p\u003e\u003cp\u003eThis amnesia manifests along three axes: (i) static intermediate processing, where upstream representations cannot be modulated by evidence discovered downstream during the same inference; (ii) absence of experiential memory, i.e., a record of how the network succeeded (which neurons and layers cooperated), as opposed to what content it saw; and (iii) limited architectural adaptation at inference time, generally restricted to block-level routing instead of neuron-level selection across non-consecutive layers.\u003c/p\u003e\u003cp\u003eWe propose Experiential Neural Architecture Selection (ExNAS), which reframes inference as experience-guided architectural selection. The core idea is operational rather than semantic: if the model can retain how it solved similar situations\u0026mdash;captured as layer-wise fingerprints and lightweight context\u0026mdash;then, during the same forward pass, it can selectively gate neurons across layers (not necessarily consecutive) under explicit budgets.\u003c/p\u003e\u003cp\u003eThe contributions of this work are threefold: a formalization of experiential memory that stores processing patterns rather than content; a transversal, neuron-granular selection mechanism with per-layer and global budget constraints acting within the same inference; and a lightweight implementation showing consistent speedups on CPU at low active fractions and without retraining. We also report a negative case clarifying where overhead offsets sparsity gains.\u003c/p\u003e"},{"header":"2. Related work","content":"\u003cp\u003eNeural Architecture Search (NAS) optimizes architectures during training (e.g., reinforcement learning, evolutionary, differentiable methods), but deployed networks remain static at inference [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. Mixture-of-Experts (MoE) achieves sparsity via block-level routing of tokens to experts; it typically does not record or exploit experiential neuron-level cooperation [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Memory-augmented models (e.g., NTM/DNC, kNN-LM) store and retrieve content, whereas our memory stores how computation succeeded (layer fingerprints and simple context) [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. Dynamic/early-exit and adaptive computation time approaches vary depth or halting time but generally do not perform neuron-level transversal selection guided by experiential history within the same inference [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. Test-time adaptation/caching reuses recent information but, again, typically lacks experience-guided neuron selection across non-consecutive layers.\u003c/p\u003e"},{"header":"3. Methodology","content":"\u003cp\u003e\u003cb\u003e3.1 Notation and high-level view\u003c/b\u003e\u003c/p\u003e\u003cp\u003eAt inference time, the system keeps a compact record of how the network processed each input, rather than the input itself. For every forward pass, ExNAS summarizes the activation of each relevant layer into a short fingerprint vector and stores it together with a small amount of context (for example, simple input statistics and an output-confidence proxy) and a timestamp. This yields a stream of lightweight records that can be searched quickly. Fingerprints are designed to be inexpensive and stable: in convolutional layers, activations are spatially averaged to produce one value per channel; in fully connected layers, simple per-feature summaries are used. These summaries are then mapped to a fixed dimension and normalized so that cosine similarity is meaningful. No raw inputs or labels are retained\u0026mdash;only the compressed signals needed to recognize that the current computation is similar to a previously successful one.\u003c/p\u003e\u003cp\u003eWhen a new input arrives, ExNAS computes the same set of fingerprints for the ongoing forward pass and compares them with the fingerprints saved in memory. Similarity is computed layer by layer and then averaged across the layers that overlap between the current run and a stored record; a recency weight down-weights stale records so that recent experience counts more. Only a small top-k set of the most similar and recent records is kept (e.g., k\u0026thinsp;=\u0026thinsp;16) to guide selection. From those retrieved records, ExNAS derives a memory signal per layer\u0026mdash;intuitively, a compact hint of which parts of the network tended to help in comparable situations. That signal is combined with current activation statistics to score units and produce binary masks for the layers where gating is enabled. Masks are applied within the same forward pass and are constrained by explicit per-layer and global budgets so that the overall active fraction remains small. After producing the output, the system appends the new record to memory; if capacity is exceeded, the oldest or least useful records are discarded according to the same recency policy. This keeps both lookup time and storage bounded.\u003c/p\u003e\u003cp\u003eFor clarity, the mechanism can be summarized with the following compact formulation, which we keep for readers who prefer a formal definition:\u003c/p\u003e\u003cp\u003eConsider layers {L₁, \u0026hellip;, Lₗ} with activations a⁽ˡ⁾ for input x. We define a layer fingerprint h⁽ˡ⁾ \u0026isin; ℝᵈˡ as a compressed summary of a⁽ˡ⁾ (e.g., channel-wise means; for 2D features, spatial averaging followed by channel averaging). An experiential memory M stores tuples ({h⁽ˡ⁾}, C, t) consisting of fingerprints across layers, lightweight context C (e.g., input statistics, output confidence), and timestamp t.\u003c/p\u003e\u003cp\u003eGiven current fingerprints {ĥ⁽ˡ⁾}, ExNAS queries M to retrieve a small set of similar past executions, weighs recency, and then computes neuron-level scores that combine current evidence and memory evidence. From these scores, it derives masks for a subset of layers S (not necessarily consecutive), constrained by per-layer and global budgets, and applies them within the same forward pass.\u003c/p\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e3.2 Experiential memory (word-friendly, developed)\u003c/h2\u003e\u003cp\u003e\u003cb\u003ePurpose and contents.\u003c/b\u003e\u003c/p\u003e\u003cp\u003eAt inference time, ExNAS records \u003cem\u003ehow\u003c/em\u003e the network processed each input, not the input itself. For every forward pass it creates a compact, layer-wise fingerprint and stores a single record consisting of: (i) a map from layer name to a fixed-length fingerprint vector, (ii) lightweight context (basic input statistics and an output-confidence proxy), and (iii) a timestamp. No raw inputs or labels are kept. The goal is a searchable log that recognizes when the current computation resembles previously successful ones while remaining small and fast.\u003c/p\u003e\u003cp\u003e\u003cb\u003eConstructing a fingerprint.\u003c/b\u003e\u003c/p\u003e\u003cp\u003eEach layer\u0026rsquo;s activation is turned into a stable, low-cost summary. In convolutional layers, activations are first averaged over the spatial dimensions to produce one value per channel; in fully connected layers, simple per-feature statistics (e.g., mean magnitude) are used. These per-layer summaries are then projected to a fixed dimension (for example, d\u0026thinsp;=\u0026thinsp;64) using a light linear map or a fixed random projection and normalized so that cosine similarity is meaningful. The result is one normalized vector per layer, per inference.\u003c/p\u003e\u003cp\u003e\u003cb\u003eQuery at inference time.\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWhen a new input is processed, ExNAS computes the current set of fingerprints and compares them against the stored records. Similarity between the current run and a past record is computed as the average cosine similarity across the layers that both records contain; no extra training or alignment is required. To favor fresh information, a recency weight down-weights older records (for example, using a half-life measured in days or in number of inferences). Only a small top-k set of records is retained for the decision stage (e.g., k\u0026thinsp;=\u0026thinsp;16), which keeps lookup cost low and predictable.\u003c/p\u003e\u003cp\u003e\u003cb\u003eSignal returned to the selector.\u003c/b\u003e\u003c/p\u003e\u003cp\u003eFrom the retrieved records ExNAS derives a per-layer memory signal: a weighted combination of the retrieved fingerprints, where weights reflect similarity times recency. This signal is compared with the current activation statistics to produce neuron-level scores and a binary mask in each gated layer. Masks are applied within the same forward pass and obey explicit per-layer and global budgets so that the overall active fraction remains small and controllable.\u003c/p\u003e\u003cp\u003e\u003cb\u003eUpdate and capacity control.\u003c/b\u003e\u003c/p\u003e\u003cp\u003eAfter producing the output, ExNAS appends the new record (fingerprints, context, timestamp). When memory reaches capacity, it compacts by discarding the least recent or least useful records under the same recency policy. This keeps storage bounded and preserves fast queries.\u003c/p\u003e\u003cp\u003e\u003cb\u003eDefaults and reproducibility.\u003c/b\u003e\u003c/p\u003e\u003cp\u003eUnless noted otherwise, fingerprints use d\u0026thinsp;=\u0026thinsp;64, retrieval uses top-k\u0026thinsp;=\u0026thinsp;16, recency follows a practical half-life (for example, one week or ten thousand inferences), and gating is enabled in Conv2 (output-channel gating) and FC (input-feature gating). A short warm-up of five iterations is run before timing to populate memory with initial records.\u003c/p\u003e\u003cp\u003e\u003cb\u003eEfficiency considerations.\u003c/b\u003e\u003c/p\u003e\u003cp\u003eBecause fingerprints are short and the retrieval set is small, the added work per batch is modest: a few reductions per layer to compute fingerprints, a linear-time lookup and scoring step in the size of \u003cem\u003e(top-k \u0026times; fingerprint dimension \u0026times; number of gated layers)\u003c/em\u003e, and an element-wise mask application that is negligible. These design choices are consistent with the measured CPU gains when active fractions are low and evaluation avoids heavy in-loop updates.\u003c/p\u003e\u003cp\u003e\u003cb\u003eIntuition.\u003c/b\u003e\u003c/p\u003e\u003cp\u003eIf the current fingerprints for Conv2 and FC closely match several recent records, the system effectively \u0026ldquo;recognizes\u0026rdquo; the processing situation and prioritizes the units that proved useful in those cases. If no good matches are found, the memory contributes little and selection falls back to current-signal evidence. In both cases, decisions occur within the same forward pass and under explicit budget constraints.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e3.3 Cross-layer transversal selection\u003c/h2\u003e\u003cp\u003eFor each gated layer l, a neuron score vector is computed as a convex combination of standardized current activation statistics and a standardized memory-derived signal. The top units are selected under a per-layer budget (b_l) and, if their aggregate would exceed the global cap (B_g), masks are thinned proportionally. The resulting binary masks are applied within the same forward pass, including layers that are not consecutive.\u003c/p\u003e\u003cp\u003eLet layer l have Nₗ units. We compute a score vector s⁽ˡ⁾ \u0026isin; ℝᴺˡ as a convex combination of standardized current activations (per-unit statistics) and standardized memory-derived signals:\u003c/p\u003e\u003cp\u003es⁽ˡ⁾ = w_cur φ(z(a⁽ˡ⁾))\u0026thinsp;+\u0026thinsp;w_mem φ(z(m⁽ˡ⁾)), w_cur\u0026thinsp;+\u0026thinsp;w_mem\u0026thinsp;=\u0026thinsp;1\u003c/p\u003e\u003cp\u003ewhere z(\u0026middot;) denotes per-layer standardization and φ a nonlinearity (e.g., ReLU). We select the top kₗ = \u0026lfloor;bₗNₗ\u0026rfloor; units under a per-layer budget bₗ \u0026isin; (0,1]. If the aggregate selection exceeds the global budget B_g, we thin masks proportionally. This yields a mask applied within the same forward across layers l \u0026isin; S, including layers that are not consecutive.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003e3.4 Optional representation enrichment\u003c/h2\u003e\u003cp\u003eAn optional variant computes an enrichment vector r from transversal activations plus a running state and mixes it back into one or more target layers through a gated rule with context-dependent threshold. We describe the mechanism but do not use it in the CPU proof-of-concept.\u003c/p\u003e\u003c/div\u003e"},{"header":"4. Experimental results","content":"\u003cp\u003e\u003cstrong\u003e4.1 Setup\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDevice and environment.\u003c/strong\u003e All runs are executed on a single-CPU machine (no GPU). Default PyTorch CPU settings are used (no manual thread pinning), and both the baseline and ExNAS run in the same environment to ensure a fair comparison. Absolute times are device-dependent; results are intended to capture relative improvements.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eModel.\u003c/strong\u003e \u003cstrong\u003eSmallCNN\u003c/strong\u003e with topology \u003cstrong\u003eConv1 \u0026rarr; ReLU \u0026rarr; Conv2 \u0026rarr; ReLU \u0026rarr; AvgPool \u0026rarr; Flatten \u0026rarr; FC (10 classes)\u003c/strong\u003e. ExNAS applies neuron-granular gating in Conv2 (output-channel gating) and in FC (input-feature gating); Conv1 is left ungated.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData.\u003c/strong\u003e CIFAR-10 via torchvision when available; otherwise, a synthetic, balanced 10-class dataset (6000 train / 1000 test). The same test split is evaluated by both the baseline and ExNAS.\u003c/p\u003e\n\u003cp\u003eTraining and evaluation. The baseline is trained for one light epoch. ExNAS does not retrain; it performs a short warm-up (5 iterations) to populate the experiential memory and then evaluates on the test split. Evaluation uses the identical dataloader and batching in both conditions.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBudgets and retrieval.\u003c/strong\u003e Budgets and retrieval. Unless stated otherwise, the per-layer budget is b_l = 0.12 and the global cap is B_g = 0.06. Memory query uses a small top-k set (e.g., k = 16), cosine similarity across available layer fingerprints, and simple recency down-weighting.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMetrics and timing.\u003c/strong\u003e The evaluation reports top-1 accuracy, wall-clock time for the entire test pass, and throughput computed as the ratio between the number of test samples and that wall-clock time. Timing is taken with a monotonic clock from immediately before the first batch is processed until the last batch completes, and includes the same dataloader and batch size for the baseline and for ExNAS. A short warm-up of five iterations precedes timing to populate caches and the experiential memory. The baseline is trained for one light epoch; ExNAS does not retrain. All runs are performed on CPU under the same software environment.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eActive fractions.\u003c/strong\u003e For each gated layer, the active fraction is the proportion of units left enabled by the mask during the forward pass. In the convolutional layer (Conv2) this corresponds to the fraction of output filters that remain active; in the fully connected layer (FC) it corresponds to the fraction of input features that contribute to the matrix\u0026ndash;vector product. Under the default budgets used in this work (per-layer budget b_l = 0.12 and global cap B_g = 0.06), and with two gated layers of similar size (Conv2 and FC), the effective active fraction per gated layer is approximately six percent, which matches the observed ranges (about 4.69\u0026ndash;10.94%)..\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMinimal checklist for reproducibility\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e-Same test split, same batch size, same preprocessing for baseline and ExNAS.\u003c/p\u003e\n\u003cp\u003e-Baseline trained 1 light epoch; ExNAS no retraining.\u003c/p\u003e\n\u003cp\u003e-Warm-up: 5 iterations before timing.\u003c/p\u003e\n\u003cp\u003e-Defaults: per-layer budget b_l = 0.12, global cap B_g = 0.06, retrieval top-k = 16.\u003c/p\u003e\n\u003cp\u003e-Gated layers: Conv2 (output-channel gating) and FC (input-feature gating).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMask application.\u003c/strong\u003e The mask is multiplicative and applied within the same forward pass: in convolution, non-selected output filters (channels) are suppressed; in the fully connected layer, non-selected input columns/features are suppressed accordingly.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e4.2 Main Result (Fast Configuration)\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWith run_public_experiment_v5.py (default settings):\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBaseline\u003c/strong\u003e \u0026rarr; acc = 0.201, time = 7.65 s, throughput = 654.0 samp/s\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eExNAS (fast)\u003c/strong\u003e \u0026rarr; acc = 0.101, time = 7.37 s, throughput = 678.8 samp/s\u003cbr\u003e\u003cstrong\u003eActive fractions:\u003c/strong\u003e Conv2 \u0026asymp; 6.25%, FC \u0026asymp; 6.25%\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEffect.\u003c/strong\u003e Time reduction of 3.66% and throughput gain of 3.80% compared to the baseline, same network, no retraining.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e4.3 Budget sensitivity (Grid Search):\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWith autotune_v5.py (grid search): baseline acc = 0.198, time = 7.97 s, throughput = 627.6 samp/s.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBest time observed:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ebp = 0.12, bg = 0.06 \u0026rarr; acc = 0.099, 7.34 s, 681.2 samp/s; fractions \u0026asymp; 6.25%\u003c/p\u003e\n\u003cp\u003ebp = 0.10, bg = 0.05 \u0026rarr; acc = 0.102, 7.36 s, 678.9 samp/s; fractions \u0026asymp; 4.69%\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eRelative to baseline:\u003c/strong\u003e up to \u0026minus;7.9% in time and +8.5% in throughput (best case), at the cost of lower accuracy in this small model and minimal-training regime. As expected, smaller budgets increase sparsity and reduce time, but may degrade accuracy in undertrained models.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e4.4 Negative case under heavier adaptation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWith run_public_experiment_v5_tuned.py (bp = 0.18, bg = 0.08, warm-up = 50, adaptation enabled during evaluation):\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBaseline\u003c/strong\u003e \u0026rarr; acc = 0.189, time = 7.92 s, throughput = 631.7 samp/s\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eExNAS (tuned)\u003c/strong\u003e \u0026rarr; acc = 0.101, time = 8.04 s, throughput = 621.9 samp/s\u003cbr\u003e\u003cstrong\u003eActive fractions:\u003c/strong\u003e \u0026asymp; 9.38%\u003c/p\u003e\n\u003cp\u003eOn CPU, the selection/memory overhead can reverse the benefit of sparsity when adaptation is heavier (larger active fractions and more frequent updates), delineating a practical operating boundary.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e4.5 Summary Table\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"3\" cellpadding=\"0\" width=\"576\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eSetting\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eModel\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eAccuracy\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026nbsp;Time \u0026nbsp; (s)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eThroughput (samp/s)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eActive Fractions\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eFast (default)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eBaseline\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.201\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e7.65\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e654.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026mdash;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\u003cbr\u003e\u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eExNAS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.101\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e7.37\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e678.8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eConv2 6.25%, FC 6.25%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eGrid best-time\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eBaseline\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.198\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e7.97\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e627.6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026mdash;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\u003cbr\u003e\u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eExNAS (0.12/0.06)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.099\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e7.34\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e681.2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eConv2 6.25%, FC 6.25%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eTuned (heavy)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eBaseline\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.189\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e7.92\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e631.7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026mdash;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\u003cbr\u003e\u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eExNAS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.101\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e8.04\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e621.9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eConv2 9.38%, FC 9.38%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003eDefaults: b_l=0.12, B_g=0.06; top-k=16\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e4.6. Threats to validity and reproducibility notes:\u003c/strong\u003e\u003c/p\u003e\n\u003col\u003e\n \u003cli\u003e\u003cstrong\u003eHardware sensitivity.\u0026nbsp;\u003c/strong\u003eWall-clock time and throughput depend on CPU model, BLAS, and OS scheduling. Results emphasize relative differences under identical conditions.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eMinimal training\u003c/strong\u003e. One light epoch on a small CNN intentionally stresses compute savings over absolute accuracy; deeper training may mitigate accuracy drops under aggressive budgets.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eMemory policy.\u0026nbsp;\u003c/strong\u003eA small top-k and simple recency factor are used for speed; richer policies (e.g., FAISS, more features) may change the accuracy\u0026ndash;efficiency frontier.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eSingle-shot reporting.\u0026nbsp;\u003c/strong\u003eUnless noted, numbers correspond to a single pass. Running multiple seeds and reporting mean\u0026plusmn;std is straightforward but out of scope here.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eData pipeline.\u0026nbsp;\u003c/strong\u003eTiming includes the identical dataloader path in both conditions; caching effects are controlled by the warm-up and by evaluating the full test split.\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"5. Theoretical considerations","content":"\u003cp\u003e\u003cstrong\u003eScope and notation (Word-friendly).\u003c/strong\u003e Let \u003cem\u003eG\u003c/em\u003e be the set of gated layers; each gated layer \u003cem\u003el\u003c/em\u003e has \u003cem\u003eN_l\u003c/em\u003e units and a per-layer budget \u003cem\u003eb_l\u003c/em\u003e ∈\u0026nbsp;(0,1]. A global cap \u003cem\u003eB_g\u003c/em\u003e ∈\u0026nbsp;(0,1] limits the average activation across all layers in \u003cem\u003eG\u003c/em\u003e. The effective active fraction in layer \u003cem\u003el\u003c/em\u003e is denoted \u003cem\u003ealpha_l\u003c/em\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEffective active fraction (with a single proportional rule).\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIf the intended selections respect the global cap, then \u003cem\u003ealpha_l = b_l\u003c/em\u003e for all \u003cem\u003el\u0026nbsp;\u003c/em\u003e\u003cem\u003e∈ G\u003c/em\u003e. Otherwise, all masks are scaled by the same factor rho = (B_g * sum_over_G N_l) / (sum_over_G b_l * N_l), and alpha_l = rho * b_l.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eAnchor:\u003c/em\u003e with defaults b_l = 0.12, B_g = 0.06 and two similarly sized gated layers (Conv2, FC), rho ≈ 0.50 and alpha_l ≈ 0.06 (≈6%), matching the observed range (~4.7–10.9%) across settings.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eWhere savings come from (first-order view).\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eConvolutional layers:\u003c/em\u003e turning off output filters removes their convolutions entirely; compute scales roughly with \u003cem\u003ealpha_l\u003c/em\u003e (fraction of filters kept).\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eFully connected layers:\u003c/em\u003e turning off input features removes the corresponding multiply-adds; compute scales roughly with \u003cem\u003ealpha_l\u003c/em\u003e (fraction of features kept).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCost model (first order, verbal).\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eBaseline time: T_base = sum_over_all_layers C_l.\u003c/p\u003e\n\u003cp\u003eExNAS time: T_exnas ≈ sum_over_ungated C_l + sum_over_gated (alpha_l * C_l) + C_overhead.\u003cbr\u003eExNAS is faster when the \u003cstrong\u003esaved cost\u003c/strong\u003e sum_over_gated (1 − alpha_l) * C_l exceeds the \u003cstrong\u003eoverhead\u003c/strong\u003e C_overhead (memory query + scoring + mask application).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eOverhead profile (why it stays small in the fast setting).\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e-Retrieval: small top-k (e.g., k = 16) with short fingerprints → stable, low per-batch cost.\u003c/p\u003e\n\u003cp\u003e-Scoring: linear in the number of units in gated layers.\u003c/p\u003e\n\u003cp\u003e-Mask application: element-wise multiply (negligible).\u003c/p\u003e\n\u003cp\u003eThese choices explain the measured gains with low active fractions and also why heavier adaptation can cross the break-even point.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePractical thresholds (from the CPU runs).\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eGains were observed when: (i) \u003cem\u003ealpha_l\u003c/em\u003e stayed in ~5–11%, (ii) gating targeted cost-dominant layers (later conv and FC), and (iii) retrieval/selection remained lightweight (top-k ≤ 16, no in-loop memory updates). When \u003cem\u003ealpha_l\u003c/em\u003e grows toward ~15–20% and/or updates are frequent during evaluation, overhead can cancel the benefit (the negative case).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eReproducibility cues (how to compute and time).\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eActive fraction per layer:\u003c/em\u003e alpha_l = (1 / N_l) * sum_i mask_l[i].\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eTiming:\u003c/em\u003e wall-clock over the entire evaluation loop with the same dataloader and batching for baseline and ExNAS; warm-up 5 iterations to populate memory; one light epoch of training for the baseline; no retraining for ExNAS.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eDefaults used here:\u003c/em\u003e b_l = 0.12, B_g = 0.06, top-k = 16; gated layers: Conv2 (output-channel gating) and FC (input-feature gating).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBottom line.\u003c/strong\u003e With small active fractions and lightweight selection, the proportional thinning rule guarantees the global cap, and first-order scaling of conv/FC costs with \u003cem\u003ealpha_l\u003c/em\u003e makes the saved compute exceed overhead. This matches the CPU results (−3.7% to −7.9% time; +3.8% to +8.5% throughput) and clarifies why heavier adaptation removes the advantage.\u003c/p\u003e"},{"header":"6. Discussion","content":"\u003cp\u003eThe experiments support a practical claim: experience-guided, neuron-level transversal gating can deliver real time/throughput gains on a small CPU setup without retraining, provided that active fractions remain low and selection is lightweight. The observed accuracy trade-off is expected in a tiny-model/minimal-training regime; scaling to wider layers (e.g., attention heads and FFN channels) and training more thoroughly should improve the accuracy\u0026ndash;efficiency frontier.\u003c/p\u003e\u003cp\u003eThe negative case shows an important boundary: on CPU, heavier adaptation (larger active fractions, frequent updates during evaluation) can add enough overhead to offset sparsity benefits. Operationally, this suggests keeping consolidation offline, capping the top-k for retrieval, and using a stricter global budget when targeting CPU deployments.\u003c/p\u003e\u003cp\u003eFinally, the mechanism modulates internal representations (latent space) during the same inference under explicit budgets\u0026mdash;an operational lever that can be tuned to different hardware and latency constraints. Future work will include energy/MACs measurements with hardware counters, scaling to transformer architectures, ANN-based retrieval (e.g., FAISS), latency percentiles (P50/P95), robustness, and a systematic evaluation of the representation-enrichment variant.\u003c/p\u003e"},{"header":"7. Limitations and future work","content":"\u003cp\u003eThis work evaluates a compact CNN on a single-CPU setup with minimal training (1 epoch) and either CIFAR-10 or a synthetic balanced 10-class fallback. Reported metrics are wall-clock time and throughput; energy, MAC counts, and latency percentiles (P50/P95) were not instrumented and should be added in future iterations (e.g., via hardware performance counters and energy profiling). The representation-enrichment variant is specified but not evaluated.\u003c/p\u003e\u003cp\u003eFuture work includes: scaling to transformers (gating attention heads and FFN channels); replacing cosine lookup with ANN retrieval (e.g., FAISS); richer recency/memory policies; multi-seed runs with mean\u0026thinsp;\u0026plusmn;\u0026thinsp;std and confidence intervals; full energy/MAC instrumentation; and evaluation on edge devices under latency/energy constraints.\u003c/p\u003e"},{"header":"8. Conclusion","content":"\u003cp\u003eThis work indicates that experience-informed dynamic architectural adaptation is a solid path to improving inference in deep networks beyond static post-training deployment. Specifically, ExNAS records layer-wise fingerprints in an experiential memory and applies cross-layer selection at neuron granularity across layers not necessarily consecutive during the same inference, under per-layer and global budgets, thereby modulating the model\u0026rsquo;s internal representations (latent space) in real time.\u003c/p\u003e\u003cp\u003eIn a CPU proof-of-concept on a SmallCNN (2\u0026times;Conv\u0026thinsp;+\u0026thinsp;FC), ExNAS achieved 3.7\u0026ndash;7.9% reductions in wall-clock time and 3.8\u0026ndash;8.5% throughput gains with low active fractions (\u0026asymp;\u0026thinsp;4.7\u0026ndash;10.9%) and no retraining. A negative case was also observed in which heavier adaptation increased overhead enough to cancel the sparsity benefit, which bounds the practical operating regime and informs future designs.\u003c/p\u003e\u003cp\u003eThese findings substantiate that experience-guided, neuron-level selective gating is viable and useful for real-time inference. Future work includes measuring energy/MACs, scaling to transformers (attention heads and FFN channels), integrating ANN-based retrieval (FAISS), reporting latency percentiles (P50/P95) and robustness, and conducting a systematic study of the representation-enrichment component. These efforts should consolidate the principles explored here and extend their impact to larger-scale deployments.\u003c/p\u003e\u003cp\u003e.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003e9. Code and data availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eReproducible scripts: exnas_auto_v5.py, run_public_experiment_v5.py, autotune_v5.py, plot_results_v5.py, make_report_v5.py. The pipeline uses CIFAR-10 via torchvision; if unavailable, it falls back to a synthetic 10-class dataset (6000 train / 1000 test) generated on the fly. Experiments were run on CPU only (no GPU). A public repository (with commit hash / tag and an archived DOI) will include:\u003c/p\u003e\n\u003col\u003e\n \u003cli\u003erequirements.txt (e.g., Python 3.12, PyTorch 2.8.0, torchvision 0.23.0, numpy 2.3.2).\u003c/li\u003e\n \u003cli\u003eExact commands to reproduce:\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003epip install -r requirements.txt\u003c/p\u003e\n\u003cp\u003epython run_public_experiment_v5.py\u003c/p\u003e\n\u003cp\u003epython autotune_v5.py\u003c/p\u003e\n\u003cp\u003epython make_report_v5.py\u003c/p\u003e\n\u003col start=\"3\"\u003e\n \u003cli\u003eNotes for determinism (recommended): set seeds before running\u003cbr\u003e\u0026nbsp;PYTHONHASHSEED=0, torch.manual_seed(123), numpy.random.seed(123).\u003c/li\u003e\n \u003cli\u003eHardware/OS info and the active-budget configuration used (b_l, B_g, top-k).\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eThis artifact will be made publicly available upon submission/acceptance to facilitate independent verification.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e10. Conflict of interest statement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe author declares no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e11\u003c/strong\u003e\u003cem\u003e.\u003c/em\u003e A patent application covering the methods described here has been filed; a PCT filing is planned within the priority window.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eZoph B, Le QV (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eReal E, Moore S, Selle A, Saxena S, Suematsu YL, Tan J, Kurakin A (2017) Large-scale evolution of image classifiers. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 2902\u0026ndash;2911\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLiu H, Simonyan K, Yang Y (2018) DARTS: Differentiable architecture search. arXiv preprint arXiv:1806.09055\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eShazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q, Hinton G, Dean J (2017) Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFedus W, Zoph B, Shazeer N (2022) Switch transformer: Scaling to trillion parameter models with simple and efficient sparsity. J Mach Learn Res 23(120):1\u0026ndash;39\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGraves A, Wayne G, Danihelka I (2014) Neural turing machines. arXiv preprint arXiv:1410.5401\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSantoro A, Bartunov S, Botvinick M, Wierstra D, Lillicrap T (2016) Meta-learning with memory-augmented neural networks. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 1842\u0026ndash;1850\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTeerapittayanon S, McDanel B, Kung HT (2016) BranchyNet: Fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2464\u0026ndash;2469\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGraves A (2016) Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDehghani M, Gouws S, Vinyals O, Uszkoreit J, Kaiser Ł (2018) Universal transformers. arXiv preprint arXiv:1807.03819\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBanino A, Balaguer J, Blundell C (2021) PonderNet: Learning to ponder. arXiv preprint arXiv:2107.05407\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Independent Research - No institutional sponsorship","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"experiential neural memory, cross-layer transversal selection, dynamic inference, neuron-level gating, conditional computation","lastPublishedDoi":"10.21203/rs.3.rs-7378044/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7378044/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eNeural networks suffer from operational amnesia: they process each input as if it were the first time, without remembering which neuron combinations proved effective in similar contexts. We introduce \u003cstrong\u003eExNAS\u003c/strong\u003e (\u003cem\u003eExperiential Neural Architecture Selection\u003c/em\u003e),\u003cstrong\u003e \u003c/strong\u003ea system that performs real-time, neuron-granular architectural adaptation during the same inference by leveraging a distributed experiential memory. ExNAS records layer-wise neural fingerprints and lightweight contextual metadata and then performs transversal selection across non-consecutive layers under explicit per-layer and global budgets.\u003c/p\u003e\n\u003cp\u003eOn a CPU proof-of-concept using a small CNN (2×Conv+FC), ExNAS delivers measurable time reductions (≈3.7–7.9%) and throughput gains (≈3.8–8.5%) at low active fractions (≈4.7–10.9%), without retraining. We detail the design, provide formal definitions, and discuss sensitivity to budgets and a negative case where heavier adaptation adds overhead. These results substantiate experience-guided, neuron-level conditional computation as a practical tool for real-time inference.\u003c/p\u003e","manuscriptTitle":"Experiential neural architecture selection: dynamic cross-layer memory for real-time inference optimization","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-08-18 14:17:31","doi":"10.21203/rs.3.rs-7378044/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"9c43da7e-3e5e-4257-a71e-3f45c1b62033","owner":[],"postedDate":"August 18th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":53203742,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2025-08-18T14:17:31+00:00","versionOfRecord":[],"versionCreatedAt":"2025-08-18 14:17:31","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7378044","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7378044","identity":"rs-7378044","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.