Protein expression derived from scRNA-seq reveals lupus-induced acceleration of immune aging

doi:10.21203/rs.3.rs-9418867/v1

Protein expression derived from scRNA-seq reveals lupus-induced acceleration of immune aging

2026 · doi:10.21203/rs.3.rs-9418867/v1

preprint OA: closed

Full text JSON View at publisher

Full text 132,169 characters · extracted from preprint-html · click to expand

Protein expression derived from scRNA-seq reveals lupus-induced acceleration of immune aging | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Protein expression derived from scRNA-seq reveals lupus-induced acceleration of immune aging Ho Yeung Leung, Peng Qiu This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9418867/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 7 You are reading this latest preprint version Abstract Single-cell RNA sequencing (scRNA-seq) is widely established and excels at capturing transcriptional heterogeneity and signaling states. However, it lacks direct quantification of surface proteins, which are key determinants of immune cell identity and function. To enable protein-based characterization of immunophenotypic diversity in scRNA-seq data, we developed scEN, which employs a regularized Elastic Net regression model to predict protein expression from gene expressions. We trained scEN on Cellular Indexing Transcriptomes with Epitopes Sequencing (CITE-seq) data containing paired surface-protein and transcriptomic measurements from bone marrow. When applied to scRNA-seq from peripheral blood of healthy donors scEN generated robust protein predictions aligned with known immunophenotypes at single-cell resolution and the predicted protein expression enabled cytometry-style manual gating to resolve immune-cell subsets associated with physiological immune aging. When applied to scRNA-seq data from lupus patients, the predicted protein expression captured shifts among immune-cell subsets, revealing lupus-induced acceleration of immune aging. Health sciences/Biomarkers Biological sciences/Computational biology and bioinformatics Biological sciences/Immunology Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Introduction Single-cell RNA sequencing (scRNA-seq) has become central to systems biology. Over the past decade, scRNA-seq has enabled researchers to dissect transcriptional states at single-cell resolution, revealing how cells communicate and how gene regulation drives cellular heterogeneity 1 , 2 . Despite these advances, scRNA-seq captures only messenger RNA (mRNA), presenting a limitation: the inability to measure surface proteins that could further define immune lineage, signaling status, and functional state 3 . Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) partially addresses this limitation by jointly profiling RNA and surface proteins in the same cells using oligonucleotide-labeled antibodies 4 , 5 . By providing multimodal measurements, CITE-seq enables direct immunophenotyping and more comprehensive characterization of immune states 4 , 6 , 7 , offering proteomic insights that originally inaccessible to transcriptomes alone. However, the adoption of CITE-seq remains constrained due to cost, specialized reagents, and reduced throughput compared to conventional scRNA-seq 8 , 9 , 10 . Consequently, although single-cell profiling has expanded across diverse tissues and disease contexts, most available datasets contain only gene expression profiles 2 , 11 . A computational strategy capable of inferring surface proteins directly from transcriptomes would therefore enable more advanced immunophenotyping in RNA-only datasets, motivating the development of computational methods that infer protein expression from scRNA-seq alone, thereby supporting proteomic analyses even in transcriptome-only datasets. Several machine learning approaches integrating paired RNA–protein measurements to predict protein abundance from transcriptomes. These frameworks employed diverse architectures, ranging from joint latent-variable models (TotalVI) to biologically informed deep learning (sciPENN) and transfer learning (cTP-net), as well as unregularized linear regression (scLinear). While these approaches have demonstrated strengths, including scalability, relatively fast training compared to classical statistical methods, and the ability to impute proteins across diverse contexts, they also face several challenges 8 , 9 , 12 , 13 . First of all, deep learning models are more prone to overfitting, as their large parameter space allows them to memorize training data rather than learn generalizable patterns. Models trained on high-dimensional features may fit noise rather than the true biological signal, increasing the risk of overfitting 14 . In addition, unregularized linear methods struggle to capture complex gene–protein relationships, whereas transfer-learning models depend strongly on the composition of reference datasets and may fail to generalize when biological contexts differ substantially from pretrained datasets 8 , 12 , 13 . Consequently, predictive performance trained on one dataset may degrade when applied to unseen datasets, compounded by batch effects and other technical sources of variation 15 , 16 . These limitations highlight the need for models that are accurate, interpretable, and robust across heterogeneous datasets. To address these challenges, we introduce scEN, an Elastic Net–based machine learning framework that predicts single-cell surface protein expression from RNA profiles. Unlike deep learning–based approaches, scEN balances accuracy with generalizability, offer improved robustness and interpretable gene–protein associations. Using multiple benchmark PBMC CITE-seq and scRNA-seq datasets, we demonstrate that scEN trained on the bone marrow mononuclear cells (BMMCs) CITE-seq dataset provides robust in-silico protein predictions, particularly for immune cells derived from PBMC RNA-only data. Importantly, the predicted proteins further enable manual gating strategies that recapitulate conventional flow cytometry workflows, thereby bridging computational predictions with established laboratory practices. Building on this foundation, we developed a scEN-based immune-age prediction pipeline that stratifies lupus and healthy donors, revealing biologically meaningful shifts in lupus-induced immune composition associated with accelerated immune-system aging trajectories 17 . The overall model architecture and downstream analysis workflow are illustrated in Fig. 1 . In summary, scEN addresses a major limitation in single-cell biology, the absence of protein level information in most scRNA-seq datasets, by computationally recovering the surface protein expression directly from transcriptomes. This enables scalable immunophenotyping in RNA-only datasets, opening opportunities for downstream clinical and translational applications where multimodal profiling remains limited. Results Performance of scEN compared to benchmarking models on the CITE-seq datasets We first evaluate the performance of scEN with state-of-the-art scRNA-seq protein prediction models using the BMMC dataset (NeurIPS 2021 multimodal challenge; 90,261 bone marrow mononuclear cells with paired RNA–protein profiles) 18 . All models were trained on 80% of this dataset and tested on the remaining 20% (Fig. 2 a, b). Performance was uniformly high across all different methods. The deep learning model TotalVI achieved the strongest correlation (mean r = 0.767), closely followed by cTP-net and sciPENN (r = 0.749 and 0.713), with other models having correlations of 0.708 and 0.705. Per-protein bar plots revealed that highly expressed lineage markers such as CD3, CD19, and CD8 were predicted with high fidelity across all approaches. In contrast, lower-abundance or donor-specific proteins showed greater variability 4 , 19 . Together, these results indicate that all models achieve comparable accuracy on large multimodal datasets when training and testing data come from the same dataset. We next compare performance on the juvenile dermatomyositis (JDM) PBMC dataset (105,827 cells), which includes matched healthy and disease samples, because this dataset exhibits greater variability than the BMMC dataset 20 . The JDM dataset was analyzed using the same within-dataset splitting strategy as the BMMC benchmark (Fig. 2 c,d). Here, correlations declined for all methods (r = 0.569–0.461), consistent with increased heterogeneity in this dataset 15 . Despite these challenges, scEN maintained performance comparable to, and in some cases even exceeding, that of deep learning models, underscoring its robustness in biologically diverse datasets. Per-protein correlation profiles remained consistent across markers, with T- and B-cell lineage proteins (CD8, CD19) retaining relatively high predictability, while some markers (e.g., TIGIT, CD25) showed larger variance. Next, we compare these algorithms in a more challenging setting using the 10X genomic PBMC10K from a healthy Donor CITE-seq dataset only with major immune cell type (7,688 peripheral blood mononuclear cells with paired RNA–protein data). All models were trained exclusively on BMMC and applied directly to PBMC10K, which differs substantially in tissue origin (bone marrow and peripheral blood), donor population, and different acquisition occasion. We found that scEN outperformed all competing methods, achieving a mean correlation of r = 0.798 (Fig. 2 e,f), compared with 0.782 for cTP-net, 0.739 for scLinear, 0.726 for sciPENN, and 0.669 for TotalVI. Notably, scEN showed superior performance for lineage-defining T-cell markers such as CD3, CD4, CD8, CD14 and CD19 indicating improved cross-dataset consistency relative to deep learning approaches that rely more heavily on dataset-specific feature representations 8 , 9 , 12 , 13 . Across all three benchmarking comparisons (Fig. 2 b,d,f), lineage-defining markers such as CD3, CD4, CD8, CD19, and CD14 exhibited consistently high predictability. Meanwhile, activation-dependent markers (e.g., CD25 and TIGIT) displayed lower correlations, which consistent with known biology that context-dependent RNA–protein coupling reported previously. Activation-dependent markers such as CD25 and TIGIT are regulated by immune activation and therefore exhibit variable RNA–protein coupling across immune states, reducing transcriptomic predictability regardless of model 21 – 24 . These trends were observed across all competing methods, suggesting the differences in per-marker predictability predominantly reflect a biological phenomenon rather than model-specific effects. This consistency reinforces that scEN’s performance is not driven by marker-specific overfitting but by its ability to recover the underlying gene–protein relationships across datasets robustly. Moreover, we confirmed that scEN consistently captures both lineage-defining markers (e.g., CD3, CD4, CD8) (Fig. 2 b,d,f) and some lower-abundance proteins (e.g., CD127) (Fig. 2 f), as well as outperforming other models. These results indicate that while deep learning models excel in dataset-specific tuning, they often fail to generalize across datasets. In contrast, scEN’s regularization can handle such domain shifts. In summary, scEN achieves a balance between accuracy and robustness, yielding models that are not only interpretable on specific genes and protein relationship but also robust to batch effects and technical variability. scEN prediction on the validation scRNA-seq datasets enables downstream manual gating strategies To evaluate scEN’s ability to generalize beyond its CITE-seq training data, we applied the model to two independent PBMC scRNA-seq datasets which are Lee (10x Genomics 3′ v3, 43,512 cells) and Arh (10x Genomics 3′ v3, 49,139 cells) 25 , 26 , which contain only transcriptomic measurements without paired protein data. The original authors annotated cell types in both datasets using Seurat. These datasets differ substantially from the CITE-seq training data in donor origin, sequencing depth, and chemistry, posing a significant challenge for scEN to accurately predict protein expression. Since neither dataset includes surface-protein measurements, this evaluation tests whether scEN can reconstruct protein-level immunophenotypes purely from RNA profiles in entirely unseen biological contexts. Across both datasets, scEN-predicted CD3 and CD19 protein expression recapitulated lymphoid partitioning that defines T- and B-cell identity (Fig. 3 a,b,i,j). Cells annotated as T cells uniformly had high predicted CD3 and low CD19, whereas predicted CD19 was highest in B-cell populations and remained low across all other cell types. The resulting separation between CD3 + CD19- T (high: +, low or negative: -, intermediate: int) cells and CD19 + CD3- B cells, further supported by box-plot distributions, demonstrates that scEN captures protein-expression patterns consistent with in-silico CITE-seq behavior. Within the T-cell compartment, scEN reconstructed mutually exclusive CD4 and CD8 expression patterns corresponding to helper and cytotoxic T-cell lineages (Fig. 3 c,d,k,l). Predicted intensities formed two clearly separated lobes representing CD4 + CD8- and the CD8 + CD4- populations. Consistent with transcriptome-based annotations, CD4 T cells exhibited the highest predicted CD4 expression, while CD8 T cells showed elevated CD8 expression. Other cell types displayed substantially lower predicted values, confirming lineage-specific signal recovery despite differences in absolute intensity ranges between datasets. scEN further distinguished innate immune populations through predicted CD56 and CD14 expression (Fig. 3 e,f,m,n). NK cells consistently localized to a CD56 + CD14- region, whereas monocytes occupied a CD14 + and CD56- space, accurately reproducing canonical immunophenotypic boundaries used in flow and mass cytometry. Predicted expression levels aligned with established biology that NK cells exhibited the highest CD56 expression levels, while monocytes exhibited strong CD14 expression, with all other cell types remaining low for both markers. These patterns were consistent across datasets. Within the myeloid compartment, predicted CD14 and CD16 expression captured the classical-to-nonclassical monocyte continuum 27 (Figs. 3 g,h,o,p). In the Lee dataset, scEN reproduced two well-defined clusters corresponding to classical (CD14 + and CD16-) and non-classical (CD14- and CD16+) monocytes, with a smooth intermediate gradient. In contrast, the Arh dataset exhibited lower overall CD16 expression and a broader distribution along the CD16 axis, likely reflecting dataset-specific biology such as donor composition and relative abundance of intermediate monocytes rather than prediction noise. Despite differences in absolute signal magnitude, the relative ordering of monocyte subtypes was preserved across datasets, indicating that scEN infers stable immunophenotypic structure under heterogeneous conditions. Together, these results demonstrate that scEN generalizes effectively to independent PBMC scRNA-seq datasets lacking protein measurements, enabling reconstruction of key immunophenotypic axes from RNA-only data and supporting downstream analyses such as manual gating, lineage stratification, and cell-type frequency estimation. Clinical application of scEN on healthy and Systemic Lupus Erythematosus (SLE) scRNA-seq PBMC dataset We first demonstrate the predicted protein data support manual gating to define immune cell types. We began with establishing a broad lineage partition using CD3 and CD19 expression, two mutually exclusive markers (Fig. 4 a). Cells with high CD3 and negligible CD19 were classified as T cells (CD3 + CD19-), whereas those with high CD19 and low CD3 were identified as B cells (CD3- CD19+). This first gate provided a clean lymphoid separation, with minimal contamination between populations and a clear bimodal distribution on the CD3 and CD19 scatter plot. Within the T-cell compartment, we then gated CD4 + and CD8 + subsets based on their reciprocal expression patterns (Fig. 4 b). CD4 + T cells exhibited a strong CD4 and low CD8 signal, while CD8 + T cells showed the opposite expression profile, mirroring canonical helper and cytotoxic cell type specific phenotypes. To capture T-cell differentiation and memory states within the CD4 + and CD8 + compartment, we incorporated CD45RA and CD27 expression to define naïve and memory subsets (Fig. 4 c). Density-based gating along the CD45RA and CD27 axes revealed clear separation of maturation states consistent with established CD4 + T-cell biology. Within differentiated CD4 + populations, we further overlaid CD85j and CD57 expression to resolve functional heterogeneity (Fig. 4 d). Cells co-expressing CD57 and CD85j were annotated as highly differentiated effector phenotypes, while CD57- CD85j- subsets represented less differentiated cells. For CD8 + T cells, we used CD45RA and CD27 expressions to define canonical differentiation states (Fig. 4 e), separating naïve, central memory, effector memory, and terminal effector populations. Within the CD8 + CD45RA+ CD27- (TEMRA) compartment, we further resolved later stage differentiation using CD85j and CD57 expression (Fig. 4 f), identifying highly differentiated effector subsets. This gating strategies demonstrates that scEN-predicted markers can reproduce late-stage differentiation. Within the B-cell compartment, we first examined IgD and IgM expression to delineate naïve and transitional-like populations (Fig. 4 g). To define classical memory subsets further, we used IgD and CD27 expression (Fig. 4 h) to identify naïve B cells (IgD+CD27-), unswitched memory (IgD+CD27+), switched memory (IgD-CD27+), and double-negative memory (IgD-CD27-) B cells. To refine transitional-like populations within the IgD+CD27- gate, IgM intensity served as a secondary axis, identifying a small fraction of transitional B cells that characterized by high IgM expression. These predicted distributions closely mirrored expected frequencies in healthy PBMCs. Among CD3-CD19- cells, scEN-predicted CD56 expression delineated NK cells (CD56+) from myeloid lineages (Fig. 4 i). Within the CD56- compartment, we further separated monocytes using CD14 and CD16 markers (Fig. 4 j). Classical CD14 + CD16- and non-classical CD14-CD16 + subsets appeared as distinct arms on the CD14 and CD16 scatter plot, while a smaller CD14int CD16int population represented intermediate monocytes, which were not included in further analyses. All gating thresholds and decision boundaries are shown (Fig. 4 ). Overall, 19 cell types were defined according to the above gating strategies. Notably, the distribution of predicted marker intensities followed unimodal or bimodal patterns consistent with empirical flow cytometry data, reinforcing the accuracy of scEN’s in-silico reconstruction. This gating pipeline recapitulates the standard manual workflow used in immunophenotyping and provides a scalable framework for analyzing over one million cells without direct protein measurement 28 , 29 . scEN-derived age prediction on the healthy human scRNA-seq dataset reveals age and Immune cell type correlation To demonstrate that scEN-inferred protein data are predictive of aging, we trained an Elastic Net regression model, which predicts age from cell type composition derived from scEN-inferred protein expression. Each individual’s cell-type composition, defined as the proportion of 19 manually gated cell types, served as the input to the regression model. The prediction target is the chronological age, which serves as a standard normal human aging metric. The scatter plot depicts the relationship between scEN-derived age prediction and chronological age across all donors (Fig. 5 a). The scatter plot reveals a linear correlation spanning the entire adult lifespan, indicating that the model performs equally well in younger and older individuals. The model achieved a mean absolute error (MAE) of 9.76 years, which is comparable to the published single-cell mass cytometry and scRNA-seq immune aging prediction model, which was also trained with healthy human PBMC samples with defined major immune cell types to predict chronological age 5 , 30 . This level of accuracy highlights the ability of scEN-predicted protein profiles to capture major axes of immune-system remodeling, even without explicit molecular aging markers. We visualize the learned non-zero regression coefficients, which define the weighted contribution of each immune subset to the immune-age estimate (Fig. 5 b). The presence of both positive and negative coefficients reflects the bidirectional and compensatory nature of immune cell remodeling during aging 31 . Among all features, CD8+ naïve T cells exhibited the strongest negative coefficient, consistent with extensive immunological literature documenting their progressive depletion as a result of thymic involution and homeostatic proliferation 32 , 33 . This decline is one of the most recognized biomarkers of immune senescence and is tightly associated with impaired responsiveness to infections and vaccines. In contrast, CD85j+ effector CD8 + T cells, NK cells, CD4 + central memory (TCM) cells, and CD16 + monocytes showed positive coefficients, reflecting their expansion with aging and chronic immune activation. These shifts are characteristic of classical immunosenescence, in which repeated antigen exposure and persistent viral stimulation drive the accumulation of differentiated, cytotoxic, and innate-like immune 34 – 37 . The relative magnitude and directionality of these coefficients recapitulate well-established aging hierarchies, strengthening biological confidence in the model’s interpretability and demonstrating that scEN-derived features encode meaningful immunological structure. We highlight one representative relationship, which is the age-associated contraction of CD8+ naïve T-cell proportion (Fig. 5 c). The smooth and monotonic decline across donors, without abrupt jumps or platform-specific outliers, demonstrates that the scEN cell type-derived age-predicting model captures continuous biological variation rather than noisy technical confounders. This pattern closely mirrors classical flow cytometry studies, where CD8+ naïve T cells are among the most sensitive indicators of biological aging 38 . Similar monotonic trajectories were observed for additional top-weighted features, including increased frequencies of NK cells, CD16 + monocytes, and CD85j+ effector CD8 + T cells in older individuals 39 – 41 . These findings collectively indicate that the scEN cell type-derived age-predicting model faithfully reconstructs well-established immunological aging processes using only RNA-derived features. Together, these results position scEN as a biologically grounded and generalizable aging prediction model that represents healthy donor PBMC immune aging, and it leverages scEN-inferred protein expression. By integrating interpretable coefficients with compositionally derived signatures of immune remodeling, scEN provides a continuous metric of immune-system aging suitable for large-scale population studies, longitudinal monitoring, and disease-specific comparisons. This framework establishes the foundation for downstream analyses, including the application of the scEN cell type-derived age-predicting model to immune dysregulation in systemic lupus erythematosus (SLE), as presented in the next section. Diseases associated immune age acceleration of Systemic Lupus Erythematosus(SLE) patients To demonstrate that the scEN-based age predictor is applicable and sensitive to disease-specific immune alterations, we applied the trained age prediction model to the SLE PBMC scRNA-seq dataset 42 . We overlaid age predictions from SLE patients with those from healthy individuals (Fig. 6 ). Both SLE and healthy PBMC datasets exhibited a strong linear relationship between predicted age and chronological age. Notably, SLE patients displayed a consistent 10-year upward parallel shift in predicted age relative to healthy individuals. These results indicate that the model is sensitive to individual-level immune composition changes and that SLE-associated immune remodeling leads to a measurable acceleration of predicted age. This effect is evident in the scatter distribution, where SLE-derived predicted age values cluster above the healthy trend line, reflecting an enrichment of older predicted aging status even among younger patients. This pattern mirrors well-established immunological features of SLE, including contraction of naïve T-cell compartments, expansion of CD16 + monocytes, accumulation of cytotoxic CD8 + effector subsets, and increased NK-cell activation, cellular features that correspond to positive coefficients in the scEN-based age prediction model 43 – 45 . Collectively, these observations indicate that the model reconstructs a canonical immune aging phenotype of SLE directly from transcriptome-derived protein estimates, supporting the concept that chronic autoimmune inflammation induces a premature shift toward an aged immune architecture. Together, these results establish that SLE patients exhibit pronounced age acceleration based on the immune system relative to healthy individuals and demonstrate that the scEN cell-type–derived age prediction framework provides both a quantitative and a biologically interpretable approach for detecting such immune dysregulation caused by autoimmune disease such as SLE. More broadly, this analysis highlights how scEN-derived protein predictions bridge transcriptomic variation and clinically relevant immune metrics, positioning the scEN age predictor as a potential tool for monitoring in autoimmune disease, chronic infection, cancer immunotherapy, and large-scale immune health studies 46 . Discussion We developed scEN, an interpretable framework that generates in silico CITE-seq–like protein predictions for transcript-only scRNA-seq datasets. We demonstrated that scEN achieves predictive performance comparable to, and in some cases exceeding, state-of-the-art deep learning approaches on unseen CITE-seq validation datasets. Importantly, predicted protein expression recapitulated key lineage-defining markers, enabling downstream immunophenotyping through manual gating and accurate recovery of major immune cell subsets. These gated populations were further used to construct an scEN-based age prediction pipeline using individual-level cell-type proportions. The resulting age predictor stratified systemic lupus erythematosus (SLE) patients from healthy controls and revealed an approximate ten-year predicted age acceleration in SLE. Unlike deep-learning frameworks such as TotalVI and sciPENN, scEN achieves cross-dataset stability through Elastic Net regression, which combines L1 (lasso) and L2 (ridge) regularization penalties 8 , 9 , 47 . In high-dimensional settings such as scRNA-seq, where the number of genes far exceeds the number of cells and extensive co-expression exists, ordinary regression models are prone to overfitting 48 , 49 . L1 regularization introduces sparsity by shrinking many coefficients to zero, thereby performing automatic feature selection and emphasizing informative predictors 50 , 51 . L2 regularization stabilizes correlated coefficients in the presence of collinearity and donor-specific noise 50 , 51 . By integrating both penalties, Elastic Net achieves a balance between sparsity and stability, producing models that are interpretable and robust to batch effects and technical variability 47 . This dual-regularization likely explains why scEN generalizes more consistently than flexible deep-learning architectures, which, despite their ability to model nonlinear relationships, may exhibit unstable transfer when applied to unseen datasets 8 , 9 , 12 , 14 , 16 , 40 . In this context, scEN’s relative simplicity becomes advantageous: by constraining the model to reproducible and biologically meaningful gene–protein associations, it maintains strong predictive performance without sacrificing interpretability. Despite these strengths, scEN has several limitations. First, the model can only predict proteins represented in the training CITE-seq dataset, proteins that lacking paired RNA–protein measurements during training still could not be predicted. Second, proteins with weak or indirect transcript–protein coupling remain challenging to predict accurately, even after preprocessing of filtering lowly expressed genes 4 , 19 , 34 . As shown, prediction of different proteins performance varies, and prediction accuracy for the same protein tends to rise or fall in the same manner across different models. This phenomenon potentially reflects the biological constraints rather than computational limitations, representing a challenge shared by all RNA-to-protein prediction frameworks 8 , 21 . Future work will focus on expanding the existing training pipeline across diverse tissues and disease contexts, improving inference for rare or previously unseen proteins, and integrating multimodal information to enhance predictive capacity. Beyond methodological refinement, potential applications include immune-age assessment in autoimmune disease, immune profiling in cancer, and retrospective analysis of scRNA-seq datasets lacking paired protein measurements. Together, scEN and the associated age-prediction framework provide an accessible, interpretable, and scalable strategy for extending transcriptomic datasets toward clinically meaningful, proteomic level insight 51 , 2 , 4 , 5 . Methods Gene and protein feature harmonization Genes with zero counts across all cells were excluded prior to normalization. Highly variable genes (HVGs) were identified from the BMMC CITE-seq training dataset using Scanpy’s implementation of the Seurat v3 method (flavor = "seurat_v3", n_top_genes = 4000), which ranks genes based on standardized variance to prioritize biologically informative features while reducing noise from lowly expressed genes and without incorporating protein target information to avoid potential information leakage. We selected the top 4,000 HVGs as this range captures most biological variation in CITE-seq datasets while limiting overfitting from high-dimensional noise. HVG selection was performed using only the BMMC dataset prior to any training and test splitting or cross-dataset evaluation. The intersection of these HVGs with each validation dataset defined the final gene feature space. The same intersected gene set was used consistently across all evaluated benchmarking models. No batch effect correction or donor regression was applied to preserve raw gene–protein relationships and to avoid potential information leakage across datasets. This decision was made to evaluate model robustness under raw cross-dataset domain shift without harmonization artifacts and to assess generalization to unseen datasets. Genes with zero variance in the training dataset were removed prior to model fitting. Protein markers absent from downstream evaluation datasets were excluded from model training and benchmarking to avoid undefined prediction targets during cross-dataset evaluation. All reported performance metrics reflect only protein markers present in both the training and evaluation datasets. Cell inclusion and filtering All cells from the BMMC and JDM PBMC datasets were included without additional filtering unless preprocessing was performed automatically by the benchmarking pipeline (e.g., sciPENN or scLinear), which applies its own quality control filtering. For PBMC10K, only cells annotated as major immune populations (B cells, T cells, NK cells, monocytes, dendritic cells) by the original authors were retained. No additional harmonization was imposed across models. Protein prediction models were evaluated at the cell level, whereas age prediction models were evaluated at the donor level. Protein normalization Protein expression values were normalized according to each method’s published implementation. For scEN, ADT counts were first normalized on a per-cell basis by dividing each protein count by the total ADT count of the corresponding cell and ADT counts were normalized per cell by dividing each protein count by the total ADT count and scaling to a fixed total of 100, which ensures comparable dynamic range across cells while preserving relative protein abundance. The normalized values were subsequently transformed as: $$\:AD{T}_{norm}={\text{l}\text{o}\text{g}}_{10}(\text{normalized\:ADT}+1)$$ For scLinear and cTP-net, protein values were normalized using centered log-ratio (CLR) transformation as recommended in its original implementation. For TotalVI and sciPENN, raw RNA and ADT count matrices were used as required by its generative modeling framework. For sciPENN, the pipeline further log normalizes both gene expression and the ADT count. For each model, observed protein values were transformed using the identical normalization strategy applied during that model’s training prior to performance evaluation. All benchmarking models were executed using their published preprocessing pipelines and recommended normalization strategies to ensure fair comparison. scEN model training scEN models were trained using Elastic Net regression implemented in scikit-learn. Gene expression values served as input features, library size normalized to 1e4 and log-transformed using scanpy. Normalized Protein expression values were used as prediction targets. Separate regression models were trained independently for each protein marker. Prior to model fitting, features were standardized using a StandardScaler fit on the training data and applied to the test data. Non-finite gene expression values were replaced with zeros before model fitting. Elastic Net hyperparameters (α and l1_ratio) were optimized using grid search with five-fold cross-validation within the training set and with Pearson correlation as the scoring metric. The parameter grid included α ∈ {0.001, 0.01, 0.1} and L1_ratio ∈ {0.1, 0.5, 0.9} with max_iter = 10000. After hyperparameter selection, models were refit on the full training set using optimal parameters before evaluation on the held-out test set. Train–test splitting For protein benchmarking Cells were randomly split into training (80%) and testing (20%) partitions using a fixed random seed (random_state = 42). For cross-dataset generalization, models were trained exclusively on full BMMC dataset and applied without retraining to PBMC10K. No individuals were shared between training and testing sets. Performance evaluation Pearson correlation was chosen because it evaluates preservation of continuous protein expression structure across cells and is widely used in previous protein-imputation benchmarks. Pearson correlation was computed across cells for each protein marker and summarized by mean correlation across markers. Per-protein correlations were also reported. scEN per protein model R² was used only for hyperparameter tuning and not for cross-method comparison. Model interpretability Elastic Net model coefficients were extracted for each trained protein-specific model. Genes with non-zero coefficients were identified as contributing features, enabling interpretation of gene–protein associations learned by scEN. Manual gating of scEN-predicted protein expression Manual gating was performed using a hierarchical flow cytometry style workflow applied to two-dimensional scatter plots of scEN-predicted protein expression. Initial gating thresholds were determined using density inflection points corresponding to marker-positive and marker-negative populations of the healthy PBMC1M dataset These thresholds were defined based on bimodal distributions separating marker-positive and marker-negative populations. The same gating thresholds were subsequently applied to the remaining healthy samples and then to the independent SLE dataset. Gating thresholds were in general fixed prior to downstream statistical analyses with minimal adjustment as conventional flow cytometry workflow thresholds were not optimized to improve age prediction performance. Those minor manual adjustments were permitted only when global distribution shifts were observed, while preserving canonical marker relationships. No donor-specific threshold optimization was performed. Sequential gating followed standard immunophenotyping hierarchies (e.g. CD3 and CD19 → CD4 and CD8 → memory markers → myeloid markers). Gating reproducibility was visually confirmed across donors and disease cohorts. All gating thresholds and decision boundaries are shown in Fig. 4 to ensure transparency and reproducibility. scEN-based age prediction model Age prediction was performed using an Elastic Net regression model trained on immune cell-type composition profiles derived from manual gating of scEN-predicted protein expression. For each donor, cell-type features were defined as relative proportions, calculated as the number of cells assigned to a gated population divided by the total number of profiled cells for that donor. The resulting feature matrix consisted of proportions of 19 manually defined immune cell types spanning major lymphoid and myeloid compartments. Chronological age was used as the prediction target. Donors from the healthy PBMC1M dataset were randomly split into training (80%) and testing (20%) cohorts (random_state = 42). The 80/20 split was performed prior to hyperparameter tuning, and cross-validation was conducted exclusively within the training cohort. Hyperparameters were optimized using randomized search with five-fold cross-validation within the healthy training cohort. Cross-validation folds were defined at the donor level to ensure independence between individuals and to prevent any cell-level leakage into donor-level splits. Elastic Net models were implemented using scikit-learn with max_iter = 10,000 and random state = 42. Feature standardization was performed using StandardScaler fit on the training data and applied to the test data. Model performance was evaluated on the held-out healthy test set using mean absolute error. The trained model was subsequently applied, without retraining, to the independent SLE cohort to estimate predicted immune age. Software and implementation All analyses were conducted on a Linux-based high-performance workstation running Ubuntu 22.04.5 LTS (64-bit). The system was equipped with 512 GB DDR4 ECC registered memory (8 × 64 GB modules at 3200 MT/s). Python 3.12.2 was used for computational modeling, with machine learning implemented using scikit-learn (v1.4.2). Numerical computation and data manipulation were performed using numpy (v1.26.4) and pandas (v2.2.2). Single-cell data processing was conducted using scanpy (v1.10.1). Deep learning components utilized PyTorch (v2.7.0) with CUDA 12.6 support. Statistical analyses were performed in R (v4.5.0). All software environments were managed using Conda to ensure reproducibility and dependency control. Data Availability All datasets analyzed in this study are publicly available from previously published sources. The BMMC CITE-seq dataset from the NeurIPS 2021 multimodal challenge is available in the Gene Expression Omnibus (GEO) under accession GSE194122. The juvenile dermatomyositis (JDM) PBMC CITE-seq dataset is available through the CZ CELLxGENE Discover resource under the collection “CITE-seq of JDM PBMCs” (collection ID: c672834e-c3e3-49cb-81a5-4c844be4a975). The PBMC10K CITE-seq dataset (“10k PBMCs from a Healthy Donor – Gene Expression with a Panel of TotalSeq™-B Antibodies,” Cell Ranger 3.0.0) is available from 10x Genomics (2018 release). For external RNA-only validation, the Lee PBMC scRNA-seq dataset is available via CZ CELLxGENE Discover (collection ID: 4f889ffc-d4bc-4748-905b-8eb9db47a2ed), originally published in Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19 . The Arunachalam PBMC scRNA-seq dataset is available via CZ CELLxGENE Discover (collection ID: b9fc3d70-5a72-4479-a046-c2cc1ab19efc). For immune aging analysis, the healthy PBMC scRNA-seq dataset from Yazar et al. is available in GEO under accession GSE196830. Processed data are also available through CZ CELLxGENE Discover (collection ID: dde06e0f-ab3b-46be-96a2-a8082383c4a1 ).The systemic lupus erythematosus (SLE) PBMC scRNA-seq dataset is available in GEO under accession GSE174188. Genotype data are accessible via dbGaP under accession phs002812.v1.p1. Processed data are also available through CZ CELLxGENE Discover (collection ID: 436154da-bcf1-4130-9c8b-120ff9a888f2). Declarations Competing interests The author declares no competing interests Author Contribution H.Y.L. and P.Q designed and performed the research, experiment and wrote the manuscript. Code Availability The scEN package are publicly available at: https://github.com/OziLeung/scEN The repository includes model training scripts,benchmarking pipelines and notebooks require to reproduce the main figures. Funding Declaration The authors declare that no funds, grants were received related to the preparation of this manuscript. References Dimitrov, D. et al. Comparison of methods and resources for cell–cell communication inference from single-cell RNA-seq data. Nat. Commun. 13, 30755 (2022). Triana, S. et al. Single-cell proteo-genomic reference maps of the hematopoietic system enable the purification and massive profiling of precisely defined cell states. Nat. Immunol. 22, 1577–1589 (2021). Stubbington, M. J. T., Rozenblatt-Rosen, O., Regev, A. & Teichmann, S. A. Single-cell transcriptomics to explore the immune system in health and disease. Nat. Rev. Immunol. 17, 207–221 (2017). Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017). Alpert, A. et al. A clinically meaningful metric of immune age derived from high-dimensional longitudinal monitoring. Nat. Med. 25, 487–495 (2019). Zhang, X. et al. An immunophenotype-coupled transcriptomic atlas of human hematopoietic progenitors. Nat. Immunol. 25, 1782–1794 (2024). Song, H.-W., Martin, J., Shi, X. & Tyznik, A. J. Key considerations on CITE-seq for single-cell multiomics. Proteomics 25, 206–213 (2025). Gayoso, A. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat. Methods 18, 272–282 (2021). Lakkis, J. et al. A multi-use deep learning method for CITE-seq and single-cell RNA-seq data integration with cell surface protein prediction and imputation. Nat. Mach. Intell. 4, 940–952 (2022). Chen, Y., Fan, X., Shi, C., Shi, Z. & Wang, C. A joint analysis of single-cell transcriptomics and proteomics using transformer. npj Syst. Biol. Appl. 11, 1 (2025). Cao, Y., Zhu, J., Jia, P. & Zhao, Z. scRNASeqDB: a database for RNA-seq based gene expression profiles in human single cells. Genes 8, 368 (2017). Zhou, Z., Ye, C., Wang, J. & Zhang, N. R. Surface protein imputation from single cell transcriptomes by deep neural networks. Nat. Commun. 11, 651 (2020). Hanhart, D., Gossi, F., Rapsomaniki, M. A., Kruithof-de Julio, M. & Chouvardas, P. ScLinear predicts protein abundance at single-cell resolution. Commun. Biol. 7, 267 (2024). Charilaou, P. C. & Battat, R. Machine learning models and over-fitting considerations. World J. Gastroenterol. 28, 605–607 (2022). Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19, 562–578 (2018). Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022). Montoya-Ortiz, G. Immunosenescence, aging, and systemic lupus erythematosus. Autoimmune Dis. 2013, 267078 (2013). Lance, C. et al. Multimodal single cell data integration challenge: results and lessons learned. In Proceedings of the NeurIPS 2021 Competitions and Demonstrations Track 162–176 (PMLR, 2022). Kotliar, D. et al. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-seq. eLife 8, e43803 (2019). Rabadam, G. et al. Coordinated immune dysregulation in juvenile dermatomyositis revealed by single-cell genomics. JCI Insight 9, e176963 (2024). Huang, X. et al. Multimodal probing of T cell recognition with hexapod heterostructures. Nat. Methods 21, 857–867 (2024). Wang, B. et al. Targeting intracellular and extracellular receptors with nano-to-macroscale biomaterials to activate immune cells. J. Control. Release 357, 52–66 (2023). Kotliar, D. M. et al. Reproducible single-cell annotation of programs underlying T cell subsets, activation states and functions. Nat. Methods 22, 1964–1980 (2025). Zammit, W. H. et al. Inhibitory TIGIT signalling is dependent on T cell receptor activation. bioRxiv https://doi.org/10.1101/2025.05.08.652881 (2025). Lee, J. S. et al. Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19. Sci. Immunol. 5, abd1554 (2020). Arunachalam, P. S. et al. Systems biological assessment of immunity to mild versus severe COVID-19 infection in humans. Science 369, 1210–1220 (2020). Ziegler-Heitbrock, L. The CD14⁺ CD16⁺ blood monocytes: their role in infection and inflammation. J. Leukoc. Biol. 81, 584–592 (2007). Maecker, H. T., McCoy, J. P. & Nussenblatt, R. Standardizing immunophenotyping for the Human Immunology Project. Nat. Rev. Immunol. 12, 191–200 (2012). Newell, E. W. & Cheng, Y. Mass cytometry: blessed with the curse of dimensionality. Nat. Immunol. 17, 890–895 (2016). Zhu, H. et al. Human PBMC scRNA-seq–based aging clocks reveal ribosome-to-inflammation balance as a single-cell aging hallmark and super longevity. Sci. Adv. 9, eabq7599 (2023). Dou, L. et al. Immune remodeling during aging and the clinical significance of immunonutrition in healthy aging. Aging Dis. 15, 1588–1601 (2024). Liang, Z. et al. Age-related thymic involution: mechanisms and functional impact. Aging Cell 21, e13671 (2022). Cho, J.-H. et al. An intense form of homeostatic proliferation of naive CD8⁺ cells driven by IL-2. J. Exp. Med. 204, 1787–1801 (2007). Liu, Z. et al. Immunosenescence: molecular mechanisms and diseases. Signal Transduct. Target. Ther. 8, 200 (2023). Aw, D., Silva, A. B. & Palmer, D. B. Immunosenescence: emerging challenges for an ageing population. Immunology 120, 435–446 (2007). Aiello, A. et al. Immunosenescence and its hallmarks: how to oppose aging strategically? Front. Immunol. 10, 2247 (2019). Fulop, T. et al. Immunosenescence and inflamm-aging as two sides of the same coin: friends or foes? Front. Immunol. 8, 1960 (2018). Weyand, C. M. & Goronzy, J. J. Aging of the immune system: mechanisms and therapeutic targets. Ann. Am. Thorac. Soc. 13(Suppl 5), S422–S428 (2016). Gergues, M. et al. Senescence, NK cells, and cancer: navigating the crossroads of aging and disease. Front. Immunol. 16, 1565278 (2025). Seidler, S. et al. Age-dependent alterations of monocyte subsets and monocyte-related chemokine pathways in healthy adults. BMC Immunol. 11, 30 (2010). Gustafson, C. E. et al. Immune checkpoint function of CD85j in CD8 T cell differentiation and aging. Front. Immunol. 8, 692 (2017). Perez, R. K. et al. Single-cell RNA-seq reveals cell type–specific molecular and genetic associations to lupus. Science 376, abf1970 (2022). Lao, J. et al. Changes of peripheral T cells in systemic lupus erythematosus patients. Immun. Inflamm. Dis. 13, e70156 (2025). Thangjam, N. et al. Natural killer cell count in systemic lupus erythematosus patients: a flow cytometry-based study. Cureus 15, e46885 (2023). de Ocampo, C. et al. Effect of age on xenobiotic-induced autoimmunity. bioRxiv https://doi.org/10.1101/2025.05.22.655368 (2025). Botticelli, A. et al. The role of immune profile in predicting outcomes in cancer patients treated with immunotherapy. Front. Immunol. 13, 974087 (2022). Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67, 301–320 (2005). Wainberg, M., Merico, D., Delong, A. & Frey, B. J. Deep learning in biomedicine. Nat. Biotechnol. 36, 829–838 (2018). Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction . 2nd edn (Springer, New York, 2009). Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 58, 267–288 (1996). Hoerl, A. E. & Kennard, R. W. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970). Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Reviewers agreed at journal 11 May, 2026 Reviews received at journal 08 May, 2026 Reviewers agreed at journal 29 Apr, 2026 Reviewers invited by journal 29 Apr, 2026 Editor assigned by journal 18 Apr, 2026 Submission checks completed at journal 17 Apr, 2026 First submitted to journal 14 Apr, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9418867","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":635889931,"identity":"01950224-58ba-4ede-a460-319b667e4e6e","order_by":0,"name":"Ho Yeung Leung","email":"","orcid":"","institution":"Georgia Institute of Technology","correspondingAuthor":false,"prefix":"","firstName":"Ho","middleName":"Yeung","lastName":"Leung","suffix":""},{"id":635889932,"identity":"2e817ab3-3be1-4818-98cb-81ccf806de22","order_by":1,"name":"Peng Qiu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABAklEQVRIiWNgGAWjYPACCQZ+CTiHB4jZiNAiOQPGIlILA4PBDWK1mLP3Hn7BUGaRuPl2d+IDhl92dfzsZw8wfCg7jFOLZc+5NAuGcxKJ2+6c3WzA2JcsIdmTl8A44xxuLQY3cswMGNuAWm7kbpNg7GGWMLjBY8DM24ZHy/03EC2bZ+Ru/8HYUy9hD9LyF5+WGzzGD0BaNkjkbmNg+HFYwkACqIURjxbLnhwzhoRzEsYzgH6RSGw4LjnjTI7BwZ5z6Ti1mLOfMf7woaxOtn9278YPH/5U8/O3nzF88KPMGrfDgFEgkQCLhcQ2CH0Ap3qIFuYPiIj7g0/tKBgFo2AUjFQAADIhVib7lAMKAAAAAElFTkSuQmCC","orcid":"","institution":"Georgia Institute of Technology","correspondingAuthor":true,"prefix":"","firstName":"Peng","middleName":"","lastName":"Qiu","suffix":""}],"badges":[],"createdAt":"2026-04-14 18:38:30","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9418867/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9418867/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":108947376,"identity":"5633ead6-aa53-40c4-9da0-68c3f2262782","added_by":"auto","created_at":"2026-05-11 06:28:14","extension":"jpeg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":666272,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003escEN and scEN-based age predictor models training, validation and age prediction pipeline \u003c/strong\u003escEN is an Elastic Net framework that predicts protein abundance for scRNA-seq dataset. The figure illustrates scEN model training, application on the validation dataset with model evaluations and the developed sc-EN based age predictor model able to stratify disease condition patients with healthy patients.\u003c/p\u003e","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-9418867/v1/47278ed0cb553a1e90fa6cb7.jpeg"},{"id":108947377,"identity":"073261ed-23ef-4add-8f72-5b09cf3a0950","added_by":"auto","created_at":"2026-05-11 06:28:14","extension":"jpeg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":613524,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePerformance evaluation of models across BMMC, JDM PBMC, and PBMC10K CITE-seq datasets. (a) \u003c/strong\u003eMean Pearson correlation between observed and predicted protein abundance for BMMC under a within-dataset validation scheme. \u003cstrong\u003e(b)\u003c/strong\u003e Per-protein correlation values in BMMC are displayed as grouped bar plots.\u003cstrong\u003e (c) \u003c/strong\u003eMean Pearson correlation for the JDM PBMC dataset evaluated within-dataset.\u003cstrong\u003e (d) \u003c/strong\u003ePer-protein correlations for JDM PBMC.\u003cstrong\u003e (e)\u003c/strong\u003e Mean Pearson correlation for the PBMC10K dataset representing external cross-dataset generalization.\u003cstrong\u003e (f)\u003c/strong\u003ePer-protein correlations for PBMC10K.\u003c/p\u003e","description":"","filename":"floatimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-9418867/v1/8ec0d0d73df70c2cc258ebf4.jpeg"},{"id":108947436,"identity":"f989332b-f1e7-496e-8f9e-735d9de3be09","added_by":"auto","created_at":"2026-05-11 06:29:00","extension":"jpeg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":657371,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eExternal validation of scEN predictions using independent PBMC scRNA-seq datasets.\u003c/strong\u003eScatter plots of predicted protein expression in the Lee and Arunachalam(Arh) PBMC dataset.\u003cstrong\u003e (a, i) \u003c/strong\u003eCD3 and CD19\u003cstrong\u003e (c, k\u003c/strong\u003e) CD4 and CD8\u003cstrong\u003e (e, m\u003c/strong\u003e) CD56 and CD14; \u003cstrong\u003e(g, o) \u003c/strong\u003eCD14 and CD16.Box plots showing predicted protein across defined major population. \u0026nbsp;\u003cstrong\u003e(b, j\u003c/strong\u003e) annotated B-cell and T-cell populations CD3 and CD19 \u003cstrong\u003e(d, l) \u003c/strong\u003eseparating CD4+ and CD8+ T-cell subsets CD4 and CD8, \u003cstrong\u003e(f, n)\u003c/strong\u003e distinguishing NK cells and monocytes CD56 and CD14\u003cstrong\u003e \u003c/strong\u003e\u0026nbsp;\u003cstrong\u003e(h ,p)\u003c/strong\u003eidentifying classical and non-classical monocytes CD14 and CD16\u003c/p\u003e","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-9418867/v1/92f59585d63c1afc55210e78.jpeg"},{"id":108947396,"identity":"96e7c590-bba7-41d2-af20-69419cde044c","added_by":"auto","created_at":"2026-05-11 06:28:30","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":331163,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003escEN‑predicted surface protein expression in the Yazar et al.1 million PBMC scRNA‑seq dataset, annotated by manual gating on canonical marker thresholds.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eScatter plots show scEN-predicted protein expression used for sequential manual gating of immune cell populations across 1,248,980 healthy PBMCs. \u003cstrong\u003e(a) \u003c/strong\u003eBroad lymphoid partitioning was performed using CD3 and CD19 separating T cells and B cells. \u003cstrong\u003e(b) \u003c/strong\u003eWithin T cells, CD4 and CD8 expression delineated helper and cytotoxic T-cell subsets. \u003cstrong\u003e(c)\u003c/strong\u003e CD4+ T-cell differentiation states were resolved using CD45RA and CD27 \u003cstrong\u003e(d) \u003c/strong\u003eFurther subdivision of differentiated CD4+ populations by CD85j and CD57 to identify senescent and effector subsets. \u003cstrong\u003e(e)\u003c/strong\u003eCD8+ T-cell maturation states were defined using CD45RA and CD27. \u003cstrong\u003e(f) \u003c/strong\u003elate-stage CD8+ TEMRA cells further stratified by CD85j and CD57 expression.\u003cstrong\u003e (g)\u003c/strong\u003eWithin the B-cell compartment, IgD and IgM expression identified naïve and transitional-like B cells, while\u003cstrong\u003e (h)\u003c/strong\u003e IgD and CD27 resolved naïve, unswitched memory, switched memory, and double-negative memory B-cell subsets.\u003cstrong\u003e(i)\u003c/strong\u003e Among CD3-CD19- cells, CD56 expression distinguished NK cells from myeloid populations. \u003cstrong\u003e(j)\u003c/strong\u003e CD14 and CD16 expression separated classical and non-classical monocytes. All gates were defined using threshold-based segmentation analogous to flow cytometry.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-9418867/v1/2f0a3a46e5fedc63b70a8092.png"},{"id":108947426,"identity":"e2c803ad-b297-4dc6-b84c-30f2b1e7ad0b","added_by":"auto","created_at":"2026-05-11 06:28:49","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":341406,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003escEN cell-type-derived age prediction model applied to the Yazar PBMC1M dataset. (a)\u003c/strong\u003eScatter plot of chronological age and predicted age, showing strong concordance (MAE = 9.76 years). \u003cstrong\u003e(b)\u003c/strong\u003e Non-zero regression coefficients associated with features that are gated cell-type proportion; CD8+ naïve T cells contribute the largest negative weight, while effector/senescent CD8+ T cells, NK cells, CD4+ TCM, and CD16+ monocytes contribute positive weights.\u003cstrong\u003e(c) \u003c/strong\u003eExample feature–age relationship: frequency of CD8+ naïve T cells decreases with chronological age.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-9418867/v1/050ba09ac3b6c76aa47d242d.png"},{"id":108947435,"identity":"5bcfea36-651e-4420-9528-1600d0595bff","added_by":"auto","created_at":"2026-05-11 06:29:00","extension":"jpeg","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":381876,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003escEN-based age predictor reveals age acceleration in SLE.\u003cbr\u003e\n \u003c/strong\u003ePredicted age from the scEN age predictor, trained on the PBMC dataset (Yazar et al., n=981), applied to an independent scRNA-seq SLE dataset (Perez et al., n = 162). Regression fits showed parallel predicted age relationships with a higher intercept in SLE (healthy: predicted age = 38.61 + 0.39 × age; SLE: predicted age = 48.87 + 0.38 × age), corresponding to a 10-year upward shift in predicted age at matched chronological age.\u003c/p\u003e","description":"","filename":"floatimage6.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-9418867/v1/f39aef043af638f9de8df985.jpeg"},{"id":108977986,"identity":"6c62a293-49af-48ca-8640-d6461cf97f33","added_by":"auto","created_at":"2026-05-11 11:33:37","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3328572,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9418867/v1/e8cb0dd5-88fe-4820-ae3a-2676890fe78a.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Protein expression derived from scRNA-seq reveals lupus-induced acceleration of immune aging","fulltext":[{"header":"Introduction","content":"\u003cp\u003eSingle-cell RNA sequencing (scRNA-seq) has become central to systems biology. Over the past decade, scRNA-seq has enabled researchers to dissect transcriptional states at single-cell resolution, revealing how cells communicate and how gene regulation drives cellular heterogeneity\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. Despite these advances, scRNA-seq captures only messenger RNA (mRNA), presenting a limitation: the inability to measure surface proteins that could further define immune lineage, signaling status, and functional state\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) partially addresses this limitation by jointly profiling RNA and surface proteins in the same cells using oligonucleotide-labeled antibodies\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e. By providing multimodal measurements, CITE-seq enables direct immunophenotyping and more comprehensive characterization of immune states\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e,\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e, offering proteomic insights that originally inaccessible to transcriptomes alone. However, the adoption of CITE-seq remains constrained due to cost, specialized reagents, and reduced throughput compared to conventional scRNA-seq\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eConsequently, although single-cell profiling has expanded across diverse tissues and disease contexts, most available datasets contain only gene expression profiles\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e,\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e. A computational strategy capable of inferring surface proteins directly from transcriptomes would therefore enable more advanced immunophenotyping in RNA-only datasets, motivating the development of computational methods that infer protein expression from scRNA-seq alone, thereby supporting proteomic analyses even in transcriptome-only datasets.\u003c/p\u003e \u003cp\u003eSeveral machine learning approaches integrating paired RNA\u0026ndash;protein measurements to predict protein abundance from transcriptomes. These frameworks employed diverse architectures, ranging from joint latent-variable models (TotalVI) to biologically informed deep learning (sciPENN) and transfer learning (cTP-net), as well as unregularized linear regression (scLinear). While these approaches have demonstrated strengths, including scalability, relatively fast training compared to classical statistical methods, and the ability to impute proteins across diverse contexts, they also face several challenges\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e,\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e. First of all, deep learning models are more prone to overfitting, as their large parameter space allows them to memorize training data rather than learn generalizable patterns. Models trained on high-dimensional features may fit noise rather than the true biological signal, increasing the risk of overfitting\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e. In addition, unregularized linear methods struggle to capture complex gene\u0026ndash;protein relationships, whereas transfer-learning models depend strongly on the composition of reference datasets and may fail to generalize when biological contexts differ substantially from pretrained datasets\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e,\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e. Consequently, predictive performance trained on one dataset may degrade when applied to unseen datasets, compounded by batch effects and other technical sources of variation\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e,\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e. These limitations highlight the need for models that are accurate, interpretable, and robust across heterogeneous datasets.\u003c/p\u003e \u003cp\u003eTo address these challenges, we introduce scEN, an Elastic Net\u0026ndash;based machine learning framework that predicts single-cell surface protein expression from RNA profiles. Unlike deep learning\u0026ndash;based approaches, scEN balances accuracy with generalizability, offer improved robustness and interpretable gene\u0026ndash;protein associations. Using multiple benchmark PBMC CITE-seq and scRNA-seq datasets, we demonstrate that scEN trained on the bone marrow mononuclear cells (BMMCs) CITE-seq dataset provides robust \u003cem\u003ein-silico\u003c/em\u003e protein predictions, particularly for immune cells derived from PBMC RNA-only data. Importantly, the predicted proteins further enable manual gating strategies that recapitulate conventional flow cytometry workflows, thereby bridging computational predictions with established laboratory practices. Building on this foundation, we developed a scEN-based immune-age prediction pipeline that stratifies lupus and healthy donors, revealing biologically meaningful shifts in lupus-induced immune composition associated with accelerated immune-system aging trajectories\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e. The overall model architecture and downstream analysis workflow are illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn summary, scEN addresses a major limitation in single-cell biology, the absence of protein level information in most scRNA-seq datasets, by computationally recovering the surface protein expression directly from transcriptomes. This enables scalable immunophenotyping in RNA-only datasets, opening opportunities for downstream clinical and translational applications where multimodal profiling remains limited.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003ePerformance of scEN compared to benchmarking models on the CITE-seq datasets\u003c/p\u003e \u003cp\u003eWe first evaluate the performance of scEN with state-of-the-art scRNA-seq protein prediction models using the BMMC dataset (NeurIPS 2021 multimodal challenge; 90,261 bone marrow mononuclear cells with paired RNA\u0026ndash;protein profiles)\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e. All models were trained on 80% of this dataset and tested on the remaining 20% (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea, b). Performance was uniformly high across all different methods. The deep learning model TotalVI achieved the strongest correlation (mean r\u0026thinsp;=\u0026thinsp;0.767), closely followed by cTP-net and sciPENN (r\u0026thinsp;=\u0026thinsp;0.749 and 0.713), with other models having correlations of 0.708 and 0.705. Per-protein bar plots revealed that highly expressed lineage markers such as CD3, CD19, and CD8 were predicted with high fidelity across all approaches. In contrast, lower-abundance or donor-specific proteins showed greater variability\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e. Together, these results indicate that all models achieve comparable accuracy on large multimodal datasets when training and testing data come from the same dataset.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eWe next compare performance on the juvenile dermatomyositis (JDM) PBMC dataset (105,827 cells), which includes matched healthy and disease samples, because this dataset exhibits greater variability than the BMMC dataset\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e. The JDM dataset was analyzed using the same within-dataset splitting strategy as the BMMC benchmark (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec,d). Here, correlations declined for all methods (r\u0026thinsp;=\u0026thinsp;0.569\u0026ndash;0.461), consistent with increased heterogeneity in this dataset\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e. Despite these challenges, scEN maintained performance comparable to, and in some cases even exceeding, that of deep learning models, underscoring its robustness in biologically diverse datasets. Per-protein correlation profiles remained consistent across markers, with T- and B-cell lineage proteins (CD8, CD19) retaining relatively high predictability, while some markers (e.g., TIGIT, CD25) showed larger variance.\u003c/p\u003e \u003cp\u003eNext, we compare these algorithms in a more challenging setting using the 10X genomic PBMC10K from a healthy Donor CITE-seq dataset only with major immune cell type (7,688 peripheral blood mononuclear cells with paired RNA\u0026ndash;protein data). All models were trained exclusively on BMMC and applied directly to PBMC10K, which differs substantially in tissue origin (bone marrow and peripheral blood), donor population, and different acquisition occasion. We found that scEN outperformed all competing methods, achieving a mean correlation of r\u0026thinsp;=\u0026thinsp;0.798 (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ee,f), compared with 0.782 for cTP-net, 0.739 for scLinear, 0.726 for sciPENN, and 0.669 for TotalVI. Notably, scEN showed superior performance for lineage-defining T-cell markers such as CD3, CD4, CD8, CD14 and CD19 indicating improved cross-dataset consistency relative to deep learning approaches that rely more heavily on dataset-specific feature representations\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e,\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eAcross all three benchmarking comparisons (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb,d,f), lineage-defining markers such as CD3, CD4, CD8, CD19, and CD14 exhibited consistently high predictability. Meanwhile, activation-dependent markers (e.g., CD25 and TIGIT) displayed lower correlations, which consistent with known biology that context-dependent RNA\u0026ndash;protein coupling reported previously. Activation-dependent markers such as CD25 and TIGIT are regulated by immune activation and therefore exhibit variable RNA\u0026ndash;protein coupling across immune states, reducing transcriptomic predictability regardless of model\u003csup\u003e\u003cspan additionalcitationids=\"CR22 CR23\" citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e. These trends were observed across all competing methods, suggesting the differences in per-marker predictability predominantly reflect a biological phenomenon rather than model-specific effects. This consistency reinforces that scEN\u0026rsquo;s performance is not driven by marker-specific overfitting but by its ability to recover the underlying gene\u0026ndash;protein relationships across datasets robustly.\u003c/p\u003e \u003cp\u003eMoreover, we confirmed that scEN consistently captures both lineage-defining markers (e.g., CD3, CD4, CD8) (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb,d,f) and some lower-abundance proteins (e.g., CD127) (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ef), as well as outperforming other models. These results indicate that while deep learning models excel in dataset-specific tuning, they often fail to generalize across datasets. In contrast, scEN\u0026rsquo;s regularization can handle such domain shifts. In summary, scEN achieves a balance between accuracy and robustness, yielding models that are not only interpretable on specific genes and protein relationship but also robust to batch effects and technical variability.\u003c/p\u003e \u003cp\u003escEN prediction on the validation scRNA-seq datasets enables downstream manual gating strategies\u003c/p\u003e \u003cp\u003eTo evaluate scEN\u0026rsquo;s ability to generalize beyond its CITE-seq training data, we applied the model to two independent PBMC scRNA-seq datasets which are Lee (10x Genomics 3\u0026prime; v3, 43,512 cells) and Arh (10x Genomics 3\u0026prime; v3, 49,139 cells)\u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e,\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e, which contain only transcriptomic measurements without paired protein data. The original authors annotated cell types in both datasets using Seurat. These datasets differ substantially from the CITE-seq training data in donor origin, sequencing depth, and chemistry, posing a significant challenge for scEN to accurately predict protein expression. Since neither dataset includes surface-protein measurements, this evaluation tests whether scEN can reconstruct protein-level immunophenotypes purely from RNA profiles in entirely unseen biological contexts.\u003c/p\u003e \u003cp\u003eAcross both datasets, scEN-predicted CD3 and CD19 protein expression recapitulated lymphoid partitioning that defines T- and B-cell identity (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea,b,i,j). Cells annotated as T cells uniformly had high predicted CD3 and low CD19, whereas predicted CD19 was highest in B-cell populations and remained low across all other cell types. The resulting separation between CD3\u0026thinsp;+\u0026thinsp;CD19- T (high: +, low or negative: -, intermediate: int) cells and CD19\u0026thinsp;+\u0026thinsp;CD3- B cells, further supported by box-plot distributions, demonstrates that scEN captures protein-expression patterns consistent with \u003cem\u003ein-silico\u003c/em\u003e CITE-seq behavior.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eWithin the T-cell compartment, scEN reconstructed mutually exclusive CD4 and CD8 expression patterns corresponding to helper and cytotoxic T-cell lineages (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ec,d,k,l). Predicted intensities formed two clearly separated lobes representing CD4\u0026thinsp;+\u0026thinsp;CD8- and the CD8\u0026thinsp;+\u0026thinsp;CD4- populations. Consistent with transcriptome-based annotations, CD4 T cells exhibited the highest predicted CD4 expression, while CD8 T cells showed elevated CD8 expression. Other cell types displayed substantially lower predicted values, confirming lineage-specific signal recovery despite differences in absolute intensity ranges between datasets.\u003c/p\u003e \u003cp\u003escEN further distinguished innate immune populations through predicted CD56 and CD14 expression (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ee,f,m,n). NK cells consistently localized to a CD56\u0026thinsp;+\u0026thinsp;CD14- region, whereas monocytes occupied a CD14\u0026thinsp;+\u0026thinsp;and CD56- space, accurately reproducing canonical immunophenotypic boundaries used in flow and mass cytometry. Predicted expression levels aligned with established biology that NK cells exhibited the highest CD56 expression levels, while monocytes exhibited strong CD14 expression, with all other cell types remaining low for both markers. These patterns were consistent across datasets.\u003c/p\u003e \u003cp\u003eWithin the myeloid compartment, predicted CD14 and CD16 expression captured the classical-to-nonclassical monocyte continuum\u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e (Figs.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eg,h,o,p). In the Lee dataset, scEN reproduced two well-defined clusters corresponding to classical (CD14\u0026thinsp;+\u0026thinsp;and CD16-) and non-classical (CD14- and CD16+) monocytes, with a smooth intermediate gradient. In contrast, the Arh dataset exhibited lower overall CD16 expression and a broader distribution along the CD16 axis, likely reflecting dataset-specific biology such as donor composition and relative abundance of intermediate monocytes rather than prediction noise. Despite differences in absolute signal magnitude, the relative ordering of monocyte subtypes was preserved across datasets, indicating that scEN infers stable immunophenotypic structure under heterogeneous conditions.\u003c/p\u003e \u003cp\u003eTogether, these results demonstrate that scEN generalizes effectively to independent PBMC scRNA-seq datasets lacking protein measurements, enabling reconstruction of key immunophenotypic axes from RNA-only data and supporting downstream analyses such as manual gating, lineage stratification, and cell-type frequency estimation.\u003c/p\u003e \u003cp\u003eClinical application of scEN on healthy and Systemic Lupus Erythematosus (SLE) scRNA-seq PBMC dataset\u003c/p\u003e \u003cp\u003eWe first demonstrate the predicted protein data support manual gating to define immune cell types. We began with establishing a broad lineage partition using CD3 and CD19 expression, two mutually exclusive markers (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea). Cells with high CD3 and negligible CD19 were classified as T cells (CD3\u0026thinsp;+\u0026thinsp;CD19-), whereas those with high CD19 and low CD3 were identified as B cells (CD3- CD19+). This first gate provided a clean lymphoid separation, with minimal contamination between populations and a clear bimodal distribution on the CD3 and CD19 scatter plot. Within the T-cell compartment, we then gated CD4\u0026thinsp;+\u0026thinsp;and CD8\u0026thinsp;+\u0026thinsp;subsets based on their reciprocal expression patterns (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb). CD4\u0026thinsp;+\u0026thinsp;T cells exhibited a strong CD4 and low CD8 signal, while CD8\u0026thinsp;+\u0026thinsp;T cells showed the opposite expression profile, mirroring canonical helper and cytotoxic cell type specific phenotypes. To capture T-cell differentiation and memory states within the CD4\u0026thinsp;+\u0026thinsp;and CD8\u0026thinsp;+\u0026thinsp;compartment, we incorporated CD45RA and CD27 expression to define na\u0026iuml;ve and memory subsets (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ec). Density-based gating along the CD45RA and CD27 axes revealed clear separation of maturation states consistent with established CD4\u0026thinsp;+\u0026thinsp;T-cell biology. Within differentiated CD4\u0026thinsp;+\u0026thinsp;populations, we further overlaid CD85j and CD57 expression to resolve functional heterogeneity (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ed). Cells co-expressing CD57 and CD85j were annotated as highly differentiated effector phenotypes, while CD57- CD85j- subsets represented less differentiated cells. For CD8\u0026thinsp;+\u0026thinsp;T cells, we used CD45RA and CD27 expressions to define canonical differentiation states (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ee), separating na\u0026iuml;ve, central memory, effector memory, and terminal effector populations. Within the CD8\u0026thinsp;+\u0026thinsp;CD45RA+ CD27- (TEMRA) compartment, we further resolved later stage differentiation using CD85j and CD57 expression (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ef), identifying highly differentiated effector subsets. This gating strategies demonstrates that scEN-predicted markers can reproduce late-stage differentiation. Within the B-cell compartment, we first examined IgD and IgM expression to delineate na\u0026iuml;ve and transitional-like populations (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eg). To define classical memory subsets further, we used IgD and CD27 expression (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eh) to identify na\u0026iuml;ve B cells (IgD+CD27-), unswitched memory (IgD+CD27+), switched memory (IgD-CD27+), and double-negative memory (IgD-CD27-) B cells. To refine transitional-like populations within the IgD+CD27- gate, IgM intensity served as a secondary axis, identifying a small fraction of transitional B cells that characterized by high IgM expression. These predicted distributions closely mirrored expected frequencies in healthy PBMCs. Among CD3-CD19- cells, scEN-predicted CD56 expression delineated NK cells (CD56+) from myeloid lineages (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ei). Within the CD56- compartment, we further separated monocytes using CD14 and CD16 markers (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ej). Classical CD14\u0026thinsp;+\u0026thinsp;CD16- and non-classical CD14-CD16\u0026thinsp;+\u0026thinsp;subsets appeared as distinct arms on the CD14 and CD16 scatter plot, while a smaller CD14int CD16int population represented intermediate monocytes, which were not included in further analyses.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eAll gating thresholds and decision boundaries are shown (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). Overall, 19 cell types were defined according to the above gating strategies. Notably, the distribution of predicted marker intensities followed unimodal or bimodal patterns consistent with empirical flow cytometry data, reinforcing the accuracy of scEN\u0026rsquo;s \u003cem\u003ein-silico\u003c/em\u003e reconstruction. This gating pipeline recapitulates the standard manual workflow used in immunophenotyping and provides a scalable framework for analyzing over one million cells without direct protein measurement\u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e,\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003escEN-derived age prediction on the healthy human scRNA-seq dataset reveals age and Immune cell type correlation\u003c/p\u003e \u003cp\u003eTo demonstrate that scEN-inferred protein data are predictive of aging, we trained an Elastic Net regression model, which predicts age from cell type composition derived from scEN-inferred protein expression. Each individual\u0026rsquo;s cell-type composition, defined as the proportion of 19 manually gated cell types, served as the input to the regression model. The prediction target is the chronological age, which serves as a standard normal human aging metric. The scatter plot depicts the relationship between scEN-derived age prediction and chronological age across all donors (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ea). The scatter plot reveals a linear correlation spanning the entire adult lifespan, indicating that the model performs equally well in younger and older individuals. The model achieved a mean absolute error (MAE) of 9.76 years, which is comparable to the published single-cell mass cytometry and scRNA-seq immune aging prediction model, which was also trained with healthy human PBMC samples with defined major immune cell types to predict chronological age\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e,\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e. This level of accuracy highlights the ability of scEN-predicted protein profiles to capture major axes of immune-system remodeling, even without explicit molecular aging markers.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eWe visualize the learned non-zero regression coefficients, which define the weighted contribution of each immune subset to the immune-age estimate (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eb). The presence of both positive and negative coefficients reflects the bidirectional and compensatory nature of immune cell remodeling during aging\u003csup\u003e\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e. Among all features, CD8+ na\u0026iuml;ve T cells exhibited the strongest negative coefficient, consistent with extensive immunological literature documenting their progressive depletion as a result of thymic involution and homeostatic proliferation\u003csup\u003e\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e,\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e. This decline is one of the most recognized biomarkers of immune senescence and is tightly associated with impaired responsiveness to infections and vaccines. In contrast, CD85j+ effector CD8\u0026thinsp;+\u0026thinsp;T cells, NK cells, CD4\u0026thinsp;+\u0026thinsp;central memory (TCM) cells, and CD16\u0026thinsp;+\u0026thinsp;monocytes showed positive coefficients, reflecting their expansion with aging and chronic immune activation. These shifts are characteristic of classical immunosenescence, in which repeated antigen exposure and persistent viral stimulation drive the accumulation of differentiated, cytotoxic, and innate-like immune\u003csup\u003e\u003cspan additionalcitationids=\"CR35 CR36\" citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e\u003c/sup\u003e. The relative magnitude and directionality of these coefficients recapitulate well-established aging hierarchies, strengthening biological confidence in the model\u0026rsquo;s interpretability and demonstrating that scEN-derived features encode meaningful immunological structure.\u003c/p\u003e \u003cp\u003eWe highlight one representative relationship, which is the age-associated contraction of CD8+ na\u0026iuml;ve T-cell proportion (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ec). The smooth and monotonic decline across donors, without abrupt jumps or platform-specific outliers, demonstrates that the scEN cell type-derived age-predicting model captures continuous biological variation rather than noisy technical confounders. This pattern closely mirrors classical flow cytometry studies, where CD8+ na\u0026iuml;ve T cells are among the most sensitive indicators of biological aging\u003csup\u003e\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e. Similar monotonic trajectories were observed for additional top-weighted features, including increased frequencies of NK cells, CD16\u0026thinsp;+\u0026thinsp;monocytes, and CD85j+ effector CD8\u0026thinsp;+\u0026thinsp;T cells in older individuals\u003csup\u003e\u003cspan additionalcitationids=\"CR40\" citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e\u003c/sup\u003e. These findings collectively indicate that the scEN cell type-derived age-predicting model faithfully reconstructs well-established immunological aging processes using only RNA-derived features.\u003c/p\u003e \u003cp\u003eTogether, these results position scEN as a biologically grounded and generalizable aging prediction model that represents healthy donor PBMC immune aging, and it leverages scEN-inferred protein expression. By integrating interpretable coefficients with compositionally derived signatures of immune remodeling, scEN provides a continuous metric of immune-system aging suitable for large-scale population studies, longitudinal monitoring, and disease-specific comparisons. This framework establishes the foundation for downstream analyses, including the application of the scEN cell type-derived age-predicting model to immune dysregulation in systemic lupus erythematosus (SLE), as presented in the next section.\u003c/p\u003e \u003cp\u003eDiseases associated immune age acceleration of Systemic Lupus Erythematosus(SLE) patients\u003c/p\u003e \u003cp\u003eTo demonstrate that the scEN-based age predictor is applicable and sensitive to disease-specific immune alterations, we applied the trained age prediction model to the SLE PBMC scRNA-seq dataset\u003csup\u003e\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e. We overlaid age predictions from SLE patients with those from healthy individuals (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e). Both SLE and healthy PBMC datasets exhibited a strong linear relationship between predicted age and chronological age. Notably, SLE patients displayed a consistent 10-year upward parallel shift in predicted age relative to healthy individuals. These results indicate that the model is sensitive to individual-level immune composition changes and that SLE-associated immune remodeling leads to a measurable acceleration of predicted age.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThis effect is evident in the scatter distribution, where SLE-derived predicted age values cluster above the healthy trend line, reflecting an enrichment of older predicted aging status even among younger patients. This pattern mirrors well-established immunological features of SLE, including contraction of na\u0026iuml;ve T-cell compartments, expansion of CD16\u0026thinsp;+\u0026thinsp;monocytes, accumulation of cytotoxic CD8\u0026thinsp;+\u0026thinsp;effector subsets, and increased NK-cell activation, cellular features that correspond to positive coefficients in the scEN-based age prediction model\u003csup\u003e\u003cspan additionalcitationids=\"CR44\" citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e\u003c/sup\u003e. Collectively, these observations indicate that the model reconstructs a canonical immune aging phenotype of SLE directly from transcriptome-derived protein estimates, supporting the concept that chronic autoimmune inflammation induces a premature shift toward an aged immune architecture.\u003c/p\u003e \u003cp\u003eTogether, these results establish that SLE patients exhibit pronounced age acceleration based on the immune system relative to healthy individuals and demonstrate that the scEN cell-type\u0026ndash;derived age prediction framework provides both a quantitative and a biologically interpretable approach for detecting such immune dysregulation caused by autoimmune disease such as SLE. More broadly, this analysis highlights how scEN-derived protein predictions bridge transcriptomic variation and clinically relevant immune metrics, positioning the scEN age predictor as a potential tool for monitoring in autoimmune disease, chronic infection, cancer immunotherapy, and large-scale immune health studies\u003csup\u003e\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eWe developed scEN, an interpretable framework that generates \u003cem\u003ein silico\u003c/em\u003e CITE-seq\u0026ndash;like protein predictions for transcript-only scRNA-seq datasets. We demonstrated that scEN achieves predictive performance comparable to, and in some cases exceeding, state-of-the-art deep learning approaches on unseen CITE-seq validation datasets. Importantly, predicted protein expression recapitulated key lineage-defining markers, enabling downstream immunophenotyping through manual gating and accurate recovery of major immune cell subsets. These gated populations were further used to construct an scEN-based age prediction pipeline using individual-level cell-type proportions. The resulting age predictor stratified systemic lupus erythematosus (SLE) patients from healthy controls and revealed an approximate ten-year predicted age acceleration in SLE.\u003c/p\u003e \u003cp\u003eUnlike deep-learning frameworks such as TotalVI and sciPENN, scEN achieves cross-dataset stability through Elastic Net regression, which combines L1 (lasso) and L2 (ridge) regularization penalties\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e\u003c/sup\u003e. In high-dimensional settings such as scRNA-seq, where the number of genes far exceeds the number of cells and extensive co-expression exists, ordinary regression models are prone to overfitting\u003csup\u003e\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e,\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e\u003c/sup\u003e. L1 regularization introduces sparsity by shrinking many coefficients to zero, thereby performing automatic feature selection and emphasizing informative predictors\u003csup\u003e\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e,\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e\u003c/sup\u003e. L2 regularization stabilizes correlated coefficients in the presence of collinearity and donor-specific noise\u003csup\u003e\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e,\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e\u003c/sup\u003e. By integrating both penalties, Elastic Net achieves a balance between sparsity and stability, producing models that are interpretable and robust to batch effects and technical variability\u003csup\u003e\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e\u003c/sup\u003e. This dual-regularization likely explains why scEN generalizes more consistently than flexible deep-learning architectures, which, despite their ability to model nonlinear relationships, may exhibit unstable transfer when applied to unseen datasets\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e,\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e,\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e,\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e. In this context, scEN\u0026rsquo;s relative simplicity becomes advantageous: by constraining the model to reproducible and biologically meaningful gene\u0026ndash;protein associations, it maintains strong predictive performance without sacrificing interpretability.\u003c/p\u003e \u003cp\u003eDespite these strengths, scEN has several limitations. First, the model can only predict proteins represented in the training CITE-seq dataset, proteins that lacking paired RNA\u0026ndash;protein measurements during training still could not be predicted. Second, proteins with weak or indirect transcript\u0026ndash;protein coupling remain challenging to predict accurately, even after preprocessing of filtering lowly expressed genes\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e,\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e. As shown, prediction of different proteins performance varies, and prediction accuracy for the same protein tends to rise or fall in the same manner across different models. This phenomenon potentially reflects the biological constraints rather than computational limitations, representing a challenge shared by all RNA-to-protein prediction frameworks\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eFuture work will focus on expanding the existing training pipeline across diverse tissues and disease contexts, improving inference for rare or previously unseen proteins, and integrating multimodal information to enhance predictive capacity. Beyond methodological refinement, potential applications include immune-age assessment in autoimmune disease, immune profiling in cancer, and retrospective analysis of scRNA-seq datasets lacking paired protein measurements. Together, scEN and the associated age-prediction framework provide an accessible, interpretable, and scalable strategy for extending transcriptomic datasets toward clinically meaningful, proteomic level insight\u003csup\u003e\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e,\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e,\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003eGene and protein feature harmonization\u003c/p\u003e \u003cp\u003eGenes with zero counts across all cells were excluded prior to normalization. Highly variable genes (HVGs) were identified from the BMMC CITE-seq training dataset using Scanpy\u0026rsquo;s implementation of the Seurat v3 method (flavor = \"seurat_v3\", n_top_genes\u0026thinsp;=\u0026thinsp;4000), which ranks genes based on standardized variance to prioritize biologically informative features while reducing noise from lowly expressed genes and without incorporating protein target information to avoid potential information leakage. We selected the top 4,000 HVGs as this range captures most biological variation in CITE-seq datasets while limiting overfitting from high-dimensional noise. HVG selection was performed using only the BMMC dataset prior to any training and test splitting or cross-dataset evaluation. The intersection of these HVGs with each validation dataset defined the final gene feature space. The same intersected gene set was used consistently across all evaluated benchmarking models.\u003c/p\u003e \u003cp\u003eNo batch effect correction or donor regression was applied to preserve raw gene\u0026ndash;protein relationships and to avoid potential information leakage across datasets. This decision was made to evaluate model robustness under raw cross-dataset domain shift without harmonization artifacts and to assess generalization to unseen datasets. Genes with zero variance in the training dataset were removed prior to model fitting.\u003c/p\u003e \u003cp\u003eProtein markers absent from downstream evaluation datasets were excluded from model training and benchmarking to avoid undefined prediction targets during cross-dataset evaluation. All reported performance metrics reflect only protein markers present in both the training and evaluation datasets.\u003c/p\u003e \u003cp\u003eCell inclusion and filtering\u003c/p\u003e \u003cp\u003eAll cells from the BMMC and JDM PBMC datasets were included without additional filtering unless preprocessing was performed automatically by the benchmarking pipeline (e.g., sciPENN or scLinear), which applies its own quality control filtering. For PBMC10K, only cells annotated as major immune populations (B cells, T cells, NK cells, monocytes, dendritic cells) by the original authors were retained. No additional harmonization was imposed across models. Protein prediction models were evaluated at the cell level, whereas age prediction models were evaluated at the donor level.\u003c/p\u003e \u003cp\u003eProtein normalization\u003c/p\u003e \u003cp\u003eProtein expression values were normalized according to each method\u0026rsquo;s published implementation. For scEN, ADT counts were first normalized on a per-cell basis by dividing each protein count by the total ADT count of the corresponding cell and ADT counts were normalized per cell by dividing each protein count by the total ADT count and scaling to a fixed total of 100, which ensures comparable dynamic range across cells while preserving relative protein abundance.\u003c/p\u003e \u003cp\u003eThe normalized values were subsequently transformed as:\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:AD{T}_{norm}={\\text{l}\\text{o}\\text{g}}_{10}(\\text{normalized\\:ADT}+1)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eFor scLinear and cTP-net, protein values were normalized using centered log-ratio (CLR) transformation as recommended in its original implementation.\u003c/p\u003e \u003cp\u003eFor TotalVI and sciPENN, raw RNA and ADT count matrices were used as required by its generative modeling framework. For sciPENN, the pipeline further log normalizes both gene expression and the ADT count.\u003c/p\u003e \u003cp\u003eFor each model, observed protein values were transformed using the identical normalization strategy applied during that model\u0026rsquo;s training prior to performance evaluation. All benchmarking models were executed using their published preprocessing pipelines and recommended normalization strategies to ensure fair comparison.\u003c/p\u003e \u003cp\u003escEN model training\u003c/p\u003e \u003cp\u003escEN models were trained using Elastic Net regression implemented in scikit-learn. Gene expression values served as input features, library size normalized to 1e4 and log-transformed using scanpy. Normalized Protein expression values were used as prediction targets. Separate regression models were trained independently for each protein marker.\u003c/p\u003e \u003cp\u003ePrior to model fitting, features were standardized using a StandardScaler fit on the training data and applied to the test data. Non-finite gene expression values were replaced with zeros before model fitting.\u003c/p\u003e \u003cp\u003eElastic Net hyperparameters (α and l1_ratio) were optimized using grid search with five-fold cross-validation within the training set and with Pearson correlation as the scoring metric. The parameter grid included α \u0026isin; {0.001, 0.01, 0.1} and L1_ratio \u0026isin; {0.1, 0.5, 0.9} with max_iter\u0026thinsp;=\u0026thinsp;10000. After hyperparameter selection, models were refit on the full training set using optimal parameters before evaluation on the held-out test set.\u003c/p\u003e \u003cp\u003eTrain\u0026ndash;test splitting\u003c/p\u003e \u003cp\u003eFor protein benchmarking Cells were randomly split into training (80%) and testing (20%) partitions using a fixed random seed (random_state\u0026thinsp;=\u0026thinsp;42). For cross-dataset generalization, models were trained exclusively on full BMMC dataset and applied without retraining to PBMC10K. No individuals were shared between training and testing sets.\u003c/p\u003e \u003cp\u003ePerformance evaluation\u003c/p\u003e \u003cp\u003ePearson correlation was chosen because it evaluates preservation of continuous protein expression structure across cells and is widely used in previous protein-imputation benchmarks. Pearson correlation was computed across cells for each protein marker and summarized by mean correlation across markers. Per-protein correlations were also reported. scEN per protein model R\u0026sup2; was used only for hyperparameter tuning and not for cross-method comparison.\u003c/p\u003e \u003cp\u003eModel interpretability\u003c/p\u003e \u003cp\u003eElastic Net model coefficients were extracted for each trained protein-specific model. Genes with non-zero coefficients were identified as contributing features, enabling interpretation of gene\u0026ndash;protein associations learned by scEN.\u003c/p\u003e \u003cp\u003eManual gating of scEN-predicted protein expression\u003c/p\u003e \u003cp\u003eManual gating was performed using a hierarchical flow cytometry style workflow applied to two-dimensional scatter plots of scEN-predicted protein expression.\u003c/p\u003e \u003cp\u003eInitial gating thresholds were determined using density inflection points corresponding to marker-positive and marker-negative populations of the healthy PBMC1M dataset These thresholds were defined based on bimodal distributions separating marker-positive and marker-negative populations. The same gating thresholds were subsequently applied to the remaining healthy samples and then to the independent SLE dataset.\u003c/p\u003e \u003cp\u003eGating thresholds were in general fixed prior to downstream statistical analyses with minimal adjustment as conventional flow cytometry workflow thresholds were not optimized to improve age prediction performance. Those minor manual adjustments were permitted only when global distribution shifts were observed, while preserving canonical marker relationships. No donor-specific threshold optimization was performed.\u003c/p\u003e \u003cp\u003eSequential gating followed standard immunophenotyping hierarchies (e.g. CD3 and CD19 \u0026rarr; CD4 and CD8 \u0026rarr; memory markers \u0026rarr; myeloid markers). Gating reproducibility was visually confirmed across donors and disease cohorts. All gating thresholds and decision boundaries are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e to ensure transparency and reproducibility.\u003c/p\u003e \u003cp\u003escEN-based age prediction model\u003c/p\u003e \u003cp\u003eAge prediction was performed using an Elastic Net regression model trained on immune cell-type composition profiles derived from manual gating of scEN-predicted protein expression.\u003c/p\u003e \u003cp\u003eFor each donor, cell-type features were defined as relative proportions, calculated as the number of cells assigned to a gated population divided by the total number of profiled cells for that donor.\u003c/p\u003e \u003cp\u003eThe resulting feature matrix consisted of proportions of 19 manually defined immune cell types spanning major lymphoid and myeloid compartments. Chronological age was used as the prediction target.\u003c/p\u003e \u003cp\u003eDonors from the healthy PBMC1M dataset were randomly split into training (80%) and testing (20%) cohorts (random_state\u0026thinsp;=\u0026thinsp;42). The 80/20 split was performed prior to hyperparameter tuning, and cross-validation was conducted exclusively within the training cohort.\u003c/p\u003e \u003cp\u003eHyperparameters were optimized using randomized search with five-fold cross-validation within the healthy training cohort. Cross-validation folds were defined at the donor level to ensure independence between individuals and to prevent any cell-level leakage into donor-level splits.\u003c/p\u003e \u003cp\u003eElastic Net models were implemented using scikit-learn with max_iter\u0026thinsp;=\u0026thinsp;10,000 and random state\u0026thinsp;=\u0026thinsp;42. Feature standardization was performed using StandardScaler fit on the training data and applied to the test data.\u003c/p\u003e \u003cp\u003eModel performance was evaluated on the held-out healthy test set using mean absolute error.\u003c/p\u003e \u003cp\u003eThe trained model was subsequently applied, without retraining, to the independent SLE cohort to estimate predicted immune age.\u003c/p\u003e \u003cp\u003eSoftware and implementation\u003c/p\u003e \u003cp\u003eAll analyses were conducted on a Linux-based high-performance workstation running Ubuntu 22.04.5 LTS (64-bit). The system was equipped with 512 GB DDR4 ECC registered memory (8 \u0026times; 64 GB modules at 3200 MT/s). Python 3.12.2 was used for computational modeling, with machine learning implemented using scikit-learn (v1.4.2). Numerical computation and data manipulation were performed using numpy (v1.26.4) and pandas (v2.2.2). Single-cell data processing was conducted using scanpy (v1.10.1). Deep learning components utilized PyTorch (v2.7.0) with CUDA 12.6 support. Statistical analyses were performed in R (v4.5.0). All software environments were managed using Conda to ensure reproducibility and dependency control.\u003c/p\u003e\n\u003ch3\u003eData Availability\u003c/h3\u003e\n\u003cp\u003eAll datasets analyzed in this study are publicly available from previously published sources.\u003c/p\u003e \u003cp\u003eThe BMMC CITE-seq dataset from the NeurIPS 2021 multimodal challenge is available in the Gene Expression Omnibus (GEO) under accession GSE194122. The juvenile dermatomyositis (JDM) PBMC CITE-seq dataset is available through the CZ CELLxGENE Discover resource under the collection \u0026ldquo;CITE-seq of JDM PBMCs\u0026rdquo; (collection ID: c672834e-c3e3-49cb-81a5-4c844be4a975). The PBMC10K CITE-seq dataset (\u0026ldquo;10k PBMCs from a Healthy Donor \u0026ndash; Gene Expression with a Panel of TotalSeq\u0026trade;-B Antibodies,\u0026rdquo; Cell Ranger 3.0.0) is available from 10x Genomics (2018 release). For external RNA-only validation, the Lee PBMC scRNA-seq dataset is available via CZ CELLxGENE Discover (collection ID: 4f889ffc-d4bc-4748-905b-8eb9db47a2ed), originally published in \u003cem\u003eImmunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19\u003c/em\u003e. The Arunachalam PBMC scRNA-seq dataset is available via CZ CELLxGENE Discover (collection ID: b9fc3d70-5a72-4479-a046-c2cc1ab19efc). For immune aging analysis, the healthy PBMC scRNA-seq dataset from Yazar et al. is available in GEO under accession GSE196830. Processed data are also available through CZ CELLxGENE Discover (collection ID: dde06e0f-ab3b-46be-96a2-a8082383c4a1 ).The systemic lupus erythematosus (SLE) PBMC scRNA-seq dataset is available in GEO under accession GSE174188. Genotype data are accessible via dbGaP under accession phs002812.v1.p1. Processed data are also available through CZ CELLxGENE Discover (collection ID: 436154da-bcf1-4130-9c8b-120ff9a888f2).\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e \u003ch2\u003eCompeting interests\u003c/h2\u003e \u003cp\u003eThe author declares no competing interests\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eH.Y.L. and P.Q designed and performed the research, experiment and wrote the manuscript.\u003c/p\u003e\n\u003ch3\u003eCode Availability\u003c/h3\u003e\n\u003cp\u003eThe scEN package are publicly available at: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/OziLeung/scEN\u003c/span\u003e\u003cspan address=\"https://github.com/OziLeung/scEN\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e The repository includes model training scripts,benchmarking pipelines and notebooks require to reproduce the main figures.\u003c/p\u003e \u003cp\u003e \u003cb\u003eFunding Declaration\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThe authors declare that no funds, grants were received related to the preparation of this manuscript.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eDimitrov, D. et al. Comparison of methods and resources for cell\u0026ndash;cell communication inference from single-cell RNA-seq data. \u003cem\u003eNat. Commun.\u003c/em\u003e 13, 30755 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTriana, S. et al. Single-cell proteo-genomic reference maps of the hematopoietic system enable the purification and massive profiling of precisely defined cell states. \u003cem\u003eNat. Immunol.\u003c/em\u003e 22, 1577\u0026ndash;1589 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eStubbington, M. J. T., Rozenblatt-Rosen, O., Regev, A. \u0026amp; Teichmann, S. A. Single-cell transcriptomics to explore the immune system in health and disease. \u003cem\u003eNat. Rev. Immunol.\u003c/em\u003e 17, 207\u0026ndash;221 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eStoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. \u003cem\u003eNat. Methods\u003c/em\u003e 14, 865\u0026ndash;868 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlpert, A. et al. A clinically meaningful metric of immune age derived from high-dimensional longitudinal monitoring. \u003cem\u003eNat. Med.\u003c/em\u003e 25, 487\u0026ndash;495 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang, X. et al. An immunophenotype-coupled transcriptomic atlas of human hematopoietic progenitors. \u003cem\u003eNat. Immunol.\u003c/em\u003e 25, 1782\u0026ndash;1794 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSong, H.-W., Martin, J., Shi, X. \u0026amp; Tyznik, A. J. Key considerations on CITE-seq for single-cell multiomics. \u003cem\u003eProteomics\u003c/em\u003e 25, 206\u0026ndash;213 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGayoso, A. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. \u003cem\u003eNat. Methods\u003c/em\u003e 18, 272\u0026ndash;282 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLakkis, J. et al. A multi-use deep learning method for CITE-seq and single-cell RNA-seq data integration with cell surface protein prediction and imputation. \u003cem\u003eNat. Mach. Intell.\u003c/em\u003e 4, 940\u0026ndash;952 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen, Y., Fan, X., Shi, C., Shi, Z. \u0026amp; Wang, C. A joint analysis of single-cell transcriptomics and proteomics using transformer. \u003cem\u003enpj Syst. Biol. Appl.\u003c/em\u003e 11, 1 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCao, Y., Zhu, J., Jia, P. \u0026amp; Zhao, Z. scRNASeqDB: a database for RNA-seq based gene expression profiles in human single cells. \u003cem\u003eGenes\u003c/em\u003e 8, 368 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhou, Z., Ye, C., Wang, J. \u0026amp; Zhang, N. R. Surface protein imputation from single cell transcriptomes by deep neural networks. \u003cem\u003eNat. Commun.\u003c/em\u003e 11, 651 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHanhart, D., Gossi, F., Rapsomaniki, M. A., Kruithof-de Julio, M. \u0026amp; Chouvardas, P. ScLinear predicts protein abundance at single-cell resolution. \u003cem\u003eCommun. Biol.\u003c/em\u003e 7, 267 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCharilaou, P. C. \u0026amp; Battat, R. Machine learning models and over-fitting considerations. \u003cem\u003eWorld J. Gastroenterol.\u003c/em\u003e 28, 605\u0026ndash;607 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHicks, S. C., Townes, F. W., Teng, M. \u0026amp; Irizarry, R. A. Missing data and technical variability in single-cell RNA-sequencing experiments. \u003cem\u003eBiostatistics\u003c/em\u003e 19, 562\u0026ndash;578 (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLuecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. \u003cem\u003eNat. Methods\u003c/em\u003e 19, 41\u0026ndash;50 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMontoya-Ortiz, G. Immunosenescence, aging, and systemic lupus erythematosus. \u003cem\u003eAutoimmune Dis.\u003c/em\u003e 2013, 267078 (2013).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLance, C. et al. Multimodal single cell data integration challenge: results and lessons learned. In \u003cem\u003eProceedings of the NeurIPS 2021 Competitions and Demonstrations Track\u003c/em\u003e 162\u0026ndash;176 (PMLR, 2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKotliar, D. et al. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-seq. \u003cem\u003eeLife\u003c/em\u003e 8, e43803 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRabadam, G. et al. Coordinated immune dysregulation in juvenile dermatomyositis revealed by single-cell genomics. \u003cem\u003eJCI Insight\u003c/em\u003e 9, e176963 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuang, X. et al. Multimodal probing of T cell recognition with hexapod heterostructures. \u003cem\u003eNat. Methods\u003c/em\u003e 21, 857\u0026ndash;867 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang, B. et al. Targeting intracellular and extracellular receptors with nano-to-macroscale biomaterials to activate immune cells. \u003cem\u003eJ. Control. Release\u003c/em\u003e 357, 52\u0026ndash;66 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKotliar, D. M. et al. Reproducible single-cell annotation of programs underlying T cell subsets, activation states and functions. \u003cem\u003eNat. Methods\u003c/em\u003e 22, 1964\u0026ndash;1980 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZammit, W. H. et al. Inhibitory TIGIT signalling is dependent on T cell receptor activation. \u003cem\u003ebioRxiv\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1101/2025.05.08.652881\u003c/span\u003e\u003cspan address=\"10.1101/2025.05.08.652881\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee, J. S. et al. Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19. \u003cem\u003eSci. Immunol.\u003c/em\u003e 5, abd1554 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eArunachalam, P. S. et al. Systems biological assessment of immunity to mild versus severe COVID-19 infection in humans. \u003cem\u003eScience\u003c/em\u003e 369, 1210\u0026ndash;1220 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZiegler-Heitbrock, L. The CD14⁺ CD16⁺ blood monocytes: their role in infection and inflammation. \u003cem\u003eJ. Leukoc. Biol.\u003c/em\u003e 81, 584\u0026ndash;592 (2007).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMaecker, H. T., McCoy, J. P. \u0026amp; Nussenblatt, R. Standardizing immunophenotyping for the Human Immunology Project. \u003cem\u003eNat. Rev. Immunol.\u003c/em\u003e 12, 191\u0026ndash;200 (2012).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNewell, E. W. \u0026amp; Cheng, Y. Mass cytometry: blessed with the curse of dimensionality. \u003cem\u003eNat. Immunol.\u003c/em\u003e 17, 890\u0026ndash;895 (2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhu, H. et al. Human PBMC scRNA-seq\u0026ndash;based aging clocks reveal ribosome-to-inflammation balance as a single-cell aging hallmark and super longevity. \u003cem\u003eSci. Adv.\u003c/em\u003e 9, eabq7599 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDou, L. et al. Immune remodeling during aging and the clinical significance of immunonutrition in healthy aging. \u003cem\u003eAging Dis.\u003c/em\u003e 15, 1588\u0026ndash;1601 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiang, Z. et al. Age-related thymic involution: mechanisms and functional impact. \u003cem\u003eAging Cell\u003c/em\u003e 21, e13671 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCho, J.-H. et al. An intense form of homeostatic proliferation of naive CD8⁺ cells driven by IL-2. \u003cem\u003eJ. Exp. Med.\u003c/em\u003e 204, 1787\u0026ndash;1801 (2007).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu, Z. et al. Immunosenescence: molecular mechanisms and diseases. \u003cem\u003eSignal Transduct. Target. Ther.\u003c/em\u003e 8, 200 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAw, D., Silva, A. B. \u0026amp; Palmer, D. B. Immunosenescence: emerging challenges for an ageing population. \u003cem\u003eImmunology\u003c/em\u003e 120, 435\u0026ndash;446 (2007).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAiello, A. et al. Immunosenescence and its hallmarks: how to oppose aging strategically? \u003cem\u003eFront. Immunol.\u003c/em\u003e 10, 2247 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFulop, T. et al. Immunosenescence and inflamm-aging as two sides of the same coin: friends or foes? \u003cem\u003eFront. Immunol.\u003c/em\u003e 8, 1960 (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWeyand, C. M. \u0026amp; Goronzy, J. J. Aging of the immune system: mechanisms and therapeutic targets. \u003cem\u003eAnn. Am. Thorac. Soc.\u003c/em\u003e 13(Suppl 5), S422\u0026ndash;S428 (2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGergues, M. et al. Senescence, NK cells, and cancer: navigating the crossroads of aging and disease. \u003cem\u003eFront. Immunol.\u003c/em\u003e 16, 1565278 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSeidler, S. et al. Age-dependent alterations of monocyte subsets and monocyte-related chemokine pathways in healthy adults. \u003cem\u003eBMC Immunol.\u003c/em\u003e 11, 30 (2010).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGustafson, C. E. et al. Immune checkpoint function of CD85j in CD8 T cell differentiation and aging. \u003cem\u003eFront. Immunol.\u003c/em\u003e 8, 692 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePerez, R. K. et al. Single-cell RNA-seq reveals cell type\u0026ndash;specific molecular and genetic associations to lupus. \u003cem\u003eScience\u003c/em\u003e 376, abf1970 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLao, J. et al. Changes of peripheral T cells in systemic lupus erythematosus patients. \u003cem\u003eImmun. Inflamm. Dis.\u003c/em\u003e 13, e70156 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eThangjam, N. et al. Natural killer cell count in systemic lupus erythematosus patients: a flow cytometry-based study. \u003cem\u003eCureus\u003c/em\u003e 15, e46885 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ede Ocampo, C. et al. Effect of age on xenobiotic-induced autoimmunity. \u003cem\u003ebioRxiv\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1101/2025.05.22.655368\u003c/span\u003e\u003cspan address=\"10.1101/2025.05.22.655368\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBotticelli, A. et al. The role of immune profile in predicting outcomes in cancer patients treated with immunotherapy. \u003cem\u003eFront. Immunol.\u003c/em\u003e 13, 974087 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZou, H. \u0026amp; Hastie, T. Regularization and variable selection via the elastic net. \u003cem\u003eJ. R. Stat. Soc. Ser. B Stat. Methodol.\u003c/em\u003e 67, 301\u0026ndash;320 (2005).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWainberg, M., Merico, D., Delong, A. \u0026amp; Frey, B. J. Deep learning in biomedicine. \u003cem\u003eNat. Biotechnol.\u003c/em\u003e 36, 829\u0026ndash;838 (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHastie, T., Tibshirani, R. \u0026amp; Friedman, J. \u003cem\u003eThe Elements of Statistical Learning: Data Mining, Inference, and Prediction\u003c/em\u003e. 2nd edn (Springer, New York, 2009).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTibshirani, R. Regression shrinkage and selection via the lasso. \u003cem\u003eJ. R. Stat. Soc. Ser. B Methodol.\u003c/em\u003e 58, 267\u0026ndash;288 (1996).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHoerl, A. E. \u0026amp; Kennard, R. W. Ridge regression: biased estimation for nonorthogonal problems. \u003cem\u003eTechnometrics\u003c/em\u003e 12, 55\u0026ndash;67 (1970).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"npj-systems-biology-and-applications","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjsba","sideBox":"Learn more about [npj Systems Biology and Applications](http://www.nature.com/npjsba/)","snPcode":"41540","submissionUrl":"https://submission.springernature.com/new-submission/41540/3","title":"npj Systems Biology and Applications","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-9418867/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9418867/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eSingle-cell RNA sequencing (scRNA-seq) is widely established and excels at capturing transcriptional heterogeneity and signaling states. However, it lacks direct quantification of surface proteins, which are key determinants of immune cell identity and function. To enable protein-based characterization of immunophenotypic diversity in scRNA-seq data, we developed scEN, which employs a regularized Elastic Net regression model to predict protein expression from gene expressions. We trained scEN on Cellular Indexing Transcriptomes with Epitopes Sequencing (CITE-seq) data containing paired surface-protein and transcriptomic measurements from bone marrow. When applied to scRNA-seq from peripheral blood of healthy donors scEN generated robust protein predictions aligned with known immunophenotypes at single-cell resolution and the predicted protein expression enabled cytometry-style manual gating to resolve immune-cell subsets associated with physiological immune aging. When applied to scRNA-seq data from lupus patients, the predicted protein expression captured shifts among immune-cell subsets, revealing lupus-induced acceleration of immune aging.\u003c/p\u003e","manuscriptTitle":"Protein expression derived from scRNA-seq reveals lupus-induced acceleration of immune aging","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-05-11 06:25:53","doi":"10.21203/rs.3.rs-9418867/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"reviewerAgreed","content":"273634830732151341993576204937540217973","date":"2026-05-12T03:47:01+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-08T05:34:29+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"32399760983050896951110199031435812891","date":"2026-04-29T23:47:51+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-04-29T14:03:41+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-04-18T06:43:47+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-04-17T11:32:03+00:00","index":"","fulltext":""},{"type":"submitted","content":"npj Systems Biology and Applications","date":"2026-04-14T18:30:31+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"npj-systems-biology-and-applications","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjsba","sideBox":"Learn more about [npj Systems Biology and Applications](http://www.nature.com/npjsba/)","snPcode":"41540","submissionUrl":"https://submission.springernature.com/new-submission/41540/3","title":"npj Systems Biology and Applications","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"7eb83497-0393-4c57-ba19-8ce5babadd35","owner":[],"postedDate":"May 11th, 2026","published":true,"recentEditorialEvents":[{"type":"reviewerAgreed","content":"273634830732151341993576204937540217973","date":"2026-05-12T03:47:01+00:00","index":40,"fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-08T05:34:29+00:00","index":32,"fulltext":""},{"type":"reviewerAgreed","content":"32399760983050896951110199031435812891","date":"2026-04-29T23:47:51+00:00","index":16,"fulltext":""},{"type":"reviewersInvited","content":"23","date":"2026-04-29T14:03:41+00:00","index":"","fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":67677592,"name":"Health sciences/Biomarkers"},{"id":67677593,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":67677594,"name":"Biological sciences/Immunology"}],"tags":[],"updatedAt":"2026-05-11T06:25:53+00:00","versionOfRecord":[],"versionCreatedAt":"2026-05-11 06:25:53","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9418867","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9418867","identity":"rs-9418867","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00