Multi-Scale Sequence Encoding Distinguishes Long-Lived and Short-Lived Proteins Revealed by Protein Language Model Embeddings | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Multi-Scale Sequence Encoding Distinguishes Long-Lived and Short-Lived Proteins Revealed by Protein Language Model Embeddings Tangilal Dihan Chowdhury, Fasiha Tanzeem Taiba, Md Ushama Shafoyat, and 4 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9298535/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 6 You are reading this latest preprint version Abstract Protein stability and turnover are fundamental determinants of proteome regulation, yet how protein lifetime is specified by amino-acid sequences remains incompletely understood. Here, we identify a previously uncharacterized, multi-scale organization of sequence features associated with protein stability using protein language model (PLM) representations. Using experimentally derived half-life data from four human cell lines, we uncover a conserved stability-associated axis in embedding space that separates long-lived proteins (LLP) from short-lived proteins (SLP) (ROC–AUC = 0.80–0.82) and generalizes across species (Pearson r up to 0.465). Systematic decomposition of sequence representations shows that protein stability is not explained by amino-acid composition alone but reflects contributions from multiple sequence levels. While composition provides baseline predictive signal (AUC = 0.64–0.70), sequence grammar (0.65–0.68) and motif features (0.71–0.73) contribute additional information, with PLM representations integrating these signals into a unified framework. Disrupting residue order reduces predictive performance (AUC ~ 0.82 to ~ 0.62), supporting a role for residue organization beyond composition alone. Analysis of sequence features reveals consistent organization patterns associated with stability, with LLP proteins exhibiting tighter lysine spacing, enhanced charge clustering, and increased proline periodicity, whereas SLP proteins show more dispersed residue organization. Perturbation analyses further support the contribution of these features, with disruption of charge organization producing the largest decrease in predicted stability (Δ ≈ −1.53). Together, these findings support a model in which protein stability is associated with distributed, multi-scale sequence organization rather than isolated motifs. By systematically resolving stability-related signals across sequence scales, this work highlights the potential of PLM-based representations to uncover biologically meaningful principles governing proteome dynamics. PLM long lived protein short lived protein PLM embeddings sequence grammar motif Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 1. Introduction Proteins within cells are continuously synthesized and degraded, and the balance between these processes determines protein lifetime, a key regulator of signaling, metabolism, and protein quality control [ 1 ]. A central pathway governing targeted degradation is the ubiquitin–proteasome system, which selectively labels proteins for destruction [ 2 ]. Despite extensive research, predicting protein lifetime directly from amino-acid sequence remains a major challenge in molecular biology and computational proteomics. Although proteomics approaches enable large-scale measurement of protein turnover, they do not explain the substantial variability in half-life observed across the proteome. In particular, it remains unclear to what extent intrinsic sequence features contribute to degradation behavior. Existing studies have identified sequence elements such as degron motifs, PEST regions, and compositional biases associated with protein stability [ 3 – 5 ]. However, these features do not fully account for observed variability: many unstable proteins lack known degrons, while many proteins containing such motifs remain stable [ 4 , 6 ]. This discrepancy highlights an open question as to whether protein stability is primarily associated with localized sequence motifs or with broader, distributed patterns of residue organization across entire sequences. Recent advances in protein language models (PLMs) provide a framework to investigate this question. Trained on large-scale sequence data, PLMs capture complex statistical relationships between amino acids and encode diverse structural and functional properties [ 7 – 9 ]. However, whether these representations capture stability-associated signals, and how such signals are organized within sequences, remains incompletely understood. Here, we examine this problem using experimentally derived half-life datasets from four human cell lines [ 10 ]. We identify a stability-associated axis in PLM embedding space that separates long-lived proteins (LLP) from short-lived proteins (SLP) and shows consistent behavior across datasets and species. By systematically decomposing sequence representations, we find that protein stability is not explained by amino-acid composition alone but reflects contributions from multiple sequence levels, including composition, sequence grammar, and local motifs. Disrupting residue order reduces predictive performance, supporting a contribution of residue organization beyond composition alone. Analysis of sequence features reveals consistent organization patterns associated with stability, including tighter lysine spacing, enhanced charge clustering, and increased proline periodicity in LLP proteins relative to SLP proteins. Perturbation analyses further support the contribution of these features, with disruption of charge organization producing the largest decrease in predicted stability. Together, these results support a model in which protein stability is associated with distributed, multi-scale sequence organization rather than isolated motifs. This framework provides a basis for interpreting variability in proteome-wide turnover and highlights the potential of PLM-based representations to uncover biologically meaningful sequence principles underlying protein stability [ 11 ]. 2. Materials and Methods 2.1 Protein Datasets and Sequence Processing Protein half-life data were obtained from a large-scale quantitative proteomics study [ 10 ] that systematically mapped short-lived proteins in human cells using cycloheximide-chase assays combined with multiplexed mass spectrometry. The dataset contains measurements from four human cell lines (U2OS, HEK293T, HCT116, and RPE1). Each cell line dataset was curated by removing redundant protein sequences to reduce sequence redundancy and minimize similarity bias. For stability classification, proteins were categorized as long-lived (LLP; 25–32 h) or short-lived (SLP; 0.2–2.5 h). For each cell line, 300 LLP and 300 SLP proteins were selected to construct balanced datasets (600 proteins per cell line; 2400 proteins total across four cell lines). For continuous stability prediction analyses, the full half-life datasets (~ 1500 proteins per cell line; ~6000 proteins total) were used, spanning a wide range of degradation rates. Protein sequences were retrieved in FASTA format from UniProt [ 12 ] and used for subsequent computational analyses. To assess stability-axis similarity and cross-dataset generalization, an independent turnover dataset obtained using Stable Isotope Labeling by Amino Acids in Cell Culture mass spectrometry [ 13 , 14 ] was also analyzed. This dataset includes half-life measurements for 4,106 proteins in the human HeLa cell line and 3,528 proteins in the mouse C2C12 cell line (7,634 proteins total). Because experimental methodologies differ substantially between cycloheximide-chase and SILAC approaches [ 13 ], the datasets were analyzed separately to avoid methodological bias in half-life estimates. Overall, the analyses include more than 13,000 proteins across multiple independent proteomic datasets. All datasets, including half-life measurements and FASTA sequences, are provided in Supplementary File 1 . 2.2 Protein Language Model Embeddings Protein sequences were encoded using the ESM-2 protein language model with 650 million parameters (ESM2-t33-650M) [ 6 – 8 ]. For each protein sequence, token-level representations were extracted from the final 33rd transformer layer [ 7 , 8 ]. Residue-level embeddings [ 15 , 16 ] were averaged across all sequence positions to produce a single fixed-length vector representation for each protein. If E i represents the embedding vector of residue i and L is the sequence length, the final protein embedding (X) was computed as: $$\:X=\frac{1}{L}\:\sum\:_{i=1}^{L}{E}_{i}$$ 2.3 Dimensionality Reduction Because the original embedding vectors are high-dimensional, dimensionality reduction was performed prior to modeling [ 17 ]. First, all features were standardized using z-score normalization: $$\:Z=\frac{X-{\mu\:}\text{}}{{\sigma\:}}\:$$ where X represents the original embedding feature, u is the feature mean, and \(\:{\sigma\:}\) is the standard deviation. Principal component analysis (PCA) was then applied to the normalized embeddings to capture the major sources of variation in the data. The top 30 principal components (PCs) were retained and used as the feature space for all downstream analyses. The number of principal components (30) was selected based on cumulative explained variance 81.7%. The embedding features and the python script of extracting embedding features from PLM model (ESM 2) for all datasets are provided in Supplementary File 2 . 2.4 Stability Score Computation and LLP–SLP Classification Protein stability was inferred from protein language model embeddings [ 15 , 16 ] using a logistic regression classifier trained on PCA-transformed embeddings (top 30 PCs). Logistic regression was implemented using scikit learn, with L2 regularization (C = 1). The model estimates the probability that a protein belongs to the LLP class as: $$\:P(y=1|X)=\frac{1\text{}}{1+{e}^{-(w.X+b)}}$$ where X is the PCA feature vector, w the weight vector, and b the intercept. All preprocessing steps, including normalization and PCA, were performed within each training fold to prevent data leakage. Model performance was evaluated using ROC–AUC. The classifier defines a linear stability axis in embedding space. Each protein was projected onto this direction to compute a Stability Score (SSS) [ 18 ]: $$\:SSS=w.X+b$$ Higher SSS values correspond to proteins predicted to be more stable. To avoid overfitting, SSS values were computed using out-of-fold predictions during cross-validation [ 19 , 20 ]. To assess potential protein length bias, ROC–AUC values were compared between the original and length-matched datasets using five-fold cross-validation [ 20 ], with differences evaluated using a paired t-test [ 21 ] and Cohen’s d [ 22 ]. The datasets and the python scripts of these analysis are provided in Supplementary File 3 . 2.5 Sequence shuffling and sliding-window analysis To evaluate whether stability prediction depends on sequence order rather than amino acid composition, protein sequences were randomly shuffled while preserving composition [ 27 ]. Embeddings were recomputed from the shuffled sequences and classification performance was reassessed; a decrease in performance suggests that residue order contributes to the stability signal. To determine whether stability signals arise from localized motifs or distributed sequence patterns, a sliding-window analysis was performed [ 28 ]. Each protein was divided into overlapping 25-residue windows, embeddings were computed for each window, and projected onto the stability axis. For each protein, the maximum, mean, and variance of local stability scores were calculated to assess whether stability signals are concentrated in specific regions or distributed across the sequence. All the python analysis scripts are provided in Supplementary File 6 . 2.6 Sequence grammar and charge clustering analysis Sequence grammar features were analyzed for long-lived proteins (LLP) and short-lived proteins (SLP) across four cell lines using protein sequences obtained from FASTA files and custom Python scripts. Charge clustering features included lysine cluster size (K_max_cluster) and acidic cluster size (DE_max_cluster), defined as the longest consecutive runs of K or D/E residues within a sequence. The number of lysine clusters (K_cluster_count) was calculated as the total number of contiguous lysine segments, while charge blockiness was measured as the fraction of adjacent charged residues sharing the same charge sign [ 29 ]. Additional interpretable sequence grammar features were computed to investigate stability-related organization patterns. This included lysine spacing (mean distance between consecutive lysines), proline periodicity estimated using Fourier-based spectral analysis [ 30 ], PEST enrichment [ 5 ], proline-rich fraction, and low-complexity regions identified using sliding-window scans [ 3 , 4 ]. Disorder positioning was calculated from IUPred scores [ 31 ] as the difference between C-terminal and N-terminal disorder fractions [ 4 ]. Feature distributions between LLP and SLP proteins were compared using the Mann–Whitney U test [ 32 ], and mean values were calculated for each cell line. The IUPred scores (. result file) for all the datasets and the python scripts for sequence grammar and charge clustering analysis are provided in Supplementary File 7 . 2.7 Feature-specific perturbation analysis Feature-specific perturbation analysis was performed to evaluate the contribution of sequence organization features identified in the study. For each protein sequence, selected sequence organization features were disrupted by residue substitution while preserving overall sequence length and amino acid composition where applicable. Perturbed sequences were re-embedded using the same protein language model, and stability scores were recomputed using the previously defined stability axis. Changes in stability were quantified as the difference between original and perturbed scores (Δ stability score), and mean effects were calculated across all proteins. The python script of this analysis is provided in Supplementary File 9 . 2.8 Axis similarity and continuous half-life prediction To assess whether the stability-associated signal [ 23 ] captured by protein language model embeddings is conserved across biological contexts, stability axes derived from LLP and SLP proteins were compared across cell lines. Axis similarity was quantified using cosine similarity between logistic regression weight vectors [ 24 ], where higher similarity indicates a conserved direction separating stable and unstable proteins in embedding space. Generalization of this stability axis was further evaluated using full proteome datasets. Models trained on one cell line were used to predict stability in other cell lines, and predictions were compared with experimental half-life measurements using Pearson correlation [ 25 ]. Cross-experiment and cross-species generalization were also tested using independent mouse and human protein turnover datasets [ 13 ], which were analyzed separately due to differences in experimental protocols and half-life distributions. To examine whether embeddings capture continuous degradation rates, linear regression models were used to predict log-transformed protein half-life: $$\:\widehat{y}=w.X+b$$ where X represents the PCA-transformed embedding vector and y = log (half-life + 1). Model performance was evaluated using Pearson correlation, Spearman correlation, and coefficient of determination (R²). The python scripts of these experiments are provided in Supplementary File 4 . 2.9 Interpretation of embedding components and sequence feature correlations To interpret the embedding dimensions contributing to stability prediction, principal components with the largest logistic regression weights were identified and analyzed. Correlations were computed between these components and sequence-derived properties [ 26 ] including protein length, lysine density, charge density, and hydrophobic residue fraction, enabling identification of sequence features underlying the stability grammar encoded in the embeddings. In addition, correlations were calculated between the stability score (SSS) and several sequence-level features across cell lines. Pearson correlations were computed for protein length, lysine density, and intrinsic disorder fraction to evaluate whether basic sequence properties contribute to the embedding-derived stability signal. Finally, correlations with structural sequence properties, including hydrophobic fraction, helix propensity residues, and beta-sheet propensity residues, were examined to assess whether classical structural features explain the observed stability patterns. All the python analysis scripts are provided in Supplementary File 5 . 2.10 Grammar-based classification and perturbation analysis To evaluate whether protein stability signals can be captured using interpretable sequence features, a classification model was trained using sequence grammar–derived features, including residue periodicity, clustering, spacing, and related metrics. Model performance was evaluated using ROC–AUC [ 33 ], and the approach was further tested on an independent dataset containing human and mouse proteins measured under the same experimental protocol. In this dataset, LLP and SLP classes were defined using the top and bottom 5% of the half-life distribution, and ROC–AUC and precision were used to assess the ability of grammar features to distinguish stability classes across species. To test whether sequence organization contributes to stability prediction, in silico grammar perturbation experiments were performed. For each LLP sequence, residues at regular intervals (every 10th position) were randomly shuffled, disrupting periodic sequence organization while largely preserving amino acid composition and sequence length. Embeddings were recomputed for the perturbed sequences, and predicted stability values were compared with the original sequences using paired t-tests to determine whether disruption of sequence organization affects predicted stability. The python script of grammar-based classification for Cell 1–4, Human-external, Mouse and permutation analysis are provided in Supplementary File 8 . 2.11 Functional enrichment analysis Functional enrichment analysis was performed to identify biological processes associated with long-lived proteins (LLP) and short-lived proteins (SLP). Protein sequences were parsed from FASTA files, and gene names were extracted from UniProt headers (GN field). Gene lists were analyzed using the g:Profiler tool [ 34 ] (organism: Homo sapiens ) to identify enriched Gene Ontology biological process (GO:BP) terms [ 35 ]. Enrichment results were filtered at p < 0.05, and top-ranking terms were used for interpretation and visualization. The analysis result is provided in Supplementary File 10 . 2.12 Motif enrichment and motif-based classification All possible tripeptide motifs (20³ = 8000) were scanned across LLP and SLP protein sequences. Motif frequencies were computed for each sequence and compared between classes using the Mann–Whitney U test, with effect sizes estimated using Cohen’s d. The top 20 motifs showing the largest effect sizes were selected as features for classification. A logistic regression model using these motif frequencies was trained to distinguish LLP and SLP proteins, and performance was evaluated using five-fold cross-validation with ROC–AUC and precision. To test whether global sequence organization contributes additional predictive information, the motif features were combined with previously defined sequence grammar features, and model performance was re-evaluated. All the dataset and python scripts are provided in Supplementary File 11 . 2.13 Sequence representation benchmarking Amino acid composition features were computed as the frequency of each of the 20 standard residues in a protein sequence. Stability prediction was performed using logistic regression to classify long-lived (LLP) and short-lived proteins (SLP), and performance was evaluated using ROC–AUC with five-fold stratified cross-validation. Models based on sequence grammar, motif features, and protein language model embeddings were constructed as described above. These representations were compared to assess their relative contribution to protein stability prediction. The python scripts of this analysis are provided in Additional Information 12 . 3. Results 3.1 Identification of a Stability Axis in Protein Language Model Embeddings Projection of protein embeddings revealed clear separation between LLP and SLP proteins along a derived stability score (SSS) axis ( Supplementary Figure S1 A–C ). LLP proteins showed substantially higher scores (mean SSS = 0.8816) than SLP proteins (− 0.9317), with cross-validated means of 0.8199 and − 0.7749, respectively (Table 1 ). The separation was highly significant (Mann–Whitney p = 3.84 × 10⁻³⁵; validated p = 1.35 × 10⁻²⁶) with a large effect size (Cohen’s d = 1.345; validated d = 1.118) (Table 1 ). Cross–cell line validation showed consistent performance when the model trained on Cell1 was applied to other datasets, achieving ROC–AUC values of 0.802, 0.811, and 0.817 for Cell2–Cell4, respectively ( Supplementary Table S1 ; Fig. 1 A–F). Control analysis confirmed that predictions were not driven by protein length. Model performance remained unchanged after length matching (AUC = 0.781 vs 0.787), with no significant difference (t = − 0.363, p = 0.735; Cohen’s d = − 0.18) ( Supplementary Figure S2 A–B; Supplementary Table S2 ). These results indicate that protein language model embeddings capture a stability-associated direction that separates LLP and SLP proteins and generalizes across cell lines, independent of sequence length. These results indicate that protein language model embeddings capture a stability-associated direction that separates LLP and SLP proteins and generalizes across cell lines, independent of sequence length. Table 1 Stability score comparison between LLP and SLP proteins. Group Mean SSS Cross-validated-mean SSS LLP 0.881652634863405 0.8199233676051194 SLP −0.9317443608705182 −0.7749137369997359 Statistical significance of stability score Mann–Whitney p-value 3.835 \(\:\times\:\) 10 − 35 1.346 \(\:\times\:\) 10 − 26 Cohen’s d 1.345 1.118 3.2 Stability Signals Depend on Sequence Organization Sequence shuffling markedly reduced classification performance across all datasets (Table 2 ). While the original models achieved AUC = 0.802–0.823, shuffled sequences produced substantially lower values (0.615–0.639). Because amino acid composition was preserved during shuffling, this reduction suggests that residue order contributes to the stability-associated signal. Sliding-window analysis further showed that stability signals are distributed across the protein sequence rather than confined to short motifs (Table 3 ) . LLP proteins exhibited much higher maximum local stability scores (2.78 vs 0.69, p = 1.71×10⁻¹⁶) and mean local scores (0.758 vs − 1.149, p = 6.60×10⁻¹⁸) compared with SLP proteins, while variance remained similar (1.12 vs 1.07, p = 0.426). Together, these results indicate that protein stability is associated with distributed sequence organization, rather than being explained by amino acid composition alone or localized motifs. Table 2 Effect of sequence shuffling Cell Line Original AUC Shuffled AUC Cell 1 0.823 0.632 Cell 2 0.802 0.639 Cell 3 0.811 0.621 Cell 4 0.817 0.615 Table 3 Stability window analysis Cell Feature LLP Mean SLP Mean p-value 1 max_local_score 2.7777 0.6943 1.71 \(\:\times\:\) 10 − 16 var_local_score 1.1195 1.0701 0.426 mean_local_score 0.758 -1.1489 6.608 \(\:\times\:\) 10 − 18 2 max_local_score 2.491 0.5311 9.351 \(\:\times\:\) 10 − 16 var_local_score 1.2108 1.2556 0.50708 mean_local_score 0.5514 -1.2764 2.118 \(\:\times\:\) 10 − 17 3 max_local_score 2.819 0.6416 7.045 \(\:\times\:\) 10 − 18 var_local_score 1.1254 1.2881 0.783 mean_local_score 0.7754 -1.244 4.009 \(\:\times\:\) 10 − 19 4 max_local_score 2.6625 1.1451 5.232 \(\:\times\:\) 10 − 11 var_local_score 1.0909 1.1901 0.995 mean_local_score 0.7105 -0.7897 4.296 \(\:\times\:\) 10 − 19 3.3 Sequence-Level Correlates of the Stability Axis Decoding of the dominant embedding components revealed a consistent charge–hydrophobic stability grammar across cell lines ( Supplementary Tables S5–S6 ). The stability signal was primarily associated with lysine density, charge distribution, and hydrophobic composition, rather than sequence length. Strong relationships with charge organization were observed in Cell3 (PC7: lysine density r = − 0.450; charge density r = − 0.324), while hydrophobic fraction contributed positively in components such as Cell1 PC17 (r = 0.246) and Cell4 PC26 (r = 0.142). Direct correlations between sequence features and the stability score supported this interpretation (Table 4 ). Stability scores showed moderate associations with lysine density (r = 0.070–0.319) and protein length (r = 0.268–0.329), whereas intrinsic disorder showed weak correlations (r = − 0.086–0.171) (Fig. 2 A–D). Correlations with predicted structural propensities were generally small ( Supplementary Table S7 ). Together, these results suggest that the embedding-derived stability axis is associated with sequence-level organization of charged and hydrophobic residues, rather than classical structural features or simple compositional effects. From a biological perspective, these features are consistent with known properties associated with protein degradation. Lysine residues serve as primary ubiquitin attachment sites, and their density and spatial distribution may influence modification patterns. Similarly, clustering of charged residues may affect local electrostatic environments relevant to protein–protein interactions. Hydrophobic composition may influence folding stability and exposure of sequence features. While not directly measuring degradation mechanisms, these observations indicate that the stability axis captures biologically meaningful sequence properties rather than purely abstract embedding patterns. Table 4 Correlation between stability score and sequence features across cell lines Cell Line Length (r) Lysine Density (r) Disorder Fraction (r) Cell1 0.326 0.180 −0.086 Cell2 0.329 0.070 −0.040 Cell3 0.268 0.319 0.012 Cell4 0.278 0.216 0.171 3.4 Sequence Grammar of Long-Lived and Short-Lived Proteins Charge clustering analysis revealed consistent differences between LLP and SLP proteins across all cell lines (Table 5 ; Fig. 3 A–D). LLP proteins showed larger lysine clusters (K_max_cluster = 2.18–2.41 vs 1.94–2.06) and larger acidic clusters (DE_max_cluster = 3.86–4.08 vs 3.26–3.34), together with substantially more lysine clusters overall (44.01–47.35 vs 28.33–33.68) ( p = 10⁻⁴–10⁻⁹). In contrast, charge blockiness showed no significant differences between LLP and SLP proteins. These results indicate that the stability signal reflects enhanced charge clustering in long-lived proteins. Additional sequence analysis further highlighted differences in lysine spacing and proline periodicity (Table 5 ). LLP proteins showed shorter lysine spacing (15.27–17.62 vs 17.27–20.20) and higher proline periodicity (15.14–16.08 vs 10.73–11.94) across cell lines, whereas features such as PEST score, proline-rich fraction, and low-complexity fraction showed no significant differences. The distributions of these patterns are shown in Fig. 4 A–H and Fig. 5 A–H. Analysis of degron-like sequence patterns indicated that such motifs are broadly present in both LLP and SLP proteins. However, LLP proteins exhibited a modest but significant increase in motif density compared to SLP proteins (mean density: 0.027 vs 0.024; Mann–Whitney U test, p = 2.48 × 10⁻¹¹) ( Supplementary Fig. 4 ), suggesting that differences in stability are associated with motif distribution rather than simple presence. Together, these results suggest that protein stability is associated with distributed residue organization, where LLP proteins exhibit stronger lysine and charge clustering together with periodic proline organization, whereas SLP proteins display larger lysine spacing and weaker clustering patterns. . Table 5 Stability-associated sequence organization features across four cell lines Category Cell Feature LLP Mean SLP Mean p-value Charge Clustering 1 K_max_cluster 2.24 1.9467 4.00 × 10⁻⁵ DE_max_cluster 3.8567 3.30 1.88 × 10⁻⁴ K_cluster_count 47.3467 28.33 3.14 × 10⁻⁹ charge_blockiness 0.5514 0.5629 0.263 2 K_max_cluster 2.18 1.9433 2.00 × 10⁻⁴ DE_max_cluster 4.08 3.26 1.64 × 10⁻⁶ K_cluster_count 45.1067 29.22 4.06 × 10⁻⁸ charge_blockiness 0.5563 0.5550 0.98 3 K_max_cluster 2.30 2.0033 2.82 × 10⁻⁶ DE_max_cluster 3.9367 3.29 2.20 × 10⁻⁶ K_cluster_count 46.0167 30.21 2.02 × 10⁻¹¹ charge_blockiness 0.5624 0.5555 0.558 4 K_max_cluster 2.4134 2.06 1.80 × 10⁻⁶ DE_max_cluster 3.9867 3.34 8.07 × 10⁻⁶ K_cluster_count 44.01 33.68 1.59 × 10⁻⁶ charge_blockiness 0.56523 0.5592 0.722 Sequence Grammar 1 PEST score 0.0502 0.0559 0.066 Proline-rich fraction 0.0082 0.0074 0.805 Low complexity fraction 0.0299 0.0355 0.065 K spacing 17.619 20.199 0.0045 Proline periodicity 16.075 11.227 1.34×10⁻⁴ Disorder shift -0.0137 -0.0537 0.149 2 PEST score 0.0588 0.0602 0.815 Proline-rich fraction 0.0066 0.0075 0.240 Low complexity fraction 0.0326 0.0371 0.733 K spacing 16.168 17.270 0.272 Proline periodicity 15.276 10.732 1.62×10⁻⁵ Disorder shift 0.0224 -0.0380 0.0297 3 PEST score 0.0569 0.0578 0.920 Proline-rich fraction 0.0081 0.0078 0.375 Low complexity fraction 0.0344 0.0360 0.492 K spacing 15.273 18.498 3.57×10⁻⁴ Proline periodicity 15.139 11.865 0.0187 Disorder shift -0.0082 -0.0513 0.100 4 PEST score 0.0588 0.0548 0.144 Proline-rich fraction 0.0073 0.0067 0.891 Low complexity fraction 0.0366 0.0343 0.548 K spacing 16.550 20.194 0.0129 Proline periodicity 15.708 11.938 0.0094 Disorder shift 0.0266 -0.0134 0.181 3.5 Feature-specific perturbation analysis reveals dominant contribution of charge organization To further dissect the contribution of individual sequence grammar components, feature-specific perturbation analysis was performed by selectively disrupting lysine distribution, charge organization, and proline periodicity (Table 6 ) . Disruption of charge organization resulted in the largest decrease in predicted stability, followed by lysine redistribution and proline perturbation, indicating a hierarchical contribution of sequence features (charge > lysine > proline) to the stability signal. These results provide computational support that the embedding-derived stability signal is influenced by specific sequence organization features. These findings identify charge organization as the dominant contributor to the stability signal, highlighting its prominent contribution to the stability-associated signal within sequence representations. The Python script for this analysis is provided in Supplementary File 13 . Table 6 Feature-specific perturbation analysis of sequence grammar components Perturbation Type Mean Stability Score Δ Stability Score Original −1.03 — Lysine disruption −1.47 −0.44 Charge disruption −2.56 −1.53 Proline disruption −1.18 −0.15 3.6 Conservation of the Stability Axis Across Cell Lines and Species Stability axes derived from LLP and SLP proteins were highly similar across the four cell lines, with cosine similarity values of 0.791–0.908 (Table 7 ; Fig. 6 A), indicating a conserved direction separating stable and unstable proteins in embedding space. When models trained on one cell line were applied to predict stability in the remaining datasets using all proteins, cross-cell correlations remained positive (r = 0.412–0.467) ( Supplementary Table S3 ; Fig. 6 B), demonstrating that the stability rule generalizes beyond the LLP–SLP subsets. The stability axis also captured continuous variation in protein half-life, showing moderate correlations with measured half-life values (Pearson r = 0.374–0.396; Spearman r = 0.371–0.404) (Table 8 ), indicating that it reflects a gradual stability landscape rather than a purely binary classification boundary. Finally, cross-species evaluation showed that models trained on mouse proteins predicted human stability (r = 0.359), while the reciprocal analysis yielded r = 0.465 ( Supplementary Table S4 ; Fig. 6 C). Together, these results suggest that the stability-associated signal captured by protein language model embeddings is conserved across cell types and species. This conservation also indicates that the sequence may reflect conserved sequence features associated with protein stability across species and not specific to individual experimental systems. Table 7 Stability axis cosine similarity matrix derived using LLP and SLP proteins Train Cell1 Cell2 Cell3 Cell4 Cell1 1.000 0.908 0.804 0.869 Cell2 0.908 1.000 0.871 0.896 Cell3 0.804 0.871 1.000 0.791 Cell4 0.869 0.896 0.791 1.000 Table 8 Continuous half-life prediction Cell Pearson r Spearman r R² Cell 1 0.374 0.371 0.13 Cell 2 0.395 0.398 0.147 Cell 3 0.385 0.404 0.142 Cell 4 0.396 0.392 0.146 3.7 Functional Role of Stability-Associated Sequence Grammar Sequence grammar features provided measurable predictive power for distinguishing LLP and SLP proteins. A logistic regression model using grammar-derived features showed consistent performance across cell lines (Table 9 ), with ROC–AUC values of 0.650–0.677 and similar results in external human and mouse datasets (Fig. 7 ). The model indicated that LLP proteins tend to exhibit higher proline periodicity, tighter lysine spacing, and stronger lysine/acidic residue clustering, whereas SLP proteins show more dispersed residue organization. Importantly, these results demonstrate that sequence grammar alone captures a substantial portion of the stability signal, independent of full embedding representations. These sequence organization patterns are consistent with properties associated with protein degradation. Differences in lysine spacing and clustering suggest variation in the distribution of potential ubiquitination sites, while charge clustering may influence electrostatic interaction contexts. Periodic proline organization may affect local structural environments and accessibility of sequence features. Although these analyses do not directly measure degradation mechanisms, they provide biologically consistent interpretations of the sequence patterns identified. Perturbation analysis further supports the functional relevance of these features, as disrupting sequence organization consistently reduced predicted stability across all cell lines (Table 10 ). These results suggest that sequence grammar features contribute to the stability-associated signal captured by the model. Table 9 Performance of the grammar-based logistic regression model. Cell Line ROC-AUC Precision Cell 1 0.678 0.629 Cell 2 0.656 0.635 Cell 3 0.666 0.617 Cell 4 0.676 0.639 Human External 0.654 0.61 Mouse 0.722 0.697 Table 10 Perturbation analysis results Cell Mean Δ Predicted Log-Half-Life Paired t-test p-value 1 -0.069 0.00077 2 -0.078 1.048 \(\:\times\:\) 10 − 6 3 -0.024 0.0213 4 -0.069 1.1661 \(\:\times\:\) 10 − 5 3.8 Functional enrichment of stability-associated proteins Functional enrichment analysis revealed that both long-lived proteins (LLPs) and short-lived proteins (SLPs) are predominantly associated with regulatory biological processes, particularly those related to transcription and RNA metabolism. LLPs showed relatively stronger enrichment in chromatin organization and transcription-associated processes, indicating a bias toward nuclear gene regulatory functions (Fig. 8 ). In contrast, SLPs were enriched in broader regulatory categories, including cellular and metabolic processes, reflecting less functionally specific enrichment profiles. Although both protein groups participate extensively in regulatory networks, these results suggest a shift in functional emphasis rather than a strict functional separation. This pattern is consistent with the sequence-derived stability signal identified in this study, indicating that features captured by protein language model embeddings may reflect underlying functional tendencies associated with protein stability. The analysis and results are provided in CSV form in Supplementary File 9 . 3.9 Decomposition of sequence determinants of protein stability To dissect sequence contributions to protein stability, we evaluated composition, sequence grammar, motif features, and protein language model embeddings. As shown in Table 11 , composition showed moderate performance (ROC–AUC = 0.64–0.70), indicating a strong baseline contribution of amino acid frequencies. Grammar features, capturing residue organization (spacing and clustering), showed comparable performance (0.65–0.68). Motif-based features performed better (0.71–0.73), while embeddings achieved the highest accuracy (0.80–0.82), indicating integration of multi-level sequence information. These results support a hierarchical contribution of sequence features to the stability-associated signal across sequence scales, with higher-order representations integrating and amplifying lower-level sequence signals. Combined analysis across all datasets revealed compositional biases, with LLP enriched in D, Q, K, G, and T, and SLP enriched in C, R, S, N, and W. Together, these results support a multi-scale model of protein stability, in which composition, motif-level signals, and higher-order sequence grammar jointly contribute to stability, and protein language model embeddings integrate these signals into a unified representation. Top motif features are provided in Supplementary File 10 . Table 11 Contribution of different sequence representations Cell Line Composition (AUC) Grammar (AUC) Motif (AUC) Embedding (AUC) Cell 1 0.642 0.678 0.712 0.823 Cell 2 0.667 0.656 0.721 0.802 Cell 3 0.699 0.666 0.711 0.811 Cell 4 0.696 0.676 0.726 0.817 3.10 Summary of Stability-Associated Sequence Grammar As summarized in Fig. 9 , protein stability is associated with multiple sequence scales across multiple sequence scales, including amino acid composition, short motif patterns, and higher-order sequence grammar. Long-lived proteins exhibit tighter lysine spacing, enhanced charge clustering, and increased proline periodicity, whereas short-lived proteins display more dispersed residue organization (Fig. 9 A). Comparative analysis across sequence representations reveals a clear hierarchy of predictive power, with protein language model embeddings outperforming composition-, motif-, and grammar-based features, indicating integration of multi-scale sequence information (Fig. 9 B). Projection onto the embedding-derived stability axis separates LLP and SLP proteins, demonstrating distinct stability score distributions (Fig. 9 C). Validation analyses further show that stability signals depend on sequence organization rather than composition alone, as sequence shuffling reduces predictive performance, stability signals are distributed across sequences, and perturbation of sequence grammar features decreases predicted stability (Fig. 9 D). Together, these results support a model in which protein stability arises from distributed sequence organization captured by embedding-based representations and associated with stability-related sequence properties (Fig. 9 E). The features and the python scripts for cohen’s d analysis are provided in Supplementary File 13 and analysis fig is provided in Supplementary Fig. 3 . 4. Discussion Protein degradation is central to proteome homeostasis [ 11 ], yet the sequence-level principles governing intrinsic protein stability remain incompletely understood. Existing models largely emphasize localized degradation signals such as degrons and PEST regions [ 3 – 5 ], which account for only part of the observed variability in protein lifetimes [ 4 , 6 ]. The results presented here support an alternative perspective in which protein stability is associated with distributed, multi-scale sequence organization rather than being explained solely by localized motifs. A central observation of this study is the presence of a stability-associated axis in protein sequence representation space that separates long-lived (LLP) and short-lived proteins (SLP), shows consistent behavior across cell lines and species, and relates to continuous variation in protein half-life. The reproducibility of this signal across independent datasets suggests that it reflects a conserved sequence-associated property rather than a dataset-specific effect. Decomposition of sequence representations further indicates that protein stability is not explained by amino-acid composition alone but reflects contributions from multiple sequence levels. While composition provides a baseline signal (ROC–AUC ≈ 0.64–0.70), sequence grammar (≈ 0.65–0.68) and motif features (≈ 0.71–0.73) contribute additional information, with protein language model embeddings (≈ 0.80–0.82) integrating these signals into a unified representation. This progressive increase in predictive performance is consistent with a multi-level organization of sequence features contributing to the stability-associated signal. Perturbation and shuffling analyses provide additional support for the role of sequence organization. Disrupting residue order while preserving composition reduces predictive performance, and sliding-window analyses indicate that stability-associated signals are distributed across sequences rather than confined to localized motifs [ 27 , 28 ]. Together, these observations are consistent with a framework in which residue arrangement contributes to stability-associated information beyond compositional effects alone. Analysis of sequence features reveals consistent organization patterns associated with stability, including tighter lysine spacing, enhanced charge clustering, and increased proline periodicity in LLP proteins compared with SLP proteins. Perturbation analyses further support the contribution of these features, with disruption of charge organization producing the largest decrease in predicted stability. These findings suggest that electrostatic organization and residue spacing are associated with the stability signal captured by sequence representations. From a biochemical perspective, these patterns are consistent with known aspects of protein degradation, including the role of lysine residues in ubiquitination and the influence of charge distribution on protein–protein interactions [ 2 , 11 ]. However, these interpretations remain inferential and require direct experimental validation. Motif-level analysis further indicates that degron-like patterns are broadly present across proteins, whereas differences in their distribution and organization provide more discriminative information, supporting the importance of sequence context. Functional enrichment analysis provides additional biological context, with LLP proteins enriched in chromatin organization and transcription-related processes, and SLP proteins associated with broader regulatory and metabolic functions. These patterns are consistent with the sequence-derived stability signal and suggest a link between sequence organization, protein function, and turnover. Beyond mechanistic interpretation, these findings have broader implications. Sequence-level organization may contribute to variability in experimentally measured protein half-life across proteomics workflows, indicating that intrinsic sequence features partially shape observed turnover rates. In addition, the identified sequence grammar provides a basis for predicting protein stability directly from sequence and may inform the design of proteins with controlled degradation properties [ 36 , 37 ]. This study has several limitations. The analyses rely on existing proteomics datasets and do not include direct experimental validation of the proposed mechanisms. Although the consistency of results across datasets, cell lines, and species supports their generality, future work incorporating targeted experimental perturbations will be required to directly test the functional role of the identified sequence features. Taken together, these results support a shift from strictly motif-centric views of protein degradation toward a framework in which protein stability is associated with distributed, multi-scale sequence organization. This study illustrates how protein language model representations can be used to resolve sequence-level patterns linked to biological function and provides a foundation for further investigation of regulatory information encoded within protein sequences. 4. Conclusion In this study, we show that protein stability is associated with distributed, multi-scale sequence organization rather than being explained solely by localized sequence motifs. Using protein language model representations, we identify a stability-associated axis that distinguishes long-lived and short-lived proteins and shows consistent behavior across datasets and species. By systematically decomposing sequence representations, we find that stability-related signals reflect contributions from amino-acid composition, local motif patterns, and higher-order sequence organization. In particular, sequence grammar features, including lysine spacing, charge clustering, and proline periodicity, capture interpretable aspects of the stability-associated signal beyond composition alone. These findings support a framework in which protein stability is linked to the integration of sequence features across multiple scales, providing a basis for understanding variability in protein turnover directly from sequence. More broadly, this work highlights the potential of protein language model–based representations to uncover biologically meaningful principles embedded in protein sequences and to enable systematic analysis of sequence-function relationships in proteome dynamics. Declarations Ethics declaration Not applicable Competing interests The authors declare no competing interests. Author Contribution Conceptualization: Tangilal Dihan Chowdhury, Md. Shabiul Islam, Fasiha Tanzeem TaibaMethodology: Tangilal Dihan Chowdhury, Md Ushama Shafoyat, Kazy Noor e Alam SiddiqueeSoftware: Tangilal Dihan Chowdhury, Md Ushama Shafoyat, Maruf HasanFormal analysis: Tangilal Dihan Chowdhury, Maruf Hasan, Fasiha Tanzeem TaibaInvestigation: Tangilal Dihan Chowdhury, Kaiissar MannoorData curation: Tangilal Dihan Chowdhury, Md Ushama Shafoyat, Maruf Hasan, Kaiissar MannoorVisualization: Tangilal Dihan Chowdhury, Fasiha Tanzeem TaibaValidation: Md Ushama Shafoyat, Kaiissar MannoorSupervision: Md. Shabiul IslamProject administration: Md. Shabiul IslamWriting – original draft: Tangilal Dihan ChowdhuryWriting – review & editing: All authorsAll authors read and approved the final manuscript. All authors read and approved the final manuscript. Acknowledgement We gratefully acknowledge the Department of Biomedical Engineering, Department of Computer Science and Engineering at the Military Institute of Science and Technology and Multimedia University, Malaysia for their institutional support and facilities provided during the completion of this research. We also acknowledge the use of artificial intelligence–based language tools for assistance with grammar refinement and improvement of formal scientific writing. Data Availability All data generated or analyzed during this study are included in this published article and its Additional Information files. The complete datasets, protein language model (PLM) outputs, and all analysis scripts are provided in Additional Information 1–13. Additional data are available from the corresponding author upon reasonable request. References Alber AB, Suter DM. Dynamics of protein synthesis and degradation through the cell cycle. Cell Cycle. 2019;18:784–94. https://doi.org/10.1080/15384101.2019.1598725 . Finley D. Recognition and processing of ubiquitin-protein conjugates by the proteasome. Annu Rev Biochem. 2009;78:477–513. https://doi.org/10.1146/annurev.biochem.78.081507.101607 . Zhang Z, Mena EL, Timms RT, et al. Degrons: defining the rules of protein degradation. Nat Rev Mol Cell Biol. 2025;26:868–83. https://doi.org/10.1038/s41580-025-00870-z . Varshavsky A. N-degron and C-degron pathways of protein degradation. Proc Natl Acad Sci U S A. 2019;116:358–66. https://doi.org/10.1073/pnas.1816596116 . Rogers S, Wells R, Rechsteiner M. Amino acid sequences common to rapidly degraded proteins: the PEST hypothesis. Science. 1986;234:364–8. https://doi.org/10.1126/science.2876518 . Fishbain S, Inobe T, Israeli E, et al. Sequence composition of disordered regions fine-tunes protein half-life. Nat Struct Mol Biol. 2015;22:214–21. https://doi.org/10.1038/nsmb.2958 . Rives A, Meier J, Sercu T, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021;118:e2016239118. https://doi.org/10.1073/pnas.2016239118 . Leclercq M, Droit A. Protein language models: applications and perspectives. J Proteome Res. 2026;25:507–24. https://doi.org/10.1021/acs.jproteome.5c00506 . Lamb KD, Hughes J, Lytras S, et al. From single-sequences to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2. Nat Commun. 2026. https://doi.org/10.1038/s41467-026-69569-9 . Li J, Cai Z, Vaites LP, et al. Proteome-wide mapping of short-lived proteins in human cells. Mol Cell. 2021;81:4722–e47355. https://doi.org/10.1016/j.molcel.2021.09.015 . Zhao L, Zhao J, Zhong K, et al. Targeted protein degradation: mechanisms, strategies and application. Signal Transduct Target Ther. 2022;7:113. https://doi.org/10.1038/s41392-022-00966-4 . UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 2015;43:D204–12. https://doi.org/10.1093/nar/gku989 . Ong SE, Blagoev B, Kratchmarova I, et al. Stable isotope labeling by amino acids in cell culture as a simple and accurate approach to expression proteomics. Mol Cell Proteom. 2002;1:376–86. https://doi.org/10.1074/mcp.M200025-MCP200 . Cambridge SB, Gnad F, Nguyen C, et al. Systems-wide proteomic analysis in mammalian cells reveals conserved, functional protein turnover. J Proteome Res. 2011;10:5275–84. https://doi.org/10.1021/pr101183k . Villegas-Morcillo A, Gomez AM, Sanchez V. An analysis of protein language model embeddings for fold prediction. Brief Bioinform. 2022;23:bbac142. https://doi.org/10.1093/bib/bbac142 . Ali S, Chourasia P, Patterson M. When protein structure embedding meets large language models. Genes. 2024;15:25. https://doi.org/10.3390/genes15010025 . Hung C, Tsai CF, Wu MH. Dimensionality reduction strategies for classification: ML versus DL approaches and their combinations. Expert Syst. 2025. https://doi.org/10.1111/exsy.70140 . Mykhailiuk I, Schäfer K, Büskens C. Parametric stability score and its application in optimal control. IFAC-PapersOnLine. 2022. https://doi.org/10.1016/j.ifacol.2022.09.019 . Allgaier J, Pryss R. Cross-validation visualized: a narrative guide to advanced methods. Mach Learn Knowl Extr. 2024;6:1378–88. https://doi.org/10.3390/make6020065 . Satria A, Sitompul OS, Mawengkang H. 5-fold cross validation on supporting k-nearest neighbour accuracy of making consimilar symptoms disease classification. In: Proc 2021 Int Conf Comput Sci Eng (IC2SE). IEEE; 2021. https://doi.org/10.1109/IC2SE52832.2021.9792094 . Xu M, Fralick D, Zheng JZ, et al. The differences and similarities between two-sample t-test and paired t-test. Shanghai Arch Psychiatry. 2017;29:184–8. https://doi.org/10.11919/j.issn.1002-0829.217070 . Diener MJ. Cohen’s d. In: The Corsini Encyclopedia of Psychology. 2010. https://doi.org/10.1002/9780470479216.corpsy0200 Chu SKS, Narang K, Siegel JB. Protein stability prediction by fine-tuning a protein language model on a mega-scale dataset. PLoS Comput Biol. 2024. https://doi.org/10.1371/journal.pcbi.1012248 . Lahitani AR, Permanasari AE, Setiawan NA. Cosine similarity to determine similarity measure: study case in online essay assessment. Proc 2016 Int Conf Cyber IT Serv Manag. IEEE; 2016. https://doi.org/10.1109/CITSM.2016.7577578 . Abounaima MC, El Mazouri FZ, Lamrini L et al. The Pearson correlation coefficient applied to compare multi-criteria methods: case the ranking problematic. In: Proc 2020 Int Conf Innov Res Appl Sci Eng Technol (IRASET). IEEE; 2020. https://doi.org/10.1109/IRASET48871.2020.9092242 Blakeley-Ruiz JA, Kleiner M. Considerations for constructing a protein sequence database for metaproteomics. Comput Struct Biotechnol J. 2022;20:937–52. https://doi.org/10.1016/j.csbj.2022.01.018 . Jiang M, Anderson J, Gillespie J, et al. uShuffle: a useful tool for shuffling biological sequences while preserving the k-let counts. BMC Bioinformatics. 2008;9:192. https://doi.org/10.1186/1471-2105-9-192 . Mokhtari F, Akhlaghi MI, Simpson SL, et al. Sliding window correlation analysis: modulating window shape for dynamic brain connectivity in resting state. NeuroImage. 2019;189:655–66. https://doi.org/10.1016/j.neuroimage.2019.02.001 . Shi WH, Marciel AB. Influence of charge block length on conformation and cluster formation of atactic peptide polyampholytes. Macromolecules. 2026. https://doi.org/10.1021/acs.macromol.5c02873 . Palaniappan A, Jakobsson E. Fourier analysis of conservation patterns in protein secondary structure. Comput Struct Biotechnol J. 2017;15:265–70. https://doi.org/10.1016/j.csbj.2017.02.002 . Dosztányi Z. Prediction of protein disorder based on IUPred. Protein Sci. 2018;27:331–40. https://doi.org/10.1002/pro.3334 . Chicco D, Sichenze A, Jurman G. A simple guide to the use of Student’s t-test, Mann–Whitney U test, chi-squared test, and Kruskal–Wallis test in biostatistics. BioData Min. 2025;18:56. https://doi.org/10.1186/s13040-025-00465-6 . Nahm FS. Receiver operating characteristic curve: overview and practical use for clinicians. Korean J Anesthesiol. 2022;75:25–36. https://doi.org/10.4097/kja.21209 . Reimand J, Arak T, Adler P, Kolberg L, Reisberg S, Peterson H, Vilo J. g:Profiler—a web server for functional interpretation of gene lists (2016 update). Nucleic Acids Res. 2016;44(W1):W83–9. https://doi.org/10.1093/nar/gkw199 . Chagoyen M, Pazos F. Quantifying the biological significance of gene ontology biological processes—implications for the analysis of systems-wide data. Bioinformatics. 2010;26(3):378–84. https://doi.org/10.1093/bioinformatics/btp663 . Holman SW, Hammond DE, Simpson DM, et al. Protein turnover measurement using selected reaction monitoring-mass spectrometry (SRM-MS). Philos Trans Math Phys Eng Sci. 2016;374:20150362. https://doi.org/10.1098/rsta.2015.0362 . McGinness KE, Baker TA, Sauer RT. Engineering controllable protein degradation. Mol Cell. 2006;22:701–7. https://doi.org/10.1016/j.molcel.2006.04.027 . Additional Declarations No competing interests reported. Supplementary Files Tables.docx SupplementaryFile13.zip SupplementaryFile9.zip SupplementaryFile4.zip SupplementaryFile8.zip SupplementaryFile10.zip SupplementaryFile12.zip SupplementaryFile6.zip SupplementaryFile5.zip SupplementaryFile1.zip SupplementaryFile3.zip SupplementaryFile11.zip SupplementaryFile2.zip SupplementaryFile7.zip Cite Share Download PDF Status: Under Review Version 1 posted Reviewers agreed at journal 04 May, 2026 Reviewers invited by journal 29 Apr, 2026 Editor invited by journal 14 Apr, 2026 Editor assigned by journal 07 Apr, 2026 Submission checks completed at journal 07 Apr, 2026 First submitted to journal 02 Apr, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9298535","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":633910121,"identity":"821275d5-e055-4c90-843a-c23cf8c510ed","order_by":0,"name":"Tangilal Dihan Chowdhury","email":"","orcid":"","institution":"Military Institute of Science and Technology","correspondingAuthor":false,"prefix":"","firstName":"Tangilal","middleName":"Dihan","lastName":"Chowdhury","suffix":""},{"id":633910122,"identity":"5fcde5cf-6d5b-4e17-9acf-cce65c2c4875","order_by":1,"name":"Fasiha Tanzeem Taiba","email":"","orcid":"","institution":"Military Institute of Science and Technology","correspondingAuthor":false,"prefix":"","firstName":"Fasiha","middleName":"Tanzeem","lastName":"Taiba","suffix":""},{"id":633910123,"identity":"51e8ea86-e755-4329-8b3d-ce6bbf7ca2e0","order_by":2,"name":"Md Ushama Shafoyat","email":"","orcid":"","institution":"Military Institute of Science and Technology","correspondingAuthor":false,"prefix":"","firstName":"Md","middleName":"Ushama","lastName":"Shafoyat","suffix":""},{"id":633910124,"identity":"4c6a9058-88bd-4a6e-b3f1-30a595acfd12","order_by":3,"name":"Maruf Hasan","email":"","orcid":"","institution":"Military Institute of Science and Technology","correspondingAuthor":false,"prefix":"","firstName":"Maruf","middleName":"","lastName":"Hasan","suffix":""},{"id":633910125,"identity":"7e4939e9-e9a7-44db-957d-7b4f57dc0222","order_by":4,"name":"Kazy Noor e Alam Siddiquee","email":"","orcid":"","institution":"Military Institute of Science and Technology","correspondingAuthor":false,"prefix":"","firstName":"Kazy","middleName":"Noor e Alam","lastName":"Siddiquee","suffix":""},{"id":633910126,"identity":"e860317c-ae57-48c9-b849-4da96c60a1db","order_by":5,"name":"Kaiissar Mannoor","email":"","orcid":"","institution":"Military Institute of Science and Technology","correspondingAuthor":false,"prefix":"","firstName":"Kaiissar","middleName":"","lastName":"Mannoor","suffix":""},{"id":633910127,"identity":"782b3347-ccaa-44f8-adef-7aaf8a6ac210","order_by":6,"name":"Md. Shabiul Islam","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA80lEQVRIiWNgGAWjYDACCQY2hgQGCyDJ3MDAUAETJKwFRDICtZwhVgtEDVALYxsRWuRnNz978LBNgoGP/2Djp5vztkUbHGA+eJuHYVtiAw4tBneOmRskArWwSSQ2S+duu5274QBbsjUPw23cWiQSzCQgWhgboFp4zKTxaZGfkf4NooX/YPPv3DkgLfzf8GphuJEDtYUhsU06twFsCxteLQY3csoNEs5J8AD90madc+x27szDbMaWcwxuG+Nx2LaHP8ps5OT7Dx++nVNzO7fvePPDG28qbsvidBgU8CCYzGDbGRwJacEE9iTrGAWjYBSMguEKAL4lWBFp+6HyAAAAAElFTkSuQmCC","orcid":"","institution":"Multimedia University","correspondingAuthor":true,"prefix":"","firstName":"Md.","middleName":"Shabiul","lastName":"Islam","suffix":""}],"badges":[],"createdAt":"2026-04-02 05:53:19","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9298535/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9298535/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":108808051,"identity":"d129f26a-e32c-4ed4-b5ee-f51bc661dc5a","added_by":"auto","created_at":"2026-05-08 15:39:33","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":908898,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCross–cell line validation performance of the stability prediction model.\u003c/strong\u003e ROC curves showing model performance when trained on Cell 1 and tested on independent datasets: (A) Cell 2 (AUC = 0.802), (B) Cell 3 (AUC = 0.811), and (C) Cell 4 (AUC = 0.817). Density distributions of predicted stability scores for LLP and SLP peptides are shown in (D–F) for the corresponding validation sets, illustrating clear separation between the two classes.\u003c/p\u003e","description":"","filename":"Figure1.png","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/33a799c22d5cd192b4cf26d4.png"},{"id":108808155,"identity":"208f2dce-e8ac-4d44-98d7-ecc1794e2f51","added_by":"auto","created_at":"2026-05-08 15:40:08","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":1270814,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eStability phase-space distribution across cell lines.\u003c/strong\u003e Density plots showing the distribution of proteins in stability phase space defined by stability score (SSS) and intrinsic disorder fraction for (a) Cell1, (b) Cell2, (c) Cell3, and (d) Cell4. The dashed vertical line indicates the stability score threshold (SSS = 0). The consistent distributions across cell lines illustrate the relationship between predicted stability and disorder content\u003c/p\u003e","description":"","filename":"Figure2.png","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/999deb094ae8125b68a5fc12.png"},{"id":108808202,"identity":"32a6b7d0-5cf9-43e5-8811-a76825530628","added_by":"auto","created_at":"2026-05-08 15:40:25","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":1515842,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCharge clustering characteristics of LLP and SLP proteins across cell lines.\u003c/strong\u003e(A) Relationship between negatively charged residue clusters (DE_max_cluster) and lysine clusters (K_max_cluster). (B) Association between charge blockiness and DE_max_cluster. (C) Relationship between charge blockiness and K_max_cluster. (D) Legend indicating LLP and SLP classes and point size scaling by protein count. Each panel shows results for the four cell lines, highlighting differences in charge clustering patterns between LLP and SLP proteins.\u003c/p\u003e","description":"","filename":"Figure3.png","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/dfb81cb906fc8e12ef9f43d2.png"},{"id":108808154,"identity":"6382d394-52b5-460d-a3be-54f465906108","added_by":"auto","created_at":"2026-05-08 15:40:07","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":1352064,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eProline distribution and periodicity patterns in stable and unstable proteins.\u003c/strong\u003e(A–D) Violin plots showing the distribution of proline periodicity scores for LLP and SLP proteins across the four cell lines. (E–H) Position-normalized proline density profiles along protein sequences for LLP and SLP proteins.\u003c/p\u003e","description":"","filename":"Figure4.png","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/20fa4f904b6ab1479c64f9a5.png"},{"id":108977033,"identity":"7068616f-c85d-4202-93de-d9af71b3f90d","added_by":"auto","created_at":"2026-05-11 11:30:03","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":1154817,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eLysine spacing patterns in stable and unstable proteins.\u003c/strong\u003e (A–D) Violin plots comparing average lysine spacing between LLP and SLP proteins across the four cell lines. (E–H) Position-normalized lysine spacing profiles along protein sequences.\u003c/p\u003e","description":"","filename":"Figure5.png","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/09388887b03bcab8f338266a.png"},{"id":108808157,"identity":"2221daf4-a9e7-43e2-bc87-5bb6b5d0aefe","added_by":"auto","created_at":"2026-05-08 15:40:08","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":473284,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eConservation and generalization of the stability axis across cell lines and species.\u003c/strong\u003e (A) Network representation of cosine similarity between stability axes derived from LLP and SLP proteins across four cell lines, showing strong similarity (0.791–0.908). (B) Heatmap of cross–cell line stability prediction performance using all proteins, with Pearson correlations ranging from 0.412 to 0.467. (C) Cross-species protein half-life prediction performance showing generalization of the stability signal between mouse and human datasets (Pearson r = 0.359 for mouse → human and r = 0.465 for human → mouse).\u003c/p\u003e","description":"","filename":"Figure6.png","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/4a8b87d14c9ed07b3e6c3876.png"},{"id":108810029,"identity":"796a02ae-3b60-4039-8be0-3dc4df8e1afe","added_by":"auto","created_at":"2026-05-08 15:57:00","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":961219,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eEffect of sequence shuffling on stability prediction across cell lines.\u003c/strong\u003e Receiver operating characteristic (ROC) curves comparing model performance using original protein sequences (blue; embedding-based model) and shuffled sequences (orange) for (A) Cell 1, (B) Cell 2, (C) Cell 3, and (D) Cell 4. Shuffling preserves amino acid composition but disrupts residue order. Across all datasets, predictive performance decreases substantially after shuffling (AUC ≈ 0.64–0.68) compared to original sequences (AUC ≈ 0.80–0.82), indicating that protein stability signals depend strongly on sequence organization rather than composition alone. The diagonal dashed line represents random classification.\u003c/p\u003e","description":"","filename":"Figure7.png","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/eeadec671cb56d03c1339d46.png"},{"id":108808204,"identity":"5c6135b5-eced-4f47-9cd7-fa6709f32bc7","added_by":"auto","created_at":"2026-05-08 15:40:26","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":606698,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFunctional enrichment of long-lived and short-lived proteins.\u003c/strong\u003e(A) Gene Ontology (GO) enrichment analysis of long-lived proteins (LLPs), showing significant enrichment in transcriptional regulation, RNA metabolic processes, and chromatin organization. (B) Enrichment analysis of short-lived proteins (SLPs), highlighting broader regulatory and metabolic processes. Bars represent −log10(p-value), indicating the significance of enrichment.\u003c/p\u003e","description":"","filename":"Figure8.png","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/5b0059b8fe8366d520855580.png"},{"id":108810134,"identity":"b3780a89-bf94-4838-9635-7f356b463920","added_by":"auto","created_at":"2026-05-08 15:57:30","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":1364400,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eMulti-scale sequence determinants of protein stability and their integration in protein language model embeddings.\u003c/strong\u003e (A) Schematic representation of average sequence organization patterns in long-lived proteins (LLP) and short-lived proteins (SLP), highlighting differences in proline periodicity, lysine spacing, and charge clustering. (B) Comparison of predictive performance across sequence representations, showing increasing accuracy from amino acid composition to motif features and sequence grammar, with protein language model embeddings achieving the highest performance, indicating integration of multi-scale sequence information. (C) Projection onto the stability axis derived from embeddings separates LLP and SLP proteins, demonstrating distinct stability score distributions. (D) Validation analyses showing that stability signals depend on sequence organization: shuffling reduces predictive performance, stability signals are distributed across sequences, and perturbation of sequence grammar features decreases predicted stability. (E) Conceptual model illustrating how sequence features and multi-scale sequence grammar contribute to stability-associated sequence properties and overall protein stability.\u003c/p\u003e","description":"","filename":"Figure9.png","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/0bfde734294f116339b69d70.png"},{"id":108979691,"identity":"b5ebb8e7-0889-4540-8d9f-569bc3a7e87d","added_by":"auto","created_at":"2026-05-11 12:00:48","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":10560000,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/b3989661-2772-41f1-912a-44d80a0cf134.pdf"},{"id":108808152,"identity":"48900268-805b-4ac9-90d3-280e017c181b","added_by":"auto","created_at":"2026-05-08 15:40:07","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":24554,"visible":true,"origin":"","legend":"","description":"","filename":"Tables.docx","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/7dae51321e5f5a0ba1af7e8f.docx"},{"id":108976708,"identity":"66ee733e-0127-449f-ac23-23e1168f8758","added_by":"auto","created_at":"2026-05-11 11:28:21","extension":"zip","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":35763,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFile13.zip","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/febbbbcf957515350b577c61.zip"},{"id":108809798,"identity":"ea8e562c-27ee-4dd3-80f7-6dd6269dc229","added_by":"auto","created_at":"2026-05-08 15:55:38","extension":"zip","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":1782,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFile9.zip","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/6fe83fabd036d78bfd79c342.zip"},{"id":108808166,"identity":"15997dd8-e1ee-49e5-ac55-2f5378a469c5","added_by":"auto","created_at":"2026-05-08 15:40:11","extension":"zip","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":4808,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFile4.zip","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/69b810a586987f902ee032f0.zip"},{"id":108808163,"identity":"5d11f46f-9972-423c-8af2-3566c06fbe17","added_by":"auto","created_at":"2026-05-08 15:40:10","extension":"zip","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":6091,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFile8.zip","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/19e0e891998276d89999dfcc.zip"},{"id":108809524,"identity":"3dd2de06-4af0-4d54-af4c-b13a03eec883","added_by":"auto","created_at":"2026-05-08 15:53:25","extension":"zip","order_by":5,"title":"","display":"","copyAsset":false,"role":"supplement","size":3283,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFile10.zip","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/8e3f5716e664934aaf2d16c6.zip"},{"id":108809740,"identity":"8fb58ba8-da1f-4ee5-8b83-09659aec7c49","added_by":"auto","created_at":"2026-05-08 15:55:14","extension":"zip","order_by":6,"title":"","display":"","copyAsset":false,"role":"supplement","size":2557,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFile12.zip","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/87eb8a61c1b66a468867dcdd.zip"},{"id":108808053,"identity":"e5a136fa-1dae-4729-9dc0-c78f2919e9cf","added_by":"auto","created_at":"2026-05-08 15:39:33","extension":"zip","order_by":7,"title":"","display":"","copyAsset":false,"role":"supplement","size":3782,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFile6.zip","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/601a6c7ca071512f178bc985.zip"},{"id":108808168,"identity":"3f7d9549-c7c8-4e55-b543-47a3bfe4c513","added_by":"auto","created_at":"2026-05-08 15:40:11","extension":"zip","order_by":8,"title":"","display":"","copyAsset":false,"role":"supplement","size":5791,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFile5.zip","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/a0cf8bc2a80a35e7fad3aa0e.zip"},{"id":108808158,"identity":"14f6cfd7-061a-4f67-b72b-99981076f644","added_by":"auto","created_at":"2026-05-08 15:40:08","extension":"zip","order_by":9,"title":"","display":"","copyAsset":false,"role":"supplement","size":1583198,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFile1.zip","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/125b25e0fe8f054f8a452437.zip"},{"id":108808162,"identity":"ac96f00f-0f8d-4dbf-81b3-b8a6699f5d6c","added_by":"auto","created_at":"2026-05-08 15:40:10","extension":"zip","order_by":10,"title":"","display":"","copyAsset":false,"role":"supplement","size":2902158,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFile3.zip","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/bf796ff6b906c18cf01b1cbd.zip"},{"id":108808203,"identity":"c6c26bc5-e5ec-4665-8ab4-537c8f73f626","added_by":"auto","created_at":"2026-05-08 15:40:25","extension":"zip","order_by":11,"title":"","display":"","copyAsset":false,"role":"supplement","size":428827,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFile11.zip","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/c25416feeee4b8e5d006e3ce.zip"},{"id":108808052,"identity":"6a1542b8-d362-443c-9714-35a0209dcbc3","added_by":"auto","created_at":"2026-05-08 15:39:33","extension":"zip","order_by":12,"title":"","display":"","copyAsset":false,"role":"supplement","size":51475497,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFile2.zip","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/6f89e9bb7a650dca39187bc9.zip"},{"id":108810087,"identity":"3a27e90d-cb6d-47c8-a73a-78ef6814c141","added_by":"auto","created_at":"2026-05-08 15:57:22","extension":"zip","order_by":13,"title":"","display":"","copyAsset":false,"role":"supplement","size":51400829,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFile7.zip","url":"https://assets-eu.researchsquare.com/files/rs-9298535/v1/2296e526cf5fe202258b2816.zip"}],"financialInterests":"No competing interests reported.","formattedTitle":"Multi-Scale Sequence Encoding Distinguishes Long-Lived and Short-Lived Proteins Revealed by Protein Language Model Embeddings","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eProteins within cells are continuously synthesized and degraded, and the balance between these processes determines protein lifetime, a key regulator of signaling, metabolism, and protein quality control [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. A central pathway governing targeted degradation is the ubiquitin\u0026ndash;proteasome system, which selectively labels proteins for destruction [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Despite extensive research, predicting protein lifetime directly from amino-acid sequence remains a major challenge in molecular biology and computational proteomics.\u003c/p\u003e \u003cp\u003eAlthough proteomics approaches enable large-scale measurement of protein turnover, they do not explain the substantial variability in half-life observed across the proteome. In particular, it remains unclear to what extent intrinsic sequence features contribute to degradation behavior. Existing studies have identified sequence elements such as degron motifs, PEST regions, and compositional biases associated with protein stability [\u003cspan additionalcitationids=\"CR4\" citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. However, these features do not fully account for observed variability: many unstable proteins lack known degrons, while many proteins containing such motifs remain stable [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. This discrepancy highlights an open question as to whether protein stability is primarily associated with localized sequence motifs or with broader, distributed patterns of residue organization across entire sequences.\u003c/p\u003e \u003cp\u003eRecent advances in protein language models (PLMs) provide a framework to investigate this question. Trained on large-scale sequence data, PLMs capture complex statistical relationships between amino acids and encode diverse structural and functional properties [\u003cspan additionalcitationids=\"CR8\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. However, whether these representations capture stability-associated signals, and how such signals are organized within sequences, remains incompletely understood.\u003c/p\u003e \u003cp\u003eHere, we examine this problem using experimentally derived half-life datasets from four human cell lines [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. We identify a stability-associated axis in PLM embedding space that separates long-lived proteins (LLP) from short-lived proteins (SLP) and shows consistent behavior across datasets and species. By systematically decomposing sequence representations, we find that protein stability is not explained by amino-acid composition alone but reflects contributions from multiple sequence levels, including composition, sequence grammar, and local motifs. Disrupting residue order reduces predictive performance, supporting a contribution of residue organization beyond composition alone.\u003c/p\u003e \u003cp\u003eAnalysis of sequence features reveals consistent organization patterns associated with stability, including tighter lysine spacing, enhanced charge clustering, and increased proline periodicity in LLP proteins relative to SLP proteins. Perturbation analyses further support the contribution of these features, with disruption of charge organization producing the largest decrease in predicted stability.\u003c/p\u003e \u003cp\u003eTogether, these results support a model in which protein stability is associated with distributed, multi-scale sequence organization rather than isolated motifs. This framework provides a basis for interpreting variability in proteome-wide turnover and highlights the potential of PLM-based representations to uncover biologically meaningful sequence principles underlying protein stability [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].\u003c/p\u003e"},{"header":"2. Materials and Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1 Protein Datasets and Sequence Processing\u003c/h2\u003e \u003cp\u003eProtein half-life data were obtained from a large-scale quantitative proteomics study [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e] that systematically mapped short-lived proteins in human cells using cycloheximide-chase assays combined with multiplexed mass spectrometry. The dataset contains measurements from four human cell lines (U2OS, HEK293T, HCT116, and RPE1). Each cell line dataset was curated by removing redundant protein sequences to reduce sequence redundancy and minimize similarity bias. For stability classification, proteins were categorized as long-lived (LLP; 25\u0026ndash;32 h) or short-lived (SLP; 0.2\u0026ndash;2.5 h). For each cell line, 300 LLP and 300 SLP proteins were selected to construct balanced datasets (600 proteins per cell line; 2400 proteins total across four cell lines). For continuous stability prediction analyses, the full half-life datasets (~\u0026thinsp;1500 proteins per cell line; ~6000 proteins total) were used, spanning a wide range of degradation rates.\u003c/p\u003e \u003cp\u003eProtein sequences were retrieved in FASTA format from UniProt [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] and used for subsequent computational analyses. To assess stability-axis similarity and cross-dataset generalization, an independent turnover dataset obtained using Stable Isotope Labeling by Amino Acids in Cell Culture mass spectrometry [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] was also analyzed. This dataset includes half-life measurements for 4,106 proteins in the human HeLa cell line and 3,528 proteins in the mouse C2C12 cell line (7,634 proteins total). Because experimental methodologies differ substantially between cycloheximide-chase and SILAC approaches [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e], the datasets were analyzed separately to avoid methodological bias in half-life estimates. Overall, the analyses include more than 13,000 proteins across multiple independent proteomic datasets. All datasets, including half-life measurements and FASTA sequences, are provided in \u003cb\u003eSupplementary File 1\u003c/b\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2 Protein Language Model Embeddings\u003c/h2\u003e \u003cp\u003eProtein sequences were encoded using the ESM-2 protein language model with 650\u0026nbsp;million parameters (ESM2-t33-650M) [\u003cspan additionalcitationids=\"CR7\" citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. For each protein sequence, token-level representations were extracted from the final 33rd transformer layer [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. Residue-level embeddings [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e] were averaged across all sequence positions to produce a single fixed-length vector representation for each protein. If E\u003csub\u003ei\u003c/sub\u003e represents the embedding vector of residue i and L is the sequence length, the final protein embedding (X) was computed as:\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:X=\\frac{1}{L}\\:\\sum\\:_{i=1}^{L}{E}_{i}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.3 Dimensionality Reduction\u003c/h2\u003e \u003cp\u003eBecause the original embedding vectors are high-dimensional, dimensionality reduction was performed prior to modeling [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. First, all features were standardized using z-score normalization:\u003cdiv id=\"Equb\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equb\" name=\"EquationSource\"\u003e\n$$\\:Z=\\frac{X-{\\mu\\:}\\text{}}{{\\sigma\\:}}\\:$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere X represents the original embedding feature, u is the feature mean, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\sigma\\:}\\)\u003c/span\u003e\u003c/span\u003e is the standard deviation. Principal component analysis (PCA) was then applied to the normalized embeddings to capture the major sources of variation in the data. The top 30 principal components (PCs) were retained and used as the feature space for all downstream analyses. The number of principal components (30) was selected based on cumulative explained variance 81.7%. The embedding features and the python script of extracting embedding features from PLM model (ESM 2) for all datasets are provided in \u003cb\u003eSupplementary File 2\u003c/b\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.4 Stability Score Computation and LLP\u0026ndash;SLP Classification\u003c/h2\u003e \u003cp\u003eProtein stability was inferred from protein language model embeddings [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e] using a logistic regression classifier trained on PCA-transformed embeddings (top 30 PCs). Logistic regression was implemented using scikit learn, with L2 regularization (C\u0026thinsp;=\u0026thinsp;1). The model estimates the probability that a protein belongs to the LLP class as:\u003cdiv id=\"Equc\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equc\" name=\"EquationSource\"\u003e\n$$\\:P(y=1|X)=\\frac{1\\text{}}{1+{e}^{-(w.X+b)}}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere X is the PCA feature vector, w the weight vector, and b the intercept. All preprocessing steps, including normalization and PCA, were performed within each training fold to prevent data leakage. Model performance was evaluated using ROC\u0026ndash;AUC. The classifier defines a linear stability axis in embedding space. Each protein was projected onto this direction to compute a Stability Score (SSS) [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]:\u003cdiv id=\"Equd\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equd\" name=\"EquationSource\"\u003e\n$$\\:SSS=w.X+b$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eHigher SSS values correspond to proteins predicted to be more stable. To avoid overfitting, SSS values were computed using out-of-fold predictions during cross-validation [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e, \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. To assess potential protein length bias, ROC\u0026ndash;AUC values were compared between the original and length-matched datasets using five-fold cross-validation [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e], with differences evaluated using a paired t-test [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e] and Cohen\u0026rsquo;s d [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. The datasets and the python scripts of these analysis are provided in \u003cb\u003eSupplementary File 3\u003c/b\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e2.5 Sequence shuffling and sliding-window analysis\u003c/h2\u003e \u003cp\u003eTo evaluate whether stability prediction depends on sequence order rather than amino acid composition, protein sequences were randomly shuffled while preserving composition [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e]. Embeddings were recomputed from the shuffled sequences and classification performance was reassessed; a decrease in performance suggests that residue order contributes to the stability signal.\u003c/p\u003e \u003cp\u003eTo determine whether stability signals arise from localized motifs or distributed sequence patterns, a sliding-window analysis was performed [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]. Each protein was divided into overlapping 25-residue windows, embeddings were computed for each window, and projected onto the stability axis. For each protein, the maximum, mean, and variance of local stability scores were calculated to assess whether stability signals are concentrated in specific regions or distributed across the sequence. All the python analysis scripts are provided in \u003cb\u003eSupplementary File 6\u003c/b\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e2.6 Sequence grammar and charge clustering analysis\u003c/h2\u003e \u003cp\u003eSequence grammar features were analyzed for long-lived proteins (LLP) and short-lived proteins (SLP) across four cell lines using protein sequences obtained from FASTA files and custom Python scripts. Charge clustering features included lysine cluster size (K_max_cluster) and acidic cluster size (DE_max_cluster), defined as the longest consecutive runs of K or D/E residues within a sequence. The number of lysine clusters (K_cluster_count) was calculated as the total number of contiguous lysine segments, while charge blockiness was measured as the fraction of adjacent charged residues sharing the same charge sign [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eAdditional interpretable sequence grammar features were computed to investigate stability-related organization patterns. This included lysine spacing (mean distance between consecutive lysines), proline periodicity estimated using Fourier-based spectral analysis [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e], PEST enrichment [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], proline-rich fraction, and low-complexity regions identified using sliding-window scans [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. Disorder positioning was calculated from IUPred scores [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e] as the difference between C-terminal and N-terminal disorder fractions [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. Feature distributions between LLP and SLP proteins were compared using the Mann\u0026ndash;Whitney U test [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e], and mean values were calculated for each cell line. The IUPred scores (. result file) for all the datasets and the python scripts for sequence grammar and charge clustering analysis are provided in \u003cb\u003eSupplementary File 7\u003c/b\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e2.7 Feature-specific perturbation analysis\u003c/h2\u003e \u003cp\u003eFeature-specific perturbation analysis was performed to evaluate the contribution of sequence organization features identified in the study. For each protein sequence, selected sequence organization features were disrupted by residue substitution while preserving overall sequence length and amino acid composition where applicable. Perturbed sequences were re-embedded using the same protein language model, and stability scores were recomputed using the previously defined stability axis. Changes in stability were quantified as the difference between original and perturbed scores (Δ stability score), and mean effects were calculated across all proteins. The python script of this analysis is provided in \u003cb\u003eSupplementary File 9\u003c/b\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e2.8 Axis similarity and continuous half-life prediction\u003c/h2\u003e \u003cp\u003eTo assess whether the stability-associated signal [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e] captured by protein language model embeddings is conserved across biological contexts, stability axes derived from LLP and SLP proteins were compared across cell lines. Axis similarity was quantified using cosine similarity between logistic regression weight vectors [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e], where higher similarity indicates a conserved direction separating stable and unstable proteins in embedding space. Generalization of this stability axis was further evaluated using full proteome datasets. Models trained on one cell line were used to predict stability in other cell lines, and predictions were compared with experimental half-life measurements using Pearson correlation [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. Cross-experiment and cross-species generalization were also tested using independent mouse and human protein turnover datasets [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e], which were analyzed separately due to differences in experimental protocols and half-life distributions. To examine whether embeddings capture continuous degradation rates, linear regression models were used to predict log-transformed protein half-life:\u003cdiv id=\"Eque\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Eque\" name=\"EquationSource\"\u003e\n$$\\:\\widehat{y}=w.X+b$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere X represents the PCA-transformed embedding vector and y\u0026thinsp;=\u0026thinsp;log (half-life\u0026thinsp;+\u0026thinsp;1). Model performance was evaluated using Pearson correlation, Spearman correlation, and coefficient of determination (R\u0026sup2;). The python scripts of these experiments are provided in \u003cb\u003eSupplementary File 4\u003c/b\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e2.9 Interpretation of embedding components and sequence feature correlations\u003c/h2\u003e \u003cp\u003eTo interpret the embedding dimensions contributing to stability prediction, principal components with the largest logistic regression weights were identified and analyzed. Correlations were computed between these components and sequence-derived properties [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e] including protein length, lysine density, charge density, and hydrophobic residue fraction, enabling identification of sequence features underlying the stability grammar encoded in the embeddings. In addition, correlations were calculated between the stability score (SSS) and several sequence-level features across cell lines. Pearson correlations were computed for protein length, lysine density, and intrinsic disorder fraction to evaluate whether basic sequence properties contribute to the embedding-derived stability signal. Finally, correlations with structural sequence properties, including hydrophobic fraction, helix propensity residues, and beta-sheet propensity residues, were examined to assess whether classical structural features explain the observed stability patterns. All the python analysis scripts are provided in \u003cb\u003eSupplementary File 5\u003c/b\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e2.10 Grammar-based classification and perturbation analysis\u003c/h2\u003e \u003cp\u003eTo evaluate whether protein stability signals can be captured using interpretable sequence features, a classification model was trained using sequence grammar\u0026ndash;derived features, including residue periodicity, clustering, spacing, and related metrics. Model performance was evaluated using ROC\u0026ndash;AUC [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e], and the approach was further tested on an independent dataset containing human and mouse proteins measured under the same experimental protocol. In this dataset, LLP and SLP classes were defined using the top and bottom 5% of the half-life distribution, and ROC\u0026ndash;AUC and precision were used to assess the ability of grammar features to distinguish stability classes across species.\u003c/p\u003e \u003cp\u003eTo test whether sequence organization contributes to stability prediction, in silico grammar perturbation experiments were performed. For each LLP sequence, residues at regular intervals (every 10th position) were randomly shuffled, disrupting periodic sequence organization while largely preserving amino acid composition and sequence length. Embeddings were recomputed for the perturbed sequences, and predicted stability values were compared with the original sequences using paired t-tests to determine whether disruption of sequence organization affects predicted stability. The python script of grammar-based classification for Cell 1\u0026ndash;4, Human-external, Mouse and permutation analysis are provided in \u003cb\u003eSupplementary File 8\u003c/b\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e2.11 Functional enrichment analysis\u003c/h2\u003e \u003cp\u003eFunctional enrichment analysis was performed to identify biological processes associated with long-lived proteins (LLP) and short-lived proteins (SLP). Protein sequences were parsed from FASTA files, and gene names were extracted from UniProt headers (GN field). Gene lists were analyzed using the g:Profiler tool [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e] (organism: \u003cem\u003eHomo sapiens\u003c/em\u003e) to identify enriched Gene Ontology biological process (GO:BP) terms [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]. Enrichment results were filtered at p\u0026thinsp;\u0026lt;\u0026thinsp;0.05, and top-ranking terms were used for interpretation and visualization. The analysis result is provided in \u003cb\u003eSupplementary File 10\u003c/b\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e2.12 Motif enrichment and motif-based classification\u003c/h2\u003e \u003cp\u003eAll possible tripeptide motifs (20\u0026sup3; = 8000) were scanned across LLP and SLP protein sequences. Motif frequencies were computed for each sequence and compared between classes using the Mann\u0026ndash;Whitney U test, with effect sizes estimated using Cohen\u0026rsquo;s d. The top 20 motifs showing the largest effect sizes were selected as features for classification. A logistic regression model using these motif frequencies was trained to distinguish LLP and SLP proteins, and performance was evaluated using five-fold cross-validation with ROC\u0026ndash;AUC and precision. To test whether global sequence organization contributes additional predictive information, the motif features were combined with previously defined sequence grammar features, and model performance was re-evaluated. All the dataset and python scripts are provided in \u003cb\u003eSupplementary File 11\u003c/b\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003e2.13 Sequence representation benchmarking\u003c/h2\u003e \u003cp\u003eAmino acid composition features were computed as the frequency of each of the 20 standard residues in a protein sequence. Stability prediction was performed using logistic regression to classify long-lived (LLP) and short-lived proteins (SLP), and performance was evaluated using ROC\u0026ndash;AUC with five-fold stratified cross-validation. Models based on sequence grammar, motif features, and protein language model embeddings were constructed as described above. These representations were compared to assess their relative contribution to protein stability prediction. The python scripts of this analysis are provided in \u003cb\u003eAdditional Information 12\u003c/b\u003e.\u003c/p\u003e \u003c/div\u003e"},{"header":"3. Results","content":"\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Identification of a Stability Axis in Protein Language Model Embeddings\u003c/h2\u003e \u003cp\u003eProjection of protein embeddings revealed clear separation between LLP and SLP proteins along a derived stability score (SSS) axis (\u003cb\u003eSupplementary Figure \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003eA\u0026ndash;C\u003c/b\u003e). LLP proteins showed substantially higher scores (mean SSS\u0026thinsp;=\u0026thinsp;0.8816) than SLP proteins (\u0026minus;\u0026thinsp;0.9317), with cross-validated means of 0.8199 and \u0026minus;\u0026thinsp;0.7749, respectively (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). The separation was highly significant (Mann\u0026ndash;Whitney p\u0026thinsp;=\u0026thinsp;3.84 \u0026times; 10⁻\u0026sup3;⁵; validated p\u0026thinsp;=\u0026thinsp;1.35 \u0026times; 10⁻\u0026sup2;⁶) with a large effect size (Cohen\u0026rsquo;s d\u0026thinsp;=\u0026thinsp;1.345; validated d\u0026thinsp;=\u0026thinsp;1.118) (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Cross\u0026ndash;cell line validation showed consistent performance when the model trained on Cell1 was applied to other datasets, achieving ROC\u0026ndash;AUC values of 0.802, 0.811, and 0.817 for Cell2\u0026ndash;Cell4, respectively (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e;\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eA\u0026ndash;F). Control analysis confirmed that predictions were not driven by protein length. Model performance remained unchanged after length matching (AUC\u0026thinsp;=\u0026thinsp;0.781 vs 0.787), with no significant difference (t\u0026thinsp;=\u0026thinsp;\u0026minus;\u0026thinsp;0.363, p\u0026thinsp;=\u0026thinsp;0.735; Cohen\u0026rsquo;s d\u0026thinsp;=\u0026thinsp;\u0026minus;\u0026thinsp;0.18) (\u003cb\u003eSupplementary Figure \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003eA\u0026ndash;B; Supplementary Table \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003e\u003c/b\u003e). These results indicate that protein language model embeddings capture a stability-associated direction that separates LLP and SLP proteins and generalizes across cell lines, independent of sequence length. These results indicate that protein language model embeddings capture a stability-associated direction that separates LLP and SLP proteins and generalizes across cell lines, independent of sequence length.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eStability score comparison between LLP and SLP proteins.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGroup\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMean SSS\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCross-validated-mean SSS\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLLP\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.881652634863405\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.8199233676051194\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSLP\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u0026minus;0.9317443608705182\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u0026minus;0.7749137369997359\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"3\" nameend=\"c3\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eStatistical significance of stability score\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMann\u0026ndash;Whitney p-value\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e3.835\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\times\\:\\)\u003c/span\u003e\u003c/span\u003e 10\u003csup\u003e\u0026minus;\u0026thinsp;35\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1.346 \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\times\\:\\)\u003c/span\u003e\u003c/span\u003e 10\u003csup\u003e\u0026minus;\u0026thinsp;26\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCohen\u0026rsquo;s d\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1.345\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1.118\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003e3.2 Stability Signals Depend on Sequence Organization\u003c/h2\u003e \u003cp\u003eSequence shuffling markedly reduced classification performance across all datasets (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). While the original models achieved AUC\u0026thinsp;=\u0026thinsp;0.802\u0026ndash;0.823, shuffled sequences produced substantially lower values (0.615\u0026ndash;0.639). Because amino acid composition was preserved during shuffling, this reduction suggests that residue order contributes to the stability-associated signal. Sliding-window analysis further showed that stability signals are distributed across the protein sequence rather than confined to short motifs (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e\u003cb\u003e)\u003c/b\u003e. LLP proteins exhibited much higher maximum local stability scores (2.78 vs 0.69, p\u0026thinsp;=\u0026thinsp;1.71\u0026times;10⁻\u0026sup1;⁶) and mean local scores (0.758 vs\u0026thinsp;\u0026minus;\u0026thinsp;1.149, p\u0026thinsp;=\u0026thinsp;6.60\u0026times;10⁻\u0026sup1;⁸) compared with SLP proteins, while variance remained similar (1.12 vs 1.07, p\u0026thinsp;=\u0026thinsp;0.426). Together, these results indicate that protein stability is associated with distributed sequence organization, rather than being explained by amino acid composition alone or localized motifs.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eEffect of sequence shuffling\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell Line\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eOriginal AUC\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eShuffled AUC\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell 1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.823\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.632\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell 2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.802\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.639\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell 3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.811\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.621\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell 4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.817\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.615\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eStability window analysis\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFeature\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLLP Mean\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eSLP Mean\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003ep-value\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emax_local_score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2.7777\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.6943\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1.71\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\times\\:\\)\u003c/span\u003e\u003c/span\u003e 10\u003csup\u003e\u0026minus;\u0026thinsp;16\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003evar_local_score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.1195\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.0701\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.426\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emean_local_score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.758\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-1.1489\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e6.608\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\times\\:\\)\u003c/span\u003e\u003c/span\u003e 10\u003csup\u003e\u0026minus;\u0026thinsp;18\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emax_local_score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2.491\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.5311\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e9.351\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\times\\:\\)\u003c/span\u003e\u003c/span\u003e 10\u003csup\u003e\u0026minus;\u0026thinsp;16\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003evar_local_score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.2108\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.2556\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.50708\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emean_local_score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.5514\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-1.2764\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e2.118\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\times\\:\\)\u003c/span\u003e\u003c/span\u003e 10\u003csup\u003e\u0026minus;\u0026thinsp;17\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emax_local_score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2.819\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.6416\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e7.045\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\times\\:\\)\u003c/span\u003e\u003c/span\u003e 10\u003csup\u003e\u0026minus;\u0026thinsp;18\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003evar_local_score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.1254\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.2881\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.783\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emean_local_score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.7754\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-1.244\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e4.009\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\times\\:\\)\u003c/span\u003e\u003c/span\u003e 10\u003csup\u003e\u0026minus;\u0026thinsp;19\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emax_local_score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2.6625\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.1451\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e5.232\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\times\\:\\)\u003c/span\u003e\u003c/span\u003e 10\u003csup\u003e\u0026minus;\u0026thinsp;11\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003evar_local_score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.0909\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.1901\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.995\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emean_local_score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.7105\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.7897\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e4.296\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\times\\:\\)\u003c/span\u003e\u003c/span\u003e 10\u003csup\u003e\u0026minus;\u0026thinsp;19\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003e3.3 Sequence-Level Correlates of the Stability Axis\u003c/h2\u003e \u003cp\u003eDecoding of the dominant embedding components revealed a consistent charge\u0026ndash;hydrophobic stability grammar across cell lines (\u003cb\u003eSupplementary Tables S5\u0026ndash;S6\u003c/b\u003e). The stability signal was primarily associated with lysine density, charge distribution, and hydrophobic composition, rather than sequence length. Strong relationships with charge organization were observed in Cell3 (PC7: lysine density r\u0026thinsp;=\u0026thinsp;\u0026minus;\u0026thinsp;0.450; charge density r\u0026thinsp;=\u0026thinsp;\u0026minus;\u0026thinsp;0.324), while hydrophobic fraction contributed positively in components such as Cell1 PC17 (r\u0026thinsp;=\u0026thinsp;0.246) and Cell4 PC26 (r\u0026thinsp;=\u0026thinsp;0.142). Direct correlations between sequence features and the stability score supported this interpretation (Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). Stability scores showed moderate associations with lysine density (r\u0026thinsp;=\u0026thinsp;0.070\u0026ndash;0.319) and protein length (r\u0026thinsp;=\u0026thinsp;0.268\u0026ndash;0.329), whereas intrinsic disorder showed weak correlations (r\u0026thinsp;=\u0026thinsp;\u0026minus;\u0026thinsp;0.086\u0026ndash;0.171) (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA\u0026ndash;D). Correlations with predicted structural propensities were generally small (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM7\" class=\"InternalRef\"\u003eS7\u003c/span\u003e\u003c/b\u003e). Together, these results suggest that the embedding-derived stability axis is associated with sequence-level organization of charged and hydrophobic residues, rather than classical structural features or simple compositional effects.\u003c/p\u003e \u003cp\u003eFrom a biological perspective, these features are consistent with known properties associated with protein degradation. Lysine residues serve as primary ubiquitin attachment sites, and their density and spatial distribution may influence modification patterns. Similarly, clustering of charged residues may affect local electrostatic environments relevant to protein\u0026ndash;protein interactions. Hydrophobic composition may influence folding stability and exposure of sequence features. While not directly measuring degradation mechanisms, these observations indicate that the stability axis captures biologically meaningful sequence properties rather than purely abstract embedding patterns.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eCorrelation between stability score and sequence features across cell lines\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell Line\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLength (r)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLysine Density (r)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eDisorder Fraction (r)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.326\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.180\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u0026minus;0.086\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.329\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.070\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u0026minus;0.040\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.268\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.319\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.012\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.278\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.216\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.171\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003e3.4 Sequence Grammar of Long-Lived and Short-Lived Proteins\u003c/h2\u003e \u003cp\u003eCharge clustering analysis revealed consistent differences between LLP and SLP proteins across all cell lines (Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e; Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA\u0026ndash;D). LLP proteins showed larger lysine clusters (K_max_cluster\u0026thinsp;=\u0026thinsp;2.18\u0026ndash;2.41 vs 1.94\u0026ndash;2.06) and larger acidic clusters (DE_max_cluster\u0026thinsp;=\u0026thinsp;3.86\u0026ndash;4.08 vs 3.26\u0026ndash;3.34), together with substantially more lysine clusters overall (44.01\u0026ndash;47.35 vs 28.33\u0026ndash;33.68) (\u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;10⁻⁴\u0026ndash;10⁻⁹). In contrast, charge blockiness showed no significant differences between LLP and SLP proteins. These results indicate that the stability signal reflects enhanced charge clustering in long-lived proteins.\u003c/p\u003e \u003cp\u003eAdditional sequence analysis further highlighted differences in lysine spacing and proline periodicity (Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e). LLP proteins showed shorter lysine spacing (15.27\u0026ndash;17.62 vs 17.27\u0026ndash;20.20) and higher proline periodicity (15.14\u0026ndash;16.08 vs 10.73\u0026ndash;11.94) across cell lines, whereas features such as PEST score, proline-rich fraction, and low-complexity fraction showed no significant differences. The distributions of these patterns are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eA\u0026ndash;H and Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eA\u0026ndash;H.\u003c/p\u003e \u003cp\u003eAnalysis of degron-like sequence patterns indicated that such motifs are broadly present in both LLP and SLP proteins. However, LLP proteins exhibited a modest but significant increase in motif density compared to SLP proteins (mean density: 0.027 vs 0.024; Mann\u0026ndash;Whitney U test, p\u0026thinsp;=\u0026thinsp;2.48 \u0026times; 10⁻\u0026sup1;\u0026sup1;) (\u003cb\u003eSupplementary Fig.\u0026nbsp;4\u003c/b\u003e), suggesting that differences in stability are associated with motif distribution rather than simple presence.\u003c/p\u003e \u003cp\u003eTogether, these results suggest that protein stability is associated with distributed residue organization, where LLP proteins exhibit stronger lysine and charge clustering together with periodic proline organization, whereas SLP proteins display larger lysine spacing and weaker clustering patterns.\u003c/p\u003e \u003cp\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eStability-associated sequence organization features across four cell lines\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCategory\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCell\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eFeature\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLLP Mean\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eSLP Mean\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003ep-value\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"15\" rowspan=\"16\"\u003e \u003cp\u003e\u003cb\u003eCharge Clustering\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\" morerows=\"3\" rowspan=\"4\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eK_max_cluster\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2.24\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1.9467\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e4.00 \u0026times; 10⁻⁵\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDE_max_cluster\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3.8567\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e3.30\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1.88 \u0026times; 10⁻⁴\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eK_cluster_count\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e47.3467\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e28.33\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e3.14 \u0026times; 10⁻⁹\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003echarge_blockiness\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.5514\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.5629\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.263\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\" morerows=\"3\" rowspan=\"4\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eK_max_cluster\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2.18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1.9433\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e2.00 \u0026times; 10⁻⁴\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDE_max_cluster\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.08\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e3.26\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1.64 \u0026times; 10⁻⁶\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eK_cluster_count\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e45.1067\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e29.22\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e4.06 \u0026times; 10⁻⁸\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003echarge_blockiness\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.5563\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.5550\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\" morerows=\"3\" rowspan=\"4\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eK_max_cluster\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2.30\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e2.0033\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e2.82 \u0026times; 10⁻⁶\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDE_max_cluster\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3.9367\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e3.29\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e2.20 \u0026times; 10⁻⁶\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eK_cluster_count\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e46.0167\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e30.21\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e2.02 \u0026times; 10⁻\u0026sup1;\u0026sup1;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003echarge_blockiness\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.5624\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.5555\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.558\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\" morerows=\"3\" rowspan=\"4\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eK_max_cluster\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2.4134\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e2.06\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1.80 \u0026times; 10⁻⁶\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDE_max_cluster\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3.9867\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e3.34\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e8.07 \u0026times; 10⁻⁶\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eK_cluster_count\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e44.01\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e33.68\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1.59 \u0026times; 10⁻⁶\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003echarge_blockiness\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.56523\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.5592\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.722\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"23\" rowspan=\"24\"\u003e \u003cp\u003e\u003cb\u003eSequence Grammar\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\" morerows=\"5\" rowspan=\"6\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePEST score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0502\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.0559\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.066\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eProline-rich fraction\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0082\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.0074\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.805\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLow complexity fraction\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0299\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.0355\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.065\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eK spacing\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e17.619\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e20.199\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.0045\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eProline periodicity\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e16.075\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e11.227\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1.34\u0026times;10⁻⁴\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDisorder shift\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0137\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.0537\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.149\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\" morerows=\"5\" rowspan=\"6\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePEST score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0588\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.0602\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.815\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eProline-rich fraction\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0066\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.0075\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.240\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLow complexity fraction\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0326\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.0371\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.733\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eK spacing\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e16.168\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e17.270\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.272\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eProline periodicity\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e15.276\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e10.732\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1.62\u0026times;10⁻⁵\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDisorder shift\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0224\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.0380\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.0297\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\" morerows=\"5\" rowspan=\"6\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePEST score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0569\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.0578\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.920\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eProline-rich fraction\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0081\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.0078\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.375\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLow complexity fraction\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0344\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.0360\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.492\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eK spacing\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e15.273\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e18.498\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e3.57\u0026times;10⁻⁴\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eProline periodicity\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e15.139\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e11.865\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.0187\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDisorder shift\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0082\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.0513\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.100\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\" morerows=\"5\" rowspan=\"6\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePEST score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0588\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.0548\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.144\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eProline-rich fraction\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0073\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.0067\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.891\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLow complexity fraction\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0366\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.0343\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.548\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eK spacing\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e16.550\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e20.194\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.0129\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eProline periodicity\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e15.708\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e11.938\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.0094\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDisorder shift\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0266\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.0134\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.181\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003e3.5 Feature-specific perturbation analysis reveals dominant contribution of charge organization\u003c/h2\u003e \u003cp\u003eTo further dissect the contribution of individual sequence grammar components, feature-specific perturbation analysis was performed by selectively disrupting lysine distribution, charge organization, and proline periodicity (Table\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e6\u003c/span\u003e\u003cb\u003e)\u003c/b\u003e. Disruption of charge organization resulted in the largest decrease in predicted stability, followed by lysine redistribution and proline perturbation, indicating a hierarchical contribution of sequence features (charge\u0026thinsp;\u0026gt;\u0026thinsp;lysine\u0026thinsp;\u0026gt;\u0026thinsp;proline) to the stability signal. These results provide computational support that the embedding-derived stability signal is influenced by specific sequence organization features. These findings identify charge organization as the dominant contributor to the stability signal, highlighting its prominent contribution to the stability-associated signal within sequence representations. The Python script for this analysis is provided in \u003cb\u003eSupplementary File 13\u003c/b\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab6\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eFeature-specific perturbation analysis of sequence grammar components\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePerturbation Type\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMean Stability Score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eΔ Stability Score\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOriginal\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u0026minus;1.03\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u0026mdash;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLysine disruption\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u0026minus;1.47\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u0026minus;0.44\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCharge disruption\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u0026minus;2.56\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u0026minus;1.53\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eProline disruption\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u0026minus;1.18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u0026minus;0.15\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003e3.6 Conservation of the Stability Axis Across Cell Lines and Species\u003c/h2\u003e \u003cp\u003eStability axes derived from LLP and SLP proteins were highly similar across the four cell lines, with cosine similarity values of 0.791\u0026ndash;0.908 (Table\u0026nbsp;\u003cspan refid=\"Tab7\" class=\"InternalRef\"\u003e7\u003c/span\u003e; Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eA), indicating a conserved direction separating stable and unstable proteins in embedding space. When models trained on one cell line were applied to predict stability in the remaining datasets using all proteins, cross-cell correlations remained positive (r\u0026thinsp;=\u0026thinsp;0.412\u0026ndash;0.467) (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM3\" class=\"InternalRef\"\u003eS3\u003c/span\u003e;\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eB), demonstrating that the stability rule generalizes beyond the LLP\u0026ndash;SLP subsets. The stability axis also captured continuous variation in protein half-life, showing moderate correlations with measured half-life values (Pearson r\u0026thinsp;=\u0026thinsp;0.374\u0026ndash;0.396; Spearman r\u0026thinsp;=\u0026thinsp;0.371\u0026ndash;0.404) (Table\u0026nbsp;\u003cspan refid=\"Tab8\" class=\"InternalRef\"\u003e8\u003c/span\u003e), indicating that it reflects a gradual stability landscape rather than a purely binary classification boundary. Finally, cross-species evaluation showed that models trained on mouse proteins predicted human stability (r\u0026thinsp;=\u0026thinsp;0.359), while the reciprocal analysis yielded r\u0026thinsp;=\u0026thinsp;0.465 (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM4\" class=\"InternalRef\"\u003eS4\u003c/span\u003e;\u003c/b\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eC). Together, these results suggest that the stability-associated signal captured by protein language model embeddings is conserved across cell types and species. This conservation also indicates that the sequence may reflect conserved sequence features associated with protein stability across species and not specific to individual experimental systems.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab7\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 7\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eStability axis cosine similarity matrix derived using LLP and SLP proteins\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTrain\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCell1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCell2\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCell3\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eCell4\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.908\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.804\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.869\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.908\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.871\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.896\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.804\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.871\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.791\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.869\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.896\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.791\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab8\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 8\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eContinuous half-life prediction\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePearson r\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSpearman r\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eR\u0026sup2;\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell 1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.374\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.371\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.13\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell 2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.395\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.398\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.147\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell 3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.385\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.404\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.142\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell 4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.396\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.392\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.146\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec23\" class=\"Section2\"\u003e \u003ch2\u003e3.7 Functional Role of Stability-Associated Sequence Grammar\u003c/h2\u003e \u003cp\u003eSequence grammar features provided measurable predictive power for distinguishing LLP and SLP proteins. A logistic regression model using grammar-derived features showed consistent performance across cell lines (Table\u0026nbsp;\u003cspan refid=\"Tab9\" class=\"InternalRef\"\u003e9\u003c/span\u003e), with ROC\u0026ndash;AUC values of 0.650\u0026ndash;0.677 and similar results in external human and mouse datasets (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e). The model indicated that LLP proteins tend to exhibit higher proline periodicity, tighter lysine spacing, and stronger lysine/acidic residue clustering, whereas SLP proteins show more dispersed residue organization. Importantly, these results demonstrate that sequence grammar alone captures a substantial portion of the stability signal, independent of full embedding representations. These sequence organization patterns are consistent with properties associated with protein degradation. Differences in lysine spacing and clustering suggest variation in the distribution of potential ubiquitination sites, while charge clustering may influence electrostatic interaction contexts. Periodic proline organization may affect local structural environments and accessibility of sequence features. Although these analyses do not directly measure degradation mechanisms, they provide biologically consistent interpretations of the sequence patterns identified. Perturbation analysis further supports the functional relevance of these features, as disrupting sequence organization consistently reduced predicted stability across all cell lines (Table\u0026nbsp;\u003cspan refid=\"Tab10\" class=\"InternalRef\"\u003e10\u003c/span\u003e). These results suggest that sequence grammar features contribute to the stability-associated signal captured by the model.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab9\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 9\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePerformance of the grammar-based logistic regression model.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell Line\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eROC-AUC\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePrecision\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell 1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.678\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.629\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell 2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.656\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.635\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell 3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.666\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.617\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell 4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.676\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.639\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHuman External\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.654\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.61\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMouse\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.722\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.697\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab10\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 10\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePerturbation analysis results\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMean Δ Predicted Log-Half-Life\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePaired t-test p-value\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e-0.069\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.00077\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e-0.078\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.048\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\times\\:\\)\u003c/span\u003e\u003c/span\u003e 10\u003csup\u003e\u0026minus;\u0026thinsp;6\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e-0.024\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0213\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e-0.069\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.1661\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\times\\:\\)\u003c/span\u003e\u003c/span\u003e 10\u003csup\u003e\u0026minus;\u0026thinsp;5\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec24\" class=\"Section2\"\u003e \u003ch2\u003e3.8 Functional enrichment of stability-associated proteins\u003c/h2\u003e \u003cp\u003eFunctional enrichment analysis revealed that both long-lived proteins (LLPs) and short-lived proteins (SLPs) are predominantly associated with regulatory biological processes, particularly those related to transcription and RNA metabolism. LLPs showed relatively stronger enrichment in chromatin organization and transcription-associated processes, indicating a bias toward nuclear gene regulatory functions (Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e). In contrast, SLPs were enriched in broader regulatory categories, including cellular and metabolic processes, reflecting less functionally specific enrichment profiles. Although both protein groups participate extensively in regulatory networks, these results suggest a shift in functional emphasis rather than a strict functional separation. This pattern is consistent with the sequence-derived stability signal identified in this study, indicating that features captured by protein language model embeddings may reflect underlying functional tendencies associated with protein stability. The analysis and results are provided in CSV form in \u003cb\u003eSupplementary File 9\u003c/b\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec25\" class=\"Section2\"\u003e \u003ch2\u003e3.9 Decomposition of sequence determinants of protein stability\u003c/h2\u003e \u003cp\u003eTo dissect sequence contributions to protein stability, we evaluated composition, sequence grammar, motif features, and protein language model embeddings. As shown in Table\u0026nbsp;\u003cspan refid=\"Tab11\" class=\"InternalRef\"\u003e11\u003c/span\u003e, composition showed moderate performance (ROC\u0026ndash;AUC\u0026thinsp;=\u0026thinsp;0.64\u0026ndash;0.70), indicating a strong baseline contribution of amino acid frequencies. Grammar features, capturing residue organization (spacing and clustering), showed comparable performance (0.65\u0026ndash;0.68). Motif-based features performed better (0.71\u0026ndash;0.73), while embeddings achieved the highest accuracy (0.80\u0026ndash;0.82), indicating integration of multi-level sequence information. These results support a hierarchical contribution of sequence features to the stability-associated signal across sequence scales, with higher-order representations integrating and amplifying lower-level sequence signals. Combined analysis across all datasets revealed compositional biases, with LLP enriched in D, Q, K, G, and T, and SLP enriched in C, R, S, N, and W. Together, these results support a multi-scale model of protein stability, in which composition, motif-level signals, and higher-order sequence grammar jointly contribute to stability, and protein language model embeddings integrate these signals into a unified representation. Top motif features are provided in \u003cb\u003eSupplementary File 10\u003c/b\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab11\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 11\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eContribution of different sequence representations\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell Line\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eComposition (AUC)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eGrammar (AUC)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMotif (AUC)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eEmbedding (AUC)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell 1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.642\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.678\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.712\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.823\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell 2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.667\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.656\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.721\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.802\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell 3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.699\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.666\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.711\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.811\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCell 4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.696\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.676\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.726\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.817\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec26\" class=\"Section2\"\u003e \u003ch2\u003e3.10 Summary of Stability-Associated Sequence Grammar\u003c/h2\u003e \u003cp\u003eAs summarized in Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e, protein stability is associated with multiple sequence scales across multiple sequence scales, including amino acid composition, short motif patterns, and higher-order sequence grammar. Long-lived proteins exhibit tighter lysine spacing, enhanced charge clustering, and increased proline periodicity, whereas short-lived proteins display more dispersed residue organization (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003eA). Comparative analysis across sequence representations reveals a clear hierarchy of predictive power, with protein language model embeddings outperforming composition-, motif-, and grammar-based features, indicating integration of multi-scale sequence information (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003eB).\u003c/p\u003e \u003cp\u003eProjection onto the embedding-derived stability axis separates LLP and SLP proteins, demonstrating distinct stability score distributions (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003eC). Validation analyses further show that stability signals depend on sequence organization rather than composition alone, as sequence shuffling reduces predictive performance, stability signals are distributed across sequences, and perturbation of sequence grammar features decreases predicted stability (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003eD). Together, these results support a model in which protein stability arises from distributed sequence organization captured by embedding-based representations and associated with stability-related sequence properties (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003eE). The features and the python scripts for cohen\u0026rsquo;s d analysis are provided in \u003cb\u003eSupplementary File 13\u003c/b\u003e and analysis fig is provided in \u003cb\u003eSupplementary Fig.\u0026nbsp;3\u003c/b\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"4. Discussion","content":"\u003cp\u003eProtein degradation is central to proteome homeostasis [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e], yet the sequence-level principles governing intrinsic protein stability remain incompletely understood. Existing models largely emphasize localized degradation signals such as degrons and PEST regions [\u003cspan additionalcitationids=\"CR4\" citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], which account for only part of the observed variability in protein lifetimes [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. The results presented here support an alternative perspective in which protein stability is associated with distributed, multi-scale sequence organization rather than being explained solely by localized motifs.\u003c/p\u003e \u003cp\u003eA central observation of this study is the presence of a stability-associated axis in protein sequence representation space that separates long-lived (LLP) and short-lived proteins (SLP), shows consistent behavior across cell lines and species, and relates to continuous variation in protein half-life. The reproducibility of this signal across independent datasets suggests that it reflects a conserved sequence-associated property rather than a dataset-specific effect.\u003c/p\u003e \u003cp\u003eDecomposition of sequence representations further indicates that protein stability is not explained by amino-acid composition alone but reflects contributions from multiple sequence levels. While composition provides a baseline signal (ROC\u0026ndash;AUC\u0026thinsp;\u0026asymp;\u0026thinsp;0.64\u0026ndash;0.70), sequence grammar (\u0026asymp;\u0026thinsp;0.65\u0026ndash;0.68) and motif features (\u0026asymp;\u0026thinsp;0.71\u0026ndash;0.73) contribute additional information, with protein language model embeddings (\u0026asymp;\u0026thinsp;0.80\u0026ndash;0.82) integrating these signals into a unified representation. This progressive increase in predictive performance is consistent with a multi-level organization of sequence features contributing to the stability-associated signal.\u003c/p\u003e \u003cp\u003ePerturbation and shuffling analyses provide additional support for the role of sequence organization. Disrupting residue order while preserving composition reduces predictive performance, and sliding-window analyses indicate that stability-associated signals are distributed across sequences rather than confined to localized motifs [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e, \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]. Together, these observations are consistent with a framework in which residue arrangement contributes to stability-associated information beyond compositional effects alone.\u003c/p\u003e \u003cp\u003eAnalysis of sequence features reveals consistent organization patterns associated with stability, including tighter lysine spacing, enhanced charge clustering, and increased proline periodicity in LLP proteins compared with SLP proteins. Perturbation analyses further support the contribution of these features, with disruption of charge organization producing the largest decrease in predicted stability. These findings suggest that electrostatic organization and residue spacing are associated with the stability signal captured by sequence representations.\u003c/p\u003e \u003cp\u003eFrom a biochemical perspective, these patterns are consistent with known aspects of protein degradation, including the role of lysine residues in ubiquitination and the influence of charge distribution on protein\u0026ndash;protein interactions [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. However, these interpretations remain inferential and require direct experimental validation. Motif-level analysis further indicates that degron-like patterns are broadly present across proteins, whereas differences in their distribution and organization provide more discriminative information, supporting the importance of sequence context.\u003c/p\u003e \u003cp\u003eFunctional enrichment analysis provides additional biological context, with LLP proteins enriched in chromatin organization and transcription-related processes, and SLP proteins associated with broader regulatory and metabolic functions. These patterns are consistent with the sequence-derived stability signal and suggest a link between sequence organization, protein function, and turnover.\u003c/p\u003e \u003cp\u003eBeyond mechanistic interpretation, these findings have broader implications. Sequence-level organization may contribute to variability in experimentally measured protein half-life across proteomics workflows, indicating that intrinsic sequence features partially shape observed turnover rates. In addition, the identified sequence grammar provides a basis for predicting protein stability directly from sequence and may inform the design of proteins with controlled degradation properties [\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e, \u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThis study has several limitations. The analyses rely on existing proteomics datasets and do not include direct experimental validation of the proposed mechanisms. Although the consistency of results across datasets, cell lines, and species supports their generality, future work incorporating targeted experimental perturbations will be required to directly test the functional role of the identified sequence features.\u003c/p\u003e \u003cp\u003eTaken together, these results support a shift from strictly motif-centric views of protein degradation toward a framework in which protein stability is associated with distributed, multi-scale sequence organization. This study illustrates how protein language model representations can be used to resolve sequence-level patterns linked to biological function and provides a foundation for further investigation of regulatory information encoded within protein sequences.\u003c/p\u003e"},{"header":"4. Conclusion","content":"\u003cp\u003eIn this study, we show that protein stability is associated with distributed, multi-scale sequence organization rather than being explained solely by localized sequence motifs. Using protein language model representations, we identify a stability-associated axis that distinguishes long-lived and short-lived proteins and shows consistent behavior across datasets and species. By systematically decomposing sequence representations, we find that stability-related signals reflect contributions from amino-acid composition, local motif patterns, and higher-order sequence organization. In particular, sequence grammar features, including lysine spacing, charge clustering, and proline periodicity, capture interpretable aspects of the stability-associated signal beyond composition alone. These findings support a framework in which protein stability is linked to the integration of sequence features across multiple scales, providing a basis for understanding variability in protein turnover directly from sequence. More broadly, this work highlights the potential of protein language model\u0026ndash;based representations to uncover biologically meaningful principles embedded in protein sequences and to enable systematic analysis of sequence-function relationships in proteome dynamics.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e \u003ch2\u003eEthics declaration\u003c/h2\u003e \u003cp\u003eNot applicable\u003c/p\u003e \u003c/p\u003e\u003cp\u003e \u003ch2\u003eCompeting interests\u003c/h2\u003e \u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eConceptualization: Tangilal Dihan Chowdhury, Md. Shabiul Islam, Fasiha Tanzeem TaibaMethodology: Tangilal Dihan Chowdhury, Md Ushama Shafoyat, Kazy Noor e Alam SiddiqueeSoftware: Tangilal Dihan Chowdhury, Md Ushama Shafoyat, Maruf HasanFormal analysis: Tangilal Dihan Chowdhury, Maruf Hasan, Fasiha Tanzeem TaibaInvestigation: Tangilal Dihan Chowdhury, Kaiissar MannoorData curation: Tangilal Dihan Chowdhury, Md Ushama Shafoyat, Maruf Hasan, Kaiissar MannoorVisualization: Tangilal Dihan Chowdhury, Fasiha Tanzeem TaibaValidation: Md Ushama Shafoyat, Kaiissar MannoorSupervision: Md. Shabiul IslamProject administration: Md. Shabiul IslamWriting \u0026ndash; original draft: Tangilal Dihan ChowdhuryWriting \u0026ndash; review \u0026amp; editing: All authorsAll authors read and approved the final manuscript. All authors read and approved the final manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgement\u003c/h2\u003e\u003cp\u003eWe gratefully acknowledge the Department of Biomedical Engineering, Department of Computer Science and Engineering at the Military Institute of Science and Technology and Multimedia University, Malaysia for their institutional support and facilities provided during the completion of this research. We also acknowledge the use of artificial intelligence\u0026ndash;based language tools for assistance with grammar refinement and improvement of formal scientific writing.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eAll data generated or analyzed during this study are included in this published article and its Additional Information files. The complete datasets, protein language model (PLM) outputs, and all analysis scripts are provided in Additional Information 1\u0026ndash;13. Additional data are available from the corresponding author upon reasonable request.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eAlber AB, Suter DM. Dynamics of protein synthesis and degradation through the cell cycle. Cell Cycle. 2019;18:784\u0026ndash;94. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1080/15384101.2019.1598725\u003c/span\u003e\u003cspan address=\"10.1080/15384101.2019.1598725\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFinley D. Recognition and processing of ubiquitin-protein conjugates by the proteasome. Annu Rev Biochem. 2009;78:477\u0026ndash;513. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1146/annurev.biochem.78.081507.101607\u003c/span\u003e\u003cspan address=\"10.1146/annurev.biochem.78.081507.101607\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang Z, Mena EL, Timms RT, et al. Degrons: defining the rules of protein degradation. Nat Rev Mol Cell Biol. 2025;26:868\u0026ndash;83. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41580-025-00870-z\u003c/span\u003e\u003cspan address=\"10.1038/s41580-025-00870-z\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVarshavsky A. N-degron and C-degron pathways of protein degradation. Proc Natl Acad Sci U S A. 2019;116:358\u0026ndash;66. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1073/pnas.1816596116\u003c/span\u003e\u003cspan address=\"10.1073/pnas.1816596116\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRogers S, Wells R, Rechsteiner M. Amino acid sequences common to rapidly degraded proteins: the PEST hypothesis. Science. 1986;234:364\u0026ndash;8. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1126/science.2876518\u003c/span\u003e\u003cspan address=\"10.1126/science.2876518\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFishbain S, Inobe T, Israeli E, et al. Sequence composition of disordered regions fine-tunes protein half-life. Nat Struct Mol Biol. 2015;22:214\u0026ndash;21. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/nsmb.2958\u003c/span\u003e\u003cspan address=\"10.1038/nsmb.2958\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRives A, Meier J, Sercu T, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021;118:e2016239118. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1073/pnas.2016239118\u003c/span\u003e\u003cspan address=\"10.1073/pnas.2016239118\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLeclercq M, Droit A. Protein language models: applications and perspectives. J Proteome Res. 2026;25:507\u0026ndash;24. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1021/acs.jproteome.5c00506\u003c/span\u003e\u003cspan address=\"10.1021/acs.jproteome.5c00506\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLamb KD, Hughes J, Lytras S, et al. From single-sequences to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2. Nat Commun. 2026. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41467-026-69569-9\u003c/span\u003e\u003cspan address=\"10.1038/s41467-026-69569-9\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi J, Cai Z, Vaites LP, et al. Proteome-wide mapping of short-lived proteins in human cells. Mol Cell. 2021;81:4722\u0026ndash;e47355. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.molcel.2021.09.015\u003c/span\u003e\u003cspan address=\"10.1016/j.molcel.2021.09.015\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhao L, Zhao J, Zhong K, et al. Targeted protein degradation: mechanisms, strategies and application. Signal Transduct Target Ther. 2022;7:113. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41392-022-00966-4\u003c/span\u003e\u003cspan address=\"10.1038/s41392-022-00966-4\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eUniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 2015;43:D204\u0026ndash;12. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/nar/gku989\u003c/span\u003e\u003cspan address=\"10.1093/nar/gku989\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOng SE, Blagoev B, Kratchmarova I, et al. Stable isotope labeling by amino acids in cell culture as a simple and accurate approach to expression proteomics. Mol Cell Proteom. 2002;1:376\u0026ndash;86. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1074/mcp.M200025-MCP200\u003c/span\u003e\u003cspan address=\"10.1074/mcp.M200025-MCP200\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCambridge SB, Gnad F, Nguyen C, et al. Systems-wide proteomic analysis in mammalian cells reveals conserved, functional protein turnover. J Proteome Res. 2011;10:5275\u0026ndash;84. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1021/pr101183k\u003c/span\u003e\u003cspan address=\"10.1021/pr101183k\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVillegas-Morcillo A, Gomez AM, Sanchez V. An analysis of protein language model embeddings for fold prediction. Brief Bioinform. 2022;23:bbac142. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/bib/bbac142\u003c/span\u003e\u003cspan address=\"10.1093/bib/bbac142\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAli S, Chourasia P, Patterson M. When protein structure embedding meets large language models. Genes. 2024;15:25. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/genes15010025\u003c/span\u003e\u003cspan address=\"10.3390/genes15010025\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHung C, Tsai CF, Wu MH. Dimensionality reduction strategies for classification: ML versus DL approaches and their combinations. Expert Syst. 2025. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1111/exsy.70140\u003c/span\u003e\u003cspan address=\"10.1111/exsy.70140\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMykhailiuk I, Sch\u0026auml;fer K, B\u0026uuml;skens C. Parametric stability score and its application in optimal control. IFAC-PapersOnLine. 2022. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.ifacol.2022.09.019\u003c/span\u003e\u003cspan address=\"10.1016/j.ifacol.2022.09.019\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAllgaier J, Pryss R. Cross-validation visualized: a narrative guide to advanced methods. Mach Learn Knowl Extr. 2024;6:1378\u0026ndash;88. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/make6020065\u003c/span\u003e\u003cspan address=\"10.3390/make6020065\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSatria A, Sitompul OS, Mawengkang H. 5-fold cross validation on supporting k-nearest neighbour accuracy of making consimilar symptoms disease classification. In: Proc 2021 Int Conf Comput Sci Eng (IC2SE). IEEE; 2021. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/IC2SE52832.2021.9792094\u003c/span\u003e\u003cspan address=\"10.1109/IC2SE52832.2021.9792094\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXu M, Fralick D, Zheng JZ, et al. The differences and similarities between two-sample t-test and paired t-test. Shanghai Arch Psychiatry. 2017;29:184\u0026ndash;8. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.11919/j.issn.1002-0829.217070\u003c/span\u003e\u003cspan address=\"10.11919/j.issn.1002-0829.217070\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDiener MJ. Cohen\u0026rsquo;s d. In: The Corsini Encyclopedia of Psychology. 2010. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1002/9780470479216.corpsy0200\u003c/span\u003e\u003cspan address=\"10.1002/9780470479216.corpsy0200\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChu SKS, Narang K, Siegel JB. Protein stability prediction by fine-tuning a protein language model on a mega-scale dataset. PLoS Comput Biol. 2024. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1371/journal.pcbi.1012248\u003c/span\u003e\u003cspan address=\"10.1371/journal.pcbi.1012248\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLahitani AR, Permanasari AE, Setiawan NA. Cosine similarity to determine similarity measure: study case in online essay assessment. Proc 2016 Int Conf Cyber IT Serv Manag. IEEE; 2016. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/CITSM.2016.7577578\u003c/span\u003e\u003cspan address=\"10.1109/CITSM.2016.7577578\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAbounaima MC, El Mazouri FZ, Lamrini L et al. The Pearson correlation coefficient applied to compare multi-criteria methods: case the ranking problematic. In: Proc 2020 Int Conf Innov Res Appl Sci Eng Technol (IRASET). IEEE; 2020. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/IRASET48871.2020.9092242\u003c/span\u003e\u003cspan address=\"10.1109/IRASET48871.2020.9092242\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBlakeley-Ruiz JA, Kleiner M. Considerations for constructing a protein sequence database for metaproteomics. Comput Struct Biotechnol J. 2022;20:937\u0026ndash;52. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.csbj.2022.01.018\u003c/span\u003e\u003cspan address=\"10.1016/j.csbj.2022.01.018\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJiang M, Anderson J, Gillespie J, et al. uShuffle: a useful tool for shuffling biological sequences while preserving the k-let counts. BMC Bioinformatics. 2008;9:192. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1186/1471-2105-9-192\u003c/span\u003e\u003cspan address=\"10.1186/1471-2105-9-192\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMokhtari F, Akhlaghi MI, Simpson SL, et al. Sliding window correlation analysis: modulating window shape for dynamic brain connectivity in resting state. NeuroImage. 2019;189:655\u0026ndash;66. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.neuroimage.2019.02.001\u003c/span\u003e\u003cspan address=\"10.1016/j.neuroimage.2019.02.001\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShi WH, Marciel AB. Influence of charge block length on conformation and cluster formation of atactic peptide polyampholytes. Macromolecules. 2026. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1021/acs.macromol.5c02873\u003c/span\u003e\u003cspan address=\"10.1021/acs.macromol.5c02873\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePalaniappan A, Jakobsson E. Fourier analysis of conservation patterns in protein secondary structure. Comput Struct Biotechnol J. 2017;15:265\u0026ndash;70. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.csbj.2017.02.002\u003c/span\u003e\u003cspan address=\"10.1016/j.csbj.2017.02.002\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDoszt\u0026aacute;nyi Z. Prediction of protein disorder based on IUPred. Protein Sci. 2018;27:331\u0026ndash;40. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1002/pro.3334\u003c/span\u003e\u003cspan address=\"10.1002/pro.3334\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChicco D, Sichenze A, Jurman G. A simple guide to the use of Student\u0026rsquo;s t-test, Mann\u0026ndash;Whitney U test, chi-squared test, and Kruskal\u0026ndash;Wallis test in biostatistics. BioData Min. 2025;18:56. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1186/s13040-025-00465-6\u003c/span\u003e\u003cspan address=\"10.1186/s13040-025-00465-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNahm FS. Receiver operating characteristic curve: overview and practical use for clinicians. Korean J Anesthesiol. 2022;75:25\u0026ndash;36. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.4097/kja.21209\u003c/span\u003e\u003cspan address=\"10.4097/kja.21209\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eReimand J, Arak T, Adler P, Kolberg L, Reisberg S, Peterson H, Vilo J. g:Profiler\u0026mdash;a web server for functional interpretation of gene lists (2016 update). Nucleic Acids Res. 2016;44(W1):W83\u0026ndash;9. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/nar/gkw199\u003c/span\u003e\u003cspan address=\"10.1093/nar/gkw199\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChagoyen M, Pazos F. Quantifying the biological significance of gene ontology biological processes\u0026mdash;implications for the analysis of systems-wide data. Bioinformatics. 2010;26(3):378\u0026ndash;84. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/bioinformatics/btp663\u003c/span\u003e\u003cspan address=\"10.1093/bioinformatics/btp663\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHolman SW, Hammond DE, Simpson DM, et al. Protein turnover measurement using selected reaction monitoring-mass spectrometry (SRM-MS). Philos Trans Math Phys Eng Sci. 2016;374:20150362. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1098/rsta.2015.0362\u003c/span\u003e\u003cspan address=\"10.1098/rsta.2015.0362\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMcGinness KE, Baker TA, Sauer RT. Engineering controllable protein degradation. Mol Cell. 2006;22:701\u0026ndash;7. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.molcel.2006.04.027\u003c/span\u003e\u003cspan address=\"10.1016/j.molcel.2006.04.027\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"bmc-bioinformatics","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"binf","sideBox":"Learn more about [BMC Bioinformatics](http://bmcbioinformatics.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/binf","title":"BMC Bioinformatics","twitterHandle":"@BMC_Bioinformatics","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"PLM, long lived protein, short lived protein, PLM embeddings, sequence grammar, motif","lastPublishedDoi":"10.21203/rs.3.rs-9298535/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9298535/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eProtein stability and turnover are fundamental determinants of proteome regulation, yet how protein lifetime is specified by amino-acid sequences remains incompletely understood. Here, we identify a previously uncharacterized, multi-scale organization of sequence features associated with protein stability using protein language model (PLM) representations. Using experimentally derived half-life data from four human cell lines, we uncover a conserved stability-associated axis in embedding space that separates long-lived proteins (LLP) from short-lived proteins (SLP) (ROC\u0026ndash;AUC\u0026thinsp;=\u0026thinsp;0.80\u0026ndash;0.82) and generalizes across species (Pearson r up to 0.465). Systematic decomposition of sequence representations shows that protein stability is not explained by amino-acid composition alone but reflects contributions from multiple sequence levels. While composition provides baseline predictive signal (AUC\u0026thinsp;=\u0026thinsp;0.64\u0026ndash;0.70), sequence grammar (0.65\u0026ndash;0.68) and motif features (0.71\u0026ndash;0.73) contribute additional information, with PLM representations integrating these signals into a unified framework. Disrupting residue order reduces predictive performance (AUC\u0026thinsp;~\u0026thinsp;0.82 to ~\u0026thinsp;0.62), supporting a role for residue organization beyond composition alone. Analysis of sequence features reveals consistent organization patterns associated with stability, with LLP proteins exhibiting tighter lysine spacing, enhanced charge clustering, and increased proline periodicity, whereas SLP proteins show more dispersed residue organization. Perturbation analyses further support the contribution of these features, with disruption of charge organization producing the largest decrease in predicted stability (Δ \u0026asymp; \u0026minus;1.53). Together, these findings support a model in which protein stability is associated with distributed, multi-scale sequence organization rather than isolated motifs. By systematically resolving stability-related signals across sequence scales, this work highlights the potential of PLM-based representations to uncover biologically meaningful principles governing proteome dynamics.\u003c/p\u003e","manuscriptTitle":"Multi-Scale Sequence Encoding Distinguishes Long-Lived and Short-Lived Proteins Revealed by Protein Language Model Embeddings","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-05-07 12:33:45","doi":"10.21203/rs.3.rs-9298535/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"reviewerAgreed","content":"328533478704997703777217871327125977927","date":"2026-05-04T07:06:43+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-04-29T13:59:37+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2026-04-14T14:23:44+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-04-08T01:56:20+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-04-08T01:56:08+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Bioinformatics","date":"2026-04-02T05:41:58+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"bmc-bioinformatics","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"binf","sideBox":"Learn more about [BMC Bioinformatics](http://bmcbioinformatics.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/binf","title":"BMC Bioinformatics","twitterHandle":"@BMC_Bioinformatics","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"7adf4894-fc9a-40b9-b23d-04f91ae618be","owner":[],"postedDate":"May 7th, 2026","published":true,"recentEditorialEvents":[{"type":"reviewerAgreed","content":"328533478704997703777217871327125977927","date":"2026-05-04T07:06:43+00:00","index":23,"fulltext":""},{"type":"reviewersInvited","content":"5","date":"2026-04-29T13:59:37+00:00","index":"","fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-05-07T12:33:46+00:00","versionOfRecord":[],"versionCreatedAt":"2026-05-07 12:33:45","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9298535","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9298535","identity":"rs-9298535","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.