Deep-Palm: an integrated deep learning framework for structure-aware prediction of protein S-Palmitoylation

preprint OA: closed CC-BY-NC-4.0
📄 Open PDF Full text JSON View at publisher
Full text 45,242 characters · extracted from oa-pdf · 6 sections · click to expand

Abstract

Protein S-palmitoylation is a critical and reversible lipid modification that governs protein localization, trafficking, and signaling. Its dysregulation is increasingly implicated in cancer and therapeutic resistance, highlighting an urgent need for high-throughput computational prediction tools. Palmitoylation is regulated by a complex interplay of sequence motifs, structural conformations, and physicochemical properties. To comprehensively capture these determinants, we developed Deep-Palm: a deep learning framework that integrates multi-view features, including amino acid sequences, spatial constraints from predicted structures, physicochemical descriptors, and protein language model embeddings, for accurate prediction of S-palmitoylation sites. In independent testing, Deep-Palm achieved an area under the curve (AUC) of 0.931, substantially outperforming state-of-the-art tools such as pCysMod, MusiteDeep, and GPS-Palm. Furthermore, Deep-Palm demonstrated robust performance across diverse eukaryotic species. Notably, its predictive accuracy remained consistent regardless of protein functional categories or subcellular localization, indicating that the model captures fundamental, context-invariant determinants of palmitoylation. By embedding amino acid sequences with structural and protein property awareness, Deep-Palm not only delivers stable and high-precision predictions but also provides a framework for uncovering novel regulatory mechanisms and therapeutic targets in protein post-translational modification (PTM). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint

Introduction

The functional diversity of the eukaryotic proteome is exponentially expanded by post-translational modifications (PTMs)[1].Among the diverse types of PTMs, S-palmitoylation is a reversible lipid modification serving as a dynamic molecular switch that regulates protein activity, localization and interaction networks [2]. S-palmitoylation involves the formation of a labile thioester bond between a 16-carbon saturated fatty acid (palmitate) and the thiol group of specific cysteine residues [3], which is governed by the opposing actions of two enzyme families. The zinc finger DHHC-domain-containing protein acyltransferases (ZDHHCs) catalyze the addition of palmitate, while the acyl-protein thioesterases (APTs) mediate depalmitoylation [2]. Because S-palmitoylation is reversible, it can rapidly tune a protein’s membrane residence and signaling output in response to upstream cues, functioning in many contexts analogously to phosphorylation[4,5] This dynamic control is biologically crucial and can be hijacked in disease—for example, fatty acid synthase–dependent EGFR palmitoylation and palmitoylation-dependent NRAS trafficking promote oncogenic signaling in cancer models [6-8]. Several public resources curate experimentally supported S-palmitoylation site annotations, including SwissPalm and CysModDB, with complementary evidence integrated in UniProt [9-11]. Although these annotations continue to expand, they remain incomplete because identifying and validating new sites still relies on labor-intensive enrichment and chemical-reporter workflows[12,13]. Moreover, the thioester linkage is chemically labile, which can complicate sample handling and proteomics readouts [2,12]. Therefore, computational approaches are still needed to prioritize candidate cysteines and accelerate the discovery of previously unannotated S-palmitoylation sites[14-16]. With the goal of effective mining of S-palmitoylation events among proteomes, computational tools were developed for fast and accurate prediction of palmitoylated cysteins via deep learning and machine learning methods. Over the past decade, multiple computational tools have been developed for S-palmitoylation site prediction, typically representing each cysteine using a fixed-length sequence window and sequence-derived features (e.g., local motifs and physicochemical properties), such as GPS-Palm [14], CSS-Palm 2.0 [15], and pCysMod [16]. These tools have facilitated candidate prioritization, yet most remain sequence-centric, learning palmitoylation propensity primarily from short linear windows around the target cysteine and motif-like sequence patterns [10,14-17]. Consequently, they tend to treat ZDHHC substrate preference as being largely specified by residues flanking the modified cysteine, potentially overlooking broader contextual determinants emphasized in S-acylation biology [2,3,12]. However, cysteine recognition by S-palmitoylation “writer” enzymes (ZDHHC palmitoyl acyltransferases) and depalmitoylation “eraser” enzymes is not determined solely by sequence features, but also depends on structural determinants such as membrane topology, three-dimensional accessibility, and productive presentation of the cysteine to the enzyme at the membrane interface [18-22]. Structural studies of human zDHHC20 and zDHHC15 place the catalytic DHHC motif at the membrane–cytosol interface and reveal a tent-like hydrophobic cavity formed by transmembrane helices that accommodates the fatty-acyl chain, providing a mechanistic basis for why modification depends on the membrane-proximal 3D context of the target cysteine rather than flanking residues alone [18]. Beyond local windows, substrate recruitment can also be mediated by .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint distal interaction modules, as shown by the ankyrin-repeat domain of human zDHHC17 engaging a substrate peptide in a defined binding mode [19]. On the eraser side, ABHD17 enzymes have been identified as depalmitoylases required for N-Ras depalmitoylation and relocalization, and APT2 has been shown to extract substrate acyl chains from membranes for hydrolysis, supporting the central role of membrane/structural context in depalmitoylation dynamics [20,21]. Therefore, incorporating structure-aware features into prediction is necessary to complement sequence-only models and to improve mechanistic interpretability at the site level.In parallel, evolutionary semantics from protein language-model embeddings—exemplified by ESM2—provide a practical route to encode structural and evolutionary constraints directly from amino-acid sequence, as ESM-2–based models enable atomic-level protein structure prediction from primary sequence alone, indicating that the learned embeddings capture non-local dependencies relevant to structure and evolution [22-24]. To bridge this gap, we present DeepPalm, a multi-view prediction framework that takes cysteine-centered sequence windows as input and outputs residue-level propens ities for candidate S-palmitoylation sites, following common PTM-site prediction practice [14]. DeepPalm represents each candidate using evolutionary semantics derived from ESM-2 embeddings, which learn protein sequence regularities at scale and have been used to support structure-aware inference from sequence alone [22,23]. DeepPalm further incorporates 3D structural context by building residue interaction graphs from ESMFold-predicted structures and encoding the spatial microenvironment around each cysteine with graph convolution [22,25]. Finally, DeepPalm integrates physicochemical descriptors from AAindex with bidirectional recurrent modeling and attention, extracts local motif signals using convolutional architectures, and fuses all views through a stacking meta-learner to produce a unified site score[26-28].

Materials and methods

Data preprocessing strategy To construct a high-quality dataset of protein S-palmitoylation sites for model building, we implemented a multi-step data curation strategy. We first collected experimental evidence-based sites from cystein PTM databases including SwissPalm [9], CysModDB[10] and the training data from GPS-Palm[14]. We defined cysteins with S-palmitoylation annotaion in as positive cases. Definition: Positive samples were defined as 31 aa residues as input. centered on verified palmitoylated cysteine residues (i.e., positions/uff9s15 to /uff9Å15 ). Negative Sample Generation: To generate a robust negative set, we extracted non-palmitoylated cysteines from the same proteins, ensuring a representative background distribution that accounts for protein-specific expression levels and cellular localization. Homology Bias Mitigation: Strict quality control was applied to mitigate homology bias, a critical factor for objective evaluation. We employed CD-HIT [29] to cluster sequences with a 60% identity threshold, removing redundant fragments that could lead to inflated performance estimates due to data leakage between training and testing sets. The final non-redundant dataset comprised 6,970 samples, balanced (1:1) between positive (3,485 sites) and negative classes via random undersampling of the majority negative class. This balancing step is crucial to prevent the classifier from learning a trivial "always negative" heuristic, a common pitfall in PTM prediction where negative sites vastly outnumber positive ones.31 The dataset was .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint partitioned into training ( 80% ) and independent testing (20%) sets using stratified sampling to preserve class, organism, and functional distributions. Organism total neg_0 pos_1 Mus musculus 3069 590 2479 Homo sapiens 1568 675 893 Arabidopsis thaliana 354 342 12 Rattus norvegicus 280 244 36 Caenorhabditis elegans 157 155 2 Saccharomyces cerevisiae 151 147 4 Drosophila melanogaster 133 126 7 Bos taurus 126 121 5 Xenopus laevis 112 112 0 Dictyostelium discoideum 112 112 0 Other (206 species) 908 861 47 Total (all samples) 6970 3485 3485 Table 1 Dataset Statistics Model architecture We developed Deep-Palm, a multi-view deep learning framework that synergizes evolutionary, structural, and physicochemical contexts. The architecture consists of four parallel branches, each designed to capture distinct biological modalities, which are subsequently integrated by a meta-learner. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint Figure 1 The Deep-Palm Framework Feature calculation 1. Evolutionary Semantic Encoding To leverage deep evolutionary patterns, we utilized the pre-trained protein language model ESM-2 (Evolutionary Scale Modeling, specifically the 3B parameter version) [22]. ESM-2 is a transformer-based model trained on millions of protein sequences from the UniRef database [11]. Unlike traditional Position-Specific Scoring Matrices (PSSMs) which only capture conservation at individual positions, ESM-2 embeddings capture high-order co-evolutionary dependencies and latent semantic properties of protein sequences. Each 31-residue sequence was mapped to high-dimensional embeddings ( /u1sfs /uf400 /u1sf0, where /u1sfs /uf404 31 and /u1sf0 /uf404 1280 ) representing the latent semantic space of protein evolution. To effectively decode local dependencies from these embeddings, we implemented a Variable-Convolutional layer (vConv)[30]. As demonstrated in recent omics studies [31], vConv dynamically adapts kernel sizes during training to capture complex, variable-length functional motifs that fixed-size convolutions often miss. Mathematically, the vConv operation allows the network to learn a mask function /u1sf9 that weights the effective width of the kernel /u1s49 . For an input sequence representation /u1s50/u14ss/u1s44 /g3013/g3400/g3005 , the output feature map /u1s51 is computed as: /u1s51 /g3037 /uf404/uf5ff/uf5ff/u1s50 /g4666 /g3037/g2878/g3038/g2879/g2869 /g4667 ,/g3031 /g3005 /g3031/g2880/g2869 /g3012 /g3038/g2880/g2869 /u166s/u46Åi/u1s49 /g3038,/g3031 /u16ii/u1sf9 /g3038 /u4666 θ /u466Å /u46Åf/uff9Å/u1s54 where/u1sfÅ is the maximum kernel size, /u16ii denotes element-wise multiplication, and M /g2921 /u4666 θ /u466Å is a learnable masking function parameterized by θ (typically using sigmoid functions) that smoothly determines the active window size of the convolution. This mechanism allows the model to "focus" on motifs of varying lengths without manual tuning of kernel sizes. 2. Structure-Aware Graph Representation A core innovation of Deep-Palm is the integration of 3D structural topology. We generated predicted structures for all peptide windows using ESMFold , which offers accuracy comparable to AlphaFold2[32] but with superior inference speed, making it feasible to model thousands of peptide fragments. We constructed residue interaction graphs G/uf404 /u4666 V, E /u466Å , where nodes /u1s4s represent amino acids and edges /u1sf1 represent spatial contacts. An edge was defined between two residues if the Euclidean distance between their /u1si9 /g2961 atoms was less than a threshold of 8 Å [33]. /u1sf1 /g3036/g3037 /uf404 /u1/u466s 1i f /uf6f0 /u1sÅ0 /g3036 /uff9s/u1sÅ0 /g3037 /uf6f0 /g2870 /uf40Å8 Å 0o t h e r w i s e /u1 Node features /u1sf4 /g4666 /g2868 /g4667 were initialized by aggregating local structural metrics extracted via 2D convolutions. A two-layer Graph Convolutional Network (GCN) [34] was then employed to propagate information across the spatial neighbors. The propagation rule for the GCN layer is defined as: /u1sf4 /g4666 /g3039/g2878/g2869 /g4667 /uf404σ !/u46Ås /uf4f5 /u1sf0 /uf561 /uf4f9 /g2879 /g2869 /g2870 /u1siÅ /u46f4 /uf4f5/u1sf0 /uf561 /uf4f9 /g2879 /g2869 /g2870 /u1sf4 /g4666 /g3039 /g4667 /u1s49 /g4666 /g3039 /g4667 /u46Å9 Where /u1siÅ /u46f4 /uf404/u1siÅ/uff9Å/u1sf5 /g3015 is the adjacency matrix with added self-loops, /ui160/uf561 is the degree matrix (where .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint /ui160 /g2841/g2841/uf56i /uf404 ∑ /ui15Å /g2841/g2842/uf56i/g2192 ), /ui1Å9 /g4666 /g2194 /g4667 is the learnable weight matrix for layer /ui194 , and σ is the ReLU activation function. This propagation allows the model to learn the 3D microenvironment governing enzyme-substrate accessibility, effectively aggregating information from residues that may be distant in the primary sequence but spatially proximal to the reactive cysteine [18]. 3. Physicochemical Modeling Complementing the deep representations, two additional branches explicitly modeled biophysical properties and linear motifs. Physicochemical Branch: This branch utilized 14 curated indices from the AAindex database [35], representing properties such as hydrophobicity, steric hindrance, side-chain volume, and isoelectric point. These features were processed by a Bidirectional Long Short-Term Memory (Bi-LSTM) network[26]. The Bi-LSTM processes the sequence in both forward and backward directions to capture long-range sequential dependencies. An attention mechanism [36] was applied to the hidden states to weigh the contribution of each residue to the final classification: α /g3047 /uf404 exp /u4666 /u1s5Å /g3047 /u466Å ∑ exp /u4666 /u1s5Å /g3038 /u466Å/g3021 /g3038/g2880/g2869 ,/u1f05/u1s5Å /g3047 /uf404/u1sÅ4 /g3021 tanh /u4666 /u1s49 /g3035 /u1s60 /g3047 /uff9Å/u1s54 /u466Å /u1s55/uf404/uf5ff α /g3047 /u1s60 /g3047 /g3021 /g3047/g2880/g2869 where /u1s60 /g3047 is the combined hidden state at position /u1sÅi , α /g3047 is the attention weight, and /u1s55 is the context vector. This allows the model to focus on residues that contribute most significantly to the biochemical environment of the cysteine. 4.k-mer This branch employed a multi-channel Convolutional Neural Network ( CNN) [27] to extract strictly local sequence patterns (2-mers to 4-mers). These local motifs (e.g., Cys-Cys pairs, C-terminal CaaX motifs) are often directly recognized by the zinc finger domain of palmitoyl acyltransferases (PATs/ZDHHCs). Ensemble learning and model training Stacking Generalization: To robustly integrate predictions from the four heterogeneous branches, we implemented a stacking ensemble strategy[28]. Instead of simple averaging, we trained a logistic regression meta-learner to dynamically weight the probability outputs of the base models. The meta-learner optimizes the final prediction /u1s51 as: /u1s51/uf404σ /u4666 /u1sÅ5 /g2869 /u1s4i /g3006/g3020/g3014 /uff9Å/u1sÅ5 /g2870 /u1s4i /g3008/g3004/g3015 /uff9Å/u1sÅ5 /g2871 /u1s4i /g3003/g3036/g3013/g3020/g3021/g3014 /uff9Å/u1sÅ5 /g2872 /u1s4i /g3004/g3015/g3015 /uff9Å/u1s54 /u466Å where /ui1Åi represents the probability output of each branch and /uii05 are the learned weights. The meta-learner was trained using out-of-fold (OOF) predictions from the cross-validation process to prevent data leakage and ensure generalization to unseen data. Training Protocol: The model was implemented in PyTorch [37]. We adopted a 5-fold cross-validation scheme for hyperparameter tuning. The network was optimized using the AdamW optimizer[38] with a weight decay of 1/uf4001 0 /g2879/g2871 to prevent overfitting. We utilized Binary Cross-Entropy (BCE) as the loss function: .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint /uiiÅs/uf404/uff9s 1 /u1s40 /uf5ff /u46Å0 /u1sÅÅ /g3036 log /u4666 /u1sÅÅ /g3114/uf549 /u466Å /uff9Å /u4666 1/uff9s/u1sÅÅ /g3036 /u466Å log /u4666 1/uff9s/u1sÅÅ /g3114/uf549 /u466Å/u46Å1 /g3015 /g3036/g2880/g2869 To handle the complexity of the multi-branch architecture, we employed mixed-precision training and gradient clipping. Early stopping was triggered if the Area Under the Receiver Operating Characteristic Curve (AUROC) on the validation set did not improve for 5 consecutive epochs. Performance evaluation We evaluated model performance using standard binary-classification metrics, including sensitivity, specificity, and accuracy. Protein-coding calls were treated as positive instances, whereas non-coding calls were treated as negative instances. Let TP, TN, FP, and FN denote the numbers of true positives, true negatives, false positives, and false negatives, respectively. The metrics were computed as: Sensitivity /uf404 /u1s46/u1s4i /u1s46/u1s4i /uff9Å /u1sfi/u1s40 , Speci/u9Å6icity /uf404 /u1s46/u1s40 /u1s46/u1s40 /uff9Å /u1sfi/u1s4i , Accuracy /uf404 /u1s46/u1s4i /uff9Å /u1s46/u1s40 /u1s46/u1s4i /uff9Å /u1s46/u1s40 /uff9Å /u1sfi/u1s4i /uff9Å /u1sfi/u1s40 AUC /uf404 /uf505 TPR /u4666 FPR /u466Å /g2869 /g2868 /u1f0s/u1s56/u1f0sFPR. TPR /uf404 /u1s46/u1s4i /u1s46/u1s4i /uff9Å /u1sfi/u1s40 , FPR /uf404 /u1sfi/u1s4i /u1sfi/u1s4i /uff9Å /u1s46/u1s40 Model training and evaluation were conducted on an NVIDIA A100 GPU environment.

Results

Multi-View Synergy Mitigates Overfitting and Enhances Robustness To evaluate the contribution of individual feature channels to the predictive capability of Deep-Palm, we analyzed model performance on the training set using stratified 5-fold cross-validation (Table 1). The Stacking Ensemble strategy achieved the highest discriminative power, with an AUC of 0.970, outperforming the linear Blending strategy (AUC 0.966) and all individual branches. Model Branch Training AUC Testing AUC Specificity Sensitivity Amino acid sequence 0.985 0.875 0.742 0.818 Protein properties 0.933 0.916 0.773 0.885 ESM embedding 0.959 0.923 0.856 0.801 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint Spatial structure 0.706 0.731 0.469 0.859 Deep-Palm (combined) 0.970 0.931 0.856 0.836 Table 2 Performance comparison of individual branches vs. the ensemble model on training and independent test sets. An analysis of the individual branches revealed critical insights into the nature of biological data modeling. Notably, the Amino acid sequence branch exhibited an exceptionally high AUC of 0.985 during training but suffered a significant performance drop to 0.875 during independent testing. This discrepancy highlights a fundamental limitation of explicit motif-based features: they tend to overfit local sequence noise rather than learning generalizable biochemical rules. In contrast, the Multiple sequence alignment branch (AUC 0.959) and the Protein properties branch (AUC 0.933) provided robust foundational predictions that maintained stability between training and testing phases. The Structural (GCN) branch showed the highest specificity among individual models, confirming the hypothesis that structural constraints are key discriminators of true sites. The superior performance of the Stacking model confirms that dynamically integrating evolutionary semantics and physicochemical contexts effectively compensates for the limitations of single-view features, preventing the model from relying solely on overfitting-prone sequence motifs. Generalization and Balanced Decision-Making A comparative analysis of metrics between the training and independent test sets revealed that Deep-Palm maintained a high AUC of 0.931 on completely unseen data. Crucially, the model achieved a highly balanced profile, with Sensitivity (0.856) and Specificity (0.836) being nearly equal. This balance (Accuracy 0.846) is of paramount importance in the field of PTM prediction. Traditional classifiers often skew towards the majority class (non-modified sites) due to the inherent data imbalance in proteomic datasets. By achieving high specificity without sacrificing sensitivity, Deep-Palm minimizes false positives—a common plague in computational proteomics that leads to wasted resources in experimental validation—while ensuring that bona fide palmitoylation sites are not overlooked. Performance Benchmarking of Deep-Palm for S-Palmitoylation Site Prediction We benchmarked Deep-Palm against established S-palmitoylation prediction tools: GPS-Palm [14], pCysMod[16], and MusiteDeep[17]. The results revealed that Deep-Palm represents a substantial 14.4% improvement over the second-best tool, GPS-Palm. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint Figure 2 Performance comparison between Deep-Palm and other existing predictors. Receiver Operating Characteristic (ROC) curves of Deep-Palm, pCysMod, GPSPalm, and MusiteDeep on the independent test set. Figure 3 Quantitative performance metrics of Deep-Palm and existing predictors. Bar charts comparing the Sensitivity (SEN), Specificity (SPE), Accuracy (ACC), and Area Under the Curve (AUC) .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint of Deep-Palm, pCysMod, GPSPalm, and MusiteDeep on the independent test set. The comparative analysis highlights critical trade-offs in existing methodologies: 1. GPS-Palm tends to sacrifice specificity (0.516) for higher sensitivity, leading to a high rate of false positives. This suggests that its motif-based scoring system is overly permissive. 2. pCysMod, conversely, sacrifices sensitivity (0.547) for specificity, resulting in a high rate of false negatives. Deep-Palm is the only predictor to maintain both metrics above 0.84, effectively reconciling this trade-off. This suggests that the inclusion of structural features (GCN branch) and deep evolutionary context (ESM branch) provides the model with a higher-resolution decision boundary than sequence-only models can achieve. Evolutionary Conservation and Structural Determinants To verify whether Deep-Palm captures conserved biochemical mechanisms rather than species-specific signatures, we evaluated its performance on species-specific subsets. The model exhibited exceptional robustness, achieving an AUC of 0.951 for Homo sapiens and 0.990 for Mus musculus. Performance Stratification by Local and Global Cysteine Context To investigate whether Deep-Palm’s performance is sensitive to cysteine-related sequence context, we conducted three stratified evaluations on the test set. Specifically, we (i) grouped samples by the number of cysteines within the 31-aa input window, (ii) grouped samples by the distance to the nearest neighboring cysteine in the full-length protein sequence, and (iii) grouped samples by the relative position of the target cysteine with respect to the N- and C-termini. For each stratum, we computed sensitivity, specificity, accuracy, and ROC-AUC for Deep-Palm and baseline methods. Figure 4 Local cysteine density (window-level) .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint Figure 5 Nearest-neighbor cysteine spacing (protein-level) Figure 6 C-terminal positional context (protein-level) .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint Figure 7 N-terminal positional context (protein-level) Robust Performance Across GO Functional Categories and Cellular Components To assess robustness across proteins with diverse biological functions and subcellular localizations, we stratified the test set using UniProt Gene Ontology annotations, including Biological Process (BP), Molecular Function (MF), and Cellular Component (CC). For each GO group, we computed sensitivity, specificity, accuracy, and ROC-AUC for Deep-Palm and baseline methods. Deep-Palm exhibited consistently strong performance across GO categories, indicating robustness to functional and cellular-context heterogeneity. Figure 8 AUC performan ce .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint comparison between Deep-Palm and existing prediction tools across different Gene Ontology (GO) categories. The Structural Logic of the Palmitoylation Code The superior performance of Deep-Palm underscores a fundamental biological reality: the "palmitoylation code" is not merely linear but topological. The catalytic mechanism of ZDHHC enzymes involves a "ping-pong" reaction where the fatty acid is first transferred to the enzyme and then to the substrate. For this transfer to occur, the substrate cysteine must not only be exposed but must also be positioned to enter the deep, hydrophobic cavity of the ZDHHC enzyme, which is embedded within the membrane [18]. Traditional sequence-based predictors fail to capture this steric constraint. A cysteine might be surrounded by a favorable sequence motif (e.g., hydrophobic residues) but be buried in the protein core or occluded by a stable secondary structure, rendering it inaccessible to the ZDHHC active site. By explicitly modeling the 3D neighborhood via Graph Convolutional Networks (GCN) on ESMFold-predicted structures, Deep-Palm essentially performs a virtual "docking" check, filtering out false positives that are chemically plausible but structurally forbidden. This structural awareness is the likely driver of the dramatic improvement in specificity (0.848) compared to sequence-only tools like GPS-Palm (0.516). Our model effectively learns to distinguish between potential sites (sequence motif present) and functional sites (structurally accessible). Evolutionary Semantics vs. Explicit Motifs Our ablation studies revealed a striking contrast between the k-mer branch (high training AUC, lower testing AUC) and the ESM branch (consistent high performance). This reinforces the utility of Protein Language Models (PLMs) in predictive proteomics. The k-mer branch, akin to traditional motif scanning, memorizes explicit patterns (e.g., Cys-Cys pairs). However, palmitoylation motifs are notoriously degenerate; there is no single consensus sequence analogous to the K-R-X-X-S/T motif in phosphorylation. ESM-2 embeddings, conversely, capture "soft" evolutionary constraints [39]. A cysteine that is evolutionarily conserved across orthologs, or that co-evolves with residues that maintain surface accessibility, is represented in the high-dimensional semantic space of the PLM. Deep-Palm utilizes this "evolutionary intelligence" to identify functional sites even in the absence of canonical motifs, explaining its robustness across species boundaries. This finding suggests that future PTM predictors should prioritize deep semantic encoding over explicit motif engineering. Implications for Cancer Therapy and Drug Resistance The ability to accurately predict S-palmitoylation has profound implications for oncology, particularly in understanding drug resistance mechanisms. EGFR and TKI Resistance: In Non-Small Cell Lung Cancer (NSCLC), the S-palmitoylation of EGFR has been shown to regulate its nuclear trafficking and signaling persistence, contributing to resistance against Tyrosine Kinase Inhibitors (TKIs) like Gefitinib. Studies have shown that .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint inhibiting Fatty Acid Synthase (FASN) with Orlistat blocks EGFR palmitoylation [5], restoring TKI sensitivity. Deep-Palm could be deployed to screen for palmitoylation sites on mutant EGFR variants, identifying potential vulnerabilities that could be targeted by such combination therapies. FLT3 in Leukemia: Similarly, in Acute Myeloid Leukemia (AML), the internal tandem duplication (ITD) mutation of FLT3 relies on palmitoylation for its oncogenic signaling from the endoplasmic reticulum. Targeting the ZDHHC6-mediated palmitoylation of FLT3 represents a novel therapeutic avenue [40]. Deep-Palm’s ability to pinpoint these regulatory cysteines facilitates the design of peptide inhibitors or small molecules that disrupt specific enzyme-substrate interactions. Immune Checkpoints: The palmitoylation of PD-L1 protects it from ubiquitination and lysosomal degradation, thereby maintaining high surface levels that suppress T-cell immunity [41]. Accurate prediction of these sites aids in the development of "palmitoylation inhibitors" as adjuvants to immune checkpoint blockade therapy. FASN in Hepatocellular Carcinoma (HCC): Recent evidence indicates that the palmitoyltransferase ZDHHC20 promotes hepatocarcinogenesis by directly S-palmitoylating fatty acid synthase (FASN) [42]. In chemical carcinogen–driven HCC mouse models, ZDHHC20 knockout significantly reduced tumorigenesis, and palmitoylation proteomics identified FASN as a ZDHHC20-dependent substrate. Mechanistically, ZDHHC20 palmitoylates FASN at Cys1471 and Cys1881, which stabilizes FASN; genetic loss or pharmacologic inhibition of ZDHHC20, as well as C1471S/C1881S mutation, accelerates FASN degradation. This stabilization appears to arise from competition between palmitoylation and ubiquitin–proteasome turnover, involving the SNX8–TRIM28 E3 ligase complex. prediction framework such as Deep-Palm can operationalize these observations by assigning residue-level palmitoylation propensities to cysteines in oncogenic and immune-regulatory proteins, enabling systematic prioritization of candidate regulatory sites for mechanistic testing. In practice, Deep-Palm can be used to compare wild-type and clinically observed variants to identify mutations that introduce, remove or reweight palmitoylation-prone cysteines, and to nominate sites for targeted Cys-to-Ser/Ala mutagenesis, acylation assays and palmitoyl-proteomics validation. By narrowing the hypothesis space to a tractable set of high-priority residues, such predictions can help delineate palmitoylation-dependent vulnerabilities and guide rational design of combination strategies that target palmitoylation pathways alongside kinase inhibition or immune checkpoint blockade. Interpretability and Future Directions A pervasive criticism of deep learning in biology is the "black box" nature of the models. Deep-Palm addresses this through the attention mechanism in its Bi-LSTM and GCN branches. Visualization of attention weights allows researchers to identify which residues—neighboring or distant—are "voting" for the palmitoylation event. This provides biophysical interpretability, generating hypotheses about structural motifs that can be validated via site-directed mutagenesis. Future iterations of Deep-Palm will aim to integrate tissue-specific expression data (scRNA-seq) to predict cell-type-specific palmitoylation events, addressing the limitation that protein abundance varies across tissues. Furthermore, expanding the framework to encompass other lipid modifications, such as N-myristoylation and prenylation, could ultimately yield a unified "Lipidome-Atlas" for the eukaryotic proteome. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint

Conclusion

Deep-Palm represents a paradigm shift in the computational prediction of post-translational modifications. By moving beyond the linearity of the genetic code and embracing the three-dimensional and evolutionary reality of proteins, Deep-Palm bridges the gap between sequence data and functional phenotype. It stands not only as a tool for accurate site identification but as a platform for decoding the complex regulatory logic of the palmitoylome, offering a new lens through which to view—and potentially treat—diseases driven by aberrant protein lipidation. Conflict of interest The authors declare no competing interests. Funding This work was supported by the National Natural Science Foundation of China (Grant No. 32500579). This work was supported by the Fundamental Research Funds for the Central Universities (Project No.2025CDJZKPT-10).

Reference

[1] Zhong Q, Xiao X, Qiu Y , et al. Protein posttranslational modifications in health and diseases: functions, regulatory mechanisms, and therapeutic implications[J]. MedComm, 2023, 4(3): e261. [2] Mesquita F S, Abrami L, Linder M E, et al. Mechanisms and functions of protein S-acylation[J]. NATURE REVIEWS MOLECULAR CELL BIOLOGY, 2024, 25(6): 488-509. [3] Chamberlain L H, Shipston M J. The physiology of protein S-acylation[J]. Physiological Reviews, 2015, 95(2): 341-376. [4] Linder M E, Deschenes R J. Palmitoylation: policing protein stability and traffic[J]. Nature Reviews. Molecular Cell Biology, 2007, 8(1): 74-84. [5] Trafficking and signaling by fatty-acylated and prenylated proteins | Nature Chemical Biology[J]. [2025]. [6] Ali A, Levantini E, Teo J T, et al. Fatty acid synthase mediates EGFR palmitoylation in EGFR mutated non /i1small cell lung cancer[J]. EMBO Molecular Medicine, 2018, 10(3): e8313. [7] Ren J G, Xing B, Lv K, et al. RAB27B controls palmitoylation-dependent NRAS trafficking .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint and signaling in myeloid leukemia[J]. The Journal of Clinical Investigation, 133(12): e165510. [8] Kadry Y A, Lee J Y, Witze E S. Regulation of EGFR signalling by palmitoylation and its role in tumorigenesis[J]. Open Biology, 11(10): 210033. [9] Blanc M, David F, Abrami L, et al. SwissPalm: Protein Palmitoylation database[J]. F1000Research, 2015, 4: 261. [10] Meng Y, Zhang L, Zhang L, et al. CysModDB: a comprehensive platform with the integration of manually curated resources and analysis tools for cysteine posttranslational modifications[J]. Briefings in Bioinformatics, 2022, 23(6): bbac460. [11] UniProt: the Universal Protein Knowledgebase in 2023 | Nucleic Acids Research | Oxford Academic[J]. [2025]. [12] Exploring Protein S-Palmitoylation: Mechanisms, Detection, and Strategies for Inhibitor Discovery | ACS Chemical Biology[J]. [2025]. [13] Hannoush R N, Sun J. The chemical toolbox for monitoring protein fatty acylation and prenylation[J]. Nature Chemical Biology, 2010, 6(7): 498-506. [14] Ning W, Jiang P, Guo Y , et al. GPS-Palm: a deep learning-based graphic presentation system for the prediction of S-palmitoylation sites in proteins[J]. Briefings in bioinformatics, 2020, null: null. [15] Ren J, Wen L ping, Gao X, et al. CSS-Palm 2.0: an updated software for palmitoylation sites prediction.[J]. Protein engineering, design & selection /i1: PEDS, 2008, 21 11: 639-644. [16] pCysMod: Prediction of Multiple Cysteine Modifications Based on Deep Learning Framework - PubMed[J]. [2025]. [17] MusiteDeep: a deep-learning based webserver for protein post-translational modification site prediction and visualization - PubMed[J]. [2025]. [18] Rana M S, Kumar P, Lee C J, et al. Fatty acyl recognition and transfer by an integral membrane S-acyltransferase[J]. Science (New York, N.Y .), 2018, 359(6372): eaao6326. [19] Verardi R, Kim J S, Ghirlando R, et al. Structural basis for substrate recognition by the ankyrin repeat domain of human DHHC17 palmitoyltransferase[J]. Structure (London, England/i1: 1993), 2017, 25(9): 1337-1347.e6. [20] Lin D T S, Conibear E. ABHD17 proteins are novel protein depalmitoylases that regulate N-Ras palmitate turnover and subcellular localization[J]. eLife, 2015, 4: e11306. [21] Palmitoylated acyl protein thioesterase APT2 deforms membranes to extract substrate acyl chains | nature chemical biology[EB/OL]. [2026-02-03]. https://www.nature.com/articles/s41589-021-00753-2?utm_source=chatgpt.com. [22] Evolutionary-scale prediction of atomic-level protein structure with a language model | Science[J]. [2025]. [23] Rives A, Meier J, Sercu T, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences[J]. Proceedings of the National Academy of Sciences of the United States of America, 2021, 118(15): e2016239118. [24] Lin Z, Akin H, Rao R, et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction[A]. bioRxiv, 2022: 2022.07.20.500902. [25] Kipf T N, Welling M. Semi-Supervised Classification with Graph Convolutional Networks[J]. 2017. [26] Schuster M, Paliwal K K. Bidirectional recurrent neural networks[J]. IEEE Transactions on .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint Signal Processing, 1997, 45(11): 2673-2681. [27] Gradient-based learning applied to document recognition | IEEE Journals & Magazine | IEEE Xplore[Z]. [2025]. [28] Wolpert D H. Stacked generalization[J]. Neural Networks, 1992, 5(2): 241-259. [29] Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences[J]. Bioinformatics (Oxford, England), 2006, 22(13): 1658-1659. [30] Li J Y, Jin S, Tu X M, et al. Identifying complex motifs in massive omics data with a variable-convolutional layer in deep neural network[J]. Briefings in Bioinformatics, 2021, 22(6): bbab233. [31] DeepProSite: structure-aware protein binding site prediction using ESMFold and pretrained language model | Bioinformatics | Oxford Academic[J]. [2025]. [32] Highly accurate protein structure prediction with AlphaFold | Nature[J]. [2025]. [33] Jiao S, Ye X, Sakurai T, et al. Integration of pre-trained protein language models with equivariant graph neural networks for peptide toxicity prediction[J]. BMC Biology, 2025, 23(1): 229. [34] Semi-Supervised Learning With Graph Learning-Convolutional Networks | IEEE Conference Publication | IEEE Xplore[J]. [2025]. [35] Kawashima S, Pokarowski P, Pokarowska M, et al. AAindex: amino acid index database, progress report 2008[J]. Nucleic Acids Research, 2008, 36(Database issue): D202-205. [36] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010. [37] Paszke A, Gross S, Massa F, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library[J]. 2019. [38] Decoupled Weight Decay Regularization | Semantic Scholar[J]. [2025]. [39] Protein language models learn evolutionary statistics of interacting sequence motifs | PNAS[J]. [2025]. [40] Lv K, Ren J G, Han X, et al. Depalmitoylation rewires FLT3-ITD signaling and exacerbates leukemia progression[J]. Blood, 2021, 138(22): 2244-2255. [41] Inhibiting PD-L1 palmitoylation enhances T-cell immune responses against tumours | Nature Biomedical Engineering[J]. [2025]. [42] ZDHHC20 mediated S-palmitoylation of fatty acid synthase (FASN) promotes hepatocarcinogenesis - PubMed[J]. [2025]. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 7, 2026. ; https://doi.org/10.64898/2026.03.05.709753doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-24T02:00:01.246996+00:00
License: CC-BY-NC-4.0