PepMCP: A Graph-Based Membrane Contact Probability Predictor for Membrane-Lytic Antimicrobial Peptides

preprint OA: closed CC-BY-NC-ND-4.0
📄 Open PDF Full text JSON View at publisher
Full text 48,704 characters · extracted from oa-pdf · 8 sections · click to expand

Abstract

Motivation: The membrane-lytic mechanism of antimicrobial peptides (AMPs) is often overlooked during their in silico discovery process, largely due to the lack of a suitable metric for the membrane-binding propensity of peptides. Previously, we proposed a characteristic called membrane contact probability (MCP) and applied it to the identification of membrane proteins and membrane-lytic AMPs. However, previous MCP predictors were not trained on short peptides targeting bacterial membranes, which may result in unsatisfactory performance for peptide studies.

Results

In this study, we present PepMCP, a peptide-tailored model for predicting MCP values of short peptides. We collected more than 500 membrane-lytic AMPs from the literature, conducted coarse-grained molecular dynamics (MD) simulations for these AMPs, and extracted their residue MCP labels from MD trajectories to train PepMCP. PepMCP employs the GraphSAGE framework to address this node regression task, encoding each peptide sequence as a graph with 4-hop edges. PepMCP achieved a Pearson correlation coefficient of 0. 883 an d an RM SE of 0. 123 on th e node-level test set. It can recognize membrane-lytic AMPs with the predicted MCP values for each sequence, thereby facilitating mechanism-driven AMP discovery. Additionally, we provide a database, MemAMPdb, which includes the membrane-lytic AMPs, as well as the PepMCP web server for easy access. Availability and Implementation: The code and data are available at https://github.com/ComputBiophys/PepMCP. Contact: [email protected] Supplementary Information: Supplementary data are available online. Key words: Antimicrobial peptide, Membrane contact probability, Graph neural network, Membrane-lytic mechanism

Introduction

Antimicrobial peptides (AMPs) are short peptides widely present in the innate immune system of various organisms. AMPs are known for their primary mechanisms of disrupting membranes, which provide them with broad-spectrum antimicrobial abilities and reduce the likelihood of causing drug-resistance (J´ unior et al., 2025). Recently, machine learning techniques have facilitated the discovery of new AMPs by screening hundreds of metagenomes (Ma et al., 2022; Santos- J´ unior et al., 2024), proteomes (Li et al., 2025c), and theoretical sequence spaces (Huang et al., 2023; Szymczak et al., 2025). However, the membrane-lytic mechanism of AMPs is usually not considered in their computational procedures due to the lack of a metric to characterize the membrane-lytic propensity of peptides. Conventional physiochemical properties, such as hydrophobicity, hydrophobic moment, and the Boman index, provide a general description of sequences; however, they are coarse metrics and cannot serve as indicators for membrane-lytic AMPs (Santos-J´ unior et al., 2024). There are some structure-dependent tools, such as DREAMM for predicting protein-membrane interfaces (Chatzigoulas and Cournia, 2022; Paranou et al., 2024) and PPM for predicting the spatial orientation of proteins in membranes (Lomize et al., 2022). However, these tools were not primarily developed for short peptides. PMIpred (van Hilten et al., 2024) trained a neural network model on binding free energies calculated from molecular dynamics (MD) simulations, but it only used fixed-length peptides (24 residues) and focused on recognizing curvature-sensing peptides, which might not be applicable to membrane-lytic AMPs of different lengths. 1 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 3, 2026. ; https://doi.org/10.64898/2026.02.01.703163doi: bioRxiv preprint 2 Dong et al. Previously, we proposed a characteristic called membrane contact probability (MCP) for studying the structure and function of membrane proteins. MCP is defined as the likelihood of each residue in a protein sequence to be in direct contact with the hydrophobic cores of membranes (Wang et al., 2022). MCP can be extracted from MD simulations by calculating the fraction-of-time probability of the residueα- carbon being within 6 ˚A of the lipid acyl chain carbon atoms. Given the time-consuming nature of running MD simulations, we developed deep learning-based MCP predictors and applied them in contact map prediction (Wang et al., 2022), membrane protein screening and design (Wang et al., 2025a; Li et al., 2025a), and studying mechanosensitive protein dynamics (Han et al., 2025). In particular, we utilized MCP to discover novel membrane-lytic AMPs from human and frog metaproteomes. We constructed a pipeline incorporating the prediction of MCP, helical propensity, and anti-parallel dimerization, and we successfully discovered seven membrane-lytic AMPs (Li et al., 2025b). Even though MCP has shown potential in studying membrane-lytic AMPs, some limitations exist with previous MCP predictors. First, they were primarily developed for membrane proteins, and their training data lacked peptides. The minimal sequence length in their training set was restricted to 26 amino acids, which affected the prediction accuracy for short peptides. Second, the MCP labels were derived from the MemProtMD database (Newport et al., 2018), where the membranes in MD simulations were composed of pure 1,2-dihexadecanoyl-rac-glycero-3-phosphocholine (DPPC) lipids, mimicking mammalian membranes rather than bacterial membranes. Although some membrane proteins exhibited similar MCP distributions with phosphatidylcholine (PC) or phosphatidylethanolamine (PE)/phosphatidylglycerol (PG) lipids (Wang et al., 2022), many AMPs demonstrated different behaviors and exhibited membrane selectivity (Suarez-Leston et al., 2022; Dong et al., 2025). Therefore, it is crucial to develop a peptide-specific MCP predictor for bacterial membranes, contributing to the discovery of membrane-lytic AMPs. In this study, we built a peptide-tailored MCP predictor, PepMCP, based on the graph sample and aggregate (GraphSAGE) model. To train the PepMCP model, we collected a high-quality dataset that contained 516 membrane- lytic AMPs and conducted coarse-grained (CG) MD simulations to calculate their MCP values while interacting with bacterial membranes (Fig. 1a). PepMCP encoded a peptide as a graph with 4-hop edges and ESM C node embeddings to capture spatial information without requiring peptide structures (framework in Fig. 1b). PepMCP predicted the residue MCP values of each node and achieved a Pearson correlation coefficient of 0.883 and a root mean square error (RMSE) of 0.123 on the node-level split test set. We demonstrate that PepMCP can not only predict the MCP patterns in peptides, but also be utilized to recognize membrane-lytic AMPs from soluble peptides using sequence average MCP values.

Materials and methods

Data Collection A membrane-lytic AMP dataset was manually curated from literature. Using the keyword ’antimicrobial peptide’, the PubMed database was searched for publications reporting experimentally validated membrane-lytic AMPs (as of November 2024). The membrane-lytic mechanisms of AMPs were confirmed with the following approaches: (1) membrane permeability assays, in which membrane integrity–sensitive fluorescent dyes were incubated with bacteria and peptides. Propidium iodide (PI) was used to indicate inner membrane integrity. N-phenyl-1-naphthylamine (NPN) was used to indicate outer membrane permeabilization. 3,3’-dipropylthiadicarbocyanine iodide (DiSC 3(5)) was used to assess cytoplasmic membrane depolarization (Ma et al., 2022); (2) liposome leakage assays, in which bacterial membrane–mimicking liposomes encapsulating fluorescent dyes were used to evaluate AMP-induced membrane disruption (Ambroggio et al., 2005); (3) scanning electron microscopy (SEM) and transmission electron microscopy (TEM), to examine morphological alterations of bacterial membranes; and (4) fluorescence microscopy and flow cytometry, combined with fluorescent dyes such as PI or SYTOX green, to detect changes in membrane integrity, where increased fluorescence generally indicates membrane disruption (Buck et al., 2019). During this collection process, AMP sequences with complex chemical modifications, such as cyclization, fatty acid conjugation, or non-canonical amino acids, were excluded. Sequences with lengths ranging from 10 to 51 amino acids were retained. Redundant sequences were removed using CD-HIT (Li et al., 2001) at a threshold of 90%. Ultimately, the membrane- lytic AMP dataset comprised 516 sequences, which were used as the positive set with MCP labels obtained from our MD simulations. For the negative set, soluble peptides were collected from the Protein Data Bank (as of May 2020) following a similar procedure to that of the MCP predictor (Wang et al., 2022). The sequence lengths ranged from 10 to 51 residues. Peptides with ’antimicrobial’, ’antibiotic’, or related annotations were discarded. A total of 1307 sequences remained after redundancy removal using CD-HIT with a threshold of 70%. Then, 516 peptides were randomly selected as the negative set, and their residues were labeled with zero. The training, validation, and testing sets were partitioned using both the residue-level and sequence-level split approaches. On the residue level, 20% of nodes were retained for testing, while the remaining 80% of nodes were divided into training and validation sets through a 5-fold cross-validation approach for each sequence. Finally, 24,636 nodes were included in the training and validation sets, while 5,663 nodes were included in the test set. On the sequence level, 206 sequences (20% of the total) and all their constituent nodes were designated for testing. The remaining 826 sequences were used for 5-fold cross- validation. MD Simulation The structures of 516 membrane-lytic AMPs were predicted using ColabFold 1.5.5 (Jumper et al., 2021; Mirdita et al., 2022) and mapped to a CG representation via themartinize.py script (de Jong et al., 2013). An elastic network was adopted to maintain the secondary structures with a force constant of 500kJ mol −1 nm−2 and an interaction cut-off range between 0.5 nm and 0.9 nm (Monticelli et al., 2008; Periole et al., 2009; Poma et al., 2017). The membrane bilayers were constructed using the CHARMM-GUI Martini Maker (Wu et al., 2014; Qi et al., 2015), consisting of 1-palmitoyl-2-oleoyl-sn-glycero- 3-phosphoethanolamine (POPE) and 1-palmitoyl-2-oleoyl-sn- glycero-3-phosphoglycerol (POPG) in a 3:1 molar ratio. To optimize computational efficiency, systems were scaled into three sizes based on the longitudinal dimension of the peptide .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 3, 2026. ; https://doi.org/10.64898/2026.02.01.703163doi: bioRxiv preprint PepMCP for Membrane-Lytic AMPs 3 Fig. 1.Overview of the study.a. Workflow of PepMCP, including collecting the membrane-lytic AMPs from literature, running coarse-grained MD simulations, calculating the membrane contact probability (MCP) values of each residue, training the PepMCP predictor, and building the MemAMPdb database.b. Framework of the PepMCP model. PepMCP receives peptide sequences as input, encodes them as 4-hop graphs, and utilizes the GraphSAGE model to accomplish the node-level regression task of predicting MCP values. The dashed lines in the octagram represent the sequential order of residues in a peptide, while the solid lines represent 4-hop connected edges. (Fig. S1a): Small (≤3 nm, 40 lipids), Medium (3–5 nm, 69 lipids), and Large (>5 nm, 168 lipids). Each lipid bilayer underwent a 100-ns equilibration to achieve phase stability. Peptides were then positioned in the aqueous phase, parallel to the membrane surface at an initial distance of 15–20 ˚A. All systems were solvated with standard Martini water beads and ionized with 150 mM NaCl for neutralization. All MD simulations were performed using GROMACS 2022.5 (Abraham et al., 2015) with the Martini 2.2 force field (de Jong et al., 2013). A 5000-step energy minimization was conducted using the steepest descent algorithm. Systems were equilibrated for 1 ns in the NVT ensemble, followed by a 50 ns NPT equilibration, during which harmonic positional restraints were applied to the peptide backbone beads. For production simulations, unrestrained MD trajectories were produced for 2–5µsper system, with a time step of 20 fs. The temperature was kept at 310 K using the v-rescale thermostat (Bussi et al., 2007) and the pressure at 1 bar using the Parrinello-Rahman barostat with semi-isotropic coupling (Parrinello and Rahman, 1981). Non-bonded interactions were calculated using a cut- off of 1.2 nm for both van der Waals and electrostatic forces. The latter was treated using the reaction-field method with a dielectric constant (ϵ rf ) of 15. MCP Calculation The membrane-binding property of the 516 independent trajectories was evaluated using the minimum distance between the protein backbone and the lipid phosphate groups (BB–PO4). Of the 516 membrane-lytic AMPs, 514 showed stable membrane binding over the final 1µsof MD trajectories (Fig. S1b, S2a and b), while two outliers did not exhibit stable interactions with membranes over the 5µstrajectories in our simulations (Fig. S2c and d). This suggests that the majority of the AMPs in the positive dataset are membrane-interacting peptides, and the MCP values obtained from MD simulations can be used to characterize the membrane-binding features of these AMPs. The MCP between individual peptide residues and the lipid bilayer was calculated using the final 1µsof MD trajectories. A contact was defined based on a spatial proximity threshold of 6.0 ˚A between a residue’s backbone (BB) bead and any bead within the hydrophobic lipid tail (C1A, C1B, D2A, C2B, C3A, C3B, C4A, and C4B) of both POPE and POPG (Wang et al., 2022). MCP values were calculated using the MDAnalysis library (Michaud-Agrawal et al., 2011). PepMCP Model Model Framework Each peptide sequence was encoded as a graphG=. The nodes (V={v 0, v1, ..., vn}) representednresidues in the sequence, and the node features (H∈R m×n) were derived from a protein language model (ESM C 300M (Hayes et al., 2025), where the feature dimensionm= 960). The edges were encoded using the 4-hop approach, which meant that an edgee connectsv i andv i+4. This edge encoding effectively extracted the peptide structural information, as most membrane-lytic AMPs areα-helix. The GraphSAGE (Hamilton et al., 2017) method was adopted as our model to process peptide graphs. GraphSAGE is an inductive approach that includes three steps: neighborhood sampling, feature aggregation, and label prediction. Using an aggregate functionaggregate(·), the feature of nodev∈Vat layer (k+ 1) would be: h(k+1) S(v) =aggregate({h (k) u ,∀u∈S(v)}) (1) h(k+1) v =σ  W (k+1) ·concat(h (k) v , h(k+1) S(v) )  (2) whereS(v) is the neighborhood node set ofv,σ(·) is the activation function ReLU,Wis a trainable weight matrix, and concat(·) is the concatenation operation. The pooling aggregator was used in PepMCP. It conducted an element-wise max pooling after a linear transformation of the neighborhood node features: .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 3, 2026. ; https://doi.org/10.64898/2026.02.01.703163doi: bioRxiv preprint 4 Dong et al. aggregate pool =max  {σ(Wpoolh(k+1) u +b),∀u∈S(v)}  (3) Three layers were included here to compress the 960- dimension feature to 512-dimension, and predicted the 1- dimension label. The outputs were transformed into probability values using the sigmoid function. Model Implementation PepMCP was developed using the PyTorch and DGL libraries (Wang et al., 2019). The loss function of PepMCP was the mean squared error. The optimizer used was Adam, with a learning rate of 0.0001. The batch size for the graph was 4. The training epoch was set to 20, and early stopping was adopted when the model showed no improvement in the validation loss for 10 epochs. Four regression metrics were used to evaluate the model performances, including the Spearman’s rank correlation coefficient (Spearman), the Pearson correlation coefficient (Pearson), the coefficient of determination (R 2), and the root mean square error (RMSE). Spearman= Pn i=1(r(ai)− r(a))(r(bi)− r(b))qPn i=1(r(ai)− r(a))2Pn i=1(r(bi)− r(b))2 (4) P earson= Pn i=1(ai − a)(bi − b)qPn i=1(ai − a)2Pn i=1(bi − b)2 (5) R2 = 1− Pn i=1(ai −b i)2 Pn i=1(ai − a)2 (6) RM SE= vuu t 1 n nX i=1 (bi −a i)2 (7) wherea i andb i denote the actual and predicted values of the sequencei, respectively.r(·) calculates the ranking number. Application on Independent Membrane-Lytic AMPs An external test set was curated to demonstrate the application of PepMCP on membrane-lytic AMPs. A total of 34 novel membrane-lytic AMPs were collected from the literature published in 2024-2025, excluding those that overlapped with the sequences in the training set. These membrane- lytic AMPs were also confirmed to have at least one of the aforementioned experimental proofs. Similarly, coarse-grained MD simulations were conducted using AlphaFold structures of these 34 peptides. The residue-level MCP values were then calculated as the ground truth labels. A collection of soluble peptides was obtained from the PDB (after May 2020) with lengths ranging from 10 to 51 residues, excluding those associated with membrane-related or antibiotic-related annotations. Consequently, 46 soluble peptides remained and were assigned zero labels. This external test set includes 80 peptides and 2133 residue nodes.

Results

Membrane-Lytic AMPs and their MCP Values Since current AMP databases often lack clear and verifiable annotations of mechanisms, we curated a high-quality membrane-lytic AMP dataset containing 516 AMPs with experimentally validated mechanisms. Most of the AMPs originated from natural sources, while some were generated by deep learning models. Then we utilized MD simulations to investigate the peptide-membrane interactions and calculate the MCP values of each residue. Considering that not all AMPs had solved 3D structures, we used their AlphaFold- predicted structures in MD simulations. 88% of the structures had an average pLDDT>70 (Fig. S3a). Most membrane- lytic AMPs wereα-helix, with a fewβ-sheet, random coil, orαβsecondary structures (Fig. 2a). At the final frame of simulations, these peptides could stably attach to the surface of bacterial membranes (Fig. S1b), often with one side contacting the lipid tails and exhibiting larger MCP values. Fig. 2b shows the MCP frequency distributions of 20 canonical amino acids, where the amino acid orders follow the Eisenberg hydrophobicity scale. Hydrophobic residues such as isoleucine (I), phenylalanine (F), valine (V), and leucine (L) had a higher proportion of MCP values greater than 0.5 and tended to be in contact with the lipid tail via hydrophobic interactions. Alanine (A) and glycine (G) also showed a certain proportion of MCP values close to one, due to their simple side chains. Most MCPs of polar amino acids are close to zero. We also included soluble peptides as negative samples to improve the generalization capability of our proposed model. The amino acid frequency of membrane-lytic AMPs and soluble peptides differed, as the former contained a higher frequency of positively-charged lysine (K) and a lower frequency of negatively-charged aspartic acid (D) and glutamic acid (E), which is a characteristic feature of AMPs (Fig. 2c). We compared some physicochemical properties of membrane- lytic AMPs and soluble peptides in Fig. 2d. Membrane-lytic AMPs exhibited a wider hydrophobicity distribution and larger hydrophobic moments. They also showed a relatively smaller Boman index, indicating a propensity for membrane interaction (Boman, 2003). However, there were still overlapping areas between the two peptide sets, so these properties could not be used to distinguish all the membrane-lytic AMPs from soluble peptides, making it essential to develop other indicators. PepMCP is a Peptide-Tailored MCP Predictor Based on the membrane-lytic AMPs and their MCP values, we developed the PepMCP model, focusing on these short peptides. The lengths of all the sequences in the training set ranged from 10 to 51 amino acids. We encoded the peptide sequences using graphs, with each residue representing a graph node. We designed 4-hop edges to capture the spatial interactions of peptides without introducing their 3D structures, by connecting thei th node with the (i+ 4) th node (Fig. 1b). We implicitly encoded the sequential information of peptide sequences in their node embedding from the ESM C language model. We utilized the inductive GraphSAGE model to process the features, which aggregated the features on the local subgraph and produced new features for the unseen nodes. We compared the performance of the PepMCP model with that of two previous MCP predictors. One was a special version of the MCP predictor, which used the SPIDER3-Single (Heffernan et al., 2018) features to replace multiple sequence alignments (MSAs), as short peptides often lack abundant MSAs. We utilized this predictor to screen novel membrane- lytic AMPs with dimerization metrics and reached a moderate success rate of 39% (Li et al., 2025b). We referred to this model as the ’base’ model in the following comparisons. The .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 3, 2026. ; https://doi.org/10.64898/2026.02.01.703163doi: bioRxiv preprint PepMCP for Membrane-Lytic AMPs 5 Fig. 2.Dataset characterization.a. Secondary structures of membrane-lytic AMPs as predicted by AlphaFold2.b. Frequency distribution of MCP values across 20 amino acids. The maximum x-axis value (frequency) of each subplot is 0.6. The order of amino acids follows the hydrophobicity.c. Amino acid frequency of membrane-lytic AMPs and soluble peptides.d. Physicochemical properties (hydrophobicity, hydrophobic moment, and Boman index) of membrane-lytic AMPs and soluble peptides. Eisenberg’s scale was used for all hydrophobicity measurements. T able 1.Performance of PepMCP, ProtRAP-LM, and the base model on the test set. Bold values indicate the best performance for the corresponding metric. Data split Model Spearman↑Pearson↑RMSE↓ Residue Base 0.1843 0.2186 0.3694 ProtRAP-LM 0.1114 0.1133 0.3048 PepMCP0.7963 0.8825 0.1228 Sequence Base 0.2069 0.2592 0.3636 ProtRAP-LM 0.0532 0.1170 0.3061 PepMCP0.7168 0.7022 0.1862 other model was ProtRAP-LM, which used ESM-2 to encode protein inputs and could predict their relative accessibility simultaneously (Wang et al., 2025a). The minimum sequence length in their training set was 26, as their predictive goal was not primarily short peptides, which resulted in limited predictive ability on our test set (Table. 1). On the residue- level split test set, the base model and ProtRAP-LM could only obtain Pearson correlation coefficients of 0.2186 and 0.1133, respectively. The base model performed slightly better when using the sequence-level split, achieving a Spearman of 0.2069 and a Pearson of 0.2592, while ProtRAP-LM performed worse. In contrast, PepMCP model significantly outperformed the two models, achieving a Spearman correlation of 0.7963 and a Pearson correlation of 0.8825 for the residue level split set (Table. 1). For the sequence-level split, the performance is slightly lower but still achieves correlations above 0.70. Therefore, PepMCP is a customized and capable model for predicting peptide-specific MCP values. Comparison of Different Architectures We compared the performances of four graph neural network variants, including the graph convolutional network (GCN) (Kipf and Welling, 2016), GraphSAGE (Hamilton et al., 2017), graph attention network (GAT) (Veliˇ ckovi´ c et al., 2018), and topology adaptive graph convolutional network (TAGCN) (Du et al., 2018). Fig. 3a showed the results of 5-fold cross-validation. The GraphSAGE model achieved the best performance among the four types of GNNs, with a Spearman coefficient of 0.8016±0.0029, a Pearson correlation coefficient of 0.8762±0.0060, anR 2 of 0.7664±0.0107, and an RMSE of 0.1258±0.0029. The inductive aggregation of GraphSAGE model effectively captured the information of peptide sequences. Regarding the other models, TAGCN also had a comparable Spearman of 0.8016±0.0039; however, it could not be compared to GraphSAGE on the other three metrics. The aggregator was a crucial component of the GraphSAGE model, so we tested four types of aggregators (Fig. 3b). The max pooling aggregator (’pool’) demonstrated the best performance across the four regression metrics. Meanwhile, we compared different edge encoding approaches in PepMCP. We tested k-hop edges (connecting edges between nodesv i andv i+k) for k = 1, 2, 3, or 4 (Fig. 3c). Overall, the best performance was observed when k = 4. We suggest that this is related to the secondary structure of peptides, as almost 90% of them wereα-helix. There are approximately 3.6 residues per turn in the helix, so thei th and (i+ 4) th residues are spatially adjacent. Therefore, encoding peptide graphs with 4-hop edges is a simple yet effective method to process residue- level predictions. The sequential graph (k = 1) contributed to the second-to-last worst performance, indicating that feature aggregation of the residues along the sequence provided limited useful connections. We also tested double combinations of k- hop edges, encoding k = 4 edges along with another k-hop edge where k = 1, 2, or 3. Even though some of them exhibited slightly higher metrics compared to k = 4, for instance, k = 1, 4 had a Spearman of 0.8110±0.0026, and k = 3, 4 had a Pearson of 0.8779±0.0044, they did not demonstrate superior performance compared to k = 4 in terms ofR 2 and RMSE values. For the node features, we compared three lightweight protein language models, Ankh (Elnaggar et al., 2023), ESM C (300M and 600M) (Hayes et al., 2025), and Profluent-E1 (150M, 300M, and 600M) (Jain et al., 2025). Table 2 presents the results of 5-fold cross validation, where ESM C 300M achieved the best performance. Interestingly, the 150M and 300M versions of .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 3, 2026. ; https://doi.org/10.64898/2026.02.01.703163doi: bioRxiv preprint 6 Dong et al. Fig. 3.Comparison of different model architectures.a. Graph neural network (GNN) variants.b. Aggregator of GraphSAGE (abbreviated as SAGE in the figure).c. K-hop edge for encoding sequence graphs. Scatter plots show Spearman, Pearson, andR 2 (higher is better), while bar plots show RMSE values (lower is better). All the results are average values on the test set using 5-fold cross validation models, and the error bars represent the standard errors. T able 2. Performance using node features from different language models on the test set. Average values ± standard deviations from 5-fold cross validation were reported. Bold values represent the best performance for that metric. Node Feature Param. Spearman↑Pearson↑R 2 ↑RMSE↓ ESM C 300M0.8016±0.00290.8762±0.00600.7664±0.0107 0.1258±0.0029 ESM C 600M 0.8000±0.0039 0.8754±0.0020 0.7603±0.0081 0.1274±0.0021 Ankh-base 450M 0.7956±0.00380.8775±0.00370.7590±0.0211 0.1277±0.0054 Profluent-E1 150M 0.6812±0.1191 0.5475±0.3837 0.3455±0.4736 0.1964±0.0760 Profluent-E1 300M 0.7569±0.0730 0.7169±0.3126 0.5615±0.3976 0.1597±0.0647 Profluent-E1 600M 0.7825±0.0079 0.8753±0.0028 0.7586±0.0084 0.1279±0.0022 Profluent-E1 displayed large standard deviations because they were unable to handle one or two folds out of the 5-fold cross validation sets. In brief, the final PepMCP model utilized the GraphSAGE architecture with the pool aggregator, encoded edges in a 4-hop approach, and extracted node features from ESM C 300M. Recognizing Membrane-Lytic AMPs with PepMCP We prepared an external test set that included 34 new membrane-lytic AMPs and 46 soluble non-AMPs. The average similarity between these sequences and all the 1032 peptides in the training set was 39.7%, as calculated by the local Smith-Waterman alignment. Although PepMCP was trained on residue-level split data, it performed well in predicting the MCP values for all the residues of unseen sequences, and achieved a Pearson of 0.8226 on this external test set (Fig. 3). Similarly, PepMCP significantly outperformed the previous ProtRAP- LM and base models. In addition, we used the average MCP prediction for all residues in a peptide as a sequence-level prediction. In this way, we evaluated PepMCP’s ability to distinguish membrane-lytic AMPs from other peptides using classification metrics. Following the previous threshold (Li et al., 2025b), we regarded a peptide as positive if its average MCP was greater than 0.2. Otherwise, the peptide was predicted to be negative. Table 3 also shows the classification

Results

on this external test set, considering membrane-lytic AMPs as positive samples and soluble non-AMPs as negative samples. PepMCP achieved AUC, accuracy, and precision scores of over 0.9. Although ProtRAP-LM and the base model also obtained AUC values of over 0.7, they were not comparable to PepMCP on other metrics, especially precision. This result demonstrates the effectiveness of PepMCP in distinguishing membrane-lytic AMPs from soluble peptides. We presented four cases from the external test set in Fig. 4, two of which were membrane-lytic AMPs (Fig. 4a) and two were soluble non-AMPs (Fig. 4b). PepMCP effectively captured the zigzag MCP patterns of this amphipathicα-helix peptide p11, which was mined from human gut microbe metagenomes (Li et al., 2025c). MCP coloring in its AlphaFold structure showed that its hydrophobic side had a high tendency to contact the lipid bilayer. A short AMP m AMP76 (9 residues) was screened from UniProtKB by AMPSorter (Wang et al., 2025b). PepMCP also predicted well on this random coil, with only slight errors for two tryptophan (W) residues. Regarding the negative samples, PepMCP predicted MCP values that were very close to zero for a glucagon analog peptide (PDB ID: 6PHO) and a Z0 domain from the transcription repressor BCL11A (PDB ID: 9BV0) (Harris et al., 2025). In Fig. 4b, we zoomed in on the view of MCP values and found that there were still periodic patterns every four residues in the predictions of α-helix 6PHO. This phenomenon also existed in the C-terminal α-helix part of 9BV0, but did not appear in theβ-sheet part. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 3, 2026. ; https://doi.org/10.64898/2026.02.01.703163doi: bioRxiv preprint PepMCP for Membrane-Lytic AMPs 7 Fig. 4.Case studies of PepMCP predictions.a. PepMCP predicted values, true MCP values from MD, and structures of two membrane-lytic AMPs.b. PepMCP predicted values, true MCP values with all zeros, and structures of two soluble non-AMPs. Structures were predicted using AlphaFold2 and colored with predicted MCP values. The maximum values of colormaps were set to 1.0, 0.5, 0.05, and 0.05, respectively, according to the line plots. T able 3.Results of both classification and regression tasks on the external test set. Bold values indicate the best performance for that metric. Task Metric PepMCP ProtRAP-LM Base Regression Spearman0.71820.3723 0.2587 Pearson0.82260.0706 0.2868 R2 0.6750-0.3416 -0.8361 RMSE0.15310.3112 0.3640 Classification AUC0.92770.7168 0.7008 AUPR0.86060.5502 0.5991 Accuracy0.90000.5875 0.4875 Precision0.93330.5294 0.4493 Recall 0.8235 0.26470.9118 F10.87500.3529 0.6019 This indicates that the 4-hop edge encoding in PepMCP was effective in aggregating the features along theα-helix. MemAMPdb Database and PepMCP Server Based on the results above, we developed MemAMPdb database and the web server of PepMCP predictor for convenient use. In the MemAMPdb database (http://www. songlab.cn/MemAMPdb/), all 550 membrane-lytic AMPs (516 in the training set and 34 in the external test set) have been uploaded, along with their sequences, references, mechanism validation methods, and origins. Users can easily search by their ID, name, or keywords in mechanism assays and origin descriptions. At the PepMCP server (http://www.songlab.cn/PepMCP/ Introduction/), users can submit their query peptide sequences in FASTA format. The per-residue PepMCP predictions and line plots for each peptide can be downloaded after running PepMCP on our cloud server. Batch prediction is supported for no more than 20 sequences.

Conclusion

To overcome the limitations of previous MCP predictors for peptides, we developed PepMCP, a graph-based peptide- tailored MCP predictor. PepMCP was trained using residue- level MCP labels from MD simulations of membrane-lytic AMPs, with the membranes composed of POPE and POPG lipids. We employed GraphSAGE in PepMCP to inductively pass the node-level features from the ESM C language model, and used 4-hop edges that effectively incorporated the spatial features of peptides, especially forα-helix. PepMCP significantly outperformed both ProtRAP-LM and the base model on these membrane-lytic AMPs. On an external test set, PepMCP captured the MCP value patterns across peptides at the node level and could distinguish the membrane-lytic AMPs from other soluble peptides at the sequence level. Therefore, PepMCP is a useful tool for recognizing and investigating membrane-lytic AMPs. There are still some limitations to PepMCP. First, it favors α-helix peptides due to the imbalanced training data. Not only doβ-sheet and coil peptides account for a small proportion of the membrane-lytic AMPs, but their AlphaFold-predicted structures used in MD simulations also exhibit lower confidence levels (Fig. S3b). We observe that PepMCP can still handle some peptides with other secondary structures among the test cases, but the predicted results for these require attention. Additionally, we suggest using PepMCP in conjunction with other antimicrobial predictors to identify novel membrane- lytic AMPs, given the limited number of training sequences available for PepMCP. Although PepMCP enables the capture of patterns or binding modes of peptide-membrane interactions, as reflected by MCP values, it cannot distinguish some subtle effects caused by membrane type or morphology, nor can it elucidate the detailed modes of action of membrane-lytic AMPs, such as barrel-stave or carpet models. PepMCP is trained only on membrane-lytic AMPs, but it has the potential to generalize to other membrane-active peptides. Examples include cell-penetrating peptides, which permeate the membrane, and transmembrane signal peptides. As shown in Fig. S4, PepMCP produces MCP predictions of around 0.5 for nearly all residues in these membrane-active α-helical peptides, except at the terminus. We expect that .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 3, 2026. ; https://doi.org/10.64898/2026.02.01.703163doi: bioRxiv preprint 8 Dong et al. PepMCP will play a crucial role in the discovery of membrane- lytic AMPs and other membrane-active peptides in future studies. Competing Interests No competing interest is declared. Author Contributions C.S. conceived the project. R.D. developed the model and analyzed the results. T.A. conducted the MD simulations. Q.C. collected the data. K.K. and L.W. assisted in building the server and analyzed the results. Z.Z. assisted in processed the data. R.D., T.A., and Q.C. wrote the original manuscript, C.S. reviewed the manuscript. All authors have approved the final manuscript. Funding This work was supported by the National Key R&D Program of China (2024YFA0916800), the Science Fund for Innovative Research Groups of the National Natural Science Foundation of China (T2321001), and the Frontier Innovation Fund of Peking University Chengdu Academy for Advanced Interdisciplinary Biotechnologies. Part of the computation was performed on the computing platform of the Center for Life Sciences, Peking University. Data Availability The code and data are available athttps://github.com/ ComputBiophys/PepMCP. MemAMPdb is accessible athttp://www. songlab.cn/MemAMPdb/. PepMCP server is accessible athttp: //www.songlab.cn/PepMCP/Introduction/.

References

M. Abraham, T. Murtola, R. Schulz, S. P´ all, J. Smith, B. Hess, and E. Lindahl. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers.SoftwareX, 1:19–25, 2015. E. E. Ambroggio, F. Separovic, J. H. Bowie, G. D. Fidelio, and L. A. Bagatolli. Direct visualization of membrane leakage induced by the antibiotic peptides: Maculatin, citropin, and aurein.Biophysical Journal, 89(3):1874–1881, 2005. H. G. Boman. Antibacterial peptides: basic facts and emerging concepts.Journal of Internal Medicine, 254(3):197–215, 2003. A. K. Buck, D. E. Elmore, and L. E. Darling. Using fluorescence microscopy to shed light on the mechanisms of antimicrobial peptides.Future Medicinal Chemistry, 11(18):2447–2460, 2019. G. Bussi, D. Donadio, and M. Parrinello. Canonical sampling through velocity rescaling.The Journal of Chemical Physics, 126(1):014101, 2007. A. Chatzigoulas and Z. Cournia. Predicting protein–membrane interfaces of peripheral membrane proteins using ensemble machine learning.Briefings in Bioinformatics, 23(2): bbab518, 2022. D. H. de Jong, G. Singh, W. F. D. Bennett, C. Arnarez, T. A. Wassenaar, L. V. Sch¨ afer, X. Periole, D. P. Tieleman, and S. J. Marrink. Improved parameters for the Martini coarse- grained protein force field.Journal of Chemical Theory and Computation, 9(1):687–697, 2013. R. Dong, Q. Cao, and C. Song. Painting peptides with antimicrobial potency through deep reinforcement learning. Advanced Science, 12(43):e06332, 2025. J. Du, S. Zhang, G. Wu, J. M. F. Moura, and S. Kar. Topology adaptive graph convolutional networks.arXiv, 2018. A. Elnaggar, H. Essam, W. Salah-Eldin, W. Moustafa, M. Elkerdawy, C. Rochereau, and B. Rost. Ankh: optimized protein language model unlocks general-purpose modelling. bioRxiv, page 2023.01.16.524265, 2023. W. L. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. InNIPS, 2017. Z. Han, L. Wang, and C. Song. Improved anisotropic network models for membrane protein dynamics and mechanosensitive ion channels.bioRxiv, page 2025.05.22.654704, 2025. R. E. Harris, R. D. Whitehead III, and A. T. Alexandrescu. Solution structure of the Z0 domain from transcription repressor BCL11A sheds light on the sequence properties of protein-binding zinc fingers.Protein Science, 34(4):e70097, 2025. T. Hayes, R. Rao, H. Akin, N. J. Sofroniew, D. Oktay, Z. Lin, R. Verkuil, V. Q. Tran, J. Deaton, M. Wiggert, R. Badkundri, I. Shafkat, J. Gong, A. Derry, R. S. Molina, N. Thomas, Y. A. Khan, C. Mishra, C. Kim, L. J. Bartie, M. Nemeth, P. D. Hsu, T. Sercu, S. Candido, and A. Rives. Simulating 500 million years of evolution with a language model.Science, 387(6736):850–858, 2025. R. Heffernan, K. Paliwal, J. Lyons, J. Singh, Y. Yang, and Y. Zhou. Single-sequence-based prediction of protein secondary structures and solvent accessibility by deep whole- sequence learning.Journal of Computational Chemistry, 39 (26):2210–2216, 2018. J. Huang, Y. Xu, Y. Xue, Y. Huang, X. Li, X. Chen, Y. Xu, D. Zhang, P. Zhang, J. Zhao, and J. Ji. Identification of potent antimicrobial peptides via a machine-learning pipeline that mines the entire space of peptide sequences.Nature Biomedical Engineering, 7(6):797–810, 2023. S. Jain, J. Beazer, J. A. Ruffolo, A. Bhatnagar, and A. Madani. E1: Retrieval-augmented protein encoder models.bioRxiv, page 2025.11.12.688125, 2025. J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. ˇZ´ ıdek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bodenstein, D. Silver, O. Vinyals, A. W. Senior, K. Kavukcuoglu, P. Kohli, and D. Hassabis. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021. N. G. O. J´ unior, C. M. Souza, D. F. Buccini, M. H. Cardoso, and O. L. Franco. Antimicrobial peptides: structure, functions and translational applications.Nature Reviews Microbiology, 23:687–700, 2025. T. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks.arXiv, abs/1609.02907, 2016. J. Li, H. Guo, and C. Song. MemConverter: An iterative pipeline for reprogramming protein localization in membrane or aqueous solution.bioRxiv, page 2025.10.23.684164, 2025a. J. Li, C. Yang, R. Dong, J. F. B. Juarez, L. Wang, M. E. Wettstein, D. Wang, C. Cao, Y. Lu, and C. Song. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 3, 2026. ; https://doi.org/10.64898/2026.02.01.703163doi: bioRxiv preprint PepMCP for Membrane-Lytic AMPs 9 Mechanism-driven screening of membrane-targeting and pore-forming antimicrobial peptides.Advanced Science, page e16470, 2025b. W. Li, L. Jaroszewski, and A. Godzik. Clustering of highly homologous sequences to reduce the size of large protein databases.Bioinformatics, 17(3):282–283, 2001. W. Li, B. Huang, M. Guo, Z. Zeng, T. Cai, L. Feng, X. Zhang, L. Guo, X. Jiang, Y. Yin, E. Wang, X. Huang, and J. Zheng. Unveiling the evolution of antimicrobial peptides in gut microbes via foundation-model-powered framework.Cell Reports, 44(6):115773, 2025c. A. L. Lomize, S. C. Todd, and I. D. Pogozheva. Spatial arrangement of proteins in planar and curved membranes by PPM 3.0.Protein Science, 31(1):209–220, 2022. Y. Ma, Z. Guo, B. Xia, Y. Zhang, X. Liu, Y. Yu, N. Tang, X. Tong, M. Wang, X. Ye, J. Feng, Y. Chen, and J. Wang. Identification of antimicrobial peptides from the human gut microbiome using deep learning.Nature Biotechnology, 40: 1–11, 06 2022. N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein. MDAnalysis: A toolkit for the analysis of molecular dynamics simulations.Journal of Computational Chemistry, 32(10):2319–2327, 2011. M. Mirdita, K. Sch¨ utze, Y. Moriwaki, L. Heo, S. Ovchinnikov, and M. Steinegger. ColabFold: making protein folding accessible to all.Nature Methods, 19:679–682, 2022. L. Monticelli, S. K. Kandasamy, X. Periole, R. G. Larson, D. P. Tieleman, and S.-J. Marrink. The MARTINI coarse-grained force field: Extension to proteins.Journal of Chemical Theory and Computation, 4(5):819–834, 2008. T. D. Newport, M. S. Sansom, and P. J. Stansfeld. The MemProtMD database: a resource for membrane-embedded protein structures and their lipid interactions.Nucleic Acids Research, 47(D1):D390–D397, 2018. D. Paranou, A. Chatzigoulas, and Z. Cournia. Using deep learning and large protein language models to predict protein–membrane interfaces of peripheral membrane proteins.Bioinformatics Advances, 4(1):vbae078, 2024. M. Parrinello and A. Rahman. Polymorphic transitions in single crystals: A new molecular dynamics method.Journal of Applied Physics, 52(12):7182–7190, 1981. X. Periole, M. Cavalli, S.-J. Marrink, and M. A. Ceruso. Combining an elastic network with a coarse- grained molecular force field: Structure, dynamics, and intermolecular recognition.Journal of Chemical Theory and Computation, 5(9):2531–2543, 2009. A. B. Poma, M. Cieplak, and P. E. Theodorakis. Combining the MARTINI and structure-based coarse-grained approaches for the molecular dynamics studies of conformational transitions in proteins.Journal of Chemical Theory and Computation, 13(3):1366–1374, 2017. Y. Qi, H. I. Ing´ olfsson, X. Cheng, J. Lee, S. J. Marrink, and W. Im. CHARMM-GUI Martini Maker for coarse-grained simulations with the martini force field.Journal of Chemical Theory and Computation, 11(9):4486–4494, 2015. C. D. Santos-J´ unior, M. D. Torres, Y. Duan,´Alvaro Rodr´ ıguez del R´ ıo, T. S. Schmidt, H. Chong, A. Fullam, M. Kuhn, C. Zhu, A. Houseman, J. Somborski, A. Vines, X.-M. Zhao, P. Bork, J. Huerta-Cepas, C. de la Fuente-Nunez, and L. P. Coelho. Discovery of antimicrobial peptides in the global microbiome with machine learning.Cell, 187(14): 3761–3778.e16, 2024. F. Suarez-Leston, M. Calvelo, G. F. Tolufashe, A. Mu˜ noz, U. Veleiro, C. Porto, M. Bastos, ´Angel Pi˜ neiro, and R. Garcia-Fandino. SuPepMem: A database of innate immune system peptides and their cell membrane interactions.Computational and Structural Biotechnology Journal, 20:874–881, 2022. P. Szymczak, W. Zarzecki, J. Wang, Y. Duan, J. Wang, L. P. Coelho, C. de la Fuente-Nunez, and E. Szczurek. AI-driven antimicrobial peptide discovery: Mining and generation. Accounts of Chemical Research, 58(12):1831–1846, 2025. N. van Hilten, N. Verwei, J. Methorst, C. Nase, A. Bernatavicius, and H. J. Risselada. PMIpred: a physics- informed web server for quantitative protein–membrane interaction prediction.Bioinformatics, 40(2):btae069, 2024. P. Veliˇ ckovi´ c, G. Cucurull, A. Casanova, A. Romero, P. Li` o, and Y. Bengio. Graph attention networks.arXiv, 2018. L. Wang, J. Zhang, D. Wang, and C. Song. Membrane contact probability: An essential and predictive character for the structural and functional studies of membrane proteins. PLOS Computational Biology, 18(3):e1009972, 2022. L. Wang, K. Kang, and C. Song. ProtRAP-LM: Fast and accurate protein relative accessibility prediction and membrane protein screening through protein language model embeddings.bioRxiv, page 2025.01.20.633985, 2025a. M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y. Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang. Deep graph library: A graph-centric, highly- performant package for graph neural networks.arXiv, 2019. Y. Wang, L. Zhao, Z. Li, Y. Xi, Y. Pan, G.-P. Zhao, and L. Zhang. A generative artificial intelligence approach for the discovery of antimicrobial peptides against multidrug- resistant bacteria.Nature Microbiology, 10:2997–3012, 2025b. E. L. Wu, X. Cheng, S. Jo, H. Rui, K. C. Song, E. M. D´ avila- Contreras, Y. Qi, J. Lee, V. Monje-Galvan, R. M. Venable, J. B. Klauda, and W. Im. CHARMM-GUI Membrane Builder toward realistic biological membrane simulations.J Comput Chem, 35(27):1997–2004, 2014. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 3, 2026. ; https://doi.org/10.64898/2026.02.01.703163doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-23T02:00:01.238055+00:00
License: CC-BY-NC-ND-4.0