CaLMPhosKAN: Prediction of General Phosphorylation Sites in Proteins via Fusion of Codon-Aware Embeddings with Amino Acid-Aware Embeddings and Wavelet-based Kolmogorov–Arnold Network

doi:10.1101/2024.07.30.605530

CaLMPhosKAN: Prediction of General Phosphorylation Sites in Proteins via Fusion of Codon-Aware Embeddings with Amino Acid-Aware Embeddings and Wavelet-based Kolmogorov–Arnold Network

2024 · doi:10.1101/2024.07.30.605530

preprint OA: closed

📄 Open PDF Full text JSON View at publisher

Full text 3,623 characters · extracted from oa-doi-fallback · click to expand

Abstract The mapping from codon to amino acid is surjective due to the high degeneracy of the codon alphabet, suggesting that codon space might harbor higher information content. Embeddings from the codon language model have recently demonstrated success in various downstream tasks. However, predictive models for phosphorylation sites, arguably the most studied Post-Translational Modification (PTM), and PTM sites in general, have predominantly relied on amino acid-level representations. This work introduces a novel approach for prediction of phosphorylation sites by incorporating codon-level information through embeddings from a recently developed codon language model trained exclusively on protein-coding DNA sequences. Protein sequences are first meticulously mapped to reliable coding sequences and encoded using this encoder to generate codon-aware embeddings. These embeddings are then integrated with amino acid-aware embeddings obtained from a protein language model through an early fusion strategy. Subsequently, a window-level representation of the site of interest is formed from the fused embeddings within a defined window frame. A ConvBiGRU network extracts features capturing spatiotemporal correlations between proximal residues within the window, followed by a Kolmogorov-Arnold Network (KAN) based on the Derivative of Gaussian (DoG) wavelet transform function to produce the prediction inference for the site. We dub the overall model integrating these elements as CaLMPhosKAN. On independent testing with Serine-Threonine (combined) and Tyrosine test sets, CaLMPhosKAN outperforms existing approaches. Furthermore, we demonstrate the model’s effectiveness in predicting sites within intrinsically disordered regions of proteins. Overall, CaLMPhosKAN emerges as a robust predictor of general phosphosites in proteins. CaLMPhosKAN will be released publicly soon. Competing Interest Statement The authors have declared no competing interest. Footnotes email: pp5291{at}g.rit.edu, sp2530{at}g.rit.edu, dkcvcs{at}rit.edu, Email: chcarrie{at}mtu.edu, Email: hdismail{at}ncat.edu, Email: mchaud1{at}ilstu.edu Abbreviations - CaLM - Codon Adaptation Language Model - DNA - Deoxyribonucleic Acid - Wav-KAN - Kolmogorov-Arnold Network - ConvBiGRU - Convolutional Gated Recurrent Unit - DoG - Derivative of Gaussian - KAN - Kolmogorov-Arnold Network - mRNA - Messenger RNA - PTM - Post-translational modification - ML - Machine Learning - SENet - Squeeze and excitation Network - CapsNet - Capsule Network - DL - Deep Learning - CDHIT - Cluster Database at High Identity with Tolerance - S - serine - T - threonine - Y - tyrosine - P-sites - phosphorylated/positive sites - NP-sites - non-phosphorylated/negative sites - NCBI - National Center for Biotechnology Information - RefSeq - NCBI Nucleotide Reference sequences - NW - Needleman–Wunsch (NW) algorithm - pLM - Protein Language model - ESM - Evolutionary Sequence Modelling - T5 - Text- to-Text Transfer Transformer - MLM - Masked Language Modeling - U - Selenocysteine - Z - Pyrrolysine - O - Hydroxyproline - B - Beta-amino acids - 2DCNN - Two Dimensional Convolutional layer - BiGRU - Bidirectional Gated Recurrent Unit - CWT - Continuous Wavelet Transform - DWT - Discrete Wavelet Transform - Dog - Derivative of Gaussian - BCE - Binary-Cross Entropy - MCC - Matthews Correlation Coefficient - PRE - Precision - REC - Recall - AUC - Area Under Curve - AUPR - Area Under the Precision-Recall Curve - AUROC - Area Under Receiver Operating Characteristic - IDRs - Intrinsically Disordered Regions - non-IDRs - Non-Intrinsically Disordered Regions.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00