Full text
3,623 characters
· extracted from
oa-doi-fallback
· click to expand
Abstract
The mapping from codon to amino acid is surjective due to the high degeneracy of the codon alphabet, suggesting that codon space might harbor higher information content. Embeddings from the codon language model have recently demonstrated success in various downstream tasks. However, predictive models for phosphorylation sites, arguably the most studied Post-Translational Modification (PTM), and PTM sites in general, have predominantly relied on amino acid-level representations. This work introduces a novel approach for prediction of phosphorylation sites by incorporating codon-level information through embeddings from a recently developed codon language model trained exclusively on protein-coding DNA sequences. Protein sequences are first meticulously mapped to reliable coding sequences and encoded using this encoder to generate codon-aware embeddings. These embeddings are then integrated with amino acid-aware embeddings obtained from a protein language model through an early fusion strategy. Subsequently, a window-level representation of the site of interest is formed from the fused embeddings within a defined window frame. A ConvBiGRU network extracts features capturing spatiotemporal correlations between proximal residues within the window, followed by a Kolmogorov-Arnold Network (KAN) based on the Derivative of Gaussian (DoG) wavelet transform function to produce the prediction inference for the site. We dub the overall model integrating these elements as CaLMPhosKAN. On independent testing with Serine-Threonine (combined) and Tyrosine test sets, CaLMPhosKAN outperforms existing approaches. Furthermore, we demonstrate the model’s effectiveness in predicting sites within intrinsically disordered regions of proteins. Overall, CaLMPhosKAN emerges as a robust predictor of general phosphosites in proteins. CaLMPhosKAN will be released publicly soon.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
email: pp5291{at}g.rit.edu, sp2530{at}g.rit.edu, dkcvcs{at}rit.edu, Email: chcarrie{at}mtu.edu, Email: hdismail{at}ncat.edu, Email: mchaud1{at}ilstu.edu
Abbreviations
- CaLM
- Codon Adaptation Language Model
- DNA
- Deoxyribonucleic Acid
- Wav-KAN
- Kolmogorov-Arnold Network
- ConvBiGRU
- Convolutional Gated Recurrent Unit
- DoG
- Derivative of Gaussian
- KAN
- Kolmogorov-Arnold Network
- mRNA
- Messenger RNA
- PTM
- Post-translational modification
- ML
- Machine Learning
- SENet
- Squeeze and excitation Network
- CapsNet
- Capsule Network
- DL
- Deep Learning
- CDHIT
- Cluster Database at High Identity with Tolerance
- S
- serine
- T
- threonine
- Y
- tyrosine
- P-sites
- phosphorylated/positive sites
- NP-sites
- non-phosphorylated/negative sites
- NCBI
- National Center for Biotechnology Information
- RefSeq
- NCBI Nucleotide Reference sequences
- NW
- Needleman–Wunsch (NW) algorithm
- pLM
- Protein Language model
- ESM
- Evolutionary Sequence Modelling
- T5
- Text- to-Text Transfer Transformer
- MLM
- Masked Language Modeling
- U
- Selenocysteine
- Z
- Pyrrolysine
- O
- Hydroxyproline
- B
- Beta-amino acids
- 2DCNN
- Two Dimensional Convolutional layer
- BiGRU
- Bidirectional Gated Recurrent Unit
- CWT
- Continuous Wavelet Transform
- DWT
- Discrete Wavelet Transform
- Dog
- Derivative of Gaussian
- BCE
- Binary-Cross Entropy
- MCC
- Matthews Correlation Coefficient
- PRE
- Precision
- REC
- Recall
- AUC
- Area Under Curve
- AUPR
- Area Under the Precision-Recall Curve
- AUROC
- Area Under Receiver Operating Characteristic
- IDRs
- Intrinsically Disordered Regions
- non-IDRs
- Non-Intrinsically Disordered Regions.
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.