{"paper_id":"1d96042d-3c65-4a99-9eae-93afababe1d9","body_text":"G2VTCR: predicting antigen binding specificity by Weisfeiler-\nLehman graph embedding of T cell receptor sequences \nZicheng Wang 1, Yufeng Shen1,2,3 \n \nAffiliations: \n1. Department of Systems Biology \n2. Department of Biomedical Informatics \n3. JP Sulzberger Columbia Genome Center \nColumbia University Irving Medical Center, New York, NY, USA \n \n \nAbstract \nThe binding of peptide-MHC complexes by T cell receptors (TCRs) is crucial for T cell \nantigen recognition in adaptive immunity. High-throughput multiplex assays have \ngenerated valuable data and insights about antigen specificity of TCRs.  However, \nidentifying which TCRs recognize which antigens remains a significant challenge due \nto the immense diversity of TCR. Here we describe G2VTCR (Graph2Vec-based \nRepresentation and Embedding of TCR and Targets for Enhanced Recognition \nAnalysis), a computational method that uses atomic level graph embedding to predict \nTCR-antigen recognition.  G2VTCR represents antigens and the third complementarity-\ndetermining region (CDR3) of TCR sequences using graphs, in which nodes encode \natomic identities and edges encode chemical bonds between atoms, and then uses \nWeisfeiler-Lehman iterations to produce embeddings. The embeddings can be used \nfor supervised classification tasks in TCR-antigen binding prediction and unsupervised \nclustering of TCRs. We evaluated G2VTCR using publicly available paired TCR-\nCDR3/antigen data generated by antigen-stimulation experiments. We show that \nG2VTCR has better performance in both classification and clustering than other \nembedding methods including pre-trained protein language models.  We investigated \nthe impact of Weisfeiler-Lehman iterations and the sample size of TCR CDR3 on \nclassification performance. Our results highlight the utility of atomic level graphical \nembedding of immune repertoire sequences for antigen specificity prediction.  \nIntroduction \nThe recognition of peptide-MHC complexes by T cell receptors (TCRs) is pivotal for T \ncell activation and antigen specificity in adaptive immune responses. The knowledge of \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\nantigen specificity of T cells has a range of applications including cancer \nimmunotherapies, vaccine development, and the understanding of autoimmune \ndiseases (1-5). Experimentally determining TCR-antigen specificity relies on methods \nsuch as tetramer (1, 6) staining or in vitro stimulation assays (7), which are often labor-\nintensive, time-consuming, and difficult to scale for large datasets. Given the immense \ndiversity of CDR3 sequences, estimated to range between 1320 to 1420 combinations, \nexperimental sampling covers only a small fraction of possible TCRs. This challenge \nmotivates the development of computational prediction models that evaluate TCR-\npeptide interactions across a broad range of sequences (8).  Recent advancements in \ncomputational biology (9, 10)and graph-based machine learning (11, 12)have \nsignificantly enhanced the precision of such predictions.  \nThe binding specificity of TCR-peptide is determined by TRA, TRB, peptide, and MHC \n(13). For CD8 T cell, the beta CDR3 sequence along with the V and J genes are often \nprimary contributors (14). Previously published computational tools for analyzing TCR \npatterns and predicting peptide–TCR interactions can be categorized into three \napproaches. The first approach uses unsupervised clustering algorithms to identify \nantigen-specific binding patterns by grouping similar TCR CDR3 sequences. Tools \nsuch as TCRdist (15), GLIPH (16), and DeepTCR (17) follow this strategy. The second \napproach uses supervised machine learning methods, in which  CDR3 and peptide \nsequences are represented using techniques like one-hot encoding, k-mers, BLOSUM \nor amino acid indices, such as TCRex (18), SETE (2), NetTCR (19). ERGO (20), pMTnet \n(21), TITAN (10). The third approach represents CDR3 and peptides using latent \nembeddings from language models trained on large-scale protein sequences, such as \nPanPep (9). However, most approaches face challenges in generalizing to unseen \nTCRs (22). This limitation may arise from their inability to effectively represent the \nstructural and biophysical characteristics of TCR-antigen interactions. \nHere we introduce G2VTCR (Graph-based Representation and Embedding of Antigen \nand TCR for Enhanced Recognition Analysis), a computational method that utilizes \ngraph embedding techniques for high-dimensional modeling of TCR CDR3 and antigen \nsequences. G2VTCR constructs atomic-level graphs from antigen and TCR sequences, \ntreating atoms as nodes and chemical bonds as edges, and extracts context-aware \nsubgraph features using Weisfeiler-Lehman relabeling. This atomic-level graph \nembedding algorithm may capture biophysical properties of CDR3 and antigen \npeptides. We assessed the application of embeddings in predicting TCR-antigen \nbinding as a supervised classification problem and clustering TCR sequences using \nunsupervised methods. \nTo ensure robust evaluation and reduce overfitting, we split the data into independent \ntraining and testing sets with non-overlapping TCR-antigen pairs. We optimized key \nhyperparameters, including the number of Weisfeiler-Lehman (WL) iterations (23, 24) \nand embedding dimensions, and evaluated performance using auROC. We \ndemonstrate G2VTCR’s ability to accurately predict TCR-antigen interactions and its \nrobustness across a variety of antigen pools (25). \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\nResults \nModel architecture: \n \nFigure 1.  Schematic of the G2VTCR framework for predicting TCR-antigen binding using \ngraph-based molecular representations. The method consists of three key steps: (1) \nSeq2Mol, where sequence data is converted into molecular structures; (2) Mol2Graph, where \nmolecular structures are represented as graphs with nodes corresponding to atoms (e.g., \noxygen, carbon, nitrogen) and edges representing bonds; and (3) Graph2Vec, where graph \nembeddings are generated using the Weisfeiler-Lehman (WL) graph kernel for iterative graph \nrepresentation. The resulting TCR embeddings are applied to unsupervised clustering to \nidentify groups of TCR sequences with similar antigen recognition profiles. Additionally, TCR \nand antigen embeddings are pooled together and used to train a Random Forest (RF) classifier, \nwhich distinguishes between positive (binding) and negative pairs.  \nG2VTCR represents the sequences of both TCR CDR3 and antigens as graphs, in \nwhich nodes correspond to atoms (e.g., carbon, oxygen, nitrogen) and edges represent \nbonds between them. The method includes three main steps: (1) Seq2Mol, which \nconverts sequence data into molecular structures; (2) Mol2Graph, which maps \nmolecular structures into graph representations based on atomic and bond attributes; \nand (3) Graph2Vec (23), which generates graph embeddings using the Weisfeiler-\nLehman graph kernel (26) (Figure 1). As part of the G2VTCR framework, the resulting \ngraph embeddings are used for both supervised classification and unsupervised \nclustering. Specifically, a Random Forest (RF) classifier is employed to predict TCR-\nantigen binding, while DBSCAN is applied to identify clusters of TCR sequences with \nshared antigen specificity. \nWe constructed training and testing dataset by pairing TCR CDR3 sequences with \nknown antigenic peptides. Positive labeled pairs consisted of experimentally verified \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\nbinding pairs, while negative pairs were generated through random sampling. The \ndataset was derived from two primary sources: (1) antigen stimulation experiments that \nidentified TCR sequences from the COVID-19 research cohort (25). These data \nincluded expanded T-cell clonotypes, high-throughput sequencing of TCRβ CDR3 \nregions, and antigen specificity annotations; and (2) the VDJdb database (27), which \nprovided a curated set of annotated TCR sequences with known antigen specificities. \nAs depicted in Figure 1, we pooled the TCR-antigen data and represented each \nsequence as a graph using the Seq2Mol and Mol2Graph steps. Using Graph2Vec, the \nmethod converted molecular interactions into numerical embeddings. The G2VTCR \nmethod then combined graph-based embeddings and machine learning classifiers to \npredict TCR-antigen interactions.  \nAnalysis of TCR Embeddings \n \nFigure 2. Embedding of TCR sequences. t-SNE visualization showing the distribution of TCR \nsequences across different epitope pools, highlighting their separation based on structural and \nantigen-specific features. Each color corresponds to a distinct epitope pool.   \nTo visualize TCR embeddings, we used t-distributed Stochastic Neighbor Embedding \n(t-SNE) to project the high-dimensional embeddings into two dimensions. We selected \nTCRs corresponding to 16 dominant epitope pools (Supplementary Table 1) to be \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\nshown in the figure. As shown in Figure 2, the t-SNE visualization revealed distinct \ngroups corresponding to TCR sequences binding to different peptides, indicating that \nG2VTCR embeddings effectively capture antigen-specific characteristics.  \nWe selected antigens that are paired with at least 1,000 TCR CDR3 sequences for \nmodel training and testing. The TCR data were split into training and testing sets (80/20 \nsplit), ensuring that each TCR-antigen pair appeared exclusively in one set, with no \nwithin-epitope overlap. Instead of random split, we applied a sequence similarity-\nguided approach to construct the test set to minimize the similarity of TCRs between \ntraining and testing sets. Specifically, for each epitope group, we randomly selected \nseed TCRs and identified their nearest neighbors based on one-hot encoded sequence \nembeddings using FAISS with L2 distance. This KNN-like strategy ensured that similar \nsequences remained together, reducing the risk of data leakage and maintaining a \nclear separation between training and testing distributions (see Methods and \nSupplementary Figure 1). To generate negative samples, we used a shuffling strategy \nby pairing TCRs with peptides other than their known targets. \nImpact of Weisfeiler-Lehman Iterations, Embedding Dimensionality on \nclassification  \nFigure 3. Impact of Weisfeiler-Lehman Iterations, Embedding Dimensionality. (A-B) \nPerformance (auROC) on validation data as a function of Weisfeiler-Lehman (WL) iterations and \nembedding dimensionality, demonstrating optimal performance with 4-6 WL iterations and high \ndimensions (64+). \nTo identify optimal parameter settings, we performed parameter tuning using a 5-fold \ncross-validation scheme, where evaluation data was part of the training process and \nnot used for testing. We examined the impact of key parameters in the Weisfeiler-\nLehman procedure on the utility of the generated embeddings for classification. \nSpecifically, we varied the number of WL iterations from 1 to 10 and the embedding \ndimensions across 16, 32, 64, 128, 256, and 512. A Random Forest classifier was \ntrained using the embeddings produced by G2VTCR as input, with labeled data \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\nindicating TCR-antigen binding (positive interactions) or non-binding (negative \ninteractions).  \nGraph2Vec embeddings were generated with dimensions ranging from 16 to 512, and \nmodel performance was evaluated using auROC scores across multiple Weisfeiler-\nLehman (WL) iterations. A key factor in Graph2Vec’s success is the number of \nWeisfeiler-Lehman (WL) iterations, which determine the depth of neighborhood \ninformation incorporated into node labels. Fewer WL iterations capture local structures, \nwhile more iterations, performance gains diminish, and can lead to overfitting. Similarly, \nthe dimensionality of embeddings influences how much information is encoded. High \ndimensions provide marginal improvements but may not justify the added complexity. \nFigure 3 A and B show the area under the receiver operating characteristic curve \n(auROC) on 5-fold cross-validation as a function of Weisfeiler-Lehman (WL) iterations \n(1-10) across different embedding dimensions (16, 32, 64, 128, 256, and 512), \nhighlighting how intermediate values offer the best balance between capturing \nstructural complexity and avoiding overfitting. Our results indicate that an intermediate \nnumber of WL iterations (4-6) and a dimensionality of 512 provide the best \nperformance for the given dataset. We further evaluated the robustness of G2VTCR \nembeddings when applied to RF classifiers with varying maximum depths. Notably, the \nperformance in Supplementary Figure 3 corresponds to the scenario with unlimited \nmaximum depth in the RF classifier, where an AUROC of 0.96 is achieved. When \nevaluated the maximum depth from 1 to 15, G2VTCR maintains consistently high \nperformance after RF maximum depth equals to 7 (auROC > 0.9). \nPerformance evaluation of classification and clustering based on G2VTCR \nembeddings \n \nFigure 4. Receiver Operating Characteristic (ROC) curve of classification analysis and \nClustering analysis of TCR sequence representations. (A) Shows performance under the \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\nshuffling strategy, with G2VTCR again achieving the highest auROC of 0.958, compared to \nOne-hot (auROC = 0.706) and ESM2 (auROC = 0.614). (B) Clustering performance comparison \nacross representations using Clustering Precision (c-precision) and Clustering-Critical Success \nIndex (c-CSI) metrics across varying DBSCAN epsilon values. Representations include \nG2VTCR, ESM, OneHot encoding, RDKit fingerprint embeddings, and TCRdist. The G2VTCR \nembedding demonstrates superior performance, achieving the highest clustering precision and \nc-CSI, indicating its effectiveness in differentiating TCR clusters. \nWe next evaluated the predictive performance of G2VTCR embeddings in both \nsupervised classification and unsupervised clustering tasks using test data. In the \nclassification task, the model was trained to distinguish between positive TCR–antigen \nbinding pairs and negative pairs generated by shuffling TCRs across non-cognate \npeptides. Each pair was represented by the concatenation of Graph2Vec embeddings \nof the TCR and epitope, learned from atomic-level graph representations. The \nGraph2Vec model was configured with five Weisfeiler-Lehman iterations and 256 \nembedding dimensions, selected based on prior cross-validation experiments. The \nclassifier, a Random Forest, was trained on these combined embeddings using labeled \ntraining data and evaluated on a held-out test set constructed using a nearest-\nneighbor-based similarity splitting strategy to minimize overlap and leakage. Figure 4A \nshows comparative performance analyses revealed that the G2VTCR framework \nsignificantly outperformed other embedding methods in predicting peptide-TCR \ninteractions. G2VTCR achieved an area under the receiver operating characteristic \ncurve (auROC) of 0.96 for testing data. In contrast, the traditional One-hot encoding \napproach yielded auROCs of 0.71, while the more recent computational embedding \nmethod, ESM2 (650 M) (28), achieved lower auROCs of 0.61.  \nWe evaluated the performance in clustering based on G2VTCR embedding. To \nconstruct the clustering dataset, we excluded TCR sequences associated with \nantigens represented by fewer than 1,000 unique sequences, thereby focusing the \nclustering analysis on antigens with sufficient representation. Using DBSCAN as the \nclustering algorithm, we varied the epsilon (𝜀) parameter to control the density \nthreshold for cluster formation. Clustering performance was quantified using Clustering \nPrecision (c-Precision) and the Clustering-Critical Success Index (c-CSI) (29). As shown \nin Figure 4B, the G2VTCR framework achieved higher clustering precision and c-CSI \ncompared to alternative embeddings, including ESM2, OneHot encodings, RDKit \nfingerprints (30), TCRdist embeddings (15) and GLIPH 2 (16). \n \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\nFigure 5. Data Depth on TCR-Epitope Interaction Prediction Performance. (A) \nCorrelation between the number of TCRs (log10-transformed) and auROC scores for \nepitope-specific prediction, highlighting improved performance with increasing TCR \ncounts. (B) Relationship between data ratio and auROC scores, showing an elbow \npoint at 20% and a plateau beyond 30%, with epitope-specific variability in \nperformance trends. \nTo evaluate whether the similarity between test and training data affects model \nperformance, we computed the average minimal TCR distances by first calculating the \npairwise distances between test and training CDR3 sequences for each epitope using \nthe TCR-dist method (15), and then averaging the minimum distance for each test \nsequence with training sequence from the same epitope pool (Supplementary Figure \n4). We found there was no significant correlation between the average minimal TCR \ndistances and the auROC, as indicated by a Pearson correlation coefficient of 0.12 and \na p-value of 0.49, suggesting a weak and non-significant relationship. Futhermore, \nthere is no significant correlation between the number of TCRs and the auROC \n(correlation coefficient=0.014, p-value=0.94). Epitope-specific analyses nonetheless \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\nhighlight the robustness of the G2VTCR framework, with several antigens \ndemonstrating exceptional predictive performance. Notably, antigen pools 37, 141, 90, \nand 215 achieved auROC scores of 0.989, 0.986, 0.984, and 0.981, respectively (see \nSupplementary Table 1 for sequences and indices). \nImpact of data depth on performance \nTo further investigate the impact of data depth on model performance, we conducted \ndown-sampling analysis on individual epitopes from the COVID-19 research cohort \ndataset (31),  analyzing the relationship between the number of TCR sequences (Figure \n5A) and the data ratio (Figure 5B) with the corresponding auROC scores. To simulate \nvarying data availability, we applied down-sampling prior to the embedding step, after \nsplitting the data into training and testing sets. In each experiment, both the training \nand testing sets were down-sampled at the same ratio to ensure consistency. These \nanalyses reveal how data availability influences predictive accuracy. As shown in \nFigure 5B, the strong positive correlation ( 𝑟 = 0.65 , 𝑝 = 4.06 × 10−43 r=0.65, p=4.06×10 \n−43) indicates that increasing the number of TCRs generally enhances predictive \nperformance, as measured by auROC scores. However, for epitopes with highly \ndiverse repertoires, even datasets with over 1,000 TCRs may not fully capture all \nbinding scenarios. Figure 5B shows the relationship between the data ratio (proportion \nof available TCR data) and model performance. The results reveal a critical threshold at \na 20% sampling data, where predictive accuracy sharply increases, indicating \nsufficient data coverage for reliable predictions (Supplementary Figure 2). Beyond this \npoint, performance gains become incremental, with diminishing returns observed after \n30%.   \nWe observed a similar trend on clustering performance with the depth of the dataset \nused during the embedding step. Specifically, down-sampling the input data resulted \nin a noticeable decrease in clustering performance by down-sampling to 20% or below \nof the original data, as reflected by reductions in both clustering precision and c-CSI \nmetrics (Supplementary Figure 2). This degradation affected G2VTCR and other \nembedding methods alike, leading to less distinct separation of TCRs into clusters, \nparticularly within epitope-specific groups. These findings highlight the importance of \nadequate data coverage when using unsupervised embeddings to capture the diversity \nand specificity of TCR repertoires. \n \nImpact of Node Features and Subgraph Structure on Model Performance \nTable 1. Performance Evaluation of Node and Edge Features in G2VTCR \nFramework. C1, C2, etc., represent different feature sets combining node and edge \nfeatures for evaluating the G2VTCR framework. \n C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\nAtomic Num  X X X X X X X X X \nIs Aromatic  X X X X X X X X X \nFormal \nCharge \n    X X X X X X \nHybridization     X X X X X X \nDegree        X X X \nTotal Num Hs        X X X \nBond Type  X X X X X X X X X \nIs Aromatic   X X  X X  X X \nIs Conjugated   X X  X X  X X \nBond Stereo    X   X   X \nIs In Ring    X   X   X \nauROC 0.951 0.966 0.965 0.963 0.964 0.963 0.961 0.954 0.955 0.955 \n \nWe systematically evaluated the influence of various node and edge features on the \nperformance of the G2VTCR framework in predicting antigen-TCR interactions. Node \nfeatures such as atomic number, aromaticity, formal charge, hybridization, connectivity \ndegree, and the total number of hydrogens were assessed alongside bond-level \nfeatures including bond type, aromaticity, conjugation, stereochemistry, and ring \nmembership. Performance was quantified using the area under the receiver operating \ncharacteristic curve (auROC), with a baseline of 0.951 achieved using a non-attributed \nmodel (wl_iterations=5, dimensions=128, embeddings based solely on graph structure \nand connectivity, without incorporating node or edge attributes). \nOur findings suggest that the performance of the G2VTCR framework is robust across \nvarious feature configurations, with only minimal differences observed when specific \nfeatures are added or excluded (Table 1).  \n \nIdentification of High-Frequency Motifs in TCR Sequence Clusters \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\n \nFigure 6. Sequence logo plots of high-frequency, high-TF-IDF motifs from TCR \nsequences across epitope pools 7_YLC, 111_HTT, and 215_YQI. Each motif is \nshown as a five-residue window centered at the most informative position (position 0). \nThe number of motifs per epitope varies depending on the number of specific \nsubgraph-based patterns that met the selection criteria: high TF-IDF scores, presence \nin at least 5 TCRs, no overlap with motifs from other epitopes, and occurrence in at \nleast 20 TCR sequences in the dataset. \nWe analyzed TCR sequence clusters based on their atomic-level subgraph patterns \nusing the G2VTCR framework and TF-IDF (Term Frequency-Inverse Document \nFrequency) vectorizer (32)to the graph features of the sequences. This method allowed \nus to identify the most significant words or motifs common in the sequences within \neach cluster. We specifically focused on atomic features that appear frequently within \nthe cluster sequence graphs but not across all clusters, ensuring these features are \ndistinctive for each cluster. Here, we present the analysis of three enriched TCR \nclusters, highlighting key features and prominent patterns in the TCR sequences as \nexamples.  \nIn the high-accuracy prediction epitope pool 215_YQI, we identified several key \npatterns in the TCR sequences occurring at significant frequencies. The most \nprominent patterns included “SIGQG” (223 sequences), “SIGTG” (207 sequences), and \n“SIGLG” (174 sequences), all sharing the common “SIG” prefix, which represents a \nrecurring motif within this cluster. Variations such as “SLGQG” (137 sequences), \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\n“SIGVG” (102 sequences), and “SMGQG” (90 sequences) highlighted flexibility around \nthe core “SIG” and “SLG” motifs. Additional motifs, including “STGTG” (73 sequences), \nand “SSGTG” (35 sequences), further demonstrated the diversity of subgraph patterns \nwithin this cluster. We compared G2VTCR against GLIPH using real motif outputs from \n215_YQI epitope pool, GLIPH identified frequent short patterns such as “IG” (1053 \nsequences), “SI” (1031), and “SIG” (982), which correspond to common substrings but \nlack positional specificity or structural insight. \nTo better illustrate how these motifs define cluster specificity, we extended our analysis \nto examine high-frequency motifs with high TF-IDF scores across additional epitope \npools, including 7_YLC, 111_HTT. Figure 6 provides a logo plot visualization of these \nkey motifs, capturing the probabilistic occurrence of specific patterns within the \nsequence clusters. This analysis highlights not only the distinctiveness of certain motifs \nwithin individual epitope pools but also their variation across clusters. These high-TF-\nIDF motifs, such as “%GAI%”, “SARGG” in epitope pool 7_YLC and “%GPWD”, \nRD/GP” in epitope pool 111_HTT, which GLIPH returned longer but sparse motifs like \n“RGLAG%SYE” and “S%GGNE” (each seen in ~15 sequences, ambiguous or gapped \npositions indicated by “%”), which include ambiguous or gapped positions. Similarly, \nGLIPH identified “SPRD” and “SLRD” in epitope pool 111_HTT as short, frequent \nmotifs. \n \nDiscussion \nG2VTCR represents a new approach for modeling of TCR-antigen interactions. \nLeveraging atomic-level graph-based embeddings, G2VTCR achieves superior \nperformance in both supervised classification and unsupervised clustering tasks \ncompared to previous amino acid-level and language model-based embedding \nmethods. This improvement may be attributed to its ability to more efficiently represent \ndiscontinuous binding motifs at the atomic scale. Unlike amino acid-level embeddings, \nwhich are limited by the coarse granularity of a 20-character corpus, G2VTCR’s atom-\nlevel embeddings extend the available information by capturing subgraph-level \nstructural and chemical properties. This approach allows the model to preserve \nmolecular details that are critical for accurately modeling TCR-antigen interactions. \nWhile sequence-based models such as OneHot and ESM2 are highly effective for \nlonger protein sequences, they do not account for the local biophysical and chemical \nproperties that are vital for short sequences like CDR3 regions.  \nOur analysis highlights the importance of optimal WL iterations and embedding \ndimensions in enhancing model performance. Intermediate WL iterations (4-6) and a \ndimensionality of 128-256 were found to provide the best balance, capturing detailed \nneighborhood information without introducing excessive noise.  \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\nUnlike k-mer-based methods that rely on fixed-length substrings for pattern detection \n(33), G2V embeddings operate at the graph level, enabling the identification of \nstructurally equivalent patterns despite sequence variation. While k-mer methods, such \nas SPAN-TCR, can detect frequent motifs like “GX” or “XG” and calculate entropy \nreduction, they are limited by exact sequence matching and fixed k-mer sizes. In \ncontrast, G2V embeddings use graph-level abstractions to capture inter-atomic \ninteractions and represent TCR structural diversity comprehensively. For instance, the \nrecurring “XGLGP” motif and its variants highlight not only shared atomic subgraph \ncontexts but also the central residues critical for binding specificity, which k-mer-\nbased approaches cannot discern due to their reliance on fixed-length substrings. \nA limitation of G2VTCR is that the method's predictive accuracy depends on \ncomprehensive sampling of TCR sequences specific to each antigen. The performance \nof G2VTCR declines when applied to antigens for which only limited TCR sequence \ninformation is available. Additionally, our current evaluation did not fully explore the \nmodel's capacity to generalize to novel antigens with sparse TCR representation. \nFuture studies should address this potential limitation to confirm the robustness of \nG2VTCR across broader antigenic contexts. \nIn this study we have focused on TCR β-chain without considering α-chain or HLA \nalleles. This approach would not work well for class II–restricted T cell responses, \nwhere both α- and β-chain participate antigen recognition. Future studies could \naddress this issue by incorporating paired α-β TCR sequences and multiple HLA alleles \nin larger and more diverse datasets.  \n \nAcknowledgements \nWe thank Dr. Donna Farber, Dr. Peter Sims, and Jean-Baptiste Reynier for \ninsightful discussions, and Jake Hagen for assistance with computing \ninfrastructure. This work was supported in part by the National Institutes of \nHealth grant U19AI128949. \n \nMethods \nCode Availability \nOur implementation of G2VTCR is publicly available at \nhttps://github.com/princello/G2VTCR. \nData Collection and Process \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\nWe used a dataset derived from antigen stimulation experiments to identify T-cell \nreceptor (TCR) sequences from the MIRA assay of COVID-19 subjects (31). We \nincluded data on T-cell clonotype distributions, high-throughput sequencing of TCRβ \nCDR3 regions, and antigen specificity annotations obtained through computational \nanalyses. The dataset originally contains 152,718 TCRs. After filtering out epitopes with \nfewer than 1,000 associated TCRs, 97,216 TCRs remained for the analysis. Before \nfiltering, there were 545 epitopes from 269 antigen groups. After removing epitopes \nwith fewer than 1,000 associated TCRs, 120 epitopes from 35 antigen groups \nremained for analysis. Additionally, we incorporated TCR sequences from the VDJdb \ndatabase (27), which provided annotated TCRs with known antigen specificities. \nTraining and testing splitting: The dataset consisted of TCR CDR3 sequences paired \nwith antigenic epitopes, divided into training and testing sets using an 80/20 split. To \nconstruct the testing set, we employed a nearest-neighbor sampling approach based \non sequence similarity. Specifically, TCR sequences were one-hot encoded and \nembedded into a fixed-length vector space, and L2 distance was used to identify \nclusters of similar sequences using the FAISS library. Seed sequences were randomly \nselected, and their nearest neighbors were grouped to form the test set. This \nprocedure ensured that sequences associated with the same antigenic epitope were \nentirely allocated to either the training or testing set, preventing within-epitope leakage \nand maximizing independence between sets. Additionally, we introduced two \nstrategies for generating negative TCR–epitope associations: random shuffling of \nknown TCRs across non-cognate epitopes and controlled swapping of TCRs between \ndifferent antigen groups, both of which were included in the model training and \nevaluation. \nShuffling Known Positive Pairs: This approach involves creating random TCR-\nepitope combinations by pairing each TCR with epitopes different from their known \nmatches. The assumption here is that a TCR specific to one epitope is unlikely to be \nspecific to a different, unrelated epitope. Due to the MIRA assay can identify what \nmight be considered “negative pairs” as part of its methodology by accomplishing the \nprocess of sorting T cells into antigen-specific and non-antigen-specific populations \nafter exposure to antigen pools. the shuffed antigen-TCR Pairs could be considered as \nnegative data. \nGraph Construction \n \nAntigen and TCR CDR3 sequences were input as strings of amino acid codes and \nconverted into molecular structures using RDKit (30), a cheminformatics toolkit. In this \nrepresentation, each atom within the amino acids, such as carbon, nitrogen, and \noxygen, was treated as a node. The edges between these nodes represent chemical \nbonds, detailing the molecular structure of the peptide. \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\nFollowing the conversion, the molecular structures were transformed into detailed \ngraph representations. Nodes in these graphs were enriched with attribute data \nreflecting the atomic properties such as atomic weight, electronegativity, and other \nrelevant chemical characteristics. Edges, representing the chemical bonds between \natoms, included attributes such as bond type (e.g., single, double, peptide bond) and \nbond order, which are crucial for understanding the molecular architecture of the \npeptides. \nThe molecular graphs generated in RDKit were then exported to NetworkX (34)to \nfacilitate the application of advanced network analysis techniques. This integration \nallowed for a comprehensive analysis that goes beyond simple structural depiction, \nenabling the study of detailed interactions within the peptide sequences. NetworkX \nprovided tools to apply various graph-theoretical algorithms that analyze the structural \nand functional characteristics of the peptides at an atomic level. \nGraph embeddings were performed to transform this detailed atomic graph data into a \nnumerical form suitable for computational models. Techniques such as node-level, \nedge-level, and graph-level embeddings were employed to derive vector \nrepresentations that capture the intrinsic chemical and structural properties of the \npeptides for predictive modeling tasks. \nGraph Embedding with Graph2Vec \nIn the graph embedding process used by Graph2Vec, each graph 𝐺 = (𝑉 , 𝐸), where 𝑉 \nis the set of nodes and 𝐸 is the set of edges, is represented as a vector 𝑣⃗. The \nalgorithm treats each graph as a document and the rooted subgraphs around each \nnode as words. For each graph 𝐺, the rooted subgraph ℎ ( 𝑢 ) for a node 𝑢 is \nconstructed by aggregating information from its 𝑘-hop neighborhood, where 𝑘 \nrepresents the Weisfeiler-Lehman (WL) subtree height. The subgraph information is \nlabeled and transformed into a bag-of-words representation BoW (𝐺) using the WL \nsubtree kernel. \n𝐵𝑜𝑊(𝐺) = \t 3 𝜙(ℎ(𝑢))\n!∈#\n \nHere, 𝜙(ℎ(𝑢)) represents the encoding of the rooted subgraph around node 𝑢.  \nThe Graph2Vec algorithm then applies a document embedding technique to learn a \nfixed-size vector representation that captures the structural and label information of \nthese subgraphs. This embedding vector 𝑣⃗ for each graph is obtained through an \nunsupervised learning approach that maximizes the likelihood of preserving \nneighborhood information of the graphs within the embedding space. This is achieved \nby optimizing the following objective function: \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\n𝑚𝑎𝑥 8 𝑃(ℎ$|𝑣⃗(𝐺%))\n('!,) !)∈+\n \nwhere 𝑃(ℎ$|𝑣⃗(𝐺%)) is the probability of the subgraph ℎ$  given the graph embedding \n𝑣⃗(𝐺%), and 𝐷 is the dataset of graphs. This probability is modeled using a softmax \nfunction:  \n𝑃 <ℎ$=𝑣⃗(𝐺%)> = \t exp \t(𝑣⃗(𝐺%) ∙ ℎC⃗$)\n∑ exp \t() \"∈, 𝑣⃗(𝐺%) ∙ ℎC⃑-)\n \nwhere ℎC⃗$ is the embedding of the subgraph ℎ$, and 𝐻 is the set of all subgraphs in the \ndataset. Through this unsupervised learning approach, Graph2Vec generates a vector \n𝑣⃗ for each graph that captures both structural and label information of the subgraphs, \npreserving neighborhood information in the embedding space. \nIn addition to Graph2Vec, the embedding methods utilized in this study include \nOneHot Encoding, Evolutionary Scale Modeling (ESM), and RDKit Fingerprint \nEmbedding. OneHot Encoding encodes amino acids as independent binary features. \nESM generates sequence embeddings informed by evolutionary-scale protein \ndatabases. RDKit Fingerprint Embedding represents molecular graphs as binary \nvectors based on substructure presence. These methods were employed to compare \ntheir performance in capturing relevant features for downstream analyses. \nTCR Distance Calculation \nFor the pairwise distance calculation between T-cell receptor (TCR) sequences, we \nused the TCR-dist method (15), which computes similarity based on the \ncomplementarity-determining region 3 (CDR3). This method relies on a vector-based \ndistance metric that accounts for both the sequence and structural properties of the \nCDR3 regions. \nKey parameters for the TCR-dist calculation included the use of a gap penalty of 4 to \npenalize insertions and deletions, and the trimming of 3 residues from the N-terminus \nand 2 residues from the C-terminus of each CDR3 sequence. The computation was \noptimized through the use of parallel processing across 4 CPU cores, and redundant \nsequences were filtered to improve efficiency. To further accelerate the calculations, \nwe utilized precomputed distance matrices with appropriate weighting applied to the \nsequence distances. \nThis configuration ensured accurate and efficient pairwise distance estimation between \nTCR sequences, forming the basis for subsequent analyses such as clustering and \nvisualization. \nPerformance metrics \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\nNumerous benchmark methods have been proposed to assess the performance of \nTCR-epitope prediction algorithms, including those founded on sequence distance. \nTrue positive rate (TPR), false positive rate (FPR), and accuracy (ACC) are derived from \nthe counts of true positives (TP), true negatives (TN), false positives (FP), and false \nnegatives (FN). The area under the receiver operating characteristic curve (AUROC) is a \nmetric to evaluate our classification model's performance, gauging the model's \ncapacity to differentiate between positive and negative classes. \nTF-IDF Score \nTF-IDF (Term Frequency-Inverse Document Frequency) vectorization is a technique \nused to convert text data into numerical representations. It helps identify the \nimportance of words (in this paper, hashed atomic labels) in a document relative to a \ncollection of documents (graph). Here's how it works and why it's useful in identifying \nkey features within clusters: \nThe term frequency (TF) measures how frequently a word appears in a document. It is \ncalculated as:  \n𝑇𝐹(𝑡, 𝑔) \t = \t 𝑁𝑢𝑚𝑏𝑒𝑟\t𝑜𝑓\t𝑡𝑖𝑚𝑒𝑠\t𝑡𝑒𝑟𝑚\t𝑡\t𝑎𝑝𝑝𝑒𝑎𝑟𝑠\t𝑖𝑛\t𝑔𝑟𝑎𝑝ℎ\t𝑔\n𝑇𝑜𝑡𝑎𝑙\t𝑛𝑢𝑚𝑏𝑒𝑟\t𝑜𝑓\t𝑡𝑒𝑟𝑚𝑠\t𝑖𝑛\t𝑔𝑟𝑎𝑝ℎ\t𝑔 \t \n \nThe invere document frequency (IDF) measures how important a word is in the corpus. \nIt is calculated as: \n𝐼𝐷𝐹(𝑡, 𝐶) \t = \t𝑙𝑜𝑔\t(\t 𝑇𝑜𝑡𝑎𝑙\t𝑛𝑢𝑚𝑏𝑒𝑟\t𝑜𝑓\t𝑔𝑟𝑎𝑝ℎ𝑠\t𝑖𝑛\t𝑐𝑜𝑟𝑝𝑢𝑠\t𝐶\n𝑁𝑢𝑚𝑏𝑒𝑟\t𝑜𝑓\t𝑔𝑟𝑎𝑝ℎ𝑠\t𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔\t𝑡𝑒𝑟𝑚\t𝑡\t) \n𝑇𝐹 − 𝐼𝐷𝐹(𝑡, 𝑔, 𝐶) \t = \t𝑇𝐹(𝑡, 𝑔) \t × \t𝐼𝐷𝐹(𝑡, 𝐶) \nThis score highlights hashed atomic features that are important (i.e., frequent in a \nparticular graph but rare across the corpus). Additionally, the code implementation \nconsidered a minimum document frequency to filter out less significant features. Only \nfeatures appearing in a minimum number of documents were included, ensuring that \nthe identified features were not only significant within individual graphs but also \ncommonly represented across multiple graphs in the cluster. \nClustering Analysis \nTo evaluate the clustering performance of various TCR sequence representations, we \napplied the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) \nalgorithm (29). DBSCAN was selected for its capability to detect clusters of arbitrary \nshapes while effectively handling noise. The algorithm relies on two key parameters: ε, \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\nwhich defines the maximum distance between points within the same neighborhood, \nand 𝑚, the minimum number of points required to form a cluster. For a given point 𝑝: \nThe 𝜀-neighborhood is defined as: \n𝑁.(𝑝) = {𝑞 ∈ 𝑅/ ∣ 𝑑(𝑝, 𝑞) ≤ 𝜀} \n \nWhere |𝑁0(𝑃)| ≥ 𝑚 for 𝑝 to be a core point. \nA point  𝑞 is density-reachable from a core point 𝑝 if there exists a chain of core points \n𝑝1, 𝑝2, … , 𝑝3 such that 𝑝1 = 𝑝, 𝑝3 = 𝑞, and 𝑝%41 ∈ 𝑁0(𝑝% )  \nDBSCAN was applied with a range of 𝜀 values (0.01 – 20) using the Manhattan distance \nmetric. \nDimensionality reduction was applied to ensure computational feasibility and optimize \nclustering performance. Principal Component Analysis (PCA) was used to reduce \nfingerprint embeddings, Graph2Vec, ESM2 (650M), and OneHot encoding to 24 \ndimensions. Multidimensional Scaling (MDS) was applied to TCR distance matrices \ncomputed using TCRdist to project the data into a lower-dimensional space.  \nThe performance of the clustering algorithm was assessed using two key metrics: \nClustering Precision (c-precision), this measures the purity of clusters concerning \nepitope labels. For a cluster 𝐶%  \nc − Precision(𝐶%) = 𝑚𝑎𝑥$(|𝐶% ∩ 𝐿$|)\n|𝐶%|  \nWhere 𝐿$ represents the set of points with lable 𝑗, and  |𝐶%| is the size of cluster 𝐶%. \nClustering-Critical Success Index (c-CSI): This metric accounts for both cluster purity \nand the proportion of well-clustered points: \n𝑐 − 𝐶𝑆𝐼 = \t\n∑ 𝑚𝑎𝑥$(|𝐶% ∩ 𝐿$|)%\n∑ |𝐶% | + \t ∑ p𝐿$p − \t ∑ 𝑚𝑎𝑥$(|𝐶% ∩ 𝐿$|)%$ \t\t%\n \nwhile c-CSI accounts for both cluster purity and the proportion of well-clustered points, \nproviding a comprehensive evaluation of clustering quality. The clustering analysis was \nperformed across the three primary TCR representations—graph-based embeddings, \nESM2 embeddings, and OneHot encodings—as well as the RDKit-based fingerprint \nembeddings and TCRdist. The resulting metrics were plotted to compare the clustering \neffectiveness of each representation under varying levels of noise and clustering \ndensity. Since GLIPH does not involve varying ε-like parameters, the output of the \nclustering analysis is a single pair of c-Precision and c-CSI values for each processed \ndataset or condition. This is different from DBSCAN, where multiple metrics can be \ncalculated across a range of ε values.  \n \n \nReferences  \n \n1. Emerson RO, DeWitt WS, Vignali M, Gravley J, Hu JK, Osborne EJ, et al. \nImmunosequencing identifies signatures of cytomegalovirus exposure history and HLA-\nmediated effects on the T cell repertoire. Nat Genet. 2017;49(5):659-65. \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\n2. Tong Y, Wang J, Zheng T, Zhang X, Xiao X, Zhu X, et al. SETE: Sequence-based \nEnsemble learning approach for TCR Epitope binding prediction. Comput Biol Chem. \n2020;87:107281. \n3. Wu D, Gowathaman R, Pierce BG, Mariuzza RA. T cell receptors employ diverse \nstrategies to target a p53 cancer neoantigen. J Biol Chem. 2022;298(3):101684. \n4. Purcell AW, McCluskey J, Rossjohn J. More than one reason to rethink the use of \npeptides in vaccine design. Nature reviews Drug discovery. 2007;6(5):404-14. \n5. Peng Y, Mentzer AJ, Liu G, Yao X, Yin Z, Dong D, et al. Broad and strong memory \nCD4(+) and CD8(+) T cells induced by SARS-CoV-2 in UK convalescent individuals following \nCOVID-19. Nat Immunol. 2020;21(11):1336-45. \n6. Koning D, Costa AI, Hasrat R, Grady BP, Spijkers S, Nanlohy N, et al. In vitro \nexpansion of antigen-specific CD8+ T cells distorts the T-cell repertoire. Journal of \nimmunological methods. 2014;405:199-203. \n7. Klinger M, Pepin F, Wilkins J, Asbury T, Wittkop T, Zheng J, et al. Multiplex \nidentification of antigen-specific T cell receptors using a combination of immune assays and \nimmune receptor sequencing. PLoS One. 2015;10(10):e0141561. \n8. Sette A, Peters B. Immune epitope mapping in the post-genomic era: lessons for vaccine \ndevelopment. Curr Opin Immunol. 2007;19(1):106-10. \n9. Gao Y, Gao Y, Fan Y, Zhu C, Wei Z, Zhou C, et al. Pan-Peptide Meta Learning for T-\ncell receptor–antigen binding recognition. Nature Machine Intelligence. 2023:1-14. \n10. Weber A, Born J, Rodriguez Martinez M. TITAN: T-cell receptor specificity prediction \nwith bimodal attention networks. Bioinformatics. 2021;37(Suppl_1):i237-i44. \n11. Kearnes S, McCloskey K, Berndl M, Pande V, Riley P. Molecular graph convolutions: \nmoving beyond fingerprints. J Comput Aided Mol Des. 2016;30(8):595-608. \n12. Gawehn E, Hiss JA, Schneider G. Deep Learning in Drug Discovery. Mol Inform. \n2016;35(1):3-14. \n13. La Gruta NL, Gras S, Daley SR, Thomas PG, Rossjohn J. Understanding the drivers of \nMHC restriction of T cell receptors. Nature Reviews Immunology. 2018;18(7):467-78. \n14. Springer I, Tickotsky N, Louzoun Y. Contribution of t cell receptor alpha and beta cdr3, \nmhc typing, v and j genes to peptide binding prediction. Front Immunol. 2021;12:664514. \n15. Dash P, Fiore-Gartland AJ, Hertz T, Wang GC, Sharma S, Souquette A, et al. \nQuantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. \n2017;547(7661):89-93. \n16. Glanville J, Huang H, Nau A, Hatton O, Wagar LE, Rubelt F, et al. Identifying specificity \ngroups in the T cell receptor repertoire. Nature. 2017;547(7661):94-8. \n17. Sidhom J-W, Larman HB, Pardoll DM, Baras AS. DeepTCR is a deep learning \nframework for revealing sequence concepts within T-cell repertoires. Nature communications. \n2021;12(1):1-12. \n18. Gielis S, Moris P, Bittremieux W, De Neuter N, Ogunjimi B, Laukens K, et al. \nIdentification of Epitope-Specific T Cells in T-Cell Receptor Repertoires. Methods Mol Biol. \n2020;2120:183-95. \n19. Montemurro A, Schuster V, Povlsen HR, Bentzen AK, Jurtz V, Chronister WD, et al. \nNetTCR-2.0 enables accurate prediction of TCR-peptide binding by using paired TCRalpha and \nbeta sequence data. Commun Biol. 2021;4(1):1060. \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint \n\n20. Springer I, Besser H, Tickotsky-Moskovitz N, Dvorkin S, Louzoun Y. Prediction of \nspecific TCR-peptide binding from large dictionaries of TCR-peptide pairs. Front Immunol. \n2020:1803. \n21. Lu T, Zhang Z, Zhu J, Wang Y, Jiang P, Xiao X, et al. Deep learning-based prediction of \nthe T cell receptor–antigen binding specificity. Nature machine intelligence. 2021;3(10):864-75. \n22. Dens C, Laukens K, Bittremieux W, Meysman P. The pitfalls of negative data bias for \nthe T-cell epitope specificity challenge. Nature Machine Intelligence. 2023;5(10):1060-2. \n23. Narayanan A, Chandramohan M, Venkatesan R, Chen L, Liu Y, Jaiswal S. graph2vec: \nLearning distributed representations of graphs. arXiv preprint arXiv:170705005. 2017. \n24. Grohe M. word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector \nEmbeddings of Structured Data. Pods'20: Proceedings of the 39th Acm Sigmod-Sigact-Sigai \nSymposium on Principles of Database Systems. 2020:1-16. \n25. Nolan S, Vignali M, Klinger M, Dines JN, Kaplan IM, Svejnoha E, et al. A large-scale \ndatabase of T-cell receptor beta (TCRβ) sequences and binding associations from natural and \nsynthetic exposure to SARS-CoV-2. Research square. 2020. \n26. Shervashidze N, Schweitzer P, van Leeuwen EJ, Mehlhorn K, Borgwardt KM. \nWeisfeiler-Lehman Graph Kernels. J Mach Learn Res. 2011;12:2539-61. \n27. Shugay M, Bagaev DV, Zvyagin IV, Vroomans RM, Crawford JC, Dolton G, et al. \nVDJdb: a curated database of T-cell receptor sequences with known antigen specificity. Nucleic \nacids research. 2018;46(D1):D419-D27. \n28. Lin ZM, Akin H, Rao RS, Hie B, Zhu ZK, Lu WT, et al. Evolutionary-scale prediction of \natomic-level protein structure with a language model. Science. 2023;379(6637):1123-30. \n29. Leary AY, Scott D, Gupta NT, Waite JC, Skokos D, Atwal GS, et al. Designing \nmeaningful continuous representations of T cell receptor sequences with deep generative models. \nNat Commun. 2024;15(1):4271. \n30. RDKit: Open-source cheminformatics. https://www.rdkit.org  [ \n31. Snyder TM, Gittelman RM, Klinger M, May DH, Osborne EJ, Taniguchi R, et al. \nMagnitude and Dynamics of the T-Cell Response to SARS-CoV-2 Infection at Both Individual \nand Population Levels. medRxiv. 2020. \n32. Dillon M. Introduction to Modern Information-Retrieval - Salton,G, Mcgill,M. Inform \nProcess Manag. 1983;19(6):402-3. \n33. Katayama Y, Kobayashi TJ. Comparative Study of Repertoire Classification Methods \nReveals Data Efficiency of  \n-mer Feature Extraction. Front Immunol. 2022;13. \n34. Hagberg A, Swart P, S Chult D. Exploring network structure, dynamics, and function \nusing NetworkX. 2008. \n \n.CC-BY 4.0 International licenseavailable under a \nwas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprint (whichthis version posted May 4, 2025. ; https://doi.org/10.1101/2025.04.29.651344doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}