Machine Learning Dataset and Benchmark for Accurate T Cell Receptor-pHLA Binding Prediction | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Resource Machine Learning Dataset and Benchmark for Accurate T Cell Receptor-pHLA Binding Prediction Fuli Feng, Xinyuan Zhu, Jiadong Lu, Yeqing Lu, Yuyan Zhang This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7286169/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted You are reading this latest preprint version Abstract A central challenge in immunology and therapeutic design is accurately predicting the diverse interactions between T cell receptors (TCRs) and peptide-HLA (pHLA) complexes. Existing machine learning tools are hindered by incomplete sequence data and biased non-binding examples. To overcome this, we present Hi-TPH, a large-scale hierarchical dataset featuring an on-the-fly selection strategy for generating non-binding data. We further develop Hi-TPH-PLMs, a collection of Protein Language Models (PLMs) with varied architectures and scales, fine-tuned on Hi-TPH. These models achieve a 17.4% performance gain over state-of-the-art tools on an external wet-lab test set. Leveraging the hierarchical structure of Hi-TPH, detailed analyses dissect the contribution of different molecular components to binding prediction and reveal their synergistic interplay—for instance, the prediction contribution of HLA relies on the presence of full TCR chains. Hi-TPH and Hi-TPH-PLMs are publicly released to support the development of more reliable tools for advanced immunoinformatics research and personalized immunotherapy. Biological sciences/Computational biology and bioinformatics/Data publication and archiving Biological sciences/Immunology/Immunotherapy Biological sciences/Computational biology and bioinformatics/Machine learning Full Text Additional Declarations There is NO Competing Interest. Supplementary Files HiTPHNCSSupp.pdf Machine Learning Dataset and Benchmark for Accurate T Cell Receptor-pHLA Binding Prediction Cite Share Download PDF Status: Under Review Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7286169","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Resource","associatedPublications":[],"authors":[{"id":512728945,"identity":"e2186252-73aa-4a05-9f5c-c54b060ae583","order_by":0,"name":"Fuli Feng","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA1ElEQVRIiWNgGAWjYBAC9gYgUVFgA+WyEaGF5wCQOGOQBlVNgpbDpGhh7z384oDB+cT585sfMHwoO8zAP7uBgBaec2kWBwxuJ244xmbAOOPcYQaJOwfwa7GXyDEz/gDSwsbDwMzbdpjBQCKBgC3yb8wMDhicS5zfBtTylygtEjzGDw4YHEhsOAbUwkiUFp4cM4YDBsnGG46lGRzsOZfOI3GDkBb2M8YfDlTYyc5vPvzwwY8yazn+GQS0AAGbBIx1AGQGQfVAwPyBGFWjYBSMglEwggEA6KdC0G6Rvf8AAAAASUVORK5CYII=","orcid":"","institution":"University of Science and Technology of China","correspondingAuthor":true,"prefix":"","firstName":"Fuli","middleName":"","lastName":"Feng","suffix":""},{"id":512728946,"identity":"d73dd3e1-5fbc-4f5b-99e5-87148e390f7e","order_by":1,"name":"Xinyuan Zhu","email":"","orcid":"","institution":"University of Science and Technology of China","correspondingAuthor":false,"prefix":"","firstName":"Xinyuan","middleName":"","lastName":"Zhu","suffix":""},{"id":512728947,"identity":"f3313036-13f1-4f48-8a52-5b4cb258af54","order_by":2,"name":"Jiadong Lu","email":"","orcid":"","institution":"University of Science and Technology of China","correspondingAuthor":false,"prefix":"","firstName":"Jiadong","middleName":"","lastName":"Lu","suffix":""},{"id":512728948,"identity":"b93eaa8b-951e-4b1d-91d7-49e4b93f5f31","order_by":3,"name":"Yeqing Lu","email":"","orcid":"","institution":"University of Science and Technology of China","correspondingAuthor":false,"prefix":"","firstName":"Yeqing","middleName":"","lastName":"Lu","suffix":""},{"id":512728949,"identity":"d33dc82a-a67c-455b-ba46-03e84472675a","order_by":4,"name":"Yuyan Zhang","email":"","orcid":"","institution":"University of Science and Technology of China","correspondingAuthor":false,"prefix":"","firstName":"Yuyan","middleName":"","lastName":"Zhang","suffix":""}],"badges":[],"createdAt":"2025-08-04 02:30:14","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7286169/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7286169/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":91042505,"identity":"732b999d-4058-420a-befa-366989c18aa3","added_by":"auto","created_at":"2025-09-11 04:33:47","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2708985,"visible":true,"origin":"","legend":"Article File","description":"","filename":"HiTPHNCSMain.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7286169/v1_covered_30b6832c-c739-4609-bd78-3974a4dbddaa.pdf"},{"id":91042355,"identity":"c85d224d-c5bc-4418-8063-4cc167d7bf23","added_by":"auto","created_at":"2025-09-11 04:25:45","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":7313727,"visible":true,"origin":"","legend":"Machine Learning Dataset and Benchmark for Accurate T Cell Receptor-pHLA Binding Prediction","description":"","filename":"HiTPHNCSSupp.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7286169/v1/acfb155b9a1984cdd398fa49.pdf"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Machine Learning Dataset and Benchmark for Accurate T Cell Receptor-pHLA Binding Prediction","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-7286169/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7286169/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"A central challenge in immunology and therapeutic design is accurately predicting the diverse interactions between T cell receptors (TCRs) and peptide-HLA (pHLA) complexes. Existing machine learning tools are hindered by incomplete sequence data and biased non-binding examples. To overcome this, we present Hi-TPH, a large-scale hierarchical dataset featuring an on-the-fly selection strategy for generating non-binding data. We further develop Hi-TPH-PLMs, a collection of Protein Language Models (PLMs) with varied architectures and scales, fine-tuned on Hi-TPH. These models achieve a 17.4% performance gain over state-of-the-art tools on an external wet-lab test set. Leveraging the hierarchical structure of Hi-TPH, detailed analyses dissect the contribution of different molecular components to binding prediction and reveal their synergistic interplay—for instance, the prediction contribution of HLA relies on the presence of full TCR chains. Hi-TPH and Hi-TPH-PLMs are publicly released to support the development of more reliable tools for advanced immunoinformatics research and personalized immunotherapy.","manuscriptTitle":"Machine Learning Dataset and Benchmark for Accurate T Cell Receptor-pHLA Binding Prediction","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-11 04:25:40","doi":"10.21203/rs.3.rs-7286169/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"nature-computational-science","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"natcomputsci","sideBox":"Learn more about [Nature Computational Science](http://www.nature.com/natcomputsci/)","snPcode":"","submissionUrl":"","title":"Nature Computational Science","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Nature Research","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"036dd5b7-3afa-4f69-a256-84c5dca5f719","owner":[],"postedDate":"September 11th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":54465733,"name":"Biological sciences/Computational biology and bioinformatics/Data publication and archiving"},{"id":54465734,"name":"Biological sciences/Immunology/Immunotherapy"},{"id":54465735,"name":"Biological sciences/Computational biology and bioinformatics/Machine learning"}],"tags":[],"updatedAt":"2026-03-17T17:51:04+00:00","versionOfRecord":[],"versionCreatedAt":"2025-09-11 04:25:40","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7286169","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7286169","identity":"rs-7286169","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.