Machine Learning Dataset and Benchmark for Accurate T Cell Receptor-pHLA Binding Prediction

preprint OA: closed
Full text JSON View at publisher
Full text 11,511 characters · extracted from preprint-html · click to expand
Machine Learning Dataset and Benchmark for Accurate T Cell Receptor-pHLA Binding Prediction | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Resource Machine Learning Dataset and Benchmark for Accurate T Cell Receptor-pHLA Binding Prediction Fuli Feng, Xinyuan Zhu, Jiadong Lu, Yeqing Lu, Yuyan Zhang This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7286169/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted You are reading this latest preprint version Abstract A central challenge in immunology and therapeutic design is accurately predicting the diverse interactions between T cell receptors (TCRs) and peptide-HLA (pHLA) complexes. Existing machine learning tools are hindered by incomplete sequence data and biased non-binding examples. To overcome this, we present Hi-TPH, a large-scale hierarchical dataset featuring an on-the-fly selection strategy for generating non-binding data. We further develop Hi-TPH-PLMs, a collection of Protein Language Models (PLMs) with varied architectures and scales, fine-tuned on Hi-TPH. These models achieve a 17.4% performance gain over state-of-the-art tools on an external wet-lab test set. Leveraging the hierarchical structure of Hi-TPH, detailed analyses dissect the contribution of different molecular components to binding prediction and reveal their synergistic interplay—for instance, the prediction contribution of HLA relies on the presence of full TCR chains. Hi-TPH and Hi-TPH-PLMs are publicly released to support the development of more reliable tools for advanced immunoinformatics research and personalized immunotherapy. Biological sciences/Computational biology and bioinformatics/Data publication and archiving Biological sciences/Immunology/Immunotherapy Biological sciences/Computational biology and bioinformatics/Machine learning Full Text Additional Declarations There is NO Competing Interest. Supplementary Files HiTPHNCSSupp.pdf Machine Learning Dataset and Benchmark for Accurate T Cell Receptor-pHLA Binding Prediction Cite Share Download PDF Status: Under Review Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7286169","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Resource","associatedPublications":[],"authors":[{"id":512728945,"identity":"e2186252-73aa-4a05-9f5c-c54b060ae583","order_by":0,"name":"Fuli Feng","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA1ElEQVRIiWNgGAWjYBAC9gYgUVFgA+WyEaGF5wCQOGOQBlVNgpbDpGhh7z384oDB+cT585sfMHwoO8zAP7uBgBaec2kWBwxuJ244xmbAOOPcYQaJOwfwa7GXyDEz/gDSwsbDwMzbdpjBQCKBgC3yb8wMDhicS5zfBtTylygtEjzGDw4YHEhsOAbUwkiUFp4cM4YDBsnGG46lGRzsOZfOI3GDkBb2M8YfDlTYyc5vPvzwwY8yazn+GQS0AAGbBIx1AGQGQfVAwPyBGFWjYBSMglEwggEA6KdC0G6Rvf8AAAAASUVORK5CYII=","orcid":"","institution":"University of Science and Technology of China","correspondingAuthor":true,"prefix":"","firstName":"Fuli","middleName":"","lastName":"Feng","suffix":""},{"id":512728946,"identity":"d73dd3e1-5fbc-4f5b-99e5-87148e390f7e","order_by":1,"name":"Xinyuan Zhu","email":"","orcid":"","institution":"University of Science and Technology of China","correspondingAuthor":false,"prefix":"","firstName":"Xinyuan","middleName":"","lastName":"Zhu","suffix":""},{"id":512728947,"identity":"f3313036-13f1-4f48-8a52-5b4cb258af54","order_by":2,"name":"Jiadong Lu","email":"","orcid":"","institution":"University of Science and Technology of China","correspondingAuthor":false,"prefix":"","firstName":"Jiadong","middleName":"","lastName":"Lu","suffix":""},{"id":512728948,"identity":"b93eaa8b-951e-4b1d-91d7-49e4b93f5f31","order_by":3,"name":"Yeqing Lu","email":"","orcid":"","institution":"University of Science and Technology of China","correspondingAuthor":false,"prefix":"","firstName":"Yeqing","middleName":"","lastName":"Lu","suffix":""},{"id":512728949,"identity":"d33dc82a-a67c-455b-ba46-03e84472675a","order_by":4,"name":"Yuyan Zhang","email":"","orcid":"","institution":"University of Science and Technology of China","correspondingAuthor":false,"prefix":"","firstName":"Yuyan","middleName":"","lastName":"Zhang","suffix":""}],"badges":[],"createdAt":"2025-08-04 02:30:14","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7286169/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7286169/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":91042505,"identity":"732b999d-4058-420a-befa-366989c18aa3","added_by":"auto","created_at":"2025-09-11 04:33:47","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2708985,"visible":true,"origin":"","legend":"Article File","description":"","filename":"HiTPHNCSMain.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7286169/v1_covered_30b6832c-c739-4609-bd78-3974a4dbddaa.pdf"},{"id":91042355,"identity":"c85d224d-c5bc-4418-8063-4cc167d7bf23","added_by":"auto","created_at":"2025-09-11 04:25:45","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":7313727,"visible":true,"origin":"","legend":"Machine Learning Dataset and Benchmark for Accurate T Cell Receptor-pHLA Binding Prediction","description":"","filename":"HiTPHNCSSupp.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7286169/v1/acfb155b9a1984cdd398fa49.pdf"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Machine Learning Dataset and Benchmark for Accurate T Cell Receptor-pHLA Binding Prediction","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-7286169/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7286169/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"A central challenge in immunology and therapeutic design is accurately predicting the diverse interactions between T cell receptors (TCRs) and peptide-HLA (pHLA) complexes. Existing machine learning tools are hindered by incomplete sequence data and biased non-binding examples. To overcome this, we present Hi-TPH, a large-scale hierarchical dataset featuring an on-the-fly selection strategy for generating non-binding data. We further develop Hi-TPH-PLMs, a collection of Protein Language Models (PLMs) with varied architectures and scales, fine-tuned on Hi-TPH. These models achieve a 17.4% performance gain over state-of-the-art tools on an external wet-lab test set. Leveraging the hierarchical structure of Hi-TPH, detailed analyses dissect the contribution of different molecular components to binding prediction and reveal their synergistic interplay—for instance, the prediction contribution of HLA relies on the presence of full TCR chains. Hi-TPH and Hi-TPH-PLMs are publicly released to support the development of more reliable tools for advanced immunoinformatics research and personalized immunotherapy.","manuscriptTitle":"Machine Learning Dataset and Benchmark for Accurate T Cell Receptor-pHLA Binding Prediction","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-11 04:25:40","doi":"10.21203/rs.3.rs-7286169/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"nature-computational-science","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"natcomputsci","sideBox":"Learn more about [Nature Computational Science](http://www.nature.com/natcomputsci/)","snPcode":"","submissionUrl":"","title":"Nature Computational Science","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Nature Research","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"036dd5b7-3afa-4f69-a256-84c5dca5f719","owner":[],"postedDate":"September 11th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":54465733,"name":"Biological sciences/Computational biology and bioinformatics/Data publication and archiving"},{"id":54465734,"name":"Biological sciences/Immunology/Immunotherapy"},{"id":54465735,"name":"Biological sciences/Computational biology and bioinformatics/Machine learning"}],"tags":[],"updatedAt":"2026-03-17T17:51:04+00:00","versionOfRecord":[],"versionCreatedAt":"2025-09-11 04:25:40","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7286169","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7286169","identity":"rs-7286169","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00