Assessing data size requirements for training generalizable sequence-based TCR specificity models via pan-allelic MHC-I non-self ligandome evaluation | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Assessing data size requirements for training generalizable sequence-based TCR specificity models via pan-allelic MHC-I non-self ligandome evaluation Antoine Delaunay, Miles McGibbon, Bachir Djermani, Nikolai Gorbushin, and 10 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6446591/v2 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 27 Nov, 2025 Read the published version in Scientific Reports → Version 2 posted 15 You are reading this latest preprint version Show more versions Abstract Quickly identifying which T cell receptors (TCRs) specifically bind patient-unique neoepitopes is a critical challenge for personalized TCR cell therapy in oncology. Due to enormous diversity of both TCR and neoepitope repertoires, a machine learning predictor of TCR-pMHC specificity for personalized therapy must generalize to TCRs and epitopes not seen in the training data. For the first time, we estimate the necessary size of such training data. We first show that published models fail to generalize beyond a single-residue dissimilarity to the epitope training set distribution. We then impute the possible mutated ligandome across the 34 most prevalent human MHC alleles and represent it as a graph based on our established dissimilarity cutoff. By finding the dominating set of this graph, we estimate that between one and 100 million epitopes are required to train a generalizable sequence-based TCR specificity prediction model - 1000 times the size of current public data. Biological sciences/Computational biology and bioinformatics/Machine learning Health sciences/Oncology/Cancer/Cancer therapy/Cancer immunotherapy Biological sciences/Immunology/Antigen processing and presentation/Mhc/Mhc class i Biological sciences/Immunology/Lymphocytes/T cells/T cell receptor TCR specificity prediction oncology personalized T-cell therapy sample size estimation Full Text Additional Declarations Competing interest reported. A. Delaunay, M. McGibbon, B. Djermani, N. Gorbushin, S. Chaves García-Mascaraque, I. Rayment, K. Beguir, L. Copoiu and N. Lopez Carranza are employees of InstaDeep, a subsidiary of the BioNTech group. I. Kizhvatov, C. Petit, M. Lang, and A. Tovchigrechko are employees of BioNTech. U. Sahin is the CEO of BioNTech. This study was funded by BioNTech. Cite Share Download PDF Status: Published Journal Publication published 27 Nov, 2025 Read the published version in Scientific Reports → Version 2 posted Editorial decision: Revision requested 27 Jun, 2025 Reviews received at journal 26 Jun, 2025 Reviewers agreed at journal 24 Jun, 2025 Reviewers agreed at journal 23 Jun, 2025 Reviewers agreed at journal 20 Jun, 2025 Reviews received at journal 19 Jun, 2025 Reviewers agreed at journal 17 Jun, 2025 Reviewers agreed at journal 16 Jun, 2025 Reviewers agreed at journal 16 Jun, 2025 Reviewers agreed at journal 16 Jun, 2025 Reviewers invited by journal 16 Jun, 2025 Editor invited by journal 16 Jun, 2025 Editor assigned by journal 11 Jun, 2025 Submission checks completed at journal 10 Jun, 2025 First submitted to journal 09 Jun, 2025 You are reading this latest preprint version Show more versions Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6446591","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[{"code":1,"date":"2025-04-16 03:56:47","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"articleType":"Article","associatedPublications":[],"authors":[{"id":497391625,"identity":"14711f33-0b32-4c91-9074-1f1a7beeaa5c","order_by":0,"name":"Antoine Delaunay","email":"","orcid":"","institution":"InstaDeep Ltd","correspondingAuthor":false,"prefix":"","firstName":"Antoine","middleName":"","lastName":"Delaunay","suffix":""},{"id":497391626,"identity":"77c1b1ad-01da-4b3c-87ae-5b618fc1bdb1","order_by":1,"name":"Miles McGibbon","email":"","orcid":"","institution":"InstaDeep Ltd","correspondingAuthor":false,"prefix":"","firstName":"Miles","middleName":"","lastName":"McGibbon","suffix":""},{"id":497391627,"identity":"82e61ae8-7144-4d4b-808c-eb259fdc2951","order_by":2,"name":"Bachir Djermani","email":"","orcid":"","institution":"InstaDeep Ltd","correspondingAuthor":false,"prefix":"","firstName":"Bachir","middleName":"","lastName":"Djermani","suffix":""},{"id":497391628,"identity":"7752102c-520a-45d2-9dbe-2cc84cc9198a","order_by":3,"name":"Nikolai Gorbushin","email":"","orcid":"","institution":"InstaDeep Ltd","correspondingAuthor":false,"prefix":"","firstName":"Nikolai","middleName":"","lastName":"Gorbushin","suffix":""},{"id":497391629,"identity":"1d1479ac-4acc-4f7e-9965-841e2b3af911","order_by":4,"name":"Sergio Chaves Garcia-Mascaraque","email":"","orcid":"","institution":"InstaDeep Ltd","correspondingAuthor":false,"prefix":"","firstName":"Sergio","middleName":"Chaves","lastName":"Garcia-Mascaraque","suffix":""},{"id":497391630,"identity":"17c35050-1197-42c3-80ab-5763f727024c","order_by":5,"name":"Isaac Rayment","email":"","orcid":"","institution":"InstaDeep Ltd","correspondingAuthor":false,"prefix":"","firstName":"Isaac","middleName":"","lastName":"Rayment","suffix":""},{"id":497391631,"identity":"fab8edb7-e847-4c2b-a082-d544117ed8d5","order_by":6,"name":"Ilya Kizhvatov","email":"","orcid":"","institution":"BioNTech SE","correspondingAuthor":false,"prefix":"","firstName":"Ilya","middleName":"","lastName":"Kizhvatov","suffix":""},{"id":497391632,"identity":"e0df15da-c714-4310-b129-6554911ffd7a","order_by":7,"name":"Cecile Petit","email":"","orcid":"","institution":"BioNTech SE","correspondingAuthor":false,"prefix":"","firstName":"Cecile","middleName":"","lastName":"Petit","suffix":""},{"id":497391633,"identity":"7538e63e-f0e4-49a0-a1c8-2aaa118a43f6","order_by":8,"name":"Maren Lang","email":"","orcid":"","institution":"BioNTech SE","correspondingAuthor":false,"prefix":"","firstName":"Maren","middleName":"","lastName":"Lang","suffix":""},{"id":497391634,"identity":"424d8f25-18c3-42d3-8392-1939d0ca1a4d","order_by":9,"name":"Karim Beguir","email":"","orcid":"","institution":"InstaDeep Ltd","correspondingAuthor":false,"prefix":"","firstName":"Karim","middleName":"","lastName":"Beguir","suffix":""},{"id":497391635,"identity":"5c75fd44-b5cc-46a7-bcbb-b06074d0df32","order_by":10,"name":"Ugur Sahin","email":"","orcid":"","institution":"BioNTech SE","correspondingAuthor":false,"prefix":"","firstName":"Ugur","middleName":"","lastName":"Sahin","suffix":""},{"id":497391636,"identity":"318843e8-8635-4aa3-aa48-94f11384bf89","order_by":11,"name":"Liviu Copoiu","email":"","orcid":"","institution":"InstaDeep Ltd","correspondingAuthor":false,"prefix":"","firstName":"Liviu","middleName":"","lastName":"Copoiu","suffix":""},{"id":497391637,"identity":"354f23c7-9e52-48b4-970c-6bdf331e5216","order_by":12,"name":"Nicolas Lopez Carranza","email":"","orcid":"","institution":"InstaDeep Ltd","correspondingAuthor":false,"prefix":"","firstName":"Nicolas","middleName":"Lopez","lastName":"Carranza","suffix":""},{"id":497391638,"identity":"4c742292-db7e-4f3d-8cda-207eec6980cc","order_by":13,"name":"Andrey Tovchigrechko","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABOUlEQVRIie3PMUvDQBTA8RcCN6Xc+kog/QoXAq1Fxa9yRyBdLAgFcaqBwrlU54R+iYIgXSUQl+KcQaQidLYUpEoFe2lFvaI4Otx/yfGOH3kHYDL9z9jHgQBXHyQAsxjLkf0LwW/EShUhfyGwIXYlhh8JHfSunmYSuhTD6WTyeu/RgRSP+6Mdr1HrXc+PIDvQCN7lnTSVgNUkajBx0QlWkyxojzFoShK6CWQi1kxxGNiVFWHjMUHR52KILem2JYph7tRtBzKuiZoib1/IaUmaJaHPiuiLMUUsRW76BPmCc4ZR7lrrvxBFLG0xv4g6Vv8Wq+mZrKOIuZ8UUeifr97C8ihwHdbS3+IV4SUsjvcote1pdbHkNZpE/uRl1PVYlj3MnZNdfbEyi+DmILfu2NZk3XLrYDKZTKbP3gG/mWuewaR6nAAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0002-0959-4429","institution":"BioNTech SE","correspondingAuthor":true,"prefix":"","firstName":"Andrey","middleName":"","lastName":"Tovchigrechko","suffix":""}],"badges":[],"createdAt":"2025-04-14 13:40:43","currentVersionCode":2,"declarations":"","doi":"10.21203/rs.3.rs-6446591/v2","doiUrl":"https://doi.org/10.21203/rs.3.rs-6446591/v2","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41598-025-26454-7","type":"published","date":"2025-11-27T15:58:49+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":97179365,"identity":"f41d51d8-d91b-49ee-a49a-a094b4a2b283","added_by":"auto","created_at":"2025-12-01 16:15:03","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2881879,"visible":true,"origin":"","legend":"","description":"","filename":"TCRdatasetsizeestimationLaTexsource.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6446591/v2_covered_f6d58efb-f41c-4f23-a71c-1685dac44fa0.pdf"}],"financialInterests":"Competing interest reported. A. Delaunay, M. McGibbon, B. Djermani, N. Gorbushin, S. Chaves García-Mascaraque, I. Rayment, K. Beguir, L. Copoiu and N. Lopez Carranza are employees of InstaDeep, a subsidiary of the BioNTech group.\nI. Kizhvatov, C. Petit, M. Lang, and A. Tovchigrechko are employees of BioNTech. U. Sahin is the CEO of BioNTech.\nThis study was funded by BioNTech.","formattedTitle":"Assessing data size requirements for training generalizable sequence-based TCR specificity models via pan-allelic MHC-I non-self ligandome evaluation","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"TCR specificity prediction, oncology, personalized T-cell therapy, sample size estimation","lastPublishedDoi":"10.21203/rs.3.rs-6446591/v2","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6446591/v2","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eQuickly identifying which T cell receptors (TCRs) specifically bind patient-unique neoepitopes is a critical challenge for personalized TCR cell therapy in oncology. Due to enormous diversity of both TCR and neoepitope repertoires, a machine learning predictor of TCR-pMHC specificity for personalized therapy must generalize to TCRs and epitopes not seen in the training data. For the first time, we estimate the necessary size of such training data. We first show that published models fail to generalize beyond a single-residue dissimilarity to the epitope training set distribution. We then impute the possible mutated ligandome across the 34 most prevalent human MHC alleles and represent it as a graph based on our established dissimilarity cutoff. By finding the dominating set of this graph, we estimate that between one and 100 million epitopes are required to train a generalizable sequence-based TCR specificity prediction model - 1000 times the size of current public data.\u003c/p\u003e","manuscriptTitle":"Assessing data size requirements for training generalizable sequence-based TCR specificity models via pan-allelic MHC-I non-self ligandome evaluation","msid":"","msnumber":"","nonDraftVersions":[{"code":"","date":"2025-10-24 14:51:18","doi":"","editorialEvents":[{"type":"decision","content":"Revision requested","date":"2025-10-24T14:51:18+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-10-24T14:49:52+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-10-20T04:26:35+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"53308995613638933996122652609207440506","date":"2025-10-20T03:25:40+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"117615485934942515520689735414012320882","date":"2025-10-16T15:17:37+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-10-13T00:25:51+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-10-10T09:24:59+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2025-10-10T01:53:51+00:00","index":"","fulltext":""},{"type":"notPreprinted","content":""}],"status":"timeline","journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}},{"code":2,"date":"2025-08-08 01:48:01","doi":"10.21203/rs.3.rs-6446591/v2","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-06-27T09:28:06+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-06-26T22:13:29+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"285483214093438216764847487116673320557","date":"2025-06-24T09:25:40+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"285981431152846468252013616588563649162","date":"2025-06-23T07:31:50+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"117565040223664996964439273606456558665","date":"2025-06-20T21:18:51+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-06-19T06:44:52+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"53308995613638933996122652609207440506","date":"2025-06-17T08:51:12+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"37843274063229400022417758562564860889","date":"2025-06-17T02:11:39+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"55837718459385055390218978955963391272","date":"2025-06-16T19:47:11+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"117615485934942515520689735414012320882","date":"2025-06-16T15:01:43+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-06-16T14:04:15+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-06-16T10:46:12+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-06-11T11:30:48+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-06-11T01:43:12+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2025-06-09T16:44:41+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"82d54398-d7f2-44fd-a59a-738981904289","owner":[],"postedDate":"August 8th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":52837235,"name":"Biological sciences/Computational biology and bioinformatics/Machine learning"},{"id":52837236,"name":"Health sciences/Oncology/Cancer/Cancer therapy/Cancer immunotherapy"},{"id":52837237,"name":"Biological sciences/Immunology/Antigen processing and presentation/Mhc/Mhc class i"},{"id":52837238,"name":"Biological sciences/Immunology/Lymphocytes/T cells/T cell receptor"}],"tags":[],"updatedAt":"2025-12-01T16:09:19+00:00","versionOfRecord":{"articleIdentity":"rs-6446591","link":"https://doi.org/10.1038/s41598-025-26454-7","journal":{"identity":"scientific-reports","isVorOnly":false,"title":"Scientific Reports"},"publishedOn":"2025-11-27 15:58:49","publishedOnDateReadable":"November 27th, 2025"},"versionCreatedAt":"2025-08-08 01:48:01","video":"","vorDoi":"10.1038/s41598-025-26454-7","vorDoiUrl":"https://doi.org/10.1038/s41598-025-26454-7","workflowStages":[]},"version":"v2","identity":"rs-6446591","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6446591","identity":"rs-6446591","version":["v2"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.