EHRs Enable Robust Lung Cancer Risk Stratification with Transformer-based Models: A Retrospective Multi-center Validation Study

preprint OA: closed CC-BY-4.0
📄 Open PDF Full text JSON View at publisher
Full text 20,677 characters · extracted from preprint-html · click to expand
EHRs Enable Robust Lung Cancer Risk Stratification with Transformer-based Models: A Retrospective Multi-center Validation Study | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article EHRs Enable Robust Lung Cancer Risk Stratification with Transformer-based Models: A Retrospective Multi-center Validation Study Eduardo Alonso, Naroa Mendez, Teresa Garcia-Navarro, Eunate Arana-Arri, and 19 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8200847/v1 This work is licensed under a CC BY 4.0 License Status: Under Revision Version 1 posted 16 You are reading this latest preprint version Abstract Early detection of lung cancer is challenging, and current screening eligibility relies on costly, difficult-to-scale questionnaires. We developed and validated risk stratification models using routinely collected longitudinal structured Electronic Health Records (EHRs) to support population-level screening and evaluation. In this retrospective, multicentre study, we trained four AI models, comparing non-temporal approaches (Count-Based Logistic Regression and time-agnostic Transformer) with temporal sequence modeling approaches (LSTM network and time-aware Transformer). External validation was performed on two independent cohorts from Osakidetza (26,348 individuals from Spain) and the University Hospital of Liège (33,576 individuals from Belgium), evaluating external validity and screening efficiency. The time-aware transformer model (STraTS_t) was the top performer (AUROC 0.809) in the Andalusian Health Service training cohort (202,830 individuals from Spain). Its performance was robustly preserved during sequential external validation (Osakidetza AUROC 0.794; Liège AUROC 0.743). STraTS_t also showed superior screening efficiency, requiring only 26.54% of the population to be screened to detect 70% of lung cancer cases, compared to 41.01% for the baseline CB model. Our findings demonstrate that structured routine EHRs and time-aware transformers deliver accurate, robust lung-cancer risk stratification sustained across distinct European health systems. This capability makes a strong case for screening approaches that are cost- and time-efficient, suitable for population-level deployment without requiring new data collection. Biological sciences/Cancer Biological sciences/Computational biology and bioinformatics Health sciences/Health care Physical sciences/Mathematics and computing Health sciences/Medical research Lung cancer Electronic Health Records Risk Stratification Time-Series Analysis Screening Efficiency Transformers Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Revision Version 1 posted Editorial decision: Revision requested 29 Jan, 2026 Reviews received at journal 23 Jan, 2026 Reviews received at journal 23 Jan, 2026 Reviews received at journal 22 Jan, 2026 Reviews received at journal 22 Jan, 2026 Reviews received at journal 03 Jan, 2026 Reviewers agreed at journal 29 Dec, 2025 Reviewers agreed at journal 28 Dec, 2025 Reviewers agreed at journal 27 Dec, 2025 Reviewers agreed at journal 27 Dec, 2025 Reviewers agreed at journal 26 Dec, 2025 Reviewers agreed at journal 26 Dec, 2025 Reviewers invited by journal 26 Dec, 2025 Editor assigned by journal 28 Nov, 2025 Submission checks completed at journal 28 Nov, 2025 First submitted to journal 25 Nov, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8200847","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":552477574,"identity":"b4b85ffe-7c02-4f90-a664-a729f424d45e","order_by":0,"name":"Eduardo Alonso","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA5klEQVRIie3PMQuCQBTA8ZMH13Lhahj5FU5uaJH6KieCLQp9BENwCpqjL9HaZhzU2Fqj+QUKGhqEUmsLztwi7j88eMOPx0NIpfrJIKrmAMF7p81Eqwl7Ed6CuPX8iuid7SyfFs5kvYfsfLo71hDB7nqTkN7cjdky8cO1wIwF3Lc3EfZWfQmhqZuY3UiEy5hgM+CC05QwMCRkfMgSkxRi0nuRRzOhRnmFYMF1qElaE+0iIcYxi1k38e0FVL/4nl0+5YFEIH3hbXNSOBbWRZYHzsii+1hod5n5rDwBpB0pa3tFpVKp/rsnGGVFyB2y56MAAAAASUVORK5CYII=","orcid":"","institution":"Vicomtech","correspondingAuthor":true,"prefix":"","firstName":"Eduardo","middleName":"","lastName":"Alonso","suffix":""},{"id":552477575,"identity":"3d19f2ed-a8ec-40fb-8b6b-fab993f9ae00","order_by":1,"name":"Naroa Mendez","email":"","orcid":"","institution":"Vicomtech","correspondingAuthor":false,"prefix":"","firstName":"Naroa","middleName":"","lastName":"Mendez","suffix":""},{"id":552477576,"identity":"59c0b841-bf66-448a-b691-44bacfaca690","order_by":2,"name":"Teresa Garcia-Navarro","email":"","orcid":"","institution":"Vicomtech","correspondingAuthor":false,"prefix":"","firstName":"Teresa","middleName":"","lastName":"Garcia-Navarro","suffix":""},{"id":552477577,"identity":"9df1c2e8-f45e-49b7-8086-8d035668ca05","order_by":3,"name":"Eunate Arana-Arri","email":"","orcid":"","institution":"Biobizkaia HRI Osakidetza","correspondingAuthor":false,"prefix":"","firstName":"Eunate","middleName":"","lastName":"Arana-Arri","suffix":""},{"id":552477578,"identity":"7e8f0894-7eb9-4971-bd95-836e280ae75f","order_by":4,"name":"Jon Eneko Idoyaga-Uribarrena","email":"","orcid":"","institution":"Biobizkaia HRI","correspondingAuthor":false,"prefix":"","firstName":"Jon","middleName":"Eneko","lastName":"Idoyaga-Uribarrena","suffix":""},{"id":552477579,"identity":"65a3ef40-2ce1-4a33-a88c-ea6c206fe791","order_by":5,"name":"Miguel Giraldez-Álvarez","email":"","orcid":"","institution":"Institute of Biomedicine of Seville","correspondingAuthor":false,"prefix":"","firstName":"Miguel","middleName":"","lastName":"Giraldez-Álvarez","suffix":""},{"id":552477580,"identity":"360a5d06-c6cf-45a0-9a1e-de21065a2046","order_by":6,"name":"Alberto Moreno-Conde","email":"","orcid":"","institution":"Institute of Biomedicine of Seville","correspondingAuthor":false,"prefix":"","firstName":"Alberto","middleName":"","lastName":"Moreno-Conde","suffix":""},{"id":552477582,"identity":"4898e08a-03a8-41af-b050-a19599146c75","order_by":7,"name":"Jesús Moreno-Conde","email":"","orcid":"","institution":"Institute of Biomedicine of Seville","correspondingAuthor":false,"prefix":"","firstName":"Jesús","middleName":"","lastName":"Moreno-Conde","suffix":""},{"id":552477584,"identity":"78676ac0-ee52-44a8-9787-1e633c48bd4f","order_by":8,"name":"Francisco J. Núñez-Benjumea","email":"","orcid":"","institution":"Institute of Biomedicine of Seville","correspondingAuthor":false,"prefix":"","firstName":"Francisco","middleName":"J.","lastName":"Núñez-Benjumea","suffix":""},{"id":552477585,"identity":"b5a0078c-b492-41f1-8925-81c1b7e92c16","order_by":9,"name":"David Vicente-Baz","email":"","orcid":"","institution":"Hospital Universitario Virgen Macarena","correspondingAuthor":false,"prefix":"","firstName":"David","middleName":"","lastName":"Vicente-Baz","suffix":""},{"id":552477586,"identity":"b8cb6dc4-d537-4c97-a932-56aa9712d977","order_by":10,"name":"Julien Guiot","email":"","orcid":"","institution":"Centre Hospitalier Universitaire de Liège","correspondingAuthor":false,"prefix":"","firstName":"Julien","middleName":"","lastName":"Guiot","suffix":""},{"id":552477587,"identity":"0fa7a523-2335-4ff9-b3ce-45bf7f4fb75f","order_by":11,"name":"Astrid Paulus","email":"","orcid":"","institution":"Centre Hospitalier Universitaire de Liège","correspondingAuthor":false,"prefix":"","firstName":"Astrid","middleName":"","lastName":"Paulus","suffix":""},{"id":552477589,"identity":"4bc907ff-afaa-41b1-b94d-ad4fa7605a92","order_by":12,"name":"Marjorie Gangolf","email":"","orcid":"","institution":"Centre Hospitalier Universitaire de Liège","correspondingAuthor":false,"prefix":"","firstName":"Marjorie","middleName":"","lastName":"Gangolf","suffix":""},{"id":552477594,"identity":"24ff1bea-b218-4566-804c-8bb8c571432a","order_by":13,"name":"Monique Henket","email":"","orcid":"","institution":"Centre Hospitalier Universitaire de Liège","correspondingAuthor":false,"prefix":"","firstName":"Monique","middleName":"","lastName":"Henket","suffix":""},{"id":552477595,"identity":"68b2651c-f24b-4e92-8ce1-80287adf7082","order_by":14,"name":"Benoit Ernst","email":"","orcid":"","institution":"Centre Hospitalier Universitaire de Liège","correspondingAuthor":false,"prefix":"","firstName":"Benoit","middleName":"","lastName":"Ernst","suffix":""},{"id":552477596,"identity":"3fb44d9d-8731-4696-84ab-4b6fa2a94973","order_by":15,"name":"Valentina Gogulancea","email":"","orcid":"","institution":"University of Ulster","correspondingAuthor":false,"prefix":"","firstName":"Valentina","middleName":"","lastName":"Gogulancea","suffix":""},{"id":552477597,"identity":"7b3b707c-4e8b-45f3-a6ae-a8989422e204","order_by":16,"name":"Debbie Rankin","email":"","orcid":"","institution":"University of Ulster","correspondingAuthor":false,"prefix":"","firstName":"Debbie","middleName":"","lastName":"Rankin","suffix":""},{"id":552477598,"identity":"e1ca42e3-6410-4883-8f92-11828def8f3b","order_by":17,"name":"Michaela Black","email":"","orcid":"","institution":"University of Ulster","correspondingAuthor":false,"prefix":"","firstName":"Michaela","middleName":"","lastName":"Black","suffix":""},{"id":552477599,"identity":"846fd787-4432-4d35-857e-331ecb34f873","order_by":18,"name":"Ibai Gurrutxaga","email":"","orcid":"","institution":"University of the Basque Country","correspondingAuthor":false,"prefix":"","firstName":"Ibai","middleName":"","lastName":"Gurrutxaga","suffix":""},{"id":552477600,"identity":"01172851-16e4-4e02-9944-c9061f942a9c","order_by":19,"name":"Andoni Beristain","email":"","orcid":"","institution":"Vicomtech","correspondingAuthor":false,"prefix":"","firstName":"Andoni","middleName":"","lastName":"Beristain","suffix":""},{"id":552477601,"identity":"79da81ed-083d-4d6a-a05e-e74096cb4750","order_by":20,"name":"Alba Garin-Muga","email":"","orcid":"","institution":"Vicomtech","correspondingAuthor":false,"prefix":"","firstName":"Alba","middleName":"","lastName":"Garin-Muga","suffix":""},{"id":552477602,"identity":"8bfab1fc-9d10-4d66-b867-68a3af3d5ef3","order_by":21,"name":"Ivan Macía","email":"","orcid":"","institution":"Vicomtech","correspondingAuthor":false,"prefix":"","firstName":"Ivan","middleName":"","lastName":"Macía","suffix":""},{"id":552477603,"identity":"711d353c-5de7-44eb-a081-29e34d6b8e93","order_by":22,"name":"Xabier Calle","email":"","orcid":"","institution":"Vicomtech","correspondingAuthor":false,"prefix":"","firstName":"Xabier","middleName":"","lastName":"Calle","suffix":""}],"badges":[],"createdAt":"2025-11-25 08:38:12","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8200847/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8200847/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":97304746,"identity":"223ba19a-b51b-4147-bad0-268cf1298335","added_by":"auto","created_at":"2025-12-03 02:44:03","extension":"json","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":21519,"visible":true,"origin":"","legend":"","description":"","filename":"9765cebba58a4e9f9ae51e526f07deaf.json","url":"https://assets-eu.researchsquare.com/files/rs-8200847/v1/ca3e8ee958d45e20a91a3fa9.json"},{"id":97369262,"identity":"d5a15cda-dc34-4fc7-ace5-cc714886bcc1","added_by":"auto","created_at":"2025-12-03 16:24:03","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":965790,"visible":true,"origin":"","legend":"","description":"","filename":"LUCIAEHRnpj1.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8200847/v1_covered_2a396d78-3bda-4b5d-9e96-586ba2611628.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"EHRs Enable Robust Lung Cancer Risk Stratification with Transformer-based Models: A Retrospective Multi-center Validation Study","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"npj-digital-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjdigitalmed","sideBox":"Learn more about [npj Digital Medicine](http://www.nature.com/npjdigitalmed/)","snPcode":"41746","submissionUrl":"https://submission.springernature.com/new-submission/41746/3","title":"npj Digital Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Lung cancer, Electronic Health Records, Risk Stratification, Time-Series Analysis, Screening Efficiency, Transformers","lastPublishedDoi":"10.21203/rs.3.rs-8200847/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8200847/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Early detection of lung cancer is challenging, and current screening eligibility relies on costly, difficult-to-scale questionnaires. We developed and validated risk stratification models using routinely collected longitudinal structured Electronic Health Records (EHRs) to support population-level screening and evaluation. In this retrospective, multicentre study, we trained four AI models, comparing non-temporal approaches (Count-Based Logistic Regression and time-agnostic Transformer) with temporal sequence modeling approaches (LSTM network and time-aware Transformer). External validation was performed on two independent cohorts from Osakidetza (26,348 individuals from Spain) and the University Hospital of Liège (33,576 individuals from Belgium), evaluating external validity and screening efficiency. The time-aware transformer model (STraTS_t) was the top performer (AUROC 0.809) in the Andalusian Health Service training cohort (202,830 individuals from Spain). Its performance was robustly preserved during sequential external validation (Osakidetza AUROC 0.794; Liège AUROC 0.743). STraTS_t also showed superior screening efficiency, requiring only 26.54% of the population to be screened to detect 70% of lung cancer cases, compared to 41.01% for the baseline CB model. Our findings demonstrate that structured routine EHRs and time-aware transformers deliver accurate, robust lung-cancer risk stratification sustained across distinct European health systems. This capability makes a strong case for screening approaches that are cost- and time-efficient, suitable for population-level deployment without requiring new data collection.","manuscriptTitle":"EHRs Enable Robust Lung Cancer Risk Stratification with Transformer-based Models: A Retrospective Multi-center Validation Study","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-12-03 02:43:59","doi":"10.21203/rs.3.rs-8200847/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-01-29T20:40:21+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-01-23T21:41:33+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-01-23T09:57:49+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-01-22T18:18:21+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-01-22T07:11:27+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-01-03T05:17:12+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"101080320161550164212191675963660497353","date":"2025-12-29T16:08:34+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"235203987850885863773923690782725960515","date":"2025-12-28T16:56:47+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"300495151200834491892269257186204440251","date":"2025-12-27T13:32:16+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"202647765320026265720201543981523883226","date":"2025-12-27T06:36:55+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"150576530531172071396635959772591755966","date":"2025-12-27T02:20:44+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"57214921196144080093729638849029171489","date":"2025-12-27T00:23:34+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-12-26T20:06:26+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-11-29T00:35:39+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-11-28T13:45:47+00:00","index":"","fulltext":""},{"type":"submitted","content":"npj Digital Medicine","date":"2025-11-25T08:24:05+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"npj-digital-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjdigitalmed","sideBox":"Learn more about [npj Digital Medicine](http://www.nature.com/npjdigitalmed/)","snPcode":"41746","submissionUrl":"https://submission.springernature.com/new-submission/41746/3","title":"npj Digital Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"e3a8693c-b37d-40b5-813f-3fb8a69b91d9","owner":[],"postedDate":"December 3rd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"in-revision","subjectAreas":[{"id":58796018,"name":"Biological sciences/Cancer"},{"id":58796019,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":58796020,"name":"Health sciences/Health care"},{"id":58796021,"name":"Physical sciences/Mathematics and computing"},{"id":58796022,"name":"Health sciences/Medical research"}],"tags":[],"updatedAt":"2026-05-14T21:38:32+00:00","versionOfRecord":[],"versionCreatedAt":"2025-12-03 02:43:59","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8200847","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8200847","identity":"rs-8200847","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-23T02:00:01.238055+00:00
License: CC-BY-4.0