An explainable language model predicts survival from medical reports in oncology

doi:10.21203/rs.3.rs-7121466/v1

An explainable language model predicts survival from medical reports in oncology

2025 · doi:10.21203/rs.3.rs-7121466/v1

preprint OA: closed

Full text JSON View at publisher

Full text 93,141 characters · extracted from preprint-html · click to expand

An explainable language model predicts survival from medical reports in oncology | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article An explainable language model predicts survival from medical reports in oncology Clément Piat, Quentin Blampey, Alexandre Joutard, Mohamed Aymen Qabel, and 9 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7121466/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Prognosis estimation is key to personalizing oncology care, yet current models rely on limited and often incomplete clinical and biological data. We designed a solution adapted to any kind of cancer (type and stage) based on narrative electronic medical reports, i.e. the basic working material for oncologists. We used 2.3M medical documents (corresponding to 36,123 patients for whom we had the date of death) to train, validate and test three different approaches. The best survival prediction performances were obtained by taking into account the medical history with sequential reports. This model (K-memBERT-T2) reached a Pearson correlation of 0.655 on the test cohort, 0.621 on a large external cohort of 143k documents (17,633 additional patients) (p-values:<10 − 5 ) and a concordance index of 0.766 when adding 7082 alive and censored patients in this external test cohort. The 3-month binary survival predictions achieved an AUC of 0.852 on the test cohort and 0.875 on the external dataset. The model related to survival duration better than the PS, independently of its mention in texts. We present a non-invasive and interpretable method paving the way for an easy implementation in French-speaking centers. Biological sciences/Cancer Biological sciences/Computational biology and bioinformatics Health sciences/Oncology Deep learning electronic health records natural language processing transformers prognosis cancer Figures Figure 1 Figure 2 Figure 3 Figure 4 TRANSLATIONAL RELEVANCE Language Models can predict pan-cancer overall survival from routinely-drafted electronic health records in oncology, with confidence level estimation and interpretation of the most important words. The application can also use recent medical history if available (a suit of dated medical reports), which was shown to improve the predictions’ reliability. At this stage, without a dedicated prospective confirmation study, K-memBERT shouldn’t be used for clinical decision support. INTRODUCTION Imperfect prognostication of patients with cancer entails multiple negative consequences, starting with the misevaluation of the goals of treatment. When reaching the end of life, quality of life must be considered over quantity, especially as, in some cases, early introduction of supportive care improves both overall survival and quality of life 1 , 2 . Refining, facilitating and automating prognosis estimation could therefore benefit patients at multiple levels through their caregivers. Meanwhile, hospitals accumulate colossal amounts of health data in electronic format that can be exploited by machines to help physicians in their daily work. An increasing number of machine- and deep-learning algorithms shows that artificial intelligence (AI)-based predictive tools can successfully predict important clinical outcomes 3 , 4 , including survival in cancer patients 5 – 7 , by processing either raw medical images (CT-scans, electrocardiograms, histopathology slides etc.) or structured data such as clinical descriptors, molecular profiles or lab results. Medical reports in narrative format have the advantages to: (i) gather relevant personal and clinical information about the patient and the disease in an almost exhaustive manner and in a single format, (ii) be quickly and easily accessible to any oncologist or healthcare professional implicated in the treatment decision, and (iii) only require existing resources such as the hospital informatics environment. The objectives of this study is to show that language models can predict patients prognostic from narrative medical data, and develop a tool easy to implement in practice. METHODS Patients We retrospectively retrieved EHRs from patients affected by malignant cancers filling the following criteria: adult (≥ 18 years old) patient treated at Gustave Roussy Cancer center (GR), France, between July 1987 and December 2020, with at least one non-blank (> 250 characters) text electronic document, deceased at the time of collection and for whom the date of death was recorded in the medical database. The type of reports selected were consultation reports, radiological reports, clinical notes and hospitalization reports. The independent test cohort comprised EHRs from adult patients included in a prospective study of any kind at the Centre Léon Bérard (CLB, another Comprehensive Cancer Centers), France, with at least one non-blank medical report recorded between January, 2000 and December, 2020. All the alive CLB patients included in the analysis were explicitly proposed for the use of their personal data for research purpose during their lifetime and only patients non-opposing were analyzed. Dataset We filtered out documents recorded after the date of death and documents with less than 250 characters. The remaining texts were preprocessed and tokenized using SentencePiece 8 , which extracts prefixes and suffixes, such as "logy", thus treats e.g."oncology" as "onco + logy"; selectively limited to 512 tokens for longer texts; changed letters to lowercase etc. For texts longer that 512, we took the first 256 tokens, and the last 256 tokens. We added the token “...” between the two parts. We performed an random split on the patients (to avoid data leakage) from the GR dataset to compose the training, validation and test sets with 88%, 2% and 10% in each group, respectively. Using the CLB cohort (external test cohort only), we sub-sampled into 10 datasets composed of a random selection of 1,000 patients each, in order to compute distributions upon the tested performance metrics. BERT-based models and control The existing CamemBERT underwent masked language modeling pre-training on non-medical text corpus (vocabulary size of 32,005 tokens) 9 . We did not performed additional language modeling but instead hypothesized that the tokenizer could miss specific medical terms that are important for the task of prognosis prediction. We thus expanded the vocabulary by adding the 500 most frequent tokens found in the GR training cohort that were not in the original vocabulary of CamemBERT. The related embedding were randomly initialized and further trained. We then benchmarked three alternatives: (1) K-memBERT-base , which takes one single document as input, (2) K-memBERT-conflation , which aggregates multiple K-memBERT-base predictions distributions, and (3) K-memBERT-T2 , which stacks two transformers to input a sequence of medical reports. The multiple reports consisted in a sequence of historical reports selected from the last time point defined to compute survival. For each model type (K-memBERT-base, K-memBERT-conflation and K-memBERT-T2), we trained the parameters on the same training set, we used the validation set to tune the hyperparameters, and we independently evaluated the performances of each model using the GR test set. The best preforming model was selected based of the best Pearson correlation obtained between prediction and ground truth (see Metrics). Predictions: objectives and labels We designed a model that predicts a distribution probability of survival time and estimates the confidence in each prediction. We used Gaussian distributions to represent the model predictions, defined by the mean value and the associated standard deviation and translated the labels into a standard uniform distribution [0–1] with the formula 1 – exp(- individual survival /mean survival of the training set). For each prediction, the mean value is the survival duration and the standard deviation represents the confidence in the prediction. The loss is a generalization of the Mean Squared Error including the standard deviation. Additional information is provided in Supplementary methods . Metrics and calibration Continuous survival predictions were evaluated by Pearson and Spearman correlations between the predicted and the true survivals. We used the concordance index for survival analysis in the CLB cohort that included censored survival times. In addition, we set thresholds of time values and evaluated standard performance metrics (balanced accuracy, AUC, recall equivalent to sensitivity and F1-scores) at 3 months, 1 year and at the mean survival duration rate (based on the GR cohort). For recall and F1-scores we took the class “survival” as the default class, or otherwise mentioned. We have performed calibration analysis of the the models’ prediction with various potential biases such as time of follow-up, tumor type, age, sex, prognostic scores, and the observed prognosis of the patients. Comparison to performance status We used the Performance Status (PS) as a gold standard measure to easily estimate prognosis, that is widely used in clinical trials selection criteria, for example 10 , 11 . We extracted the documents containing the mention of ‘PS=[number]’ in the CLB test cohort. We performed 3 types of analysis: 1) compared the relation of the classes of PS with the true survival and with the predicted survival, 2) evaluated if removing the ‘PS=[number]’ or ‘Karnofsky=[percentage]’ would change the predicted survival, and 3) evaluated how five predicted classes from K-memBERT predictions (intervals [0-0.2]; [0.2–0.4]; [0.4–0.6]; [0.6–0.8] and [0.8-1]) related to true survival. The ‘PS=[number]’ or ‘Karnofsky=[percentage]’ are semi-structured information at the beginning of consultation reports in the CLB dataset and we could thus remove it using regular expression. We computed the confusion matrices and several multi-class classification metrics such as Precision equivalent to positive predictive value (were death is the event), Recall, Overall Accuracy, Balanced Accuracy, Precision and F1-Score, for both reference class when relevant, and Cohen's Kappa score 12 . Ethics The study complies with the European GDPR regulation 2016/679, the French law, Good Clinical Practice Guidelines of the International Conference on Harmonization and was approved by internal ethics and scientific commissions (notification 2021-66). We registered the project by the MR004 declaration (V3.2 23/08/2021) to the French Health Data Hub and Unicancer at both sites, GR and CLB. We confirmed that all patients gave their consent or were not opposed to the use of their data. The data collected could only be used for the aim of this study, and patient identification was secured. Code, libraries and resources We used Python 3.7, Pytorch 1.7, Transformers 4.1, Scikit-learn and the transformer-interpret libraries. For data security, we trained the model on a single in-site GPU GeForce GTX 1080 Ti with RAM 11Gb. We have released the source code and the code for a local installation of the application at https://github.com/DITEP/KmemBERT , and a demonstration of an web-hosted application of K-memBERT https://fd21dde0-4a7b-44bc-b011-b5de469557ea.pub.instances.scw.cloud/BERTPrediction . Role of Funders Funding by GR and CLB for human resources which are non for profit Comprehensive Cancer Centers. No specific influence to declare on study design, data collection, data analyses, interpretation, or writing of report. RESULTS Data The GR dataset preparation led to a clean and workable text corpus of 2.7 million documents from 36,124 patients. About half (46.5%) of the documents were consultation reports, 15% were radiological reports, 12% were clinical notes and 11% were hospitalization reports (Supplementary Table 1 and Supplementary Results). The random split automatically isolated 31,789 patients for training (2,053k documents), 722 patients for validation (47k documents) and 3,612 patients for testing (233k documents) (Fig. 1 A). The CLB external independent testing cohort consisted of 17,633 patients from which we sub-sampled arbitrarily 10,000 patients into 10 cohorts of 1,000 patients each, for a total of 243k documents. On average, this corpus format was similar to the GR cohort (Supplementary Fig. 1E-H), the mean number of documents per patient being 46 and 97% of patients had more than four eligible documents. The patients in the CLB cohorts had a mean of 56 year-old (standard deviation of 16), 58% were Females, with 22% patients with breast cancers, 13% digestive tract cancers, 11% hematological cancers, 8% mesothelial and soft tissues cancers, and 10 other types of cancers with frequencies between 7% and 1%. In all cohorts, survival dropped exponentially over time, with an mean survival time of 701 days for the GR cohort and 837 days for the CLB cohort (rate of variation = 19.3%) (Fig. 1 B). Models’ description, training and selection on the validation cohort We developed K-memBERT – a suite of deep learning models adapted for survival prediction – from the French Transformer CamemBERT. We designed three types of K-memBERT models to explore which way provides the most accurate predictions (Fig. 2 A-C): one that takes a single document as input (K-memBERT-base), and two that can take a sequence of several historical medical reports from the same patient as input, either by conflation (K-memBERT-conflation) or with an additional Transformer layer on top of K-memBERT-base (K-memBERT-T2). In every configuration we used the date of the last report inputted in the model to compute each patient’s survival. Fine-tuning K-memBERT-base on the GR training cohort took 40h on a single local and secured GPU. The best performing model on the validation cohort was K-memBERT-T2, with a validation Pearson correlation of 0.686 (p-value < 10 − 5 ) between true and predicted survivals, reaching a 1-year survival prediction AUC of 0.832, a F1-score of 0.760, and a balanced accuracy of 0.754 (Fig. 2 D,E and Supplementary Table 2). Both K-membert-conflation and K-memBERT-T2 showed significantly superior scores than K-memBERT-based, supporting the benefit of taking into account the medical history to improve predictions. K-memBERT-T2 on the test cohort On the unseen GR test cohort, K-memBERT-T2 achieved a Pearson correlation of 0.655 (p-value < 10 − 5 ) (Fig. 3 A). For binary predictions at one year, the model showed: AUC = 0.817, F1 score = 0.741, balanced accuracy = 0.737 and recall (equivalent to sensitivity) = 0.763 (Fig. 3 B and Supplementary Fig. 2). We observed that the performances of K-memBERT-T2 on the GR test cohort varied with the prognosis (calibration analysis): for patients with a short life expectancy, predictions were relatively optimistic (i.e. predictions over-valued for patients with short survival, at 3 months, AUC = 0.852 and F1-score = 0.917 for survival but 0.467 for predicting death) (Fig. 3 C). Conversely with patients with a prolonged life expectancy, predictions were relatively pessimistic (i.e. predictions under-valued for patients with long survival, at mean follow-up [701 days], AUC = 0.827 and F1-score = 0.548 for survival but 0.851 for predicting death) (Fig. 3 C, Supplementary Fig. 2). Interestingly, we found a positive correlation between the level of errors and the predicted standard deviations (Pearson = 0.37, Supplementary Fig. 2F), supporting that accurate predictions were assigned with high confidence by the model, and reciprocally mediocre predictions were tempered by low predicted confidence. External validation on an independent cohort On the external test cohort (CLB), the Pearson correlation between the predicted survival value and the ground truth was 0.621 (p-value < 10 − 5 , with mean correlation across cohorts’ splits = 0.615, standard deviation = 0.012) and the Spearman correlation was 0.634 (p-value < 10 − 5 , with mean correlation across cohorts’ splits = 0.635, standard deviation = 0.01) (Fig. 3 D). Binary predictions at three months were excellent, with an AUC of 0.875 and a false negative rate of only 1.1% (patients predicted dead but who remained alive at one year) (Supplementary Fig. 3A). Predictions at one year showed an AUC of 0.806 and 8.1% of false negative rate (patients predicted dead but who remained alive at one year) (Fig. 3 E,F). Again, we observed that the algorithm was slightly overoptimistic at all time points (3 months, 1 year, mean follow-up) and became pessimistic on the very latest events, i.e. on patients with the most prolonged life expectancy (Supplementary Fig. 3D); although predictions at 701 days reached F1 score = 0.761 and AUC = 0.789 (Supplementary Tables 3&4). K-memBERT-T2 performances were stable for patients between 35 and 85 year-old (Supplementary Fig. 3E), across cancer types but with worst performance for hematological cancers, and breast cancers (Supplementary Fig. 3F), and without significant difference between sexes (not shown). We also added 7082 alive patients (censored survival at the time of data extraction) in the CLB database thanks to their non-opposition to the use of their data for this project. The concordance index of the whole CLB test cohort was 0.766. Comparison with the Performance Status (PS) We choose to compare the predictions of K-memBERT-T2 with the PS. PS is a widely used prognosis score in oncology that defines five grades from 0 to 4 (5 being death) corresponding to increasing levels of fatigue and highly correlated with overall survival. We extracted the PS scores contained in 46,123 documents from the CLB cohort. We first confirmed that PS were significantly related to survival (Kruskal Wallis p-value < 10 − 5 , multi-class overall accuracy = 0.197, Cohen kappa score = 0.032). Then, we translated K-memBERT-T2 predictions into five prognosis intervals [0-0.2], [0.2–0.4], [0.4–0.6], [0.6–0.8] and [0.8-1]. We showed that the model prognosis groups were more accurate at predicting overall survival than PS (Kruskal Wallis p-value < 10 − 5 , multi-class overall accuracy = 0.336, Cohen kappa score = 0.170), in all groups (Fig. 3 G, Supplementary Fig. 4). Importantly, the model predictions were slightly influenced by the presence of the terms ‘PS = 0’ in the text documents, while the other ‘PS’ and/or ‘Karnofsky scores’ (‘KS’, equivalent to PS and frequently used in CLB documents) did not influence the predictions (Supplementary Fig. 4D). Interpretation of the model predictions The attention mechanisms for the medical vocabulary increased in the very latest layers of K-memBERT, suggesting that the step of adding medical vocabulary benefited to the model and provided fine-grained representations (Supplementary Fig. 4A). We analyzed the level and the sense of importance of each input token on the predictions (Fig. 4 , Supplementary Fig. 5, supplementary method). In some situations, even with wrong predictions, human decisions would have been in line with the algorithm prediction based on the medical record used for prognosis estimation (Supplementary Fig. 5E and Supplementary Results). DISCUSSION We have imporved, trained, evaluated and interpreted the performances of a deep learning model – K-memBERT – that accurately estimates the overall survival of patients with cancer of any type (any primary, and any stage for patients from two Comprehensive Cancer Centers) from written electronic medical reports. K-memBERT specificities include an improved medical vocabulary, the ability to estimate confidence in its predictions, and the ability to input several documents at a time to leverage information from the patient’s historical clinical notes. We confirmed the high generalizability of K-memBERT performance (meaning that it can predict accurately whatever the source of the data), despite the differences in true survival time distributions between the cohorts. Importantly, K-memBERT is a non-invasive method that can be unlimitedly repeated over time (at each consultation) and does not require extra-work from the medical team as it uses routinely-generated free-text medical records. We showed that inputting sequential documents of a patient’s history improved the performance of the model, compensating the difficulty of long time range predictions. This strategy is also supported by two other studies using structured data types for various clinical situations (supplementary Context) 13 , 14 . This intuition came from the clinical routine, where physicians usually review a large number of documents to picture the patients’ past and recent medical history. The human performance for prognosis estimation at 2-year is AUC=[0.72–0.81] in patients with advanced cancer 15 . Across various studies only 20–29% of physicians’ prognosis estimate were accurate (“accurate” meaning that the duration of predicted survival was comprised between 0.67 and 1.33 times the actual survival) 16 – 20 . This is close to K-memBERT-T2 on test sets having 3-month AUC=[0.852–0.875], 1-year AUC=[0.806–0.817], and Pearson correlation [0.621–0.655]. Our study was intentionally dedicated to patients affected by any type of cancer at any stage, designing a comprehensive valuable tool usable across all cancer specialties. This is original compared to similar NLP models fed with EHRs, where selection of specific patient subpopulation is often the rule 21 – 24 . Overall, medical consultations data have barely been used for cancer prognosis estimation, despite the extensive exploration of other types of input data (mostly biological and molecular data in recent years, see Supplementary Context). We acknowledge a potential selection bias in training the models on deceased patient only: the dataset may have been enriched with aggressive and incurable cancer cases. Accordingly, the mean survival was shorter in the GR cohort than in the external validation cohort, possibly due to the differences in selection criteria as patients in the external cohort were all included in a clinical trial, suggesting that their prognosis at inclusion was good enough to be enrolled. Nevertheless, we showed that it did not affect the generalization of the predictions and we confirmed it on a set of censored follow-ups (i.e. alive patients) on the external test cohort. Another limitation is that the model has seen patients’ data from specialized Cancer Centers and we could not evaluate the generalization performance in General Hospitals within this study. Given these potential bias and no proper prospective evaluation in this study, we do not recommend using K-memBERT for clinical decision but instead for translational research purposes such as internal comparison to physicians’ prognostication. In conclusion, we propose a methodology that accurately predicts overall survival in patients with cancer of any type, at any stage. The model compiles non-invasiveness of the procedure, ease of use, high accuracy of the predictions, high generalizability of results upon francophone EHRs, immediate output and human control over the analysis. This reliable approach may enable the democratization of more precise prognostication into treatment decision guidance to continuously ensure the best care of patients with cancer. Declarations DECLARATION OF INTERESTS LV reports personal fees from Adaptherapy, is CEO of RESOLVED, has received non-personal fees from Pierre-Fabre and Servier, and a grant from Bristol-Myers Squibb, all outside the submitted work. Author Contribution Clément Piat (C.P): Data curation, Formal Analysis, Methodology, Writing – original draft, Writing – review & editingQuentin Blampey (Q.B): Data curation, Formal Analysis, Methodology, Writing – original draft, Writing – review & editingAlexandre Joutard (A.J): Data curation, Formal Analysis, Methodology, Writing – original draft, Writing – review & editingMohamed Aymen Qabel (M.A.Q): Data curation, Formal Analysis, Methodology, Writing – original draft, Writing – review & editingThéo Di Piazza (T.D.P): Data curation, Formal Analysis, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing, Visualization,Ugo Benassayag (U.B): Data curation, SoftwareRaphael Vienne (R.V): Data curation, Formal Analysis, ValidationRaphael Reme (R.R): Software, VisualizationDaphné Morel (D.M): Writing – original draft, Writing – review & editingMaxime Choffe (M.C): Data curationEric Deutsch (E.D): Writing – original draft, Writing – review & editingJean-Yves Blay (JY.B): Writing – original draft, Writing – review & editingLoic Verlingue (L.V): Conceptualization, Supervision, Project administration, Writing – original draft, Writing – review & editingAll authors read and approved the final version of the manuscriptThe underlying data have been verified by Clément Piat (C.B), Quentin Blampey (Q.B), Alexandre Joutard (A.J), Mohamed Aymen Qabel (M.A.Q), Théo Di Piazza (T.D.P), Maxime Choffe (M.C) and Loic Verlingue (L.V) Acknowledgement The authors would like to acknowledge all the patients included in this study and the colleagues at DITEP, DTNSI and U1030 at Gustave Roussy, at CRCL and DSI at CLB, at CentraleSupelec Paris and Collective Thinking, the medical and paramedical staffs that took care of the patients and generated the data used in this study. Data Availability The data used in this study is free medical text modified by pseudo-anonymization procedure. Nevertheless the granularity of the data preclude to make it accessible without the risk of sharing Protected Health Information. We have released the code at https://github.com/DITEP/KmemBERT, a demonstration of an online application of K-memBERT and the code for a local installation for a secured utilization https://fd21dde0-4a7b-44bc-b011-b5de469557ea.pub.instances.scw.cloud/BERTPrediction CONTRIBUTORS Clément Piat: Data curation, Formal Analysis, Methodology, Writing – original draft, Writing – review & editing Quentin Blampey: Data curation, Formal Analysis, Methodology, Writing – original draft, Writing – review & editing Alexandre Joutard: Data curation, Formal Analysis, Methodology, Writing – original draft, Writing – review & editing Mohamed Aymen Qabel: Data curation, Formal Analysis, Methodology, Writing – original draft, Writing – review & editing Théo Di Piazza: Data curation, Formal Analysis, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing, Visualization, Ugo Benassayag: Data curation, Software Raphael Vienne: Data curation, Formal Analysis, Validation Raphael Reme: Software, Visualization Daphné Morel: Writing – original draft, Writing – review & editing Maxime Choffe: Data curation Eric Deutsch: Writing – original draft, Writing – review & editing Jean-Yves Blay: Writing – original draft, Writing – review & editing Loic Verlingue: Conceptualization, Supervision, Project administration, Writing – original draft, Writing – review & editing All authors read and approved the final version of the manuscript The underlying data have been verified by Clément Piat, Quentin Blampey, Alexandre Joutard, Mohamed Aymen Qabel, Théo Di Piazza, Maxime Choffe and Loic Verlingue DATA SHARING STATEMENT The data used in this study is free medical text modified by pseudo-anonymization procedure. Nevertheless the granularity of the data preclude to make it accessible without the risk of sharing Protected Health Information. We have released the code at https://github.com/DITEP/KmemBERT , a demonstration of an online application of K-memBERT and the code for a local installation for a secured utilization https://fd21dde0-4a7b-44bc-b011-b5de469557ea.pub.instances.scw.cloud/BERTPrediction References Lu, Z. et al. Early Interdisciplinary Supportive Care in Patients With Previously Untreated Metastatic Esophagogastric Cancer: A Phase III Randomized Controlled Trial. Journal of clinical oncology 39, (2021). Bandieri, E. et al. Early versus delayed palliative/supportive care in advanced cancer: an observational study. BMJ Support Palliat Care 10, e32 (2020). Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine 25, 44–56 (2019). Rajkomar, A., Dean, J. & Kohane, I. Machine Learning in Medicine. N Engl J Med 380, 1347–1358 (2019). Elfiky, A. A., Pany, M. J., Parikh, R. B. & Obermeyer, Z. Development and Application of a Machine Learning Approach to Assess Short-term Mortality Risk Among Patients With Cancer Starting Chemotherapy. JAMA Netw Open 1, e180926 (2018). Wulczyn, E. et al. Interpretable survival prediction for colorectal cancer using deep learning. NPJ Digit Med 4, 71 (2021). Skrede, O.-J. et al. Deep learning for prediction of colorectal cancer outcome: a discovery and validation study. Lancet 395, 350–360 (2020). Kudo, T. & Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds. Blanco, E. & Lu, W.) 66–71 (Association for Computational Linguistics, Brussels, Belgium, 2018). doi: 10.18653/v1/D18-2012 . Martin, L. et al. CamemBERT: a Tasty French Language Model. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , http://arxiv.org/abs/1911.03894 7203–7219 (2020) doi: 10.18653/v1/2020.acl-main.645 . Oken, M. M. et al. Toxicity and response criteria of the Eastern Cooperative Oncology Group. Am J Clin Oncol 5, 649–655 (1982). Corbaux, P. et al. Patients’ selection and trial matching in early-phase oncology clinical trials. Critical Reviews in Oncology/Hematology 196, 104307 (2024). Grandini, M., Bagli, E. & Visani, G. Metrics for Multi-Class Classification: an Overview. Preprint at https://doi.org/10.48550/arXiv.2008.05756 (2020). Pang, C. et al. CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks. in Proceedings of Machine Learning for Health 239–260 (PMLR, 2021). Li, Y. et al. Hi-BEHRT: Hierarchical Transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records. IEEE J Biomed Health Inform PP, (2022). Malhotra, K. et al. Prognostic accuracy of patients, caregivers, and oncologists in advanced cancer. Cancer 125, 2684–2692 (2019). Christakis, N. A. & Lamont, E. B. Extent and determinants of error in doctors’ prognoses in terminally ill patients: prospective cohort study. BMJ 320, 469–472 (2000). Glare, P. et al. A systematic review of physicians’ survival predictions in terminally ill cancer patients. BMJ 327, 195–198 (2003). Amano, K. et al. The Accuracy of Physicians’ Clinical Predictions of Survival in Patients With Advanced Cancer. J Pain Symptom Manage 50, 139–146.e1 (2015). Smith-Uffen, M. E. S. et al. Estimating survival in advanced cancer: a comparison of estimates made by oncologists and patients. Support Care Cancer 28, 3399–3407 (2020). Kiely, B. E. et al. The median informs the message: accuracy of individualized scenarios for survival time based on oncologists’ estimates. J Clin Oncol 31, 3565–3571 (2013). Kalyan, K. S., Rajasekharan, A. & Sangeetha, S. AMMU: A survey of transformer-based biomedical pretrained language models. Journal of Biomedical Informatics 126, 103982 (2022). Yang, S., Varghese, P., Stephenson, E., Tu, K. & Gronsbell, J. Machine learning approaches for electronic health records phenotyping: a methodical review. J Am Med Inform Assoc 30, 367–381 (2023). Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med 1–10 (2024) doi: 10.1038/s41591-024-03097-1 . Ferber, D. et al. Autonomous Artificial Intelligence Agents for Clinical Decision Making in Oncology. Preprint at https://doi.org/10.48550/arXiv.2404.04667 (2024). Additional Declarations No competing interests reported. Supplementary Files FiguresKmemBERTsupplementarynpj.pptx KmemBERTSupplementaryfigureslegendsnpj.docx KmemBERTSupplementarytablesnpj.docx KmemBERTSupplementarymethodsnpj.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7121466","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":502005436,"identity":"5f988976-fe8b-4416-8969-03d70fa752c5","order_by":0,"name":"Clément Piat","email":"","orcid":"","institution":"CentraleSupelec","correspondingAuthor":false,"prefix":"","firstName":"Clément","middleName":"","lastName":"Piat","suffix":""},{"id":502005437,"identity":"34d1cbf8-d25a-431c-8d3c-9aa7f9e40ca0","order_by":1,"name":"Quentin Blampey","email":"","orcid":"","institution":"CentraleSupelec","correspondingAuthor":false,"prefix":"","firstName":"Quentin","middleName":"","lastName":"Blampey","suffix":""},{"id":502005438,"identity":"08fdfa4d-7569-4677-b802-4c8055fd563c","order_by":2,"name":"Alexandre Joutard","email":"","orcid":"","institution":"CentraleSupelec","correspondingAuthor":false,"prefix":"","firstName":"Alexandre","middleName":"","lastName":"Joutard","suffix":""},{"id":502005439,"identity":"24aa6cc3-abcc-4639-a562-526cfe839271","order_by":3,"name":"Mohamed Aymen Qabel","email":"","orcid":"","institution":"CentraleSupelec","correspondingAuthor":false,"prefix":"","firstName":"Mohamed","middleName":"Aymen","lastName":"Qabel","suffix":""},{"id":502005440,"identity":"03474c04-c311-4ef8-9f86-1153a71ea3d2","order_by":4,"name":"Théo Di Piazza","email":"","orcid":"","institution":"Insa Rennes","correspondingAuthor":false,"prefix":"","firstName":"Théo","middleName":"Di","lastName":"Piazza","suffix":""},{"id":502005441,"identity":"a7e26dcf-578a-485e-b81a-1b4c5c8ba6ec","order_by":5,"name":"Ugo Benassayag","email":"","orcid":"","institution":"DITEP, Gustave Roussy Cancer Center","correspondingAuthor":false,"prefix":"","firstName":"Ugo","middleName":"","lastName":"Benassayag","suffix":""},{"id":502005442,"identity":"bdd4638b-fb70-40e4-be5f-005e2b5f6b3a","order_by":6,"name":"Raphael Vienne","email":"","orcid":"","institution":"Centre Léon Bérard","correspondingAuthor":false,"prefix":"","firstName":"Raphael","middleName":"","lastName":"Vienne","suffix":""},{"id":502005443,"identity":"e9515df2-d5ef-480b-a2f6-efae234b62ee","order_by":7,"name":"Raphael Reme","email":"","orcid":"","institution":"Telecom ParisTech","correspondingAuthor":false,"prefix":"","firstName":"Raphael","middleName":"","lastName":"Reme","suffix":""},{"id":502005444,"identity":"b1d6f65e-e56a-4df4-bbe8-7ab1767db717","order_by":8,"name":"Daphné Morel","email":"","orcid":"","institution":"Université Paris-Saclay INSERM Molecular radiotherapy U1030, Gustave Roussy Cancer Center","correspondingAuthor":false,"prefix":"","firstName":"Daphné","middleName":"","lastName":"Morel","suffix":""},{"id":502005445,"identity":"fe84d557-4d06-4b68-b74c-57d21d4e7329","order_by":9,"name":"Maxime Choffe","email":"","orcid":"","institution":"DITEP, Gustave Roussy Cancer Center","correspondingAuthor":false,"prefix":"","firstName":"Maxime","middleName":"","lastName":"Choffe","suffix":""},{"id":502005446,"identity":"dd30a42d-0288-4d4c-9252-acbcfd6ad599","order_by":10,"name":"Eric Deutsch","email":"","orcid":"","institution":"Université Paris-Saclay INSERM Molecular radiotherapy U1030, Gustave Roussy Cancer Center","correspondingAuthor":false,"prefix":"","firstName":"Eric","middleName":"","lastName":"Deutsch","suffix":""},{"id":502005447,"identity":"9379bb81-88cd-442c-a20a-bb7cd7a861fe","order_by":11,"name":"Jean-Yves Blay","email":"","orcid":"","institution":"Centre Léon Bérard","correspondingAuthor":false,"prefix":"","firstName":"Jean-Yves","middleName":"","lastName":"Blay","suffix":""},{"id":502005448,"identity":"5e629bb9-ca76-49ae-8a29-908cae735830","order_by":12,"name":"Loic Verlingue","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABLklEQVRIie3PwUqEQBjA8U8G9DLqdYTQHmFkIFpa9lkMwZOwVw8RE8J0qXtQPUS3ji7C7mXYrkIdWvZYgR2DqFZdk1Q6B/k/jB/M/JgRYGjoD2aAwqsBlZ/xznaD1CfsNlFrolYkwG3CuuTnkOL2iS7R4pNcEeOpqunpE47uMF2cJ0i/3XfMy3i1joBO2wTPYqKIYCSQERxgeY+pXHpIl8S9eJgzVwId8RYhhwJAplRFeI/pYkOykCJdEA+yULU4vNP2LRX53JKPZUOcLNTeONB+EiUFYWudJw3ZDKrSR4p/8SK/vEW5mvvYkpLOrgVxb7KAWZx2iKmdrvKcTqhpSpY/H01sY3HmPr6IY8fO/NUrjzqkzKsfWay7CUDS7PWC71BerA7/9dDQ0NDQP+wLAJtfUI8xr4MAAAAASUVORK5CYII=","orcid":"","institution":"Centre Léon Bérard","correspondingAuthor":true,"prefix":"","firstName":"Loic","middleName":"","lastName":"Verlingue","suffix":""}],"badges":[],"createdAt":"2025-07-14 13:08:12","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7121466/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7121466/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":89981676,"identity":"765904d6-bf09-484e-92bb-ceaf0b8bebe8","added_by":"auto","created_at":"2025-08-27 06:26:10","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":145612,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDescription of datasets. a\u003c/strong\u003e. Flowchart describing the data split to build the training, validation and test cohorts. Test cohorts were only used to independently test the performances of pre-designed models when applying to previously-unseen data. \u003cstrong\u003eb\u003c/strong\u003e. Bar chart detailing median survival durations with first and third quartiles within each dataset, suggesting a well-balanced data split. Dotted lines refer to mean survival of the cohorts from the GR cohort (green) and the CLB cohort (orange).\u003c/p\u003e","description":"","filename":"Slide1.png","url":"https://assets-eu.researchsquare.com/files/rs-7121466/v1/90d7f39e72e2b494c060f37b.png"},{"id":89981675,"identity":"0d207f9f-3eb0-4071-bb4b-10f54a5fe59f","added_by":"auto","created_at":"2025-08-27 06:26:10","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":97399,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDeep learning models design and selection. a-c\u003c/strong\u003e. Overview of learning workflows for benchmarked models considering either a single document (K-memBERT-base) or a suit of documents (medical history, K-memBERT-conflation and K-memBERT-T2). \u003cstrong\u003ed\u003c/strong\u003e. Details of benchmarked models’ performances on the validation dataset from hospital #1. The best performing model was selected based on the highest Pearson correlation score. \u003cstrong\u003ee\u003c/strong\u003e. Point cloud graph of the best-performing model (K-memBERT-T2) on the validation set, depicting the [0;1] individual predicted survival (y-axis) according to the true survival (x-axis).\u003c/p\u003e","description":"","filename":"Slide2.png","url":"https://assets-eu.researchsquare.com/files/rs-7121466/v1/02af30893979817f30e28d26.png"},{"id":89981678,"identity":"97c50a62-2c1a-47d4-90fd-b66c50cf0961","added_by":"auto","created_at":"2025-08-27 06:26:10","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":120379,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eIndependent testing of the best-performing model. a\u003c/strong\u003e. Point cloud graph depicting the [0;1] individual predicted survival (y-axis) on the first test set, according to the true survival (x-axis).\u003cstrong\u003e b\u003c/strong\u003e. Confusion matrix of binary predictions at one year on the test set of the GR cohort versus the ground truth.\u003cstrong\u003e c\u003c/strong\u003e. Kaplan-Meyer representation of predicted survival and true survival (test label) of the model on test sets. \u003cstrong\u003ed\u003c/strong\u003e. Point cloud graph depicting the [0;1] individual predicted survival (y-axis) on the CLB test set, according to the true survival (x-axis).\u003cstrong\u003e e\u003c/strong\u003e. Confusion matrix of binary predictions at one year on the CLB test set versus the ground truth. \u003cstrong\u003ef\u003c/strong\u003e. Area Under the ROC Curve (AUC) of binary predictions at one year on the CLB test set. \u003cstrong\u003eg\u003c/strong\u003e. Box plot of the true survival of patients according to their Performance Status (PS) class at the time of estimation (purple) or to their predicted group using the model (green).\u003c/p\u003e","description":"","filename":"Slide3.png","url":"https://assets-eu.researchsquare.com/files/rs-7121466/v1/141f7a8434f86ccfbc99348c.png"},{"id":89983687,"identity":"6ec6df67-4080-4181-afb8-3d2cff73ca4f","added_by":"auto","created_at":"2025-08-27 06:34:10","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":145607,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eConcrete examples of results provided by the model.\u003c/strong\u003e The application outputs a continuous prediction of survival that correlates with prognosis, together with visualization modules that explain which words have been the most affecting the prediction, at which level and to which sense (influencing to bad prognosis or to good prognosis) to ensure a possible human review of the output. The ‘Text extract’ shows an extract of the text, and the ‘Top 10 of the most influent words for prediction’ shows the importance values (mean and standard deviation for multiple occurrences) of the words across all the texts token for a single prediction (i.e. 4 sequential most recent historical medical reports for each prediction with K-memBERT-T2). We advise a local installation with the code provided for a secured utilization of the application with protected health information.\u003c/p\u003e","description":"","filename":"Slide4.png","url":"https://assets-eu.researchsquare.com/files/rs-7121466/v1/e5859fec77e1bd6ea01626d6.png"},{"id":94131349,"identity":"a3b7ce14-3eda-4efe-b77d-bf3a62481fa3","added_by":"auto","created_at":"2025-10-22 17:43:47","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1201790,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7121466/v1/5b29f1ee-847a-4683-afe6-cd9d2942949f.pdf"},{"id":89981679,"identity":"f3ab994b-c9ee-4bff-8346-233196674570","added_by":"auto","created_at":"2025-08-27 06:26:10","extension":"pptx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":3681889,"visible":true,"origin":"","legend":"","description":"","filename":"FiguresKmemBERTsupplementarynpj.pptx","url":"https://assets-eu.researchsquare.com/files/rs-7121466/v1/3c9c8e738594c32c461fb095.pptx"},{"id":89981682,"identity":"f1fe92f8-07a0-4302-b7f5-2ee172fffdcd","added_by":"auto","created_at":"2025-08-27 06:26:10","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":16827,"visible":true,"origin":"","legend":"","description":"","filename":"KmemBERTSupplementaryfigureslegendsnpj.docx","url":"https://assets-eu.researchsquare.com/files/rs-7121466/v1/d42413ee71824d24414b1198.docx"},{"id":89981687,"identity":"4f3c2d7a-384d-4948-bc81-6e7cc07ba043","added_by":"auto","created_at":"2025-08-27 06:26:11","extension":"docx","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":20303,"visible":true,"origin":"","legend":"","description":"","filename":"KmemBERTSupplementarytablesnpj.docx","url":"https://assets-eu.researchsquare.com/files/rs-7121466/v1/82bfbaf9d58eb325efe9f125.docx"},{"id":89981692,"identity":"3e3921e9-3ede-487b-8161-d3549e3fcd3e","added_by":"auto","created_at":"2025-08-27 06:26:11","extension":"docx","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":89699,"visible":true,"origin":"","legend":"","description":"","filename":"KmemBERTSupplementarymethodsnpj.docx","url":"https://assets-eu.researchsquare.com/files/rs-7121466/v1/ff3d2615c045576b0eac0c73.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"An explainable language model predicts survival from medical reports in oncology","fulltext":[{"header":"TRANSLATIONAL RELEVANCE","content":"\u003cp\u003eLanguage Models can predict pan-cancer overall survival from routinely-drafted electronic health records in oncology, with confidence level estimation and interpretation of the most important words. The application can also use recent medical history if available (a suit of dated medical reports), which was shown to improve the predictions’ reliability. At this stage, without a dedicated prospective confirmation study, K-memBERT shouldn’t be used for clinical decision support. \u003c/p\u003e"},{"header":"INTRODUCTION","content":"\u003cp\u003eImperfect prognostication of patients with cancer entails multiple negative consequences, starting with the misevaluation of the goals of treatment. When reaching the end of life, quality of life must be considered over quantity, especially as, in some cases, early introduction of supportive care improves both overall survival and quality of life \u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. Refining, facilitating and automating prognosis estimation could therefore benefit patients at multiple levels through their caregivers.\u003c/p\u003e\u003cp\u003eMeanwhile, hospitals accumulate colossal amounts of health data in electronic format that can be exploited by machines to help physicians in their daily work. An increasing number of machine- and deep-learning algorithms shows that artificial intelligence (AI)-based predictive tools can successfully predict important clinical outcomes \u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e,\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e, including survival in cancer patients \u003csup\u003e\u003cspan additionalcitationids=\"CR6\" citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e–\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e, by processing either raw medical images (CT-scans, electrocardiograms, histopathology slides etc.) or structured data such as clinical descriptors, molecular profiles or lab results. Medical reports in narrative format have the advantages to: (i) gather relevant personal and clinical information about the patient and the disease in an almost exhaustive manner and in a single format, (ii) be quickly and easily accessible to any oncologist or healthcare professional implicated in the treatment decision, and (iii) only require existing resources such as the hospital informatics environment.\u003c/p\u003e\u003cp\u003eThe objectives of this study is to show that language models can predict patients prognostic from narrative medical data, and develop a tool easy to implement in practice.\u003c/p\u003e"},{"header":"METHODS","content":"\u003cp\u003e\u003cb\u003ePatients\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWe retrospectively retrieved EHRs from patients affected by malignant cancers filling the following criteria: adult (≥ 18 years old) patient treated at Gustave Roussy Cancer center (GR), France, between July 1987 and December 2020, with at least one non-blank (\u0026gt; 250 characters) text electronic document, deceased at the time of collection and for whom the date of death was recorded in the medical database. The type of reports selected were consultation reports, radiological reports, clinical notes and hospitalization reports. The independent test cohort comprised EHRs from adult patients included in a prospective study of any kind at the Centre Léon Bérard (CLB, another Comprehensive Cancer Centers), France, with at least one non-blank medical report recorded between January, 2000 and December, 2020. All the alive CLB patients included in the analysis were explicitly proposed for the use of their personal data for research purpose during their lifetime and only patients non-opposing were analyzed.\u003c/p\u003e\u003cp\u003e\u003cb\u003eDataset\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWe filtered out documents recorded after the date of death and documents with less than 250 characters. The remaining texts were preprocessed and tokenized using SentencePiece \u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e, which extracts prefixes and suffixes, such as \"logy\", thus treats e.g.\"oncology\" as \"onco + logy\"; selectively limited to 512 tokens for longer texts; changed letters to lowercase etc. For texts longer that 512, we took the first 256 tokens, and the last 256 tokens. We added the token “...” between the two parts. We performed an random split on the patients (to avoid data leakage) from the GR dataset to compose the training, validation and test sets with 88%, 2% and 10% in each group, respectively. Using the CLB cohort (external test cohort only), we sub-sampled into 10 datasets composed of a random selection of 1,000 patients each, in order to compute distributions upon the tested performance metrics.\u003c/p\u003e\u003cp\u003e\u003cb\u003eBERT-based models and control\u003c/b\u003e\u003c/p\u003e\u003cp\u003eThe existing CamemBERT underwent masked language modeling pre-training on non-medical text corpus (vocabulary size of 32,005 tokens) \u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e. We did not performed additional language modeling but instead hypothesized that the tokenizer could miss specific medical terms that are important for the task of prognosis prediction. We thus expanded the vocabulary by adding the 500 most frequent tokens found in the GR training cohort that were not in the original vocabulary of CamemBERT. The related embedding were randomly initialized and further trained. We then benchmarked three alternatives: (1) \u003cem\u003eK-memBERT-base\u003c/em\u003e, which takes one single document as input, (2) \u003cem\u003eK-memBERT-conflation\u003c/em\u003e, which aggregates multiple K-memBERT-base predictions distributions, and (3) \u003cem\u003eK-memBERT-T2\u003c/em\u003e, which stacks two transformers to input a sequence of medical reports. The multiple reports consisted in a sequence of historical reports selected from the last time point defined to compute survival. For each model type (K-memBERT-base, K-memBERT-conflation and K-memBERT-T2), we trained the parameters on the same training set, we used the validation set to tune the hyperparameters, and we independently evaluated the performances of each model using the GR test set. The best preforming model was selected based of the best Pearson correlation obtained between prediction and ground truth (see Metrics).\u003c/p\u003e\u003cp\u003e\u003cb\u003ePredictions: objectives and labels\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWe designed a model that predicts a distribution probability of survival time and estimates the confidence in each prediction. We used Gaussian distributions to represent the model predictions, defined by the mean value and the associated standard deviation and translated the labels into a standard uniform distribution [0–1] with the formula 1 – exp(- individual survival /mean survival of the training set). For each prediction, the mean value is the survival duration and the standard deviation represents the confidence in the prediction. The loss is a generalization of the Mean Squared Error including the standard deviation. Additional information is provided in \u003cem\u003eSupplementary methods\u003c/em\u003e.\u003c/p\u003e\u003cp\u003e\u003cb\u003eMetrics and calibration\u003c/b\u003e\u003c/p\u003e\u003cp\u003eContinuous survival predictions were evaluated by Pearson and Spearman correlations between the predicted and the true survivals. We used the concordance index for survival analysis in the CLB cohort that included censored survival times. In addition, we set thresholds of time values and evaluated standard performance metrics (balanced accuracy, AUC, recall equivalent to sensitivity and F1-scores) at 3 months, 1 year and at the mean survival duration rate (based on the GR cohort). For recall and F1-scores we took the class “survival” as the default class, or otherwise mentioned. We have performed calibration analysis of the the models’ prediction with various potential biases such as time of follow-up, tumor type, age, sex, prognostic scores, and the observed prognosis of the patients.\u003c/p\u003e\u003cp\u003e\u003cb\u003eComparison to performance status\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWe used the Performance Status (PS) as a gold standard measure to easily estimate prognosis, that is widely used in clinical trials selection criteria, for example \u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e,\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e. We extracted the documents containing the mention of ‘PS=[number]’ in the CLB test cohort. We performed 3 types of analysis: 1) compared the relation of the classes of PS with the true survival and with the predicted survival, 2) evaluated if removing the ‘PS=[number]’ or ‘Karnofsky=[percentage]’ would change the predicted survival, and 3) evaluated how five predicted classes from K-memBERT predictions (intervals [0-0.2]; [0.2–0.4]; [0.4–0.6]; [0.6–0.8] and [0.8-1]) related to true survival. The ‘PS=[number]’ or ‘Karnofsky=[percentage]’ are semi-structured information at the beginning of consultation reports in the CLB dataset and we could thus remove it using regular expression. We computed the confusion matrices and several multi-class classification metrics such as Precision equivalent to positive predictive value (were death is the event), Recall, Overall Accuracy, Balanced Accuracy, Precision and F1-Score, for both reference class when relevant, and Cohen's Kappa score \u003csup\u003e\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003e\u003cb\u003eEthics\u003c/b\u003e\u003c/p\u003e\u003cp\u003e The study complies with the European GDPR regulation 2016/679, the French law, Good Clinical Practice Guidelines of the International Conference on Harmonization and was approved by internal ethics and scientific commissions (notification 2021-66). We registered the project by the MR004 declaration (V3.2 23/08/2021) to the French Health Data Hub and Unicancer at both sites, GR and CLB. We confirmed that all patients gave their consent or were not opposed to the use of their data. The data collected could only be used for the aim of this study, and patient identification was secured.\u003c/p\u003e\u003cp\u003e\u003cb\u003eCode, libraries and resources\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWe used Python 3.7, Pytorch 1.7, Transformers 4.1, Scikit-learn and the transformer-interpret libraries. For data security, we trained the model on a single in-site GPU GeForce GTX 1080 Ti with RAM 11Gb. We have released the source code and the code for a local installation of the application at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/DITEP/KmemBERT\u003c/span\u003e\u003cspan address=\"https://github.com/DITEP/KmemBERT\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e, and a demonstration of an web-hosted application of K-memBERT \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://fd21dde0-4a7b-44bc-b011-b5de469557ea.pub.instances.scw.cloud/BERTPrediction\u003c/span\u003e\u003cspan address=\"https://fd21dde0-4a7b-44bc-b011-b5de469557ea.pub.instances.scw.cloud/BERTPrediction\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003cb\u003eRole of Funders\u003c/b\u003e\u003c/p\u003e\u003cp\u003eFunding by GR and CLB for human resources which are non for profit Comprehensive Cancer Centers. No specific influence to declare on study design, data collection, data analyses, interpretation, or writing of report.\u003c/p\u003e"},{"header":"RESULTS","content":"\u003cp\u003e\u003cb\u003eData\u003c/b\u003e\u003c/p\u003e\u003cp\u003eThe GR dataset preparation led to a clean and workable text corpus of 2.7\u0026nbsp;million documents from 36,124 patients. About half (46.5%) of the documents were consultation reports, 15% were radiological reports, 12% were clinical notes and 11% were hospitalization reports (Supplementary Table\u0026nbsp;1 and Supplementary Results). The random split automatically isolated 31,789 patients for training (2,053k documents), 722 patients for validation (47k documents) and 3,612 patients for testing (233k documents) (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eA).\u003c/p\u003e\u003cp\u003eThe CLB external independent testing cohort consisted of 17,633 patients from which we sub-sampled arbitrarily 10,000 patients into 10 cohorts of 1,000 patients each, for a total of 243k documents. On average, this corpus format was similar to the GR cohort (Supplementary Fig.\u0026nbsp;1E-H), the mean number of documents per patient being 46 and 97% of patients had more than four eligible documents. The patients in the CLB cohorts had a mean of 56 year-old (standard deviation of 16), 58% were Females, with 22% patients with breast cancers, 13% digestive tract cancers, 11% hematological cancers, 8% mesothelial and soft tissues cancers, and 10 other types of cancers with frequencies between 7% and 1%. In all cohorts, survival dropped exponentially over time, with an mean survival time of 701 days for the GR cohort and 837 days for the CLB cohort (rate of variation\u0026thinsp;=\u0026thinsp;19.3%) (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eB).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eModels\u0026rsquo; description, training and selection on the validation cohort\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWe developed K-memBERT \u0026ndash; a suite of deep learning models adapted for survival prediction \u0026ndash; from the French Transformer CamemBERT. We designed three types of K-memBERT models to explore which way provides the most accurate predictions (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA-C): one that takes a single document as input (K-memBERT-base), and two that can take a sequence of several historical medical reports from the same patient as input, either by conflation (K-memBERT-conflation) or with an additional Transformer layer on top of K-memBERT-base (K-memBERT-T2). In every configuration we used the date of the last report inputted in the model to compute each patient\u0026rsquo;s survival. Fine-tuning K-memBERT-base on the GR training cohort took 40h on a single local and secured GPU.\u003c/p\u003e\u003cp\u003eThe best performing model on the validation cohort was K-memBERT-T2, with a validation Pearson correlation of 0.686 (p-value\u0026thinsp;\u0026lt;\u0026thinsp;10\u003csup\u003e\u0026minus;\u0026thinsp;5\u003c/sup\u003e) between true and predicted survivals, reaching a 1-year survival prediction AUC of 0.832, a F1-score of 0.760, and a balanced accuracy of 0.754 (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eD,E and Supplementary Table\u0026nbsp;2). Both K-membert-conflation and K-memBERT-T2 showed significantly superior scores than K-memBERT-based, supporting the benefit of taking into account the medical history to improve predictions.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eK-memBERT-T2 on the test cohort\u003c/b\u003e\u003c/p\u003e\u003cp\u003eOn the unseen GR test cohort, K-memBERT-T2 achieved a Pearson correlation of 0.655 (p-value\u0026thinsp;\u0026lt;\u0026thinsp;10\u003csup\u003e\u0026minus;\u0026thinsp;5\u003c/sup\u003e) (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA). For binary predictions at one year, the model showed: AUC\u0026thinsp;=\u0026thinsp;0.817, F1 score\u0026thinsp;=\u0026thinsp;0.741, balanced accuracy\u0026thinsp;=\u0026thinsp;0.737 and recall (equivalent to sensitivity)\u0026thinsp;=\u0026thinsp;0.763 (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB and Supplementary Fig.\u0026nbsp;2). We observed that the performances of K-memBERT-T2 on the GR test cohort varied with the prognosis (calibration analysis): for patients with a short life expectancy, predictions were relatively optimistic (i.e. predictions over-valued for patients with short survival, at 3 months, AUC\u0026thinsp;=\u0026thinsp;0.852 and F1-score\u0026thinsp;=\u0026thinsp;0.917 for survival but 0.467 for predicting death) (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eC). Conversely with patients with a prolonged life expectancy, predictions were relatively pessimistic (i.e. predictions under-valued for patients with long survival, at mean follow-up [701 days], AUC\u0026thinsp;=\u0026thinsp;0.827 and F1-score\u0026thinsp;=\u0026thinsp;0.548 for survival but 0.851 for predicting death) (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eC, Supplementary Fig.\u0026nbsp;2). Interestingly, we found a positive correlation between the level of errors and the predicted standard deviations (Pearson\u0026thinsp;=\u0026thinsp;0.37, Supplementary Fig.\u0026nbsp;2F), supporting that accurate predictions were assigned with high confidence by the model, and reciprocally mediocre predictions were tempered by low predicted confidence.\u003c/p\u003e\u003cp\u003e\u003cb\u003eExternal validation on an independent cohort\u003c/b\u003e\u003c/p\u003e\u003cp\u003eOn the external test cohort (CLB), the Pearson correlation between the predicted survival value and the ground truth was 0.621 (p-value\u0026thinsp;\u0026lt;\u0026thinsp;10\u003csup\u003e\u0026minus;\u0026thinsp;5\u003c/sup\u003e, with mean correlation across cohorts\u0026rsquo; splits\u0026thinsp;=\u0026thinsp;0.615, standard deviation\u0026thinsp;=\u0026thinsp;0.012) and the Spearman correlation was 0.634 (p-value\u0026thinsp;\u0026lt;\u0026thinsp;10\u003csup\u003e\u0026minus;\u0026thinsp;5\u003c/sup\u003e, with mean correlation across cohorts\u0026rsquo; splits\u0026thinsp;=\u0026thinsp;0.635, standard deviation\u0026thinsp;=\u0026thinsp;0.01) (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eD). Binary predictions at three months were excellent, with an AUC of 0.875 and a false negative rate of only 1.1% (patients predicted dead but who remained alive at one year) (Supplementary Fig.\u0026nbsp;3A). Predictions at one year showed an AUC of 0.806 and 8.1% of false negative rate (patients predicted dead but who remained alive at one year) (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eE,F). Again, we observed that the algorithm was slightly overoptimistic at all time points (3 months, 1 year, mean follow-up) and became pessimistic on the very latest events, i.e. on patients with the most prolonged life expectancy (Supplementary Fig.\u0026nbsp;3D); although predictions at 701 days reached F1 score\u0026thinsp;=\u0026thinsp;0.761 and AUC\u0026thinsp;=\u0026thinsp;0.789 (Supplementary Tables\u0026nbsp;3\u0026amp;4). K-memBERT-T2 performances were stable for patients between 35 and 85 year-old (Supplementary Fig.\u0026nbsp;3E), across cancer types but with worst performance for hematological cancers, and breast cancers (Supplementary Fig.\u0026nbsp;3F), and without significant difference between sexes (not shown). We also added 7082 alive patients (censored survival at the time of data extraction) in the CLB database thanks to their non-opposition to the use of their data for this project. The concordance index of the whole CLB test cohort was 0.766.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eComparison with the Performance Status (PS)\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWe choose to compare the predictions of K-memBERT-T2 with the PS. PS is a widely used prognosis score in oncology that defines five grades from 0 to 4 (5 being death) corresponding to increasing levels of fatigue and highly correlated with overall survival. We extracted the PS scores contained in 46,123 documents from the CLB cohort. We first confirmed that PS were significantly related to survival (Kruskal Wallis p-value\u0026thinsp;\u0026lt;\u0026thinsp;10\u003csup\u003e\u0026minus;\u0026thinsp;5\u003c/sup\u003e, multi-class overall accuracy\u0026thinsp;=\u0026thinsp;0.197, Cohen kappa score\u0026thinsp;=\u0026thinsp;0.032). Then, we translated K-memBERT-T2 predictions into five prognosis intervals [0-0.2], [0.2\u0026ndash;0.4], [0.4\u0026ndash;0.6], [0.6\u0026ndash;0.8] and [0.8-1]. We showed that the model prognosis groups were more accurate at predicting overall survival than PS (Kruskal Wallis p-value\u0026thinsp;\u0026lt;\u0026thinsp;10\u003csup\u003e\u0026minus;\u0026thinsp;5\u003c/sup\u003e, multi-class overall accuracy\u0026thinsp;=\u0026thinsp;0.336, Cohen kappa score\u0026thinsp;=\u0026thinsp;0.170), in all groups (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eG, Supplementary Fig.\u0026nbsp;4). Importantly, the model predictions were slightly influenced by the presence of the terms \u0026lsquo;PS\u0026thinsp;=\u0026thinsp;0\u0026rsquo; in the text documents, while the other \u0026lsquo;PS\u0026rsquo; and/or \u0026lsquo;Karnofsky scores\u0026rsquo; (\u0026lsquo;KS\u0026rsquo;, equivalent to PS and frequently used in CLB documents) did not influence the predictions (Supplementary Fig.\u0026nbsp;4D).\u003c/p\u003e\u003cp\u003e\u003cb\u003eInterpretation of the model predictions\u003c/b\u003e\u003c/p\u003e\u003cp\u003eThe attention mechanisms for the medical vocabulary increased in the very latest layers of K-memBERT, suggesting that the step of adding medical vocabulary benefited to the model and provided fine-grained representations (Supplementary Fig.\u0026nbsp;4A). We analyzed the level and the sense of importance of each input token on the predictions (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, Supplementary Fig.\u0026nbsp;5, supplementary method). In some situations, even with wrong predictions, human decisions would have been in line with the algorithm prediction based on the medical record used for prognosis estimation (Supplementary Fig.\u0026nbsp;5E and Supplementary Results).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e"},{"header":"DISCUSSION","content":"\u003cp\u003eWe have imporved, trained, evaluated and interpreted the performances of a deep learning model \u0026ndash; K-memBERT \u0026ndash; that accurately estimates the overall survival of patients with cancer of any type (any primary, and any stage for patients from two Comprehensive Cancer Centers) from written electronic medical reports. K-memBERT specificities include an improved medical vocabulary, the ability to estimate confidence in its predictions, and the ability to input several documents at a time to leverage information from the patient\u0026rsquo;s historical clinical notes. We confirmed the high generalizability of K-memBERT performance (meaning that it can predict accurately whatever the source of the data), despite the differences in true survival time distributions between the cohorts. Importantly, K-memBERT is a non-invasive method that can be unlimitedly repeated over time (at each consultation) and does not require extra-work from the medical team as it uses routinely-generated free-text medical records.\u003c/p\u003e\u003cp\u003eWe showed that inputting sequential documents of a patient\u0026rsquo;s history improved the performance of the model, compensating the difficulty of long time range predictions. This strategy is also supported by two other studies using structured data types for various clinical situations (supplementary Context) \u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e,\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e. This intuition came from the clinical routine, where physicians usually review a large number of documents to picture the patients\u0026rsquo; past and recent medical history. The human performance for prognosis estimation at 2-year is AUC=[0.72\u0026ndash;0.81] in patients with advanced cancer \u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e. Across various studies only 20\u0026ndash;29% of physicians\u0026rsquo; prognosis estimate were accurate (\u0026ldquo;accurate\u0026rdquo; meaning that the duration of predicted survival was comprised between 0.67 and 1.33 times the actual survival) \u003csup\u003e\u003cspan additionalcitationids=\"CR17 CR18 CR19\" citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e. This is close to K-memBERT-T2 on test sets having 3-month AUC=[0.852\u0026ndash;0.875], 1-year AUC=[0.806\u0026ndash;0.817], and Pearson correlation [0.621\u0026ndash;0.655].\u003c/p\u003e\u003cp\u003eOur study was intentionally dedicated to patients affected by any type of cancer at any stage, designing a comprehensive valuable tool usable across all cancer specialties. This is original compared to similar NLP models fed with EHRs, where selection of specific patient subpopulation is often the rule \u003csup\u003e\u003cspan additionalcitationids=\"CR22 CR23\" citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e. Overall, medical consultations data have barely been used for cancer prognosis estimation, despite the extensive exploration of other types of input data (mostly biological and molecular data in recent years, see Supplementary Context).\u003c/p\u003e\u003cp\u003eWe acknowledge a potential selection bias in training the models on deceased patient only: the dataset may have been enriched with aggressive and incurable cancer cases. Accordingly, the mean survival was shorter in the GR cohort than in the external validation cohort, possibly due to the differences in selection criteria as patients in the external cohort were all included in a clinical trial, suggesting that their prognosis at inclusion was good enough to be enrolled. Nevertheless, we showed that it did not affect the generalization of the predictions and we confirmed it on a set of censored follow-ups (i.e. alive patients) on the external test cohort. Another limitation is that the model has seen patients\u0026rsquo; data from specialized Cancer Centers and we could not evaluate the generalization performance in General Hospitals within this study. Given these potential bias and no proper prospective evaluation in this study, we do not recommend using K-memBERT for clinical decision but instead for translational research purposes such as internal comparison to physicians\u0026rsquo; prognostication.\u003c/p\u003e\u003cp\u003eIn conclusion, we propose a methodology that accurately predicts overall survival in patients with cancer of any type, at any stage. The model compiles non-invasiveness of the procedure, ease of use, high accuracy of the predictions, high generalizability of results upon francophone EHRs, immediate output and human control over the analysis. This reliable approach may enable the democratization of more precise prognostication into treatment decision guidance to continuously ensure the best care of patients with cancer.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eDECLARATION OF INTERESTS\u003c/h2\u003e\n\u003cp\u003eLV reports personal fees from Adaptherapy, is CEO of RESOLVED, has received non-personal fees from Pierre-Fabre and Servier, and a grant from Bristol-Myers Squibb, all outside the submitted work.\u003c/p\u003e\n\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\n\u003cp\u003eCl\u0026eacute;ment Piat (C.P): Data curation, Formal Analysis, Methodology, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editingQuentin Blampey (Q.B): Data curation, Formal Analysis, Methodology, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editingAlexandre Joutard (A.J): Data curation, Formal Analysis, Methodology, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editingMohamed Aymen Qabel (M.A.Q): Data curation, Formal Analysis, Methodology, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editingTh\u0026eacute;o Di Piazza (T.D.P): Data curation, Formal Analysis, Methodology, Validation, Visualization, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing, Visualization,Ugo Benassayag (U.B): Data curation, SoftwareRaphael Vienne (R.V): Data curation, Formal Analysis, ValidationRaphael Reme (R.R): Software, VisualizationDaphn\u0026eacute; Morel (D.M): Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editingMaxime Choffe (M.C): Data curationEric Deutsch (E.D): Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editingJean-Yves Blay (JY.B): Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editingLoic Verlingue (L.V): Conceptualization, Supervision, Project administration, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editingAll authors read and approved the final version of the manuscriptThe underlying data have been verified by Cl\u0026eacute;ment Piat (C.B), Quentin Blampey (Q.B), Alexandre Joutard (A.J), Mohamed Aymen Qabel (M.A.Q), Th\u0026eacute;o Di Piazza (T.D.P), Maxime Choffe (M.C) and Loic Verlingue (L.V)\u003c/p\u003e\n\u003ch2\u003eAcknowledgement\u003c/h2\u003e\n\u003cp\u003eThe authors would like to acknowledge all the patients included in this study and the colleagues at DITEP, DTNSI and U1030 at Gustave Roussy, at CRCL and DSI at CLB, at CentraleSupelec Paris and Collective Thinking, the medical and paramedical staffs that took care of the patients and generated the data used in this study.\u003c/p\u003e\n\u003ch2\u003eData Availability\u003c/h2\u003e\n\u003cp\u003eThe data used in this study is free medical text modified by pseudo-anonymization procedure. Nevertheless the granularity of the data preclude to make it accessible without the risk of sharing Protected Health Information. We have released the code at https://github.com/DITEP/KmemBERT, a demonstration of an online application of K-memBERT and the code for a local installation for a secured utilization https://fd21dde0-4a7b-44bc-b011-b5de469557ea.pub.instances.scw.cloud/BERTPrediction\u003c/p\u003e\u003cp\u003e\u003cb\u003eCONTRIBUTORS\u003c/b\u003e\u003c/p\u003e\u003cp\u003eCl\u0026eacute;ment Piat: Data curation, Formal Analysis, Methodology, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing\u003c/p\u003e\u003cp\u003eQuentin Blampey: Data curation, Formal Analysis, Methodology, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing\u003c/p\u003e\u003cp\u003eAlexandre Joutard: Data curation, Formal Analysis, Methodology, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing\u003c/p\u003e\u003cp\u003eMohamed Aymen Qabel: Data curation, Formal Analysis, Methodology, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing\u003c/p\u003e\u003cp\u003eTh\u0026eacute;o Di Piazza: Data curation, Formal Analysis, Methodology, Validation, Visualization, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing, Visualization,\u003c/p\u003e\u003cp\u003eUgo Benassayag: Data curation, Software\u003c/p\u003e\u003cp\u003eRaphael Vienne: Data curation, Formal Analysis, Validation\u003c/p\u003e\u003cp\u003eRaphael Reme: Software, Visualization\u003c/p\u003e\u003cp\u003eDaphn\u0026eacute; Morel: Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing\u003c/p\u003e\u003cp\u003eMaxime Choffe: Data curation\u003c/p\u003e\u003cp\u003eEric Deutsch: Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing\u003c/p\u003e\u003cp\u003eJean-Yves Blay: Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing\u003c/p\u003e\u003cp\u003eLoic Verlingue: Conceptualization, Supervision, Project administration, Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing\u003c/p\u003e\u003cp\u003eAll authors read and approved the final version of the manuscript\u003c/p\u003e\u003cp\u003eThe underlying data have been verified by Cl\u0026eacute;ment Piat, Quentin Blampey, Alexandre Joutard, Mohamed Aymen Qabel, Th\u0026eacute;o Di Piazza, Maxime Choffe and Loic Verlingue\u003c/p\u003e\u003cp\u003e\u003cb\u003eDATA SHARING STATEMENT\u003c/b\u003e\u003c/p\u003e\u003cp\u003eThe data used in this study is free medical text modified by pseudo-anonymization procedure. Nevertheless the granularity of the data preclude to make it accessible without the risk of sharing Protected Health Information. We have released the code at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/DITEP/KmemBERT\u003c/span\u003e\u003cspan address=\"https://github.com/DITEP/KmemBERT\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e, a demonstration of an online application of K-memBERT and the code for a local installation for a secured utilization \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://fd21dde0-4a7b-44bc-b011-b5de469557ea.pub.instances.scw.cloud/BERTPrediction\u003c/span\u003e\u003cspan address=\"https://fd21dde0-4a7b-44bc-b011-b5de469557ea.pub.instances.scw.cloud/BERTPrediction\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eLu, Z. \u003cem\u003eet al.\u003c/em\u003e Early Interdisciplinary Supportive Care in Patients With Previously Untreated Metastatic Esophagogastric Cancer: A Phase III Randomized Controlled Trial. \u003cem\u003eJournal of clinical oncology\u003c/em\u003e 39, (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBandieri, E. \u003cem\u003eet al.\u003c/em\u003e Early versus delayed palliative/supportive care in advanced cancer: an observational study. \u003cem\u003eBMJ Support Palliat Care\u003c/em\u003e 10, e32 (2020).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTopol, E. J. High-performance medicine: the convergence of human and artificial intelligence. \u003cem\u003eNature Medicine\u003c/em\u003e 25, 44\u0026ndash;56 (2019).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRajkomar, A., Dean, J. \u0026amp; Kohane, I. Machine Learning in Medicine. \u003cem\u003eN Engl J Med\u003c/em\u003e 380, 1347\u0026ndash;1358 (2019).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eElfiky, A. A., Pany, M. J., Parikh, R. B. \u0026amp; Obermeyer, Z. Development and Application of a Machine Learning Approach to Assess Short-term Mortality Risk Among Patients With Cancer Starting Chemotherapy. \u003cem\u003eJAMA Netw Open\u003c/em\u003e 1, e180926 (2018).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWulczyn, E. \u003cem\u003eet al.\u003c/em\u003e Interpretable survival prediction for colorectal cancer using deep learning. \u003cem\u003eNPJ Digit Med\u003c/em\u003e 4, 71 (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSkrede, O.-J. \u003cem\u003eet al.\u003c/em\u003e Deep learning for prediction of colorectal cancer outcome: a discovery and validation study. \u003cem\u003eLancet\u003c/em\u003e 395, 350\u0026ndash;360 (2020).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKudo, T. \u0026amp; Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. in \u003cem\u003eProceedings of the\u003c/em\u003e 2018 \u003cem\u003eConference on Empirical Methods in Natural Language Processing: System Demonstrations\u003c/em\u003e (eds. Blanco, E. \u0026amp; Lu, W.) 66\u0026ndash;71 (Association for Computational Linguistics, Brussels, Belgium, 2018). doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.18653/v1/D18-2012\u003c/span\u003e\u003cspan address=\"10.18653/v1/D18-2012\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMartin, L. \u003cem\u003eet al.\u003c/em\u003e CamemBERT: a Tasty French Language Model. \u003cem\u003eProceedings of the 58th Annual Meeting of the Association for Computational Linguistics\u003c/em\u003e, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://arxiv.org/abs/1911.03894\u003c/span\u003e\u003cspan address=\"http://arxiv.org/abs/1911.03894\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e 7203\u0026ndash;7219 (2020) doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.18653/v1/2020.acl-main.645\u003c/span\u003e\u003cspan address=\"10.18653/v1/2020.acl-main.645\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOken, M. M. \u003cem\u003eet al.\u003c/em\u003e Toxicity and response criteria of the Eastern Cooperative Oncology Group. \u003cem\u003eAm J Clin Oncol\u003c/em\u003e 5, 649\u0026ndash;655 (1982).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCorbaux, P. \u003cem\u003eet al.\u003c/em\u003e Patients\u0026rsquo; selection and trial matching in early-phase oncology clinical trials. \u003cem\u003eCritical Reviews in Oncology/Hematology\u003c/em\u003e 196, 104307 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGrandini, M., Bagli, E. \u0026amp; Visani, G. Metrics for Multi-Class Classification: an Overview. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2008.05756\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2008.05756\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2020).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePang, C. \u003cem\u003eet al.\u003c/em\u003e CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks. in \u003cem\u003eProceedings of Machine Learning for Health\u003c/em\u003e 239\u0026ndash;260 (PMLR, 2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLi, Y. \u003cem\u003eet al.\u003c/em\u003e Hi-BEHRT: Hierarchical Transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records. \u003cem\u003eIEEE J Biomed Health Inform\u003c/em\u003e PP, (2022).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMalhotra, K. \u003cem\u003eet al.\u003c/em\u003e Prognostic accuracy of patients, caregivers, and oncologists in advanced cancer. \u003cem\u003eCancer\u003c/em\u003e 125, 2684\u0026ndash;2692 (2019).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChristakis, N. A. \u0026amp; Lamont, E. B. Extent and determinants of error in doctors\u0026rsquo; prognoses in terminally ill patients: prospective cohort study. \u003cem\u003eBMJ\u003c/em\u003e 320, 469\u0026ndash;472 (2000).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGlare, P. \u003cem\u003eet al.\u003c/em\u003e A systematic review of physicians\u0026rsquo; survival predictions in terminally ill cancer patients. \u003cem\u003eBMJ\u003c/em\u003e 327, 195\u0026ndash;198 (2003).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAmano, K. \u003cem\u003eet al.\u003c/em\u003e The Accuracy of Physicians\u0026rsquo; Clinical Predictions of Survival in Patients With Advanced Cancer. \u003cem\u003eJ Pain Symptom Manage\u003c/em\u003e 50, 139\u0026ndash;146.e1 (2015).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSmith-Uffen, M. E. S. \u003cem\u003eet al.\u003c/em\u003e Estimating survival in advanced cancer: a comparison of estimates made by oncologists and patients. \u003cem\u003eSupport Care Cancer\u003c/em\u003e 28, 3399\u0026ndash;3407 (2020).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKiely, B. E. \u003cem\u003eet al.\u003c/em\u003e The median informs the message: accuracy of individualized scenarios for survival time based on oncologists\u0026rsquo; estimates. \u003cem\u003eJ Clin Oncol\u003c/em\u003e 31, 3565\u0026ndash;3571 (2013).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKalyan, K. S., Rajasekharan, A. \u0026amp; Sangeetha, S. AMMU: A survey of transformer-based biomedical pretrained language models. \u003cem\u003eJournal of Biomedical Informatics\u003c/em\u003e 126, 103982 (2022).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYang, S., Varghese, P., Stephenson, E., Tu, K. \u0026amp; Gronsbell, J. Machine learning approaches for electronic health records phenotyping: a methodical review. \u003cem\u003eJ Am Med Inform Assoc\u003c/em\u003e 30, 367\u0026ndash;381 (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHager, P. \u003cem\u003eet al.\u003c/em\u003e Evaluation and mitigation of the limitations of large language models in clinical decision-making. \u003cem\u003eNat Med\u003c/em\u003e 1\u0026ndash;10 (2024) doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s41591-024-03097-1\u003c/span\u003e\u003cspan address=\"10.1038/s41591-024-03097-1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFerber, D. \u003cem\u003eet al.\u003c/em\u003e Autonomous Artificial Intelligence Agents for Clinical Decision Making in Oncology. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2404.04667\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2404.04667\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Deep learning, electronic health records, natural language processing, transformers, prognosis, cancer","lastPublishedDoi":"10.21203/rs.3.rs-7121466/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7121466/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003ePrognosis estimation is key to personalizing oncology care, yet current models rely on limited and often incomplete clinical and biological data. We designed a solution adapted to any kind of cancer (type and stage) based on narrative electronic medical reports, i.e. the basic working material for oncologists. We used 2.3M medical documents (corresponding to 36,123 patients for whom we had the date of death) to train, validate and test three different approaches. The best survival prediction performances were obtained by taking into account the medical history with sequential reports. This model (K-memBERT-T2) reached a Pearson correlation of 0.655 on the test cohort, 0.621 on a large external cohort of 143k documents (17,633 additional patients) (p-values:\u0026lt;10\u003csup\u003e\u0026minus;\u0026thinsp;5\u003c/sup\u003e) and a concordance index of 0.766 when adding 7082 alive and censored patients in this external test cohort. The 3-month binary survival predictions achieved an AUC of 0.852 on the test cohort and 0.875 on the external dataset. The model related to survival duration better than the PS, independently of its mention in texts. We present a non-invasive and interpretable method paving the way for an easy implementation in French-speaking centers.\u003c/p\u003e","manuscriptTitle":"An explainable language model predicts survival from medical reports in oncology","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-08-27 06:26:05","doi":"10.21203/rs.3.rs-7121466/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"3301dc3d-7a54-4df0-827c-f8aa58215bbe","owner":[],"postedDate":"August 27th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":53319233,"name":"Biological sciences/Cancer"},{"id":53319234,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":53319235,"name":"Health sciences/Oncology"}],"tags":[],"updatedAt":"2025-10-22T17:08:21+00:00","versionOfRecord":[],"versionCreatedAt":"2025-08-27 06:26:05","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7121466","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7121466","identity":"rs-7121466","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00