Revealing the Impact of Pre-training Data on Medical Foundation Models | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Revealing the Impact of Pre-training Data on Medical Foundation Models Yukun Zhou, Zheyuan Wang, Yilan Wu, Ariel Yuhan Ong, Siegfried Wagner, and 23 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6080254/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 28 Feb, 2026 Read the published version in Nature Communications → Version 1 posted You are reading this latest preprint version Abstract Medical foundation models (FM), pre-trained on large-scale unlabelled data, have demonstrated robust performance and high efficiency when fine-tuned to various clinically relevant applications. However, the impact of pre-training data on medical FM performance such as generalisability and fairness, which form the foundation in fine-tuned models, remains unexplored. To address this, we sampled two large cohorts from two sites, Moorfields Eye Hospital (UK) and the Shanghai Diabetes Prevention Program (China), each containing 904,170 retinal images for FM pre-training. We developed parallel FMs using identical processes and compared their fairness and generalisability on downstream tasks with publicly available datasets and held-out data from each site. Our results demonstrate that, despite strong generalisability, medical FMs perform significantly better on downstream data that align with the pre-training data in approximately one-third of tasks. Additionally, age is a key metadata factor impacting FM fairness and generalisability in retinal images, whereas sex and ethnicity show no such impact. These findings advocate for an evidence-based approach to pre-training data selection and highlight the importance of transparency even for pre-training data, ultimately enhancing FM capabilities and guiding FM development and customised application in healthcare. Health sciences/Diseases/Eye diseases Health sciences/Diseases/Cardiovascular diseases Health sciences/Health care/Medical imaging Health sciences/Medical research/Translational research Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Introduction Foundation models (FM) are large artificial intelligence (AI) models trained using data and computation at scale [ 1 , 2 ]. Using self-supervised or unsupervised learning methods, FMs capture abundant data patterns that can potentially be applied to diverse applications in real-world scenarios. This approach has broad applications across fields of medical AI [ 3 – 5 ], such as ophthalmology [ 6 – 8 ], radiology [ 9 – 11 ], pathology [ 12 – 16 ], and as a generalist [ 17 , 18 ] advancing clinically meaningful tasks like disease diagnosis and prognosis. However, despite the increasing number of medical FMs being developed, there remains very limited knowledge regarding how the composition of pre-training data, the “guts” of these models, affects FM capabilities such as generalisability and fairness. The lack of this critical knowledge makes pre-training data collection and medical FM development inefficient and highly speculative. Training data is the fundamental substrate for developing AI models, with its characteristics encompassing attributes, properties, and features that influence data quality, usability, and its impact on analytical processes. These characteristics define the scope of knowledge and largely determine model capability including generalisability and fairness [ 19 – 22 ]. Previous studies have investigated the impact of labelled data on traditional application-specific AI models in supervised learning [ 20 , 23 – 26 ], providing effective guidance for labelled data selection during model training. For instance, some work has demonstrated that model performance markedly drops when applied to external sites with distinct data characteristics such as demographics, imaging devices, and disease phenotypes [ 20 , 23 , 24 , 26 ]. Additionally, imbalanced training data often leads to poor performance in underrepresented subgroups in terms of age, sex, and ethnicity, raising concerns about model fairness and generalisability [ 25 , 27 , 28 ]. To address these challenges, previous work had focused on building diverse and balanced training data for traditional application-specific AI models, with techniques such as data augmentation and synthesis [ 20 , 29 – 31 ]. Unlike application-specific AI models that rely solely on labelled data for training, FMs learn generalisable features through extensive pre-training on substantial unlabelled data (e.g. via self-supervised learning [ 32 – 34 ]), followed by fine-tuning on labelled data for specific target applications. The performance of FMs is collectively shaped, where pre-training establishes the foundational capabilities while fine-tuning refines performance in specific tasks. However, the impact of unlabelled pre-training data on medical FMs remains unexplored, partly hampered by the substantial workload and resources required to build and compare parallel FMs on matched large-scale datasets from different countries and areas. This lack of knowledge leaves critical questions unanswered: 1) Do pre-training data characteristics affect the generalisability of medical FMs, such as performing poorly on sites with distinct demographics and imaging devices? 2) Do clinical metadata of pre-training data impact medical FM fairness over age, sex, and ethnicity, similar to traditional application-specific AI models? 3) How can we identify key metadata that likely influence FM generalisability and fairness? Addressing these questions is critical to ensure that FMs have good foundational capabilities for downstream clinical applications. Specifically, given the substantial data and computational resources required to develop medical FMs, it is imperative to understand what constitutes an appropriate distribution of pre-training data to enhance development efficiency and optimise medical FMs that can be broadly used for various clinically relevant applications across different sites. Furthermore, revealing the impact of pre-training data provides a strong basis for advocating data transparency, one of the least transparent dimensions according to Foundation Model Transparency Index Scores [ 35 ]. This is particularly relevant to medical FMs, where data distributions are often skewed–no dataset is free of limitations [ 36 ]. Disclosing how pre-training data impacts FMs, and providing details of pre-training data, are essential to understanding the strengths and limitations of medical FMs. To address these gaps, we investigate the impact of pre-training data on medical FM performance, using two large data cohorts from Moorfields Eye Hospital (MEH), UK, and Shanghai Diabetes Prevention Program (SDPP), China, each comprising 904,170 retinal colour photographs for FM pre-training. We first characterise data cohorts using clinical and imaging metadata, latent features (representative features encoded by models), and clinically meaningful morphological indices, highlighting the differences between the datasets. We then develop medical FMs with MEH data (FM-MEH) and SDPP data (FM-SDPP), using identical pre-training strategies and implementation details. We evaluate the performance of parallel FMs across a wide range of downstream tasks, including ocular disease diagnosis and systemic event prediction, using data from each site (held out from pre-training data) and publicly available datasets. We finally assess model fairness across subgroups based on age, sex, and ethnicity. Our findings demonstrate that pre-training data significantly impact the generalisability and fairness of medical FMs. Although FM-MEH and FM-SDPP perform comparably in around 70% of downstream tasks, they perform significantly better in some tasks on sites where they were pre-trained. For retinal FMs studied, the age distribution of pre-training data introduced performance gaps over age subgroups when adapted to downstream tasks, while sex and ethnicity show minimal impact. Through extensive experiments with real-world clinical data, this study addresses previously unanswered questions about the impact of pre-training data on medical FMs. More importantly, it advocates for an evidence-based approach to data description and selection in medical FM development to improve development efficiency and model capabilities. Results Quantification of data characteristics Figure 1 provides an overview of the development and application of medical FMs. FM-MEH and FM-SDPP were constructed using data from MEH and SDPP respectively. We randomly sampled 904,170 retinal fundus photographs from each database for FM pre-training, and analysed their characteristics across clinical and imaging metadata, latent features, and clinically meaningful morphological indices. As shown in Fig. 2 , significant differences were observed between the MEH and SDPP data in terms of clinical and imaging metadata. The average age in the MEH cohort is 68.88 years (95% Confidence Interval (CI) 68.85, 68.91), significantly older than the SDPP cohort, which had an average age of 47.26 years (95% CI 47.21, 47.30) ( P < 0.001). MEH data had a more balanced sex distribution, with 52.8% female participants compared to 36.8% in the SDPP data. MEH data was ethnically diverse with individuals identifying as White (45.9%), Asian or Asian British (18.7%, of which 0.6% were Chinese), Black or Black British (9.2%), Mixed (0.9%), other ethnicity (12.7%) and not reported (NR, 12.6%), based on the ethnicity grouping by the UK Office for National Statistics. In contrast, the SDPP cohort comprised only Chinese participants. Imaging devices also varied between the two datasets, with MEH primarily using the 3DOCT-2000SA (Topcon), FD-OCT (Topcon), and CIRRUS (ZEISS), while the SDPP data was collected using TRC-NW300 and TRC-NW400 (Topcon). To analyse latent features, we extracted features from 5000 random samples respectively from each site and visualised them using t-distributed stochastic neighbour embedding (t-SNE), a dimensionality reduction algorithm widely used for visualising high-dimensional data [ 37 ]. As shown in Fig. 2 b, the t-SNE plots revealed clear clustering patterns between the MEH and SDPP cohorts, underscoring the distinct data distributions of the two sites. Furthermore, we measured clinically meaningful morphological indices, such as vascular fractal dimension, which have been proven to be highly associated with systemic conditions like cardiovascular [ 38 – 40 ] and neurological health [ 41 – 44 ]. Figure 2 c highlights significant differences in these indices between the two cohorts. For instance, the artery fractal dimension in the MEH dataset is 1.25 (95% CI 1.24, 1.25), compared to 1.28 (95% CI 1.28, 1.29) in the SDPP dataset (p < 0.001). More morphological indices are listed in Extended Data Fig. 1 . For FM pre-training, we included two representative and widely used self-supervised learning strategies, generative-based learning (Masked Autoencoder [ 45 ]) and contrastive-based learning (DINOV2 [ 46 ]). We organised downstream tasks using held-out data (i.e. isolated from FM pre-training data) from MEH and SDPP, as well as publicly available datasets sourced from several countries. We adapted FMs to downstream tasks via both fine-tuning (all model parameters tuned on downstream labelled data) and linear probes (all model parameters frozen with one linear classifier tuned on downstream labelled data), as shown in Extended Data Fig. 2 . All task performances were assessed using the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). Details of the downstream task datasets are listed in Supplementary Tables 1 and 2. More details about data curation, model adaptation, and model evaluation are introduced in the Methods section. Generalisability on downstream tasks curated from each site We compared FM-MEH and FM-SDPP on three clinically relevant applications (diabetic retinopathy detection, diabetic macular oedema detection, and ischaemic stroke prediction) using held-out MEH and SDPP data. As shown in Fig. 3 , the FMs demonstrated good generalisability, with FM-MEH and FM-SDPP showing comparable performance on more than 70% of evaluations (17 out of 24). Despite this generalisability, the FMs occasionally perform significantly better on the site where they were developed. For instance, when adapted to SDPP downstream tasks, FM-SDPP pre-trained with Masked Autoencoder significantly outperformed FM-MEH in three out of six evaluations, as shown in Fig. 3 a. Similarly, when adapted to MEH downstream tasks, FM-MEH significantly outperformed FM-SDPP in three evaluations, as shown in Fig. 3 c. The FM-MEH and FM-SDPP pre-trained with DINOV2 exhibited fewer significant differences in downstream tasks (Fig. 3 b and 3 d). For SDPP downstream tasks, neither fine-tuning nor linear probes revealed significant differences between FM-SDPP and FM-MEH. For MEH downstream tasks, FM-MEH significantly outperformed FM-SDPP only in ischaemic stroke prediction (p < 0.001). AUPRC performance for all tasks is illustrated in Extended Data Fig. 3 . All quantitative results are listed in Supplementary Table 3. Generalisability on downstream tasks from publicly available datasets We evaluated the generalisability of FM-MEH and FM-SDPP to diverse applications using six publicly available datasets, comprising diabetic retinopathy detection (APTOS2019, IDRiD, and MESSIDOR2), glaucoma detection (Glaucoma fundus), and multiple retinal disease detection (JSIEC and Retina). As shown in Fig. 4 , while the performance of FM-SDPP and FM-MEH varied depending on the self-supervised learning strategies employed, they showed comparable performance in 16 out of 24 (66.7%) evaluations. When pre-trained with Masked Autoencoder and fine-tuned to the downstream tasks (Fig. 4 a), FM-SDPP significantly outperformed FM-MEH on four out of six datasets (p < 0.001), except for IDRiD (p = 0.601) and JSIEC (p = 0.129). When adapted to downstream tasks with the linear probe (Fig. 4 c), FM-MEH performed significantly better on the Glaucoma fundus dataset (p < 0.001). When using DINOV2 for FM pre-training (Fig. 4 b and Fig. 4 d), FM-MEH significantly outperformed FM-SDPP in three applications, i.e. fine-tuning to MESSIDOR2 (p < 0.001) and linear probing to IDRiD (p < 0.001) and Retina (p = 0.013). AUPRC results are illustrated in Extended Data Fig. 4 . All quantitative results are listed in Supplementary Table 3. Identifying clinical metadata showing strong data variations We subgrouped the randomly sampled MEH pre-training data based on clinical metadata (e.g. age, sex, and ethnicity) and characterised the subgroup data in latent features and morphological indices, as shown in Fig. 5 . The age was split into three subgroups, young group ( 70 years) based on the age distribution observed in Fig. 2 a. We observed clear clusterings in t-SNE maps (Fig. 5 b, Extended Data Fig. 5 ) and significant distribution differences in morphological indices (Fig. 5 d). For instance, the middle-aged group (40–70 years) had significantly higher artery fractal dimensions than the aged group (> 70 years). To eliminate potential confounding effects of sex and ethnicity, we specified a subgroup (e.g. female, Asian or Asian British) for analysis, and still observed distinct distribution differences across age subgroups in both latent features (Fig. 5 c) and morphological indices (Fig. 5 e). This demonstrates that age subgroups exhibit clear morphological variations, and even with self-supervised pre-training, medical FMs learned distinct latent features across the subgroups, which potentially caused bias in FM performance such as fairness and generalisability. For ethnicity, we investigated three subgroups White, Asian or Asian British, and Black or Black British. The t-SNE visualisations (Extended Data Fig. 6 a, FMs trained with Masked Autoencoder) showed clear clustering only for the White cohort. When FMs pre-trained with DINOV2 (Extended Data Fig. 6 c), latent features showed no distinct clustering across all ethnicities. Additionally, when controlling for confounders, there were no significant differences across ethnic subgroups in morphological indices (Extended Data Fig. 6 f). For sex subgroups (i.e. female, male), we observed no distinct clusterings in t-SNE visualisations, either before or after removing confounding variables (Extended Data Fig. 7). Only the vein fractal dimension remained significant differences across the sex subgroups after removing confounding variables (Extended Data Fig. 7f). These findings suggest that, ethnicity and sex subgroups contributed limited observable variations in morphological indices compared to age distribution. FMs learned less distinguishable latent features in self-supervised pre-training, which are less likely to bias FM fairness and generalisability in downstream tasks. FM fairness and generalisability over clinical metadata We conducted subgroup analyses to evaluate whether the distribution of clinical metadata introduced bias to FM fairness and generalisability across downstream tasks. We examined subgroup performance on MEH downstream tasks of diabetic retinopathy and diabetic macular oedema detection, as FM-SDPP and FM-MEH showed no significant differences in overall performance on these tasks (p = 0.275 and p = 0.313 respectively). The SDPP pre-training data is mainly distributed over the young ( 70 years), with a ratio of (0.015, 0.467, 0.518). As shown in Fig. 6 a, FM-SDPP performed consistently better than FM-MEH in the young group but worse in the aged group. For instance, in diabetic macular oedema detection using a linear probe, FM-SDPP outperformed FM-MEH by an averaged AUROC of 0.094 in the young group (p = 0.014) while underperforming by 0.049 in the aged group (p = 0.034). FM-MEH and FM-SDPP demonstrated similar performance in the middle-aged group. This consistent performance gap across diverse applications demonstrated the FM bias introduced by differential age distribution in pre-training data, verifying that age, as a key metadata showing strong data variations, introduces bias in model fairness and generalisability. For ethnicity, despite being pre-trained exclusively on data from the Chinese cohort, FM-SDPP sometimes outperformed FM-MEH in White and Black subgroups on downstream tasks. For instance, FM-SDPP achieved an AUROC of 0.12 higher than FM-MEH in the White subgroup for diabetic macular oedema detection (Fig. 6 b, p < 0.001). When adapted to diabetic macular oedema detection with the linear probe, FM-MEH pre-trained with DINOV2 significantly outperformed FM-SDPP in Asian cohorts (p = 0.009). The performance differences across ethnicity subgroups showed no consistent pattern and did not correlate with the ethnicity distribution of the pre-training data. For sex, FM-MEH and FM-SDPP showed varying performance across sex subgroups depending on the task and adaptation method (Fig. 6 c). Although FM-SDPP pre-training data has a less balanced sex distribution (36.8% female participants versus 52.8% in MEH data), it performed better in female cohorts for certain tasks like diabetic macular oedema detection when pre-trained with DINOV2 and adapted with linear probe (p = 0.004). These results showed no correlation with the sex distribution in the pre-training data. The observations across ethnicity and sex subgroups verified that, as clinical metadata exhibits limited data variations, the distribution differences of ethnicity and sex in pre-training data are less likely to introduce biases in FM fairness and generalisability. All quantitative results and p-values are listed in Supplementary Table 4. Discussion This study investigates the impact of pre-training data on medical FM performance including generalisability and fairness, a critical yet underexplored area despite the rapid advancements in medical FM research. This is a highly unique study design where we developed parallel medical FMs with identical implementations, differing only in their pre-training data, and evaluated their performance in extensive experiments. Our findings show that FMs demonstrate good generalisability, achieving comparable performance in over 70% of downstream tasks. However, FMs sometimes perform better on application data that aligns with their pre-training data, and key clinical metadata, such as age distribution, potentially introducing biases that affect FM fairness and generalisability. These results highlight the importance of transparently disclosing pre-training data characteristics and developing evidence-based approaches to data selection, offering practical guidance for improving the construction and application of medical FMs. Medical FMs serve as robust base models for diverse applications, as evidenced by the strong generalisability of FM-MEH and FM-SDPP when adapted to downstream tasks across sites. In a majority of evaluations across MEH and SDPP sites, there were no significant differences in performance between intra-site and inter-site adaptations. For example, FM-MEH and FM-SDPP performed comparably in 17 out of 24 (70.8%) evaluations on SDPP and MEH downstream tasks (Fig. 3 ), showcasing great generalisability given the substantial differences between SDPP and MEH data (Fig. 2 ). Unlike prior studies that evaluated generalisability by comparing FMs to traditional application-specific models, our approach provided a more straightforward analysis by comparing the generalisation performance between intra-site (e.g. FMs pre-trained on SDPP data and adapted to SDPP downstream tasks) and inter-site (e.g. FMs pre-trained on MEH data and adapted to SDPP downstream tasks) adaptation. When adapted to publicly available datasets, FM-MEH and FM-SDPP performed comparably in 16 out of 24 (66.7%) evaluations (Fig. 4 ). Such a level of generalisability is rarely observed in traditional application-specific AI models, which typically perform significantly better in internal sites compared to external ones, even after fine-tuning or linear probing. The strong generalisability of medical FMs reinforces their potential as base models for adaptation to specific applications, such as disease diagnosis and prognosis. Users are encouraged to choose medical FMs pre-trained on data with similar distributions to the application data, considering that FMs are not yet perfectly generalisable and occasionally perform better on intra-site tasks. For instance, FM-MEH significantly outperformed FM-SDPP in 4 out of 12 evaluations on MEH downstream tasks, while FM-SDPP achieved superior performance in 3 out of 12 evaluations on SDPP tasks. This provides references to guide local deployment of medical FMs considering the increasing number of FMs in medical fields. Using ophthalmology as an example, RETFound [ 6 ] was primarily pre-trained on UK data while VisionFM [ 47 ] was pre-trained mainly on data from China. To better facilitate model selection, it is essential to disclose pre-training data details, aligning with data transparency initiatives in medical AI, such as the STANDING Together recommendations [ 36 ]. This is particularly important to medical AI, as clinical data from individual sites is often skewed due to local population and specific study design. Furthermore, the challenges posed by pre-training data limitations underscore the importance of creating a large, global database that unites and optimises resources from across the world. Our findings provide a real-world example demonstrating that, despite rapid advancements in medical FMs, global collaboration remains crucial for developing truly generalisable medical AI. Incorporating multifaceted views of pre-training data enables a comprehensive assessment of data distribution, as well as the identification of key metadata that show strong variations. Clinical metadata, such as demographics, are widely used to quantify data characteristics but provide only a limited view relevant to FM development. As shown in Fig. 2 , demographic factors such as age, sex, and ethnicity differed significantly between MEH and SDPP datasets. However, only age subgroups significantly shifted the distribution of clinically meaningful morphological indices (Fig. 5 ). These findings emphasise the need for a multifaceted approach to describing data distribution. Morphological indices quantify the clinically relevant variations influenced by multiple factors, including demographics, ocular disease phenotypes, and systemic conditions. They provide a clinically meaningful perspective on data distribution and have been extensively studied in clinical association research [ 38 , 39 , 41 , 42 , 48 , 49 ]. Meanwhile, latent features represent how data are perceived by models and are highly relevant to model performance across diverse applications. Prior machine learning research often depicted data distribution in latent feature space and regulated features for generative modelling [ 50 , 51 ] and domain adaptation [ 52 , 53 ]. We demonstrate that morphological indices and latent features offer complementary descriptions of pre-training data and guide key metadata identification, providing a clear overview of data used for medical FM development. The proposed pipeline (Fig. 5 a) can be extended to other medical fields by adjusting the morphological indices or involving domain-specific indices, enabling a comprehensive description of data distribution and identification of key clinical metadata. Metadata showing strong data variations are more likely to introduce biases in model fairness and generalisability in downstream tasks, requiring extra attention during data preparation and selection. Previous studies have explored how labelled data influences the performance of application-specific models. For instance, a recent study [ 20 ] demonstrated that fine-tuning data with uniformly sampled demographic attributes (e.g. age, sex, and ethnicity) improved model fairness in clinically relevant applications. However, few studies have investigated how FM pre-training data affect fairness and generalisability, largely due to the substantial resources and workload required in building parallel FMs for comparison. In our study, subgroup analysis revealed that the age distribution (the identified key metadata) introduced bias in model fairness across downstream tasks (Fig. 6 ), while similar biases were not observed for sex or ethnicity. This suggests that in retinal imaging, increasing diversity and balance of certain attributes, such as ethnicity and sex, does not necessarily enhance FM fairness (Fig. 6 ), while a balanced and wide-ranging age distribution in pre-training data contributes to improving the fairness and generalisability of retinal FMs. This highlights the necessity of identifying key metadata based on an evidence-based approach and leveraging these insights to guide the pre-training data selection, ultimately optimising the generalisability and fairness of medical FMs. Advancements in general AI techniques (e.g. self-supervised learning methods) continue to push the performance boundaries of medical AI. In our study, medical FMs pre-trained with different self-supervised learning strategies demonstrated varying performance and generalisability. We included representative self-supervised learning strategies, i.e. generative-based learning (Masked Autoencoder) and more recent contrastive-based learning (DINOV2). Our results showed that DINOV2 achieved superior performance in downstream tasks on each site (Supplementary Table 5). These observations extended to publicly available datasets, where FMs pre-trained with DINOV2 significantly performed better in over half of the evaluations (Supplementary Table 5). Additionally, FM-MEH and FM-SDPP pre-trained with DINOV2 showed significant differences in only one evaluation, compared to seven for FMs pre-trained with Masked Autoencoder. This indicated that DINOV2 introduced comparable performance between intra-site and inter-site adaptations, suggesting strong generalisability. The superior performance of DINOV2 is likely credited to a combination of various pre-training strategies (i.e. DINO [ 54 ] and iBoT [ 55 ] learning image-level and patch-level features respectively), several practical tweaks (e.g. Sinkhorn-Knopp centring [ 56 ]), and generalisable features learnt from initial large-scale pre-training on 142 million natural images [ 46 ]. Despite the benefits of translational research, many of the latest and most powerful general AI techniques remain proprietary (e.g. GPT) or have limited transferability (e.g. DeepSeek) with insufficient technical details. The long-term and sustainable advancement of medical AI requires the development of domain-specific techniques tailored to the unique characteristics of medical data and application scenarios. Pre-training clinical data and self-supervised learning strategies have a synergistic effect on FM performance. FM-MEH and FM-SDPP pre-trained using various self-supervised learning methods performed substantially differently when adapted to publicly available datasets. When FM-SDPP is pre-trained with Masked Autoencoder, it significantly outperformed FM-MEH in four out of twelve evaluations (Fig. 4 a and Fig. 4 b). In contrast, FM-MEH pre-trained with DINOV2 generally outperformed FM-SDPP in disease diagnosis, with three cases showing significant differences. This suggests synergistic effects between SDPP data and Masked Autoencoder, as well as between MEH data and DINOV2. Although our study does not focus primarily on exploring the synergy between pre-training data and learning strategy, it provides real-world examples supporting the initiatives of seeing data and learning strategies as interconnected components [ 57 , 58 ], which highlights the need to simultaneously optimise both model learning strategies and pre-training data characteristics to advance medical FM development. Although this work systematically reveals the impact of pre-training data on FM performance using real-world clinical data, several limitations and challenges remain to be addressed in future research. First, although this work describes data distribution in a multifaceted view: clinical and imaging metadata, latent features, and morphological indices, future studies should include extra factors particularly concerning disease phenotypes. This is currently limited by the complexity of disease categories and severity, as well as challenges in precisely controlling disease phenotypes in large-scale pre-training data organisation. Second, due to the considerable workload involved in developing parallel FMs and organising diverse downstream tasks, this study primarily focused on representative self-supervised learning strategies, such as Masked Autoencoder and DINOV2, and used eye images as an exemplar. Further research involving a wider range of learning strategies and medical domains is needed. Third, due to the differences in the sources of labels (e.g. MEH diabetic retinopathy labels were extracted from clinical practice records; SDPP labels were annotated by two ophthalmologists with disagreements adjudicated by a consultant-level ophthalmologist), there are clear performance differences across various applications, as shown in Fig. 3 . Although this does not bias the performance comparison between FM-MEH and FM-SDPP, a well-aligned labelling system would allow more standardised cross-validation. Building upon this study, future work could quantify the extent to which our findings can improve efficiency in medical FM development, such as quantifying saved data volume and computation resources for developing competitive medical FMs. Additionally, the key metadata of pre-training data can be identified and prioritised in batch data sampling for federated learning and in data synthesis by generative modelling. In conclusion, we unravel the impact of pre-training data on the performance of medical FMs, demonstrating that both AI equity and generalisability start at the foundations–the pre-training data. Establishing an accurate and clear understanding of this knowledge is crucial to optimising the development and use of medical FMs. Our findings, along with the proposed pipeline for key metadata identification, provide practical guidance for pre-training data selection, both within individual sites where data is often skewed due to local population characteristics or specific study designs, and among global stakeholders collaborating to aggregate multi-site data, to advance medical FM development for healthcare applications. Methods Source for pre-training data The Moorfields Eye Hospital (MEH) cohort was sourced from AlzEye [ 59 ], a retrospective cohort study linking ophthalmic data from 353,157 participants, who attended MEH between 2008 and 2018, with systemic health data from hospital admissions across the whole of England. The ethnicity groups are reported based on the ethnicity grouping by the UK Office for National Statistics. The Shanghai Diabetes Prevention Program (SDPP) cohort was drawn from a community-based longitudinal study of 79,284 participants who underwent physical examinations at Huadong Sanatorium and Shanghai Sixth People’s Hospital between December 2015 and November 2022. We randomly sampled 904,170 retinal fundus photographs from each database for FM pre-training. The corresponding data characteristics are listed in Fig. 2 . The retinal fundus photographs have a normal field of view (< 60°), i.e. no ultra-widefield fundus images were used. Data for downstream tasks We evaluated the foundation model performance on clinically relevant applications using the data from Moorfields Eye Hospital (MEH) UK, SDPP China, and publicly available datasets. First, we organised ocular disease detection tasks, including diabetic retinopathy and diabetic macular oedema detection, using MEH and SDPP data which were held out from FM pre-training data at the patient level. There was no overlap of patients between pre-training and downstream data. We curated 2000 images with labels of diabetic retinopathy and macular oedema from 2000 participants. The labels for diabetic retinopathy are based on the International Clinical Diabetic Retinopathy Severity scale [ 60 ], indicating five stages from no diabetic retinopathy to proliferative diabetic retinopathy. The 2000 images were evenly distributed over the five categories. The labels for diabetic macular oedema included three categories: no diabetic oedema, non-clinically significant diabetic macular oedema, and clinically significant diabetic oedema [ 61 , 62 ]. For MEH data, the labels were obtained from clinical practice records. For SDPP data, two independent ophthalmologists annotated the disease labels, with disagreements adjudicated by a consultant-level ophthalmologist. Second, we curated the task of ischaemic stroke prediction using MEH and SDPP data. The stroke labels include binary categories, i.e. no stroke event within three years from imaging or stroke event within three years. For SDPP data, stroke labels were obtained from digital hospital records and self-report records during longitudinal visits between December 2015 and November 2022. For MEH data, systemic health data were derived from Hospital Episode Statistics (HES) data relating to admitted patient care (inpatient records). Diagnostic codes in HES admitted patient care were reported according to the tenth revision of the ICD (International Statistical Classification of Diseases) [ 63 ]. ICD codes for stroke (I23-I24) were used in line with previous reports. The stroke data from MEH included 2526 images with each category having 1263 images, while SDPP data included 2000 images with each category including 1000 images. More details are listed in Supplementary Table 2. Similarly to the RETFound study [ 6 ], we organised six ocular disease detection tasks with publicly available datasets. For diabetic retinopathy diagnosis, Kaggle APTOS2019 (India), IDRID (India) [ 64 ] and MESSIDOR2 (France) [ 65 ] were used, with the labels defined by the International Clinical Diabetic Retinopathy Severity scale. For glaucoma, Glaucoma Fundus (South Korea) [ 66 ] was included, with three categorical labels, non-glaucoma, early glaucoma (suspected glaucoma) and advanced glaucoma. For datasets with several diseases, JSIEC (China) [ 67 ] and Retina were included. JSIEC included 1,000 images with 39 categories of common referable fundus diseases and conditions. Retina had labels of normal, glaucoma, cataract and retina disease. The grading protocols for the public datasets were summarised as: IDRiD, two medical experts provided adjudicated consensus grades; MESSIDOR2, adjudicated by a panel of three retina specialists in accordance with a published protocol; APTOS2019, Kaggle dataset with limited information but possibly a single clinician grader; Glaucoma Fundus, agreement of two specialists based on visual fields and extensive imaging, and JSIEC, labelled by ophthalmologists and confirmed by senior retina specialists. Disagreements were resolved by a panel of five senior retina specialists. Retina, details not available. The details of datasets, such as imaging devices, country and label category, are listed in Supplementary Table 1. Data processing for self-supervised learning We used AutoMorph [ 68 ], an automated retinal image analysis tool, to exclude the background and keep the retinal area. All images were resized to 256 × 256 with cubic interpolation. We followed the default data augmentation settings as Masked Autoencoder and DINOV2. On pre-training with Masked Autoencoder, we included random crop (lower bounds 20% of the whole image and upper bounds 100%) and resized the cropped patches to 224 × 224, random horizontal flipping and image normalisation. For DINOV2, the global patch augmentation included random crop (lower bounds 32% of the whole image and upper bounds 100%) and resizing the cropped patches to 224 × 224, random horizontal flipping, colour jittering (brightness 0.4, contrast 0.4, saturation 0.2, and hue 0.1), followed by either Gaussian blur or Gaussian blur and random image solarising (threshold 128, possibility 20%). The local patch augmentation included random crop (lower bounds 5% of the whole image and upper bounds 32%) and resizing the cropped patches to 96 × 96, random horizontal flipping, colour jittering, and random Gaussian blur (possibility 50%). All augmented patches were normalised. We also measured the image quality and morphological indices with AutoMorph. Foundation model implementations For FM pre-training, we selected two representative self-supervised learning strategies, Masked Autoencoder and DINOV2. Both have been widely used across various domains including medical applications, and have demonstrated state-of-the-art performance in disease diagnosis [ 6 , 14 , 16 , 69 ]. We used a specific configuration of Masked Autoencoder comprising an encoder and a decoder. The encoder was a large vision Transformer (ViT-large) with 24 Transformer blocks and an embedding vector size of 1024, while the decoder was a small vision Transformer (ViT-small) with eight Transformer blocks and an embedding vector size of 512. The encoder took unmasked patches (with a patch size of 16 × 16) as input and projected them into feature vectors of size 1024. These feature vectors passed through the 24 Transformer blocks, which consisted of multi-headed self-attention and multilayer perceptrons to generate high-level features. The decoder reconstructed the image by inserting masked placeholder patches into the extracted high-level features and then projecting them back to image patches through a linear projection layer. During model pre-training, the objective was to reconstruct retinal images from the highly masked version, with a mask ratio of 0.75. The pre-training batch size was 1792 (4 GPUs × 448 per GPU). The total pre-training epoch was 800 and the first 15 epochs were for learning rate warming up (from 0 to a learning rate of 1 × 10 − 3). The model weights at the final epoch were saved as the checkpoint for adapting to downstream tasks. We specified DINOV2 with both teacher and student networks as ViT-large, with 24 Transformer blocks and an embedding vector size of 1024. It included a projection head of three-layer perceptrons, respectively with dimensions 2048, 384, and 131,072. The patch size was 14 × 14. The teacher network processed the global patches while the student network processed both global and local patches. During model pre-training, the objectives combined the original objectives of DINO [ 54 ] and iBOT [ 55 ]. The DINO part calculated the cross-entropy loss between the categorical tokens from the teacher network and the student network, while the iBOT part calculated the cross-entropy between masked patch tokens between the two networks (the maximum number of masking patches was 128). The pre-training batch size was 320 (4 GPUs × 80 per GPU). The total pre-training epoch was 100 and the first 10 epochs were for the learning rate warming up (from 1 × 10 − 6 to a learning rate of 2 × 10 − 4) and the remaining 90 epochs for a cosine annealing schedule. The model weights at the final epoch were saved as the checkpoint for adapting to downstream tasks. Adaptation to downstream tasks When adapting foundation models pre-trained with Masked Autoencoder to downstream tasks, we only need the encoder (ViT-large) of the foundation model and discard the decoder. For foundation models pre-trained with DINOV2, the teacher network was used and adapted to downstream tasks. Both the encoder and teacher networks extracted high-level features from retinal images. A fully connected layer took these features as input and output the probability distribution over the disease categories. The category with the highest probability was selected as the final classification. The number of categories determined the number of neurons in the fully connected layer. We used two adaptation strategies, fine-tuning and the linear probe. Fine-tuning tuned the encoder and fully connected layer using the downstream data while the linear probe tuned only the fully connected layer. The schematic diagram is shown in Extended Data Fig. 2 . The training objective was to predict the same categorical output as the label. The batch size was set to 16, and the model was trained for 50 epochs. The first 10 epochs followed a learning rate warm-up schedule, increasing linearly from 0 to a learning rate of 5 × 10 − 4. This was followed by a cosine annealing schedule, where the learning rate gradually decreased from 5 × 10 − 4 to 1 × 10 − 6 over the remaining 40 epochs. After each training epoch, the model performance was evaluated on the validation set. The model checkpoint with the highest AUROC on the validation set was saved for subsequent internal and external evaluations. Computational resources Four NVIDIA Tesla A100 (80 GB) were used for self-supervised pre-training in this project. It took about 16 days to finish pre-training with DINOV2 or Masked Autoencoder. We used an equal computational cost from MEH and SDPP for foundation model development. For fine-tuning and linear probing foundation models to downstream tasks, we use NVIDIA Tesla T4 (16 GB). Fine-tuning took about 70 mins for every 1,000 images, while linear probing took around 15 mins. Evaluation and statistical analysis All task performance was assessed using the classification metrics AUROC and AUPRC. For ischaemic stroke prediction tasks, the AUROC and AUPRC were calculated in a binary setting. For multiclass classification, such as five-stage diabetic retinopathy and multicategory disease diagnosis, AUROC and AUPRC were calculated separately for each class and then averaged to obtain the overall AUROC and AUPRC scores. For each task, we fine-tuned the model with five different random seeds, which determined the implementations including shuffling of fine-tuning data and data augmentation. The mean and standard deviation of the performance across the five runs were computed. The standard error is estimated as (standard deviation / √5), and the 95% confidence interval (CI) is obtained by multiplying the standard error by 1.96. The normality of the model performance was checked via Shapiro-Wilk test. Statistical significance is calculated using two-sided t-tests. Declarations Ethics statement This study involves human participants and was approved by the London-Central Research Ethics Committee (18/LO/1163, approved 1 August 2018), Advanced statistical modelling of multimodal data of genetic and acquired retinal diseases (20/HRA/2158, approved 5 May 2020), Confidential Advisory Group for Section 251 support (18/CAG/0111, approved 13 September 2018), and the Ethics Committee of Shanghai Sixth People’s Hospital (Approved No: 2019-087, approved 29 August 2019). The National Health Service Health Research Authority gave final approval on 13 September 2018. Moorfields Eye Hospital NHS Foundation Trust validated the de-identifications for MEH data. Only de-identified retrospective data were used for research. Data availability The MEH data consists of routinely collected healthcare data. Owing to their sensitive nature, the dataset is subject to controlled access by means of a structured application process. The AlzEye dataset is subject to the contractual restrictions of the data sharing agreements between National Health Service Digital, Moorfields Eye Hospital and University College London, and is not available for access beyond the AlzEye research team. National and international collaborations are welcomed, although restrictions on access to the cohort mean that only the AlzEye researchers can directly analyse individual-level systemic health data. Interested collaborators should contact the chief investigator P.A.K. For SDPP data, individual-level patient data can be accessible with the consent of the data management committee from institutions and are not publicly available. Requests for the non-profit use of the fundus images and related clinical information should be sent to T.Y.W. The data management committee will then review all the requests and grants (if successful). A formal data transfer agreement will be required upon approval. Generally, all these requests for access to the data will be responded to within 1 month. Data for ocular disease experiments are publicly available online and can be accessed through the following links: IDRID (https://ieee-dataport.org/open-access/indian-diabetic-retinopathy-image-dataset-idrid), MESSIDOR2 (https://www.adcis.net/en/third-party/messidor2/), APTOS2019 (https://www.kaggle.com/competitions/aptos2019-blindness-detection/data), Glaucoma Fundus (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1YRRAC), JSIEC (https://zenodo.org/record/3477553), and Retina (https://www.kaggle.com/datasets/jr2ngb/cataractdataset). Code availability The code used to train, fine-tune and evaluate RETFound from Y.Z. is available at https://github.com/rmaphoh/RETFound_MAE, which is based on PyTorch. All pre-trained model weights are available at https://huggingface.co/YukunZhou. Images were processed with automated retinal image analysis tool AutoMorph v.1.0 (https://github.com/rmaphoh/AutoMorph). Results were further analysed and visualised with Python v.3.11.0, NumPy v.1.26.4, SciPy v.1.15.2, Matplotlib v.3.8.4, pandas v.1.5.0, Scikit-Learn v.1.4.2 and Pillow v.10.2.0. References Moor M, Banerjee O, Abad ZSH, Krumholz HM, Leskovec J, Topol EJ, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023;616: 259–265. Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, et al. On the Opportunities and Risks of Foundation Models. arXiv [cs.LG]. 2021. Available: http://arxiv.org/abs/2108.07258 Zhang S, Metaxas D. On the challenges and perspectives of foundation models for medical image analysis. Med Image Anal. 2024;91: 102996. Chia MA, Zhou Y, Keane PA. A new foundation model for multimodal ophthalmic images: Advancing disease detection and prediction. NEJM AI. 2024;1. doi:10.1056/aie2401024 Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, et al. Toward expert-level medical question answering with large language models. News@nat,Com. 2025. doi:10.1038/s41591-024-03423-7 Zhou Y, Chia MA, Wagner SK, Ayhan MS, Williamson DJ, Struyven RR, et al. A foundation model for generalizable disease detection from retinal images. Nature. 2023;622: 156–163. Li J, Guan Z, Wang J, Cheung CY, Zheng Y, Lim L-L, et al. Integrated image-based deep learning and language models for primary diabetes care. Nat Med. 2024. doi:10.1038/s41591-024-03139-8 Wang M, Lin T, Lin A, Yu K, Peng Y, Wang L, et al. Common and rare fundus diseases identification using vision-language foundation model with knowledge of over 400 diseases. arXiv [eess.IV]. 2024. Available: http://arxiv.org/abs/2406.09317 Tiu E, Talius E, Patel P, Langlotz CP, Ng AY, Rajpurkar P. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat Biomed Eng. 2022. doi:10.1038/s41551-022-00936-9 Pai S, Bontempi D, Hadzic I, Prudente V, Sokač M, Chaunzwa TL, et al. Foundation model for cancer imaging biomarkers. Nat Mach Intell. 2024;6: 354–367. Tanno R, Barrett DGT, Sellergren A, Ghaisas S, Dathathri S, See A, et al. Collaboration between clinicians and vision-language models in radiology report generation. Nat Med. 2024. doi:10.1038/s41591-024-03302-1 Huang Z, Bianchi F, Yuksekgonul M, Montine TJ, Zou J. A visual-language foundation model for pathology image analysis using medical Twitter. Nat Med. 2023;29: 2307–2316. Lu MY, Chen B, Williamson DFK, Chen RJ, Liang I, Ding T, et al. A visual-language foundation model for computational pathology. Nat Med. 2024;30: 863–874. Xu H, Usuyama N, Bagga J, Zhang S, Rao R, Naumann T, et al. A whole-slide foundation model for digital pathology from real-world data. Nature. 2024;630: 181–188. Wang X, Zhao J, Marostica E, Yuan W, Jin J, Zhang J, et al. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature. 2024. doi:10.1038/s41586-024-07894-z Vorontsov E, Bozkurt A, Casson A, Shaikovski G, Zelechowski M, Severson K, et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nat Med. 2024;30: 2924–2935. Tu Tao, Azizi Shekoofeh, Driess Danny, Schaekermann Mike, Amin Mohamed, Chang Pi-Chuan, et al. Towards Generalist Biomedical AI. NEJM AI. 2024;1: AIoa2300138. Zhang K, Zhou R, Adhikarla E, Yan Z, Liu Y, Yu J, et al. A generalist vision-language foundation model for diverse biomedical tasks. Nat Med. 2024;30: 3129–3141. Willemink MJ, Koszek WA, Hardell C, Wu J, Fleischmann D, Harvey H, et al. Preparing Medical Imaging Data for Machine Learning. Radiology. 2020;295: 4–15. Ktena I, Wiles O, Albuquerque I, Rebuffi S-A, Tanno R, Roy AG, et al. Generative models improve fairness of medical classifiers under distribution shifts. Nat Med. 2024;30: 1166–1173. LeCun, Bengio Y, Hinton Y, E. G. Deep learning. Nature. 2015;521: 436–444. Zhang A, Xing L, Zou J, Wu JC. Shifting machine learning for healthcare from development to deployment and from models to data. Nat Biomed Eng. 2022;6: 1330–1345. Yang Y, Zhang H, Gichoya JW, Katabi D, Ghassemi M. The limits of fair medical imaging AI in real-world generalization. Nat Med. 2024. doi:10.1038/s41591-024-03113-4 Lin M, Li T, Yang Y, Holste G, Ding Y, Van Tassel SH, et al. Improving model fairness in image-based computer-aided diagnosis. Nat Commun. 2023;14: 6261. Seyyed-Kalantari L, Zhang H, McDermott MBA, Chen IY, Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat Med. 2021;27: 2176–2182. Hogg HDJ, Martindale APL, Liu X, Denniston AK. Clinical Evaluation of Artificial Intelligence-Enabled Interventions. Invest Ophthalmol Vis Sci. 2024;65: 10. AI can be sexist and racist—it’s time to make it fair. Available: https://idp.nature.com/authorize/casa?redirect_uri=https://www.nature.com/articles/d41586-018-05707-8&casa_token=tw57t_dkgfwAAAAA:bWJudlgHBkDCCCPwgOb7A74_9vDizVEK7S7k4Dlv58r3Pq1agWfEHMtwNJW2NkFTi-BBIFkbHfxBnns Larrazabal AJ, Nieto N, Peterson V, Milone DH, Ferrante E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc Natl Acad Sci U S A. 2020;117: 12592–12594. Zhang L, Wang X, Yang D, Sanford T, Harmon S, Turkbey B, et al. Generalizing Deep Learning for Medical Image Segmentation to Unseen Domains via Deep Stacked Transformation. IEEE Trans Med Imaging. 2020;39: 2531–2540. Frid-Adar M, Diamant I, Klang E, Amitai M, Goldberger J, Greenspan H. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing. 2018;321: 321–331. Han T, Nebelung S, Haarburger C, Horst N, Reinartz S, Merhof D, et al. Breaking medical data sharing boundaries by using synthesized radiographs. Sci Adv. 2020;6: eabb7973. Krishnan R, Rajpurkar P, Topol EJ. Self-supervised learning in medicine and healthcare. Nat Biomed Eng. 2022;6: 1346–1352. Doersch C, Gupta A, Efros AA. Unsupervised visual representation learning by context prediction. 2015 IEEE International Conference on Computer Vision (ICCV). IEEE; 2015. pp. 1422–1430. Jing L, Tian Y. Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey. IEEE Trans Pattern Anal Mach Intell. 2021;43: 4037–4058. Bommasani R, Klyman K, Longpre S, Xiong B, Kapoor S, Maslej N, et al. Foundation Model Transparency Reports. arXiv [cs.LG]. 2024. Available: http://arxiv.org/abs/2402.16268 Alderman JE, Palmer J, Laws E, McCradden MD, Ordish J, Ghassemi M, et al. Tackling algorithmic bias and promoting transparency in health datasets: the STANDING Together consensus recommendations. Lancet Digit Health. 2025;7: e64–e88. Maaten L, Hinton GE. Visualizing Data using t-SNE. Journal of Machine Learning Research. 2008;9: 2579–2605. Wong TY, Mitchell P. Hypertensive retinopathy. N Engl J Med. 2004;351: 2310–2317. Günthner R, Hanssen H, Hauser C, Angermann S, Lorenz G, Kemmner S, et al. Impaired Retinal Vessel Dilation Predicts Mortality in End-Stage Renal Disease. Circ Res. 2019. doi:10.1161/CIRCRESAHA.118.314318 Seidelmann SB, Claggett B, Bravo PE, Gupta A, Farhad H, Klein BE, et al. Retinal vessel calibers in predicting long-term cardiovascular outcomes: The Atherosclerosis Risk in Communities Study: The Atherosclerosis Risk in Communities Study. Circulation. 2016;134: 1328–1338. Ko F, Muthy ZA, Gallacher J, Sudlow C, Rees G, Yang Q, et al. Association of Retinal Nerve Fiber Layer Thinning With Current and Future Cognitive Decline: A Study Using Optical Coherence Tomography. JAMA Neurol. 2018;75: 1198–1205. Wagner SK, Romero-Bascones D, Cortina-Borja M, Williamson DJ, Struyven RR, Zhou Y, et al. Retinal optical coherence tomography features associated with incident and prevalent Parkinson disease. Neurology. 2023;101: e1581–e1593. Cheung CY-L, Ong YT, Ikram MK, Ong SY, Li X, Hilal S, et al. Microvascular network alterations in the retina of patients with Alzheimer’s disease. Alzheimers Dement. 2014;10: 135–142. Cheung CY-L, Ikram MK, Chen C, Wong TY. Imaging retina to study dementia and stroke. Prog Retin Eye Res. 2017;57: 89–107. He K, Chen X, Xie S, Li Y, Doll’ar P, Girshick RB. Masked Autoencoders Are Scalable Vision Learners. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2021; 15979–15988. Oquab M, Darcet T, Moutakanni T, Vo H, Szafraniec M, Khalidov V, et al. DINOv2: Learning robust visual features without supervision. arXiv [cs.CV]. 2023. Available: http://arxiv.org/abs/2304.07193 Qiu J, Wu J, Wei H, Shi P, Zhang M, Sun Y, et al. Development and validation of a multimodal multitask vision foundation model for generalist ophthalmic artificial intelligence. NEJM AI. 2024;1. doi:10.1056/aioa2300221 Shi D, Zhou Y, He S, Wagner SK, Huang Y, Keane PA, et al. Cross-modality Labeling Enables Noninvasive Capillary Quantification as a Sensitive Biomarker for Assessing Cardiovascular Risk. Ophthalmol Sci. 2024;4: 100441. Wagner SK, Cortina-Borja M, Silverstein SM, Zhou Y, Romero-Bascones D, Struyven RR, et al. Association Between Retinal Features From Multimodal Imaging and Schizophrenia. JAMA Psychiatry. 2023;80: 478–487. Zhao S, Song J, Ermon S. Learning hierarchical features from deep generative models. Precup D, Teh YW, editors. ICML. 2017;70: 4091–4099. Vahdat A, Kreis K, Kautz J. Score-based generative modeling in latent space. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Vaughan JW, editors. Neural Inf Process Syst. 2021;34: 11287–11302. Mancini M, Porzi L, Bulo SR, Caputo B, Ricci E. Boosting domain adaptation by discovering latent domains. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE; 2018. pp. 3771–3780. Pan SJ, Tsang IW, Kwok JT, Yang Q. Domain adaptation via transfer component analysis. IEEE Trans Neural Netw. 2011;22: 199–210. Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, et al. Emerging properties in self-supervised vision transformers. arXiv [cs.CV]. 2021. pp. 9650–9660. Available: http://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper.html Zhou J, Wei C, Wang H, Shen W, Xie C, Yuille A, et al. iBOT: Image BERT Pre-Training with Online Tokenizer. arXiv [cs.CV]. 2021. Available: http://arxiv.org/abs/2111.07832 Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A. Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst. 2020;33: 9912–9924. Xiao L, Pennington J. Synergy and symmetry in deep learning: Interactions between the data, model, and inference algorithm. arXiv [cs.LG]. 2022. Available: http://arxiv.org/abs/2207.04612 Qin Z, Chen D, Zhang W, Yao L, Huang Y, Ding B, et al. The synergy between data and multi-modal large language models: A survey from co-development perspective. arXiv [cs.AI]. 2024. Available: http://arxiv.org/abs/2407.08583 Wagner S, Hughes F, Cortina-Borja M, Pontikos N, Struyven R, Liu X, et al. AlzEye: longitudinal record-level linkage of ophthalmic imaging and hospital admissions of 353 157 patients in London, UK. BMJ Open. 2022;12. doi:10.1136/bmjopen-2021-058552 Wilkinson CP, Ferris FL 3rd, Klein RE, Lee PP, Agardh CD, Davis M, et al. Proposed international clinical diabetic retinopathy and diabetic macular edema disease severity scales. Ophthalmology. 2003;110: 1677–1682. Ciulla T, Amador A, Zinman B. Diabetic retinopathy and diabetic macular edema: pathophysiology, screening, and novel therapies. Diabetes Care. 2003;26: 2653–2664. for Diabetic Macular Edema P. Photocoagulation for diabetic macular edema. Early Treatment Diabetic Retinopathy Study report number 1. Early Treatment Diabetic Retinopathy Study research group. AMA Arch Ophthalmol. 1985;103: 1796–1806. World Health Organization. International Statistical Classification of Diseases and Related Health Problems: Alphabetical index. World Health Organization; 2004. Porwal P, Pachade S, Kokare M, Deshmukh G, Son J, Bae W, et al. Idrid: Diabetic retinopathy--segmentation and grading challenge. Med Image Anal. 2020;59: 101561. Abràmoff, ; Folk MD, ; Han JC, ; Walker DP, ; Williams JD, ; Russell DF, et al. Automated Analysis of Retinal Images for Detection of Referable Diabetic Retinopathy. JAMA Ophthalmol. 2013;131: 351–357. Ahn JM, Kim S, Ahn K-S, Cho S-H, Lee KB, Kim US. Correction: A deep learning model for the detection of both advanced and early glaucoma using fundus photography. PLoS One. 2019;14: e0211579. Cen L-P, Ji J, Lin J-W, Ju S-T, Lin H-J, Li T-P, et al. Automatic detection of 39 fundus diseases and conditions in retinal photographs using deep neural networks. Nat Commun. 2021;12: 4828. Zhou Y, Wagner SK, Chia MA, Zhao A, Woodward-Court P, Xu M, et al. AutoMorph: Automated Retinal Vascular Morphology Quantification Via a Deep Learning Pipeline. Transl Vis Sci Technol. 2022;11: 12. Boers TGW, Fockens KN, van der Putten JA, Jaspers TJM, Kusters CHJ, Jukema JB, et al. Foundation models in gastrointestinal endoscopic AI: Impact of architecture, pre-training approach and data efficiency. Med Image Anal. 2024;98: 103298. Additional Declarations Yes there is potential Competing Interest. Professor Pearse Keane has acted as a consultant for DeepMind, Roche, Novartis, Apellis, and BitFount and is an equity owner in Big Picture Medical. He has received speaker fees from Heidelberg Engineering, Topcon, Allergan, and Bayer. Supplementary Files ExtendedDataFiguresandTables.docx SupplementaryTable.xlsx Supplementary Table 1-5 RS.pdf Reporting Summary Cite Share Download PDF Status: Published Journal Publication published 28 Feb, 2026 Read the published version in Nature Communications → Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6080254","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":435201621,"identity":"47bfadc7-b4a8-4976-88f3-7e4ef03e5a6b","order_by":0,"name":"Yukun Zhou","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA9ElEQVRIiWNgGAWjYFACNjCZACIOfKhgYOBjQAgQ1MJ4cMYZGJ9ILcyHeduI0GJwIy1NuoChLo9fIoHhMO+8w/JsDMwPPzC2peHTckx6BgNbseSMBIaDc7cdNmxjYDOWYGzLwaMlvU2ah4EnccONBIYDb7cdTgA6zIyBsa2CkBYJiBbeOSAt7N8IaAE6jIfBAKzlIG8DSAsPyBbcDpM88yzZmscgIXFmz8OGgzOOpRu2MfMUSyScw+19vuNphrd5KuoS+9mTD3/4UGMtz8/evvHDh7JknFoUDoCdByIYGyBCzAz4I1K+AY/kKBgFo2AUjAIwAABI2k8KLOY8twAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0002-0840-6422","institution":"University College London","correspondingAuthor":true,"prefix":"","firstName":"Yukun","middleName":"","lastName":"Zhou","suffix":""},{"id":435201622,"identity":"f1638337-f021-4229-ae94-ed6d427c6bba","order_by":1,"name":"Zheyuan Wang","email":"","orcid":"","institution":"MoE Key Lab of Artificial Intelligence, Artificial Intelligence Institute, Shanghai Jiao Tong University, Shanghai, China.","correspondingAuthor":false,"prefix":"","firstName":"Zheyuan","middleName":"","lastName":"Wang","suffix":""},{"id":435201623,"identity":"221b7756-c461-4261-ae2b-efbd09f54edd","order_by":2,"name":"Yilan Wu","email":"","orcid":"https://orcid.org/0000-0003-0493-9958","institution":"Tsinghua University","correspondingAuthor":false,"prefix":"","firstName":"Yilan","middleName":"","lastName":"Wu","suffix":""},{"id":435201624,"identity":"0bd62f75-1cb3-47c1-bbc3-ebe719fffc87","order_by":3,"name":"Ariel Yuhan Ong","email":"","orcid":"https://orcid.org/0000-0001-9300-573X","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Ariel","middleName":"Yuhan","lastName":"Ong","suffix":""},{"id":435201625,"identity":"a11edb64-3204-401f-968a-48cf507a1b92","order_by":4,"name":"Siegfried Wagner","email":"","orcid":"https://orcid.org/0000-0003-4915-4353","institution":"University College London Institute of Ophthalmology, University College London","correspondingAuthor":false,"prefix":"","firstName":"Siegfried","middleName":"","lastName":"Wagner","suffix":""},{"id":435201626,"identity":"622e3183-3adf-41d9-99c1-46bfb39a08a8","order_by":5,"name":"Eden Ruffell","email":"","orcid":"https://orcid.org/0009-0006-2403-2199","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Eden","middleName":"","lastName":"Ruffell","suffix":""},{"id":435201627,"identity":"261e355e-e370-41ac-a781-cefc422c01cd","order_by":6,"name":"Mark Chia","email":"","orcid":"","institution":"The Royal Victorian Eye and Ear Hospital","correspondingAuthor":false,"prefix":"","firstName":"Mark","middleName":"","lastName":"Chia","suffix":""},{"id":435201628,"identity":"9da80788-75ea-414d-b730-1bed3682fd91","order_by":7,"name":"Zhouyu Guan","email":"","orcid":"https://orcid.org/0009-0008-5102-0067","institution":"Shanghai Jiao Tong University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Zhouyu","middleName":"","lastName":"Guan","suffix":""},{"id":435201629,"identity":"a09e5cf9-b8f0-4428-a1f1-2ebc7d261485","order_by":8,"name":"Lie Ju","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Lie","middleName":"","lastName":"Ju","suffix":""},{"id":435201630,"identity":"397b6af8-3f94-49a4-bc35-62d69a2a790a","order_by":9,"name":"Justin Engelmann","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Justin","middleName":"","lastName":"Engelmann","suffix":""},{"id":435201631,"identity":"722f8ee8-894c-4043-a5b4-b318dbafbb19","order_by":10,"name":"David Merle","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"David","middleName":"","lastName":"Merle","suffix":""},{"id":435201632,"identity":"e7124ea3-8d45-4a03-a476-324c3be4fbec","order_by":11,"name":"Tingyao Li","email":"","orcid":"","institution":"MoE Key Lab of Artificial Intelligence, Artificial Intelligence Institute, Shanghai Jiao Tong University, Shanghai, China.","correspondingAuthor":false,"prefix":"","firstName":"Tingyao","middleName":"","lastName":"Li","suffix":""},{"id":435201633,"identity":"debd2d82-cf65-4c80-8881-b92b8e1db200","order_by":12,"name":"Jia Shu","email":"","orcid":"","institution":"MoE Key Lab of Artificial Intelligence, Artificial Intelligence Institute, Shanghai Jiao Tong University, Shanghai, China.","correspondingAuthor":false,"prefix":"","firstName":"Jia","middleName":"","lastName":"Shu","suffix":""},{"id":435201634,"identity":"3b5e27b1-b9b2-4b0e-82a5-7f240ccce0d9","order_by":13,"name":"Paul Nderitu","email":"","orcid":"","institution":"Moorfields Eye Hospital NHS Foundation Trust","correspondingAuthor":false,"prefix":"","firstName":"Paul","middleName":"","lastName":"Nderitu","suffix":""},{"id":435201635,"identity":"2554ab0b-5718-4c98-9c7f-1e48d9e40b5d","order_by":14,"name":"Ke Zou","email":"","orcid":"","institution":"National University of Singapore","correspondingAuthor":false,"prefix":"","firstName":"Ke","middleName":"","lastName":"Zou","suffix":""},{"id":435201636,"identity":"c32be645-b341-4680-b5ba-f76a4b0af70a","order_by":15,"name":"Jocelyn Hui Lin Goh","email":"","orcid":"","institution":"Singapore Eye Research Institute, Singapore National Eye Centre, Singapore.","correspondingAuthor":false,"prefix":"","firstName":"Jocelyn","middleName":"Hui Lin","lastName":"Goh","suffix":""},{"id":435201637,"identity":"9a3f991c-49aa-44fd-b32a-3d80d5952c34","order_by":16,"name":"Qingshan Hou","email":"","orcid":"","institution":"National University of Singapore","correspondingAuthor":false,"prefix":"","firstName":"Qingshan","middleName":"","lastName":"Hou","suffix":""},{"id":435201638,"identity":"357addea-13ea-41b1-beaf-89398216a5f4","order_by":17,"name":"XiaoXuan Liu","email":"","orcid":"https://orcid.org/0000-0002-1286-0038","institution":"University of Birmingham","correspondingAuthor":false,"prefix":"","firstName":"XiaoXuan","middleName":"","lastName":"Liu","suffix":""},{"id":435201639,"identity":"ce2399a4-9cf6-47e1-bd71-312fbdd339f9","order_by":18,"name":"Yaxing Wang","email":"","orcid":"","institution":"Tsinghua University","correspondingAuthor":false,"prefix":"","firstName":"Yaxing","middleName":"","lastName":"Wang","suffix":""},{"id":435201640,"identity":"4d55d24e-0da3-4270-91a5-e2934fa6d75e","order_by":19,"name":"Yih Chung Tham","email":"","orcid":"https://orcid.org/0000-0002-6752-797X","institution":"Centre for Innovation and Precision Eye Health","correspondingAuthor":false,"prefix":"","firstName":"Yih","middleName":"Chung","lastName":"Tham","suffix":""},{"id":435201641,"identity":"7fdce6d6-2697-4112-8143-51caa7775e03","order_by":20,"name":"Andre Altmann","email":"","orcid":"https://orcid.org/0000-0002-9265-2393","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Andre","middleName":"","lastName":"Altmann","suffix":""},{"id":435201642,"identity":"10089d0e-74a9-4720-a4ff-808a35123905","order_by":21,"name":"Carol Cheung","email":"","orcid":"https://orcid.org/0000-0002-9672-1819","institution":"The Chinese University of Hong Kong","correspondingAuthor":false,"prefix":"","firstName":"Carol","middleName":"","lastName":"Cheung","suffix":""},{"id":435201643,"identity":"c9246236-1fe4-46b0-bf47-9c5bcbf00f9a","order_by":22,"name":"Daniel Alexander","email":"","orcid":"https://orcid.org/0000-0003-2439-350X","institution":"Centre for Medical Image Computing, Department of Computer Science, University College London","correspondingAuthor":false,"prefix":"","firstName":"Daniel","middleName":"","lastName":"Alexander","suffix":""},{"id":435201644,"identity":"b568aa05-06a5-4d36-ac27-a1e4526f4978","order_by":23,"name":"Eric Topol","email":"","orcid":"https://orcid.org/0000-0002-1478-4729","institution":"Executive VP, Scripps Research Professor, Molecular Medicine, Scripps Research Director \u0026 Founder, Scripps Research Translational Institute Department of Molecular Medicine","correspondingAuthor":false,"prefix":"","firstName":"Eric","middleName":"","lastName":"Topol","suffix":""},{"id":435201645,"identity":"a137fe3e-837b-4a31-ac28-0d3895fbb9b9","order_by":24,"name":"Alastair Denniston","email":"","orcid":"https://orcid.org/0000-0001-7849-0087","institution":"University of Birmingham","correspondingAuthor":false,"prefix":"","firstName":"Alastair","middleName":"","lastName":"Denniston","suffix":""},{"id":435201646,"identity":"d3226378-6454-467c-901b-7e0daf93338e","order_by":25,"name":"Tien Yin Wong","email":"","orcid":"https://orcid.org/0000-0002-8448-1264","institution":"Tsinghua University","correspondingAuthor":false,"prefix":"","firstName":"Tien","middleName":"Yin","lastName":"Wong","suffix":""},{"id":435201647,"identity":"f1462db4-bd05-48db-815f-ae19dbf016b1","order_by":26,"name":"Bin Sheng","email":"","orcid":"https://orcid.org/0000-0001-8510-2556","institution":"Shanghai Jiao Tong University","correspondingAuthor":false,"prefix":"","firstName":"Bin","middleName":"","lastName":"Sheng","suffix":""},{"id":435201648,"identity":"db75f8d0-d850-42dc-a578-4b595fae3978","order_by":27,"name":"Pearse A. Keane","email":"","orcid":"https://orcid.org/0000-0002-9239-745X","institution":"NIHR Biomedical Research Centre at Moorfields Eye Hospital NHS Foundation Trust,University College London","correspondingAuthor":false,"prefix":"","firstName":"Pearse","middleName":"A.","lastName":"Keane","suffix":""}],"badges":[],"createdAt":"2025-02-21 14:16:10","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6080254/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6080254/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41467-026-70077-z","type":"published","date":"2026-02-28T05:00:00+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":79832171,"identity":"caa1081b-58d5-4570-bf37-5bd225cf72d2","added_by":"auto","created_at":"2025-04-03 10:45:31","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":751078,"visible":true,"origin":"","legend":"\u003cp\u003eSchematic of the study. We investigated the impact of pre-training data characteristics on medical foundation models. Foundation models were pre-trained respectively with data from Moorfields Eye Hospital (FM-MEH) and Shanghai Diabetes Prevention Program (FM-SDPP) and are adapted to downstream tasks for disease detection and prediction. The data characteristics were described in terms of clinical and imaging metadata, clinically meaningful morphological indices, and latent features (representative features encoded by models). Subgroup analysis was performed to evaluate FM fairness over age, sex, and ethnicity.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-6080254/v1/1bbb6ffc623ba44ea21e1a78.png"},{"id":79830961,"identity":"f695e567-7897-452d-8d05-ac2ce245ac1f","added_by":"auto","created_at":"2025-04-03 10:29:31","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":626180,"visible":true,"origin":"","legend":"\u003cp\u003eQuantification of data characteristics from Moorfields Eye Hospital (MEH) and Shanghai Diabetes Prevention Program (SDPP). All values and ratios were calculated over sampled images. \u003cstrong\u003ea\u003c/strong\u003e, the distribution of metadata including age, sex, ethnicity, and imaging devices. \u003cstrong\u003eb\u003c/strong\u003e, t-SNE visualisation for MEH and SDPP data (5000 randomly sampled data points), respectively with features extracted by foundation models developed in each site (FM-MEH and FM-SDPP). \u003cstrong\u003ec\u003c/strong\u003e, the image quality distribution and morphological indices of MEH and SDPP data (5000 randomly sampled data points), obtained with AutoMorph. These demonstrate the distinct distribution of data from Moorfields Eye Hospital and SDPP in multifaceted and complementary views.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-6080254/v1/713bc0cdd1ef7c547e23d54e.png"},{"id":79831559,"identity":"8558597c-d59c-4756-8fea-80ac57feb85f","added_by":"auto","created_at":"2025-04-03 10:37:31","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":672952,"visible":true,"origin":"","legend":"\u003cp\u003eAUROC performance of FM-MEH and FM-SDPP on downstream tasks using data from each site. Subgraphs \u003cstrong\u003ea\u003c/strong\u003eand \u003cstrong\u003eb\u003c/strong\u003e show the model performance on tasks at the SDPP site, with FMs respectively pre-trained with Masked Autoencoder and DINOV2. Subgraphs \u003cstrong\u003ec\u003c/strong\u003eand \u003cstrong\u003ed\u003c/strong\u003e present the performance of FMs on tasks at the MEH site. In SDPP downstream tasks, FM-SDPP significantly outperformed FM-MEH on 3 out of 12 evaluations, while FM-MEH achieved superior performance on 4 evaluations when adapted to MEH downstream tasks. FMs pre-trained with Masked autoencoder have more cases with significant performance differences (bolded p-value) compared to DINOV2. For each task, models were fine-tuned with five different random seeds, controlling the shuffling of fine-tuning data, and evaluated on the test set to generate five replicates. The mean AUROC values are represented by bar centres, with error bars indicating 95% confidence intervals (CI). A two-sided t-test was used to assess whether the performance differences between FM-SDPP and FM-MEH were statistically significant, with p-values listed in the figure. Bolded p-values indicate significant differences (p\u0026lt;0.05). \u003cem\u003en\u003c/em\u003e indicates the number of cases showing significant differences.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-6080254/v1/b6d6377db0971997d7086dab.png"},{"id":79830970,"identity":"1b3b0c04-c17d-4ed7-83e2-4341dc610096","added_by":"auto","created_at":"2025-04-03 10:29:31","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":685569,"visible":true,"origin":"","legend":"\u003cp\u003eAUROC performance of FM-MEH and FM-SDPP on downstream tasks using publicly available datasets sourced from multiple countries. Subgraphs \u003cstrong\u003ea\u003c/strong\u003e and \u003cstrong\u003eb\u003c/strong\u003e show the performance of FMs pre-trained respectively with Masked Autoencoder and DINOV2 when fine-tuned to downstream tasks. Subgraphs \u003cstrong\u003ec\u003c/strong\u003e and \u003cstrong\u003ed\u003c/strong\u003e present the performance of FMs when adapted to downstream tasks with the linear probe. When pre-trained with Masked Autoencoder, FM-SDPP significantly outperformed FM-MEH on four downstream evaluations. When pre-trained with DINOV2, FM-MEH significantly outperformed FM-SDPP on three evaluations. For each task, models were fine-tuned with five different random seeds, controlling the shuffling of fine-tuning data, and evaluated on the test set to generate five replicates. The mean AUROC values are represented by bar centres, with error bars indicating 95% CI. A two-sided t-test was used to assess whether the performance differences between FM-SDPP and FM-MEH were statistically significant, with p-values listed in the figure. Bolded p-values indicate significant differences (p\u0026lt;0.05). \u003cem\u003en\u003c/em\u003e indicates the number of cases showing significant differences.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-6080254/v1/10962533bdadc7884602926a.png"},{"id":79832172,"identity":"b5199cee-8685-44ab-8185-f1ff049a3da9","added_by":"auto","created_at":"2025-04-03 10:45:31","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":903607,"visible":true,"origin":"","legend":"\u003cp\u003eIdentifying key metadata showing strong variations in pre-training data. \u003cstrong\u003ea\u003c/strong\u003e illustrates the pipeline that splits data into subgroups based on metadata and quantifies their latent features and morphological indices. It identifies key metadata that shows clear clustering of latent features and significant differences in morphological indices across subgroups. Using age as an example, \u003cstrong\u003eb\u003c/strong\u003e shows t-SNE visualisation with latent features across age subgroups, while \u003cstrong\u003ec\u003c/strong\u003eshows t-SNE visualisation after eliminating confounding effects by specifying sex and ethnicity (e.g. Female, Asian or Asian British). The age subgroups show clear clusterings. \u003cstrong\u003ed\u003c/strong\u003e demonstrates the distribution density of clinically meaningful morphological indices over age subgroups, while \u003cstrong\u003ef\u003c/strong\u003e shows the distribution density after specifying sex and ethnicity. A Kruskal-Wallis H-test was conducted to assess statistical significance, with p-values listed in the figure. Both artery and vein fractal dimensions show significant differences. These indicate that age is a key metadata that demonstrates strong data variations.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-6080254/v1/45965ac56fbd5ad1e02fd6de.png"},{"id":79830967,"identity":"f840f9c0-1b5f-4e1f-83ca-0b70e0ea9fd6","added_by":"auto","created_at":"2025-04-03 10:29:31","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":120044,"visible":true,"origin":"","legend":"\u003cp\u003ePerformance differences between FM-MEH and FM-SDPP on diabetic retinopathy and diabetic macular oedema detection across age (\u003cstrong\u003ea\u003c/strong\u003e), ethnicity (\u003cstrong\u003eb\u003c/strong\u003e), and sex (\u003cstrong\u003ec\u003c/strong\u003e) subgroups. Yellow bars indicate that FM-MEH outperforms FM-SDPP, while grey bars indicate the opposite. The left two columns include the results for FMs pre-trained with Masked Autoencoder, while the right two columns show results for FMs pre-trained with DINOV2. Each section includes the results with fine-tuning and the linear probe. We observed consistent and significant performance gaps between FM-MEH and FM-SDPP in the age section, but not in the sex and ethnicity sections. The bar centres represent the mean AUROC differences, with error bars indicating 95% confidence intervals (CI). A two-sided t-test was conducted to assess statistical significance, with asterisks denoting significant differences (p\u0026lt;0.05).\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-6080254/v1/469d48ed98f258b8fcf95234.png"},{"id":106584054,"identity":"6483bbc3-0151-4303-a2e1-d17bce9c88da","added_by":"auto","created_at":"2026-04-10 07:21:36","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3907204,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6080254/v1/d461b509-3b64-4285-a827-83d18926c74f.pdf"},{"id":79831557,"identity":"b865c643-9bf1-497a-bca1-5b434e4801db","added_by":"auto","created_at":"2025-04-03 10:37:31","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":4395510,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cbr\u003e\u003c/p\u003e","description":"","filename":"ExtendedDataFiguresandTables.docx","url":"https://assets-eu.researchsquare.com/files/rs-6080254/v1/a1e18087d4c9799d3fbf91e8.docx"},{"id":79830958,"identity":"1292a828-92a8-4d74-9ea2-4f7fb23d0f0a","added_by":"auto","created_at":"2025-04-03 10:29:31","extension":"xlsx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":66115,"visible":true,"origin":"","legend":"\u003cp\u003eSupplementary Table 1-5\u003c/p\u003e","description":"","filename":"SupplementaryTable.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6080254/v1/c8c55516a8eca836c807d0a8.xlsx"},{"id":79830962,"identity":"7d39fe81-cc4a-4a76-a5e4-fd501e37dd3f","added_by":"auto","created_at":"2025-04-03 10:29:31","extension":"pdf","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":205624,"visible":true,"origin":"","legend":"\u003cp\u003eReporting Summary\u003c/p\u003e","description":"","filename":"RS.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6080254/v1/e19a80c6ab1482f4fbb909b7.pdf"}],"financialInterests":"\u003cb\u003eYes\u003c/b\u003e there is potential Competing Interest.\nProfessor Pearse Keane has acted as a consultant for DeepMind, Roche, Novartis, Apellis, and BitFount and is an equity owner in Big Picture Medical. He has received speaker fees from Heidelberg Engineering, Topcon, Allergan, and Bayer.","formattedTitle":"Revealing the Impact of Pre-training Data on Medical Foundation Models","fulltext":[{"header":"Introduction","content":"\u003cp\u003eFoundation models (FM) are large artificial intelligence (AI) models trained using data and computation at scale [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Using self-supervised or unsupervised learning methods, FMs capture abundant data patterns that can potentially be applied to diverse applications in real-world scenarios. This approach has broad applications across fields of medical AI [\u003cspan additionalcitationids=\"CR4\" citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], such as ophthalmology [\u003cspan additionalcitationids=\"CR7\" citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], radiology [\u003cspan additionalcitationids=\"CR10\" citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e], pathology [\u003cspan additionalcitationids=\"CR13 CR14 CR15\" citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e], and as a generalist [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e, \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e] advancing clinically meaningful tasks like disease diagnosis and prognosis. However, despite the increasing number of medical FMs being developed, there remains very limited knowledge regarding how the composition of pre-training data, the \u0026ldquo;guts\u0026rdquo; of these models, affects FM capabilities such as generalisability and fairness. The lack of this critical knowledge makes pre-training data collection and medical FM development inefficient and highly speculative.\u003c/p\u003e \u003cp\u003eTraining data is the fundamental substrate for developing AI models, with its characteristics encompassing attributes, properties, and features that influence data quality, usability, and its impact on analytical processes. These characteristics define the scope of knowledge and largely determine model capability including generalisability and fairness [\u003cspan additionalcitationids=\"CR20 CR21\" citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. Previous studies have investigated the impact of labelled data on traditional application-specific AI models in supervised learning [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e, \u003cspan additionalcitationids=\"CR24 CR25\" citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e], providing effective guidance for labelled data selection during model training. For instance, some work has demonstrated that model performance markedly drops when applied to external sites with distinct data characteristics such as demographics, imaging devices, and disease phenotypes [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e, \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e, \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. Additionally, imbalanced training data often leads to poor performance in underrepresented subgroups in terms of age, sex, and ethnicity, raising concerns about model fairness and generalisability [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e, \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]. To address these challenges, previous work had focused on building diverse and balanced training data for traditional application-specific AI models, with techniques such as data augmentation and synthesis [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e, \u003cspan additionalcitationids=\"CR30\" citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eUnlike application-specific AI models that rely solely on labelled data for training, FMs learn generalisable features through extensive pre-training on substantial unlabelled data (e.g. via self-supervised learning [\u003cspan additionalcitationids=\"CR33\" citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]), followed by fine-tuning on labelled data for specific target applications. The performance of FMs is collectively shaped, where pre-training establishes the foundational capabilities while fine-tuning refines performance in specific tasks. However, the impact of unlabelled pre-training data on medical FMs remains unexplored, partly hampered by the substantial workload and resources required to build and compare parallel FMs on matched large-scale datasets from different countries and areas. This lack of knowledge leaves critical questions unanswered: 1) Do pre-training data characteristics affect the generalisability of medical FMs, such as performing poorly on sites with distinct demographics and imaging devices? 2) Do clinical metadata of pre-training data impact medical FM fairness over age, sex, and ethnicity, similar to traditional application-specific AI models? 3) How can we identify key metadata that likely influence FM generalisability and fairness? Addressing these questions is critical to ensure that FMs have good foundational capabilities for downstream clinical applications. Specifically, given the substantial data and computational resources required to develop medical FMs, it is imperative to understand what constitutes an appropriate distribution of pre-training data to enhance development efficiency and optimise medical FMs that can be broadly used for various clinically relevant applications across different sites. Furthermore, revealing the impact of pre-training data provides a strong basis for advocating data transparency, one of the least transparent dimensions according to Foundation Model Transparency Index Scores [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]. This is particularly relevant to medical FMs, where data distributions are often skewed\u0026ndash;no dataset is free of limitations [\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e]. Disclosing how pre-training data impacts FMs, and providing details of pre-training data, are essential to understanding the strengths and limitations of medical FMs.\u003c/p\u003e \u003cp\u003eTo address these gaps, we investigate the impact of pre-training data on medical FM performance, using two large data cohorts from Moorfields Eye Hospital (MEH), UK, and Shanghai Diabetes Prevention Program (SDPP), China, each comprising 904,170 retinal colour photographs for FM pre-training. We first characterise data cohorts using clinical and imaging metadata, latent features (representative features encoded by models), and clinically meaningful morphological indices, highlighting the differences between the datasets. We then develop medical FMs with MEH data (FM-MEH) and SDPP data (FM-SDPP), using identical pre-training strategies and implementation details. We evaluate the performance of parallel FMs across a wide range of downstream tasks, including ocular disease diagnosis and systemic event prediction, using data from each site (held out from pre-training data) and publicly available datasets. We finally assess model fairness across subgroups based on age, sex, and ethnicity. Our findings demonstrate that pre-training data significantly impact the generalisability and fairness of medical FMs. Although FM-MEH and FM-SDPP perform comparably in around 70% of downstream tasks, they perform significantly better in some tasks on sites where they were pre-trained. For retinal FMs studied, the age distribution of pre-training data introduced performance gaps over age subgroups when adapted to downstream tasks, while sex and ethnicity show minimal impact. Through extensive experiments with real-world clinical data, this study addresses previously unanswered questions about the impact of pre-training data on medical FMs. More importantly, it advocates for an evidence-based approach to data description and selection in medical FM development to improve development efficiency and model capabilities.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eQuantification of data characteristics\u003c/h2\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e provides an overview of the development and application of medical FMs. FM-MEH and FM-SDPP were constructed using data from MEH and SDPP respectively. We randomly sampled 904,170 retinal fundus photographs from each database for FM pre-training, and analysed their characteristics across clinical and imaging metadata, latent features, and clinically meaningful morphological indices. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, significant differences were observed between the MEH and SDPP data in terms of clinical and imaging metadata. The average age in the MEH cohort is 68.88 years (95% Confidence Interval (CI) 68.85, 68.91), significantly older than the SDPP cohort, which had an average age of 47.26 years (95% CI 47.21, 47.30) (\u003cem\u003eP\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.001). MEH data had a more balanced sex distribution, with 52.8% female participants compared to 36.8% in the SDPP data. MEH data was ethnically diverse with individuals identifying as White (45.9%), Asian or Asian British (18.7%, of which 0.6% were Chinese), Black or Black British (9.2%), Mixed (0.9%), other ethnicity (12.7%) and not reported (NR, 12.6%), based on the ethnicity grouping by the UK Office for National Statistics. In contrast, the SDPP cohort comprised only Chinese participants. Imaging devices also varied between the two datasets, with MEH primarily using the 3DOCT-2000SA (Topcon), FD-OCT (Topcon), and CIRRUS (ZEISS), while the SDPP data was collected using TRC-NW300 and TRC-NW400 (Topcon).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTo analyse latent features, we extracted features from 5000 random samples respectively from each site and visualised them using t-distributed stochastic neighbour embedding (t-SNE), a dimensionality reduction algorithm widely used for visualising high-dimensional data [\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e]. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb, the t-SNE plots revealed clear clustering patterns between the MEH and SDPP cohorts, underscoring the distinct data distributions of the two sites. Furthermore, we measured clinically meaningful morphological indices, such as vascular fractal dimension, which have been proven to be highly associated with systemic conditions like cardiovascular [\u003cspan additionalcitationids=\"CR39\" citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e] and neurological health [\u003cspan additionalcitationids=\"CR42 CR43\" citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e]. Figure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec highlights significant differences in these indices between the two cohorts. For instance, the artery fractal dimension in the MEH dataset is 1.25 (95% CI 1.24, 1.25), compared to 1.28 (95% CI 1.28, 1.29) in the SDPP dataset (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). More morphological indices are listed in Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003eFor FM pre-training, we included two representative and widely used self-supervised learning strategies, generative-based learning (Masked Autoencoder [\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e]) and contrastive-based learning (DINOV2 [\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e]). We organised downstream tasks using held-out data (i.e. isolated from FM pre-training data) from MEH and SDPP, as well as publicly available datasets sourced from several countries. We adapted FMs to downstream tasks via both fine-tuning (all model parameters tuned on downstream labelled data) and linear probes (all model parameters frozen with one linear classifier tuned on downstream labelled data), as shown in Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e. All task performances were assessed using the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). Details of the downstream task datasets are listed in Supplementary Tables\u0026nbsp;1 and 2. More details about data curation, model adaptation, and model evaluation are introduced in the Methods section.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eGeneralisability on downstream tasks curated from each site\u003c/h3\u003e\n\u003cp\u003eWe compared FM-MEH and FM-SDPP on three clinically relevant applications (diabetic retinopathy detection, diabetic macular oedema detection, and ischaemic stroke prediction) using held-out MEH and SDPP data. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, the FMs demonstrated good generalisability, with FM-MEH and FM-SDPP showing comparable performance on more than 70% of evaluations (17 out of 24). Despite this generalisability, the FMs occasionally perform significantly better on the site where they were developed. For instance, when adapted to SDPP downstream tasks, FM-SDPP pre-trained with Masked Autoencoder significantly outperformed FM-MEH in three out of six evaluations, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea. Similarly, when adapted to MEH downstream tasks, FM-MEH significantly outperformed FM-SDPP in three evaluations, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ec. The FM-MEH and FM-SDPP pre-trained with DINOV2 exhibited fewer significant differences in downstream tasks (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eb and \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ed). For SDPP downstream tasks, neither fine-tuning nor linear probes revealed significant differences between FM-SDPP and FM-MEH. For MEH downstream tasks, FM-MEH significantly outperformed FM-SDPP only in ischaemic stroke prediction (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). AUPRC performance for all tasks is illustrated in Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. All quantitative results are listed in Supplementary Table\u0026nbsp;3.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e\n\u003ch3\u003eGeneralisability on downstream tasks from publicly available datasets\u003c/h3\u003e\n\u003cp\u003eWe evaluated the generalisability of FM-MEH and FM-SDPP to diverse applications using six publicly available datasets, comprising diabetic retinopathy detection (APTOS2019, IDRiD, and MESSIDOR2), glaucoma detection (Glaucoma fundus), and multiple retinal disease detection (JSIEC and Retina). As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, while the performance of FM-SDPP and FM-MEH varied depending on the self-supervised learning strategies employed, they showed comparable performance in 16 out of 24 (66.7%) evaluations. When pre-trained with Masked Autoencoder and fine-tuned to the downstream tasks (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea), FM-SDPP significantly outperformed FM-MEH on four out of six datasets (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001), except for IDRiD (p\u0026thinsp;=\u0026thinsp;0.601) and JSIEC (p\u0026thinsp;=\u0026thinsp;0.129). When adapted to downstream tasks with the linear probe (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ec), FM-MEH performed significantly better on the Glaucoma fundus dataset (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). When using DINOV2 for FM pre-training (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb and Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ed), FM-MEH significantly outperformed FM-SDPP in three applications, i.e. fine-tuning to MESSIDOR2 (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) and linear probing to IDRiD (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) and Retina (p\u0026thinsp;=\u0026thinsp;0.013). AUPRC results are illustrated in Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e. All quantitative results are listed in Supplementary Table\u0026nbsp;3.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e\n\u003ch3\u003eIdentifying clinical metadata showing strong data variations\u003c/h3\u003e\n\u003cp\u003eWe subgrouped the randomly sampled MEH pre-training data based on clinical metadata (e.g. age, sex, and ethnicity) and characterised the subgroup data in latent features and morphological indices, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e. The age was split into three subgroups, young group (\u0026lt;\u0026thinsp;40 years), middle-aged group (40\u0026ndash;70 years), and aged group (\u0026gt;\u0026thinsp;70 years) based on the age distribution observed in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea. We observed clear clusterings in t-SNE maps (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eb, Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e) and significant distribution differences in morphological indices (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ed). For instance, the middle-aged group (40\u0026ndash;70 years) had significantly higher artery fractal dimensions than the aged group (\u0026gt;\u0026thinsp;70 years). To eliminate potential confounding effects of sex and ethnicity, we specified a subgroup (e.g. female, Asian or Asian British) for analysis, and still observed distinct distribution differences across age subgroups in both latent features (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ec) and morphological indices (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ee). This demonstrates that age subgroups exhibit clear morphological variations, and even with self-supervised pre-training, medical FMs learned distinct latent features across the subgroups, which potentially caused bias in FM performance such as fairness and generalisability.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFor ethnicity, we investigated three subgroups White, Asian or Asian British, and Black or Black British. The t-SNE visualisations (Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ea, FMs trained with Masked Autoencoder) showed clear clustering only for the White cohort. When FMs pre-trained with DINOV2 (Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ec), latent features showed no distinct clustering across all ethnicities. Additionally, when controlling for confounders, there were no significant differences across ethnic subgroups in morphological indices (Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ef).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFor sex subgroups (i.e. female, male), we observed no distinct clusterings in t-SNE visualisations, either before or after removing confounding variables (Extended Data Fig.\u0026nbsp;7). Only the vein fractal dimension remained significant differences across the sex subgroups after removing confounding variables (Extended Data Fig.\u0026nbsp;7f). These findings suggest that, ethnicity and sex subgroups contributed limited observable variations in morphological indices compared to age distribution. FMs learned less distinguishable latent features in self-supervised pre-training, which are less likely to bias FM fairness and generalisability in downstream tasks.\u003c/p\u003e\n\u003ch3\u003eFM fairness and generalisability over clinical metadata\u003c/h3\u003e\n\u003cp\u003eWe conducted subgroup analyses to evaluate whether the distribution of clinical metadata introduced bias to FM fairness and generalisability across downstream tasks. We examined subgroup performance on MEH downstream tasks of diabetic retinopathy and diabetic macular oedema detection, as FM-SDPP and FM-MEH showed no significant differences in overall performance on these tasks (p\u0026thinsp;=\u0026thinsp;0.275 and p\u0026thinsp;=\u0026thinsp;0.313 respectively). The SDPP pre-training data is mainly distributed over the young (\u0026lt;\u0026thinsp;40 years) and middle-aged groups (40\u0026ndash;70 years), with a subgroup ratio of (0.279, 0.685, 0.036), while MEH data is skewed towards middle-age and older cohorts (\u0026gt;\u0026thinsp;70 years), with a ratio of (0.015, 0.467, 0.518). As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ea, FM-SDPP performed consistently better than FM-MEH in the young group but worse in the aged group. For instance, in diabetic macular oedema detection using a linear probe, FM-SDPP outperformed FM-MEH by an averaged AUROC of 0.094 in the young group (p\u0026thinsp;=\u0026thinsp;0.014) while underperforming by 0.049 in the aged group (p\u0026thinsp;=\u0026thinsp;0.034). FM-MEH and FM-SDPP demonstrated similar performance in the middle-aged group. This consistent performance gap across diverse applications demonstrated the FM bias introduced by differential age distribution in pre-training data, verifying that age, as a key metadata showing strong data variations, introduces bias in model fairness and generalisability.\u003c/p\u003e \u003cp\u003eFor ethnicity, despite being pre-trained exclusively on data from the Chinese cohort, FM-SDPP sometimes outperformed FM-MEH in White and Black subgroups on downstream tasks. For instance, FM-SDPP achieved an AUROC of 0.12 higher than FM-MEH in the White subgroup for diabetic macular oedema detection (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eb, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). When adapted to diabetic macular oedema detection with the linear probe, FM-MEH pre-trained with DINOV2 significantly outperformed FM-SDPP in Asian cohorts (p\u0026thinsp;=\u0026thinsp;0.009). The performance differences across ethnicity subgroups showed no consistent pattern and did not correlate with the ethnicity distribution of the pre-training data.\u003c/p\u003e \u003cp\u003eFor sex, FM-MEH and FM-SDPP showed varying performance across sex subgroups depending on the task and adaptation method (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ec). Although FM-SDPP pre-training data has a less balanced sex distribution (36.8% female participants versus 52.8% in MEH data), it performed better in female cohorts for certain tasks like diabetic macular oedema detection when pre-trained with DINOV2 and adapted with linear probe (p\u0026thinsp;=\u0026thinsp;0.004). These results showed no correlation with the sex distribution in the pre-training data. The observations across ethnicity and sex subgroups verified that, as clinical metadata exhibits limited data variations, the distribution differences of ethnicity and sex in pre-training data are less likely to introduce biases in FM fairness and generalisability. All quantitative results and p-values are listed in Supplementary Table\u0026nbsp;4.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis study investigates the impact of pre-training data on medical FM performance including generalisability and fairness, a critical yet underexplored area despite the rapid advancements in medical FM research. This is a highly unique study design where we developed parallel medical FMs with identical implementations, differing only in their pre-training data, and evaluated their performance in extensive experiments. Our findings show that FMs demonstrate good generalisability, achieving comparable performance in over 70% of downstream tasks. However, FMs sometimes perform better on application data that aligns with their pre-training data, and key clinical metadata, such as age distribution, potentially introducing biases that affect FM fairness and generalisability. These results highlight the importance of transparently disclosing pre-training data characteristics and developing evidence-based approaches to data selection, offering practical guidance for improving the construction and application of medical FMs.\u003c/p\u003e \u003cp\u003eMedical FMs serve as robust base models for diverse applications, as evidenced by the strong generalisability of FM-MEH and FM-SDPP when adapted to downstream tasks across sites. In a majority of evaluations across MEH and SDPP sites, there were no significant differences in performance between intra-site and inter-site adaptations. For example, FM-MEH and FM-SDPP performed comparably in 17 out of 24 (70.8%) evaluations on SDPP and MEH downstream tasks (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e), showcasing great generalisability given the substantial differences between SDPP and MEH data (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Unlike prior studies that evaluated generalisability by comparing FMs to traditional application-specific models, our approach provided a more straightforward analysis by comparing the generalisation performance between intra-site (e.g. FMs pre-trained on SDPP data and adapted to SDPP downstream tasks) and inter-site (e.g. FMs pre-trained on MEH data and adapted to SDPP downstream tasks) adaptation. When adapted to publicly available datasets, FM-MEH and FM-SDPP performed comparably in 16 out of 24 (66.7%) evaluations (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). Such a level of generalisability is rarely observed in traditional application-specific AI models, which typically perform significantly better in internal sites compared to external ones, even after fine-tuning or linear probing. The strong generalisability of medical FMs reinforces their potential as base models for adaptation to specific applications, such as disease diagnosis and prognosis.\u003c/p\u003e \u003cp\u003eUsers are encouraged to choose medical FMs pre-trained on data with similar distributions to the application data, considering that FMs are not yet perfectly generalisable and occasionally perform better on intra-site tasks. For instance, FM-MEH significantly outperformed FM-SDPP in 4 out of 12 evaluations on MEH downstream tasks, while FM-SDPP achieved superior performance in 3 out of 12 evaluations on SDPP tasks. This provides references to guide local deployment of medical FMs considering the increasing number of FMs in medical fields. Using ophthalmology as an example, RETFound [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e] was primarily pre-trained on UK data while VisionFM [\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e] was pre-trained mainly on data from China. To better facilitate model selection, it is essential to disclose pre-training data details, aligning with data transparency initiatives in medical AI, such as the STANDING Together recommendations [\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e]. This is particularly important to medical AI, as clinical data from individual sites is often skewed due to local population and specific study design. Furthermore, the challenges posed by pre-training data limitations underscore the importance of creating a large, global database that unites and optimises resources from across the world. Our findings provide a real-world example demonstrating that, despite rapid advancements in medical FMs, global collaboration remains crucial for developing truly generalisable medical AI.\u003c/p\u003e \u003cp\u003eIncorporating multifaceted views of pre-training data enables a comprehensive assessment of data distribution, as well as the identification of key metadata that show strong variations. Clinical metadata, such as demographics, are widely used to quantify data characteristics but provide only a limited view relevant to FM development. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, demographic factors such as age, sex, and ethnicity differed significantly between MEH and SDPP datasets. However, only age subgroups significantly shifted the distribution of clinically meaningful morphological indices (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e). These findings emphasise the need for a multifaceted approach to describing data distribution. Morphological indices quantify the clinically relevant variations influenced by multiple factors, including demographics, ocular disease phenotypes, and systemic conditions. They provide a clinically meaningful perspective on data distribution and have been extensively studied in clinical association research [\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e, \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e, \u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e, \u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e, \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e, \u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e]. Meanwhile, latent features represent how data are perceived by models and are highly relevant to model performance across diverse applications. Prior machine learning research often depicted data distribution in latent feature space and regulated features for generative modelling [\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e, \u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e] and domain adaptation [\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e52\u003c/span\u003e, \u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e]. We demonstrate that morphological indices and latent features offer complementary descriptions of pre-training data and guide key metadata identification, providing a clear overview of data used for medical FM development. The proposed pipeline (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ea) can be extended to other medical fields by adjusting the morphological indices or involving domain-specific indices, enabling a comprehensive description of data distribution and identification of key clinical metadata.\u003c/p\u003e \u003cp\u003eMetadata showing strong data variations are more likely to introduce biases in model fairness and generalisability in downstream tasks, requiring extra attention during data preparation and selection. Previous studies have explored how labelled data influences the performance of application-specific models. For instance, a recent study [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e] demonstrated that fine-tuning data with uniformly sampled demographic attributes (e.g. age, sex, and ethnicity) improved model fairness in clinically relevant applications. However, few studies have investigated how FM pre-training data affect fairness and generalisability, largely due to the substantial resources and workload required in building parallel FMs for comparison. In our study, subgroup analysis revealed that the age distribution (the identified key metadata) introduced bias in model fairness across downstream tasks (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e), while similar biases were not observed for sex or ethnicity. This suggests that in retinal imaging, increasing diversity and balance of certain attributes, such as ethnicity and sex, does not necessarily enhance FM fairness (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e), while a balanced and wide-ranging age distribution in pre-training data contributes to improving the fairness and generalisability of retinal FMs. This highlights the necessity of identifying key metadata based on an evidence-based approach and leveraging these insights to guide the pre-training data selection, ultimately optimising the generalisability and fairness of medical FMs.\u003c/p\u003e \u003cp\u003eAdvancements in general AI techniques (e.g. self-supervised learning methods) continue to push the performance boundaries of medical AI. In our study, medical FMs pre-trained with different self-supervised learning strategies demonstrated varying performance and generalisability. We included representative self-supervised learning strategies, i.e. generative-based learning (Masked Autoencoder) and more recent contrastive-based learning (DINOV2). Our results showed that DINOV2 achieved superior performance in downstream tasks on each site (Supplementary Table\u0026nbsp;5). These observations extended to publicly available datasets, where FMs pre-trained with DINOV2 significantly performed better in over half of the evaluations (Supplementary Table\u0026nbsp;5). Additionally, FM-MEH and FM-SDPP pre-trained with DINOV2 showed significant differences in only one evaluation, compared to seven for FMs pre-trained with Masked Autoencoder. This indicated that DINOV2 introduced comparable performance between intra-site and inter-site adaptations, suggesting strong generalisability. The superior performance of DINOV2 is likely credited to a combination of various pre-training strategies (i.e. DINO [\u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e54\u003c/span\u003e] and iBoT [\u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e55\u003c/span\u003e] learning image-level and patch-level features respectively), several practical tweaks (e.g. Sinkhorn-Knopp centring [\u003cspan citationid=\"CR56\" class=\"CitationRef\"\u003e56\u003c/span\u003e]), and generalisable features learnt from initial large-scale pre-training on 142\u0026nbsp;million natural images [\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e]. Despite the benefits of translational research, many of the latest and most powerful general AI techniques remain proprietary (e.g. GPT) or have limited transferability (e.g. DeepSeek) with insufficient technical details. The long-term and sustainable advancement of medical AI requires the development of domain-specific techniques tailored to the unique characteristics of medical data and application scenarios.\u003c/p\u003e \u003cp\u003ePre-training clinical data and self-supervised learning strategies have a synergistic effect on FM performance. FM-MEH and FM-SDPP pre-trained using various self-supervised learning methods performed substantially differently when adapted to publicly available datasets. When FM-SDPP is pre-trained with Masked Autoencoder, it significantly outperformed FM-MEH in four out of twelve evaluations (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea and Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb). In contrast, FM-MEH pre-trained with DINOV2 generally outperformed FM-SDPP in disease diagnosis, with three cases showing significant differences. This suggests synergistic effects between SDPP data and Masked Autoencoder, as well as between MEH data and DINOV2. Although our study does not focus primarily on exploring the synergy between pre-training data and learning strategy, it provides real-world examples supporting the initiatives of seeing data and learning strategies as interconnected components [\u003cspan citationid=\"CR57\" class=\"CitationRef\"\u003e57\u003c/span\u003e, \u003cspan citationid=\"CR58\" class=\"CitationRef\"\u003e58\u003c/span\u003e], which highlights the need to simultaneously optimise both model learning strategies and pre-training data characteristics to advance medical FM development.\u003c/p\u003e \u003cp\u003eAlthough this work systematically reveals the impact of pre-training data on FM performance using real-world clinical data, several limitations and challenges remain to be addressed in future research. First, although this work describes data distribution in a multifaceted view: clinical and imaging metadata, latent features, and morphological indices, future studies should include extra factors particularly concerning disease phenotypes. This is currently limited by the complexity of disease categories and severity, as well as challenges in precisely controlling disease phenotypes in large-scale pre-training data organisation. Second, due to the considerable workload involved in developing parallel FMs and organising diverse downstream tasks, this study primarily focused on representative self-supervised learning strategies, such as Masked Autoencoder and DINOV2, and used eye images as an exemplar. Further research involving a wider range of learning strategies and medical domains is needed. Third, due to the differences in the sources of labels (e.g. MEH diabetic retinopathy labels were extracted from clinical practice records; SDPP labels were annotated by two ophthalmologists with disagreements adjudicated by a consultant-level ophthalmologist), there are clear performance differences across various applications, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. Although this does not bias the performance comparison between FM-MEH and FM-SDPP, a well-aligned labelling system would allow more standardised cross-validation. Building upon this study, future work could quantify the extent to which our findings can improve efficiency in medical FM development, such as quantifying saved data volume and computation resources for developing competitive medical FMs. Additionally, the key metadata of pre-training data can be identified and prioritised in batch data sampling for federated learning and in data synthesis by generative modelling.\u003c/p\u003e \u003cp\u003eIn conclusion, we unravel the impact of pre-training data on the performance of medical FMs, demonstrating that both AI equity and generalisability start at the foundations\u0026ndash;the pre-training data. Establishing an accurate and clear understanding of this knowledge is crucial to optimising the development and use of medical FMs. Our findings, along with the proposed pipeline for key metadata identification, provide practical guidance for pre-training data selection, both within individual sites where data is often skewed due to local population characteristics or specific study designs, and among global stakeholders collaborating to aggregate multi-site data, to advance medical FM development for healthcare applications.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003eSource for pre-training data\u003c/h2\u003e \u003cp\u003eThe Moorfields Eye Hospital (MEH) cohort was sourced from AlzEye [\u003cspan citationid=\"CR59\" class=\"CitationRef\"\u003e59\u003c/span\u003e], a retrospective cohort study linking ophthalmic data from 353,157 participants, who attended MEH between 2008 and 2018, with systemic health data from hospital admissions across the whole of England. The ethnicity groups are reported based on the ethnicity grouping by the UK Office for National Statistics. The Shanghai Diabetes Prevention Program (SDPP) cohort was drawn from a community-based longitudinal study of 79,284 participants who underwent physical examinations at Huadong Sanatorium and Shanghai Sixth People\u0026rsquo;s Hospital between December 2015 and November 2022. We randomly sampled 904,170 retinal fundus photographs from each database for FM pre-training. The corresponding data characteristics are listed in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e. The retinal fundus photographs have a normal field of view (\u0026lt;\u0026thinsp;60\u0026deg;), i.e. no ultra-widefield fundus images were used.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eData for downstream tasks\u003c/h2\u003e \u003cp\u003eWe evaluated the foundation model performance on clinically relevant applications using the data from Moorfields Eye Hospital (MEH) UK, SDPP China, and publicly available datasets. First, we organised ocular disease detection tasks, including diabetic retinopathy and diabetic macular oedema detection, using MEH and SDPP data which were held out from FM pre-training data at the patient level. There was no overlap of patients between pre-training and downstream data. We curated 2000 images with labels of diabetic retinopathy and macular oedema from 2000 participants. The labels for diabetic retinopathy are based on the International Clinical Diabetic Retinopathy Severity scale [\u003cspan citationid=\"CR60\" class=\"CitationRef\"\u003e60\u003c/span\u003e], indicating five stages from no diabetic retinopathy to proliferative diabetic retinopathy. The 2000 images were evenly distributed over the five categories. The labels for diabetic macular oedema included three categories: no diabetic oedema, non-clinically significant diabetic macular oedema, and clinically significant diabetic oedema [\u003cspan citationid=\"CR61\" class=\"CitationRef\"\u003e61\u003c/span\u003e, \u003cspan citationid=\"CR62\" class=\"CitationRef\"\u003e62\u003c/span\u003e]. For MEH data, the labels were obtained from clinical practice records. For SDPP data, two independent ophthalmologists annotated the disease labels, with disagreements adjudicated by a consultant-level ophthalmologist. Second, we curated the task of ischaemic stroke prediction using MEH and SDPP data. The stroke labels include binary categories, i.e. no stroke event within three years from imaging or stroke event within three years. For SDPP data, stroke labels were obtained from digital hospital records and self-report records during longitudinal visits between December 2015 and November 2022. For MEH data, systemic health data were derived from Hospital Episode Statistics (HES) data relating to admitted patient care (inpatient records). Diagnostic codes in HES admitted patient care were reported according to the tenth revision of the ICD (International Statistical Classification of Diseases) [\u003cspan citationid=\"CR63\" class=\"CitationRef\"\u003e63\u003c/span\u003e]. ICD codes for stroke (I23-I24) were used in line with previous reports. The stroke data from MEH included 2526 images with each category having 1263 images, while SDPP data included 2000 images with each category including 1000 images. More details are listed in Supplementary Table\u0026nbsp;2.\u003c/p\u003e \u003cp\u003eSimilarly to the RETFound study [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e], we organised six ocular disease detection tasks with publicly available datasets. For diabetic retinopathy diagnosis, Kaggle APTOS2019 (India), IDRID (India) [\u003cspan citationid=\"CR64\" class=\"CitationRef\"\u003e64\u003c/span\u003e] and MESSIDOR2 (France) [\u003cspan citationid=\"CR65\" class=\"CitationRef\"\u003e65\u003c/span\u003e] were used, with the labels defined by the International Clinical Diabetic Retinopathy Severity scale. For glaucoma, Glaucoma Fundus (South Korea) [\u003cspan citationid=\"CR66\" class=\"CitationRef\"\u003e66\u003c/span\u003e] was included, with three categorical labels, non-glaucoma, early glaucoma (suspected glaucoma) and advanced glaucoma. For datasets with several diseases, JSIEC (China) [\u003cspan citationid=\"CR67\" class=\"CitationRef\"\u003e67\u003c/span\u003e] and Retina were included. JSIEC included 1,000 images with 39 categories of common referable fundus diseases and conditions. Retina had labels of normal, glaucoma, cataract and retina disease. The grading protocols for the public datasets were summarised as: IDRiD, two medical experts provided adjudicated consensus grades; MESSIDOR2, adjudicated by a panel of three retina specialists in accordance with a published protocol; APTOS2019, Kaggle dataset with limited information but possibly a single clinician grader; Glaucoma Fundus, agreement of two specialists based on visual fields and extensive imaging, and JSIEC, labelled by ophthalmologists and confirmed by senior retina specialists. Disagreements were resolved by a panel of five senior retina specialists. Retina, details not available. The details of datasets, such as imaging devices, country and label category, are listed in Supplementary Table\u0026nbsp;1.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eData processing for self-supervised learning\u003c/h2\u003e \u003cp\u003eWe used AutoMorph [\u003cspan citationid=\"CR68\" class=\"CitationRef\"\u003e68\u003c/span\u003e], an automated retinal image analysis tool, to exclude the background and keep the retinal area. All images were resized to 256 \u0026times; 256 with cubic interpolation. We followed the default data augmentation settings as Masked Autoencoder and DINOV2. On pre-training with Masked Autoencoder, we included random crop (lower bounds 20% of the whole image and upper bounds 100%) and resized the cropped patches to 224 \u0026times; 224, random horizontal flipping and image normalisation. For DINOV2, the global patch augmentation included random crop (lower bounds 32% of the whole image and upper bounds 100%) and resizing the cropped patches to 224 \u0026times; 224, random horizontal flipping, colour jittering (brightness 0.4, contrast 0.4, saturation 0.2, and hue 0.1), followed by either Gaussian blur or Gaussian blur and random image solarising (threshold 128, possibility 20%). The local patch augmentation included random crop (lower bounds 5% of the whole image and upper bounds 32%) and resizing the cropped patches to 96 \u0026times; 96, random horizontal flipping, colour jittering, and random Gaussian blur (possibility 50%). All augmented patches were normalised. We also measured the image quality and morphological indices with AutoMorph.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eFoundation model implementations\u003c/h2\u003e \u003cp\u003eFor FM pre-training, we selected two representative self-supervised learning strategies, Masked Autoencoder and DINOV2. Both have been widely used across various domains including medical applications, and have demonstrated state-of-the-art performance in disease diagnosis [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e, \u003cspan citationid=\"CR69\" class=\"CitationRef\"\u003e69\u003c/span\u003e]. We used a specific configuration of Masked Autoencoder comprising an encoder and a decoder. The encoder was a large vision Transformer (ViT-large) with 24 Transformer blocks and an embedding vector size of 1024, while the decoder was a small vision Transformer (ViT-small) with eight Transformer blocks and an embedding vector size of 512. The encoder took unmasked patches (with a patch size of 16 \u0026times; 16) as input and projected them into feature vectors of size 1024. These feature vectors passed through the 24 Transformer blocks, which consisted of multi-headed self-attention and multilayer perceptrons to generate high-level features. The decoder reconstructed the image by inserting masked placeholder patches into the extracted high-level features and then projecting them back to image patches through a linear projection layer. During model pre-training, the objective was to reconstruct retinal images from the highly masked version, with a mask ratio of 0.75. The pre-training batch size was 1792 (4 GPUs \u0026times; 448 per GPU). The total pre-training epoch was 800 and the first 15 epochs were for learning rate warming up (from 0 to a learning rate of 1 \u0026times; 10\u0026thinsp;\u0026minus;\u0026thinsp;3). The model weights at the final epoch were saved as the checkpoint for adapting to downstream tasks.\u003c/p\u003e \u003cp\u003eWe specified DINOV2 with both teacher and student networks as ViT-large, with 24 Transformer blocks and an embedding vector size of 1024. It included a projection head of three-layer perceptrons, respectively with dimensions 2048, 384, and 131,072. The patch size was 14 \u0026times; 14. The teacher network processed the global patches while the student network processed both global and local patches. During model pre-training, the objectives combined the original objectives of DINO [\u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e54\u003c/span\u003e] and iBOT [\u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e55\u003c/span\u003e]. The DINO part calculated the cross-entropy loss between the categorical tokens from the teacher network and the student network, while the iBOT part calculated the cross-entropy between masked patch tokens between the two networks (the maximum number of masking patches was 128). The pre-training batch size was 320 (4 GPUs \u0026times; 80 per GPU). The total pre-training epoch was 100 and the first 10 epochs were for the learning rate warming up (from 1 \u0026times; 10\u0026thinsp;\u0026minus;\u0026thinsp;6 to a learning rate of 2 \u0026times; 10\u0026thinsp;\u0026minus;\u0026thinsp;4) and the remaining 90 epochs for a cosine annealing schedule. The model weights at the final epoch were saved as the checkpoint for adapting to downstream tasks.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eAdaptation to downstream tasks\u003c/h2\u003e \u003cp\u003eWhen adapting foundation models pre-trained with Masked Autoencoder to downstream tasks, we only need the encoder (ViT-large) of the foundation model and discard the decoder. For foundation models pre-trained with DINOV2, the teacher network was used and adapted to downstream tasks. Both the encoder and teacher networks extracted high-level features from retinal images. A fully connected layer took these features as input and output the probability distribution over the disease categories. The category with the highest probability was selected as the final classification. The number of categories determined the number of neurons in the fully connected layer. We used two adaptation strategies, fine-tuning and the linear probe. Fine-tuning tuned the encoder and fully connected layer using the downstream data while the linear probe tuned only the fully connected layer. The schematic diagram is shown in Extended Data Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e. The training objective was to predict the same categorical output as the label. The batch size was set to 16, and the model was trained for 50 epochs. The first 10 epochs followed a learning rate warm-up schedule, increasing linearly from 0 to a learning rate of 5 \u0026times; 10\u0026thinsp;\u0026minus;\u0026thinsp;4. This was followed by a cosine annealing schedule, where the learning rate gradually decreased from 5 \u0026times; 10\u0026thinsp;\u0026minus;\u0026thinsp;4 to 1 \u0026times; 10\u0026thinsp;\u0026minus;\u0026thinsp;6 over the remaining 40 epochs. After each training epoch, the model performance was evaluated on the validation set. The model checkpoint with the highest AUROC on the validation set was saved for subsequent internal and external evaluations.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eComputational resources\u003c/h2\u003e \u003cp\u003eFour NVIDIA Tesla A100 (80 GB) were used for self-supervised pre-training in this project. It took about 16 days to finish pre-training with DINOV2 or Masked Autoencoder. We used an equal computational cost from MEH and SDPP for foundation model development. For fine-tuning and linear probing foundation models to downstream tasks, we use NVIDIA Tesla T4 (16 GB). Fine-tuning took about 70 mins for every 1,000 images, while linear probing took around 15 mins.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eEvaluation and statistical analysis\u003c/h2\u003e \u003cp\u003eAll task performance was assessed using the classification metrics AUROC and AUPRC. For ischaemic stroke prediction tasks, the AUROC and AUPRC were calculated in a binary setting. For multiclass classification, such as five-stage diabetic retinopathy and multicategory disease diagnosis, AUROC and AUPRC were calculated separately for each class and then averaged to obtain the overall AUROC and AUPRC scores. For each task, we fine-tuned the model with five different random seeds, which determined the implementations including shuffling of fine-tuning data and data augmentation. The mean and standard deviation of the performance across the five runs were computed. The standard error is estimated as (standard deviation / \u0026radic;5), and the 95% confidence interval (CI) is obtained by multiplying the standard error by 1.96. The normality of the model performance was checked via Shapiro-Wilk test. Statistical significance is calculated using two-sided t-tests.\u003c/p\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics statement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study involves human participants and was approved by the London-Central Research Ethics Committee (18/LO/1163, approved 1 August 2018), Advanced statistical modelling of multimodal data of genetic and acquired retinal diseases (20/HRA/2158, approved 5 May 2020), Confidential Advisory Group for Section 251 support (18/CAG/0111, approved 13 September 2018), and the Ethics Committee of Shanghai Sixth People\u0026rsquo;s Hospital (Approved No: 2019-087, approved 29 August 2019). The National Health Service Health Research Authority gave final approval on 13 September 2018. Moorfields Eye Hospital NHS Foundation Trust validated the de-identifications for MEH data. Only de-identified retrospective data were used for research.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe MEH data consists of routinely collected healthcare data. Owing to their sensitive nature, the dataset is subject to controlled access by means of a structured application process. The AlzEye dataset is subject to the contractual restrictions of the data sharing agreements between National Health Service Digital, Moorfields Eye Hospital and University College London, and is not available for access beyond the AlzEye research team. National and international collaborations are welcomed, although restrictions on access to the cohort mean that only the AlzEye researchers can directly analyse individual-level systemic health data. Interested collaborators should contact the chief investigator P.A.K.\u003c/p\u003e\n\u003cp\u003eFor SDPP data, individual-level patient data can be accessible with the consent of the data management committee from institutions and are not publicly available. Requests for the non-profit use of the fundus images and related clinical information should be sent to T.Y.W. The data management committee will then review all the requests and grants (if successful). A formal data transfer agreement will be required upon approval. Generally, all these requests for access to the data will be responded to within 1\u0026thinsp;month.\u003c/p\u003e\n\u003cp\u003eData for ocular disease experiments are publicly available online and can be accessed through the following links: IDRID (https://ieee-dataport.org/open-access/indian-diabetic-retinopathy-image-dataset-idrid), MESSIDOR2 (https://www.adcis.net/en/third-party/messidor2/), APTOS2019 (https://www.kaggle.com/competitions/aptos2019-blindness-detection/data), Glaucoma Fundus (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1YRRAC), JSIEC (https://zenodo.org/record/3477553), and Retina (https://www.kaggle.com/datasets/jr2ngb/cataractdataset).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCode availability\u003c/strong\u003e \u003c/p\u003e\n\u003cp\u003eThe code used to train, fine-tune and evaluate RETFound from Y.Z. is available at https://github.com/rmaphoh/RETFound_MAE, which is based on PyTorch. All pre-trained model weights are available at\u003cu\u003e \u003c/u\u003ehttps://huggingface.co/YukunZhou. Images were processed with automated retinal image analysis tool AutoMorph v.1.0 (https://github.com/rmaphoh/AutoMorph). Results were further analysed and visualised with Python v.3.11.0, NumPy v.1.26.4, SciPy v.1.15.2, Matplotlib v.3.8.4, pandas v.1.5.0, Scikit-Learn v.1.4.2 and Pillow v.10.2.0.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eMoor M, Banerjee O, Abad ZSH, Krumholz HM, Leskovec J, Topol EJ, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023;616: 259\u0026ndash;265.\u003c/li\u003e\n\u003cli\u003eBommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, et al. On the Opportunities and Risks of Foundation Models. arXiv [cs.LG]. 2021. Available: http://arxiv.org/abs/2108.07258\u003c/li\u003e\n\u003cli\u003eZhang S, Metaxas D. On the challenges and perspectives of foundation models for medical image analysis. Med Image Anal. 2024;91: 102996.\u003c/li\u003e\n\u003cli\u003eChia MA, Zhou Y, Keane PA. A new foundation model for multimodal ophthalmic images: Advancing disease detection and prediction. NEJM AI. 2024;1. doi:10.1056/aie2401024\u003c/li\u003e\n\u003cli\u003eSinghal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, et al. Toward expert-level medical question answering with large language models. News@nat,Com. 2025. doi:10.1038/s41591-024-03423-7\u003c/li\u003e\n\u003cli\u003eZhou Y, Chia MA, Wagner SK, Ayhan MS, Williamson DJ, Struyven RR, et al. A foundation model for generalizable disease detection from retinal images. Nature. 2023;622: 156\u0026ndash;163.\u003c/li\u003e\n\u003cli\u003eLi J, Guan Z, Wang J, Cheung CY, Zheng Y, Lim L-L, et al. Integrated image-based deep learning and language models for primary diabetes care. Nat Med. 2024. doi:10.1038/s41591-024-03139-8\u003c/li\u003e\n\u003cli\u003eWang M, Lin T, Lin A, Yu K, Peng Y, Wang L, et al. Common and rare fundus diseases identification using vision-language foundation model with knowledge of over 400 diseases. arXiv [eess.IV]. 2024. Available: http://arxiv.org/abs/2406.09317\u003c/li\u003e\n\u003cli\u003eTiu E, Talius E, Patel P, Langlotz CP, Ng AY, Rajpurkar P. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat Biomed Eng. 2022. doi:10.1038/s41551-022-00936-9\u003c/li\u003e\n\u003cli\u003ePai S, Bontempi D, Hadzic I, Prudente V, Sokač M, Chaunzwa TL, et al. Foundation model for cancer imaging biomarkers. Nat Mach Intell. 2024;6: 354\u0026ndash;367.\u003c/li\u003e\n\u003cli\u003eTanno R, Barrett DGT, Sellergren A, Ghaisas S, Dathathri S, See A, et al. Collaboration between clinicians and vision-language models in radiology report generation. Nat Med. 2024. doi:10.1038/s41591-024-03302-1\u003c/li\u003e\n\u003cli\u003eHuang Z, Bianchi F, Yuksekgonul M, Montine TJ, Zou J. A visual-language foundation model for pathology image analysis using medical Twitter. Nat Med. 2023;29: 2307\u0026ndash;2316.\u003c/li\u003e\n\u003cli\u003eLu MY, Chen B, Williamson DFK, Chen RJ, Liang I, Ding T, et al. A visual-language foundation model for computational pathology. Nat Med. 2024;30: 863\u0026ndash;874.\u003c/li\u003e\n\u003cli\u003eXu H, Usuyama N, Bagga J, Zhang S, Rao R, Naumann T, et al. A whole-slide foundation model for digital pathology from real-world data. Nature. 2024;630: 181\u0026ndash;188.\u003c/li\u003e\n\u003cli\u003eWang X, Zhao J, Marostica E, Yuan W, Jin J, Zhang J, et al. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature. 2024. doi:10.1038/s41586-024-07894-z\u003c/li\u003e\n\u003cli\u003eVorontsov E, Bozkurt A, Casson A, Shaikovski G, Zelechowski M, Severson K, et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nat Med. 2024;30: 2924\u0026ndash;2935.\u003c/li\u003e\n\u003cli\u003eTu Tao, Azizi Shekoofeh, Driess Danny, Schaekermann Mike, Amin Mohamed, Chang Pi-Chuan, et al. Towards Generalist Biomedical AI. NEJM AI. 2024;1: AIoa2300138.\u003c/li\u003e\n\u003cli\u003eZhang K, Zhou R, Adhikarla E, Yan Z, Liu Y, Yu J, et al. A generalist vision-language foundation model for diverse biomedical tasks. Nat Med. 2024;30: 3129\u0026ndash;3141.\u003c/li\u003e\n\u003cli\u003eWillemink MJ, Koszek WA, Hardell C, Wu J, Fleischmann D, Harvey H, et al. Preparing Medical Imaging Data for Machine Learning. Radiology. 2020;295: 4\u0026ndash;15.\u003c/li\u003e\n\u003cli\u003eKtena I, Wiles O, Albuquerque I, Rebuffi S-A, Tanno R, Roy AG, et al. Generative models improve fairness of medical classifiers under distribution shifts. Nat Med. 2024;30: 1166\u0026ndash;1173.\u003c/li\u003e\n\u003cli\u003eLeCun, Bengio Y, Hinton Y, E. G. Deep learning. Nature. 2015;521: 436\u0026ndash;444.\u003c/li\u003e\n\u003cli\u003eZhang A, Xing L, Zou J, Wu JC. Shifting machine learning for healthcare from development to deployment and from models to data. Nat Biomed Eng. 2022;6: 1330\u0026ndash;1345.\u003c/li\u003e\n\u003cli\u003eYang Y, Zhang H, Gichoya JW, Katabi D, Ghassemi M. The limits of fair medical imaging AI in real-world generalization. Nat Med. 2024. doi:10.1038/s41591-024-03113-4\u003c/li\u003e\n\u003cli\u003eLin M, Li T, Yang Y, Holste G, Ding Y, Van Tassel SH, et al. Improving model fairness in image-based computer-aided diagnosis. Nat Commun. 2023;14: 6261.\u003c/li\u003e\n\u003cli\u003eSeyyed-Kalantari L, Zhang H, McDermott MBA, Chen IY, Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat Med. 2021;27: 2176\u0026ndash;2182.\u003c/li\u003e\n\u003cli\u003eHogg HDJ, Martindale APL, Liu X, Denniston AK. Clinical Evaluation of Artificial Intelligence-Enabled Interventions. Invest Ophthalmol Vis Sci. 2024;65: 10.\u003c/li\u003e\n\u003cli\u003eAI can be sexist and racist\u0026mdash;it\u0026rsquo;s time to make it fair. Available: https://idp.nature.com/authorize/casa?redirect_uri=https://www.nature.com/articles/d41586-018-05707-8\u0026amp;casa_token=tw57t_dkgfwAAAAA:bWJudlgHBkDCCCPwgOb7A74_9vDizVEK7S7k4Dlv58r3Pq1agWfEHMtwNJW2NkFTi-BBIFkbHfxBnns\u003c/li\u003e\n\u003cli\u003eLarrazabal AJ, Nieto N, Peterson V, Milone DH, Ferrante E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc Natl Acad Sci U S A. 2020;117: 12592\u0026ndash;12594.\u003c/li\u003e\n\u003cli\u003eZhang L, Wang X, Yang D, Sanford T, Harmon S, Turkbey B, et al. Generalizing Deep Learning for Medical Image Segmentation to Unseen Domains via Deep Stacked Transformation. IEEE Trans Med Imaging. 2020;39: 2531\u0026ndash;2540.\u003c/li\u003e\n\u003cli\u003eFrid-Adar M, Diamant I, Klang E, Amitai M, Goldberger J, Greenspan H. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing. 2018;321: 321\u0026ndash;331.\u003c/li\u003e\n\u003cli\u003eHan T, Nebelung S, Haarburger C, Horst N, Reinartz S, Merhof D, et al. Breaking medical data sharing boundaries by using synthesized radiographs. Sci Adv. 2020;6: eabb7973.\u003c/li\u003e\n\u003cli\u003eKrishnan R, Rajpurkar P, Topol EJ. Self-supervised learning in medicine and healthcare. Nat Biomed Eng. 2022;6: 1346\u0026ndash;1352.\u003c/li\u003e\n\u003cli\u003eDoersch C, Gupta A, Efros AA. Unsupervised visual representation learning by context prediction. 2015 IEEE International Conference on Computer Vision (ICCV). IEEE; 2015. pp. 1422\u0026ndash;1430.\u003c/li\u003e\n\u003cli\u003eJing L, Tian Y. Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey. IEEE Trans Pattern Anal Mach Intell. 2021;43: 4037\u0026ndash;4058.\u003c/li\u003e\n\u003cli\u003eBommasani R, Klyman K, Longpre S, Xiong B, Kapoor S, Maslej N, et al. Foundation Model Transparency Reports. arXiv [cs.LG]. 2024. Available: http://arxiv.org/abs/2402.16268\u003c/li\u003e\n\u003cli\u003eAlderman JE, Palmer J, Laws E, McCradden MD, Ordish J, Ghassemi M, et al. Tackling algorithmic bias and promoting transparency in health datasets: the STANDING Together consensus recommendations. Lancet Digit Health. 2025;7: e64\u0026ndash;e88.\u003c/li\u003e\n\u003cli\u003eMaaten L, Hinton GE. Visualizing Data using t-SNE. Journal of Machine Learning Research. 2008;9: 2579\u0026ndash;2605.\u003c/li\u003e\n\u003cli\u003eWong TY, Mitchell P. Hypertensive retinopathy. N Engl J Med. 2004;351: 2310\u0026ndash;2317.\u003c/li\u003e\n\u003cli\u003eG\u0026uuml;nthner R, Hanssen H, Hauser C, Angermann S, Lorenz G, Kemmner S, et al. Impaired Retinal Vessel Dilation Predicts Mortality in End-Stage Renal Disease. Circ Res. 2019. doi:10.1161/CIRCRESAHA.118.314318\u003c/li\u003e\n\u003cli\u003eSeidelmann SB, Claggett B, Bravo PE, Gupta A, Farhad H, Klein BE, et al. Retinal vessel calibers in predicting long-term cardiovascular outcomes: The Atherosclerosis Risk in Communities Study: The Atherosclerosis Risk in Communities Study. Circulation. 2016;134: 1328\u0026ndash;1338.\u003c/li\u003e\n\u003cli\u003eKo F, Muthy ZA, Gallacher J, Sudlow C, Rees G, Yang Q, et al. Association of Retinal Nerve Fiber Layer Thinning With Current and Future Cognitive Decline: A Study Using Optical Coherence Tomography. JAMA Neurol. 2018;75: 1198\u0026ndash;1205.\u003c/li\u003e\n\u003cli\u003eWagner SK, Romero-Bascones D, Cortina-Borja M, Williamson DJ, Struyven RR, Zhou Y, et al. Retinal optical coherence tomography features associated with incident and prevalent Parkinson disease. Neurology. 2023;101: e1581\u0026ndash;e1593.\u003c/li\u003e\n\u003cli\u003eCheung CY-L, Ong YT, Ikram MK, Ong SY, Li X, Hilal S, et al. Microvascular network alterations in the retina of patients with Alzheimer\u0026rsquo;s disease. Alzheimers Dement. 2014;10: 135\u0026ndash;142.\u003c/li\u003e\n\u003cli\u003eCheung CY-L, Ikram MK, Chen C, Wong TY. Imaging retina to study dementia and stroke. Prog Retin Eye Res. 2017;57: 89\u0026ndash;107.\u003c/li\u003e\n\u003cli\u003eHe K, Chen X, Xie S, Li Y, Doll\u0026rsquo;ar P, Girshick RB. Masked Autoencoders Are Scalable Vision Learners. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2021; 15979\u0026ndash;15988.\u003c/li\u003e\n\u003cli\u003eOquab M, Darcet T, Moutakanni T, Vo H, Szafraniec M, Khalidov V, et al. DINOv2: Learning robust visual features without supervision. arXiv [cs.CV]. 2023. Available: http://arxiv.org/abs/2304.07193\u003c/li\u003e\n\u003cli\u003eQiu J, Wu J, Wei H, Shi P, Zhang M, Sun Y, et al. Development and validation of a multimodal multitask vision foundation model for generalist ophthalmic artificial intelligence. NEJM AI. 2024;1. doi:10.1056/aioa2300221\u003c/li\u003e\n\u003cli\u003eShi D, Zhou Y, He S, Wagner SK, Huang Y, Keane PA, et al. Cross-modality Labeling Enables Noninvasive Capillary Quantification as a Sensitive Biomarker for Assessing Cardiovascular Risk. Ophthalmol Sci. 2024;4: 100441.\u003c/li\u003e\n\u003cli\u003eWagner SK, Cortina-Borja M, Silverstein SM, Zhou Y, Romero-Bascones D, Struyven RR, et al. Association Between Retinal Features From Multimodal Imaging and Schizophrenia. JAMA Psychiatry. 2023;80: 478\u0026ndash;487.\u003c/li\u003e\n\u003cli\u003eZhao S, Song J, Ermon S. Learning hierarchical features from deep generative models. Precup D, Teh YW, editors. ICML. 2017;70: 4091\u0026ndash;4099.\u003c/li\u003e\n\u003cli\u003eVahdat A, Kreis K, Kautz J. Score-based generative modeling in latent space. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Vaughan JW, editors. Neural Inf Process Syst. 2021;34: 11287\u0026ndash;11302.\u003c/li\u003e\n\u003cli\u003eMancini M, Porzi L, Bulo SR, Caputo B, Ricci E. Boosting domain adaptation by discovering latent domains. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE; 2018. pp. 3771\u0026ndash;3780.\u003c/li\u003e\n\u003cli\u003ePan SJ, Tsang IW, Kwok JT, Yang Q. Domain adaptation via transfer component analysis. IEEE Trans Neural Netw. 2011;22: 199\u0026ndash;210.\u003c/li\u003e\n\u003cli\u003eCaron M, Touvron H, Misra I, J\u0026eacute;gou H, Mairal J, Bojanowski P, et al. Emerging properties in self-supervised vision transformers. arXiv [cs.CV]. 2021. pp. 9650\u0026ndash;9660. Available: http://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper.html\u003c/li\u003e\n\u003cli\u003eZhou J, Wei C, Wang H, Shen W, Xie C, Yuille A, et al. iBOT: Image BERT Pre-Training with Online Tokenizer. arXiv [cs.CV]. 2021. Available: http://arxiv.org/abs/2111.07832\u003c/li\u003e\n\u003cli\u003eCaron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A. Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst. 2020;33: 9912\u0026ndash;9924.\u003c/li\u003e\n\u003cli\u003eXiao L, Pennington J. Synergy and symmetry in deep learning: Interactions between the data, model, and inference algorithm. arXiv [cs.LG]. 2022. Available: http://arxiv.org/abs/2207.04612\u003c/li\u003e\n\u003cli\u003eQin Z, Chen D, Zhang W, Yao L, Huang Y, Ding B, et al. The synergy between data and multi-modal large language models: A survey from co-development perspective. arXiv [cs.AI]. 2024. Available: http://arxiv.org/abs/2407.08583\u003c/li\u003e\n\u003cli\u003eWagner S, Hughes F, Cortina-Borja M, Pontikos N, Struyven R, Liu X, et al. AlzEye: longitudinal record-level linkage of ophthalmic imaging and hospital admissions of 353 157 patients in London, UK. BMJ Open. 2022;12. doi:10.1136/bmjopen-2021-058552\u003c/li\u003e\n\u003cli\u003eWilkinson CP, Ferris FL 3rd, Klein RE, Lee PP, Agardh CD, Davis M, et al. Proposed international clinical diabetic retinopathy and diabetic macular edema disease severity scales. Ophthalmology. 2003;110: 1677\u0026ndash;1682.\u003c/li\u003e\n\u003cli\u003eCiulla T, Amador A, Zinman B. Diabetic retinopathy and diabetic macular edema: pathophysiology, screening, and novel therapies. Diabetes Care. 2003;26: 2653\u0026ndash;2664.\u003c/li\u003e\n\u003cli\u003efor Diabetic Macular Edema P. Photocoagulation for diabetic macular edema. Early Treatment Diabetic Retinopathy Study report number 1. Early Treatment Diabetic Retinopathy Study research group. AMA Arch Ophthalmol. 1985;103: 1796\u0026ndash;1806.\u003c/li\u003e\n\u003cli\u003eWorld Health Organization. International Statistical Classification of Diseases and Related Health Problems: Alphabetical index. World Health Organization; 2004.\u003c/li\u003e\n\u003cli\u003ePorwal P, Pachade S, Kokare M, Deshmukh G, Son J, Bae W, et al. Idrid: Diabetic retinopathy--segmentation and grading challenge. Med Image Anal. 2020;59: 101561.\u003c/li\u003e\n\u003cli\u003eAbr\u0026agrave;moff, ; Folk MD, ; Han JC, ; Walker DP, ; Williams JD, ; Russell DF, et al. Automated Analysis of Retinal Images for Detection of Referable Diabetic Retinopathy. JAMA Ophthalmol. 2013;131: 351\u0026ndash;357.\u003c/li\u003e\n\u003cli\u003eAhn JM, Kim S, Ahn K-S, Cho S-H, Lee KB, Kim US. Correction: A deep learning model for the detection of both advanced and early glaucoma using fundus photography. PLoS One. 2019;14: e0211579.\u003c/li\u003e\n\u003cli\u003eCen L-P, Ji J, Lin J-W, Ju S-T, Lin H-J, Li T-P, et al. Automatic detection of 39 fundus diseases and conditions in retinal photographs using deep neural networks. Nat Commun. 2021;12: 4828.\u003c/li\u003e\n\u003cli\u003eZhou Y, Wagner SK, Chia MA, Zhao A, Woodward-Court P, Xu M, et al. AutoMorph: Automated Retinal Vascular Morphology Quantification Via a Deep Learning Pipeline. Transl Vis Sci Technol. 2022;11: 12.\u003c/li\u003e\n\u003cli\u003eBoers TGW, Fockens KN, van der Putten JA, Jaspers TJM, Kusters CHJ, Jukema JB, et al. Foundation models in gastrointestinal endoscopic AI: Impact of architecture, pre-training approach and data efficiency. Med Image Anal. 2024;98: 103298.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-6080254/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6080254/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eMedical foundation models (FM), pre-trained on large-scale unlabelled data, have demonstrated robust performance and high efficiency when fine-tuned to various clinically relevant applications. However, the impact of pre-training data on medical FM performance such as generalisability and fairness, which form the foundation in fine-tuned models, remains unexplored. To address this, we sampled two large cohorts from two sites, Moorfields Eye Hospital (UK) and the Shanghai Diabetes Prevention Program (China), each containing 904,170 retinal images for FM pre-training. We developed parallel FMs using identical processes and compared their fairness and generalisability on downstream tasks with publicly available datasets and held-out data from each site. Our results demonstrate that, despite strong generalisability, medical FMs perform significantly better on downstream data that align with the pre-training data in approximately one-third of tasks. Additionally, age is a key metadata factor impacting FM fairness and generalisability in retinal images, whereas sex and ethnicity show no such impact. These findings advocate for an evidence-based approach to pre-training data selection and highlight the importance of transparency even for pre-training data, ultimately enhancing FM capabilities and guiding FM development and customised application in healthcare.\u003c/p\u003e","manuscriptTitle":"Revealing the Impact of Pre-training Data on Medical Foundation Models","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-04-03 10:29:26","doi":"10.21203/rs.3.rs-6080254/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"nature-communications","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"NCOMMS","sideBox":"Learn more about [Nature Communications](http://www.nature.com/ncomms/)","snPcode":"","submissionUrl":"https://mts-ncomms.nature.com/","title":"Nature Communications","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Nature Communications","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"66ac0478-1f98-4e78-8e53-d6825bbdaf8c","owner":[],"postedDate":"April 3rd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":46341255,"name":"Health sciences/Diseases/Eye diseases"},{"id":46341256,"name":"Health sciences/Diseases/Cardiovascular diseases"},{"id":46341257,"name":"Health sciences/Health care/Medical imaging"},{"id":46341258,"name":"Health sciences/Medical research/Translational research"}],"tags":[],"updatedAt":"2026-04-10T07:21:25+00:00","versionOfRecord":{"articleIdentity":"rs-6080254","link":"https://doi.org/10.1038/s41467-026-70077-z","journal":{"identity":"nature-communications","isVorOnly":false,"title":"Nature Communications"},"publishedOn":"2026-02-28 05:00:00","publishedOnDateReadable":"February 28th, 2026"},"versionCreatedAt":"2025-04-03 10:29:26","video":"","vorDoi":"10.1038/s41467-026-70077-z","vorDoiUrl":"https://doi.org/10.1038/s41467-026-70077-z","workflowStages":[]},"version":"v1","identity":"rs-6080254","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6080254","identity":"rs-6080254","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.