Foundation model embeddings for quantitative tumor imaging biomarkers

doi:10.21203/rs.3.rs-6630446/v1

Foundation model embeddings for quantitative tumor imaging biomarkers

2025 · doi:10.21203/rs.3.rs-6630446/v1

preprint OA: gold CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 124,798 characters · extracted from preprint-html · click to expand

Foundation model embeddings for quantitative tumor imaging biomarkers | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Foundation model embeddings for quantitative tumor imaging biomarkers Suraj Pai, Ibrahim Hadzic, Andrey Fedorov, Raymond H. Mak, Hugo JWL Aerts This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6630446/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted You are reading this latest preprint version Abstract Foundation models are increasingly used in medical imaging, yet their ability to extract reliable quantitative radiographic phenotypes of cancer across diverse clinical contexts lacks systematic evaluation. Here, we introduce TumorImagingBench, a curated benchmark comprising six public datasets (3,244 scans) with varied oncological endpoints. We evaluate ten medical imaging foundation models, representing diverse architectures and pre-training strategies developed between 2020 and 2025, assessing their performance in deriving deep learning-based radiographic phenotypes. Our analysis extends beyond endpoint prediction performance and compares robustness to common sources of variability and saliency-based interpretability. We additionally compare the mutual similarity of learned embedding representations across each of the models. This comparative benchmarking reveals performance disparities among models and provides critical insights to guide the selection of optimal foundation models for specific quantitative imaging tasks. We publicly release all code, curated datasets, and benchmark results to foster reproducible research and future developments in quantitative cancer imaging. Health sciences/Medical research/Translational research Physical sciences/Engineering/Biomedical engineering Figures Figure 1 Figure 2 Figure 3 Figure 4 INTRODUCTION Precision oncology aims to revolutionize cancer care by tailoring treatments to the individual characteristics of each patient's tumor 1 . Central to this paradigm is the ability to characterize tumor biology, heterogeneity, and the tumor microenvironment, often non-invasively, to guide diagnosis, predict prognosis, and monitor therapeutic response 2 . Medical imaging modalities, including Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET), provide rich, spatially-resolved information about tissue structure and function, serving as key technologies in clinical oncology 3 . Over the past decade, the field of quantitative imaging analysis, particularly radiomics, has emerged as a powerful tool to unlock deeper insights from these medical images beyond qualitative visual assessment 3 – 6 . Radiomics involves the extraction of a large number of quantitative features from medical images, converting them into mineable data that can potentially capture phenotypic characteristics related to underlying pathology. These features, when integrated with clinical and genomic data, have shown promise in predicting clinical endpoints such as diagnosis, patient survival, tumor recurrence, and treatment response across various cancer types 7 . However, traditional mathematical and statistical radiomics approaches face challenges related to feature reproducibility, standardization across different imaging parameters and scanners, and the inherent complexity of selecting and interpreting informative features from a high-dimensional space 8 – 10 . The field of artificial intelligence, specifically deep learning, has witnessed transformative advancements, especially the development of Foundation Models (FMs) 11 . These models, typically pre-trained on large and diverse datasets using self-supervised or unsupervised learning objectives, learn powerful and generalizable representations that can be adapted to various downstream tasks with minimal task-specific fine-tuning. Initially demonstrating remarkable success in natural language processing and computer vision 12 – 15 , FMs are increasingly being explored within the medical domain 16 – 18 . In radiology, FMs have shown potential in tasks such as image segmentation, disease detection, report generation, and visual question answering 19 , 20 . The capacity of FMs to implicitly learn complex features directly from image data presents an alternative to the handcrafted feature engineering inherent in traditional radiomics. By leveraging large-scale pre-training, FMs can potentially capture more robust and informative image representations, overcoming some limitations of conventional methods and advancing quantitative radiomics for precision oncology 21 . However, the proliferation of several FM architectures and pre-training strategies poses a significant challenge for researchers: selecting the most appropriate model for a specific quantitative radiomics task. While several benchmarks comparing FMs exist for general tasks 22 – 24 and certain medical applications like report generation or segmentation 25 – 28 , a critical gap remains. There is currently no comprehensive, systematic benchmark specifically evaluating the performance and characteristics of different FMs as representation extractors for quantitative radiomics endpoints (i.e., diagnosis and prognosis prediction) across multiple anatomies and clinical cohorts. This lack of systematic comparison hinders informed model selection and reliable translation of FMs into radiomics research and practice. To address this gap, we present the first comprehensive benchmark evaluating ten distinct, publicly available, pre-trained 3D foundation models for quantitative radiomics analysis. We assess their representational power across six diverse clinical cohorts spanning lung, kidney, and liver anatomies, on both diagnostic and prognostic prediction tasks. Our comparative analysis aims to present the relative strengths of these FMs in quantifying radiological phenotypes relevant to oncological outcomes. Our findings reveal that model performance is task- and dataset-dependent, with no single model universally superior, although certain models like FMCIB 21 , ModelsGenesis 29 , and VISTA3D 30 demonstrate consistently strong performance across the evaluated scenarios. Beyond predictive performance, we investigate the robustness of the learned representations through test-retest reliability and input stability analyses, associate representations with salient image regions to gain insights into model interpretability, and explore the similarities between different FM representation spaces using representation alignment techniques. Furthermore, we introduce a unified, extensible software framework designed to facilitate the benchmarking of existing and future FMs on new tumor imaging datasets, thereby promoting standardized evaluation and accelerating the adoption of these powerful models within the quantitative imaging community. RESULTS In this study, we established a comprehensive framework to compare ten distinct foundation models using six publicly available datasets. We assessed a variety of models that differed in architecture (convolutional vs. transformer-based), pre-training strategies (contrastive, supervised, generative, etc.), and data utilization (low-dose CT, all CT, CT + MRI, CT + MRI + US + others). These datasets address various endpoints across cancer types located in the lung, liver, and kidney. Foundation model embeddings were extracted from each dataset, with k-nearest neighbor models leveraging neighbor voting to predict endpoints. Figure 1 provides an overview of the datasets, models, and overarching framework. Diagnostic Performance of Foundation Models For lung nodule malignancy diagnosis, performance varied considerably. On the LUNA16 31 dataset, FMCIB demonstrated the highest diagnostic capability with an Area Under the Curve (AUC) of 0.886 (95% Confidence Interval [CI]: 0.871-0.9). ModelsGenesis ranked second with an AUC of 0.806 (95% CI: 0.795–0.816). Voco 32 performed the worst, near random chance, with an AUC of 0.493 (95% CI: 0.468–0.519). VISTA3D achieved an AUC of 0.711 (95% CI: 0.692–0.730), while the remaining models yielded AUCs between 0.5 and 0.7 (Fig. 1 a). A similar ranking pattern occurred on the DLCS 33 dataset, although with lower overall AUCs. FMCIB again led with an AUC of 0.675 (95% CI: 0.655–0.696), followed by ModelsGenesis at 0.645 (95% CI: 0.624–0.666). Voco (AUC: 0.507) and CTClip 19 (AUC: 0.494) performed close to random chance. VISTA3D achieved an AUC of 0.607 (95% CI: 0.589–0.625), with other models scoring between 0.5 and 0.6 (Fig. 1 a). Prognostic Performance Across Lung, Kidney, and Liver Cancer Datasets In prognostic tasks, model performance generally decreased compared to diagnostics. For 2-year overall survival prediction in the NSCLC-Radiomics 34 dataset, VISTA3D (AUC: 0.582, 95% CI: 0.545–0.62), FMCIB (AUC: 0.577, 95% CI: 0.549–0.605), and ModelsGenesis (AUC: 0.577, 95% CI: 0.539–0.614) were the top performers. CTClip (AUC: 0.449) and Voco (AUC: 0.526) performed worst (Fig. 1 a). On the NSCLC-Radiogenomics 35 dataset for predicting NSCLC survival, VISTA3D (AUC: 0.622, 95% CI: 0.566–0.677) and CTFM 36 (AUC: 0.620, 95% CI: 0.572–0.668) achieved the highest AUCs, followed closely by Merlin 37 (AUC: 0.612) and ModelsGenesis (AUC: 0.609). Voco (AUC: 0.461) and CTClip (AUC: 0.510) showed the lowest performance (Fig. 1 a). For renal cancer prognosis (2-year overall survival) using C4KC-KiTS 38 , ModelsGenesis yielded the highest AUC of 0.733 (95% CI: 0.670–0.796), with SUPREM 39 second at 0.718 (95% CI: 0.672–0.764). CTFM (AUC: 0.463) and CTClip (AUC: 0.493) performed worst (Fig. 1 a). In predicting colorectal cancer liver metastases survival (Colorectal-Liver-Metastases 40 ), only FMCIB achieved an AUC substantially above random chance at 0.572 (95% CI: 0.509–0.644). ModelsGenesis was the next best with an AUC of 0.530 (95% CI: 0.458–0.601), while other models performed near the random baseline (Fig. 1 a). Aggregate Performance and Ranking Across Datasets Cross-dataset analysis revealed consistent performance patterns. FMCIB demonstrated strong overall performance, ranking first in three of the six datasets (LUNA16, DLCS, Colorectal-Liver-Metastases) and third in NSCLC-Radiomics (Fig. 1 b). ModelsGenesis also showed high consistency, ranking first or second in four datasets (LUNA16, DLCS, NSCLC-Radiomics, C4KC-KiTS) (Fig. 1 b). Performance generally followed a trajectory starting highest in LUNA16, decreasing through DLCS and NSCLC prognostic tasks, partially recovering in C4KC-KiTS, and declining again in Colorectal-Liver-Metastases (Fig. 1 c). Task-specific strengths were observed; FMCIB excelled in diagnostic tasks, while VISTA3D showed relatively stronger performance in prognostic tasks, achieving the top rank in NSCLC-Radiogenomics. The top three models (FMCIB, ModelsGenesis, VISTA3D) exhibited similar performance trends across the datasets (Fig. 1 c). Embedding Stability and Robustness to Input Variations Test-retest stability evaluated on the RIDER dataset 41 , simulating scanning variability, was high for most models, showing average cosine similarities between 0.97 and 1.00. However, Merlin showed lower stability with an average similarity of 0.81, and CTClip scored 0.93 (Fig. 2 a). Robustness to variations in input seed points (annotation noise) was assessed using Cohen's Kappa for agreement across 50 trials. CTFM showed the highest agreement (Kappa: 0.90), followed by SUPREM (0.87) and MedImageInsight 42 (0.85). FMCIB maintained good agreement (Kappa: 0.70). In contrast, Voco demonstrated very poor agreement (Kappa: 0.05), with CTClip (0.29) and Merlin (0.36) also showing low robustness (Fig. 2 b). Saliency Map Analysis for Explainability Saliency maps generated via feature-based occlusion sensitivity 36 indicated model focus areas. FMCIB and ModelsGenesis consistently produced saliency maps highlighting tumor regions across multiple datasets. VISTA3D maps sometimes focused on high-intensity bone structures when present, and a general tendency towards high-intensity regions was present across several models. CTClip, Voco, and Merlin failed to generate saliency maps clearly indicating tumor-specific regions (Fig. 2 c). Similarity Relationships Between Foundation Model Embeddings The relationships between the feature/embedding spaces learned by different models were examined using mutual k-nearest neighbor 43 overlaps within each dataset. Consistent trends showed higher overlaps between embeddings from ModelsGenesis-VISTA3D, FMCIB-ModelsGenesis, and FMCIB-VISTA3D pairs across multiple datasets (Fig. 3 a). These pairs frequently involved the top-performing models identified earlier. Alignment calculations further quantified these relationships; VISTA3D and ModelsGenesis embeddings showed the highest average alignment with FMCIB embeddings (scores of 2.85 and 2.69, respectively) (Fig. 3 b). The maximum average alignment observed was between VISTA3D and ModelsGenesis (score of 3.89) (Fig. 3 c, 3 d), suggesting a convergence of feature representations among these high-performing models. DISCUSSION In this study, we conducted the first comprehensive benchmarking of ten distinct foundation models (FMs) as feature extractors for quantitative tumor imaging tasks across six diverse, publicly available oncology datasets. Our evaluation includes predictive performance for diagnostic and prognostic endpoints, robustness to common sources of image variability, attribution analysis via saliency mapping, and an exploration of representational similarities between models. The results reveal significant differences in the ability of different FM embeddings for these downstream tasks, highlighting the challenges of model selection. Notably, we identified FMCIB, ModelsGenesis, and VISTA3D as exhibiting consistently strong performance across the range of datasets and clinical endpoints evaluated. Furthermore, this superior performance is often correlated with enhanced robustness and, for FMCIB and ModelsGenesis, biologically plausible saliency maps, suggesting these models capture more reliable and potentially interpretable quantitative phenotypes. The popularity of FMs, pre-trained using varied architectures, datasets, and self-supervised objectives, presents both opportunity 21 and complexity 16 for domain-specific applications like quantitative tumor imaging. While numerous benchmarks exist for general computer vision 44 – 46 and natural language processing 47 , 48 , and more recently within the broader medical field (e.g., BenchMD 49 , MedArena 25 ), a dedicated evaluation framework comparing FMs specifically for their ability to generate predictive quantitative radiomic signatures remains missing. Previous benchmarking efforts in tumor imaging have primarily focused on standardizing traditional feature extraction pipelines or comparing classical machine learning algorithms applied to those features 50 , 51 . Our work directly addresses this gap, providing a much-needed resource for researchers seeking to leverage the representational power of FMs for biomarker discovery in oncology, analogous to how earlier benchmarks guided progress in other fields. Our analysis across models spanning different eras of deep learning architectures (from UNets 52 to Transformers 53 ) and pre-training paradigms (restorative, contrastive, generative, segmentation-based) yielded several interesting observations. Perhaps most striking was the robust performance of ModelsGenesis, a relatively older model based on a simple UNet architecture and pre-trained using a restorative objective on only ~ 600 CT scans from a single LIDC dataset. Its consistent effectiveness across diverse anatomies and tasks suggests that its pre-training, focused on recovering corrupted inputs, may instill representations particularly robust to noise and adept at capturing fundamental tissue characteristics relevant for radiomics, even without exposure to vast datasets or explicit oncological tasks during pre-training. Similarly, FMCIB, which employs contrastive learning to achieve invariance against various image transformations 21 , also demonstrated strong, robust performance, reinforcing the hypothesis that pre-training objectives emphasizing robustness to transformations can yield powerful general-purpose embeddings for quantitative analysis. In contrast, VISTA3D 30 , leveraging segmentation pre-training on a large, diverse dataset including tumors, likely benefits from learning spatial features relevant to tumor characterization through its unique fine-grained supervoxel segmentation pre-training. The more recent PASTA 54 model, despite sophisticated multi-stage pre-training involving synthetic tumors, yielded less performant embeddings in our benchmark, suggesting that while potentially powerful for fine-tuning on specific tasks (as shown in its original publication), its raw embeddings may not generalize as effectively for the broad range of quantitative radiomic tasks studied here. These findings collectively highlight that the choice of pre-training strategy and data significantly impacts the downstream utility of FM embeddings, and that newer or larger models are not invariably superior for every application. As access to large-scale datasets and computational resources continues to grow, the landscape of FMs will undoubtedly become even more crowded. Comparing these models effectively, especially as performance on specific tasks begins to saturate, becomes increasingly critical. Inspired by concepts such as the Platonic Representation Hypothesis 43 , which suggests that optimal representations might converge towards a shared underlying structure, we investigated the similarity between the representation spaces of the FMs. Our analysis revealed that the models demonstrating stronger aggregate performance (notably FMCIB and ModelsGenesis) also exhibited significantly higher representational similarity, as measured by mutual neighbor analysis. This convergence among top-performing models, further corroborated by qualitative similarities in their saliency maps, suggests they may be learning overlapping, potentially fundamental, image features crucial for oncological prediction. This observation highlights possibilities regarding model interchangeability or ensembling strategies in the future, where understanding representational alignment could guide the selection of complementary models. We acknowledge several limitations inherent in this study. First, our benchmark relies exclusively on six fully public datasets. While this choice maximizes accessibility and reproducibility, it limits the scale and diversity compared to potentially available restricted-access datasets. Second, our evaluation is currently image-only, potentially being unfair to the assessment of multimodal FMs that integrate text or other data types. However, the lack of availability of suitable public datasets with paired imaging and clinical text presents a significant challenge in this context. Third, our methodology focuses solely on evaluating the fixed embeddings extracted from the FMs, without task-specific fine-tuning. While fine-tuning can often boost performance on a target task, we argue that evaluating the raw embeddings provides a better assessment of the foundational representational power learned during pre-training – a strong foundational embedding should inherently capture rich image properties beneficial for downstream tasks, and should correlate with fine-tuning performance. Future work could explicitly compare embedding performance versus fine-tuning outcomes within this framework. Finally, for downstream prediction, we employed a simple k-nearest neighbor classifier to minimize confounding factors from complex modeling choices, though exploring a wider range of classifiers could be future work.. In conclusion, this study introduces a comprehensive benchmark for evaluating foundation models in quantitative tumor imaging, providing critical insights into the performance, robustness, and underlying representational characteristics of ten prominent models across diverse oncological tasks. Our findings reveal that model selection requires careful consideration, with FMCIB, ModelsGenesis, and VISTA3D emerging as particularly promising candidates. Crucially, we present not only these results but also an open-source, extensible framework for systematic and reproducible integration of new foundation models and dataset. We hope this work will significantly push forward the adoption and rigorous evaluation of these foundation models, accelerating progress in quantitative imaging for precision oncology. METHODS Design of TumorImagingBench Evaluation The TumorImagingBench is constructed to evaluate tumor imaging (radiomic) signatures' capacity to quantify diverse radiological phenotypes of cancer. It encompasses six publicly available datasets: LUNA16 and DLCS for lung cancer diagnostics, focusing on nodule malignancy; NSCLC-Radiomics and NSCLC-Radiogenomics for assessing prognosis post-surgery and/or radiotherapy in non-small cell lung cancer patients; C4KC-KiTs for renal carcinoma prognosis post-partial or radical nephrectomy; and Colorectal-Liver-Metastases for assessing prognosis in colorectal cancer patients with liver metastases post-hepatic resection. The selection of diverse endpoints across these datasets highlights the generalizability of selected radiomics signatures. LUNA16 31 : A curated selection from the LIDC-IDRI database, featuring 888 thoracic CT scans (diagnostic and lung screening) from 7 academic centers and 8 imaging companies. It includes 1,186 lung nodules, each annotated for location and attributes like internal composition, calcification, and malignancy by a consensus of at least 3 out of 4 radiologists. For a specific evaluation mentioned, a subset of 677 nodules was chosen, all having at least one indication of malignancy suspicion. DLCS 55 : A dataset from the Duke Health system featuring 2,487 nodules from 1,613 patients. Nodules, initially flagged by AI and verified by a medical student (with selective radiologist oversight), include 3D bounding boxes, Lung-RADS annotations, and cancer outcomes. The selection followed Lung-RADS v2022 criteria, focusing on nodules ≥ 4 mm or in central/segmental airways. A subset of the dataset containing 1714 scans made publicly available with pathology confirmed malignancies was used in this study. NSCLC-Radiomics 34 : An independent test set for prognostication networks derived from the MAASTRO Clinic (Maastricht, NL). This set consisted of CT scans from 421 patients, selected from a cohort of 422 individuals with stage I-IIIB NSCLC treated with radiation therapy. Key characteristics include annotated primary Gross Tumor Volumes (GTVs), delineated by radiation oncologists using FDG PET-CT scans (Siemens Biograph, +/- contrast), and patients being right-censored for two-year survival. The chosen end-point for prediction is 2-year survival from treatment date. . NSCLC-Radiogenomics 35 : A dataset of 211 stage I-IV NSCLC patients from Stanford University and the Palo Alto VA (recruited 2008-2012, referred for surgery) who had preoperative CT and PET/CT scans (variable equipment/protocols). Tumor segmentations reviewed by two radiologists are available for 144 patients. The dataset also includes molecular data (EGFR/KRAS/ALK mutations, gene expression, RNA-seq). Our study focused on 133 patients with annotated Gross Tumor Volumes (GTVs), right-censored for two-year survival. This subset served as an independent test set for prognostication and subsequent biological investigation of our networks. The chosen end-point for prediction is 2-year survival post surgery. C4KC-KiTS 38 : A dataset of patients who underwent partial or radical nephrectomy for renal tumors at the University of Minnesota Medical Center between 2010 and 2018 were considered for inclusion in our analysis. Cases lacking preoperative arterial phase abdominal CT imaging were excluded. From the eligible population, 300 cases were randomly selected for potential inclusion. Of these, 210 had complete tumor segmentation data available. After excluding patients lost to follow-up prior to the event of interest, our final cohort for this study comprised 134 patients. For multi-tumor cases, we used the largest tumor volume to determine the seed point for our computational analysis. The chosen end-point for prediction is 2-year survival post nephrectomy. Colorectal-Liver-Metastases 40 : A dataset of single-institution consecutive series of patients who underwent colorectal liver metastases (CRLM) resection with matched preoperative CT scans. Inclusion required: pathologically confirmed CRLM, available pathologic data of non-tumoral liver parenchyma and tumor, and preoperative portal venous contrast-enhanced MDCT within 6 weeks of resection. Patients with 90-day mortality, 3 wedge resections, or no visible tumor on preoperative imaging, were excluded. For our analysis, we selected the largest tumor from each patient, resulting in a final cohort of 194 patients after excluding those lost to follow-up. The chosen end-point for prediction is 2-year survival post resection.. Selection of FM Radiomic Embedding Models We selected ten pre-trained models as embeddings to establish a radiomic signature for provided computed tomography images. Models span from 2020 to the latest in 2025, capturing prevalent design choices in pre-trained model development over five years. The earliest model, ModelsGenesis, employs a simple UNet convolutional network, while the most recent model, PASTA, utilizes a sophisticated nnUNet framework in a multimodal scheme. CT-ViT and MedImageInsight, recent models, incorporate joint text-vision approaches and transformer architectures. Table 1 presents a comparative analysis of these approaches based on several critical design choices. Table 1: Comparative analysis pre-trained/foundation models with their corresponding pre-training frameworks, architectural specifications, and training datasets. The table highlights various models, their meta-architectural designs, and structural configurations. Pre-training datasets are summarized with relevant characteristics. The specific downstream evaluation tasks that utilized these pre-trained models are also enumerated to demonstrate model applicability and performance across domains. Model Meta-architecture / Pre-training design Architecture Dataset Params Evaluated Tasks FMCIB 21 Tumor positive + Negative-mining SimCLR 3D ResNet50 11,467 scans from DeepLesion 184M Patch-based diagnosis and prognosis CT-FM 36 3D image-based contrastive pre-training promoting awareness of 3D structure. 3D SegResNet 148k scans from Imaging Data Commons (multiple datasets) 77M Whole-body and tumor segmentation, head CT triage, medical image retrieval, semantic understanding CT-CLIP 19 Contrastive language-image pretraining framework for 3D chest CTs using radiology reports 3D ViT 25,692 scans from CT-RATE dataset 25M Multi-abnormality detection, case retrieval, zero-shot classification PASTA 54 Two-stage process focusing on semantic segmentation and text-image alignment on synthetic tumors nnUNet 30k scans from PASTA-GEN30k dataset 127M Few-shot and zero-shot segmentation, tumor staging and prognosis,lesion report generation VISTA3D 30 Supervised multi-instance training along with supervoxel supervision and separate heads for interactive segmentation 3D SegResNet 11454 scans from 15 different datasets 175M Segmentation tasks across various anatomical structures and lesion (+ Interactive) VOCO 32 Large-scale 3D medical image pre-training with geometric context priors. 3D SwinUNETR 160k CTs from 30 public datasets 295M Segmentation, classification, registration, and vision-language tasks SUPREM 39 Supervised pre-training on AbdomenAtlas 1.1, combining large-scale datasets with per-voxel annotations. 3D UNet 9,262 scans from AbdomenAtlas1.1 19M Segmentation tasks across multiple datasets, demonstrating transfer learning capabilities Merlin 37 Vision-language foundation model for 3D CT, trained with EHR and radiology reports 3D UNet 15,331 CT scans along with radiology reports 270M Zero-shot classification, phenotype classification, radiology report generation, 3D segmentation MedImageInsight 42 Two-tower architecture optimized with UniCL objective f. 2D ViT 3.7M image-text/label/age pairs across several datasets 616M Image-text search, image-image search, report generation, and task fine-tuning ModelsGenesis 29 Image restoration on 3D Chest CT volumes 3D UNet 623 scans from LUNA16 dataset 7M Segmentation and classification tasks across five target 3D applications FMCIB is a foundation model designed to distinguish between lesions and non-lesions at the patch level in medical imaging. It aims to enhance the detection and characterization of cancerous lesions by leveraging self-supervised learning techniques. CT-FM is a large-scale 3D image-based pre-trained model specifically developed for various radiological tasks, including segmentation and classification. It was trained on a substantial dataset of CT scans and demonstrated superior performance across multiple tasks compared to state-of-the-art models. The model's architecture allows it to effectively cluster anatomical regions and identify similar structures across scans, making it a robust tool for medical image analysis CT-CLIP is a novel 3D adaptation of the CLIP model, designed for multi-abnormality detection in chest CT scans. It utilizes contrastive learning to align CT volumes with corresponding radiology reports, enabling zero-shot classification capabilities. This model excels in detecting multiple abnormalities without the need for extensive manual annotations, showcasing its potential for efficient clinical applications PASTA is a 3D-CT foundation model that addresses data scarcity in oncology by synthesizing lesions across various organs and tumor types. It utilizes a generative model to create a large dataset of synthetic CT scans, which enhances its training for lesion segmentation and vision-language alignment tasks. PASTA has shown exceptional performance in cross-domain transfer learning, outperforming existing models in multiple evaluation tasks VISTA3D is a versatile imaging segmentation and annotation model that supports both automatic and interactive segmentation of 3D medical images. It is the first unified foundation model to achieve state-of-the-art performance across 127 classes and is designed to facilitate efficient human correction through its interactive features. VISTA3D integrates a novel supervoxel method to enhance zero-shot performance, making it a significant advancement in 3D medical imaging VOCO is a large-scale 3D medical image pre-training framework that leverages geometric context priors to learn consistent semantic representations. It is built on a substantial dataset of CT volumes and employs a novel pretext task for contextual position predictions. VOCO has demonstrated superior performance across various downstream tasks, establishing itself as a leading model in the field of medical imaging. SUPREM is a suite of pre-trained models that provides state-of-the-art performance in organ and tumor segmentation tasks. It is based on supervised pre-training methodologies and has been shown to outperform models trained from scratch. SUPREM's architecture includes various backbones, allowing it to adapt effectively to different medical imaging tasks Merlin is a vision-language foundation model designed for interpreting abdominal CT scans. It integrates structured electronic health record (EHR) data and unstructured radiology reports to perform a variety of tasks, including zero-shot classification and report generation. Merlin's architecture allows it to generalize across multiple downstream tasks, making it a versatile tool for clinical applications MedImageInsight is a lightweight foundation model for medical imaging that spans multiple modalities, including X-ray, CT, and MRI and is trained via image-text/label/age alignment. It achieves state-of-the-art performance on various datasets and supports both classification and image retrieval tasks. The model is designed to be adaptable and efficient, making it suitable for real-world clinical applications. ModelsGenesis is a collection of models built from unlabeled 3D imaging data using a restorative reconstruction-based self-supervised method. It aims to generate powerful application-specific target models through transfer learning. The models demonstrate strong performance across various medical imaging tasks, emphasizing their potential for broad applicability in clinical settings Embedding/Feature extraction configuration Our feature extraction configuration involved a very similar procedure across all the models. For the segmentation focused models, we took the embeddings from the last layers of the encoder and added an average pooling on top to compress the feature representations in the spatial dimension. For non-segmentation models, the last layer of the model was taken in a similar fashion and average pooled. For MedImageInsight, a 2D model, we averaged across all 3D slices to obtain our embedding, as recommended in the original study. K-Nearest Neighbor Modelling setup Final layer embeddings were used as radiomic features for each sample. These features were used to predict outcomes using a k-nearest neighbor model. We also used cosine distance for the neighbor model. Models were trained using 10-fold cross-validation and results were aggregated across the 10 runs. Optuna hyperparameter tuning was used for each run independently to select the optimal number of neighbors for the task from a range of 1 to 50. 95% confidence intervals were calculated using the 10-fold cross-validation. Robustness and Saliency Evaluation We evaluated model robustness through multiple complementary approaches. First, to assess test-retest reliability, we utilized the RIDER dataset—a collection of chest CT scans from 26 patients where each patient underwent two scans within a 15-minute interval. For each model, we calculated the cosine similarity between embeddings generated from these paired scans. Higher cosine similarity values indicate greater robustness to normal scanning variability. To evaluate sensitivity to input variations, we simulated annotation variability by generating 50 random perturbations of each seed point. These perturbations followed a three-dimensional multivariate normal distribution (zero mean, diagonal covariance matrix) with a variance of 16 voxels in each dimension. We then trained models on one trial and compared predictions across all trials, measuring agreement using Cohen's Kappa after converting continuous predictions to categorical values. We implemented occlusion sensitivity analysis to identify image regions most influential to model predictions. This approach systematically occludes different regions of the input image and measures resulting changes in the output embeddings using cosine distance. Regions causing larger embedding deviations when occluded are considered more salient to the model's feature extraction process. We generated and compared these saliency maps across all models to assess their focus areas. Mutual K-Nearest Neighbor Evaluation To compare feature representations between models, we employed mutual k-nearest neighbor analysis. For each sample, we identified its 10 nearest neighbors in the embedding space generated by each model. We then quantified the overlap between neighbor sets across different models. This overlap metric reveals similarities in how different models structure their feature spaces and cluster similar samples, providing insight into which models learn comparable representations despite architectural differences. Declarations ACKNOWLEDGEMENTS The authors acknowledge financial support from NIH (H.J.W.L.A: NIH-USA U24CA194354, NIH-USA U01CA190234, NIH-USA U01CA209414, NIH-USA R35CA22052, and NIH-USA U54CA274516-01A1) and the European Union - European Research Council (H.J.W.L.A: 866504). DATA AVAILABILITY All datasets used in this study are publicly available and can be downloaded from their respective Zenodo, IDC 56 and TCIA 57 sources. Links to the dataset are available through the data citations. CODE AVAILABILITY The complete pipeline used in this study can be accessed either from the AIM webpage or directly on https://github.com/AIM-Harvard/TumorImagingBench . Code included for 1) Data preprocessing and creating annotation files, 2) Extracting features through a systematic interface for each model across all dataset, 3) Evaluation pipelines for building models and post-evaluation. References Adi-Wauran, E., Krishnapillai, S., Uleryk, E., Saeedi, S. & Bombard, Y. Patient-centred care in precision oncology: A systematic review. Patient Educ Couns 136 , 108753 (2025). Hricak, H. et al. Advances and challenges in precision imaging. Lancet Oncol 26 , e34–e45 (2025). Imaging in Clinical Oncology. Radiology (2015) doi:10.1148/radiol.14144052. Lambin, P. et al. Radiomics: extracting more information from medical images using advanced feature analysis. Eur. J. Cancer 48 , 441–446 (2012). Aerts, H. J. W. L. et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat. Commun. 5 , 4006 (2014). Mu, W., Schabath, M. B. & Gillies, R. J. Images Are Data: Challenges and Opportunities in the Clinical Translation of Radiomics. Cancer Res 82 , 2066–2068 (2022). Huang, E. P. et al. Criteria for the translation of radiomics into clinically useful tests. Nat. Rev. Clin. Oncol. 20 , 69–82 (2023). Cobo, M., Menéndez Fernández-Miranda, P., Bastarrika, G. & Lloret Iglesias, L. Enhancing radiomics and Deep Learning systems through the standardization of medical imaging workflows. Sci Data 10 , 732 (2023). Limkin, E. J. et al. Promises and challenges for the implementation of computational medical imaging (radiomics) in oncology. Ann. Oncol. 28 , 1191–1206 (2017). Kazmierski, M. et al. Multi-institutional Prognostic Modeling in Head and Neck Cancer: Evaluating Impact and Generalizability of Deep Learning and Radiomics. Cancer Res Commun 3 , 1140–1151 (2023). Bommasani, R. et al. On the Opportunities and Risks of Foundation Models. arXiv [cs.LG] (2021). Brown, T. B. et al. Language Models are Few-Shot Learners. (2020). Yuan, L. et al. Florence: A New Foundation Model for Computer Vision. (2021). Girdhar, R. et al. ImageBind: One Embedding Space To Bind Them All. (2023). Oquab, M. et al. DINOv2: Learning Robust Visual Features without Supervision. (2023). Thieme, A. et al. Foundation Models in Healthcare: Opportunities, Risks & Strategies Forward. in Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems (Association for Computing Machinery, New York, NY, USA, 2023). doi:10.1145/3544549.3583177. Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616 , 259–265 (2023). Paschali, M. et al. Foundation models in radiology: What, how, why, and why not. Radiology 314 , e240597 (2025). Hamamci, I. E. et al. Developing generalist foundation models from a multimodal dataset for 3D computed tomography. arXiv [cs.CV] (2024). Zhou, H.-Y., Adithan, S., Acosta, J. N., Topol, E. J. & Rajpurkar, P. A generalist learner for multifaceted medical image interpretation. arXiv [cs.CV] (2024). Pai, S. et al. Foundation model for cancer imaging biomarkers. Nat Mach Intell 6 , 354–367 (2024). Zheng, L. et al. LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset. arXiv [cs.CL] (2023). Chiang, W.-L. et al. Chatbot Arena: An open platform for evaluating LLMs by human preference. ICML abs/2403.04132 , (2024). Nayak, S. et al. Benchmarking Vision Language Models for cultural understanding. arXiv [cs.CV] (2024). MedArena - LLM Arena for Clinicians. https://medarena.ai/login. Chen, S. et al. Cross-Care: Assessing the healthcare implications of pre-training data on language model bias. arXiv [cs.CL] (2024). Chen, H., Fang, Z., Singla, Y. & Dredze, M. Benchmarking large language models on answering and explaining challenging medical questions. ArXiv abs/2402.18060 , (2024). Bassi, P. R. A. S. et al. Touchstone benchmark: Are we on the right way for evaluating AI algorithms for medical segmentation? arXiv [cs.CV] (2024). Zhou, Z. et al. Models Genesis: Generic Autodidactic Models for 3D Medical Image Analysis. Med. Image Comput. Comput. Assist. Interv. 11767 , 384–393 (2019). He, Y. et al. VISTA3D: Versatile Imaging SegmenTation and annotation model for 3D computed tomography. arXiv [cs.CV] (2024). Setio, A. A. A. et al. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: The LUNA16 challenge. Med. Image Anal. 42 , 1–13 (2017). Wu, L., Zhuang, J. & Chen, H. Large-scale 3D medical image pre-training with geometric context priors. arXiv [cs.CV] (2024). Tushar, F. I. et al. AI in lung health: Benchmarking detection and diagnostic models across multiple CT scan datasets. arXiv [cs.CV] (2024). Aerts, H. J. W. L. et al. Data From NSCLC-Radiomics. The Cancer Imaging Archive https://doi.org/10.7937/K9/TCIA.2015.PF0M9REI (2019). Napel, S. & Plevritis, S. K. NSCLC Radiogenomics: Initial Stanford Study of 26 cases. The Cancer Imaging Archive https://doi.org/10.7937/K9/TCIA.2014.X7ONY6B1 (2014). Pai, S. et al. Vision foundation models for computed tomography. arXiv [eess.IV] (2025). Blankemeier, L. et al. Merlin: A vision language foundation model for 3D computed tomography. arXiv [cs.CV] (2024). Heller, N. et al. C4KC KiTS Challenge Kidney Tumor Segmentation Dataset. The Cancer Imaging Archive https://doi.org/10.7937/TCIA.2019.IX49E8NX (2019). Li, W., Yuille, A. & Zhou, Z. How well do supervised 3D models transfer to medical imaging tasks? arXiv [eess.IV] (2025). Simpson, A. L. et al. Preoperative CT and survival data for patients undergoing resection of Colorectal Liver Metastases (Colorectal-Liver-Metastases). The Cancer Imaging Archive https://doi.org/10.7937/QXK2-QG03 (2023). Zhao, B., Schwartz, L. H., Kris, M. G. & Riely, G. J. Coffee-break lung CT collection with scan images reconstructed at multiple imaging parameters. The Cancer Imaging Archive https://doi.org/10.7937/K9/TCIA.2015.U1X8A5NR (2015). Codella, N. C. F. et al. MedImageInsight: An open-source embedding model for general domain medical imaging. arXiv [eess.IV] (2024). Huh, M., Cheung, B., Wang, T. & Isola, P. The platonic representation hypothesis. arXiv [cs.LG] (2024). Russakovsky, O. et al. ImageNet Large Scale Visual Recognition Challenge. arXiv [cs.CV] (2014). Singh, S. et al. Benchmarking object detectors with COCO: A new path forward. arXiv [cs.CV] (2024). Geiger, A., Lenz, P. & Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. in 2012 IEEE Conference on Computer Vision and Pattern Recognition 3354–3361 (IEEE, 2012). doi:10.1109/cvpr.2012.6248074. Song, Y. et al. GlobalBench: A benchmark for global progress in natural language processing. arXiv [cs.CL] (2023). Kayser, M. et al. E-ViL: A dataset and benchmark for natural language explanations in vision-language tasks. arXiv [cs.CV] (2021). Wantlin, K. et al. BenchMD: A benchmark for unified learning on medical images and sensors. arXiv [cs.CV] (2023). Parmar, C., Grossmann, P., Bussink, J., Lambin, P. & Aerts, H. J. W. L. Machine Learning methods for Quantitative Radiomic Biomarkers. Sci. Rep. 5 , 13087 (2015). Woznicki, P., Laqua, F. C., Al-Haj, A., Bley, T. & Baeßler, B. Addressing challenges in radiomics research: systematic review and repository of open-access cancer imaging datasets. Insights Imaging 14 , 216 (2023). Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv [cs.CV] (2015). Vaswani, A. et al. Attention is all you need. arXiv [cs.CL] (2017). Lei, W. et al. A data-efficient pan-tumor foundation model for oncology CT interpretation. arXiv [eess.IV] (2025). Wang, A. et al. Duke Lung Cancer Screening Dataset 2024. Zenodo https://doi.org/10.5281/ZENODO.10782890 (2024). Fedorov, A. et al. NCI Imaging Data Commons. Cancer Res. 81 , 4188–4193 (2021). Clark, K. et al. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J. Digit. Imaging 26 , 1045–1057 (2013). Additional Declarations There is NO Competing Interest. Cite Share Download PDF Status: Under Review Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6630446","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":462512953,"identity":"64eaa474-df6e-4c25-b918-78f27d185de9","order_by":0,"name":"Suraj Pai","email":"","orcid":"https://orcid.org/0000-0001-8043-2230","institution":"Mass General Brigham | Harvard Medical School","correspondingAuthor":false,"prefix":"","firstName":"Suraj","middleName":"","lastName":"Pai","suffix":""},{"id":462512954,"identity":"f88bdb37-64be-425a-bf83-489cd8af09e8","order_by":1,"name":"Ibrahim Hadzic","email":"","orcid":"","institution":"Mass General Brigham | Harvard Medical School","correspondingAuthor":false,"prefix":"","firstName":"Ibrahim","middleName":"","lastName":"Hadzic","suffix":""},{"id":462512955,"identity":"7d058b6e-64f2-4036-9268-7710aa13e770","order_by":2,"name":"Andrey Fedorov","email":"","orcid":"https://orcid.org/0000-0003-4806-9413","institution":"Brigham and Women's Hospital","correspondingAuthor":false,"prefix":"","firstName":"Andrey","middleName":"","lastName":"Fedorov","suffix":""},{"id":462512956,"identity":"656928b4-2367-483e-92c6-ab2e036dd961","order_by":3,"name":"Raymond H. Mak","email":"","orcid":"","institution":"Mass General Brigham and Harvard Medical School","correspondingAuthor":false,"prefix":"","firstName":"Raymond","middleName":"H.","lastName":"Mak","suffix":""},{"id":462512952,"identity":"fd4b54b3-859e-400c-a125-fe6c32fc250c","order_by":4,"name":"Hugo JWL Aerts","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA80lEQVRIiWNgGAWjYBACg8NgyiaBQYKBDSqWwMDAg0/L8QYGhgMMaaRoOXMApOUwKVpuJB/7/OHP+TyD2w1sD37uOSxv3p7A+OBtG24t9jfSkmccbLtdbHDnALthz7PDhnPOPGA2nItHi8GNHGOGgw23E7fdSGCT4DmQxjhDIoFNmheflvvvPzMc+HMOrEXyz4E0e6AW9t94tdzIYWY4wHYArEWa54BNIsgWZvxa0owZzrYlF9vfSGyTljlgkzyD52Gz5Jxz+LQkP2ao+GOXJzkj+ZjkmwMStjPYkw9+eFOGWwsSYGxAZ4yCUTAKRsEoIBcAAG5yXKFmFLn8AAAAAElFTkSuQmCC","orcid":"https://orcid.org/0000-0002-2122-2003","institution":"Mass General Brigham | Harvard Medical School","correspondingAuthor":true,"prefix":"","firstName":"Hugo","middleName":"JWL","lastName":"Aerts","suffix":""}],"badges":[],"createdAt":"2025-05-09 16:45:22","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6630446/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6630446/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":83611999,"identity":"82f6c6df-3e42-4ce0-a049-c79df449dbb4","added_by":"auto","created_at":"2025-05-29 12:28:49","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":127462,"visible":true,"origin":"","legend":"\u003cp\u003eAn overview of the data processing and analysis pipeline for TumorImagingBench. The pipeline includes multiple stages, including data processing with diverse datasets (LUNA16, DLCS, NSCLC-Radiomics, NSCLC-Radiogenomics, C4KC-KiTS, Colorectal-Liver-Metastases), feature extraction, feature evaluation through embedding and nearest neighbor evaluation, and target prediction. The right panel lists foundation models utilized for predictions, such as VISTA3D, Merlin, CT-CLIP, and others.\u003c/p\u003e","description":"","filename":"1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6630446/v1/88e756e2f7dd11f26e064193.jpg"},{"id":83612000,"identity":"493caea5-63d9-4935-a644-fb3e703752ea","added_by":"auto","created_at":"2025-05-29 12:28:49","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":100389,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eComparative performance of deep learning feature extractors across cancer imaging \u003c/strong\u003edatasets. \u003cstrong\u003ea. \u003c/strong\u003eAUC scores of nine models evaluated on six cancer datasets of TumorImagingBench with sample sizes noted \u003cstrong\u003eb.\u003c/strong\u003eHeat map showing model rankings (#1-#7) across datasets. \u003cstrong\u003ec.\u003c/strong\u003e Performance trajectories of top-ranked models across all datasets.\u003c/p\u003e","description":"","filename":"2.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6630446/v1/0dbc6ec0d397873f3b813623.jpg"},{"id":83612004,"identity":"4c00ec85-412f-4aec-98b9-2db99ade61ca","added_by":"auto","created_at":"2025-05-29 12:28:49","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":108953,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eRobustness analysis and feature attribution of deep learning models across cancer imaging datasets\u003c/strong\u003e. \u003cstrong\u003ea, \u003c/strong\u003eAverage similarity scores demonstrating model embedding stability in test-retest scenarios, with most models showing high reproducibility (\u0026gt;0.95). \u003cstrong\u003eb,\u003c/strong\u003e Box plots displaying model agreement when varying input seed points, revealing significant variability in robustness across different extractors.\u003cstrong\u003e c,\u003c/strong\u003e Heatmaps of salient regions identified through occlusion sensitivity analysis across all model-dataset combinations, illustrating differences in feature attention patterns among the nine extractors across six cancer imaging datasets.\u003c/p\u003e","description":"","filename":"3.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6630446/v1/d8643ccb3fbf3c878e75d510.jpg"},{"id":83612001,"identity":"1ecdf88a-0c4c-4b19-89d2-7d54a85523ee","added_by":"auto","created_at":"2025-05-29 12:28:49","extension":"jpg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":119189,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eAlignment and average alignment between models.\u003c/strong\u003e \u003cstrong\u003ea\u003c/strong\u003e. Heatmaps illustrating model overlap across different datasets, measured using mutual k-nearest neighbors, shown in varying shades of green to indicate the degree of overlap. (\u003cstrong\u003eb\u003c/strong\u003e-\u003cstrong\u003ed\u003c/strong\u003e) Bar charts presenting average alignment for top performing models: FMCIB, ModelsGen, and VISTA3D, highlighting alignment differences across models.\u003c/p\u003e","description":"","filename":"4.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6630446/v1/ff41849359cd6608528ec2b6.jpg"},{"id":95798922,"identity":"1d9b2672-ae6a-4e82-9ce0-7ce366cf3363","added_by":"auto","created_at":"2025-11-13 08:18:14","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1453097,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6630446/v1/c30b28d8-554e-4536-b598-e7daba7e3679.pdf"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Foundation model embeddings for quantitative tumor imaging biomarkers","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003ePrecision oncology aims to revolutionize cancer care by tailoring treatments to the individual characteristics of each patient's tumor\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. Central to this paradigm is the ability to characterize tumor biology, heterogeneity, and the tumor microenvironment, often non-invasively, to guide diagnosis, predict prognosis, and monitor therapeutic response\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. Medical imaging modalities, including Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET), provide rich, spatially-resolved information about tissue structure and function, serving as key technologies in clinical oncology\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eOver the past decade, the field of quantitative imaging analysis, particularly radiomics, has emerged as a powerful tool to unlock deeper insights from these medical images beyond qualitative visual assessment\u003csup\u003e\u003cspan additionalcitationids=\"CR4 CR5\" citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e. Radiomics involves the extraction of a large number of quantitative features from medical images, converting them into mineable data that can potentially capture phenotypic characteristics related to underlying pathology. These features, when integrated with clinical and genomic data, have shown promise in predicting clinical endpoints such as diagnosis, patient survival, tumor recurrence, and treatment response across various cancer types\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e. However, traditional mathematical and statistical radiomics approaches face challenges related to feature reproducibility, standardization across different imaging parameters and scanners, and the inherent complexity of selecting and interpreting informative features from a high-dimensional space\u003csup\u003e\u003cspan additionalcitationids=\"CR9\" citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eThe field of artificial intelligence, specifically deep learning, has witnessed transformative advancements, especially the development of Foundation Models (FMs) \u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e. These models, typically pre-trained on large and diverse datasets using self-supervised or unsupervised learning objectives, learn powerful and generalizable representations that can be adapted to various downstream tasks with minimal task-specific fine-tuning. Initially demonstrating remarkable success in natural language processing and computer vision \u003csup\u003e\u003cspan additionalcitationids=\"CR13 CR14\" citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e, FMs are increasingly being explored within the medical domain\u003csup\u003e\u003cspan additionalcitationids=\"CR17\" citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e. In radiology, FMs have shown potential in tasks such as image segmentation, disease detection, report generation, and visual question answering\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e,\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eThe capacity of FMs to implicitly learn complex features directly from image data presents an alternative to the handcrafted feature engineering inherent in traditional radiomics. By leveraging large-scale pre-training, FMs can potentially capture more robust and informative image representations, overcoming some limitations of conventional methods and advancing quantitative radiomics for precision oncology\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e. However, the proliferation of several FM architectures and pre-training strategies poses a significant challenge for researchers: selecting the most appropriate model for a specific quantitative radiomics task. While several benchmarks comparing FMs exist for general tasks\u003csup\u003e\u003cspan additionalcitationids=\"CR23\" citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e and certain medical applications like report generation or segmentation\u003csup\u003e\u003cspan additionalcitationids=\"CR26 CR27\" citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e, a critical gap remains. There is currently no comprehensive, systematic benchmark specifically evaluating the performance and characteristics of different FMs as representation extractors for quantitative radiomics endpoints (i.e., diagnosis and prognosis prediction) across multiple anatomies and clinical cohorts. This lack of systematic comparison hinders informed model selection and reliable translation of FMs into radiomics research and practice.\u003c/p\u003e \u003cp\u003eTo address this gap, we present the first comprehensive benchmark evaluating ten distinct, publicly available, pre-trained 3D foundation models for quantitative radiomics analysis. We assess their representational power across six diverse clinical cohorts spanning lung, kidney, and liver anatomies, on both diagnostic and prognostic prediction tasks. Our comparative analysis aims to present the relative strengths of these FMs in quantifying radiological phenotypes relevant to oncological outcomes. Our findings reveal that model performance is task- and dataset-dependent, with no single model universally superior, although certain models like FMCIB\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e, ModelsGenesis\u003csup\u003e\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e, and VISTA3D\u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e demonstrate consistently strong performance across the evaluated scenarios. Beyond predictive performance, we investigate the robustness of the learned representations through test-retest reliability and input stability analyses, associate representations with salient image regions to gain insights into model interpretability, and explore the similarities between different FM representation spaces using representation alignment techniques. Furthermore, we introduce a unified, extensible software framework designed to facilitate the benchmarking of existing and future FMs on new tumor imaging datasets, thereby promoting standardized evaluation and accelerating the adoption of these powerful models within the quantitative imaging community.\u003c/p\u003e"},{"header":"RESULTS","content":"\u003cp\u003eIn this study, we established a comprehensive framework to compare ten distinct foundation models using six publicly available datasets. We assessed a variety of models that differed in architecture (convolutional vs. transformer-based), pre-training strategies (contrastive, supervised, generative, etc.), and data utilization (low-dose CT, all CT, CT\u0026thinsp;+\u0026thinsp;MRI, CT\u0026thinsp;+\u0026thinsp;MRI\u0026thinsp;+\u0026thinsp;US\u0026thinsp;+\u0026thinsp;others). These datasets address various endpoints across cancer types located in the lung, liver, and kidney. Foundation model embeddings were extracted from each dataset, with k-nearest neighbor models leveraging neighbor voting to predict endpoints. Figure\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e provides an overview of the datasets, models, and overarching framework.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eDiagnostic Performance of Foundation Models\u003c/h2\u003e \u003cp\u003eFor lung nodule malignancy diagnosis, performance varied considerably. On the LUNA16\u003csup\u003e31\u003c/sup\u003e dataset, FMCIB demonstrated the highest diagnostic capability with an Area Under the Curve (AUC) of 0.886 (95% Confidence Interval [CI]: 0.871-0.9). ModelsGenesis ranked second with an AUC of 0.806 (95% CI: 0.795\u0026ndash;0.816). Voco\u003csup\u003e\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e performed the worst, near random chance, with an AUC of 0.493 (95% CI: 0.468\u0026ndash;0.519). VISTA3D achieved an AUC of 0.711 (95% CI: 0.692\u0026ndash;0.730), while the remaining models yielded AUCs between 0.5 and 0.7 (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea). A similar ranking pattern occurred on the DLCS\u003csup\u003e\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e dataset, although with lower overall AUCs. FMCIB again led with an AUC of 0.675 (95% CI: 0.655\u0026ndash;0.696), followed by ModelsGenesis at 0.645 (95% CI: 0.624\u0026ndash;0.666). Voco (AUC: 0.507) and CTClip\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e (AUC: 0.494) performed close to random chance. VISTA3D achieved an AUC of 0.607 (95% CI: 0.589\u0026ndash;0.625), with other models scoring between 0.5 and 0.6 (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea).\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003ePrognostic Performance Across Lung, Kidney, and Liver Cancer Datasets\u003c/h3\u003e\n\u003cp\u003eIn prognostic tasks, model performance generally decreased compared to diagnostics. For 2-year overall survival prediction in the NSCLC-Radiomics\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e dataset, VISTA3D (AUC: 0.582, 95% CI: 0.545\u0026ndash;0.62), FMCIB (AUC: 0.577, 95% CI: 0.549\u0026ndash;0.605), and ModelsGenesis (AUC: 0.577, 95% CI: 0.539\u0026ndash;0.614) were the top performers. CTClip (AUC: 0.449) and Voco (AUC: 0.526) performed worst (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea). On the NSCLC-Radiogenomics\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e dataset for predicting NSCLC survival, VISTA3D (AUC: 0.622, 95% CI: 0.566\u0026ndash;0.677) and CTFM\u003csup\u003e\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e (AUC: 0.620, 95% CI: 0.572\u0026ndash;0.668) achieved the highest AUCs, followed closely by Merlin\u003csup\u003e\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e\u003c/sup\u003e (AUC: 0.612) and ModelsGenesis (AUC: 0.609). Voco (AUC: 0.461) and CTClip (AUC: 0.510) showed the lowest performance (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea). For renal cancer prognosis (2-year overall survival) using C4KC-KiTS\u003csup\u003e\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e, ModelsGenesis yielded the highest AUC of 0.733 (95% CI: 0.670\u0026ndash;0.796), with SUPREM\u003csup\u003e\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e\u003c/sup\u003e second at 0.718 (95% CI: 0.672\u0026ndash;0.764). CTFM (AUC: 0.463) and CTClip (AUC: 0.493) performed worst (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea). In predicting colorectal cancer liver metastases survival (Colorectal-Liver-Metastases\u003csup\u003e\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e), only FMCIB achieved an AUC substantially above random chance at 0.572 (95% CI: 0.509\u0026ndash;0.644). ModelsGenesis was the next best with an AUC of 0.530 (95% CI: 0.458\u0026ndash;0.601), while other models performed near the random baseline (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e\n\u003ch3\u003eAggregate Performance and Ranking Across Datasets\u003c/h3\u003e\n\u003cp\u003eCross-dataset analysis revealed consistent performance patterns. FMCIB demonstrated strong overall performance, ranking first in three of the six datasets (LUNA16, DLCS, Colorectal-Liver-Metastases) and third in NSCLC-Radiomics (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb). ModelsGenesis also showed high consistency, ranking first or second in four datasets (LUNA16, DLCS, NSCLC-Radiomics, C4KC-KiTS) (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb). Performance generally followed a trajectory starting highest in LUNA16, decreasing through DLCS and NSCLC prognostic tasks, partially recovering in C4KC-KiTS, and declining again in Colorectal-Liver-Metastases (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec). Task-specific strengths were observed; FMCIB excelled in diagnostic tasks, while VISTA3D showed relatively stronger performance in prognostic tasks, achieving the top rank in NSCLC-Radiogenomics. The top three models (FMCIB, ModelsGenesis, VISTA3D) exhibited similar performance trends across the datasets (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec).\u003c/p\u003e\n\u003ch3\u003eEmbedding Stability and Robustness to Input Variations\u003c/h3\u003e\n\u003cp\u003eTest-retest stability evaluated on the RIDER dataset\u003csup\u003e\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e\u003c/sup\u003e, simulating scanning variability, was high for most models, showing average cosine similarities between 0.97 and 1.00. However, Merlin showed lower stability with an average similarity of 0.81, and CTClip scored 0.93 (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea). Robustness to variations in input seed points (annotation noise) was assessed using Cohen's Kappa for agreement across 50 trials. CTFM showed the highest agreement (Kappa: 0.90), followed by SUPREM (0.87) and MedImageInsight\u003csup\u003e\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e (0.85). FMCIB maintained good agreement (Kappa: 0.70). In contrast, Voco demonstrated very poor agreement (Kappa: 0.05), with CTClip (0.29) and Merlin (0.36) also showing low robustness (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb).\u003c/p\u003e\n\u003ch3\u003eSaliency Map Analysis for Explainability\u003c/h3\u003e\n\u003cp\u003eSaliency maps generated via feature-based occlusion sensitivity\u003csup\u003e\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e indicated model focus areas. FMCIB and ModelsGenesis consistently produced saliency maps highlighting tumor regions across multiple datasets. VISTA3D maps sometimes focused on high-intensity bone structures when present, and a general tendency towards high-intensity regions was present across several models. CTClip, Voco, and Merlin failed to generate saliency maps clearly indicating tumor-specific regions (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eSimilarity Relationships Between Foundation Model Embeddings\u003c/h2\u003e \u003cp\u003eThe relationships between the feature/embedding spaces learned by different models were examined using mutual k-nearest neighbor\u003csup\u003e\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u003c/sup\u003e overlaps within each dataset. Consistent trends showed higher overlaps between embeddings from ModelsGenesis-VISTA3D, FMCIB-ModelsGenesis, and FMCIB-VISTA3D pairs across multiple datasets (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea). These pairs frequently involved the top-performing models identified earlier. Alignment calculations further quantified these relationships; VISTA3D and ModelsGenesis embeddings showed the highest average alignment with FMCIB embeddings (scores of 2.85 and 2.69, respectively) (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eb). The maximum average alignment observed was between VISTA3D and ModelsGenesis (score of 3.89) (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ec, \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ed), suggesting a convergence of feature representations among these high-performing models.\u003c/p\u003e \u003c/div\u003e"},{"header":"DISCUSSION","content":"\u003cp\u003eIn this study, we conducted the first comprehensive benchmarking of ten distinct foundation models (FMs) as feature extractors for quantitative tumor imaging tasks across six diverse, publicly available oncology datasets. Our evaluation includes predictive performance for diagnostic and prognostic endpoints, robustness to common sources of image variability, attribution analysis via saliency mapping, and an exploration of representational similarities between models. The results reveal significant differences in the ability of different FM embeddings for these downstream tasks, highlighting the challenges of model selection. Notably, we identified FMCIB, ModelsGenesis, and VISTA3D as exhibiting consistently strong performance across the range of datasets and clinical endpoints evaluated. Furthermore, this superior performance is often correlated with enhanced robustness and, for FMCIB and ModelsGenesis, biologically plausible saliency maps, suggesting these models capture more reliable and potentially interpretable quantitative phenotypes.\u003c/p\u003e \u003cp\u003eThe popularity of FMs, pre-trained using varied architectures, datasets, and self-supervised objectives, presents both opportunity\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e and complexity\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e for domain-specific applications like quantitative tumor imaging. While numerous benchmarks exist for general computer vision\u003csup\u003e\u003cspan additionalcitationids=\"CR45\" citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e\u003c/sup\u003e and natural language processing\u003csup\u003e\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e,\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e\u003c/sup\u003e, and more recently within the broader medical field (e.g., BenchMD\u003csup\u003e\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e\u003c/sup\u003e, MedArena\u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e), a dedicated evaluation framework comparing FMs specifically for their ability to generate predictive quantitative radiomic signatures remains missing. Previous benchmarking efforts in tumor imaging have primarily focused on standardizing traditional feature extraction pipelines or comparing classical machine learning algorithms applied to those features\u003csup\u003e\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e,\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e\u003c/sup\u003e. Our work directly addresses this gap, providing a much-needed resource for researchers seeking to leverage the representational power of FMs for biomarker discovery in oncology, analogous to how earlier benchmarks guided progress in other fields.\u003c/p\u003e \u003cp\u003eOur analysis across models spanning different eras of deep learning architectures (from UNets\u003csup\u003e\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e52\u003c/span\u003e\u003c/sup\u003e to Transformers\u003csup\u003e\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e\u003c/sup\u003e) and pre-training paradigms (restorative, contrastive, generative, segmentation-based) yielded several interesting observations. Perhaps most striking was the robust performance of ModelsGenesis, a relatively older model based on a simple UNet architecture and pre-trained using a restorative objective on only\u0026thinsp;~\u0026thinsp;600 CT scans from a single LIDC dataset. Its consistent effectiveness across diverse anatomies and tasks suggests that its pre-training, focused on recovering corrupted inputs, may instill representations particularly robust to noise and adept at capturing fundamental tissue characteristics relevant for radiomics, even without exposure to vast datasets or explicit oncological tasks during pre-training. Similarly, FMCIB, which employs contrastive learning to achieve invariance against various image transformations\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e, also demonstrated strong, robust performance, reinforcing the hypothesis that pre-training objectives emphasizing robustness to transformations can yield powerful general-purpose embeddings for quantitative analysis. In contrast, VISTA3D\u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e, leveraging segmentation pre-training on a large, diverse dataset including tumors, likely benefits from learning spatial features relevant to tumor characterization through its unique fine-grained supervoxel segmentation pre-training. The more recent PASTA\u003csup\u003e\u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e54\u003c/span\u003e\u003c/sup\u003e model, despite sophisticated multi-stage pre-training involving synthetic tumors, yielded less performant embeddings in our benchmark, suggesting that while potentially powerful for fine-tuning on specific tasks (as shown in its original publication), its raw embeddings may not generalize as effectively for the broad range of quantitative radiomic tasks studied here. These findings collectively highlight that the choice of pre-training strategy and data significantly impacts the downstream utility of FM embeddings, and that newer or larger models are not invariably superior for every application.\u003c/p\u003e \u003cp\u003eAs access to large-scale datasets and computational resources continues to grow, the landscape of FMs will undoubtedly become even more crowded. Comparing these models effectively, especially as performance on specific tasks begins to saturate, becomes increasingly critical. Inspired by concepts such as the Platonic Representation Hypothesis\u003csup\u003e\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u003c/sup\u003e, which suggests that optimal representations might converge towards a shared underlying structure, we investigated the similarity between the representation spaces of the FMs. Our analysis revealed that the models demonstrating stronger aggregate performance (notably FMCIB and ModelsGenesis) also exhibited significantly higher representational similarity, as measured by mutual neighbor analysis. This convergence among top-performing models, further corroborated by qualitative similarities in their saliency maps, suggests they may be learning overlapping, potentially fundamental, image features crucial for oncological prediction. This observation highlights possibilities regarding model interchangeability or ensembling strategies in the future, where understanding representational alignment could guide the selection of complementary models.\u003c/p\u003e \u003cp\u003eWe acknowledge several limitations inherent in this study. First, our benchmark relies exclusively on six fully public datasets. While this choice maximizes accessibility and reproducibility, it limits the scale and diversity compared to potentially available restricted-access datasets. Second, our evaluation is currently image-only, potentially being unfair to the assessment of multimodal FMs that integrate text or other data types. However, the lack of availability of suitable public datasets with paired imaging and clinical text presents a significant challenge in this context. Third, our methodology focuses solely on evaluating the fixed embeddings extracted from the FMs, without task-specific fine-tuning. While fine-tuning can often boost performance on a target task, we argue that evaluating the raw embeddings provides a better assessment of the foundational representational power learned during pre-training \u0026ndash; a strong foundational embedding should inherently capture rich image properties beneficial for downstream tasks, and should correlate with fine-tuning performance. Future work could explicitly compare embedding performance versus fine-tuning outcomes within this framework. Finally, for downstream prediction, we employed a simple k-nearest neighbor classifier to minimize confounding factors from complex modeling choices, though exploring a wider range of classifiers could be future work..\u003c/p\u003e \u003cp\u003eIn conclusion, this study introduces a comprehensive benchmark for evaluating foundation models in quantitative tumor imaging, providing critical insights into the performance, robustness, and underlying representational characteristics of ten prominent models across diverse oncological tasks. Our findings reveal that model selection requires careful consideration, with FMCIB, ModelsGenesis, and VISTA3D emerging as particularly promising candidates. Crucially, we present not only these results but also an open-source, extensible framework for systematic and reproducible integration of new foundation models and dataset. We hope this work will significantly push forward the adoption and rigorous evaluation of these foundation models, accelerating progress in quantitative imaging for precision oncology.\u003c/p\u003e"},{"header":"METHODS","content":"\u003cp\u003e\u003cstrong\u003eDesign of TumorImagingBench Evaluation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe TumorImagingBench is constructed to evaluate tumor imaging (radiomic) signatures\u0026apos; capacity to quantify diverse radiological phenotypes of cancer. It encompasses six publicly available datasets: LUNA16 and DLCS for lung cancer diagnostics, focusing on nodule malignancy; NSCLC-Radiomics and NSCLC-Radiogenomics for assessing prognosis post-surgery and/or radiotherapy in non-small cell lung cancer patients; C4KC-KiTs for renal carcinoma prognosis post-partial or radical nephrectomy; and Colorectal-Liver-Metastases for assessing prognosis in colorectal cancer patients with liver metastases post-hepatic resection. The selection of diverse endpoints across these datasets highlights the generalizability of selected radiomics signatures.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLUNA16\u003c/strong\u003e\u003csup\u003e31\u003c/sup\u003e\u003cstrong\u003e:\u0026nbsp;\u003c/strong\u003eA curated selection from the LIDC-IDRI database, featuring 888 thoracic CT scans (diagnostic and lung screening) from 7 academic centers and 8 imaging companies. It includes 1,186 lung nodules, each annotated for location and attributes like internal composition, calcification, and malignancy by a consensus of at least 3 out of 4 radiologists. For a specific evaluation mentioned, a subset of 677 nodules was chosen, all having at least one indication of malignancy suspicion.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDLCS\u003c/strong\u003e\u003csup\u003e55\u003c/sup\u003e\u003cstrong\u003e:\u0026nbsp;\u003c/strong\u003eA dataset from the Duke Health system featuring 2,487 nodules from 1,613 patients. Nodules, initially flagged by AI and verified by a medical student (with selective radiologist oversight), include 3D bounding boxes, Lung-RADS annotations, and cancer outcomes. The selection followed Lung-RADS v2022 criteria, focusing on nodules \u0026ge; 4 mm or in central/segmental airways. A subset of the dataset containing 1714 scans made publicly available with pathology confirmed malignancies was used in this study.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eNSCLC-Radiomics\u003c/strong\u003e\u003csup\u003e34\u003c/sup\u003e\u003cstrong\u003e:\u0026nbsp;\u003c/strong\u003eAn independent test set for prognostication networks derived from the MAASTRO Clinic (Maastricht, NL). This set consisted of CT scans from 421 patients, selected from a cohort of 422 individuals with stage I-IIIB NSCLC treated with radiation therapy. Key characteristics include annotated primary Gross Tumor Volumes (GTVs), delineated by radiation oncologists using FDG PET-CT scans (Siemens Biograph, +/- contrast), and patients being right-censored for two-year survival. The chosen end-point for prediction is 2-year survival from treatment date. .\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eNSCLC-Radiogenomics\u003c/strong\u003e\u003csup\u003e35\u003c/sup\u003e\u003cstrong\u003e:\u0026nbsp;\u003c/strong\u003eA dataset of 211 stage I-IV NSCLC patients from Stanford University and the Palo Alto VA (recruited 2008-2012, referred for surgery) who had preoperative CT and PET/CT scans (variable equipment/protocols). Tumor segmentations reviewed by two radiologists are available for 144 patients. The dataset also includes molecular data (EGFR/KRAS/ALK mutations, gene expression, RNA-seq). Our study focused on 133 patients with annotated Gross Tumor Volumes (GTVs), right-censored for two-year survival. This subset served as an independent test set for prognostication and subsequent biological investigation of our networks. The chosen end-point for prediction is 2-year survival post surgery.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eC4KC-KiTS\u003c/strong\u003e\u003csup\u003e38\u003c/sup\u003e\u003cstrong\u003e:\u0026nbsp;\u003c/strong\u003eA dataset of patients who underwent partial or radical nephrectomy for renal tumors at the University of Minnesota Medical Center between 2010 and 2018 were considered for inclusion in our analysis. Cases lacking preoperative arterial phase abdominal CT imaging were excluded. From the eligible population, \u0026nbsp;300 cases were randomly selected for potential inclusion. Of these, 210 had complete tumor segmentation data available. After excluding patients lost to follow-up prior to the event of interest, our final cohort for this study comprised 134 patients. For multi-tumor cases, we used the largest tumor volume to determine the seed point for our computational analysis. The chosen end-point for prediction is 2-year survival post nephrectomy.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eColorectal-Liver-Metastases\u003c/strong\u003e\u003csup\u003e40\u003c/sup\u003e\u003cstrong\u003e:\u0026nbsp;\u003c/strong\u003eA dataset of single-institution consecutive series of patients who underwent colorectal liver metastases (CRLM) resection with matched preoperative CT scans. Inclusion required: pathologically confirmed CRLM, available pathologic data of non-tumoral liver parenchyma and tumor, and preoperative portal venous contrast-enhanced MDCT within 6 weeks of resection. Patients with 90-day mortality, \u0026lt;24 months follow-up, preoperative hepatic artery infusion chemotherapy, local tumor ablation, \u0026gt;3 wedge resections, or no visible tumor on preoperative imaging, were excluded. For our analysis, we selected the largest tumor from each patient, resulting in a final cohort of 194 patients after excluding those lost to follow-up. The chosen end-point for prediction is 2-year survival post resection..\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSelection of FM Radiomic Embedding Models\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe selected ten pre-trained models as embeddings to establish a radiomic signature for provided computed tomography images. Models span from 2020 to the latest in 2025, capturing prevalent design choices in pre-trained model development over five years. The earliest model, ModelsGenesis, employs a simple UNet convolutional network, while the most recent model, PASTA, utilizes a sophisticated nnUNet framework in a multimodal scheme. CT-ViT and MedImageInsight, recent models, incorporate joint text-vision approaches and transformer architectures. Table 1 presents a comparative analysis of these approaches based on several critical design choices.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 1: Comparative analysis pre-trained/foundation models with their corresponding pre-training frameworks, architectural specifications, and training datasets.\u0026nbsp;\u003c/strong\u003eThe table highlights various models, their meta-architectural designs, and structural configurations. Pre-training datasets are summarized with relevant characteristics. The specific downstream evaluation tasks that utilized these pre-trained models are also enumerated to demonstrate model applicability and performance across domains.\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"623\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 100px;\"\u003e\n \u003cp\u003eModel\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 183px;\"\u003e\n \u003cp\u003eMeta-architecture\u003c/p\u003e\n \u003cp\u003e/ Pre-training design\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 70px;\"\u003e\n \u003cp\u003eArchitecture\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 94px;\"\u003e\n \u003cp\u003eDataset\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 53px;\"\u003e\n \u003cp\u003eParams\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003eEvaluated Tasks\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 100px;\"\u003e\n \u003cp\u003eFMCIB\u003csup\u003e21\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 183px;\"\u003e\n \u003cp\u003eTumor positive + Negative-mining SimCLR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 70px;\"\u003e\n \u003cp\u003e3D ResNet50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 94px;\"\u003e\n \u003cp\u003e11,467 scans from DeepLesion\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 53px;\"\u003e\n \u003cp\u003e184M\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003ePatch-based diagnosis and prognosis\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 100px;\"\u003e\n \u003cp\u003eCT-FM\u003csup\u003e36\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 183px;\"\u003e\n \u003cp\u003e3D image-based contrastive pre-training promoting awareness of 3D structure.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 70px;\"\u003e\n \u003cp\u003e3D SegResNet\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 94px;\"\u003e\n \u003cp\u003e148k scans from Imaging Data Commons (multiple datasets)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 53px;\"\u003e\n \u003cp\u003e77M\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003eWhole-body and tumor segmentation, head CT triage, medical image retrieval, semantic understanding\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 100px;\"\u003e\n \u003cp\u003eCT-CLIP\u003csup\u003e19\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 183px;\"\u003e\n \u003cp\u003eContrastive language-image pretraining framework for 3D chest CTs using radiology reports\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 70px;\"\u003e\n \u003cp\u003e3D ViT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 94px;\"\u003e\n \u003cp\u003e25,692 scans from CT-RATE dataset\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 53px;\"\u003e\n \u003cp\u003e25M\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003eMulti-abnormality detection, case retrieval, zero-shot classification\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 100px;\"\u003e\n \u003cp\u003ePASTA\u003csup\u003e54\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 183px;\"\u003e\n \u003cp\u003eTwo-stage process focusing on semantic segmentation and text-image alignment on synthetic tumors \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 70px;\"\u003e\n \u003cp\u003ennUNet\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 94px;\"\u003e\n \u003cp\u003e30k scans from PASTA-GEN30k dataset\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 53px;\"\u003e\n \u003cp\u003e127M\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003eFew-shot and zero-shot segmentation, tumor staging and prognosis,lesion report generation\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 100px;\"\u003e\n \u003cp\u003eVISTA3D\u003csup\u003e30\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 183px;\"\u003e\n \u003cp\u003eSupervised multi-instance training along with supervoxel supervision and separate heads for interactive segmentation\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 70px;\"\u003e\n \u003cp\u003e3D SegResNet\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 94px;\"\u003e\n \u003cp\u003e11454 scans from 15 different datasets\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 53px;\"\u003e\n \u003cp\u003e175M\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003eSegmentation tasks across various anatomical structures and lesion (+ Interactive)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 100px;\"\u003e\n \u003cp\u003eVOCO\u003csup\u003e32\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 183px;\"\u003e\n \u003cp\u003eLarge-scale 3D medical image pre-training with geometric context priors. \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 70px;\"\u003e\n \u003cp\u003e3D SwinUNETR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 94px;\"\u003e\n \u003cp\u003e160k CTs from 30 public datasets\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 53px;\"\u003e\n \u003cp\u003e295M\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003eSegmentation, classification, registration, and vision-language tasks\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 100px;\"\u003e\n \u003cp\u003eSUPREM\u003csup\u003e39\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 183px;\"\u003e\n \u003cp\u003eSupervised pre-training on AbdomenAtlas 1.1, combining large-scale datasets with per-voxel annotations.\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 70px;\"\u003e\n \u003cp\u003e3D UNet\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 94px;\"\u003e\n \u003cp\u003e9,262 scans from AbdomenAtlas1.1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 53px;\"\u003e\n \u003cp\u003e19M\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003eSegmentation tasks across multiple datasets, demonstrating transfer learning capabilities\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 100px;\"\u003e\n \u003cp\u003eMerlin\u003csup\u003e37\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 183px;\"\u003e\n \u003cp\u003eVision-language foundation model for 3D CT, trained with EHR and radiology reports\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 70px;\"\u003e\n \u003cp\u003e3D UNet\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 94px;\"\u003e\n \u003cp\u003e15,331 CT scans along with radiology reports\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 53px;\"\u003e\n \u003cp\u003e270M\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003eZero-shot classification, phenotype classification, radiology report generation, 3D segmentation\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 100px;\"\u003e\n \u003cp\u003eMedImageInsight\u003csup\u003e42\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 183px;\"\u003e\n \u003cp\u003eTwo-tower architecture optimized with UniCL objective f.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 70px;\"\u003e\n \u003cp\u003e2D ViT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 94px;\"\u003e\n \u003cp\u003e3.7M image-text/label/age pairs across several datasets\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 53px;\"\u003e\n \u003cp\u003e616M\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003eImage-text search, image-image search, report generation, and task fine-tuning\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 100px;\"\u003e\n \u003cp\u003eModelsGenesis\u003csup\u003e29\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 183px;\"\u003e\n \u003cp\u003eImage restoration on 3D Chest CT volumes\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 70px;\"\u003e\n \u003cp\u003e3D UNet\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 94px;\"\u003e\n \u003cp\u003e623 scans from LUNA16 dataset\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 53px;\"\u003e\n \u003cp\u003e7M\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003eSegmentation and classification tasks across five target 3D applications\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003eFMCIB\u003c/strong\u003e is a foundation model designed to distinguish between lesions and non-lesions at the patch level in medical imaging. It aims to enhance the detection and characterization of cancerous lesions by leveraging self-supervised learning techniques.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCT-FM\u0026nbsp;\u003c/strong\u003eis a large-scale 3D image-based pre-trained model specifically developed for various radiological tasks, including segmentation and classification. It was trained on a substantial dataset of CT scans and demonstrated superior performance across multiple tasks compared to state-of-the-art models. The model\u0026apos;s architecture allows it to effectively cluster anatomical regions and identify similar structures across scans, making it a robust tool for medical image analysis\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCT-CLIP\u0026nbsp;\u003c/strong\u003eis a novel 3D adaptation of the CLIP model, designed for multi-abnormality detection in chest CT scans. It utilizes contrastive learning to align CT volumes with corresponding radiology reports, enabling zero-shot classification capabilities. This model excels in detecting multiple abnormalities without the need for extensive manual annotations, showcasing its potential for efficient clinical applications\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePASTA\u0026nbsp;\u003c/strong\u003eis a 3D-CT foundation model that addresses data scarcity in oncology by synthesizing lesions across various organs and tumor types. It utilizes a generative model to create a large dataset of synthetic CT scans, which enhances its training for lesion segmentation and vision-language alignment tasks. PASTA has shown exceptional performance in cross-domain transfer learning, outperforming existing models in multiple evaluation tasks\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eVISTA3D\u0026nbsp;\u003c/strong\u003eis a versatile imaging segmentation and annotation model that supports both automatic and interactive segmentation of 3D medical images. It is the first unified foundation model to achieve state-of-the-art performance across 127 classes and is designed to facilitate efficient human correction through its interactive features. VISTA3D integrates a novel supervoxel method to enhance zero-shot performance, making it a significant advancement in 3D medical imaging\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eVOCO\u0026nbsp;\u003c/strong\u003eis a large-scale 3D medical image pre-training framework that leverages geometric context priors to learn consistent semantic representations. It is built on a substantial dataset of CT volumes and employs a novel pretext task for contextual position predictions. VOCO has demonstrated superior performance across various downstream tasks, establishing itself as a leading model in the field of medical imaging.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSUPREM\u003c/strong\u003e is a suite of pre-trained models that provides state-of-the-art performance in organ and tumor segmentation tasks. It is based on supervised pre-training methodologies and has been shown to outperform models trained from scratch. SUPREM\u0026apos;s architecture includes various backbones, allowing it to adapt effectively to different medical imaging tasks\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMerlin\u0026nbsp;\u003c/strong\u003eis a vision-language foundation model designed for interpreting abdominal CT scans. It integrates structured electronic health record (EHR) data and unstructured radiology reports to perform a variety of tasks, including zero-shot classification and report generation. Merlin\u0026apos;s architecture allows it to generalize across multiple downstream tasks, making it a versatile tool for clinical applications\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMedImageInsight\u003c/strong\u003e is a lightweight foundation model for medical imaging that spans multiple modalities, including X-ray, CT, and MRI and is trained via image-text/label/age alignment. It achieves state-of-the-art performance on various datasets and supports both classification and image retrieval tasks. The model is designed to be adaptable and efficient, making it suitable for real-world clinical applications.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eModelsGenesis\u0026nbsp;\u003c/strong\u003eis a collection of models built from unlabeled 3D imaging data using a restorative reconstruction-based self-supervised method. It aims to generate powerful application-specific target models through transfer learning. The models demonstrate strong performance across various medical imaging tasks, emphasizing their potential for broad applicability in clinical settings\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEmbedding/Feature extraction configuration\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eOur feature extraction configuration involved a very similar procedure across all the models. For the segmentation focused models, we took the embeddings from the last layers of the encoder and added an average pooling on top to compress the feature representations in the spatial dimension. For non-segmentation models, the last layer of the model was taken in a similar fashion and average pooled. For MedImageInsight, a 2D model, we averaged across all 3D slices to obtain our embedding, as recommended in the original study.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eK-Nearest Neighbor Modelling setup\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFinal layer embeddings were used as radiomic features for each sample. These features were used to predict outcomes using a k-nearest neighbor model. We also used cosine distance for the neighbor model. Models were trained using 10-fold cross-validation and results were aggregated across the 10 runs. Optuna hyperparameter tuning was used for each run independently to select the optimal number of neighbors for the task from a range of 1 to 50. 95% confidence intervals were calculated using the 10-fold cross-validation.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eRobustness and Saliency Evaluation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe evaluated model robustness through multiple complementary approaches. First, to assess test-retest reliability, we utilized the RIDER dataset\u0026mdash;a collection of chest CT scans from 26 patients where each patient underwent two scans within a 15-minute interval. For each model, we calculated the cosine similarity between embeddings generated from these paired scans. Higher cosine similarity values indicate greater robustness to normal scanning variability. To evaluate sensitivity to input variations, we simulated annotation variability by generating 50 random perturbations of each seed point. These perturbations followed a three-dimensional multivariate normal distribution (zero mean, diagonal covariance matrix) with a variance of 16 voxels in each dimension. We then trained models on one trial and compared predictions across all trials, measuring agreement using Cohen\u0026apos;s Kappa after converting continuous predictions to categorical values. We implemented occlusion sensitivity analysis to identify image regions most influential to model predictions. This approach systematically occludes different regions of the input image and measures resulting changes in the output embeddings using cosine distance. Regions causing larger embedding deviations when occluded are considered more salient to the model\u0026apos;s feature extraction process. We generated and compared these saliency maps across all models to assess their focus areas.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMutual K-Nearest Neighbor Evaluation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo compare feature representations between models, we employed mutual k-nearest neighbor analysis. For each sample, we identified its 10 nearest neighbors in the embedding space generated by each model. We then quantified the overlap between neighbor sets across different models. This overlap metric reveals similarities in how different models structure their feature spaces and cluster similar samples, providing insight into which models learn comparable representations despite architectural differences.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eACKNOWLEDGEMENTS\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors acknowledge financial support from NIH (H.J.W.L.A: NIH-USA U24CA194354, NIH-USA U01CA190234, NIH-USA U01CA209414, NIH-USA R35CA22052, and NIH-USA U54CA274516-01A1) and the European Union - European Research Council (H.J.W.L.A: 866504).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDATA AVAILABILITY\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll datasets used in this study are publicly available and can be downloaded from their respective Zenodo, IDC\u003csup\u003e56\u003c/sup\u003e and TCIA\u003csup\u003e57\u003c/sup\u003e sources. Links to the dataset are available through the data citations.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCODE AVAILABILITY\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe complete pipeline used in this study can be accessed either from the AIM webpage or directly on https://github.com/AIM-Harvard/TumorImagingBench .\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eCode included for 1) Data preprocessing and creating annotation files, 2) Extracting features through a systematic interface for each model across all dataset, 3) Evaluation pipelines for building models and post-evaluation.\u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eAdi-Wauran, E., Krishnapillai, S., Uleryk, E., Saeedi, S. \u0026amp; Bombard, Y. Patient-centred care in precision oncology: A systematic review. \u003cem\u003ePatient Educ Couns\u003c/em\u003e \u003cstrong\u003e136\u003c/strong\u003e, 108753 (2025).\u003c/li\u003e\n\u003cli\u003eHricak, H. \u003cem\u003eet al.\u003c/em\u003e Advances and challenges in precision imaging. \u003cem\u003eLancet Oncol\u003c/em\u003e \u003cstrong\u003e26\u003c/strong\u003e, e34\u0026ndash;e45 (2025).\u003c/li\u003e\n\u003cli\u003eImaging in Clinical Oncology. \u003cem\u003eRadiology\u003c/em\u003e (2015) doi:10.1148/radiol.14144052.\u003c/li\u003e\n\u003cli\u003eLambin, P. \u003cem\u003eet al.\u003c/em\u003e Radiomics: extracting more information from medical images using advanced feature analysis. \u003cem\u003eEur. J. Cancer\u003c/em\u003e \u003cstrong\u003e48\u003c/strong\u003e, 441\u0026ndash;446 (2012).\u003c/li\u003e\n\u003cli\u003eAerts, H. J. W. L. \u003cem\u003eet al.\u003c/em\u003e Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. \u003cem\u003eNat. Commun.\u003c/em\u003e \u003cstrong\u003e5\u003c/strong\u003e, 4006 (2014).\u003c/li\u003e\n\u003cli\u003eMu, W., Schabath, M. B. \u0026amp; Gillies, R. J. Images Are Data: Challenges and Opportunities in the Clinical Translation of Radiomics. \u003cem\u003eCancer Res\u003c/em\u003e \u003cstrong\u003e82\u003c/strong\u003e, 2066\u0026ndash;2068 (2022).\u003c/li\u003e\n\u003cli\u003eHuang, E. P. \u003cem\u003eet al.\u003c/em\u003e Criteria for the translation of radiomics into clinically useful tests. \u003cem\u003eNat. Rev. Clin. Oncol.\u003c/em\u003e \u003cstrong\u003e20\u003c/strong\u003e, 69\u0026ndash;82 (2023).\u003c/li\u003e\n\u003cli\u003eCobo, M., Men\u0026eacute;ndez Fern\u0026aacute;ndez-Miranda, P., Bastarrika, G. \u0026amp; Lloret Iglesias, L. Enhancing radiomics and Deep Learning systems through the standardization of medical imaging workflows. \u003cem\u003eSci Data\u003c/em\u003e \u003cstrong\u003e10\u003c/strong\u003e, 732 (2023).\u003c/li\u003e\n\u003cli\u003eLimkin, E. J. \u003cem\u003eet al.\u003c/em\u003e Promises and challenges for the implementation of computational medical imaging (radiomics) in oncology. \u003cem\u003eAnn. Oncol.\u003c/em\u003e \u003cstrong\u003e28\u003c/strong\u003e, 1191\u0026ndash;1206 (2017).\u003c/li\u003e\n\u003cli\u003eKazmierski, M. \u003cem\u003eet al.\u003c/em\u003e Multi-institutional Prognostic Modeling in Head and Neck Cancer: Evaluating Impact and Generalizability of Deep Learning and Radiomics. \u003cem\u003eCancer Res Commun\u003c/em\u003e \u003cstrong\u003e3\u003c/strong\u003e, 1140\u0026ndash;1151 (2023).\u003c/li\u003e\n\u003cli\u003eBommasani, R. \u003cem\u003eet al.\u003c/em\u003e On the Opportunities and Risks of Foundation Models. \u003cem\u003earXiv [cs.LG]\u003c/em\u003e (2021).\u003c/li\u003e\n\u003cli\u003eBrown, T. B. \u003cem\u003eet al.\u003c/em\u003e Language Models are Few-Shot Learners. (2020).\u003c/li\u003e\n\u003cli\u003eYuan, L. \u003cem\u003eet al.\u003c/em\u003e Florence: A New Foundation Model for Computer Vision. (2021).\u003c/li\u003e\n\u003cli\u003eGirdhar, R. \u003cem\u003eet al.\u003c/em\u003e ImageBind: One Embedding Space To Bind Them All. (2023).\u003c/li\u003e\n\u003cli\u003eOquab, M. \u003cem\u003eet al.\u003c/em\u003e DINOv2: Learning Robust Visual Features without Supervision. (2023).\u003c/li\u003e\n\u003cli\u003eThieme, A. \u003cem\u003eet al.\u003c/em\u003e Foundation Models in Healthcare: Opportunities, Risks \u0026amp; Strategies Forward. in \u003cem\u003eExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems\u003c/em\u003e (Association for Computing Machinery, New York, NY, USA, 2023). doi:10.1145/3544549.3583177.\u003c/li\u003e\n\u003cli\u003eMoor, M. \u003cem\u003eet al.\u003c/em\u003e Foundation models for generalist medical artificial intelligence. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e616\u003c/strong\u003e, 259\u0026ndash;265 (2023).\u003c/li\u003e\n\u003cli\u003ePaschali, M. \u003cem\u003eet al.\u003c/em\u003e Foundation models in radiology: What, how, why, and why not. \u003cem\u003eRadiology\u003c/em\u003e \u003cstrong\u003e314\u003c/strong\u003e, e240597 (2025).\u003c/li\u003e\n\u003cli\u003eHamamci, I. E. \u003cem\u003eet al.\u003c/em\u003e Developing generalist foundation models from a multimodal dataset for 3D computed tomography. \u003cem\u003earXiv [cs.CV]\u003c/em\u003e (2024).\u003c/li\u003e\n\u003cli\u003eZhou, H.-Y., Adithan, S., Acosta, J. N., Topol, E. J. \u0026amp; Rajpurkar, P. A generalist learner for multifaceted medical image interpretation. \u003cem\u003earXiv [cs.CV]\u003c/em\u003e (2024).\u003c/li\u003e\n\u003cli\u003ePai, S. \u003cem\u003eet al.\u003c/em\u003e Foundation model for cancer imaging biomarkers. \u003cem\u003eNat Mach Intell\u003c/em\u003e \u003cstrong\u003e6\u003c/strong\u003e, 354\u0026ndash;367 (2024).\u003c/li\u003e\n\u003cli\u003eZheng, L. \u003cem\u003eet al.\u003c/em\u003e LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset. \u003cem\u003earXiv [cs.CL]\u003c/em\u003e (2023).\u003c/li\u003e\n\u003cli\u003eChiang, W.-L. \u003cem\u003eet al.\u003c/em\u003e Chatbot Arena: An open platform for evaluating LLMs by human preference. \u003cem\u003eICML\u003c/em\u003e \u003cstrong\u003eabs/2403.04132\u003c/strong\u003e, (2024).\u003c/li\u003e\n\u003cli\u003eNayak, S. \u003cem\u003eet al.\u003c/em\u003e Benchmarking Vision Language Models for cultural understanding. \u003cem\u003earXiv [cs.CV]\u003c/em\u003e (2024).\u003c/li\u003e\n\u003cli\u003eMedArena - LLM Arena for Clinicians. https://medarena.ai/login.\u003c/li\u003e\n\u003cli\u003eChen, S. \u003cem\u003eet al.\u003c/em\u003e Cross-Care: Assessing the healthcare implications of pre-training data on language model bias. \u003cem\u003earXiv [cs.CL]\u003c/em\u003e (2024).\u003c/li\u003e\n\u003cli\u003eChen, H., Fang, Z., Singla, Y. \u0026amp; Dredze, M. Benchmarking large language models on answering and explaining challenging medical questions. \u003cem\u003eArXiv\u003c/em\u003e \u003cstrong\u003eabs/2402.18060\u003c/strong\u003e, (2024).\u003c/li\u003e\n\u003cli\u003eBassi, P. R. A. S. \u003cem\u003eet al.\u003c/em\u003e Touchstone benchmark: Are we on the right way for evaluating AI algorithms for medical segmentation? \u003cem\u003earXiv [cs.CV]\u003c/em\u003e (2024).\u003c/li\u003e\n\u003cli\u003eZhou, Z. \u003cem\u003eet al.\u003c/em\u003e Models Genesis: Generic Autodidactic Models for 3D Medical Image Analysis. \u003cem\u003eMed. Image Comput. Comput. Assist. Interv.\u003c/em\u003e \u003cstrong\u003e11767\u003c/strong\u003e, 384\u0026ndash;393 (2019).\u003c/li\u003e\n\u003cli\u003eHe, Y. \u003cem\u003eet al.\u003c/em\u003e VISTA3D: Versatile Imaging SegmenTation and annotation model for 3D computed tomography. \u003cem\u003earXiv [cs.CV]\u003c/em\u003e (2024).\u003c/li\u003e\n\u003cli\u003eSetio, A. A. A. \u003cem\u003eet al.\u003c/em\u003e Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: The LUNA16 challenge. \u003cem\u003eMed. Image Anal.\u003c/em\u003e \u003cstrong\u003e42\u003c/strong\u003e, 1\u0026ndash;13 (2017).\u003c/li\u003e\n\u003cli\u003eWu, L., Zhuang, J. \u0026amp; Chen, H. Large-scale 3D medical image pre-training with geometric context priors. \u003cem\u003earXiv [cs.CV]\u003c/em\u003e (2024).\u003c/li\u003e\n\u003cli\u003eTushar, F. I. \u003cem\u003eet al.\u003c/em\u003e AI in lung health: Benchmarking detection and diagnostic models across multiple CT scan datasets. \u003cem\u003earXiv [cs.CV]\u003c/em\u003e (2024).\u003c/li\u003e\n\u003cli\u003eAerts, H. J. W. L. \u003cem\u003eet al.\u003c/em\u003e Data From NSCLC-Radiomics. The Cancer Imaging Archive https://doi.org/10.7937/K9/TCIA.2015.PF0M9REI (2019).\u003c/li\u003e\n\u003cli\u003eNapel, S. \u0026amp; Plevritis, S. K. NSCLC Radiogenomics: Initial Stanford Study of 26 cases. The Cancer Imaging Archive https://doi.org/10.7937/K9/TCIA.2014.X7ONY6B1 (2014).\u003c/li\u003e\n\u003cli\u003ePai, S. \u003cem\u003eet al.\u003c/em\u003e Vision foundation models for computed tomography. \u003cem\u003earXiv [eess.IV]\u003c/em\u003e (2025).\u003c/li\u003e\n\u003cli\u003eBlankemeier, L. \u003cem\u003eet al.\u003c/em\u003e Merlin: A vision language foundation model for 3D computed tomography. \u003cem\u003earXiv [cs.CV]\u003c/em\u003e (2024).\u003c/li\u003e\n\u003cli\u003eHeller, N. \u003cem\u003eet al.\u003c/em\u003e C4KC KiTS Challenge Kidney Tumor Segmentation Dataset. The Cancer Imaging Archive https://doi.org/10.7937/TCIA.2019.IX49E8NX (2019).\u003c/li\u003e\n\u003cli\u003eLi, W., Yuille, A. \u0026amp; Zhou, Z. How well do supervised 3D models transfer to medical imaging tasks? \u003cem\u003earXiv [eess.IV]\u003c/em\u003e (2025).\u003c/li\u003e\n\u003cli\u003eSimpson, A. L. \u003cem\u003eet al.\u003c/em\u003e Preoperative CT and survival data for patients undergoing resection of Colorectal Liver Metastases (Colorectal-Liver-Metastases). The Cancer Imaging Archive https://doi.org/10.7937/QXK2-QG03 (2023).\u003c/li\u003e\n\u003cli\u003eZhao, B., Schwartz, L. H., Kris, M. G. \u0026amp; Riely, G. J. Coffee-break lung CT collection with scan images reconstructed at multiple imaging parameters. The Cancer Imaging Archive https://doi.org/10.7937/K9/TCIA.2015.U1X8A5NR (2015).\u003c/li\u003e\n\u003cli\u003eCodella, N. C. F. \u003cem\u003eet al.\u003c/em\u003e MedImageInsight: An open-source embedding model for general domain medical imaging. \u003cem\u003earXiv [eess.IV]\u003c/em\u003e (2024).\u003c/li\u003e\n\u003cli\u003eHuh, M., Cheung, B., Wang, T. \u0026amp; Isola, P. The platonic representation hypothesis. \u003cem\u003earXiv [cs.LG]\u003c/em\u003e (2024).\u003c/li\u003e\n\u003cli\u003eRussakovsky, O. \u003cem\u003eet al.\u003c/em\u003e ImageNet Large Scale Visual Recognition Challenge. \u003cem\u003earXiv [cs.CV]\u003c/em\u003e (2014).\u003c/li\u003e\n\u003cli\u003eSingh, S. \u003cem\u003eet al.\u003c/em\u003e Benchmarking object detectors with COCO: A new path forward. \u003cem\u003earXiv [cs.CV]\u003c/em\u003e (2024).\u003c/li\u003e\n\u003cli\u003eGeiger, A., Lenz, P. \u0026amp; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. in \u003cem\u003e2012 IEEE Conference on Computer Vision and Pattern Recognition\u003c/em\u003e 3354\u0026ndash;3361 (IEEE, 2012). doi:10.1109/cvpr.2012.6248074.\u003c/li\u003e\n\u003cli\u003eSong, Y. \u003cem\u003eet al.\u003c/em\u003e GlobalBench: A benchmark for global progress in natural language processing. \u003cem\u003earXiv [cs.CL]\u003c/em\u003e (2023).\u003c/li\u003e\n\u003cli\u003eKayser, M. \u003cem\u003eet al.\u003c/em\u003e E-ViL: A dataset and benchmark for natural language explanations in vision-language tasks. \u003cem\u003earXiv [cs.CV]\u003c/em\u003e (2021).\u003c/li\u003e\n\u003cli\u003eWantlin, K. \u003cem\u003eet al.\u003c/em\u003e BenchMD: A benchmark for unified learning on medical images and sensors. \u003cem\u003earXiv [cs.CV]\u003c/em\u003e (2023).\u003c/li\u003e\n\u003cli\u003eParmar, C., Grossmann, P., Bussink, J., Lambin, P. \u0026amp; Aerts, H. J. W. L. Machine Learning methods for Quantitative Radiomic Biomarkers. \u003cem\u003eSci. Rep.\u003c/em\u003e \u003cstrong\u003e5\u003c/strong\u003e, 13087 (2015).\u003c/li\u003e\n\u003cli\u003eWoznicki, P., Laqua, F. C., Al-Haj, A., Bley, T. \u0026amp; Bae\u0026szlig;ler, B. Addressing challenges in radiomics research: systematic review and repository of open-access cancer imaging datasets. \u003cem\u003eInsights Imaging\u003c/em\u003e \u003cstrong\u003e14\u003c/strong\u003e, 216 (2023).\u003c/li\u003e\n\u003cli\u003eRonneberger, O., Fischer, P. \u0026amp; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. \u003cem\u003earXiv [cs.CV]\u003c/em\u003e (2015).\u003c/li\u003e\n\u003cli\u003eVaswani, A. \u003cem\u003eet al.\u003c/em\u003e Attention is all you need. \u003cem\u003earXiv [cs.CL]\u003c/em\u003e (2017).\u003c/li\u003e\n\u003cli\u003eLei, W. \u003cem\u003eet al.\u003c/em\u003e A data-efficient pan-tumor foundation model for oncology CT interpretation. \u003cem\u003earXiv [eess.IV]\u003c/em\u003e (2025).\u003c/li\u003e\n\u003cli\u003eWang, A. \u003cem\u003eet al.\u003c/em\u003e Duke Lung Cancer Screening Dataset 2024. Zenodo https://doi.org/10.5281/ZENODO.10782890 (2024).\u003c/li\u003e\n\u003cli\u003eFedorov, A. \u003cem\u003eet al.\u003c/em\u003e NCI Imaging Data Commons. \u003cem\u003eCancer Res.\u003c/em\u003e \u003cstrong\u003e81\u003c/strong\u003e, 4188\u0026ndash;4193 (2021).\u003c/li\u003e\n\u003cli\u003eClark, K. \u003cem\u003eet al.\u003c/em\u003e The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. \u003cem\u003eJ. Digit. Imaging\u003c/em\u003e \u003cstrong\u003e26\u003c/strong\u003e, 1045\u0026ndash;1057 (2013).\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-6630446/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6630446/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eFoundation models are increasingly used in medical imaging, yet their ability to extract reliable quantitative radiographic phenotypes of cancer across diverse clinical contexts lacks systematic evaluation. Here, we introduce TumorImagingBench, a curated benchmark comprising six public datasets (3,244 scans) with varied oncological endpoints. We evaluate ten medical imaging foundation models, representing diverse architectures and pre-training strategies developed between 2020 and 2025, assessing their performance in deriving deep learning-based radiographic phenotypes. Our analysis extends beyond endpoint prediction performance and compares robustness to common sources of variability and saliency-based interpretability. We additionally compare the mutual similarity of learned embedding representations across each of the models. This comparative benchmarking reveals performance disparities among models and provides critical insights to guide the selection of optimal foundation models for specific quantitative imaging tasks. We publicly release all code, curated datasets, and benchmark results to foster reproducible research and future developments in quantitative cancer imaging.\u003c/p\u003e","manuscriptTitle":"Foundation model embeddings for quantitative tumor imaging biomarkers","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-05-29 12:28:44","doi":"10.21203/rs.3.rs-6630446/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"nature-communications","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"NCOMMS","sideBox":"Learn more about [Nature Communications](http://www.nature.com/ncomms/)","snPcode":"","submissionUrl":"https://mts-ncomms.nature.com/","title":"Nature Communications","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Nature Communications","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"3de77cf6-0132-4a0a-a137-1be3a3d4915f","owner":[],"postedDate":"May 29th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":49109364,"name":"Health sciences/Medical research/Translational research"},{"id":49109365,"name":"Physical sciences/Engineering/Biomedical engineering"}],"tags":[],"updatedAt":"2026-03-20T12:36:03+00:00","versionOfRecord":[],"versionCreatedAt":"2025-05-29 12:28:44","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6630446","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6630446","identity":"rs-6630446","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-21T05:10:58.409756+00:00

License: CC-BY-4.0