Profiler: an open web platform for multi-omics analysis

doi:10.21203/rs.3.rs-7058776/v1

Profiler: an open web platform for multi-omics analysis

2025 · doi:10.21203/rs.3.rs-7058776/v1

preprint OA: closed

Full text JSON View at publisher

Full text 153,318 characters · extracted from preprint-html · click to expand

Profiler: an open web platform for multi-omics analysis | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Profiler: an open web platform for multi-omics analysis Michel Salzet, Yanis Zirem, Lea Ledoux, isabelle Fournier This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7058776/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract High-throughput multi-omics experiments create large, heterogeneous data matrices that remain inaccessible to many life-science laboratories. We introduce Profiler, an open-source, web-based platform that unifies data import, quality control, preprocessing, statistical tests, machine- and deep-learning, biomarker discovery, pathway enrichment and survival modelling behind an intuitive point-and-click interface. Built with Streamlit and deployable either locally or on high-performance clusters, Profiler processes proteomics, lipidomics and other omics modalities at interactive speeds. In a benchmark on spatial proteomic and lipidomic maps from 50 glioblastoma resections, the platform reproduced published molecular subtypes, uncovered candidate therapeutic targets and generated fully traceable analysis reports in under ten minutes. Profiler therefore lowers the computational barrier for multi-omics projects and provides a reproducible foundation for systems-biology and precision-medicine research. Biological sciences/Cancer Biological sciences/Computational biology and bioinformatics multi-omics bioinformatics platform machine learning deep learning biomarkers discovery pathways enrichment automatic dugs repurposing data visualization survival analysis Streamlit Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Introduction The advent of high-throughput technologies, such as next-generation sequencing (NGS), mass spectrometry (MS) and microarrays, has revolutionized biomedical research. These platforms generate large-scale, multi-dimensional datasets, collectively referred to as omics data, encompassing genomics, transcriptomics, proteomics, and metabolomics. Such datasets hold immense potential for elucidating biological mechanisms, discovering disease biomarkers, and identifying novel therapeutic targets. However, the complexity, heterogeneity, and volume of omics data introduce substantial computational and analytical challenges 1 . Traditional omics data analysis typically requires specialized expertise in bioinformatics, statistics and programming, placing it beyond the reach of many experimental biologists and clinicians. Furthermore, many existing tools are limited in scope, tailored to specific omics types or single-step analyses and are often confined to command-line environments, which hinder accessibility, interoperability and reproducibility. Researchers are frequently compelled to navigate fragmented workflows across multiple software packages, leading to inefficiencies, steep learning curves and reproducibility concerns 2,3 . In response to these limitations, there is a growing demand for integrated, user-friendly, and visually intuitive platforms that combine analytical robustness with accessibility. Solutions such as Galaxy 4 , MetaboAnalyst 5 and Perseus 6 have made important strides in addressing specific areas of omics analysis. However, few platforms offer a truly comprehensive, end-to-end solution that covers multiple omics modalities, incorporates advanced machine learning and deep learning methods and enables interactive data visualization and interpretation within a unified environment. To address these critical gaps, we introduce Profiler, a modular, web-based application designed to democratize omics data analysis. Developed in Python using the Streamlit framework, Profiler provides a seamless and integrated pipeline covering key stages of analysis: data import and conversion, preprocessing (including cleaning, normalization, imputation, batch effect correction), visualization, statistical testing, machine learning, deep learning, biomarker discovery, pathway enrichment analysis, and survival analysis. The platform is built for scalability, modularity and extensibility, allowing it to evolve with emerging research needs and analytical innovations. Notably, Profiler is engineered to serve both novice and expert users. It offers guided, workflow-oriented interfaces for users with limited computational experience, while its flexible architecture supports customization and advanced analytical workflows for experienced users. Profiler's compatibility with a wide range of data formats, combined with efficient backend processing, ensures robust performance even with high-dimensional datasets. By lowering the technical barriers to entry, Profiler aims to provide the scientific community with an accessible, transparent and comprehensive analytical ecosystem, one that promotes reproducibility, accelerates discovery, and empowers data-driven decision-making in modern life sciences research. Materials and Methods The dataset used in this article to demonstrate the utility of Profiler originates from the studies by Duhamel et al . (2022) 7 and Lagache et al . (2025) 8 . While the data were initially collected in 2022, they were reanalyzed in the 2025 study using a more appropriate and advanced data analysis pipelines. Cohort Tumors from 50 patients were included in the study. Patients with newly diagnosed glioblastoma were prospectively enrolled between September 2014 and November 2018 at Lille University Hospital, France (NCT02473484). All patients gave written informed consent before enrollment. These 50 tumors were used for omics MALDI-MSI and proteomics analysis. Tumors samples were processed within 2 hours after sample extraction in the surgery room to limit the risk of degradation of proteins. Spatially resolved proteomics extraction The different clusters identified by the segmentation process (detailed explanation in Lagache et al . (2025) 8 ) were submitted to spatially resolved proteomics. A localized digestion was carried out by deposing a trypsin solution (40 μg/ml in NH 4 HCO 3 50 mM), on a region of 500 μm 2 of tissue (4 × 4 droplets of 200 μm in diameter), using CHIP-1000. The deposition method comprises approximately 1205 cycles per digestion spot, i.e., 3 h of deposition, with a drop volume of 150 pL. Finally, each spot was digested with 0.112 μg of trypsin. Following the micro-digestion, each spot was extracted by liquid microjunction using the TriVersa Nanomate device, with LESA (Liquid Extraction and Surface Analysis) parameters 9 . The tryptic peptides were extracted by performing two consecutive extraction cycles for three different solvents mixtures (TFA 0.1%; ACN/0.1% TFA (8:2, v/v ); and MeOH/0.1% TFA (7:3, v/v )) for a total of six extractions. For each cycle, 2 μl of solvent was drawn into the tip of the pipette, of which 0.8 pL was brought into contact with the surface. 15 back and forth movements were performed to extract the peptides before collecting the solution in a recovery tube. All extracts were pulled in one tube and 50 μl of ACN were finally added before drying the samples in a SpeedVac. The samples were then stored at −20 ◦C prior to nLC-MS/MS analysis. nLC-MS/MS Bottom-up Analysis Prior to MS analysis, the reconstituted samples were desalted using C18 Ziptip (Millipore, Saint-Quentin-en-Yvelines, France), eluted with 80% ACN and vacuum-dried. The dried samples were resuspended in 0.1% FA aqueous/ACN (98:2, v/v ). Peptides separation was performed by reverse phase chromatography, using a NanoAcquity UPLC system (Waters) coupled to a Q-Exactive Orbitrap mass spectrometer (Thermo Scientific) via a nanoelectrospray source. A pre-concentration column (nanoAcquity Symmetry C18, 5 µm, 180 µm × 20 mm) and an analytical column (nanoAcquity BEH C18, 1.7 µm, 75 µm × 250 mm) were used. A 2 h linear gradient of acetonitrile in 0.1% formic acid (5%-35%) was applied, at the flow rate of 300 nl/min. For MS and MS/MS Acquisition (Xcalibur 4.1 and Exactive Series 2.9), a data-dependent mode was defined to analyze the 10 most intense ions of MS analysis (Top 10). The MS analysis was performed with an m/z mass range between 300 and 1600, a resolution of 70,000 FWHM, an AGC of 3e 6 ions and a maximum injection time of 120 ms. The MS/MS analysis was performed with an m/z mass range between 200 and 2000, an AGC of 5e 4 ions, a maximum injection time of 60 ms and the resolution was set at 17,500 FWHM. To avoid any batch effect during the analysis, the extractions were chosen at random to create analysis sequences. Data analysis prior to the use of Profiler (2022 study) All MS data were searched with MaxQuant software 10 (Version 1.5.3.30) using Andromeda search engine against the complete proteome for Homo sapiens (UniProt, release July 2018, 20,412 entries). Trypsin was selected as enzyme and two missed cleavages were allowed, with N-terminal acetylation and methionine oxidation as variable modifications. The mass accuracies were set to 6 ppm and 20 ppm, respectively, for MS and MS/MS spectra. False discovery rate (FDR) at the peptide spectrum matches (PSM) and protein levels was estimated using a decoy version of the previously defined databases (reverse construction, Homo sapiens, UniProt, release July 2018) and set to 1%. A minimum of two peptides with at least one unique is necessary to complete the identification of a protein. The MaxLFQ algorithm was used to performed label-free quantification of the proteins. Data analysis prior to the use of Profiler (2025 study) The aim of the 2025 study 8 was to advance the concept of dry proteomics. To this end, lipidomic and proteomic MALDI-MSI analyses were performed on 13 glioblastoma tissue samples. Common molecular clusters were identified and correlated with microproteomic data previously obtained in the 2022 study 7 . We developed a dedicated pipeline for the segmentation of mass spectrometry imaging (MSI) data to assess the number and spatial distribution of tumor clones, both within and across patients. A t-SNE analysis based on lipidomic imaging data revealed a clear separation into two distinct groups. This clustering pattern was consistently recapitulated in the heatmap derived from microproteomic data, further supporting the robustness of the classification. As a result, the samples were stratified into two molecular subgroups, referred to as group A and group B. Comparison of clinical outcomes and differential analysis, showed that group A was associated with significantly longer overall survival (greater than 32 months) and tumor aggressiveness, invasion and therapeutic resistance, while group B was linked to a poorer prognosis (survival less than 30 months) and less aggressiveness, necrosis and potential therapeutic targets.To automate the prediction of patient outcomes, we developed a dry-lab proteomic analysis pipeline. This pipeline enabled the extraction of spatially resolved MSI clusters, which were subsequently analyzed using trained machine learning models. From a single pixel or an MSI-derived cluster, the models could predict the identity of the corresponding tumor clone, its associated protein expression profile, its classification into group A or B, and ultimately the patient's prognostic category. Scalability and system resources Profiler is designed with scalability and robust system resources in mind, ensuring optimal performance and reliability for high-demand analytical tasks. The platform runs on the Mésocentre de Calcul de Lille (https://hpc.univ-lille.fr), leveraging an infrastructure that includes 246 GB of RAM, 8GB of swap memory and multiple vCPUs and vGPUs. This setup, operating on a Linux server and utilizing OpenStack cloud technology, provides ample computational power to handle complex analyses efficiently. Additionally, Profiler benefits from expandable GPU and vCPU pools, allowing for dynamic scaling of resources based on user demand. The system is actively monitored to ensure it meets the analytical needs of its users. If user demand increases, as tracked via system telemetry, Profiler's compute resources (CPU, GPU, RAM) can be scaled in collaboration with HPC administrators. This proactive approach ensures that the platform remains responsive and capable of handling increased loads without compromising performance. Furthermore, plans are in place to augment the system's capacity if there is a surge in demand from the user community, ensuring that Profiler continues to deliver high-quality, timely results even under heavy usage. In parallel, desktop versions of Profiler are being considered for development to provide increased accessibility and offline functionality. The technological stack supporting Profiler’s backend, frontend and cloud infrastructure is detailed in Table 1 , which outlines the key libraries and tools integrated into the platform to enable efficient data processing, modeling, visualization and deployment. Table 1 . Overview of technologies and libraries used in the Profiler platform. Backend Technology/ Library Description/ Role Python Main programming language, integrates all modules and orchestrates workflow execution Pandas Data manipulation and preprocessing for tabular and omics data Numpy Efficient numerical operations and array manipulation pyopenMS Mass spectrometry file parsing Msconvert (ProteoWizard) Raw MS data conversion to open formats Openpyxl Excel file handling Scikit-learn Machine learning models Tensorflow/Keras Deep learning model design and training (MLPs, CNNs, RNNs) lifelines Survival modeling and stratification Imbalanced-learn Class balancing (e.g., SMOTE, ADASYN, under sampling) pycombat Batch effect correction SHAP, Eli5 Model explainability Scipy.stats, statsmodels Parametric and non-parametric statistical tests (e.g., t-test, ANOVA, Kruskal-Wallis, Mann-Whitney) joblib / pickle Model serialization and persistence (saving/loading ML pipelines and objects) GSEApy Gene set enrichment analysis NetworkX Construction and analysis of biological networks and pathway graphs Frontend Streamlit Web-based user interface HTML/CSS Custom layout and styling of the interface components Plotly, Matplotlib, Seaborn Interactive and static visualizations (e.g., spectra, volcano plots, radar charts) HPC and Cloud Linux (ubuntu) Operating system for server and local environments Open Stack Cloud infrastructure management for ressource provisioning Systemd Service orchestration and daemon management Nginx Reverse proxy server for deployment, load balancing, and API exposure Docker Containerization for reproducibility, environment control, and deployment Results Profiler’s primary goal is to bridge the gap between raw omics data and actionable biological insights by leveraging a custom pipeline combining state-of-the-art libraries, original modules, and high-performance computing. Figure 1 illustrates the 8 interconnected components of this software (detailed in the User’s Manual in Supplementary Data 1 ). To demonstrate how Profiler operates and the types of results it can generate, the proteomic dataset processed with MaxQuant will be used as main running example throughout the workflow. Additionally, lipidomic data acquired using the SpiderMass technology, such as those published by Zirem et al. (2024) 11 will be used for module not useful for proteomic dataset. Data conversion and importation To accommodate vendor heterogeneity, Profiler integrates a vendor-agnostic data conversion module using msconvert from proteowizard 12 . It supports the conversion of raw files from Bruker, Thermo Fisher, and Waters instruments into open formats such as. mzML, mzXML, .mz5, and .mzDB via pyOpenMS. During conversion, users can: define mass range boundaries, enable peak picking, apply lock mass corrections, downsample spectra for faster processing. This ensures standardization of MS input across platforms and enhances compatibility with downstream tools. In addition, Profiler accepts and harmonizes a wide variety of omics data types, including mass spectrometry standard format files, where MS files are structured by biological class or condition using the and parsed using pyOpenMS library 13 , and tabular omics data in .csv, .tsv, .txt, and .xlsx formats, including exports from MaxQuant 10 , DIA-NN 14 , and Perseus 6 . The expected format for tabular data requires a column named ‘Class’ for target labels (e.g., control, condition 1, etc.) and the remaining columns as features (ions, gene names, protein names, metabolites, etc.). Additionally, Profiler supports survival and clinical data, requiring 'Overall Survival' and 'State' columns to facilitate survival modeling and stratification using the lifelines library. Uploaded datasets are automatically cataloged, checked for delimiter consistency, and verified for missing or malformed values. Data handling and manipulation are facilitated by the pandas and openpyxl libraries. Data exploration and preprocessing An integrated data exploration module enables users to interactively explore and validate their datasets, offering summarization through visualizations of class distributions, missing values information and sample sizes. As shown in Figure 2 , using the data exploration component of Profiler, the dataset consists of 108 samples in group A (73.5%) and 39 samples in group B (26.5%), indicating a class imbalance that may require over- or under-sampling to address. Furthermore, approximately 50% of the data contains missing values, with a higher proportion in group B. Only half of the features follow a normal distribution, suggesting that K -nearest neighbors (KNN) imputation is suitable for handling missing data, and that either the t-test or the Mann-Whitney U test should be used to assess the statistical significativity depending on the distribution of each variable (feature). Users can also manage labels by editing class names in-session for clarity and consistency. One module provides various preprocessing options, including normalization techniques such as TIC, RMS, BasePeak, QNorm and log transformations, as well as batch effect correction using NeuroCombat from the pycombat package 15 . Dynamic binning can be applied to selected mass ranges, and missing value imputation is supported through mean, median, mode, and KNN-based imputation using scikit-learn's KNNImputer libraries 16 . For our dataset, KNN imputation with missing value removal was used to optimize the dataset and the rest of the data analysis, as it was recommended in the Figure 2 . Indeed, given that the dataset contains a balanced mix of values with uncertain distribution characteristics, it is unclear whether mean or median imputation would be optimal. As a result, KNN imputation emerges as the most robust and adaptive solution. Thanks to KNN imputation and the removal of missing values (exclusive features), the total number of proteins falls from 4936 to 4251. Class balancing and sampling Profiler includes advanced resampling strategies to correct class imbalance, either by data augmentation or data decrease, which is crucial for training classification models. These strategies include oversampling techniques such as SMOTE (Synthetic Minority Over-sampling Technique) and ADASYN (Adaptive Synthetic Sampling), which generate synthetic samples to balance the classes and undersampling techniques like RandomUnderSampler and NearMiss, which reduce the number of samples in the majority class. These resampling methods are applied through the imbalanced-learn library 17 , ensuring full compatibility with structured data and MS intensities, thereby enhancing the performance and reliability of classification models. In our dataset, applying oversampling, as recommended in Figure 2 , to address class imbalance would result in 108 samples per group. All subsequent analyses could then rely on this balanced dataset, if wanted. Data Visualization The visualization engine relies on Plotly, Matplotlib, and Seaborn to generate interactive plots, offering a variety of visualization options and providing a comprehensive and interactive way to explore and understand the data. These include feature distributions displayed through line, bar, histogram, and radar charts, as well as spectra visualization with mean signal/features and individual sample. UpSet and Venn diagrams are used to show the overlap of features across classes, using the upsetplot library and custom logic 18 . Spectra from classical mass spectrometry datasets can be displayed and interactively explored ( Figure S1 ), allowing zooming and other manipulations. In addition, pseudo-spectra, such as the one shown in Figure 3A , can be visualized to display the label-free quantification (LFQ) intensities of all detected proteins across groups. Using the raw data, before applying KNN imputation by Class and removing class-exclusive features (which cannot be imputed as they are not detected in the other class a Venn diagram can be generated to identify group-exclusive proteins ( Figure 3B ). In our case, 145 proteins were found to be exclusive to group A, and 540 to group B, with 4251 proteins in common. However, the exclusive proteins can only be used for pathway enrichment analysis (as presented in the following sections of the paper), but not for statistical testing or machine/deep learning model training. Therefore, for all subsequent analyses, except pathway enrichment, the results rely exclusively on the dataset with no exclusive features and no missing values as they are imputed. Before performing statistical tests, it is important to explore and better understand the data. Several types of visualizations are available for this purpose. For example, a bar chart can be used to show the distribution of a specific protein across different groups by displaying its presence or absence ( Figure 3C ). The protein EGFR, for instance, appears to be more expressed in group B. It is also possible to compare multiple proteins simultaneously using radar charts, line plots, or bar charts ( Figures 3D–E–F ). These visualizations reveal, for example, that EGFR is more abundant in group B, whereas DLGAP3, ICAM3 and KCTD16 are more highly expressed in group A. Correlation and similarity analysis To explore inter-feature or inter-class relationships, Profiler offers advanced modules that support exploratory biological hypotheses and quality control. Users can assess intra-feature relationships through correlation methods, including Pearson and Spearman, which are computed between the average feature vectors of each class. Pearson correlation is ideal for normally distributed data, measuring linear relationships, while Spearman correlation is suitable for non-parametric data, assessing monotonic relationships using rank values. Additionally, inter-class resemblance is evaluated using cosine similarity and Cohen’s Kappa score. Cosine similarity measures the angle between feature vectors of each class, indicating the directional alignment of the data (with 1 signifying identical direction and 0 orthogonal). Cohen’s Kappa, on the other hand, evaluates the agreement in categorized feature profiles after discretizing continuous data into ranked categories (e.g., low, medium, high expression). This discretization allows Kappa to measure agreement on patterns rather than exact numerical values, providing insights into the consistency of feature profiles across classes. These techniques are crucial for understanding the underlying data structure and ensuring the reliability of biological interpretations and the novel application of Cohen’s Kappa within Profiler is particularly valuable for omics analysis, as suitable to reveal consistent expression trends that may be masked by variability at the continuous level. Figure S2 shows that while group A and group B are highly correlated (r = 0.93), indicating strong similarity in continuous variables, their moderate agreement on Cohen’s Kappa (κ = 0.57) suggests notable differences when categorical aspects are considered. Machine Learning & Deep Learning Profiler supports comprehensive machine learning (ML) and deep learning (DL) workflows through scikit-learn, TensorFlow, and custom wrappers, offering a wide range of techniques for both unsupervised and supervised learning. For unsupervised learning, users can employ dimensionality reduction methods such as PCA (Principal Component Analysis), UMAP (Uniform Manifold Approximation and Projection), and t-SNE (t-Distributed Stochastic Neighbor Embedding) to visualize data clusters. The plots can be generated in both 2D and 3D, depending on the dimensionality reduction method used. In our case, non-linear techniques such as UMAP and t-SNE proved to be the most effective for clearly distinguishing between the two groups, A and B, as they form well-separated clusters ( Figure 4A-B ). In contrast, the linear method PCA fails to clearly differentiate these groups, suggesting that it does not capture the underlying structure of the data as effectively. For UMAP, the n_neighbors parameter is crucial as it defines the size of the local neighborhood used for manifold approximation. Choosing this parameter can be challenging for scientists, as it is not well-documented and can lead to misleading biological conclusions if not set correctly. To address this, Profiler uses a heuristic approach to calculate n_neighbors based on the number of data points. This heuristic ensures that the neighborhood size adapts to the dataset size, balancing between capturing local structure and computational efficiency. This approach is based on recommendations from the original UMAP paper 19 and practical guidelines from the machine learning community. For t-SNE, the perplexity parameter influences the number of nearest neighbors that are used in other data points. Similar to UMAP, selecting an appropriate perplexity value can be non-trivial and may result in incorrect interpretations if done manually. Profiler calculates perplexity using a heuristic approach based on the number of data points. This heuristic aims to find a balance between preserving local and global data structures while avoiding overfitting. This method is inspired by the original t-SNE paper 20 and best practices in the field. By automating the selection of these parameters, Profiler helps users avoid potential pitfalls and ensures more reliable and reproducible results. In addition, k -means clustering and silhouette analysis 21 can be used to assess group formation and heterogeneity. Indeed, for k -means clustering, determining the optimal number of clusters is a critical step. Profiler uses silhouette analysis to evaluate the quality of the clustering. The silhouette score measures how similar an object is to its own cluster compared to other clusters. A higher silhouette score indicates better-defined clusters. By analyzing the silhouette scores for different numbers of clusters, Profiler helps users identify the optimal number of clusters without overclustering or underclustering. This ensures that the clustering results are meaningful and biologically relevant. Looking at our dataset, Silhouette analysis indicates that an optimal clustering would involve three groups, rather than the current two-group classification (A and B) ( Figure 4C ). This is consistent with the previous t-SNE and UMAP plots, where two distinct subgroups can be observed within group B, suggesting underlying heterogeneity. This observation is further supported by the t-SNE projection with three clusters, where group B clearly subdivides into two separate clusters, referred to as groups B and C ( Figure 4D-E ) . This suggests that group B may contain multiple tumor clones or distinct subtypes. In previous lipid-MSI studies, patients from group B often showed high levels of necrosis, which could also explain the observed heterogeneity. To explore this further, integrating additional clinical metadata, such as age, sex, treatment history, or comorbidities, could help identify meaningful biological or clinical differences and improve patient stratification. In supervised learning, Profiler provides access to over 23 models, including Random Forest, Logistic Regression, SVM, Naïve Bayes, Gradient Boosting and LDA/QDA, along with ensemble methods like bagging classifiers. Users can compare model performance using learning curves, confusion matrices and classification reports with metrics such as F1 Scores, accuracy, recall, precision, sensitivity and specificity. When attempting to build the optimal classification model using our dataset, 20 out of the 23 tested algorithms reached perfect accuracy (100%) after 20-fold cross-validation ( Figure 4F-G ). This performance underscores a clear separation between the groups and indicates that the models successfully captured distinct protein profiles characteristic of each group. Notably, both the confusion matrix and the classification report demonstrate that the optimal model, built using the RidgeClassifier algorithm, achieved perfect performance with no misclassifications ( Figure 4H-I ). The learning curve shows that the model begins to learn effectively after 70 samples and reaches optimal performance by 90 samples. Furthermore, the close alignment of the training and validation curves towards the end indicates good generalization, with no apparent underfitting or overfitting ( Figure 4J ). For deep learning, Profiler supports architectures like MLP (Multilayer Perceptron), CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network), with accelerated training and real-time metric tracking. Deep learning typically requires large amounts of data to be truly effective. In our case, the dataset is not extensive enough to provide a clear advantage over traditional machine learning approaches. Nevertheless, as shown in Figure S3 , the deep learning algorithms (MLP and CNN) still managed to achieve 100% accuracy in classifying the two groups. Users can save and reload trained models along with the selected features, the fitted label encoder and the full preprocessing pipeline, including scaling and transformations. This ensures that any new data used for prediction will undergo the exact same preprocessing steps as the training data, maintaining consistency and avoiding data leakage. Importantly, saving the specific trained features (not just the input dimension) guarantees that the model only processes the variables it was originally trained on, preserving both model integrity and performance. This is particularly crucial when applying the model to new omics data, such as metabolomic spectra, proteomic LFQ, or gene/RNA expression, where some features used during training may not be detected in a given sample. In traditional workflows, this mismatch would prevent prediction altogether. However, Profiler handles this seamlessly by assigning a default value (e.g., zero) to any missing feature, treating it as not detected. This allows predictions to proceed using the available features without compromising model compatibility or requiring retraining. This approach enhances reproducibility, ensures robust and interpretable predictions, and supports scalable deployment in real-world scenarios. All models can be exported for external use, making Profiler a powerful and flexible tool for both exploratory analysis and predictive modeling across diverse omics applications. Biomarker discovery Next, Profiler offers a comprehensive pipeline for biomarker discovery and feature interpretation, which also serves as a robust feature selection process. This pipeline includes a variety of statistical analysis tools and explainability modules designed to identify, rank, and visualize significant biomarkers. These insights can then be saved as structured dataframes for further analysis or model retraining, enhancing overall performance and interpretability. One of the standout features is the volcano plot, conventionally used to compare binary classes. However, Profiler has expanded this functionality to support multi-class comparisons, providing a more versatile tool for biomarker discovery. Volcano plots visualize the statistical significance (p-value) and magnitude of change (fold change) for each feature, allowing users to quickly identify the most relevant biomarkers. Provides also option to highlight feature names for better clarity and offers a features detection based on intensity thresholds, which can automatically identify and include significant features in the analysis This multi-class capability broadens the applicability of volcano plots, making them a powerful tool for complex datasets. Using a volcano plot with a 0.1-fold change and a 0.05 p-value, 66 proteins were found significantly deregulated in group A or B ( Figure 5A ). Profiler also integrates explainability tools to enhance the interpretability of machine learning results. It supports SHAP ( SHapley Additive exPlanations) (https://shap.readthedocs.io) for both local and global attribution 22 , and LIME (https://eli5.readthedocs.io) for introspection of models introspection. SHAP values provide detailed explanations of model outputs by quantifying the contribution of each feature to individual predictions, offering both per-sample and overall insights. LIME, on the other hand, offers transparency in models by highlighting feature weights/contributions and their effects (postive or negative). Profiler includes custom modules that convert SHAP and LIME outputs into structured DataFrames, facilitating easier downstream analysis and integration. In addition, various visualization techniques such as beeswarm plots and positive/negative contribution plots are generated to visually feature impacts and enhance understanding of model behavior. Together, these tools ensure that predictive models are not only accurate but also trustworthy and explainable. Using AI explainability tools, 54 proteins that contributed most to the model’s predictions were identified ( Figure 5B–C ). These proteins were added to those found deregulated in the volcano plot, except when already recurrent such as FMO3 for group B and HBQ1, TMEM163, SEPT14, and DHRS3 for group A, for further analysis. Additionally, Profiler offers heatmap clustering for both features and samples, enabling users to visualize patterns and relationships within the data. Users can perform heatmap clustering on all or selected features, with options to average feature values by class and apply statistical tests to filter significant features. Customizable parameters include the choice of data type (original intensity or log2 transformed) and p-value thresholds, allowing for tailored analysis. The heatmaps are enhanced with custom color schemes to highlight under-expression, neutral expression, and over-expression, providing a clear and intuitive visualization. A heatmap generated using all 120 discovered biomarkers, from both volcano plots and AI explainability methods, clearly demonstrated a strong clustering of the two groups, with distinct patterns of under- and overexpressed proteins ( Figure 5D ). Moreover, when comparing the heatmaps generated using the biomarkers from the volcano plots and those identified through AI, we observe that the one derived from AI appears to be clustered in a much more homogeneous manner ( Figure S4 ). In contrast, the heatmap based on volcano plot biomarkers shows a noticeable heterogeneity, particularly within group B. For statistical analysis, Profiler supports a range of tests tailored to both binary and multi-class scenarios, including parametric and non-parametric methods. Users can perform t-tests and ANOVA for parametric data, as well as Kruskal-Wallis and Mann-Whitney tests for non-parametric data. These tests help assess the significance of features and their correlation with biological conditions by facilitating the visualization using boxplots, violin plots or bar plots. Here, two examples of deregulated proteins were displayed using boxplots and violinplots ( Figure 5 E-F ) using Kruskal Wallis test. Indeed, it showed that in a significantly manner, ACYP2 is overexpressed in group B, in contrary to COL6A2 who is more expressed in group A. Overall, Profiler's biomarker discovery and feature interpretation pipeline is designed to streamline the process of identifying significant features, enhancing model performance, and providing clear, interpretable results. The ability to save these insights as structured dataframes further facilitates downstream analysis and model retraining, ensuring that users can leverage the most relevant features for their research. Pathway enrichment and functional annotation Biological pathway analysis in Profiler is performed using GSEApy 23 , interfaced via custom algorithm. This feature allows users to select from multiple comprehensive databases, more than 100 databases, including KEGG 24 , GO 25 , Reactome 26 , MSigDB 27 , and Drug Signatures 28,29 , providing a wide range of biological contexts for analysis. One of the key advantages of Profiler is its support for multi-class enrichment, which facilitates comparative insights across different phenotypes. This is particularly useful for studies involving multiple conditions or treatments, as it allows for a more nuanced understanding of biological pathways. For each pathway, Profiler provides detailed information including the number of associated proteins or genes, as well as the list of implicated features within that pathway. Importantly, Profiler also highlights genes or proteins that are not associated with any enriched pathways, allowing users to capture the full scope of molecular involvement, including potentially novel or understudied factors. The results of the enrichment analysis are visualized in enriched term graphs, heatmaps and interactive plots, which provide an intuitive way to explore the significance of various pathways. Additionally, the results can be exported as structured tables, making it easy to integrate the findings into further analyses or reports. By using all identified biomarkers, (Exclusive features, volcano plots feature selection and markers highlighted via AI explainability) and applying the enrichment module, we identified the top 15 enriched pathways for each group using the MSigDB_Hallmark_2020 database. These pathways were ranked based on their combined score and visualized as either bar plots or heatmaps ( Figure 6A–B ), or based to gene counts ( Figure 6C ). In addition, the specific proteins involved in each enriched pathway can be retrieved ( Figure 6D and Supplementary Data 2 ). Even more interestingly, their interaction network is visualized in Figure 6E , revealing complex interactions within and between certain pathways. This analysis revealed, for example, that group A tumors are enriched in pathways such as myogenesis (cell differentiation) and interferon alpha response (antiviral immune response). Overall, group A appears to activate differentiation, immune response, and cellular structure programs, suggesting a more stable, less invasive, and potentially less aggressive tumor phenotype. In contrast, proteins in group B are involved in signaling pathways such as KRAS signaling (proliferation), unfolded protein response (cancer cells under stress), and interferon gamma response (inflammation and oxidative stress). This is particularly noteworthy, as group B tumors seem to engage cellular stress, proliferation, inflammation, and tumor deregulation pathways, consistent with a more aggressive, invasive behavior and a potentially higher resistance to treatment. It is worth noting that 120 genes and 478 genes are not enriched in group A and B respectively. Profiler retains this information, allowing users to explore these non-enriched features, which may represent poorly characterized or context-specific proteins/genes. Investigating these elements could lead to novel biological insights and uncover new functional roles. Going further, we explored potential drug targets using the DGIdb_Drug_Targets_2024 database to identify compounds that could specifically target the previously enriched pathways in each group. As with the pathway analysis, the results were visualized using multiple plot types ( Figure 6F–J ). For group B, the identified drugs were notably enriched in inhibitors of oncogenic kinases, such as Dabrafenib and Dasatinib, which target proliferative signaling pathways. Additionally, compounds frequently associated with aggressive or treatment-resistant cancers, including Masitinib and Linifanib, were also highlighted. This aligns with the observation that group B tumors strongly activate oncogenic signaling pathways such as MAPK and BRAF, which promote cell proliferation, survival, and therapeutic resistance, consistent with a more aggressive tumor phenotype and poorer prognosis. In contrast, group A showed enrichment in targets of immunomodulatory and anti-inflammatory drugs, such as Infliximab, along with classical anticancer agents like Lapatinib and Methotrexate. This suggests a distinct therapeutic landscape for group A tumors, potentially more responsive to immune modulation and conventional chemotherapy ( Supplementary Data 3 ). Profiler also offers an interactive gene interaction network using NetworkX, a powerful Python library for the creation, manipulation, and study of complex networks. This network visualization allows users to explore the relationships between genes involved in enriched pathways, providing deeper insights into the biological mechanisms at play. Users can dynamically interact with the network, zooming in on specific genes or pathways to understand their connectivity and importance. The network is color-coded based on the protein type or class, making it easy to distinguish between different groups of genes. This interactive feature enhances the interpretability of the enrichment results and helps researchers identify key genes and pathways that may be crucial for further investigation. Survival and prognostic modeling Using the lifelines library 30 , Profiler supports advanced survival and prognostic modeling techniques. Key features include Kaplan-Meier estimation, which provides a non-parametric way to estimate the survival function from lifetime data, and Cox Proportional Hazards modeling, which assesses the effect of several risk factors on survival time. Additionally, Profiler supports Log-Rank tests to compare the survival distributions of two or more groups. These tools are essential for translational biomarker studies, where understanding the prognostic value of various covariates is crucial. By integrating these survival analysis techniques, Profiler enables researchers to identify factors that significantly impact survival outcomes, aiding in the development of more effective treatment strategies and personalized medicine approaches. Furthermore, the Cox Proportional Hazards model can be saved and deployed directly within Profiler to make predictions on new data or patients, facilitating real-time prognostic assessments and enhancing clinical decision-making. Our analysis used Kaplan-Meier survival curves and a Cox proportional hazards model to assess survival outcomes and influencing factors for the two distinct groups, A and B. Figure S5A , with Kaplan-Meier curves, reveal a significant survival advantage for Group A over Group B, with a p-value of 0.00001, indicating this difference is statistically significant. Figure S5B , with a forest plot from the Cox model, identifies key proteins impacting survival, with log (Hazard Ratios) and 95% confidence intervals illustrating their effects ( Supplementary Data 4 ). Indeed, variables to the right of zero indicate increased hazard and worse survival, while those to the left suggest better survival prospects. We can observe, for instance, that the protein MX1(Myxovirus resistance protein 1) is associated with shorter survival 31 , and act as a negative prognostic factor. This is consistent with the enrichment result showing its involvement in type I interferon response and tumor-promoting inflammation, often linked to aggressive tumor phenotypes and resistance to therapy. In contrast, GOT1 (Glutamic-Oxaloacetic Transaminase 1) and ACYP2 (Acylphosphatase 2) are both associated with longer survival 32 , suggesting a protective role. Wizard & Deployment Tools Wizard Module designed to guide users through real-time and post-hoc prediction workflows, enhancing the accessibility and utility of predictive modeling. This module supports real-time predictions on new samples directly from raw files, a feature initially designed for real-time prediction connected to mass spectrometer instruments. While real-time prediction directly from the instrument is not feasible when using Profiler from the web, users can still achieve real-time predictions by dragging and dropping a zipped raw file from instruments such as Waters, Bruker, or Thermo. This capability ensures that users can leverage Profiler's predictive power even in environments where direct instrument integration is not possible. Additionally, the Wizard Module facilitates post-hoc predictions using tabular datasets and saved models. Users can upload tabular data and apply saved models to make predictions, with (“Class” column) or without ground truth data. This flexibility allows for the comparison and assessment of test datasets against known outcomes, providing valuable insights into model performance. The results of these predictions can be visualized, interpreted, and exported in publication-ready formats, making it easy to share findings with colleagues or include them in research publications. Using the same dataset employed for spectral visualization in a previous module, originating from the study by Zirem et al. (2024) 11,33 , a classification model was built, achieving 92% accuracy through 5-fold cross-validation. This model was then tested blindly on an unseen dataset using the Wizard module of Profiler. Two ways of predictions are available, either using a raw data (real-time or post-acquisition way) or using an already processed csv file (post-hoc way). As shown in Figure S6 and S7 , the real-time predictions were highly satisfactory, with no misclassifications. A novel and powerful feature introduced in Profiler is the ability to simultaneously interrogate multiple trained models. Users can upload several models, with the same trained features and label encoders, and Profiler will perform predictions using all models in parallel. The final class is then determined by a majority voting strategy and a confidence score is provided to reflect the consensus across models. This ensemble-like approach improves prediction robustness, compensates for model-specific biases and ensures more reliable decision-making in practical applications. Discussion The increasing volume and complexity of omics data continue to push the boundaries of computational biology. Tools capable of managing and interpreting such data must not only be powerful and statistically sound but also accessible to the wider research community 34 . Profiler addresses this need by offering an end-to-end, modular solution that unifies multiple analytical capabilities within a single, web-based application. Unlike existing platforms such as Galaxy, which require complex installation and server configuration, or Perseus, which is confined to Windows environments, Profiler is platform-independent and lightweight, designed to run efficiently on a wide range of systems. Its web-based architecture ensures broad accessibility, and its scalability is evidenced by its performance on high-capacity. This makes Profiler suitable for both small laboratory experiments and large-scale clinical studies. A distinguishing feature of Profiler is its seamless integration of machine learning and deep learning modules, enabling sophisticated predictive modeling directly from user-uploaded data. By embedding preprocessing, feature selection, model training, and evaluation into an intuitive workflow, Profiler lowers the barrier to entry for advanced data science techniques in biology 35 . Furthermore, the inclusion of automated biomarker discovery and survival analysis tools allows for clinically relevant insights to be drawn with minimal overhead. Another critical advantage lies in the platform’s support for data visualization and interpretability. Profiler offers real-time interactive plots, such as reduction methods (PCA, t-SNE, UMAP) volcano plots, clustering heatmaps box/violin plots, which are essential for hypothesis generation and exploratory data analysis. These features not only improve user engagement but also facilitate deeper understanding of data structure and biological patterns. From a software engineering standpoint, Profiler was built with extensibility in mind. Its modular design allows for rapid integration of new analytical methods and data types as the field evolves. Future directions include the incorporation of single-cell omics support, release and integrates pre-trained models for domain-specific applications such as bacterioscoring, immunoscoring, and dry proteomics. These models, validated in prior peer-reviewed studies 8,11,33 , offer domain-specific scoring pipelines that are seamlessly integrated into the workflow. This fusion of enrichment-driven interpretation with task-specific predictive modeling allows researchers to not only observe differential expression patterns but also contextualize them within a biological or clinical framework, supporting hypothesis generation, validation, and translational impact. To ensure scalability and maintain user experience, Profiler is currently hosted on the high-performance computing (HPC) infrastructure of the Mésocentre of Lille, with access to 246 GB RAM, multiple CPUs, and expandable GPU capacity. Should usage statistics indicate high demand, we are prepared to scale up computational resources accordingly by increasing CPU, GPU, and RAM allocations, in collaboration with the Mésocentre’s HPC provisioning team. This commitment ensures that Profiler remains responsive and capable of handling large-scale bioinformatics workflows In conclusion, Profiler represents a powerful addition to the bioinformatics toolkit. By combining robust analytics with a user-centered design, it closes a critical gap in omics data analysis. We anticipate that Profiler will serve as a valuable resource for biologists, clinicians, and data scientists alike, accelerating discovery in diverse research areas ranging from cancer genomics to microbial ecology. Declarations Data Availability Profiler is openly accessible at (https://prism-profiler.univ-lille.fr). All datasets used in this study are available in the dedicated GitHub repository at (https://github.com/yanisZirem/Profiler_v1_requests_datatests) in the Data_fo_peer_review_paper folder. In addition to the glioblastoma dataset illustrated in the article, the repository includes a wide range of real and simulated datasets designed to showcase Profiler’s capabilities across multiple omics platforms. It contains raw mass spectrometry data acquired from Bruker and Waters instruments (Bruker_data/ and Waters_data/), as well as processed output files from DIA-NN (DIA-NN_data/) and MaxQuant (Maxquant_data/). The Tabular_data_multi_omics/ directory offers structured "toy" datasets specifically created to help users get started with Profiler, test its different modules, and explore its full potential. These datasets, covering lipidomics, proteomics, transcriptomics, and metabolomics, are tailored for binary classification (e.g., aggressive vs. non-aggressive tumors) and multi-class tasks (e.g., tumor, necrotic, and healthy tissues). They are also suitable for training and educational purposes, particularly for students or researchers learning to analyze multi-omics data. Additionally, the Survival_data/ folder contains clinical variables and lipid markers (clinical_and_LipidsMarkers.csv) for Cox regression modeling, as well as preprocessed patient data (statuts_patients.csv) for Kaplan–Meier survival analysis. All data are shared in accessible formats to encourage transparency, reproducibility, and broader adoption by the scientific and educational communities. Acknowledgments This work is partially supported by tthe Institut National de la Santé et de la Recherche Biomédicale (Inserm); Inserm Transfert, Région Hauts de France, Mésocentre de Calcul de from Université de Lille, Agence Nationale de la recherche (Click & Detect, 1051 CE29, 2024). The authors thank the OrganOmics platform of PRISM Inserm U1192, which is recognized and supported by the University of Lille, the Infrastructure PROFI (https://www.profiproteomics.fr/), and the GIS IbiSA (https://www.ibisa.net/). The OrganOmics platform (Villeneuve d’Ascq, France) is also supported by Region Hauts de France and FEDER funding. Authors contribution Y.Z. conceptualized and designed the developed data analysis pipelines. Y.Z. developed the Profiler software. L.L. tested Profiler to point out potential bugs and improve the tool as much as possible. L.L. and Y.Z. wrote the User Manual. Y.Z. and L.L. wrote the manuscript’s original draft. M.S. and I.F. corrected the manuscript. M.S. I.F supervised the project and provided the funding. Competing interest Y.Z., L.L., I.F. and M.S. declare they have no competing interests. Profiler is registered since January 14 th 2025 at the Inter deposit IDNN from the Program Protection agence with the number : IDDN 1 .FR 2 .001 3 .030004 4 .0005 .S 6 .C 7 .2025 8 .000 9 .31230 10 Generative AI statement The author(s) declare that Generative AI was used in the creation of this manuscript. During the preparation of this work, the authors utilized ChatGPT-4.0 to enhance the language quality. Following its use, the authors thoroughly reviewed and edited the content as necessary, taking full responsibility for the accuracy and integrity of the publication. References Hasin, Y., Seldin, M. & Lusis, A. Multi-omics approaches to disease. Genome Biol 18 , 83 (2017). Mangul, S. et al. Challenges and recommendations to improve the installability and archival stability of omics computational tools. PLoS Biol 17 , e3000333 (2019). Perez‐Riverol, Y., Alpi, E., Wang, R., Hermjakob, H. & Vizcaíno, J. A. Making proteomics data accessible and reusable: Current state of proteomics databases and repositories. Proteomics 15 , 930–950 (2015). Afgan, E. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Research 46 , W537–W544 (2018). Pang, Z. et al. MetaboAnalyst 5.0: narrowing the gap between raw spectra and functional insights. Nucleic Acids Research 49 , W388–W396 (2021). Tyanova, S. et al. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat Methods 13 , 731–740 (2016). Duhamel, M. et al. Spatial analysis of the glioblastoma proteome reveals specific molecular signatures and markers of survival. Nat Commun 13 , 6665 (2022). Lagache, L., Zirem, Y., Le Rhun, É., Fournier, I. & Salzet, M. Predicting Protein Pathways Associated to Tumor Heterogeneity by Correlating Spatial Lipidomics and Proteomics: The Dry Proteomic Concept. Molecular & Cellular Proteomics 24 , 100891 (2025). Quanico, J. et al. Development of liquid microjunction extraction strategy for improving protein identification from tissue sections. Journal of Proteomics 79 , 200–218 (2013). Tyanova, S., Temu, T. & Cox, J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat Protoc 11 , 2301–2319 (2016). Zirem, Y. et al. Real-time glioblastoma tumor microenvironment assessment by SpiderMass for improved patient management. Cell Reports Medicine 101482 (2024) doi:10.1016/j.xcrm.2024.101482. Kessner, D., Chambers, M., Burke, R., Agus, D. & Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 24 , 2534–2536 (2008). Röst, H. L., Schmitt, U., Aebersold, R. & Malmström, L. pyOpenMS: A Python‐based interface to the OpenMS mass‐spectrometry algorithm library. Proteomics 14 , 74–77 (2014). Demichev, V., Messner, C. B., Vernardis, S. I., Lilley, K. S. & Ralser, M. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat Methods 17 , 41–44 (2020). Behdenna, A. et al. pyComBat, a Python tool for batch effects correction in high-throughput molecular data using empirical Bayes methods. BMC Bioinformatics 24 , 459 (2023). Aljrees, T. Improving prediction of cervical cancer using KNN imputer and multi-model ensemble learning. PLoS ONE 19 , e0295632 (2024). Lemaıtre, G. & Nogueira, F. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Lavanya, A. et al. Assessing the Performance of Python Data Visualization Libraries: A Review. IJCERT 10 , 28–39 (2023). McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Preprint at http://arxiv.org/abs/1802.03426 (2020). Visualizing Data using t-SNE. Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20 , 53–65 (1987). Lundberg, S. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. Preprint at https://doi.org/10.48550/ARXIV.1705.07874 (2017). Fang, Z., Liu, X. & Peltz, G. GSEApy: a comprehensive package for performing gene set enrichment analysis in Python. Bioinformatics 39 , btac757 (2023). Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. The Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucleic Acids Research 43 , D1049–D1056 (2015). Fabregat, A. et al. The Reactome Pathway Knowledgebase. Nucleic Acids Research 46 , D649–D655 (2018). Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27 , 1739–1740 (2011). Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Research 41 , D991–D995 (2012). Svoboda, D. L., Saddler, T. & Auerbach, S. S. An Overview of National Toxicology Program’s Toxicogenomic Applications: DrugMatrix and ToxFX. in Advances in Computational Toxicology (ed. Hong, H.) vol. 30 141–157 (Springer International Publishing, Cham, 2019). Davidson-Pilon, C. lifelines: survival analysis in Python. JOSS 4 , 1317 (2019). Menyhárt, O., Fekete, J. T. & Győrffy, B. Gene expression-based biomarkers designating glioblastomas resistant to multiple treatment strategies. Carcinogenesis 42 , 804–813 (2021). Gao, X., Zhao, J., Jia, L. & Zhang, Q. Remarkable immune and clinical value of novel ferroptosis-related genes in glioma. Sci Rep 12 , 12854 (2022). Zirem, Y., Ledoux, L., Salzet, M. & Fournier, I. Protocol to analyze 1D and 2D mass spectrometry data from glioblastoma tissues for cancer diagnosis and immune cell identification. STAR Protocols 5 , 103285 (2024). Marx, V. The big challenges of big data. Nature 498 , 255–260 (2013). Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat Rev Genet 16 , 321–332 (2015). Additional Declarations Yes there is potential Competing Interest. Y.Z., L.L., I.F. and M.S. declare they have no competing interests. Profiler is registered since January 14th 2025 at the Inter deposit IDNN from the Program Protection agence with the number : IDDN1 .FR2 .0013 .0300044 .0005 .S6 .C7 .20258 .0009 .3123010. Supplementary Files UserManualDataSup1.pdf User Manual MSIgGeneinvolvmentacroospathwaysdetailsDataSup2.csv MSIg gene involvement accross pathways DrugGeneinvolvmentacrosspathwaysdetailsDataSup3.csv Drug gene involvment accross pathways Coxsummarytop20DataSup4.csv Cox Summary Top20 FiguresSupProfilerv1.docx Supplementary igures Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7058776","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":481992812,"identity":"53804549-adf3-4d8d-a83a-55df4eaf9b39","order_by":0,"name":"Michel Salzet","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABDUlEQVRIie3PsUrDQBzH8V8I/LNEu54IzRMIV4QqKPRVLji4ScElg2CgkCw+QAfRV9DF+cJBXQp9hQShc9w6OPjPWcQhl9nhvtPdP3y4fwCf7z8WgoA5H6LulqG7BlrNz/mTHiDSHgC9tgRaScEn5XrnD6mKn5HGADkpw23dSsxGi6j5+HwyN4dCQddSjEEHdR+ZGjqbLCXSpcGVrN7MLXWEFzsFRbKfxHQcSygYrASTtIg3bUfSPCl6F7PkixdLTFDuqseOaPvKfU7kJvz7wYsJCVXOJMotUXASmh498BqvhkKxXl3/kknhIhuzFbvscjZ+L5s2u7tInxcIGp4kIwfZJ3pmg8Dn8/l8g30Di6ZVze71FhwAAAAASUVORK5CYII=","orcid":"https://orcid.org/0000-0003-4318-0817","institution":"Université de Lille Protéomique Réponse Inflammatoire Spectrométrie de Masse - PRISM","correspondingAuthor":true,"prefix":"","firstName":"Michel","middleName":"","lastName":"Salzet","suffix":""},{"id":481992813,"identity":"aea02ecc-fd63-411b-95ba-2bd39d51c3a2","order_by":1,"name":"Yanis Zirem","email":"","orcid":"","institution":"Université de Lille Protéomique Réponse Inflammatoire Spectrométrie de Masse - PRISM","correspondingAuthor":false,"prefix":"","firstName":"Yanis","middleName":"","lastName":"Zirem","suffix":""},{"id":481992814,"identity":"11656103-9777-4fa3-a7fc-c4d9053a6a7b","order_by":2,"name":"Lea Ledoux","email":"","orcid":"","institution":"Université de Lille Protéomique Réponse Inflammatoire Spectrométrie de Masse - PRISM","correspondingAuthor":false,"prefix":"","firstName":"Lea","middleName":"","lastName":"Ledoux","suffix":""},{"id":481992815,"identity":"e1372f73-f5d6-4787-9888-c03cf95cade9","order_by":3,"name":"isabelle Fournier","email":"","orcid":"","institution":"Université de Lille Protéomique Réponse Inflammatoire Spectrométrie de Masse - PRISM","correspondingAuthor":false,"prefix":"","firstName":"isabelle","middleName":"","lastName":"Fournier","suffix":""}],"badges":[],"createdAt":"2025-07-06 15:35:14","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7058776/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7058776/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":87300426,"identity":"01247df1-9dfb-4eac-82fc-6e3dae0f629f","added_by":"auto","created_at":"2025-07-22 13:15:05","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":675118,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eEnd-to-end Profiler Analysis Pipelines\u003c/strong\u003e. Modular architecture with streamlined flow through 8 interconnected components\u003cem\u003e.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-7058776/v1/8731171e966d20e14f11cebc.png"},{"id":87300868,"identity":"849bbf46-f90e-4cfc-b5ce-66d60a06e8bc","added_by":"auto","created_at":"2025-07-22 13:23:05","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":1107274,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eOverview of the insights gained from the data exploration module using Profiler.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-7058776/v1/b769d6620e5d467f4990af3e.png"},{"id":87300428,"identity":"0ce1023b-75a3-4be5-80b3-85c08f6ebd2a","added_by":"auto","created_at":"2025-07-22 13:15:05","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":684826,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eKey findings from the data visualization module using Profiler. A)\u003c/strong\u003e Pseudo-spectra\u003cstrong\u003e \u003c/strong\u003edisplaying the LFQ intensities of all detected proteins across groups,\u003cstrong\u003e B) \u003c/strong\u003eVenn diagram allowing to discover exclusive proteins of each group, \u003cstrong\u003eC) \u003c/strong\u003eBar chart showing the distribution of a specific protein (EGFR) across different groups, \u003cstrong\u003eD-E-F) \u003c/strong\u003eComparison of multiple proteins (EGFR, KCTD16, ICAM3 and DLGAP3) across groups using line plots, bar charts and radar charts.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-7058776/v1/f4b97210d3844a9f6b927487.png"},{"id":87302004,"identity":"395463eb-43c7-46be-8cba-1c9fd8f64782","added_by":"auto","created_at":"2025-07-22 13:31:05","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":925855,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eOverview of outcomes generated by the AI modeling module within Profiler. A-B) \u003c/strong\u003e2D and 3D plots of PCA, UMAP, and t-SNE visualizations separating classes A and B, \u003cstrong\u003eC) \u003c/strong\u003eSilhouette scores for different cluster numbers,\u003cstrong\u003e D-E) \u003c/strong\u003et-SNE clustering results for k=2 and k=3, \u003cstrong\u003eF-G-H-I-J)\u003c/strong\u003e Model accuracies are compared, top 3 models, classification metrics, confusion matrix and learning curve for the best-performing model is detailed.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-7058776/v1/fd13e2f2b9834820ddadc241.png"},{"id":87300432,"identity":"db9c5188-edca-4b21-a293-4ce0e8c1aab5","added_by":"auto","created_at":"2025-07-22 13:15:05","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":707276,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eOverview of the outcomes generated by the Profiler-based biomarker discovery module\u003c/strong\u003e. \u003cstrong\u003eA)\u003c/strong\u003e Volcano plot with a 0.1-fold change threshold and p-value of 0.05, showing up- and down-regulated proteins depending on the groups. \u003cstrong\u003eB)\u003c/strong\u003e SHAP beeswarm plot of the top 20 proteins contributing to the ML model. \u003cstrong\u003eC)\u003c/strong\u003eLIME analysis showing the top 50 proteins contributing to the ML model. \u003cstrong\u003eD)\u003c/strong\u003e Heatmap clustering the two groups based on deregulated proteins identified by both the volcano plot and AI explainability methods. \u003cstrong\u003eE–F)\u003c/strong\u003eBoxplots and violin plots displaying the expression levels of two deregulated proteins (GOT1 and COL6A2).\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-7058776/v1/eacdc0422357eb1620891ddb.png"},{"id":87300434,"identity":"f6cb3736-b360-4fee-a859-afd8040d2fe5","added_by":"auto","created_at":"2025-07-22 13:15:05","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":1706913,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePathways and drugs enrichment using enrichment module in Profiler\u003c/strong\u003e. Panels with enriched pathways/drugs bar plot according to combined score, enriched pathways/drugs according to combined score, enriched pathways/drugs according to gene count, gene involvement across pathways/drugs and interactive gene interaction network. Top panel focusing on pathways and Bottom panel on drugs.\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-7058776/v1/1cd916994f98f0f67e06de96.png"},{"id":87530118,"identity":"6a317177-3a9f-493e-8522-e080bca7ee3c","added_by":"auto","created_at":"2025-07-24 21:48:46","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":7169245,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7058776/v1/10cfeb1f-cadb-4ee1-8651-7196051a3295.pdf"},{"id":87300433,"identity":"7b29e348-5da5-4946-9370-c5653b9db06f","added_by":"auto","created_at":"2025-07-22 13:15:05","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":7426891,"visible":true,"origin":"","legend":"\u003cp\u003eUser Manual\u003c/p\u003e","description":"","filename":"UserManualDataSup1.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7058776/v1/c6feb13fb9b3e85eb648a07f.pdf"},{"id":87300425,"identity":"5441e59b-8c3a-4c43-9f2e-aea07839e346","added_by":"auto","created_at":"2025-07-22 13:15:05","extension":"csv","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":1058,"visible":true,"origin":"","legend":"\u003cp\u003eMSIg gene involvement accross pathways\u003c/p\u003e","description":"","filename":"MSIgGeneinvolvmentacroospathwaysdetailsDataSup2.csv","url":"https://assets-eu.researchsquare.com/files/rs-7058776/v1/07dc61a59f1702911715f33c.csv"},{"id":87300427,"identity":"5bfc30cd-2324-4d98-95f2-55c29a1e24a8","added_by":"auto","created_at":"2025-07-22 13:15:05","extension":"csv","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":789,"visible":true,"origin":"","legend":"\u003cp\u003eDrug gene involvment accross pathways\u003c/p\u003e","description":"","filename":"DrugGeneinvolvmentacrosspathwaysdetailsDataSup3.csv","url":"https://assets-eu.researchsquare.com/files/rs-7058776/v1/e0f517626e75069a5c1fca86.csv"},{"id":87300429,"identity":"fc49a53d-542f-4612-b88a-d1fb40ca114c","added_by":"auto","created_at":"2025-07-22 13:15:05","extension":"csv","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":4112,"visible":true,"origin":"","legend":"\u003cp\u003eCox Summary Top20\u003c/p\u003e","description":"","filename":"Coxsummarytop20DataSup4.csv","url":"https://assets-eu.researchsquare.com/files/rs-7058776/v1/9eebf53fc4f75bfaca69faf6.csv"},{"id":87300870,"identity":"13782cd2-d14e-4705-a62f-366bd16f34dd","added_by":"auto","created_at":"2025-07-22 13:23:05","extension":"docx","order_by":5,"title":"","display":"","copyAsset":false,"role":"supplement","size":1368065,"visible":true,"origin":"","legend":"\u003cp\u003eSupplementary igures\u003c/p\u003e","description":"","filename":"FiguresSupProfilerv1.docx","url":"https://assets-eu.researchsquare.com/files/rs-7058776/v1/422433599fe71b9201a90268.docx"}],"financialInterests":"\u003cb\u003eYes\u003c/b\u003e there is potential Competing Interest.\nY.Z., L.L., I.F. and M.S. declare they have no competing interests. Profiler is registered since January 14th 2025 at the Inter deposit IDNN from the Program Protection agence with the number : IDDN1 .FR2 .0013 .0300044 .0005 .S6 .C7 .20258 .0009 .3123010.","formattedTitle":"Profiler: an open web platform for multi-omics analysis","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe advent of high-throughput technologies, such as next-generation sequencing (NGS), mass spectrometry (MS) and microarrays, has revolutionized biomedical research. These platforms generate large-scale, multi-dimensional datasets, collectively referred to as omics data, encompassing genomics, transcriptomics, proteomics, and metabolomics. Such datasets hold immense potential for elucidating biological mechanisms, discovering disease biomarkers, and identifying novel therapeutic targets. However, the complexity, heterogeneity, and volume of omics data introduce substantial computational and analytical challenges\u003csup\u003e1\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eTraditional omics data analysis typically requires specialized expertise in bioinformatics, statistics and programming, placing it beyond the reach of many experimental biologists and clinicians. Furthermore, many existing tools are limited in scope, tailored to specific omics types or single-step analyses and are often confined to command-line environments, which hinder accessibility, interoperability and reproducibility. Researchers are frequently compelled to navigate fragmented workflows across multiple software packages, leading to inefficiencies, steep learning curves and reproducibility concerns\u003csup\u003e2,3\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eIn response to these limitations, there is a growing demand for integrated, user-friendly, and visually intuitive platforms that combine analytical robustness with accessibility. Solutions such as Galaxy\u003csup\u003e4\u003c/sup\u003e, MetaboAnalyst\u003csup\u003e5\u003c/sup\u003e and Perseus\u003csup\u003e6\u003c/sup\u003e have made important strides in addressing specific areas of omics analysis. However, few platforms offer a truly comprehensive, end-to-end solution that covers multiple omics modalities, incorporates advanced machine learning and deep learning methods and enables interactive data visualization and interpretation within a unified environment.\u003c/p\u003e\n\u003cp\u003eTo address these critical gaps, we introduce Profiler, a modular, web-based application designed to democratize omics data analysis. Developed in Python using the Streamlit framework, Profiler provides a seamless and integrated pipeline covering key stages of analysis: data import and conversion, preprocessing (including cleaning, normalization, imputation, batch effect correction), visualization, statistical testing, machine learning, deep learning, biomarker discovery, pathway enrichment analysis, and survival analysis. The platform is built for scalability, modularity and extensibility, allowing it to evolve with emerging research needs and analytical innovations.\u003c/p\u003e\n\u003cp\u003eNotably, Profiler is engineered to serve both novice and expert users. It offers guided, workflow-oriented interfaces for users with limited computational experience, while its flexible architecture supports customization and advanced analytical workflows for experienced users. Profiler\u0026apos;s compatibility with a wide range of data formats, combined with efficient backend processing, ensures robust performance even with high-dimensional datasets.\u003c/p\u003e\n\u003cp\u003eBy lowering the technical barriers to entry, Profiler aims to provide the scientific community with an accessible, transparent and comprehensive analytical ecosystem, one that promotes reproducibility, accelerates discovery, and empowers data-driven decision-making in modern life sciences research.\u003c/p\u003e"},{"header":"Materials and Methods","content":"\u003cp\u003eThe dataset used in this article to demonstrate the utility of Profiler originates from the studies by Duhamel \u003cem\u003eet al\u003c/em\u003e. (2022)\u003csup\u003e7\u003c/sup\u003e and Lagache \u003cem\u003eet al\u003c/em\u003e. (2025)\u003csup\u003e8\u003c/sup\u003e. While the data were initially collected in 2022, they were reanalyzed in the 2025 study using a more appropriate and advanced data analysis pipelines.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCohort\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTumors from 50 patients were included in the study. Patients with newly diagnosed glioblastoma were prospectively enrolled between September 2014 and November 2018 at Lille University Hospital, France (NCT02473484). All patients gave written informed consent before enrollment. These 50 tumors were used for omics MALDI-MSI and proteomics analysis. Tumors samples were processed within 2\u0026thinsp;hours after sample extraction in the surgery room to limit the risk of degradation of proteins.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSpatially resolved proteomics extraction\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe different clusters identified by the segmentation process (detailed explanation in Lagache \u003cem\u003eet al\u003c/em\u003e. (2025)\u003csup\u003e8\u003c/sup\u003e) were submitted to spatially resolved proteomics. A localized digestion was carried out by deposing a trypsin solution (40 \u0026mu;g/ml in NH\u003csub\u003e4\u003c/sub\u003eHCO\u003csub\u003e3\u003c/sub\u003e 50 mM), on a region of 500 \u0026mu;m\u003csup\u003e2\u003c/sup\u003e of tissue (4 \u0026times; 4 droplets of 200 \u0026mu;m in diameter), using CHIP-1000. The deposition method comprises approximately 1205 cycles per digestion spot, i.e., 3 h of deposition, with a drop volume of 150 pL. Finally, each spot was digested with 0.112 \u0026mu;g of trypsin. Following the micro-digestion, each spot was extracted by liquid microjunction using the TriVersa Nanomate device, with LESA (Liquid Extraction and Surface Analysis) parameters\u003csup\u003e9\u003c/sup\u003e. The tryptic peptides were extracted by performing two consecutive extraction cycles for three different solvents mixtures (TFA 0.1%; ACN/0.1% TFA (8:2, \u003cem\u003ev/v\u003c/em\u003e); and MeOH/0.1% TFA (7:3, \u003cem\u003ev/v\u003c/em\u003e)) for a total of six extractions. For each cycle, 2 \u0026mu;l of solvent was drawn into the tip of the pipette, of which 0.8 pL was brought into contact with the surface. 15 back and forth movements were performed to extract the peptides before collecting the solution in a recovery tube. All extracts were pulled in one tube and 50 \u0026mu;l of ACN were finally added before drying the samples in a SpeedVac. The samples were then stored at \u0026minus;20 ◦C prior to nLC-MS/MS analysis.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003enLC-MS/MS Bottom-up Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ePrior to MS analysis, the reconstituted samples were desalted using C18 Ziptip (Millipore, Saint-Quentin-en-Yvelines, France), eluted with 80% ACN and vacuum-dried. The dried samples were resuspended in 0.1% FA aqueous/ACN (98:2, \u003cem\u003ev/v\u003c/em\u003e). Peptides separation was performed by reverse phase chromatography, using a NanoAcquity UPLC system (Waters) coupled to a Q-Exactive Orbitrap mass spectrometer (Thermo Scientific) via a nanoelectrospray source. A pre-concentration column (nanoAcquity Symmetry C18, 5\u0026thinsp;\u0026micro;m, 180\u0026thinsp;\u0026micro;m\u0026thinsp;\u0026times;\u0026thinsp;20\u0026thinsp;mm) and an analytical column (nanoAcquity BEH C18, 1.7\u0026thinsp;\u0026micro;m, 75\u0026thinsp;\u0026micro;m\u0026thinsp;\u0026times;\u0026thinsp;250\u0026thinsp;mm) were used. A 2\u0026thinsp;h linear gradient of acetonitrile in 0.1% formic acid (5%-35%) was applied, at the flow rate of 300\u0026thinsp;nl/min. For MS and MS/MS Acquisition (Xcalibur 4.1 and Exactive Series 2.9), a data-dependent mode was defined to analyze the 10 most intense ions of MS analysis (Top 10). The MS analysis was performed with an \u003cem\u003em/z\u003c/em\u003e mass range between 300 and 1600, a resolution of 70,000 FWHM, an AGC of 3e\u003csup\u003e6\u003c/sup\u003e ions and a maximum injection time of 120\u0026thinsp;ms. The MS/MS analysis was performed with an \u003cem\u003em/z\u003c/em\u003e mass range between 200 and 2000, an AGC of 5e\u003csup\u003e4\u003c/sup\u003e ions, a maximum injection time of 60\u0026thinsp;ms and the resolution was set at 17,500 FWHM. To avoid any batch effect during the analysis, the extractions were chosen at random to create analysis sequences.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData analysis prior to the use of Profiler (2022 study)\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll MS data were searched with MaxQuant software\u003csup\u003e10\u003c/sup\u003e (Version 1.5.3.30) using Andromeda search engine against the complete proteome for Homo sapiens (UniProt, release July 2018, 20,412 entries). Trypsin was selected as enzyme and two missed cleavages were allowed, with N-terminal acetylation and methionine oxidation as variable modifications. The mass accuracies were set to 6 ppm and 20 ppm, respectively, for MS and MS/MS spectra. False discovery rate (FDR) at the peptide spectrum matches (PSM) and protein levels was estimated using a decoy version of the previously defined databases (reverse construction, Homo sapiens, UniProt, release July 2018) and set to 1%. A minimum of two peptides with at least one unique is necessary to complete the identification of a protein. The MaxLFQ algorithm was used to performed label-free quantification of the proteins.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData analysis prior to the use of Profiler (2025 study)\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe aim of the 2025 study\u003csup\u003e8\u003c/sup\u003e was to advance the concept of dry proteomics. To this end, lipidomic and proteomic MALDI-MSI analyses were performed on 13 glioblastoma tissue samples. Common molecular clusters were identified and correlated with microproteomic data previously obtained in the 2022 study\u003csup\u003e7\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eWe developed a dedicated pipeline for the segmentation of mass spectrometry imaging (MSI) data to assess the number and spatial distribution of tumor clones, both within and across patients. A t-SNE analysis based on lipidomic imaging data revealed a clear separation into two distinct groups. This clustering pattern was consistently recapitulated in the heatmap derived from microproteomic data, further supporting the robustness of the classification. As a result, the samples were stratified into two molecular subgroups, referred to as group A and group B. Comparison of clinical outcomes and differential analysis, showed that group A was associated with significantly longer overall survival (greater than 32 months) and tumor aggressiveness, invasion and therapeutic resistance, while group B was linked to a poorer prognosis (survival less than 30 months) and less aggressiveness, necrosis and potential therapeutic targets.To automate the prediction of patient outcomes, we developed a dry-lab proteomic analysis pipeline. This pipeline enabled the extraction of spatially resolved MSI clusters, which were subsequently analyzed using trained machine learning models. From a single pixel or an MSI-derived cluster, the models could predict the identity of the corresponding tumor clone, its associated protein expression profile, its classification into group A or B, and ultimately the patient\u0026apos;s prognostic category.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eScalability and system resources\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eProfiler is designed with scalability and robust system resources in mind, ensuring optimal performance and reliability for high-demand analytical tasks. The platform runs on the M\u0026eacute;socentre de Calcul de Lille (https://hpc.univ-lille.fr), leveraging an infrastructure that includes 246 GB of RAM, 8GB of swap memory and multiple vCPUs and vGPUs. This setup, operating on a Linux server and utilizing OpenStack cloud technology, provides ample computational power to handle complex analyses efficiently. Additionally, Profiler benefits from expandable GPU and vCPU pools, allowing for dynamic scaling of resources based on user demand. The system is actively monitored to ensure it meets the analytical needs of its users. If user demand increases, as tracked via system telemetry, Profiler\u0026apos;s compute resources (CPU, GPU, RAM) can be scaled in collaboration with HPC administrators. This proactive approach ensures that the platform remains responsive and capable of handling increased loads without compromising performance. Furthermore, plans are in place to augment the system\u0026apos;s capacity if there is a surge in demand from the user community, ensuring that Profiler continues to deliver high-quality, timely results even under heavy usage. In parallel, desktop versions of Profiler are being considered for development to provide increased accessibility and offline functionality.\u003c/p\u003e\n\u003cp\u003eThe technological stack supporting Profiler\u0026rsquo;s backend, frontend and cloud infrastructure is detailed in \u003cstrong\u003eTable 1\u003c/strong\u003e, which outlines the key libraries and tools integrated into the platform to enable efficient data processing, modeling, visualization and deployment.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 1\u003c/strong\u003e. Overview of technologies and libraries used in the Profiler platform.\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"17\" style=\"width: 179px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eBackend\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eTechnology/ Library\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eDescription/ Role\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003ePython\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eMain programming language, integrates all modules and orchestrates workflow execution\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003ePandas\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eData manipulation and preprocessing for tabular and omics data\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eNumpy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eEfficient numerical operations and array manipulation\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003epyopenMS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eMass spectrometry file parsing\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eMsconvert (ProteoWizard)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eRaw MS data conversion to open formats\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eOpenpyxl\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eExcel file handling\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eScikit-learn\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eMachine learning models\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eTensorflow/Keras\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eDeep learning model design and training (MLPs, CNNs, RNNs)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003elifelines\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eSurvival modeling and stratification\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eImbalanced-learn\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eClass balancing (e.g., SMOTE, ADASYN, under sampling)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003epycombat\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eBatch effect correction\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eSHAP, Eli5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eModel explainability\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eScipy.stats, statsmodels\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eParametric and non-parametric statistical tests (e.g., t-test, ANOVA, Kruskal-Wallis, Mann-Whitney)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003ejoblib / pickle\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eModel serialization and persistence (saving/loading ML pipelines and objects)\u003c/p\u003e\n \u003ctable border=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003cp\u003e\u003cbr\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eGSEApy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eGene set enrichment analysis\u003c/p\u003e\n \u003ctable border=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003cp\u003e\u003cbr\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eNetworkX\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eConstruction and analysis of biological networks and pathway graphs\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"3\" style=\"width: 179px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eFrontend\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eStreamlit\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eWeb-based user interface\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eHTML/CSS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eCustom layout and styling of the interface components\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003ePlotly, Matplotlib, Seaborn\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eInteractive and static visualizations (e.g., spectra, volcano plots, radar charts)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"5\" style=\"width: 179px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eHPC and Cloud\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eLinux (ubuntu)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eOperating system for server and local environments\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eOpen Stack\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eCloud infrastructure management for ressource provisioning\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eSystemd\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eService orchestration and daemon management\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eNginx\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eReverse proxy server for deployment, load balancing, and API exposure\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 223px;\"\u003e\n \u003cp\u003eDocker\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 201px;\"\u003e\n \u003cp\u003eContainerization for reproducibility, environment control, and deployment\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e"},{"header":"Results","content":"\u003cp\u003eProfiler\u0026rsquo;s primary goal is to bridge the gap between raw omics data and actionable biological insights by leveraging\u0026nbsp;a custom pipeline combining state-of-the-art libraries, original modules, and high-performance computing. \u003cstrong\u003eFigure 1\u003c/strong\u003e illustrates the 8 interconnected components of this software (detailed in the User\u0026rsquo;s Manual\u003cstrong\u003e\u0026nbsp;in Supplementary Data 1\u003c/strong\u003e). To demonstrate how Profiler operates and the types of results it can generate, the proteomic dataset processed with MaxQuant will be used as main running example throughout the workflow. Additionally, lipidomic data acquired using the SpiderMass technology, such as those published by Zirem\u003cem\u003e\u0026nbsp;et al.\u0026nbsp;\u003c/em\u003e(2024)\u003csup\u003e11\u003c/sup\u003e will be used for module not useful for proteomic dataset.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eData conversion and importation\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo accommodate vendor heterogeneity, Profiler integrates a vendor-agnostic data conversion\u0026nbsp;module using msconvert from proteowizard\u003csup\u003e12\u003c/sup\u003e. It supports the conversion of raw files from Bruker, Thermo Fisher, and Waters instruments into open formats such as. mzML, mzXML, .mz5, and .mzDB via pyOpenMS. During conversion, users can: define mass range boundaries, enable peak picking, apply lock mass corrections, downsample spectra for faster processing. This ensures standardization of MS input across platforms and enhances compatibility with downstream tools.\u003c/p\u003e\n\u003cp\u003eIn addition, Profiler accepts and harmonizes a wide variety of omics data types, including mass spectrometry standard format files, where MS files are structured by biological class or condition using the and parsed using pyOpenMS library\u003csup\u003e13\u003c/sup\u003e, and tabular omics data in .csv, .tsv, .txt, and .xlsx formats, including exports from MaxQuant\u003csup\u003e10\u003c/sup\u003e, DIA-NN\u003csup\u003e14\u003c/sup\u003e, and Perseus\u003csup\u003e6\u003c/sup\u003e. The expected format for tabular data requires a column named \u0026lsquo;Class\u0026rsquo; for target labels (e.g., control, condition 1, etc.) and the remaining columns as features (ions, gene names, protein names, metabolites, etc.). Additionally, Profiler supports survival and clinical data, requiring \u0026apos;Overall Survival\u0026apos; and \u0026apos;State\u0026apos; columns to facilitate survival modeling and stratification using the lifelines library. Uploaded datasets are automatically cataloged, checked for delimiter consistency, and verified for missing or malformed values. Data handling and manipulation are facilitated by the pandas and openpyxl libraries.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eData exploration and preprocessing\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAn integrated data exploration module enables users to interactively explore and validate their datasets, offering summarization through visualizations of class distributions, missing values information and sample sizes.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAs shown in \u003cstrong\u003eFigure 2\u003c/strong\u003e, using the data exploration component of Profiler, the dataset consists of 108 samples in group A (73.5%) and 39 samples in group B (26.5%), indicating a class imbalance that may require over- or under-sampling to address. Furthermore, approximately 50% of the data contains missing values, with a higher proportion in group B. Only half of the features follow a normal distribution, suggesting that \u003cem\u003eK\u003c/em\u003e-nearest neighbors (KNN) imputation is suitable for handling missing data, and that either the t-test or the Mann-Whitney U test should be used to assess the statistical significativity depending on the distribution of each variable (feature).\u003c/p\u003e\n\u003cp\u003eUsers can also manage labels by editing class names in-session for clarity and consistency. One module provides various preprocessing options, including normalization techniques such as TIC, RMS, BasePeak, QNorm and log transformations, as well as batch effect correction using NeuroCombat from the pycombat package\u003csup\u003e15\u003c/sup\u003e. Dynamic binning can be applied to selected mass ranges, and missing value imputation is supported through mean, median, mode, and KNN-based imputation using scikit-learn\u0026apos;s KNNImputer libraries\u003csup\u003e16\u003c/sup\u003e.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eFor our dataset, KNN imputation with missing value removal was used to optimize the dataset and the rest of the data analysis, as it was recommended in the \u003cstrong\u003eFigure 2\u003c/strong\u003e. Indeed, given that the dataset contains a balanced mix of values with uncertain distribution characteristics, it is unclear whether mean or median imputation would be optimal. As a result, KNN imputation emerges as the most robust and adaptive solution. Thanks to KNN imputation and the removal of missing values (exclusive features), the total number of proteins falls from 4936 to 4251.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eClass balancing and sampling\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eProfiler includes advanced resampling strategies to correct class imbalance, either by data augmentation or data decrease, which is crucial for training classification models. These strategies include oversampling techniques such as SMOTE (Synthetic Minority Over-sampling Technique) and ADASYN (Adaptive Synthetic Sampling), which generate synthetic samples to balance the classes and undersampling techniques like RandomUnderSampler and NearMiss, which reduce the number of samples in the majority class. These resampling methods are applied through the imbalanced-learn library\u003csup\u003e17\u003c/sup\u003e, ensuring full compatibility with structured data and MS intensities, thereby enhancing the performance and reliability of classification models.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eIn our dataset, applying oversampling, as recommended in \u003cstrong\u003eFigure 2\u003c/strong\u003e, to address class imbalance would result in 108 samples per group. All subsequent analyses could then rely on this balanced dataset, if wanted.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eData Visualization\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe visualization engine relies on Plotly, Matplotlib, and Seaborn to generate interactive plots, offering a variety of visualization options and providing a comprehensive and interactive way to explore and understand the data. These include feature distributions displayed through line, bar, histogram, and radar charts, as well as spectra visualization with mean signal/features and individual sample. UpSet and Venn diagrams are used to show the overlap of features across classes, using the upsetplot library and custom logic\u003csup\u003e18\u003c/sup\u003e.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eSpectra from classical mass spectrometry datasets can be displayed and interactively explored (\u003cstrong\u003eFigure S1\u003c/strong\u003e), allowing zooming and other manipulations. In addition, pseudo-spectra, such as the one shown in \u003cstrong\u003eFigure 3A\u003c/strong\u003e, can be visualized to display the label-free quantification (LFQ) intensities of all detected proteins across groups.\u003c/p\u003e\n\u003cp\u003eUsing the raw data, before applying KNN imputation by Class and removing class-exclusive features (which cannot be imputed as they are not detected in the other class a Venn diagram can be generated to identify group-exclusive proteins (\u003cstrong\u003eFigure 3B\u003c/strong\u003e). In our case, 145 proteins were found to be exclusive to group A, and 540 to group B, with 4251 proteins in common. However, the exclusive proteins can only be used for pathway enrichment analysis (as presented in the following sections of the paper), but not for statistical testing or machine/deep learning model training. Therefore, for all subsequent analyses, except pathway enrichment, the results rely exclusively on the dataset with no exclusive features and no missing values as they are imputed.\u003c/p\u003e\n\u003cp\u003eBefore performing statistical tests, it is important to explore and better understand the data. Several types of visualizations are available for this purpose. For example, a bar chart can be used to show the distribution of a specific protein across different groups by displaying its presence or absence (\u003cstrong\u003eFigure 3C\u003c/strong\u003e). The protein EGFR, for instance, appears to be more expressed in group B. It is also possible to compare multiple proteins simultaneously using radar charts, line plots, or bar charts (\u003cstrong\u003eFigures 3D\u0026ndash;E\u0026ndash;F\u003c/strong\u003e). These visualizations reveal, for example, that EGFR is more abundant in group B, whereas DLGAP3, ICAM3 and KCTD16 are more highly expressed in group A.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eCorrelation and similarity analysis\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo explore inter-feature or inter-class relationships, Profiler offers advanced modules that support exploratory biological hypotheses and quality control. Users can assess intra-feature relationships through correlation methods, including Pearson and Spearman, which are computed between the average feature vectors of each class. Pearson correlation is ideal for normally distributed data, measuring linear relationships, while Spearman correlation is suitable for non-parametric data, assessing monotonic relationships using rank values.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAdditionally, inter-class resemblance is evaluated using cosine similarity and Cohen\u0026rsquo;s Kappa score. Cosine similarity measures the angle between feature vectors of each class, indicating the directional alignment of the data (with 1 signifying identical direction and 0 orthogonal). Cohen\u0026rsquo;s Kappa, on the other hand, evaluates the agreement in categorized feature profiles after discretizing continuous data into ranked categories (e.g., low, medium, high expression). This discretization allows Kappa to measure agreement on patterns rather than exact numerical\u0026nbsp;values, providing insights into the consistency of feature profiles across classes. These techniques are crucial for understanding the underlying data structure and ensuring the reliability of biological interpretations and the novel application of Cohen\u0026rsquo;s Kappa within Profiler is particularly valuable for omics analysis, as suitable to reveal consistent expression trends that may be masked by variability at the continuous level.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFigure S2\u003c/strong\u003e shows that while group A and group B are highly correlated (r = 0.93), indicating strong similarity in continuous variables, their moderate agreement on Cohen\u0026rsquo;s Kappa (\u0026kappa; = 0.57) suggests notable differences when categorical aspects are considered.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eMachine Learning \u0026amp; Deep Learning\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eProfiler supports comprehensive machine learning (ML) and deep learning (DL) workflows through scikit-learn, TensorFlow, and custom wrappers, offering a wide range of techniques for both unsupervised and supervised learning.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eFor unsupervised learning, users can employ dimensionality reduction methods such as PCA (Principal Component Analysis), UMAP (Uniform Manifold Approximation and Projection), and t-SNE (t-Distributed Stochastic Neighbor Embedding) to visualize data clusters. The plots can be generated in both 2D and 3D, depending on the dimensionality reduction method used. In our case, non-linear techniques such as UMAP and t-SNE proved to be the most effective for clearly distinguishing between the two groups, A and B, as they form well-separated clusters (\u003cstrong\u003eFigure 4A-B\u003c/strong\u003e). In contrast, the linear method PCA fails to clearly differentiate these groups, suggesting that it does not capture the underlying structure of the data as effectively.\u003c/p\u003e\n\u003cp\u003eFor UMAP, the n_neighbors parameter is crucial as it defines the size of the local neighborhood used for manifold approximation. Choosing this parameter can be challenging for scientists, as it is not well-documented and can lead to misleading biological conclusions if not set correctly. To address this, Profiler uses a heuristic approach to calculate n_neighbors based on the number of data points. This heuristic ensures that the neighborhood size adapts to the dataset size, balancing between capturing local structure and computational efficiency. This approach is based on recommendations from the original UMAP paper\u003csup\u003e19\u003c/sup\u003e and practical guidelines from the machine learning community. For t-SNE, the perplexity parameter influences the number of nearest neighbors that are used in other data points. Similar to UMAP, selecting an appropriate perplexity value can be non-trivial and may result in incorrect interpretations if done manually. Profiler calculates perplexity using a heuristic approach based on the number of data points. This heuristic aims to find a balance between preserving local and global data structures while avoiding overfitting. This method is inspired by the original t-SNE paper\u003csup\u003e20\u003c/sup\u003e and best practices in the field. By automating the selection of these parameters, Profiler helps users avoid potential pitfalls and ensures more reliable and reproducible results.\u003c/p\u003e\n\u003cp\u003eIn addition, \u003cem\u003ek\u003c/em\u003e-means clustering and silhouette analysis\u003csup\u003e21\u003c/sup\u003e can be used to assess group formation and heterogeneity. Indeed, for \u003cem\u003ek\u003c/em\u003e-means clustering, determining the optimal number of clusters is a critical step. Profiler uses silhouette analysis to evaluate the quality of the clustering. The silhouette score measures how similar an object is to its own cluster compared to other clusters. A higher silhouette score indicates better-defined clusters. By analyzing the silhouette scores for different numbers of clusters, Profiler helps users identify the optimal number of clusters without overclustering or underclustering. This ensures that the clustering results are meaningful and biologically relevant.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eLooking at our dataset, Silhouette analysis indicates that an optimal clustering would involve three groups, rather than the current two-group classification (A and B) (\u003cstrong\u003eFigure 4C\u003c/strong\u003e). This is consistent with the previous t-SNE and UMAP plots, where two distinct subgroups can be observed within group B, suggesting underlying heterogeneity. This observation is further supported by the t-SNE projection with three clusters, where group B clearly subdivides into two separate clusters, referred to as groups B and C (\u003cstrong\u003eFigure 4D-E\u003c/strong\u003e)\u003cstrong\u003e.\u0026nbsp;\u003c/strong\u003eThis suggests that group B may contain multiple tumor clones or distinct subtypes. In previous lipid-MSI studies, patients from group B often showed high levels of necrosis, which could also explain the observed heterogeneity. To explore this further, integrating additional clinical metadata, such as age, sex, treatment history, or comorbidities, could help identify meaningful biological or clinical differences and improve patient stratification.\u003c/p\u003e\n\u003cp\u003eIn supervised learning, Profiler provides access to over 23 models, including Random Forest, Logistic Regression, SVM, Na\u0026iuml;ve Bayes, Gradient Boosting and LDA/QDA, along with ensemble methods like bagging classifiers. Users can compare model performance using learning curves, confusion matrices and classification reports with metrics such as F1 Scores, accuracy, recall, precision, sensitivity and specificity.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWhen attempting to build the optimal classification model using our dataset, 20 out of the 23 tested algorithms reached perfect accuracy (100%) after 20-fold cross-validation (\u003cstrong\u003eFigure 4F-G\u003c/strong\u003e). This performance underscores a clear separation between the groups and indicates that the models successfully captured distinct protein profiles characteristic of each group. Notably, both the confusion matrix and the classification report demonstrate that the optimal model, built using the RidgeClassifier algorithm, achieved perfect performance with no misclassifications (\u003cstrong\u003eFigure 4H-I\u003c/strong\u003e). The learning curve shows that the model begins to learn effectively after 70 samples and reaches optimal performance by 90 samples. Furthermore, the close alignment of the training and validation curves towards the end indicates good generalization, with no apparent underfitting or overfitting (\u003cstrong\u003eFigure 4J\u003c/strong\u003e).\u003c/p\u003e\n\u003cp\u003eFor deep learning, Profiler supports architectures like MLP (Multilayer Perceptron), CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network), with accelerated training and real-time metric tracking. Deep learning typically requires large amounts of data to be truly effective. In our case, the dataset is not extensive enough to provide a clear advantage over traditional machine learning approaches. Nevertheless, as shown in \u003cstrong\u003eFigure S3\u003c/strong\u003e, the deep learning algorithms (MLP and CNN) still managed to achieve 100% accuracy in classifying the two groups.\u003c/p\u003e\n\u003cp\u003eUsers can save and reload trained models along with the selected features, the fitted label encoder and the full preprocessing pipeline, including scaling and transformations. This ensures that any new data used for prediction will undergo the exact same preprocessing steps as the training data, maintaining consistency and avoiding data leakage. Importantly, saving the specific trained features (not just the input dimension) guarantees that the model only processes the variables it was originally trained on, preserving both model integrity and performance. This is particularly crucial when applying the model to new omics data, such as metabolomic spectra, proteomic LFQ, or gene/RNA expression, where some features used during training may not be detected in a given sample. In traditional workflows, this mismatch would prevent prediction altogether. However, Profiler handles this seamlessly by assigning a default value (e.g., zero) to any missing feature, treating it as not detected. This allows predictions to proceed using the available features without compromising model compatibility or requiring retraining.\u003c/p\u003e\n\u003cp\u003eThis approach enhances reproducibility, ensures robust and interpretable predictions, and supports scalable deployment in real-world scenarios. All models can be exported for external use, making Profiler a powerful and flexible tool for both exploratory analysis and predictive modeling across diverse omics applications.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eBiomarker discovery\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNext, Profiler offers a comprehensive pipeline for biomarker discovery and feature interpretation, which also serves as a robust feature selection process. This pipeline includes a variety of statistical analysis tools and explainability modules designed to identify, rank, and visualize significant biomarkers. These insights can then be saved as structured dataframes for further analysis or model retraining, enhancing overall performance and interpretability.\u003c/p\u003e\n\u003cp\u003eOne of the standout features is the volcano plot, conventionally used to compare binary classes. However, Profiler has expanded this functionality to support multi-class comparisons, providing a more versatile tool for biomarker discovery. Volcano plots visualize the statistical significance (p-value) and magnitude of change (fold change) for each feature, allowing users to quickly identify the most relevant biomarkers. Provides also option to highlight feature names for better clarity and offers a features detection based on intensity thresholds, which can automatically identify and include significant features in the analysis This multi-class capability broadens the applicability of volcano plots, making them a powerful tool for complex datasets.\u003c/p\u003e\n\u003cp\u003eUsing a volcano plot with a 0.1-fold change and a 0.05 p-value, 66 proteins were found significantly deregulated in group A or B (\u003cstrong\u003eFigure 5A\u003c/strong\u003e).\u003c/p\u003e\n\u003cp\u003eProfiler also integrates explainability tools to enhance the interpretability of machine learning results. It supports SHAP\u003cstrong\u003e\u0026nbsp;(\u003c/strong\u003eSHapley Additive exPlanations) (https://shap.readthedocs.io) for both local and global attribution\u003csup\u003e22\u003c/sup\u003e, and LIME (https://eli5.readthedocs.io) for introspection of models introspection. SHAP values provide detailed explanations of model outputs by quantifying the contribution of each feature to individual predictions, offering both per-sample and overall insights. LIME, on the other hand, offers transparency in models by highlighting feature weights/contributions and their effects (postive or negative). Profiler includes custom modules that convert SHAP and LIME outputs into structured DataFrames, facilitating easier downstream analysis and integration. In addition, various visualization techniques such as beeswarm plots and positive/negative contribution plots are generated to visually feature impacts and enhance understanding of model behavior. Together, these tools ensure that predictive models are not only accurate but also trustworthy and explainable.\u003c/p\u003e\n\u003cp\u003eUsing AI explainability tools, 54 proteins that contributed most to the model\u0026rsquo;s predictions were identified (\u003cstrong\u003eFigure 5B\u0026ndash;C\u003c/strong\u003e). These proteins were added to those found deregulated in the volcano plot, except when already recurrent such as FMO3 for group B and HBQ1, TMEM163, SEPT14, and DHRS3 for group A, for further analysis.\u003c/p\u003e\n\u003cp\u003eAdditionally, Profiler offers heatmap clustering for both features and samples, enabling users to visualize patterns and relationships within the data. Users can perform heatmap clustering on all or selected features, with options to average feature values by class and apply statistical tests to filter significant features. Customizable parameters include the choice of data type (original intensity or log2 transformed) and p-value thresholds, allowing for tailored analysis. The heatmaps are enhanced with custom color schemes to highlight under-expression, neutral expression, and over-expression, providing a clear and intuitive visualization.\u003c/p\u003e\n\u003cp\u003eA heatmap generated using all 120 discovered biomarkers, from both volcano plots and AI explainability methods, clearly demonstrated a strong clustering of the two groups, with distinct patterns of under- and overexpressed proteins (\u003cstrong\u003eFigure 5D\u003c/strong\u003e). Moreover, when comparing the heatmaps generated using the biomarkers from the volcano plots and those identified through AI, we observe that the one derived from AI appears to be clustered in a much more homogeneous manner (\u003cstrong\u003eFigure S4\u003c/strong\u003e). In contrast, the heatmap based on volcano plot biomarkers shows a noticeable heterogeneity, particularly within group B.\u003c/p\u003e\n\u003cp\u003eFor statistical analysis, Profiler supports a range of tests tailored to both binary and multi-class scenarios, including parametric and non-parametric methods. Users can perform t-tests and ANOVA for parametric data, as well as Kruskal-Wallis and Mann-Whitney tests for non-parametric data. These tests help assess the significance of features and their correlation with biological conditions by facilitating the visualization using boxplots, violin plots or bar plots.\u003c/p\u003e\n\u003cp\u003eHere, two examples of deregulated proteins were displayed using boxplots and violinplots (\u003cstrong\u003eFigure 5 E-F\u003c/strong\u003e) using Kruskal Wallis test. Indeed, it showed that in a significantly manner, ACYP2 is overexpressed in group B, in contrary to COL6A2 who is more expressed in group A.\u003c/p\u003e\n\u003cp\u003eOverall, Profiler\u0026apos;s biomarker discovery and feature interpretation pipeline is designed to streamline the process of identifying significant features, enhancing model performance, and providing clear, interpretable results. The ability to save these insights as structured dataframes further facilitates downstream analysis and model retraining, ensuring that users can leverage the most relevant features for their research.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003ePathway enrichment and functional annotation\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eBiological pathway analysis in Profiler is performed using GSEApy\u003csup\u003e23\u003c/sup\u003e, interfaced via custom algorithm. This feature allows users to select from multiple comprehensive databases, more than 100 databases, including KEGG\u003csup\u003e24\u003c/sup\u003e, GO\u003csup\u003e25\u003c/sup\u003e, Reactome\u003csup\u003e26\u003c/sup\u003e, MSigDB\u003csup\u003e27\u003c/sup\u003e, and Drug Signatures\u003csup\u003e28,29\u003c/sup\u003e, providing a wide range of biological contexts for analysis. One of the key advantages of Profiler is its support for multi-class enrichment, which facilitates comparative insights across different phenotypes. This is particularly useful for studies involving multiple conditions or treatments, as it allows for a more nuanced understanding of biological pathways. For each pathway, Profiler provides detailed information including the number of associated proteins or genes, as well as the list of implicated features within that pathway. Importantly, Profiler also highlights genes or proteins that are not associated with any enriched pathways, allowing users to capture the full scope of molecular involvement, including potentially novel or understudied factors. The results of the enrichment analysis are visualized in enriched term graphs, heatmaps and interactive plots, which provide an intuitive way to explore the significance of various pathways. Additionally, the results can be exported as structured tables, making it easy to integrate the findings into further analyses or reports.\u003c/p\u003e\n\u003cp\u003eBy using all identified biomarkers, (Exclusive features, volcano plots feature selection and markers highlighted via AI explainability) and applying the enrichment module, we identified the top 15 enriched pathways for each group using the MSigDB_Hallmark_2020 database. These pathways were ranked based on their combined score and visualized as either bar plots or heatmaps (\u003cstrong\u003eFigure 6A\u0026ndash;B\u003c/strong\u003e), or based to gene counts (\u003cstrong\u003eFigure 6C\u003c/strong\u003e). In addition, the specific proteins involved in each enriched pathway can be retrieved (\u003cstrong\u003eFigure 6D and Supplementary Data 2\u003c/strong\u003e). Even more interestingly, their interaction network is visualized in \u003cstrong\u003eFigure 6E\u003c/strong\u003e, revealing complex interactions within and between certain pathways.\u003c/p\u003e\n\u003cp\u003eThis analysis revealed, for example, that group A tumors are enriched in pathways such as myogenesis (cell differentiation) and interferon alpha response (antiviral immune response). Overall, group A appears to activate differentiation, immune response, and cellular structure programs, suggesting a more stable, less invasive, and potentially less aggressive tumor phenotype. In contrast, proteins in group B are involved in signaling pathways such as KRAS signaling (proliferation), unfolded protein response (cancer cells under stress), and interferon gamma response (inflammation and oxidative stress). This is particularly noteworthy, as group B tumors seem to engage cellular stress, proliferation, inflammation, and tumor deregulation pathways, consistent with a more aggressive, invasive behavior and a potentially higher resistance to treatment. It is worth noting that 120 genes and 478 genes are not enriched in group A and B respectively. Profiler retains this information, allowing users to explore these non-enriched features, which may represent poorly characterized or context-specific proteins/genes. Investigating these elements could lead to novel biological insights and uncover new functional roles.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eGoing further, we explored potential drug targets using the DGIdb_Drug_Targets_2024 database to identify compounds that could specifically target the previously enriched pathways in each group. As with the pathway analysis, the results were visualized using multiple plot types (\u003cstrong\u003eFigure 6F\u0026ndash;J\u003c/strong\u003e). For group B, the identified drugs were notably enriched in inhibitors of oncogenic kinases, such as Dabrafenib and Dasatinib, which target proliferative signaling pathways. Additionally, compounds frequently associated with aggressive or treatment-resistant cancers, including Masitinib and Linifanib, were also highlighted. This aligns with the observation that group B tumors strongly activate oncogenic signaling pathways such as MAPK and BRAF, which promote cell proliferation, survival, and therapeutic resistance, consistent with a more aggressive tumor phenotype and poorer prognosis. In contrast, group A showed enrichment in targets of immunomodulatory and anti-inflammatory drugs, such as Infliximab, along with classical anticancer agents like Lapatinib and Methotrexate. This suggests a distinct therapeutic landscape for group A tumors, potentially more responsive to immune modulation and conventional chemotherapy (\u003cstrong\u003eSupplementary Data 3\u003c/strong\u003e).\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;Profiler also offers an interactive gene interaction network using NetworkX, a powerful Python library for the creation, manipulation, and study of complex networks. This network visualization allows users to explore the relationships between genes involved in enriched pathways, providing deeper insights into the biological mechanisms at play. Users can dynamically interact with the network, zooming in on specific genes or pathways to understand their connectivity and importance. The network is color-coded based on the protein type or class, making it easy to distinguish between different groups of genes. This interactive feature enhances the interpretability of the enrichment results and helps researchers identify key genes and pathways that may be crucial for further investigation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eSurvival and prognostic modeling\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eUsing the lifelines library\u003csup\u003e30\u003c/sup\u003e, Profiler supports advanced survival and prognostic modeling techniques. Key features include Kaplan-Meier estimation, which provides a non-parametric way to estimate the survival function from lifetime data, and Cox Proportional Hazards modeling, which assesses the effect of several risk factors on survival time. Additionally, Profiler supports Log-Rank tests to compare the survival distributions of two or more groups. These tools are essential for translational biomarker studies, where understanding the prognostic value of various covariates is crucial. By integrating these survival analysis techniques, Profiler enables researchers to identify factors that significantly impact survival\u003cstrong\u003e\u003cem\u003e\u0026nbsp;\u003c/em\u003e\u003c/strong\u003eoutcomes, aiding in the development of more effective treatment strategies and personalized medicine approaches. Furthermore, the Cox Proportional Hazards model can be saved and deployed directly within Profiler to make predictions on new data or patients, facilitating real-time prognostic assessments and enhancing clinical decision-making.\u003c/p\u003e\n\u003cp\u003eOur analysis used Kaplan-Meier survival curves and a Cox proportional hazards model to assess survival outcomes and influencing factors for the two distinct groups, A and B. \u003cstrong\u003eFigure S5A\u003c/strong\u003e, with Kaplan-Meier curves, reveal a significant survival advantage for Group A over Group B, with a p-value of 0.00001, indicating this difference is statistically significant. \u003cstrong\u003eFigure S5B\u003c/strong\u003e, with a forest plot from the Cox model, identifies key proteins impacting survival, with log (Hazard Ratios) and 95% confidence intervals illustrating their effects (\u003cstrong\u003eSupplementary Data 4\u003c/strong\u003e). Indeed, variables to the right of zero indicate increased hazard and worse survival, while those to the left suggest better survival prospects. We can observe, for instance, that the protein MX1(Myxovirus resistance protein 1) is associated with shorter survival\u003csup\u003e31\u003c/sup\u003e, and act as a negative prognostic factor. This is consistent with the enrichment result showing its involvement in type I interferon response and tumor-promoting inflammation, often linked to aggressive tumor phenotypes and resistance to therapy. In contrast, GOT1 (Glutamic-Oxaloacetic Transaminase 1) and ACYP2 (Acylphosphatase 2) are both associated with longer survival\u003csup\u003e32\u003c/sup\u003e, suggesting a protective role.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eWizard \u0026amp; Deployment Tools\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWizard Module designed to guide users through real-time and post-hoc prediction workflows, enhancing the accessibility and utility of predictive modeling. This module supports real-time predictions on new samples directly from raw files, a feature initially designed for real-time prediction connected to mass spectrometer instruments. While real-time prediction directly from the instrument is not feasible when using Profiler from the web, users can still achieve real-time predictions by dragging and dropping a zipped raw file from instruments such as Waters, Bruker, or Thermo. This capability ensures that users can leverage Profiler\u0026apos;s predictive power even in environments where direct instrument integration is not possible. Additionally, the Wizard Module facilitates post-hoc predictions using tabular datasets and saved models. Users can upload tabular data and apply saved models to make predictions, with (\u0026ldquo;Class\u0026rdquo; column) or without ground truth data. This flexibility allows for the comparison and assessment of test datasets against known outcomes, providing valuable insights into model performance. The results of these predictions can be visualized, interpreted, and exported in publication-ready formats, making it easy to share findings with colleagues or include them in research publications.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eUsing the same dataset employed for spectral visualization in a previous module, originating from the study by Zirem \u003cem\u003eet al.\u003c/em\u003e (2024)\u003csup\u003e11,33\u003c/sup\u003e, a classification model was built, achieving 92% accuracy through 5-fold cross-validation. This model was then tested blindly on an unseen dataset using the Wizard module of Profiler. Two ways of predictions are available, either using a raw data (real-time or post-acquisition way) or using an already processed csv file (post-hoc way). As shown in \u003cstrong\u003eFigure S6 and S7\u003c/strong\u003e, the real-time predictions were highly satisfactory, with no misclassifications.\u003c/p\u003e\n\u003cp\u003eA novel and powerful feature introduced in Profiler is the ability to simultaneously interrogate multiple trained models. Users can upload several models, with the same trained features and label encoders, and Profiler will perform predictions using all models in parallel. The final class is then determined by a majority voting strategy and a confidence score is provided to reflect the consensus across models. This ensemble-like approach improves prediction robustness, compensates for model-specific biases and ensures more reliable decision-making in practical applications.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe increasing volume and complexity of omics data continue to push the boundaries of computational biology. Tools capable of managing and interpreting such data must not only be powerful and statistically sound but also accessible to the wider research community\u003csup\u003e34\u003c/sup\u003e. Profiler addresses this need by offering an end-to-end, modular solution that unifies multiple analytical capabilities within a single, web-based application.\u003c/p\u003e\n\u003cp\u003eUnlike existing platforms such as Galaxy, which require complex installation and server configuration, or Perseus, which is confined to Windows environments, Profiler is platform-independent and lightweight, designed to run efficiently on a wide range of systems. Its web-based architecture ensures broad accessibility, and its scalability is evidenced by its performance on high-capacity. This makes Profiler suitable for both small laboratory experiments and large-scale clinical studies.\u003c/p\u003e\n\u003cp\u003eA distinguishing feature of Profiler is its seamless integration of machine learning and deep learning modules, enabling sophisticated predictive modeling directly from user-uploaded data. By embedding preprocessing, feature selection, model training, and evaluation into an intuitive workflow, Profiler lowers the barrier to entry for advanced data science techniques in biology\u003csup\u003e35\u003c/sup\u003e. Furthermore, the inclusion of automated biomarker discovery and survival analysis tools allows for clinically relevant insights to be drawn with minimal overhead.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAnother critical advantage lies in the platform\u0026rsquo;s support for data visualization and interpretability. Profiler offers real-time interactive plots, such as reduction methods (PCA, t-SNE, UMAP) volcano plots, clustering heatmaps box/violin plots, which are essential for hypothesis generation and exploratory data analysis. These features not only improve user engagement but also facilitate deeper understanding of data structure and biological patterns.\u003c/p\u003e\n\u003cp\u003eFrom a software engineering standpoint, Profiler was built with extensibility in mind. Its modular design allows for rapid integration of new analytical methods and data types as the field evolves. Future directions include the incorporation of single-cell omics support, release and integrates pre-trained models for domain-specific applications such as bacterioscoring, immunoscoring, and dry proteomics. These models, validated in prior peer-reviewed studies\u003csup\u003e8,11,33\u003c/sup\u003e, offer domain-specific scoring pipelines that are seamlessly integrated into the workflow. This fusion of enrichment-driven interpretation with task-specific predictive modeling allows researchers to not only observe differential expression patterns but also contextualize them within a biological or clinical framework, supporting hypothesis generation, validation, and translational impact.\u003c/p\u003e\n\u003cp\u003eTo ensure scalability and maintain user experience, Profiler is currently hosted on the high-performance computing (HPC) infrastructure of the M\u0026eacute;socentre of Lille, with access to 246 GB RAM, multiple CPUs, and expandable GPU capacity. Should usage statistics indicate high demand, we are prepared to scale up computational resources accordingly by increasing CPU, GPU, and RAM allocations, in collaboration with the M\u0026eacute;socentre\u0026rsquo;s HPC provisioning team. This commitment ensures that Profiler remains responsive and capable of handling large-scale bioinformatics workflows\u003c/p\u003e\n\u003cp\u003eIn conclusion, Profiler represents a powerful addition to the bioinformatics toolkit. By combining robust analytics with a user-centered design, it closes a critical gap in omics data analysis. We anticipate that Profiler will serve as a valuable resource for biologists, clinicians, and data scientists alike, accelerating discovery in diverse research areas ranging from cancer genomics to microbial ecology.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eData Availability \u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eProfiler is openly accessible at (https://prism-profiler.univ-lille.fr). All datasets used in this study are available in the dedicated GitHub repository at (https://github.com/yanisZirem/Profiler_v1_requests_datatests) in the Data_fo_peer_review_paper folder. In addition to the glioblastoma dataset illustrated in the article, the repository includes a wide range of real and simulated datasets designed to showcase Profiler\u0026rsquo;s capabilities across multiple omics platforms. It contains raw mass spectrometry data acquired from Bruker and Waters instruments (Bruker_data/ and Waters_data/), as well as processed output files from DIA-NN (DIA-NN_data/) and MaxQuant (Maxquant_data/). The Tabular_data_multi_omics/ directory offers structured \u0026quot;toy\u0026quot; datasets specifically created to help users get started with Profiler, test its different modules, and explore its full potential. These datasets, covering lipidomics, proteomics, transcriptomics, and metabolomics, are tailored for binary classification (e.g., aggressive vs. non-aggressive tumors) and multi-class tasks (e.g., tumor, necrotic, and healthy tissues). They are also suitable for training and educational purposes, particularly for students or researchers learning to analyze multi-omics data. Additionally, the Survival_data/ folder contains clinical variables and lipid markers (clinical_and_LipidsMarkers.csv) for Cox regression modeling, as well as preprocessed patient data (statuts_patients.csv) for Kaplan\u0026ndash;Meier survival analysis. All data are shared in accessible formats to encourage transparency, reproducibility, and broader adoption by the scientific and educational communities.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgments \u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work is partially supported by tthe Institut National de la Sant\u0026eacute; et de la Recherche Biom\u0026eacute;dicale (Inserm); Inserm Transfert, R\u0026eacute;gion Hauts de France, M\u0026eacute;socentre de Calcul de from Universit\u0026eacute; de Lille, Agence Nationale de la recherche (Click \u0026amp; Detect, 1051 CE29, 2024). The authors thank the OrganOmics platform of PRISM Inserm U1192, which is recognized and supported by the University of Lille, the Infrastructure PROFI (https://www.profiproteomics.fr/), and the GIS IbiSA (https://www.ibisa.net/). The OrganOmics platform (Villeneuve d\u0026rsquo;Ascq, France) is also supported by Region Hauts de France and FEDER funding.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors contribution\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eY.Z. conceptualized and designed the developed data analysis pipelines. Y.Z. developed the Profiler software. L.L. tested Profiler to point out potential bugs and improve the tool as much as possible. L.L. and Y.Z. wrote the User Manual. Y.Z. and L.L. wrote the manuscript\u0026rsquo;s original draft. M.S. and I.F. corrected the manuscript. M.S. I.F supervised the project and provided the funding. \u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interest\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eY.Z., L.L., I.F. and M.S. declare they have no competing interests. Profiler is registered since January 14\u003csup\u003eth\u003c/sup\u003e 2025 at the Inter deposit IDNN from the Program Protection agence with the number : IDDN\u003csup\u003e1\u003c/sup\u003e .FR\u003csup\u003e2\u003c/sup\u003e .001\u003csup\u003e3\u003c/sup\u003e .030004\u003csup\u003e4\u003c/sup\u003e .0005 .S\u003csup\u003e6\u003c/sup\u003e .C\u003csup\u003e7\u003c/sup\u003e .2025\u003csup\u003e8\u003c/sup\u003e .000\u003csup\u003e9\u003c/sup\u003e .31230\u003csup\u003e10\u003c/sup\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003e\u003cem\u003eGenerative AI statement\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe author(s) declare that Generative AI was used in the creation of this manuscript. During the preparation of this work, the authors utilized ChatGPT-4.0 to enhance the language quality. Following its use, the authors thoroughly reviewed and edited the content as necessary, taking full responsibility for the accuracy and integrity of the publication.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eHasin, Y., Seldin, M. \u0026amp; Lusis, A. Multi-omics approaches to disease. \u003cem\u003eGenome Biol\u003c/em\u003e \u003cstrong\u003e18\u003c/strong\u003e, 83 (2017).\u003c/li\u003e\n\u003cli\u003eMangul, S. \u003cem\u003eet al.\u003c/em\u003e Challenges and recommendations to improve the installability and archival stability of omics computational tools. \u003cem\u003ePLoS Biol\u003c/em\u003e \u003cstrong\u003e17\u003c/strong\u003e, e3000333 (2019).\u003c/li\u003e\n\u003cli\u003ePerez‐Riverol, Y., Alpi, E., Wang, R., Hermjakob, H. \u0026amp; Vizca\u0026iacute;no, J. A. Making proteomics data accessible and reusable: Current state of proteomics databases and repositories. \u003cem\u003eProteomics\u003c/em\u003e \u003cstrong\u003e15\u003c/strong\u003e, 930\u0026ndash;950 (2015).\u003c/li\u003e\n\u003cli\u003eAfgan, E. \u003cem\u003eet al.\u003c/em\u003e The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. \u003cem\u003eNucleic Acids Research\u003c/em\u003e \u003cstrong\u003e46\u003c/strong\u003e, W537\u0026ndash;W544 (2018).\u003c/li\u003e\n\u003cli\u003ePang, Z. \u003cem\u003eet al.\u003c/em\u003e MetaboAnalyst 5.0: narrowing the gap between raw spectra and functional insights. \u003cem\u003eNucleic Acids Research\u003c/em\u003e \u003cstrong\u003e49\u003c/strong\u003e, W388\u0026ndash;W396 (2021).\u003c/li\u003e\n\u003cli\u003eTyanova, S. \u003cem\u003eet al.\u003c/em\u003e The Perseus computational platform for comprehensive analysis of (prote)omics data. \u003cem\u003eNat Methods\u003c/em\u003e \u003cstrong\u003e13\u003c/strong\u003e, 731\u0026ndash;740 (2016).\u003c/li\u003e\n\u003cli\u003eDuhamel, M. \u003cem\u003eet al.\u003c/em\u003e Spatial analysis of the glioblastoma proteome reveals specific molecular signatures and markers of survival. \u003cem\u003eNat Commun\u003c/em\u003e \u003cstrong\u003e13\u003c/strong\u003e, 6665 (2022).\u003c/li\u003e\n\u003cli\u003eLagache, L., Zirem, Y., Le Rhun, \u0026Eacute;., Fournier, I. \u0026amp; Salzet, M. Predicting Protein Pathways Associated to Tumor Heterogeneity by Correlating Spatial Lipidomics and Proteomics: The Dry Proteomic Concept. \u003cem\u003eMolecular \u0026amp; Cellular Proteomics\u003c/em\u003e \u003cstrong\u003e24\u003c/strong\u003e, 100891 (2025).\u003c/li\u003e\n\u003cli\u003eQuanico, J. \u003cem\u003eet al.\u003c/em\u003e Development of liquid microjunction extraction strategy for improving protein identification from tissue sections. \u003cem\u003eJournal of Proteomics\u003c/em\u003e \u003cstrong\u003e79\u003c/strong\u003e, 200\u0026ndash;218 (2013).\u003c/li\u003e\n\u003cli\u003eTyanova, S., Temu, T. \u0026amp; Cox, J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. \u003cem\u003eNat Protoc\u003c/em\u003e \u003cstrong\u003e11\u003c/strong\u003e, 2301\u0026ndash;2319 (2016).\u003c/li\u003e\n\u003cli\u003eZirem, Y. \u003cem\u003eet al.\u003c/em\u003e Real-time glioblastoma tumor microenvironment assessment by SpiderMass for improved patient management. \u003cem\u003eCell Reports Medicine\u003c/em\u003e 101482 (2024) doi:10.1016/j.xcrm.2024.101482.\u003c/li\u003e\n\u003cli\u003eKessner, D., Chambers, M., Burke, R., Agus, D. \u0026amp; Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cstrong\u003e24\u003c/strong\u003e, 2534\u0026ndash;2536 (2008).\u003c/li\u003e\n\u003cli\u003eR\u0026ouml;st, H. L., Schmitt, U., Aebersold, R. \u0026amp; Malmstr\u0026ouml;m, L. pyOpenMS: A Python‐based interface to the OpenMS mass‐spectrometry algorithm library. \u003cem\u003eProteomics\u003c/em\u003e \u003cstrong\u003e14\u003c/strong\u003e, 74\u0026ndash;77 (2014).\u003c/li\u003e\n\u003cli\u003eDemichev, V., Messner, C. B., Vernardis, S. I., Lilley, K. S. \u0026amp; Ralser, M. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. \u003cem\u003eNat Methods\u003c/em\u003e \u003cstrong\u003e17\u003c/strong\u003e, 41\u0026ndash;44 (2020).\u003c/li\u003e\n\u003cli\u003eBehdenna, A. \u003cem\u003eet al.\u003c/em\u003e pyComBat, a Python tool for batch effects correction in high-throughput molecular data using empirical Bayes methods. \u003cem\u003eBMC Bioinformatics\u003c/em\u003e \u003cstrong\u003e24\u003c/strong\u003e, 459 (2023).\u003c/li\u003e\n\u003cli\u003eAljrees, T. Improving prediction of cervical cancer using KNN imputer and multi-model ensemble learning. \u003cem\u003ePLoS ONE\u003c/em\u003e \u003cstrong\u003e19\u003c/strong\u003e, e0295632 (2024).\u003c/li\u003e\n\u003cli\u003eLemaıtre, G. \u0026amp; Nogueira, F. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning.\u003c/li\u003e\n\u003cli\u003eLavanya, A. \u003cem\u003eet al.\u003c/em\u003e Assessing the Performance of Python Data Visualization Libraries: A Review. \u003cem\u003eIJCERT\u003c/em\u003e \u003cstrong\u003e10\u003c/strong\u003e, 28\u0026ndash;39 (2023).\u003c/li\u003e\n\u003cli\u003eMcInnes, L., Healy, J. \u0026amp; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Preprint at http://arxiv.org/abs/1802.03426 (2020).\u003c/li\u003e\n\u003cli\u003eVisualizing Data using t-SNE.\u003c/li\u003e\n\u003cli\u003eRousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. \u003cem\u003eJournal of Computational and Applied Mathematics\u003c/em\u003e \u003cstrong\u003e20\u003c/strong\u003e, 53\u0026ndash;65 (1987).\u003c/li\u003e\n\u003cli\u003eLundberg, S. \u0026amp; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. Preprint at https://doi.org/10.48550/ARXIV.1705.07874 (2017).\u003c/li\u003e\n\u003cli\u003eFang, Z., Liu, X. \u0026amp; Peltz, G. GSEApy: a comprehensive package for performing gene set enrichment analysis in Python. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cstrong\u003e39\u003c/strong\u003e, btac757 (2023).\u003c/li\u003e\n\u003cli\u003eKanehisa, M. \u0026amp; Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes.\u003c/li\u003e\n\u003cli\u003eThe Gene Ontology Consortium. Gene Ontology Consortium: going forward. \u003cem\u003eNucleic Acids Research\u003c/em\u003e \u003cstrong\u003e43\u003c/strong\u003e, D1049\u0026ndash;D1056 (2015).\u003c/li\u003e\n\u003cli\u003eFabregat, A. \u003cem\u003eet al.\u003c/em\u003e The Reactome Pathway Knowledgebase. \u003cem\u003eNucleic Acids Research\u003c/em\u003e \u003cstrong\u003e46\u003c/strong\u003e, D649\u0026ndash;D655 (2018).\u003c/li\u003e\n\u003cli\u003eLiberzon, A. \u003cem\u003eet al.\u003c/em\u003e Molecular signatures database (MSigDB) 3.0. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cstrong\u003e27\u003c/strong\u003e, 1739\u0026ndash;1740 (2011).\u003c/li\u003e\n\u003cli\u003eBarrett, T. \u003cem\u003eet al.\u003c/em\u003e NCBI GEO: archive for functional genomics data sets\u0026mdash;update. \u003cem\u003eNucleic Acids Research\u003c/em\u003e \u003cstrong\u003e41\u003c/strong\u003e, D991\u0026ndash;D995 (2012).\u003c/li\u003e\n\u003cli\u003eSvoboda, D. L., Saddler, T. \u0026amp; Auerbach, S. S. An Overview of National Toxicology Program\u0026rsquo;s Toxicogenomic Applications: DrugMatrix and ToxFX. in \u003cem\u003eAdvances in Computational Toxicology\u003c/em\u003e (ed. Hong, H.) vol. 30 141\u0026ndash;157 (Springer International Publishing, Cham, 2019).\u003c/li\u003e\n\u003cli\u003eDavidson-Pilon, C. lifelines: survival analysis in Python. \u003cem\u003eJOSS\u003c/em\u003e \u003cstrong\u003e4\u003c/strong\u003e, 1317 (2019).\u003c/li\u003e\n\u003cli\u003eMenyh\u0026aacute;rt, O., Fekete, J. T. \u0026amp; Győrffy, B. Gene expression-based biomarkers designating glioblastomas resistant to multiple treatment strategies. \u003cem\u003eCarcinogenesis\u003c/em\u003e \u003cstrong\u003e42\u003c/strong\u003e, 804\u0026ndash;813 (2021).\u003c/li\u003e\n\u003cli\u003eGao, X., Zhao, J., Jia, L. \u0026amp; Zhang, Q. Remarkable immune and clinical value of novel ferroptosis-related genes in glioma. \u003cem\u003eSci Rep\u003c/em\u003e \u003cstrong\u003e12\u003c/strong\u003e, 12854 (2022).\u003c/li\u003e\n\u003cli\u003eZirem, Y., Ledoux, L., Salzet, M. \u0026amp; Fournier, I. Protocol to analyze 1D and 2D mass spectrometry data from glioblastoma tissues for cancer diagnosis and immune cell identification. \u003cem\u003eSTAR Protocols\u003c/em\u003e \u003cstrong\u003e5\u003c/strong\u003e, 103285 (2024).\u003c/li\u003e\n\u003cli\u003eMarx, V. The big challenges of big data. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e498\u003c/strong\u003e, 255\u0026ndash;260 (2013).\u003c/li\u003e\n\u003cli\u003eLibbrecht, M. W. \u0026amp; Noble, W. S. Machine learning applications in genetics and genomics. \u003cem\u003eNat Rev Genet\u003c/em\u003e\u003cstrong\u003e16\u003c/strong\u003e, 321\u0026ndash;332 (2015).\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"multi-omics, bioinformatics platform, machine learning, deep learning, biomarkers discovery, pathways enrichment, automatic dugs repurposing, data visualization, survival analysis, Streamlit","lastPublishedDoi":"10.21203/rs.3.rs-7058776/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7058776/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"High-throughput multi-omics experiments create large, heterogeneous data matrices that remain inaccessible to many life-science laboratories. We introduce Profiler, an open-source, web-based platform that unifies data import, quality control, preprocessing, statistical tests, machine- and deep-learning, biomarker discovery, pathway enrichment and survival modelling behind an intuitive point-and-click interface. Built with Streamlit and deployable either locally or on high-performance clusters, Profiler processes proteomics, lipidomics and other omics modalities at interactive speeds. In a benchmark on spatial proteomic and lipidomic maps from 50 glioblastoma resections, the platform reproduced published molecular subtypes, uncovered candidate therapeutic targets and generated fully traceable analysis reports in under ten minutes. Profiler therefore lowers the computational barrier for multi-omics projects and provides a reproducible foundation for systems-biology and precision-medicine research.","manuscriptTitle":"Profiler: an open web platform for multi-omics analysis","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-07-22 13:15:00","doi":"10.21203/rs.3.rs-7058776/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"12ddd8c9-705b-4026-9c2b-16bd8939738f","owner":[],"postedDate":"July 22nd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":51176158,"name":"Biological sciences/Cancer"},{"id":51176159,"name":"Biological sciences/Computational biology and bioinformatics"}],"tags":[],"updatedAt":"2025-07-24T22:15:09+00:00","versionOfRecord":[],"versionCreatedAt":"2025-07-22 13:15:00","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7058776","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7058776","identity":"rs-7058776","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00