Abstract
In label-free mass spectrometry experiments, the data output is typically a proteome table that requires further processing, testing, and visualization to fully interpret the captured proteomic signals. Currently, post-quantification analysis of these tables often relies on complex programmatic pipelines, which can become challenging to use. Here, we introduce the Proteomics Eye (ProtE), a single-function R package designed to streamline the analysis of proteome tables generated by commonly used software tools (DIA-NN, ProteomeDiscoverer, MaxQuant). ProtE provides a broad range of options for data processing, preparation, and statistical testing. It also performs gene set enrichment analysis and offers a comprehensive suite of visualization plots to assess data quality and facilitate biological interpretation. Given a categorical variable with two or more groups, ProtE enables group-wide and pairwise statistical comparisons across all group combinations, using both traditional statistical tests and linear models for differential expression analysis. By integrating all these features into a single, user-friendly R function, ProtE simplifies the analysis of large-scale label-free DDA and DIA datasets, making advanced proteomic analysis accessible to both experienced researchers and beginners.
Theodoros Margelos 1, Manousos Makridakis 1, Charis Gonidaki 1,2, Foteini Paradeisi 1, Manos Vossos 3, Jerome Zoidakis 1,3, Antonia Vlahou 1, Rafael Stroggilos 1*
1.
Center of Systems Biology, Biomedical Research Foundation Academy of Athens, Athens, Greece
2.
Department of Applied Mathematics, Faculty of Electrical Engineering Mathematics and Computer Science, University of Twente, Enschede, Netherlands
3.
Department of Biochemistry and Molecular Biology, Faculty of Biology, National and Kapodistrian University of Athens, Athens, Greece.
The emergence of quantitative proteomics in the last two decades has led to a breakthrough in biological science research[1]. It has enhanced the discovery of novel biomarkers and drug development[2,3], the study of disease progression[4], and boosted the understanding of cellular signaling mechanisms[5]. This was enabled by the advancement of Mass Spectrometry (MS) technologies, which allowed not only the identification of up to thousands of proteins per sample but also their robust quantification[6]. Optimized data acquisition methods such as Data Dependent (DDA) and Data Independent Acquisition (DIA)[7], in combination to label-free or labeling methods for protein quantification produce ’rich’ results; and various software tools have been developed for respective data processing and further analysis [8]. Such tools map the spectrometric data to peptide identifications and have internal methods for peptide or proteome level quantification. With the prolific expansion of bioinformatics, new data processing methods have been introduced to the field of MS, increasing the heterogeneity of the downstream proteomics analysis workflow[9]. Navigating between the different options and building the right workflow for analyzing proteomics data can become challenging even for experienced researchers.
Various tools have been developed towards standardizing the analysis of proteome-level data. These tools have been primarily established as R packages, python libraries or Shiny-based websites. Available solutions like MSstats[10], protti[11], MSnbase[12], NormalyzerDE[13], prolfqua[14], tidyProteomics[15], pyOpenMS [16], MSPypeline [17], AlphaPeptStats [18] and OmicScope [19], offer extensive functionality for either data manipulation or statistical analysis. While each has unique advantages, they often require significant bioinformatics expertise, as their bottom-up workflows rely on multiple functions and complex input transformations. Furthermore, for differential expression analysis they mostly employ only linear model statistics and do not employ nonparametric tests. Website tools like ProtExA[20], ProteoSign[21], and PIQMIe[22] are easier to utilize, enabling users with no programming skills to perform analyses. However, these platforms raise concerns with respect to data privacy and security, while they are dependent on stable internet access and functional servers.
As proteomics experiments are frequently designed by researchers in non-proteomics laboratories, it is common for individuals with little or no programming expertise to end up with large proteome tables that are complex, and time consuming to process and analyze. On the other hand, proteomics facilities are often constrained by high workloads, and thus they cannot afford to allocate extended resources for analytical data tasks. To address these challenges, we introduce the Proteomics Eye (ProtE), an R package that automates, standardizes, and accelerates the label-free proteomics workflow, in a user-friendly way. (Figure 1).
ProtE stands out for its unique ability to wrap the entire post-quantification workflow in a single function call, ProtE_analyse(), executable in a matter of minutes or even seconds. Simultaneously, it offers users the flexibility to input independent or paired experimental groups and analyze them using both traditional non-parametric statistical tests alongside linear models which can incorporate fixed effects as covariates, while saving the results directly in the user’s PC.
ProtE is tailored for the analysis of proteomics data quantified by the commonly used spectral processing tools ProteomeDiscoverer (Thermo Fisher Scientific, Waltham, USA), MaxQuant [23], and DIA-NN [24](or FragPipe’s-DIA-NN output). We designed ProtE from the perspective of the user having little or no programming expertise: given an independent categorical variable with N groups, ProtE unites in just one function call all necessary processing steps together with the ability to perform all possible N(N-1)/2 pairwise statistical comparisons. This makes ProtE highly versatile, freeing users from the necessity to subset data or manually iterate analyses, significantly reducing the effort and potential for error in downstream analysis. In brief, ProtE runs the following steps in the depicted order:
1.
Fetching UniProt information if a column ”Description” is not included in the input data table
2.
Visualization of the data prior to any processing
3.
Normalization
4.
Filtering of frequently missing proteins
5.
Imputation
6.
Visualization of the processed data
7.
Statistical analysis
8.
Visualization of statistical output in the form of plots
9.
Enrichment analysis
The main function of ProtE, ProtE_analyse(), reads, processes and analyzes proteome tables generated by Proteome Discoverer, MaxQuant, DIA-NN, or Frag-Pipe DIA-NN ( Table 1 )
| 1. | DIA-NN (or DIA-NN output from FragPipe) in .tsv or .xlsx format | Table with all samples (unique_genes_matrix or pg_matrix files) |
| 2. | MaxQuant (ProteinGroups file in .txt or .xlsx format) | Table with all samples ProteinGroups file |
| 3. | Proteome Discoverer (one table with all samples) | Table with all samples (.xlsx file) |
| 4. | Proteome Discoverer (one .xlsx file per sample, usually exported from .MSF files) | Group folders with .xlsx files (one per sample), provide via pd_single_dir argument . |
Table 1. Information about the proteome tables that can be parsed and analyzed by ProtE’s main function ProtE_analyse()
ProtE runs in two modes, depending on the existence of sample metadata. If a metadata file is available, ProtE will incorporate all the metadata variables into the statistical model, provided they do not contain any missing values (a single missing value will drop the given variable out of the model). The metadata file needs to be organized in the following simple way: the 1 st column must contain the sample names written exactly as they appear in the proteomics data, while the 2 nd column should contain the independent variable that formulates the main groups of the experiment. Every next column of the metadata can contain covariates whose inclusion in the model will affect the statistical results of the independent variable.
If a metadata file is not available, ProtE still supports statistical testing of the independent variable, as long as the samples in the input proteome table are positioned consecutively to each other according to their experimental group (the details on the underlying data structure are provided in the Supplementary File). Users can then define the sample-to-group labeling and the corresponding group sizes by using the function’s parameters group_names=c(), and samples_per_group=c(), respectively ( Fig.2A(ii) ).
The power of ProtE as an analytical tool lies not only in its simplicity but also in its rich arsenal of data processing and preparation options, all defined as parameters of the same R function. These options include seven distinct intensity normalization techniques; customizable threshold for filtering proteins with missing values, either applied to the entire dataset or separately within each experimental group and eight imputation methods( Figure 1 )(described in detail in the Supplemental File as well as in https://github.com/theomargel/ProtE).
All these analytical parameters are explicitly chosen by the user through the ProtE_analyse() function. For example, users can select filtering, normalization and imputation methods, by filling the respective parameters inside the function call. By adjusting one or more of these parameters and rerunning the function, ProtE enables users to directly assess how different data processing choices influence statistical results. A summary report is generated at the end of each analysis, which details all transformations applied to the data for the specific ProtE_analyse() run, (e.g. specific normalization, and imputation methods and missing values frequency thresholds used)
Following data processing, ProtE performs statistical analysis. Users can specify whether this analysis will be performed on independent or paired samples, via the parameter “independent,” which can be respectively set to TRUE or FALSE. ProtE generates lists of differentially expressed proteins using both nonparametric tests and linear models. The statistical output of the former is stored in the Excel file traditional_statistics.xlsx and includes Wilcoxon’s test results of pairwise comparisons, Kruskal-Wallis or Friedman’s test results when groups are more than two, and PERMANOVA pseudo-F, R 2 and p-value for the complete feature set, via the package vegan[26]. For assessing the equality of the group variances (homoscedasticity), ProtE implements Levene’s and Bartlett’s statistical tests. In all cases the average abundance of each protein per group, ratio of the means and the corresponding log 2 -fold change values, nominal p-values and false discovery rates (method of adjusting the p-values for multiple hypothesis manually selected by the user) per comparison are provided.
For linear model-based statistics, ProtE leverages the limma package, applying log 2 transformation and Bayesian regression to the data, before fitting them to a linear model, with results saved in the ’dataset_limma_test.xlsx’ file [27]. These include coefficients for each experimental group and any covariates (e.g. age, batch effects) incorporated into the model. For paired sample experiments, subjects are treated as random effects in the linear model using a blocking approach, while any covariates are incorporated into the design matrix of the model [28] . Specifically, the output features results from an ANOVA-like F-test for variance comparison when there are more than two groups, alongside results from moderated t-tests for each pairwise comparison, and log 2 fold changes of the moderated data.
By further utilizing the package fgsea [29], ProtE performs gene set enrichment analysis, using as metric score the log 2 fold changes of the pairwise comparisons. The user can select a collection from the MSigDB [30] via the arguments subcollection and species, with options including the commonly used Reactome [31] and Gene Ontology [32] gene sets. The results are saved inside an excel file named GSEA_results.xlsx that contains the enriched pathways of each comparison in different excel sheets.
To enable evaluation of data quality and to enhance biological interpretation, ProtE offers a comprehensive suite of visualization means common for proteomic datasets. This encompasses violin plots and boxplots of the protein abundances in each sample and a mean-standard deviation relation plot before and after the processing of the data, to assess the effects of normalization and imputation. Also, a log 2 abundance rank plot, a histogram of the imputed values (in case multiple values imputation has been selected), PCA plots considering both the full dataset and the significantly altered proteins are created. Heatmaps which illustrate abundance alterations across experimental groups as well as volcano plots are generated. Finally, plots displaying the most enriched, significant pathways between each contrast are also created. All plots are saved in .bmp format, and they are named automatically based on what they are displaying, After_Processing_Boxplot.bmp for example is the boxplot of the protein intensities of each sample after data processing (e.g. Normalization, Filtering, Imputation). Additionally, output messages in the R studio console will explicitly state the content of the output files.
To demonstrate the capabilities of ProtE, published proteomics data derived from ProteomeDiscoverer, DIA-NN, and MaxQuant respectively were analyzed using the ProtE_analyse function.
Specifically, proteomics data from Tserga et al., which compared the proteomics profiles of Ins2Akita type I diabetic mice and matched wild-type controls across two time points (2 and 4 months old), were analyzed [33] . Their analysis identified 255 DEPs between 2-month-old Ins2Akita and wild-type mice and 372 DEPs between 4-month-old Ins2Akita and wild-type mice, based on a Mann-Whitney p-value threshold of <0.05. The processed data, available in their supplementary material, were generated using ProteomeDiscoverer 1.4 with the SEQUEST search engine and the UniProt mouse (Mus musculus) database, exported as one file per sample. After organizing the files for each experimental group (Ins2Akita and wild-type) into distinct folders, the respective directories were used as input into ProtE_analyse alongside post-processing options: Parts per Million (PPM) normalization, filtering value set to 45, which means that proteins with less than 45% missing values across samples are kept, and their imputation as zeros, options consistent with the original publication. ProtE identified 261 DEPs (Mann-Whitney p-value reported by Tserga et al. ( Fig.2B(i) ) and 356 DEPs for the 4-month-old mice, all of which were included in the initial analysis ( Fig.2B(ii) ).
Additionally, proteomics data from the PRIDE database (Project PXD052994) were parsed, derived from the study by Angelopoulou et al [34]. This research compares the tear proteomics profiles of children with type I diabetes to those of age-matched healthy controls, providing insights into disease-related protein changes. The raw mass spectrometry data were processed as described using DIA-NN (version 1.8.1) and searched against the Homo sapiens UniProt database [34] . Their analysis identified 263 DEPs between diabetic and healthy subjects, based on a Mann-Whitney p-value threshold of <0.05. The DIA-NN pg_matrix output file was provided to ProtE, and post-processing steps were replicated as in the original analysis (log 2 transformation as normalization, filtering value set to 100%, so that no filtering occurs, and imputation of missing values as 0). Using ProtE with these parameters, all 263 DEPs (Mann-Whitney p-value <0.05) reported in the original publication were identified( Fig.2B(iii) ).
As a further showcase of the interoperability of ProtE between ProteomeDiscoverer and MaxQuant quantified proteomes, a re-analysis of data from Stroggilos et al. (2019) was also from performed[35] . In the original study, label-free DDA raw files from a total of 98 tissue samples were analyzed by ProteomeDiscoverer, and authors identified three proteomic subtypes (NPS1, NPS2, and NPS3), which were further tested for statistically significant differences in clinical and proteomic features. We sought to investigate whether ProtE could recapitulate the statistical analysis of the three subtypes, following switching the quantification software from ProteomeDiscoverer to MaxQuant. The 98 raw files were processed with MaxQuant (v2.6.5) using the same instrument parameters as in the original publication and the output file, “proteinGroups.txt”, was subsequently provided to ProtE. The same post-processing options were applied: raw intensity values were normalized to the parts-per-million (PPM) scale, per-protein missing values were calculated globally across all 98 samples, proteins with missing values in more than 80% of samples were excluded, and imputation was disabled (imputation parameter set to FALSE). Figure.2B(iv) illustrates the DEPs identified (Mann-Whitney p-value <0.05) in the original analysis compared to those identified using ProtE_analyse. Slight differences are expected due to the use of different spectral data search engines in each of Proteome Discoverer and MaxQuant.
Overall, based on the above-mentioned test cases the ProtE_analyse function closely approximates the findings of the original publications, demonstrating ProtE’s ability to replicate analyses and deliver reliable results. To highlight ProtE’s efficiency, the analysis of each of the three datasets was performed with a few lines of code, as shown in Figure 2A, and completed within 2 minutes.
While ProtE is built upon established methods, such as known normalization techniques (e.g. log2, VSN), imputation strategies (e.g. kNN, missRanger) or widely used statistical tests (e.g. limma, Kruskal-Wallis), its primary contribution lies in integrating these methods into a single, user-friendly function, ProtE_analyse(). ProtE addresses the gap of requiring bioinformatics expertise to perform a proteomics analysis, by automating the entire analysis pipeline—from data processing to statistical analysis and visualization—making it accessible to a broader audience. Its instant acceptance of multiple input formats (e.g. MaxQuant, DIA-NN, Proteome Discoverer) further enhances its utility, compared to different tools where data tables need to be previously edited. Its efficacy is also highlighted by the inclusion of other statistical tests such as the option for paired analysis, the inclusion of both traditional tests and linear models, or additional statistical tests like PERMANOVA and Levene’s test for assessing the equality of variances and gene set enrichment analyses that are not included in other proteomics analysis tools. This combination of accessibility, comprehensiveness, and flexibility constitutes a meaningful advancement, even if ProtE does not introduce entirely new algorithms.
As limitations, the current version of ProtE is designed to analyze data already summarized at the protein level, rather than incorporating peptide-level intensity data. Its differential analysis methods are optimized for straightforward experimental designs with categorical variables as groups of comparison, and they do not yet support complex experiments involving interactions between multiple factors and continuous variables as groups. These limitations highlight opportunities for expansion. In future versions, we aim to enhance ProtE by integrating peptide-to-protein summarization and extending its statistical capabilities to accommodate a broader range of experimental designs.
To conclude, ProtE is a new R package tailored for a wide range of users, from those with no bioinformatics background to advanced researchers. It performs proteomic analysis in the output data in a straightforward, easy-to-handle pipeline, in just one function call. As large-scale proteomics data are accumulated, ProtE fulfills the need for quick, reliable and standardized processing and analysis. We invite the proteomics community to experiment with ProtE and freely provide any feedback via the GitHub repo: https://github.com/theomargel/ProtE/ . The GitHub site includes installation instructions as well as a complete guide for the function, which are also presented in the supplementary file.
Acknowledgments
Funded by the European Union (Project 101097094 — ELMUMY; Project 101136926-MULTIR). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or HADEA. Neither the European Union nor the granting authority can be held responsible for them.
Conflicts of interest
The authors declare no conflicts of interest.
References
[1] Schubert, O.T., Röst, H.L., Collins, B.C., Rosenberger, G., Aebersold, R., Quantitative proteomics: challenges and opportunities in basic and applied research. Nat Protoc 2017, 12, 1289–1294.
[2] Yang, X.-L., Shi, Y., Zhang, D.-D., Xin, R., et al., Quantitative proteomics characterization of cancer biomarkers and treatment. Mol Ther Oncolytics 2021, 21, 255–263.
[3] El-Khateeb, E., Vasilogianni, A.-M., Alrubia, S., Al-Majdoub, Z.M., et al., Quantitative mass spectrometry-based proteomics in the era of model-informed drug development: Applications in translational pharmacology and recommendations for best practice. Pharmacol Ther 2019, 203, 107397.
[4] Cheung, C.H.Y., Juan, H.-F., Quantitative proteomics in lung cancer. J Biomed Sci 2017, 24, 37.
[5] Ordureau, A., Münch, C., Harper, J.W., Quantifying ubiquitin signaling. Mol Cell 2015, 58, 660–76.
[6] Cravatt, B.F., Simon, G.M., Yates, J.R., The biological impact of mass-spectrometry-based proteomics. Nature 2007, 450, 991–1000.
[7] Ludwig, C., Gillet, L., Rosenberger, G., Amon, S., et al., Data-independent acquisition-based SWATH-MS for quantitative proteomics: a tutorial. Mol Syst Biol 2018, 14, e8126.
[8] Chen, C., Hou, J., Tanner, J.J., Cheng, J., Bioinformatics Methods for Mass Spectrometry-Based Proteomics Data Analysis. Int J Mol Sci 2020, 21.
[9] Kumar, C., Mann, M., Bioinformatics analysis of mass spectrometry‐based proteomics data sets. FEBS Lett 2009, 583, 1703–1712.
[10] Choi, M., Chang, C.-Y., Clough, T., Broudy, D., et al., MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics 2014, 30, 2524–2526.
[11] Quast, J.-P., Schuster, D., Picotti, P., protti: an R package for comprehensive data analysis of peptide- and protein-centric bottom-up proteomics data. Bioinformatics advances 2022, 2, vbab041.
[12] Gatto, L., Gibb, S., Rainer, J., MSnbase, Efficient and Elegant R-Based Processing and Visualization of Raw Mass Spectrometry Data. J Proteome Res 2021, 20, 1063–1069.
[13] Willforss, J., Chawade, A., Levander, F., NormalyzerDE: Online Tool for Improved Normalization of Omics Expression Data and High-Sensitivity Differential Expression Analysis. J Proteome Res 2019, 18, 732–740.
[14] Wolski, W.E., Nanni, P., Grossmann, J., d’Errico, M., et al., prolfqua : A Comprehensive R -Package for Proteomics Differential Expression Analysis. J Proteome Res 2023, 22, 1092–1104.
[15] Jones, J., MacKrell, E.J., Wang, T.-Y., Lomenick, B., et al., Tidyproteomics: an open-source R package and data object for quantitative proteomics post analysis and visualization. BMC Bioinformatics 2023, 24, 239.
[16] Röst, H.L., Schmitt, U., Aebersold, R., Malmström, L., pyOpenMS: a Python-based interface to the OpenMS mass-spectrometry algorithm library. Proteomics 2014, 14, 74–7.
[17] Heming, S., Hansen, P., Vlasov, A., Schwörer, F., et al., MSPypeline: a python package for streamlined data analysis of mass spectrometry-based proteomics. Bioinformatics advances 2022, 2, vbac004.
[18] Krismer, E., Bludau, I., Strauss, M.T., Mann, M., AlphaPeptStats: an open-source Python package for automated and scalable statistical analysis of mass spectrometry-based proteomics. Bioinformatics 2023, 39.
[19] Reis-de-Oliveira, G., Carregari, V.C., Sousa, G.R.D.R. de, Martins-de-Souza, D., OmicScope unravels systems-level insights from quantitative proteomics data. Nat Commun 2024, 15, 6510.
[20] Minadakis, G., Sokratous, K., Spyrou, G.M., ProtExA: A tool for post-processing proteomics data providing differential expression metrics, co-expression networks and functional analytics. Comput Struct Biotechnol J 2020, 18, 1695–1703.
[21] Efstathiou, G., Antonakis, A.N., Pavlopoulos, G.A., Theodosiou, T., et al., ProteoSign: an end-user online differential proteomics statistical analysis platform. Nucleic Acids Res 2017, 45, W300–W306.
[22] Kuzniar, A., Kanaar, R., PIQMIe: a web server for semi-quantitative proteomics data management and analysis. Nucleic Acids Res 2014, 42, W100-6.
[23] Cox, J., Mann, M., MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol 2008, 26, 1367–72.
[24] Demichev, V., Messner, C.B., Vernardis, S.I., Lilley, K.S., Ralser, M., DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat Methods 2020, 17, 41–44.
[25] Voisinne G, queryup: Query the UniProt REST API using R. 2019.
[26] Oksanen J, S.G.B.F.K.R.L.P.M.P.O.R.S.P.S.M.S.E.W.H.B.M.B.M.B.B.B.D.B.T.C.G.C.M.D.C.M.D.S.E.H.F.R., vegan: Community Ecology Package. R package version 2.7-0 2024.
[27] Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., et al., limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 2015, 43, e47.
[28] Law, C.W., Zeglinski, K., Dong, X., Alhamdoosh, M., et al., A guide to creating design matrices for gene expression experiments. F1000Res 2020, 9, 1444.
[29] Korotkevich, G., Sukhov, V., Budin, N., Shpak, B., et al., Fast gene set enrichment analysis. bioRxiv 2016.
[30] Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., et al., Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 2005, 102, 15545–15550.
[31] Milacic, M., Beavers, D., Conley, P., Gong, C., et al., The Reactome Pathway Knowledgebase 2024. Nucleic Acids Res 2024, 52, D672–D678.
[32] Aleksander, S.A., Balhoff, J., Carbon, S., Cherry, J.M., et al., The Gene Ontology knowledgebase in 2023. Genetics 2023, 224.
[33] Tserga, A., Pouloudi, D., Saulnier-Blache, J.S., Stroggilos, R., et al., Proteomic Analysis of Mouse Kidney Tissue Associates Peroxisomal Dysfunction with Early Diabetic Kidney Disease. Biomedicines 2022, 10, 216.
[34] Angelopoulou, E., Kitani, R.-A., Stroggilos, R., Lygirou, V., et al., Tear Proteomics in Children and Adolescents with Type 1 Diabetes: A Promising Approach to Biomarker Identification of Diabetes Pathogenesis and Complications. Int J Mol Sci 2024, 25, 9994.
[35] Stroggilos, R., Mokou, M., Latosinska, A., Makridakis, M., et al., Proteome-based classification of Nonmuscle Invasive Bladder Cancer. Int J Cancer 2020, 146, 281–294.
Information & Authors
Information
Version history
Peer review timeline
Published
PROTEOMICS – Clinical Applications
Version of Record26 Dec 2025Published
Copyright
This work is licensed under a Creative Commons Attribution 4.0 International License