The Omics Molecule Extractor: A web application for the selection of potential biomarker panels

preprint OA: closed
Full text JSON View at publisher
Full text 85,294 characters · extracted from preprint-html · click to expand
The Omics Molecule Extractor: A web application for the selection of potential biomarker panels | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Short Report The Omics Molecule Extractor: A web application for the selection of potential biomarker panels Emanuel Lange, Kay Schallert, Johannes Schwerdt, Susmita Ghosh, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5914047/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Selecting molecule panels that are applicable to classify the health state of patients is a common task in omics data analysis. Existing software for molecule selection lacks features to select molecule panels from large datasets, requires programming experience, or lacks user-friendly interfaces. We present the Omics Molecule Extractor (OMEx) an open-source web application providing a user-friendly workflow for selecting molecules and molecule panels for sample classification from large datasets. OMEx’s user interface provides interactive visualization for exploring input data and analysis results. The feature selection strategy underlying the algorithm is based on machine learning and has not been available in any software with user interface. Extensive testing using synthetic datasets with known ground truth showed that the algorithm discovers group-separating molecules with high precision. Additionally, OMEx was tested on five real-world omics datasets demonstrating high reproducibility and overlap with reported molecules from other feature selection methods, while also reporting alternative molecules of interest. OMEx is freely available at https://mdoa-tools.bi.denbi.de/omex/home. Bioinformatics Analytical Biochemistry Molecule Selection Omics Biomarkers Bioinformatics Figures Figure 1 Figure 2 Figure 3 Full Text Omics technologies aid in investigating qualitative and quantitative presence of the biomolecules (i.e., DNA, RNA, proteins, lipids, metabolites) in biological systems, thereby enhancing the mechanistic understanding of those molecules in health and pathophysiological states [1]. Omics methods employ cutting edge technologies like- genome sequencing and liquid-chromatography coupled to mass spectrometry generating raw data (e.g., reads and mass spectra), which are further processed to identify and quantify the biomolecules [1]. This initial step typically generates tables containing thousands of molecules in rows, samples in columns and measured quantities in each cell. A common objective of applying omics technologies is to identify a subset of molecules showing quantitative differences between states of an investigated biological system, for example, between healthy and pathophysiological states. These molecules can provide insights into disease-specific processes and could serve as biomarkers for diagnosis and prognosis, or as therapeutic targets. In essence, biomarker applications, such as prognosis or diagnosis, are classification tasks that predict a sample's class based on molecular quantities. Combining multiple molecules into panels is preferred for classification, as it typically achieves higher accuracy than relying on individual molecules. The identification of molecules and molecule panels used for classification, a process known as “feature selection” is facilitated by various statistical and machine learning methods [2] [3]. The software implementing these methods is often only available as libraries for programming languages [4] [5]. Some of these programming libraries offer high flexibility but low user-friendliness. In contrast, software implemented as web application can provide higher user-friendliness due to a graphical user-interface accessed through web browsers but can have limited features. Existing web applications like MetaboAnalyst [6] or CombiRoc [7] can generate rankings for individual molecules or molecule panels based on a limited number of input molecules, respectively. However, to our knowledge, no web application currently offers the capability to generate molecule panels from thousands of input molecules, which is the typical size of omics datasets. Here, we present the Omics Molecule Extractor (OMEx; version 0.1.0), a user-friendly web application for the selection of molecule panels tailored to researchers who generate omics data. A key, and to our knowledge unique, feature of OMEx is its ability to automatically generate rankings of molecule panels from thousands of input molecules. Additionally, it offers a tidy user interface and interactive visualizations, making it a compelling alternative to existing tools for molecule selection. The initial version of OMEx’s algorithm was developed to determine biomarkers from a metaproteomics dataset [8]. Based on this initial version, we extended and generalized the algorithm to be applicable for other types of omics data. The input for OMEx is a data table obtained from omics-specific processing of raw data. The input format of this data table is a tab-separated .csv, .tsv, or .txt file containing molecule names in rows, sample names in columns and measured quantities in cells. Sample column names are prefixed with condition names (e.g., “control_” and “disease_”) for grouping. OMEx’s algorithm combines statistical and machine-learning methods (figure 1). The main objective is to determine a small subset of molecules from the input table, i.e., a molecule panel, which can discriminate between samples from two different conditions (e.g., healthy vs. disease). The algorithm involves four steps, i.e., 1 data preprocessing, 2 (pre-)filtering of molecules based on p-values, 3 (post-)filtering by a wrapper, and 4 a final classification. During data preprocessing (step 1), samples can be normalized by the sum of all molecule quantities within each sample, and molecules with sparse measurements (i.e., few measured values) can be filtered out. Both operations can be disabled if users already applied preprocessing to their data. The remaining steps utilize diagonal Linear Discriminant Analysis (d-LDA) [9], a simple classification method, and cross validation. D-LDA has been chosen as classification method because it is computationally lightweight and showed the best classification accuracy for a metaproteomics dataset in a comparison with other classification methods [8]. The p-value filter (step 2) performs a statistical test (two-sample t-test) on each molecule independently and ranks them by their p-values [2]. Molecules below a certain p-value cutoff are provided to the next step. The p-value cutoff is chosen by the algorithm to provide an optimal tradeoff between classifier accuracy and low number of molecules. In step 3, the wrapper method [2] selects a small panel of molecules for a subset of samples and evaluates the classification accuracy based on the selected panel (cross validation). This process is repeated several times (>1,000 times), while the samples provided for molecule selection are randomized in every repetition, varying the composition of the selected panels. The wrapper outputs molecule panels and individual molecules that are ranked based on their frequency of selection. An advantage of the wrapper over p-value filtering is that combinations of molecules are considered in the classification. However, wrapping is a computationally expensive technique; Therefore, the preceding p-value filtering reduces the total computation time. The most frequently chosen panels are assumed to be robust discriminators between the sample groups and provided to the final classification step 4. Step 4 evaluates the predictive power of the selected molecule panel based on classification metrics (accuracy, precision, recall, f1 score). Additionally, a principal component analysis (PCA), and hierarchical clustering are performed to visualize the separation of groups and the similarity of samples based on the selected molecule panel. OMEx is an open-source web application. Its frontend is implemented in Angular 18 and available at https://gitlab.com/kay.schallert/mpa-cloud-server. The backend is written in Java 17 for a REST API and implementation of the algorithm, utilizes R 4.3.2 for generating plots, and Docker for deployment. The backend code is available at https://gitlab.com/kay.schallert/mpa-website. OMEx is available at https://mdoa-tools.bi.denbi.de/omex/home and provides a user-friendly interface based on the step-by-step workflow of the algorithm (figure 2). Initially, users provide their omics data table in a tab-separated format, with molecules as row names and samples as columns (example datasets are available on the OMEx website and in the supplementary files). The initial input form allows for filtering sparsely measured molecules and sample-wise normalization. By default, OMEx splits all samples randomly into a training set (70%) used for molecule selection and a test set (30%) for evaluating the predictive power of extracted molecules (the split ratio can be adapted by users). The web application handles datasets with a minimum of 15 samples per group, but more samples are recommended. Each workflow step contains a detailed description and generates interactive plots for analysis of its results, such as an initial overview on class balances (figure 2, B) or a ranking of selected molecule panels (figure 2, C). Parameters of the algorithm, such as cross validation folds, can also be configured in each step but an automatic mode running all steps at once with default parameters is available as well. In the final step, a classification of the test set based on a selected molecule panel is performed and can be evaluated using a PCA plot (figure 2, D), a pairwise scatter plot (figure 2, E), volcano plot (figure 2, F), and important metrics for classification (accuracy, precision, recall, and f1 score). All results can be downloaded as .zip directory, which contains all figures and a configuration file storing all settings for reproducibility. For a proof-of-concept, a first test of the algorithm was performed using synthetic datasets generated by a sampling strategy. The filter and wrapper were tested individually and in combination (table 1) by applying 100 synthetic datasets, respectively. All synthetic datasets contained 60 samples and 50 “synthetic molecules” (a detailed description of the sampling strategy can be found in the supplementary information; The code is implemented in Java and available at https://gitlab.com/kay.schallert/mpa-cloud-server/-/tree/master/src/main/java/service/omex/algorithmtest). For each synthetic dataset, “relevant” molecules were known, providing a ground truth to evaluate the selection of molecules in the p-value filter step, wrapper step, and their combination. The selection of relevant molecules was evaluated (table 1) by precision (measure for the selection of true positive molecules) and recall (measure for the “recovery” of true positive molecules). These metrics range from 0 to 1 with higher values indicating better performance (see supplementary information for a more thorough explanation). The filter showed a high recall of 1.0 indicating that it reports all relevant molecules, while selecting many irrelevant ones as indicated by the intermediate precision of 0.75. The wrapper can sort out irrelevant molecules and reliably reports relevant ones, as indicated by its high precision of 0.88. However, it showed a low recall of 0.63, indicating that not all relevant features are found. The combination of both approaches had a recall of 0.76 and precision of 0.91. Therefore, combining both stages shows a better recovery of relevant molecules and fewer false positive selections, compared to the individual stages. Table 1: Overview on the performance of filter, wrapper and the combined stages to select relevant molecules. stage precision recall Filter only 0.75 1.0 Wrapper only 0.88 0.63 Filter + Wrapper (OMEx) 0.91 0.76 Testing with synthetic data provided confidence into the algorithm on a theoretical level. To evaluate OMEx under realistic conditions, we collected and analyzed five datasets containing real omics data (table 2). Table 2: Overview on the datasets used to test OMEx. dataset Molecules (rows) Groups (#Samples) Publication Stool metaproteomics 42,572 Control (19), non-alcoholic steatohepatitis (32), hepatocellular carcinoma (29) [8] Blood transcriptomics 10,527 Control (35), rheumatoid arthritis (45) [10] (dataset from [4] was used) Blood proteomics 1070 Control (35), rheumatoid arthritis (44) [10] Urine metabolomics 2944 Control (469), lung cancer (536) [11] Glial tumor metabolomics 7017 IDH wild-type tumors (50), IDH mutant tumors (38) [12] The initial version of OMEx’s algorithm has been developed for the study by Sydor et al. [8], who determined potential biomarkers for non-alcoholic steatohepatitis (NASH) and hepatocellular carcinoma (HCC) from metaproteomics of stool samples. The current version (0.1.0) of OMEx was applied to this dataset (stool metaproteomics dataset) to test whether molecules determined for group pairings (control vs. NASH, control vs. HCC, NASH vs. HCC) could be reproduced. Sydor et al. applied their algorithm on each group pairing individually, as well as on all three groups [8]. Because OMEx currently only supports analysis of two groups, only results from group pairings were reproduced. Due to the random assignment of samples during step 2 and 3, the most frequently selected molecules can differ slightly in every run. Therefore, the top 20 molecules from OMEx were compared to the reported biomarker candidates by Sydor et al. [8]. The most frequently reported molecule panel (a subset of the top 20 molecules) was used for a final classification run. Classification accuracy was determined by classifying samples from a test set containing 30% of all samples. The classification accuracy describes the ratio of correctly classified samples compared to all classified samples. Accuracy ranges from 0 to 1 with a value of 1 meaning that all samples were classified correctly. OMEx’s top 20 molecules reproduced many of the biomarkers from Sydor et al. [8], demonstrating that top molecules are reported consistently (table 3). The most frequently reported molecule panels resulted in good classification accuracies of the control and disease groups (0.79 and 0.85 accuracy for Control vs. NASH and Control vs. HCC, respectively) but a low accuracy for the two disease groups (0.59 accuracy for NASH vs. HCC), which is in line with the original study (table 3). Classification accuracies for each pairing resulted generally in lower accuracy for OMEx in comparison to Sydor et al. [8]. However, Sydor et al. evaluated accuracy based on cross validation but not on an independent test set. For this reason, their positive results could have originated from overfitting or from a biased estimate by the cross validation. Evaluating OMEx’s accuracy based on cross validation alone (0% test set) showed accuracies similar to the initial algorithm version (table 3). Table 3: Reproduction of results from the original algorithm. Control vs. NASH Control vs. HCC NASH vs. HCC Reproduced biomarkers (OMEx/original) 5/7 3/5 7/10 Classification accuracies Original Algorithm (no test set) 0.99 1.00 0.86 OMEx (30 % test set) 0.79 0.85 0.59 OMEx (0 % test set) 0.99 0.99 0.74 Furthermore, four publicly available omics datasets of different sizes were collected (table 2). These datasets had been used in several other studies applying different feature selection methods allowing for a benchmarking of OMEx’s outputs. Tasaki et al. [10] performed a multi-omics study of rheumatoid arthritis (RA) applying transcriptomics and proteomics analyses to blood samples from patients (blood transcriptomics and blood proteomics dataset). They created ensembles of partial least square regression (PLSR) models for each dataset and reported the molecules most influential on model output. The blood transcriptomics dataset was used by Ng et al. [4] in a tutorial on biomarker discovery. They applied XGBoost, an algorithm based on decision trees, and extracted the most important molecules for their model using Shapley Additive Explanations (SHAP). The urine metabolomics dataset contains metabolomics measurements from patients with lung cancer [11]. The authors applied random forests to extract potential biomarkers, while other studies by Chardin et al. [12] and Labory et al. [3] performed biomarker selection using supervised autoencoders (SAE) and a combination of Boruta feature selection and partial least squares discriminant analysis (PLS-LDA) respectively. Chardin et al. [12] also provided a metabolomics dataset on glial tumors (glial tumor metabolomics dataset) containing wild-type and isocitrate dehydrogenase (IDH) mutants used for the classification of glial tumors. SAE and Boruta and PLS-LDA were also applied for biomarker selection from this dataset by Chardin et al. [12] and Labory et al. [3], respectively. All datasets, OMEx configurations, and results are available in the supplementary files. All datasets were benchmarked by applying the automated mode of OMEx and using 70% of samples for molecule selection and 30% to evaluate the classification performance based on accuracy. OMEx provides a ranking of molecules based on the frequency of selections in the wrapper stage. For the blood transcriptomics and blood proteomics dataset the top 20 molecules were compared to reported molecules from other studies, whereas only top 4 and top 5 molecules were considered for the urine metabolomics and glial tumor metabolomics datasets respectively, because Mathé et al. [11] and Chardin et al. [12] did not provide more molecules (figure 3). OMEx’s top reported molecules overlapped with all other methods except for the SAE on the glial tumor metabolomics dataset. Potentially, more reported molecules would have resulted in a higher overlap. These results show that OMEx can extract the most important molecules also covering alternative molecules that are potentially not considered by other methods. Aside from molecule rankings, OMEx provides molecule panels which contain few molecules and are therefore suitable for experimental validation. The best panels for each dataset contained between two and five molecules (table 4). Selected transcripts or proteins were annotated by querying UniProt [13] and metabolites were annotated using the Workflow4Metabolomics platform [14] (Supplementary information). A classification based on these panels was compared to the other methods using the accuracy, as this was the metric reported by all other methods. In all datasets the classification accuracy across all methods was between 80.76% and 98.8% indicating that most samples could be separated. On the urine dataset methods showed a drop in accuracy to 68.67% to 81.2%. Overall, OMEx performs comparable to other methods, while it did not outperform them. It must be noted that the main objective of OMEx is not classification but molecule selection. Additionally, OMEx utilizes a very simple classification algorithm being less prone to overfitting compared to the more advanced methods such as random forests, XGBoost, and SAE. Table 4: Most frequently selected molecule panels by OMEx for the four benchmarking datasets. Dataset Best panel with identifiers from dataset Molecule names Blood transcriptomics CLEC4D C-type lectin domain family 4 member D (UniProt: Q8WXI8) RNPS1 RNA-binding protein with serine-rich domain 1 (UniProt: Q15287) Blood proteomics CRP C-reactive protein (UniProt: P02741) BPI Bactericidal permeability-increasing protein (UniProt: P17213) IGFBP6 Insulin-like growth factor-binding protein 6 (UniProt: P24592) Urine metabolomics MZ 252.97 6-Bromo-3-(hydroxyacetyl)-1H-indole (Pherobase: 6-Bromo-3-(hydroxyacetyl)-1H-indole) MZ 264.12 (Z)-2-Methyl-2-butene-1,4-diol 4-O-beta-D-Glucopyranoside MZ 126.90 Iodine MZ 239.99 6-chloro-2-quinoxalinecarboxylic acid 1,4-dioxide MZ 441.16 Gly-Asn-Asp-His (GNDH) Glial tumor metabolomics MZ 137.07 1-Methylnicotinamide MZ 173.03 N-Formyl-L-glutamate Overall, OMEx provides an excellent solution for the selection of molecule panels as biomarker candidates from large matrices of omics data. The method, embedding diagonal Linear Discriminant Analysis into the combination of filter and wrapper technique represents a useful strategy to extract biomarker candidates reliably, as shown on synthetic data. While outcomes of OMEx, can be slightly different in each run, it was shown that biomarkers reported by a previous version of the algorithm were in accordance with OMEx, demonstrating the reproducibility of the tool results. Additionally, OMEx selects similar molecules to those reported by other computational approaches while also selecting unique molecules that others did not, demonstrating its value as an alternative approach. The most striking advantage of OMEx except for its novel algorithm, is its user-friendly and web-based interface, minimizing the effort to analyze omics data and obtain potential biomarkers for experimental validation. In the future OMEx will be extended to extract molecules for more than two groups and provide more advanced classification models. Declarations ASSOCIATED CONTENT Supporting Information supplementary.docx: Supplementary information to main manuscript datasets_formatted.zip: All datasets used in the manuscript formatted to comply OMEx input format omex_sydor_comparison.zip: Results generated for the comparison of OMEx and its original implementation benchmarking_results.zip: All results generated by OMEx for the comparison with other molecule selection methods selected_molecule_identification.zip: Output from Workbench4Metabolomics and UniProt queries for the annotation of molecule panels found in the comparison of molecule selection methods AUTHOR INFORMATION Corresponding Author Emanuel Lange - ISAS e.V., Bunsen-Kirchhoff-Str. 11, 44139 Dortmund, Germany; https://orcid.org/0009-0008-6620-6681; e-mail: [email protected] Present Addresses Kay Schallert - ISAS e.V., Bunsen-Kirchhoff-Str. 11, 44139 Dortmund, Germany Johannes Schwerdt - Hochschule Merseburg University of Applied Sciences, Eberhard-Leibnitz-Straße 2, 06217 Merseburg, Germany Susmita Ghosh - ISAS e.V., Otto-Hahn-Str. 6b, 44227 Dortmund, Germany Andreas Hentschel - ISAS e.V., Otto-Hahn-Str. 6b, 44227 Dortmund, Germany Yvonne Reinders - ISAS e.V., Otto-Hahn-Str. 6b, 44227 Dortmund, Germany Robert Heyer - ISAS e.V., Bunsen-Kirchhoff-Str. 11, 44139 Dortmund, Germany †If an author’s address is different than the one given in the affiliation line, this information may be included here. Author Contributions E.L. and K.S. developed the website and algorithm. E.L. acquired datasets, tested the algorithm and website. J.S. developed the original algorithm and script for synthetic data testing and contributed to website and algorithm development. S.G., Y.R., A.H. discussed the project, provided data for testing, and tested the website. R.H. supervised the project. E.L., K.S., J.S., R.H. designed the project. E.L., K.S., J.S. wrote the manuscript. All authors revised the manuscript and have given approval for the final manuscript. ‡ E.L. and K.S. contributed equally. Notes The authors declare no competing financial interest. ACKNOWLEDGMENT All authors acknowledge the support by the “Ministerium für Kultur und Wissenschaft des Landes Nordrhein-Westfalen” and “Der Regierende Bürgermeister von Berlin, Senatskanzlei Wissenschaft und Forschung”. We thank all coworkers and friends for discussions and feedback, which helped to bring this project to its final form. References Hasin Y, Seldin M, Lusis A (2017) „Multi-omics approaches to disease. Genome Biology Bd. 18 Saeys Y, Inza I (2007) und P. Larrañaga, „A review of feature selection techniques in bioinformatics, Bioinformatics , Bd. 23, pp. 2507–2517, August Labory J, Njomgue-Fotso E, Bottini S (2024) „Benchmarking feature selection and feature extraction methods to improve the performances of machine-learning algorithms for patient classification using metabolomics biomedical data, Computational and Structural Biotechnology Journal , Bd. 23, pp. 1274–1287, December Ng S, Masarone S, Watson D, Barnes MR (2023) „The benefits and pitfalls of machine learning for biomarker discovery, Cell and Tissue Research , Bd. 394, pp. 17–31, July Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W (January 2015) Smyth, „limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43:e47–e Pang Z, Lu Y, Zhou G, Hui F, Xu L, Viau C, Spigelman A, MacDonald P, Wishart D, Li S, Xia J (April 2024) „MetaboAnalyst 6.0: towards a unified platform for metabolomics data processing, analysis and interpretation. Nucleic Acids Res 52:W398–W406 Mazzara S, Rossi RL, Grifantini R, Donizetti S, Abrignani S, Bombaci M (2017) „CombiROC: an interactive web tool for selecting accurate marker combinations of omics data, Scientific Reports , Bd. 7, March Sydor S, Dandyk C, Schwerdt J, Manka P, Benndorf D, Lehmann T, Schallert K, Wolf M, Reichl U, Canbay A, Bechmann LP, Heyer R (August 2022) „Discovering Biomarkers for Non-Alcoholic Steatohepatitis Patients with and without Hepatocellular Carcinoma Using Fecal Metaproteomics. Int J Mol Sci 23:8841 Dudoit S, Fridlyand J, Speed TP (2002) „Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data, Journal of the American Statistical Association , Bd. 97, pp. 77–87, March Tasaki S, Suzuki K, Kassai Y, Takeshita M, Murota A, Kondo Y, Ando T, Nakayama Y, Okuzono Y, Takiguchi M, Kurisu R, Miyazaki T, Yoshimoto K, Yasuoka H, Yamaoka K, Morita R, Yoshimura A, Toyoshiba H (2018) und T. Takeuchi, „Multi-omics monitoring of drug response in rheumatoid arthritis in pursuit of molecular remission. Nat Commun, Bd. 9 Mathé EA, Patterson AD, Haznadar M, Manna SK, Krausz KW, Bowman ED, Shields PG, Idle JR, Smith PB, Anami K, Kazandjian DG, Hatzakis E, Gonzalez FJ (June 2014) Harris, „Noninvasive Urinary Metabolomic Profiling Identifies Diagnostic and Prognostic Markers in Lung Cancer. Cancer Res 74:3259–3270C. C Chardin D, Gille C, Pourcher T, Humbert O, Barlaud M (2022) „Learning a confidence score and the latent space of a new supervised autoencoder for diagnosis and prognosis in clinical metabolomic studies. BMC Bioinformatics, Bd. 23 Bateman A, Martin M-J, Orchard S, Magrane M, Adesina A, Ahmad S, Bowler-Barnett EH, Bye-A-Jee H, Carpentier D, Denny P, Fan J, Garmiri P, Gonzales LJdC, Hussein A, Ignatchenko A, Insana G, Ishtiaq R, Joshi V, Jyothi D, Kandasaamy S, Lock A, Luciani A, Luo J, Lussi Y, Marin JSM, Raposo P, Rice DL, Santos R, Speretta E, Stephenson J, Totoo P, Tyagi N, Urakova N, Vasudev P, Warner K, Wijerathne S, Yu CW-H, Zaru R, Bridge AJ, Aimo L, Argoud-Puy G, Auchincloss AH, Axelsen KB, Bansal P, Baratin D, Batista Neto TM, Blatter M-C, Bolleman JT, Boutet E, Breuza L, Gil BC, Casals-Casas C, Echioukh KC, Coudert E, Cuche B, de Castro E, Estreicher A, Famiglietti ML, Feuermann M, Gasteiger E, Gaudet P, Gehant S, Gerritsen V, Gos A, Gruaz N, Hulo C, Hyka-Nouspikel N, Jungo F, Kerhornou A, Mercier PL, Lieberherr D, Masson P, Morgat A, Paesano S, Pedruzzi I, Pilbout S, Pourcel L, Poux S, Pozzato M, Pruess M, Redaschi N, Rivoire C, Sigrist CJA, Sonesson K, Sundaram S, Sveshnikova A, Wu CH, Arighi CN, Chen C, Chen Y, Huang H, Laiho K, Lehvaslaiho M, McGarvey P (2024) D. A. Natale, K. Ross, C. R. Vinayaka, Y. Wang und J. Zhang, „UniProt: the Universal Protein Knowledgebase in 2025, Nucleic Acids Research , Bd. 53, pp. D609–D617, November Giacomoni F, Le G, Corguillé M, Monsoor M, Landi P, Pericard M, Pétéra C, Duperier M, Tremblay-Franco J-F, Martin D, Jacob S, Goulitquer EA, Thévenot C, Caron (2014) „Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics, Bioinformatics , Bd. 31, pp. 1493–1495, December Additional Declarations The authors declare no competing interests. Supplementary Files acssupplementaryv4.docx Supplementary information to main manuscript datasetsformatted.zip All datasets used in the manuscript formatted to comply OMEx input format omexsydorcomparison.zip Results generated for the comparison of OMEx and its original implementation benchmarkingresults.zip All results generated by OMEx for the comparison with other molecule selection methods selectedmoleculeidentifications.zip Output from Workbench4Metabolomics and UniProt queries for the annotation of molecule panels found in the comparison of molecule selection methods Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5914047","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Short Report","associatedPublications":[],"authors":[{"id":407811794,"identity":"603df09f-4991-4478-99a4-f3c6054828c8","order_by":0,"name":"Emanuel Lange","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA+ElEQVRIiWNgGAWjYPACCwY+MF0hAaYkiNAiwcAGps+QrIWxDcbFA/jFDj97zAN0Dxv72cMvPs6zsOfvP8B44wMeLZKz08yNeYDuYePJS7OcuU0iccaNBGbLGXi0GNxOMJPmbQM5LMfMmHebRALDDQY2aR48Wuxvp3+T5v0H1ML/xsz47xwJe/nzB9ik/+CzRToHaEsDUItEjvFjxgYJxg0HEtik8Xlf4nZOmeScYxI8bBJvzBh7jkkkbryR2GzZg0cL/+z0bRJvamzk+PlzjD/8qKmzlzt/+OCNH/isgQKQf9mg0cHYQIQGCGDGFxujYBSMglEwggEATXtCCACDyQgAAAAASUVORK5CYII=","orcid":"https://orcid.org/0009-0008-6620-6681","institution":"Leibniz-Institut für Analytische Wissenschaften - ISAS - e.V., Dortmund, Germany/Graduate School Digital Infrastructure for the Life Sciences, Bielefeld University, Bielefeld, Germany","correspondingAuthor":true,"prefix":"","firstName":"Emanuel","middleName":"","lastName":"Lange","suffix":""},{"id":407811795,"identity":"4a701c1d-1e18-48da-8f6a-1d652d210d14","order_by":1,"name":"Kay Schallert","email":"","orcid":"","institution":"Leibniz-Institut für Analytische Wissenschaften - ISAS - e.V., Dortmund, Germany","correspondingAuthor":false,"prefix":"","firstName":"Kay","middleName":"","lastName":"Schallert","suffix":""},{"id":407811796,"identity":"c642ab94-6e9e-4a0e-81c4-336077b068ef","order_by":2,"name":"Johannes Schwerdt","email":"","orcid":"","institution":"Hochschule Merseburg, Merseburg, Germany","correspondingAuthor":false,"prefix":"","firstName":"Johannes","middleName":"","lastName":"Schwerdt","suffix":""},{"id":407811797,"identity":"c69c7d8e-a22e-4927-93c0-3fcf3833c3e3","order_by":3,"name":"Susmita Ghosh","email":"","orcid":"","institution":"Leibniz-Institut für Analytische Wissenschaften - ISAS - e.V., Dortmund, Germany","correspondingAuthor":false,"prefix":"","firstName":"Susmita","middleName":"","lastName":"Ghosh","suffix":""},{"id":407811798,"identity":"d2fe0cfa-9fdc-4f5e-8b07-7f38fb611b0d","order_by":4,"name":"Andreas Hentschel","email":"","orcid":"","institution":"Leibniz-Institut für Analytische Wissenschaften - ISAS - e.V., Dortmund, Germany","correspondingAuthor":false,"prefix":"","firstName":"Andreas","middleName":"","lastName":"Hentschel","suffix":""},{"id":407811799,"identity":"50cc6338-6999-4afb-ac44-dcfe4e443a63","order_by":5,"name":"Yvonne Reinders","email":"","orcid":"","institution":"Leibniz-Institut für Analytische Wissenschaften - ISAS - e.V., Dortmund, Germany","correspondingAuthor":false,"prefix":"","firstName":"Yvonne","middleName":"","lastName":"Reinders","suffix":""},{"id":407811800,"identity":"00826a0d-593a-4f2a-ab60-1cc6d76c5fdd","order_by":6,"name":"Robert Heyer","email":"","orcid":"","institution":"Leibniz-Institut für Analytische Wissenschaften - ISAS - e.V., Dortmund, Germany/Multidimensional Omics Data Analysis, Faculty of Technology, Bielefeld University, Bielefeld, Germany","correspondingAuthor":false,"prefix":"","firstName":"Robert","middleName":"","lastName":"Heyer","suffix":""}],"badges":[],"createdAt":"2025-01-27 17:10:30","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-5914047/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5914047/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":75088921,"identity":"e6b2c0fd-321b-42bb-8756-26b798281fb6","added_by":"auto","created_at":"2025-01-30 10:32:39","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":190649,"visible":true,"origin":"","legend":"\u003cp\u003eOverview on OMEx’s algorithm. The algorithm consists of an optional normalization step, molecule selection (filter and wrapper), and sample group prediction (classification).\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-5914047/v1/39dd6bca5ee403f551d1c306.png"},{"id":75088917,"identity":"d981f53c-251d-45ae-9d52-3c9e20061aed","added_by":"auto","created_at":"2025-01-30 10:32:39","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":273503,"visible":true,"origin":"","legend":"\u003cp\u003eOverview of OMEx graphical user interface features. A - OMEx is a web application which guides users through the workflow of the molecule selection algorithm. The user interface provides interactive plots on class balances (B), most frequently selected molecule panels (C), PCA and pairwise scatter plots based on extracted molecules to assess class separation (D and E), as well as volcano plots visualizing differences of extracted molecules between classes (F).\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-5914047/v1/083831ae7bfec34c7b71842d.png"},{"id":75088927,"identity":"1e58e877-2686-43f5-b7e2-efd9cc4ed24f","added_by":"auto","created_at":"2025-01-30 10:32:39","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":223933,"visible":true,"origin":"","legend":"\u003cp\u003eOverview on OMEx’s algorithm. The algorithm consists of an optional normalization step, molecule selection (filter and wrapper), and sample group prediction (classification).\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-5914047/v1/241d6f338c103878af8313ea.png"},{"id":75090459,"identity":"6e0a3a53-0730-4093-b03b-58a78abe2762","added_by":"auto","created_at":"2025-01-30 10:48:39","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1010399,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5914047/v1/e803dbdc-56e5-48c7-8464-ceda8727444a.pdf"},{"id":75088922,"identity":"9932da96-0139-464d-b5d5-b228ace42fc8","added_by":"auto","created_at":"2025-01-30 10:32:39","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":258799,"visible":true,"origin":"","legend":"\u003cp\u003eSupplementary information to main manuscript\u003c/p\u003e","description":"","filename":"acssupplementaryv4.docx","url":"https://assets-eu.researchsquare.com/files/rs-5914047/v1/7fe7eeb0dba4f8bc9e6f999b.docx"},{"id":75088932,"identity":"6f157cb7-fb0d-4e13-ac12-846950ea04ce","added_by":"auto","created_at":"2025-01-30 10:32:40","extension":"zip","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":27215052,"visible":true,"origin":"","legend":"\u003cp\u003eAll datasets used in the manuscript formatted to comply OMEx input format\u003c/p\u003e","description":"","filename":"datasetsformatted.zip","url":"https://assets-eu.researchsquare.com/files/rs-5914047/v1/13489357a1170fe47b812a61.zip"},{"id":75088934,"identity":"cb25bc21-b4ee-4bdd-8d56-51f875ba7a08","added_by":"auto","created_at":"2025-01-30 10:32:41","extension":"zip","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":84277196,"visible":true,"origin":"","legend":"\u003cp\u003eResults generated for the comparison of OMEx and its original implementation\u003c/p\u003e","description":"","filename":"omexsydorcomparison.zip","url":"https://assets-eu.researchsquare.com/files/rs-5914047/v1/19aac6a6a18bd2941c10b60d.zip"},{"id":75088933,"identity":"0ca0ff19-c637-42f5-8774-37c3de1ecf62","added_by":"auto","created_at":"2025-01-30 10:32:40","extension":"zip","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":74978210,"visible":true,"origin":"","legend":"\u003cp\u003eAll results generated by OMEx for the comparison with other molecule selection methods\u003c/p\u003e","description":"","filename":"benchmarkingresults.zip","url":"https://assets-eu.researchsquare.com/files/rs-5914047/v1/46d83ddb6270a696b30eb498.zip"},{"id":75088935,"identity":"9d333be3-db89-48b7-bcb9-037b4029bf6d","added_by":"auto","created_at":"2025-01-30 10:32:43","extension":"zip","order_by":5,"title":"","display":"","copyAsset":false,"role":"supplement","size":186507939,"visible":true,"origin":"","legend":"\u003cp\u003eOutput from Workbench4Metabolomics and UniProt queries for the annotation of molecule panels found in the comparison of molecule selection methods\u003c/p\u003e","description":"","filename":"selectedmoleculeidentifications.zip","url":"https://assets-eu.researchsquare.com/files/rs-5914047/v1/54f99b836ee9d017134d3e96.zip"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003eThe Omics Molecule Extractor: A web application for the selection of potential biomarker panels\u003c/p\u003e","fulltext":[{"header":"Full Text","content":"\u003cp\u003eOmics technologies aid in investigating qualitative and quantitative presence of the biomolecules (i.e., DNA, RNA, proteins, lipids, metabolites) in biological systems, thereby enhancing the mechanistic understanding of those molecules in health and pathophysiological states [1]. Omics methods employ cutting edge technologies like- genome sequencing and liquid-chromatography coupled to mass spectrometry generating raw data (e.g., reads and mass spectra), which are further processed to identify and quantify the biomolecules [1]. This initial step typically generates tables containing thousands of molecules in rows, samples in columns and measured quantities in each cell.\u003c/p\u003e\n\u003cp\u003eA common objective of applying omics technologies is to identify a subset of molecules showing quantitative differences between states of an investigated biological system, for example, between healthy and pathophysiological states. These molecules can provide insights into disease-specific processes and could serve as biomarkers for diagnosis and prognosis, or as therapeutic targets.\u003c/p\u003e\n\u003cp\u003eIn essence, biomarker applications, such as prognosis or diagnosis, are classification tasks that predict a sample\u0026apos;s class based on molecular quantities. Combining multiple molecules into panels is preferred for classification, as it typically achieves higher accuracy than relying on individual molecules. The identification of molecules and molecule panels used for classification, a process known as \u0026ldquo;feature selection\u0026rdquo; is facilitated by various statistical and machine learning methods [2]\u0026nbsp;[3].\u003c/p\u003e\n\u003cp\u003eThe software implementing these methods is often only available as libraries for programming languages [4]\u0026nbsp;[5]. Some of these programming libraries offer high flexibility but low user-friendliness. In contrast, software implemented as web application can provide higher user-friendliness due to a graphical user-interface accessed through web browsers but can have limited features. Existing web applications like MetaboAnalyst [6] or CombiRoc [7] can generate rankings for individual molecules or molecule panels based on a limited number of input molecules, respectively. However, to our knowledge, no web application currently offers the capability to generate molecule panels from thousands of input molecules, which is the typical size of omics datasets.\u003c/p\u003e\n\u003cp\u003eHere, we present the Omics Molecule Extractor (OMEx; version 0.1.0), a user-friendly web application for the selection of molecule panels tailored to researchers who generate omics data. A key, and to our knowledge unique, feature of OMEx is its ability to automatically generate rankings of molecule panels from thousands of input molecules. Additionally, it offers a tidy user interface and interactive visualizations, making it a compelling alternative to existing tools for molecule selection.\u003c/p\u003e\n\u003cp\u003eThe initial version of OMEx\u0026rsquo;s algorithm was developed to determine biomarkers from a metaproteomics dataset [8]. Based on this initial version, we extended and generalized the algorithm to be applicable for other types of omics data.\u003c/p\u003e\n\u003cp\u003eThe input for OMEx is a data table obtained from omics-specific processing of raw data. The input format of this data table is a tab-separated .csv, .tsv, or .txt file containing molecule names in rows, sample names in columns and measured quantities in cells. Sample column names are prefixed with condition names (e.g., \u0026ldquo;control_\u0026rdquo; and \u0026ldquo;disease_\u0026rdquo;) for grouping.\u003c/p\u003e\n\u003cp\u003eOMEx\u0026rsquo;s algorithm combines statistical and machine-learning methods (figure 1). The main objective is to determine a small subset of molecules from the input table, i.e., a molecule panel, which can discriminate between samples from two different conditions (e.g., healthy vs. disease). The algorithm involves four steps, i.e., 1 data preprocessing, 2 (pre-)filtering of molecules based on p-values, 3 (post-)filtering by a wrapper, and 4 a final classification.\u0026nbsp;During data preprocessing (step 1), samples can be normalized by the sum of all molecule quantities within each sample, and molecules with sparse measurements (i.e., few measured values) can be filtered out.\u0026nbsp;Both operations can be disabled if users already applied preprocessing to their data.\u003c/p\u003e\n\u003cp\u003eThe remaining steps utilize diagonal Linear Discriminant Analysis (d-LDA) [9], a simple classification method, and cross validation. D-LDA has been chosen as classification method because it is computationally lightweight and showed the best classification accuracy for a metaproteomics dataset in a comparison with other classification methods [8].\u003c/p\u003e\n\u003cp\u003eThe p-value filter (step 2) performs a statistical test (two-sample t-test) on each molecule independently and ranks them by their p-values [2]. Molecules below a certain p-value cutoff are provided to the next step. The p-value cutoff is chosen by the algorithm to provide an optimal tradeoff between classifier accuracy and low number of molecules. In step 3, the wrapper method [2] selects a small panel of molecules for a subset of samples and evaluates the classification accuracy based on the selected panel (cross validation). This process is repeated several times (\u0026gt;1,000 times), while the samples provided for molecule selection are randomized in every repetition, varying the composition of the selected panels. The wrapper outputs molecule panels and individual molecules that are ranked based on their frequency of selection. An advantage of the wrapper over p-value filtering is that combinations of molecules are considered in the classification. However, wrapping is a computationally expensive technique; Therefore, the preceding p-value filtering reduces the total computation time.\u003c/p\u003e\n\u003cp\u003eThe most frequently chosen panels are assumed to be robust discriminators between the sample groups and provided to the final classification step 4. Step 4 evaluates the predictive power of the selected molecule panel based on classification metrics (accuracy, precision, recall, f1 score). Additionally, a principal component analysis (PCA), and hierarchical clustering are performed to visualize the separation of groups and the similarity of samples based on the selected molecule panel.\u003c/p\u003e\n\u003cp\u003eOMEx is an open-source web application. Its frontend is implemented in Angular 18 and available at https://gitlab.com/kay.schallert/mpa-cloud-server. The backend is written in Java 17 for a REST API and implementation of the algorithm, utilizes R 4.3.2 for generating plots, and Docker for deployment. The backend code is available at https://gitlab.com/kay.schallert/mpa-website.\u003c/p\u003e\n\u003cp\u003eOMEx is available at https://mdoa-tools.bi.denbi.de/omex/home and provides a user-friendly interface based on the step-by-step workflow of the algorithm (figure 2). Initially, users provide their omics data table in a tab-separated format, with molecules as row names and samples as columns (example datasets are available on the OMEx website and in the supplementary files). The initial input form allows for filtering sparsely measured molecules and sample-wise normalization. By default, OMEx splits all samples randomly into a training set (70%) used for molecule selection and a test set (30%) for evaluating the predictive power of extracted molecules (the split ratio can be adapted by users). The web application handles datasets with a minimum of 15 samples per group, but more samples are recommended.\u003c/p\u003e\n\u003cp\u003eEach workflow step contains a detailed description and generates interactive plots for analysis of its results, such as an initial overview on class balances (figure 2, B) or a ranking of selected molecule panels (figure 2, C). Parameters of the algorithm, such as cross validation folds, can also be configured in each step but an automatic mode running all steps at once with default parameters is available as well. In the final step, a classification of the test set based on a selected molecule panel is performed and can be evaluated using a PCA plot (figure 2, D), a pairwise scatter plot (figure 2, E), volcano plot (figure 2, F), and important metrics for classification (accuracy, precision, recall, and f1 score). All results can be downloaded as .zip directory, which contains all figures and a configuration file storing all settings for reproducibility.\u003c/p\u003e\n\u003cp\u003eFor a proof-of-concept, a first test of the algorithm was performed using synthetic datasets generated by a sampling strategy. The filter and wrapper were tested individually and in combination (table 1) by applying 100 synthetic datasets, respectively. All synthetic datasets contained 60 samples and 50 \u0026ldquo;synthetic molecules\u0026rdquo; (a detailed description of the sampling strategy can be found in the supplementary information; The code is implemented in Java and available at https://gitlab.com/kay.schallert/mpa-cloud-server/-/tree/master/src/main/java/service/omex/algorithmtest).\u003c/p\u003e\n\u003cp\u003eFor each synthetic dataset, \u0026ldquo;relevant\u0026rdquo; molecules were known, providing a ground truth to evaluate the selection of molecules in the p-value filter step, wrapper step, and their combination. The selection of relevant molecules was evaluated (table 1) by precision (measure for the selection of true positive molecules) and recall (measure for the \u0026ldquo;recovery\u0026rdquo; of true positive molecules). These metrics range from 0 to 1 with higher values indicating better performance (see supplementary information for a more thorough explanation).\u003c/p\u003e\n\u003cp\u003eThe filter showed a high recall of 1.0 indicating that it reports all relevant molecules, while selecting many irrelevant ones as indicated by the intermediate precision of 0.75. The wrapper can sort out irrelevant molecules and reliably reports relevant ones, as indicated by its high precision of 0.88. However, it showed a low recall of 0.63, indicating that not all relevant features are found. The combination of both approaches had a recall of 0.76 and precision of 0.91. Therefore, combining both stages shows a better recovery of relevant molecules and fewer false positive selections, compared to the individual stages.\u003c/p\u003e\n\u003cp\u003eTable 1: Overview on the performance of filter, wrapper and the combined stages to select relevant molecules.\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 99px;\"\u003e\n \u003cp\u003estage\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 99px;\"\u003e\n \u003cp\u003eprecision\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 99px;\"\u003e\n \u003cp\u003erecall\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 99px;\"\u003e\n \u003cp\u003eFilter only\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 99px;\"\u003e\n \u003cp\u003e0.75\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 99px;\"\u003e\n \u003cp\u003e1.0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 99px;\"\u003e\n \u003cp\u003eWrapper only\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 99px;\"\u003e\n \u003cp\u003e0.88\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 99px;\"\u003e\n \u003cp\u003e0.63\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 99px;\"\u003e\n \u003cp\u003eFilter + Wrapper (OMEx)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 99px;\"\u003e\n \u003cp\u003e0.91\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 99px;\"\u003e\n \u003cp\u003e0.76\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eTesting with synthetic data provided confidence into the algorithm on a theoretical level. To evaluate OMEx under realistic conditions, we collected and analyzed five datasets containing real omics data (table 2).\u003c/p\u003e\n\u003cp\u003eTable 2: Overview on the datasets used to test OMEx.\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 68px;\"\u003e\n \u003cp\u003edataset\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 55px;\"\u003e\n \u003cp\u003eMolecules (rows)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eGroups (#Samples)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003ePublication\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 68px;\"\u003e\n \u003cp\u003eStool metaproteomics\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 55px;\"\u003e\n \u003cp\u003e42,572\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eControl (19),\u003cbr\u003e\u0026nbsp;non-alcoholic steatohepatitis (32),\u003cbr\u003e\u0026nbsp;hepatocellular carcinoma (29)\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e[8]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 68px;\"\u003e\n \u003cp\u003eBlood transcriptomics\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 55px;\"\u003e\n \u003cp\u003e10,527\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eControl (35),\u003cbr\u003e\u0026nbsp;rheumatoid arthritis (45)\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e[10] (dataset from [4] was used)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 68px;\"\u003e\n \u003cp\u003eBlood proteomics\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 55px;\"\u003e\n \u003cp\u003e1070\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eControl (35),\u0026nbsp;\u003cbr\u003e\u0026nbsp;rheumatoid arthritis (44)\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e[10]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 68px;\"\u003e\n \u003cp\u003eUrine metabolomics\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 55px;\"\u003e\n \u003cp\u003e2944\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eControl (469),\u003cbr\u003e\u0026nbsp;lung cancer (536)\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e[11]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 68px;\"\u003e\n \u003cp\u003eGlial tumor metabolomics\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 55px;\"\u003e\n \u003cp\u003e7017\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eIDH wild-type tumors (50),\u003cbr\u003e\u0026nbsp;IDH mutant tumors (38)\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e[12]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eThe initial version of OMEx\u0026rsquo;s algorithm has been developed for the study by Sydor \u003cem\u003eet al.\u003c/em\u003e [8], who determined potential biomarkers for non-alcoholic steatohepatitis (NASH) and hepatocellular carcinoma (HCC) from metaproteomics of stool samples. The current version (0.1.0) of OMEx was applied to this dataset (stool metaproteomics dataset) to test whether molecules determined for group pairings (control vs. NASH, control vs. HCC, NASH vs. HCC) could be reproduced. Sydor \u003cem\u003eet al.\u003c/em\u003e applied their algorithm on each group pairing individually, as well as on all three groups [8]. Because OMEx currently only supports analysis of two groups, only results from group pairings were reproduced. Due to the random assignment of samples during step 2 and 3, the most frequently selected molecules can differ slightly in every run. Therefore, the top 20 molecules from OMEx were compared to the reported biomarker candidates by Sydor \u003cem\u003eet al.\u003c/em\u003e [8]. The most frequently reported molecule panel (a subset of the top 20 molecules) was used for a final classification run. Classification accuracy was determined by classifying samples from a test set containing 30% of all samples. The classification accuracy describes the ratio of correctly classified samples compared to all classified samples. Accuracy ranges from 0 to 1 with a value of 1 meaning that all samples were classified correctly.\u003c/p\u003e\n\u003cp\u003eOMEx\u0026rsquo;s top 20 molecules reproduced many of the biomarkers from Sydor \u003cem\u003eet al.\u003c/em\u003e [8], demonstrating that top molecules are reported consistently (table 3). The most frequently reported molecule panels resulted in good classification accuracies of the control and disease groups (0.79 and 0.85 accuracy for Control vs. NASH and Control vs. HCC, respectively) but a low accuracy for the two disease groups (0.59 accuracy for NASH vs. HCC), which is in line with the original study (table 3). Classification accuracies for each pairing resulted generally in lower accuracy for OMEx in comparison to Sydor \u003cem\u003eet al.\u003c/em\u003e [8]. However, Sydor \u003cem\u003eet al.\u003c/em\u003e evaluated accuracy based on cross validation but not on an independent test set. For this reason, their positive results could have originated from overfitting or from a biased estimate by the cross validation. Evaluating OMEx\u0026rsquo;s accuracy based on cross validation alone (0% test set) showed accuracies similar to the initial algorithm version (table 3).\u003c/p\u003e\n\u003cp\u003eTable 3: Reproduction of results from the original algorithm.\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 78px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 73px;\"\u003e\n \u003cp\u003eControl vs. NASH\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 74px;\"\u003e\n \u003cp\u003eControl vs. HCC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 73px;\"\u003e\n \u003cp\u003eNASH vs. HCC\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 78px;\"\u003e\n \u003cp\u003eReproduced biomarkers\u003cbr\u003e\u0026nbsp;(OMEx/original)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 73px;\"\u003e\n \u003cp\u003e5/7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 74px;\"\u003e\n \u003cp\u003e3/5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 73px;\"\u003e\n \u003cp\u003e7/10\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 78px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"3\" valign=\"top\" style=\"width: 220px;\"\u003e\n \u003cp\u003eClassification accuracies\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 78px;\"\u003e\n \u003cp\u003eOriginal Algorithm (no test set)\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 73px;\"\u003e\n \u003cp\u003e0.99\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 74px;\"\u003e\n \u003cp\u003e1.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 73px;\"\u003e\n \u003cp\u003e0.86\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 78px;\"\u003e\n \u003cp\u003eOMEx (30 % test set)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 73px;\"\u003e\n \u003cp\u003e0.79\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 74px;\"\u003e\n \u003cp\u003e0.85\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 73px;\"\u003e\n \u003cp\u003e0.59\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 78px;\"\u003e\n \u003cp\u003eOMEx (0 % test set)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 73px;\"\u003e\n \u003cp\u003e0.99\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 74px;\"\u003e\n \u003cp\u003e0.99\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 73px;\"\u003e\n \u003cp\u003e0.74\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eFurthermore, four publicly available omics datasets of different sizes were collected (table 2). These datasets had been used in several other studies applying different feature selection methods allowing for a benchmarking of OMEx\u0026rsquo;s outputs.\u003c/p\u003e\n\u003cp\u003eTasaki \u003cem\u003eet al.\u0026nbsp;\u003c/em\u003e[10] performed a multi-omics study of rheumatoid arthritis (RA) applying transcriptomics and proteomics analyses to blood samples from patients (blood transcriptomics and blood proteomics dataset). They created ensembles of partial least square regression (PLSR) models for each dataset and reported the molecules most influential on model output. The blood transcriptomics dataset was used by Ng \u003cem\u003eet al.\u003c/em\u003e [4] in a tutorial on biomarker discovery. They applied XGBoost, an algorithm based on decision trees, and extracted the most important molecules for their model using Shapley Additive Explanations (SHAP).\u003c/p\u003e\n\u003cp\u003eThe urine metabolomics dataset contains metabolomics measurements from patients with lung cancer [11]. The authors applied random forests to extract potential biomarkers, while other studies by Chardin \u003cem\u003eet al.\u003c/em\u003e [12] and Labory \u003cem\u003eet al.\u003c/em\u003e [3] performed biomarker selection using supervised autoencoders (SAE) and a combination of Boruta feature selection and partial least squares discriminant analysis (PLS-LDA) respectively.\u003c/p\u003e\n\u003cp\u003eChardin \u003cem\u003eet al.\u0026nbsp;\u003c/em\u003e[12] also provided a metabolomics dataset on glial tumors (glial tumor metabolomics dataset) containing wild-type and isocitrate dehydrogenase (IDH) mutants used for the classification of glial tumors. SAE and Boruta and PLS-LDA were also applied for biomarker selection from this dataset by Chardin \u003cem\u003eet al.\u003c/em\u003e [12] and Labory \u003cem\u003eet al.\u003c/em\u003e [3], respectively.\u003c/p\u003e\n\u003cp\u003eAll datasets, OMEx configurations, and results are available in the supplementary files. All datasets were benchmarked by applying the automated mode of OMEx and using 70% of samples for molecule selection and 30% to evaluate the classification performance based on accuracy.\u003c/p\u003e\n\u003cp\u003eOMEx provides a ranking of molecules based on the frequency of selections in the wrapper stage. For the blood transcriptomics and blood proteomics dataset the top 20 molecules were compared to reported molecules from other studies, whereas only top 4 and top 5 molecules were considered for the urine metabolomics and glial tumor metabolomics datasets respectively, because Math\u0026eacute; \u003cem\u003eet al.\u003c/em\u003e [11]\u003cem\u003e\u0026nbsp;\u003c/em\u003eand Chardin \u003cem\u003eet al.\u003c/em\u003e [12] did not provide more molecules (figure 3).\u003c/p\u003e\n\u003cp\u003eOMEx\u0026rsquo;s top reported molecules overlapped with all other methods except for the SAE on the glial tumor metabolomics dataset. Potentially, more reported molecules would have resulted in a higher overlap. These results show that OMEx can extract the most important molecules also covering alternative molecules that are potentially not considered by other methods.\u003c/p\u003e\n\u003cp\u003eAside from molecule rankings, OMEx provides molecule panels which contain few molecules and are therefore suitable for experimental validation. The best panels for each dataset contained between two and five molecules (table 4). Selected transcripts or proteins were annotated by querying UniProt [13] and metabolites were annotated using the Workflow4Metabolomics platform [14] (Supplementary information). A classification based on these panels was compared to the other methods using the accuracy, as this was the metric reported by all other methods.\u003c/p\u003e\n\u003cp\u003eIn all datasets the classification accuracy across all methods was between 80.76% and 98.8% indicating that most samples could be separated. On the urine dataset methods showed a drop in accuracy to 68.67% to 81.2%. Overall, OMEx performs comparable to other methods, while it did not outperform them. It must be noted that the main objective of OMEx is not classification but molecule selection. Additionally, OMEx utilizes a very simple classification algorithm being less prone to overfitting compared to the more advanced methods such as random forests, XGBoost, and SAE.\u003c/p\u003e\n\u003cp\u003eTable 4: Most frequently selected molecule panels by OMEx for the four benchmarking datasets.\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"671\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 141px;\"\u003e\n \u003cp\u003eDataset\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eBest panel with identifiers from dataset\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 425px;\"\u003e\n \u003cp\u003eMolecule names\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"2\" valign=\"top\" style=\"width: 141px;\"\u003e\n \u003cp\u003eBlood transcriptomics\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eCLEC4D\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 425px;\"\u003e\n \u003cp\u003eC-type lectin domain family 4 member D (UniProt: Q8WXI8)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eRNPS1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 425px;\"\u003e\n \u003cp\u003eRNA-binding protein with serine-rich domain 1 (UniProt: Q15287)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"3\" valign=\"top\" style=\"width: 141px;\"\u003e\n \u003cp\u003eBlood proteomics\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eCRP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 425px;\"\u003e\n \u003cp\u003eC-reactive protein (UniProt: P02741)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eBPI\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 425px;\"\u003e\n \u003cp\u003eBactericidal permeability-increasing protein (UniProt: P17213)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eIGFBP6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 425px;\"\u003e\n \u003cp\u003eInsulin-like growth factor-binding protein 6 (UniProt: P24592)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"5\" valign=\"top\" style=\"width: 141px;\"\u003e\n \u003cp\u003eUrine metabolomics\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eMZ 252.97\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 425px;\"\u003e\n \u003cp\u003e6-Bromo-3-(hydroxyacetyl)-1H-indole (Pherobase: 6-Bromo-3-(hydroxyacetyl)-1H-indole)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eMZ 264.12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 425px;\"\u003e\n \u003cp\u003e(Z)-2-Methyl-2-butene-1,4-diol 4-O-beta-D-Glucopyranoside\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eMZ 126.90\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 425px;\"\u003e\n \u003cp\u003eIodine\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eMZ 239.99\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 425px;\"\u003e\n \u003cp\u003e6-chloro-2-quinoxalinecarboxylic acid 1,4-dioxide\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eMZ 441.16\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 425px;\"\u003e\n \u003cp\u003eGly-Asn-Asp-His (GNDH)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"2\" valign=\"top\" style=\"width: 141px;\"\u003e\n \u003cp\u003eGlial tumor metabolomics\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eMZ 137.07\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 425px;\"\u003e\n \u003cp\u003e1-Methylnicotinamide\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eMZ 173.03\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 425px;\"\u003e\n \u003cp\u003eN-Formyl-L-glutamate\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eOverall, OMEx provides an excellent solution for the selection of molecule panels as biomarker candidates from large matrices of omics data. The method, embedding diagonal Linear Discriminant Analysis into the combination of filter and wrapper technique represents a useful strategy to extract biomarker candidates reliably, as shown on synthetic data. While outcomes of OMEx, can be slightly different in each run, it was shown that biomarkers reported by a previous version of the algorithm were in accordance with OMEx, demonstrating the reproducibility of the tool results. Additionally, OMEx selects similar molecules to those reported by other computational approaches while also selecting unique molecules that others did not, demonstrating its value as an alternative approach. The most striking advantage of OMEx except for its novel algorithm, is its user-friendly and web-based interface, minimizing the effort to analyze omics data and obtain potential biomarkers for experimental validation. In the future OMEx will be extended to extract molecules for more than two groups and provide more advanced classification models.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003eASSOCIATED CONTENT\u003c/p\u003e\n\u003cp\u003eSupporting Information\u003c/p\u003e\n\u003cp\u003esupplementary.docx: Supplementary information to main manuscript\u003c/p\u003e\n\u003cp\u003edatasets_formatted.zip: All datasets used in the manuscript formatted to comply OMEx input format\u003c/p\u003e\n\u003cp\u003eomex_sydor_comparison.zip: Results generated for the comparison of OMEx and its original implementation\u003c/p\u003e\n\u003cp\u003ebenchmarking_results.zip: All results generated by OMEx for the comparison with other molecule selection methods\u003c/p\u003e\n\u003cp\u003eselected_molecule_identification.zip: Output from Workbench4Metabolomics and UniProt queries for the annotation of molecule panels found in the comparison of molecule selection methods\u003c/p\u003e\n\u003cp\u003eAUTHOR INFORMATION\u003c/p\u003e\n\u003cp\u003eCorresponding Author\u003c/p\u003e\n\u003cp\u003eEmanuel Lange - ISAS e.V., Bunsen-Kirchhoff-Str. 11, 44139 Dortmund, Germany; https://orcid.org/0009-0008-6620-6681; e-mail: [email protected]\u003c/p\u003e\n\u003cp\u003ePresent Addresses\u003c/p\u003e\n\u003cp\u003eKay Schallert - ISAS e.V., Bunsen-Kirchhoff-Str. 11, 44139 Dortmund, Germany\u003c/p\u003e\n\u003cp\u003eJohannes Schwerdt - Hochschule Merseburg University of Applied Sciences, Eberhard-Leibnitz-Stra\u0026szlig;e 2, 06217 Merseburg, Germany\u003c/p\u003e\n\u003cp\u003eSusmita Ghosh - ISAS e.V., Otto-Hahn-Str. 6b, 44227 Dortmund, Germany\u003c/p\u003e\n\u003cp\u003eAndreas Hentschel - ISAS e.V., Otto-Hahn-Str. 6b, 44227 Dortmund, Germany\u003c/p\u003e\n\u003cp\u003eYvonne Reinders - ISAS e.V., Otto-Hahn-Str. 6b, 44227 Dortmund, Germany\u003c/p\u003e\n\u003cp\u003eRobert Heyer - ISAS e.V., Bunsen-Kirchhoff-Str.\u0026nbsp;11, 44139 Dortmund, Germany\u003c/p\u003e\n\u003cp\u003e\u0026dagger;If an author\u0026rsquo;s address is different than the one given in the affiliation line, this information may be included here.\u003c/p\u003e\n\u003cp\u003eAuthor Contributions\u003c/p\u003e\n\u003cp\u003eE.L. and K.S. developed the website and algorithm. E.L. acquired datasets, tested the algorithm and website. J.S. developed the original algorithm and script for synthetic data testing and contributed to website and algorithm development. S.G., Y.R., A.H. discussed the project, provided data for testing, and tested the website. R.H. supervised the project. E.L., K.S., J.S., R.H. designed the project. E.L., K.S., J.S. wrote the manuscript. All authors revised the manuscript and have given approval for the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u0026Dagger; E.L. and K.S. contributed equally.\u003c/p\u003e\n\u003cp\u003eNotes\u003cbr\u003e\u0026nbsp;The authors declare no competing financial interest.\u003c/p\u003e\n\u003cp\u003eACKNOWLEDGMENT\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAll authors acknowledge the support by the \u0026ldquo;Ministerium f\u0026uuml;r Kultur und Wissenschaft des Landes Nordrhein-Westfalen\u0026rdquo; and \u0026ldquo;Der Regierende B\u0026uuml;rgermeister von Berlin, Senatskanzlei Wissenschaft und Forschung\u0026rdquo;.\u003c/p\u003e\n\u003cp\u003eWe thank all coworkers and friends for discussions and feedback, which helped to bring this project to its final form.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eHasin Y, Seldin M, Lusis A (2017) \u0026bdquo;Multi-omics approaches to disease. Genome Biology Bd. 18\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSaeys Y, Inza I (2007) und P. Larra\u0026ntilde;aga, \u0026bdquo;A review of feature selection techniques in bioinformatics, \u003cem\u003eBioinformatics\u003c/em\u003e, Bd. 23, pp. 2507\u0026ndash;2517, August\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLabory J, Njomgue-Fotso E, Bottini S (2024) \u0026bdquo;Benchmarking feature selection and feature extraction methods to improve the performances of machine-learning algorithms for patient classification using metabolomics biomedical data, \u003cem\u003eComputational and Structural Biotechnology Journal\u003c/em\u003e, Bd. 23, pp. 1274\u0026ndash;1287, December\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNg S, Masarone S, Watson D, Barnes MR (2023) \u0026bdquo;The benefits and pitfalls of machine learning for biomarker discovery, \u003cem\u003eCell and Tissue Research\u003c/em\u003e, Bd. 394, pp. 17\u0026ndash;31, July\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRitchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W (January 2015) Smyth, \u0026bdquo;limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43:e47\u0026ndash;e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePang Z, Lu Y, Zhou G, Hui F, Xu L, Viau C, Spigelman A, MacDonald P, Wishart D, Li S, Xia J (April 2024) \u0026bdquo;MetaboAnalyst 6.0: towards a unified platform for metabolomics data processing, analysis and interpretation. Nucleic Acids Res 52:W398\u0026ndash;W406\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMazzara S, Rossi RL, Grifantini R, Donizetti S, Abrignani S, Bombaci M (2017) \u0026bdquo;CombiROC: an interactive web tool for selecting accurate marker combinations of omics data, \u003cem\u003eScientific Reports\u003c/em\u003e, Bd. 7, March\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSydor S, Dandyk C, Schwerdt J, Manka P, Benndorf D, Lehmann T, Schallert K, Wolf M, Reichl U, Canbay A, Bechmann LP, Heyer R (August 2022) \u0026bdquo;Discovering Biomarkers for Non-Alcoholic Steatohepatitis Patients with and without Hepatocellular Carcinoma Using Fecal Metaproteomics. Int J Mol Sci 23:8841\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDudoit S, Fridlyand J, Speed TP (2002) \u0026bdquo;Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data, \u003cem\u003eJournal of the American Statistical Association\u003c/em\u003e, Bd. 97, pp. 77\u0026ndash;87, March\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTasaki S, Suzuki K, Kassai Y, Takeshita M, Murota A, Kondo Y, Ando T, Nakayama Y, Okuzono Y, Takiguchi M, Kurisu R, Miyazaki T, Yoshimoto K, Yasuoka H, Yamaoka K, Morita R, Yoshimura A, Toyoshiba H (2018) und T. Takeuchi, \u0026bdquo;Multi-omics monitoring of drug response in rheumatoid arthritis in pursuit of molecular remission. Nat Commun, Bd. 9\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMath\u0026eacute; EA, Patterson AD, Haznadar M, Manna SK, Krausz KW, Bowman ED, Shields PG, Idle JR, Smith PB, Anami K, Kazandjian DG, Hatzakis E, Gonzalez FJ (June 2014) Harris, \u0026bdquo;Noninvasive Urinary Metabolomic Profiling Identifies Diagnostic and Prognostic Markers in Lung Cancer. Cancer Res 74:3259\u0026ndash;3270C. C\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChardin D, Gille C, Pourcher T, Humbert O, Barlaud M (2022) \u0026bdquo;Learning a confidence score and the latent space of a new supervised autoencoder for diagnosis and prognosis in clinical metabolomic studies. BMC Bioinformatics, Bd. 23\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBateman A, Martin M-J, Orchard S, Magrane M, Adesina A, Ahmad S, Bowler-Barnett EH, Bye-A-Jee H, Carpentier D, Denny P, Fan J, Garmiri P, Gonzales LJdC, Hussein A, Ignatchenko A, Insana G, Ishtiaq R, Joshi V, Jyothi D, Kandasaamy S, Lock A, Luciani A, Luo J, Lussi Y, Marin JSM, Raposo P, Rice DL, Santos R, Speretta E, Stephenson J, Totoo P, Tyagi N, Urakova N, Vasudev P, Warner K, Wijerathne S, Yu CW-H, Zaru R, Bridge AJ, Aimo L, Argoud-Puy G, Auchincloss AH, Axelsen KB, Bansal P, Baratin D, Batista Neto TM, Blatter M-C, Bolleman JT, Boutet E, Breuza L, Gil BC, Casals-Casas C, Echioukh KC, Coudert E, Cuche B, de Castro E, Estreicher A, Famiglietti ML, Feuermann M, Gasteiger E, Gaudet P, Gehant S, Gerritsen V, Gos A, Gruaz N, Hulo C, Hyka-Nouspikel N, Jungo F, Kerhornou A, Mercier PL, Lieberherr D, Masson P, Morgat A, Paesano S, Pedruzzi I, Pilbout S, Pourcel L, Poux S, Pozzato M, Pruess M, Redaschi N, Rivoire C, Sigrist CJA, Sonesson K, Sundaram S, Sveshnikova A, Wu CH, Arighi CN, Chen C, Chen Y, Huang H, Laiho K, Lehvaslaiho M, McGarvey P (2024) D. A. Natale, K. Ross, C. R. Vinayaka, Y. Wang und J. Zhang, \u0026bdquo;UniProt: the Universal Protein Knowledgebase in 2025, \u003cem\u003eNucleic Acids Research\u003c/em\u003e, Bd. 53, pp. D609\u0026ndash;D617, November\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGiacomoni F, Le G, Corguill\u0026eacute; M, Monsoor M, Landi P, Pericard M, P\u0026eacute;t\u0026eacute;ra C, Duperier M, Tremblay-Franco J-F, Martin D, Jacob S, Goulitquer EA, Th\u0026eacute;venot C, Caron (2014) \u0026bdquo;Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics, \u003cem\u003eBioinformatics\u003c/em\u003e, Bd. 31, pp. 1493\u0026ndash;1495, December\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Leibniz Institute for Analytical Sciences - ISAS","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Molecule Selection, Omics, Biomarkers, Bioinformatics","lastPublishedDoi":"10.21203/rs.3.rs-5914047/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5914047/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eSelecting molecule panels that are applicable to classify the health state of patients is a common task in omics data analysis. Existing software for molecule selection lacks features to select molecule panels from large datasets, requires programming experience, or lacks user-friendly interfaces. We present the Omics Molecule Extractor (OMEx) an open-source web application providing a user-friendly workflow for selecting molecules and molecule panels for sample classification from large datasets. OMEx’s user interface provides interactive visualization for exploring input data and analysis results. The feature selection strategy underlying the algorithm is based on machine learning and has not been available in any software with user interface. Extensive testing using synthetic datasets with known ground truth showed that the algorithm discovers group-separating molecules with high precision. Additionally, OMEx was tested on five real-world omics datasets demonstrating high reproducibility and overlap with reported molecules from other feature selection methods, while also reporting alternative molecules of interest. OMEx is freely available at https://mdoa-tools.bi.denbi.de/omex/home.\u003c/p\u003e","manuscriptTitle":"The Omics Molecule Extractor: A web application for the selection of potential biomarker panels","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-01-30 10:32:34","doi":"10.21203/rs.3.rs-5914047/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"19ec46cd-3c40-4706-b05e-3507361d7be1","owner":[],"postedDate":"January 30th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":43560147,"name":"Bioinformatics"},{"id":43560148,"name":"Analytical Biochemistry"}],"tags":[],"updatedAt":"2025-01-30T10:32:34+00:00","versionOfRecord":[],"versionCreatedAt":"2025-01-30 10:32:34","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-5914047","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5914047","identity":"rs-5914047","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00