MS-Net: Multi-Similarity based network annotation for untargeted metabolomics

doi:10.21203/rs.3.rs-8174529/v1

MS-Net: Multi-Similarity based network annotation for untargeted metabolomics

2025 · doi:10.21203/rs.3.rs-8174529/v1

preprint OA: closed

Full text JSON View at publisher

Full text 137,868 characters · extracted from preprint-html · click to expand

MS-Net: Multi-Similarity based network annotation for untargeted metabolomics | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article MS-Net: Multi-Similarity based network annotation for untargeted metabolomics Pereira Francisco, Duthen, Crossay, Alignan, Hennechart, Perez, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8174529/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Confident metabolite annotation remains a critical bottleneck in untargeted LC-MS metabolomics, as experimental spectral libraries cover only 5–20% of detected features. While in silico tools generate extensive candidate lists, top-ranked predictions often fail to reflect true identities, resulting in high false annotation rates. We present MS-Net (Multi-Similarity Network-based annotation), an accessible workflow integrating mass spectral similarity networks, molecular structure similarity (Tanimoto metrics), and taxonomic knowledge to prioritize annotations within vast candidate spaces. MS-Net employs a composite Link Score combining full-molecule and scaffold Tanimoto similarities with MS/MS cosine similarity and in silico confidence metrics. High-confidence annotations seed iterative propagation throughout the network. Applied to a Cannabis sativa dataset (2,595 initial features reduced to 1,297 after filtering, from 118,000 candidates), MS-Net resolved the annotation space to 1,275 confident assignments. notably, 53% of final annotations were rescued from lower in silico ranks (2–50), demonstrating the algorithm's ability to correct ranking errors. The workflow enables reproducible, offline annotation prioritization suitable for systems biology integration. Structural Biology Analytical Biochemistry Computational Biology metabolomics mass spectrometry annotation molecular networking Tanimoto similarity natural products KNIME Cannabis cannabinoids Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction The increasing application of metabolomics, both as a standalone discipline and integrated with other omics, underscores its maturity 1 . Untargeted LC-MS fingerprints from biological matrix yield hundreds to thousands of features ( mz x RT pairs). One of the main bottlenecks is properly annotating all detected signals with sufficient confidence to ensure robust biological contextualization (e.g. pathway enrichment analysis, multi-omics inference models…). A dereplication strategy involves signal deconvolution, ions linkage deciphering using tools such as Ion Identity Network 2 or MS-CleanR 3 followed by annotation relying on a “body of evidence” approach. The formalization of annotation confidence Levels was proposed by the Metabolomic Standard Initiative 4 and refined to account for new online or computational approaches 5 . Level 1 relies on an authentic standard match, while Level 2 involves a spectral match from external libraries. The above annotation Levels include in silico approaches (Level 3) up to Level 4 (only partial match) and Level 5 for unknown spectroscopic signals. The main limiting step is the discrepancy between publicly available mass spectral libraries and the chemical diversity encountered in biological samples. For instance, the last version of FragHub mass spectral library integration 6 ( 10.5281/zenodo.17235587 , v12), encompasses two million spectra for 150,000 unique INCHIKEY, while chemical space produced by living organisms is estimated to be millions of molecules 7 . As a result, experimental spectral matches (Levels 1 and 2) cover only 5 to 20% of all detected signals in untargeted experiments. The remaining signal annotations are based on chemical catalog queries using in silico fragmentation approaches (Level 3). For each feature, putative matches are ranked according to a specific scoring system depending on the software used. For instance, MS-Finder 8 leverages rule-based fragmentation to predict fragments from SMILES structures, and MetFrag 9 employs combinatorial bond breaking. It ranks putative structures based on weighted sum fragment matches, intensity correlation, and neutral losses. Sirius-CSI 10 combines fragmentation trees with machine learning to predict molecular fingerprints, ranking candidates via Bayesian probability and fingerprint similarity. The identification accuracy reached between 10 and 30% for the first hit (Top 1) to 90% for the top 20 according to the CASMI challenge 11 and other benchmark 12 . As a result, despite the advancements in computational tools for metabolite annotation, these solutions may lead to numerous false annotations in the context of untargeted LC-MS metabolomics, dealing with hundreds to thousands of features. Several strategies have been proposed to address this challenge based on mass spectral similarity networks provided by the GNPS facility 2 . For instance, Network Annotation Propagation (NAP) 13 , leverages molecular network topology to re-rank structural candidates even without spectral library matches. Taxonomically Informed Metabolite Annotation (TIMA) 14 improves confidence in annotations by considering the taxonomic position of biological sources. ConCISE 15 integrates molecular networking, spectral library matching, and in silico class predictions to provide accurate classifications for subnetworks. Inventa 16 calculates a priority score to highlight structural novelty within extracts. More recently, MS2DECIDE 17 aggregates data from GNPS, Sirius, and ISDB-LOTUS 18 using multi-criteria decision analysis to prioritize annotation and highlight feature novelty potential. Another avenue has been proposed using multilayer networks, which combines knowledge-based metabolic reaction, correlation networks, and mass spectral similarity to enable global metabolite annotation from knowns to unknowns 19 . These approaches collectively aim to streamline feature identification by leveraging computational insights and expert knowledge to enhance the efficiency and confidence of MS/MS-based annotations. Although these network-based strategies have enhanced annotation confidence, their adoption in routine metabolomics workflows faces practical barriers, including requirements for bioinformatics expertise, dependencies on web servers or dedicated computational infrastructure, and outputs with limited chemical metadata that restrict seamless integration with pathway enrichment analysis and or multi-omics approaches 20 . To address this challenge, we propose MS-Net (Multi-Similarity Network-based annotation). This user-friendly workflow combines mass spectral and Tanimoto similarity networks with correlation analysis and taxonomic data to drive annotation prioritization. Additionally, MS-Net integrates positive and negative ionization chromatographic fingerprints with several filtering options. Finally, the workflow generates enriched metadata annotation, statistical tables, and map results on mass spectral and Tanimoto similarity networks. This approach is completely offline and compatible with output files from widely used software in the field, such as MS-Dial, MZmine, Sirius-CSI, and MS-Finder (Fig. 1 ). Results Workflow Architecture and Implementation MS-Net was developed using the Knime visual programming interface 21 and needs three primary inputs: (1) feature intensity tables (peak height or area), (2) mass spectral similarity (MSS) network edge lists, and (3) multi-Level annotation hits encompassing experimental library matches (Levels 1-2) and in silico predictions (Levels 3-4). These last collectively define a putative chemical space that may comprise several thousand candidate structures per dataset (Figure 1A,B). Annotation Confidence Scoring System The workflow begins by normalizing confidence scores across annotation Levels to establish a unified ranking system. Level 1 receives the highest confidence score and is assigned to features showing perfect concordance with authentic standards based on MS/MS spectral similarity (similarity > 0.95) and retention time agreement (ΔRT 0.85) and Level 2b for moderate-confidence matches (similarity > 0.7). For in silico annotations, MS-Net enriches structural candidates with taxonomic information by querying Coconut 2.022. The workflow performs InChIKey-based matching to identify compounds reported from user-specified taxonomic sources (genus and family Levels). Candidates with confirmed biosource origins or matching the target chemical class are assigned Level 3a, while remaining in silico matches receive Level 3b. Spectral library matches exhibiting lower similarity (similarity < 0.7) or significant precursor mass discrepancies are classified as MS/MS analogs (Level 4). Features lacking any structural annotation remain at Level 5 (unknown) (Figure 1B). Confidence scores are normalized to a 0–100 scale to enable cross-Level comparison. The normalization scheme employed Level-specific transformations: Level 1 = 95 + (similarity score− 0.95) × 100; Level 2a = 85 + (similarity score− 0.85) × 100; Level 2b = 70 + (similarity score − 0.70) / 0.15 × 14; Level 3b (in silico and de novo) = 40 + confidence score × 40; Level 4 (MS/MS analogs) = 30 + (similarity score − 0.5) × 90; Level 5 = 0. These transformations ensure that experimental matches consistently receive higher scores than computational predictions while preserving score discrimination within each Level. Feature Filtering and Data Reduction Prior to network-based annotation propagation, users may optionally apply filtering strategies to reduce dataset complexity while retaining chemically informative features. The workflow supports retention time clustering using the MS-CleanR 3 algorithm. Within each RT cluster, the most intense features and/or those with the highest network connectivity (degree) are preferentially retained, effectively removing redundant signals (Figure 1C). Optionally, a cosine filter threshold can be added to constrain the MSS network. Finally, putative annotations between two nodes may be filtered according to XlogP calculated for each candidate. For each feature pair, the XlogP is compared to the edge delta retention time. In C18 mode, only feature pairs exhibiting XlogP trends consistent with retention time order are conserved. Network-Based Annotation Propagation MS-Net constructs a seed subnetwork comprising only high-confidence annotations (Levels 1, 2a, and 3a), which serves as the foundation for propagating structural assignments throughout the entire MSS network. To rank competing structural candidates for each feature pair connected in the MSS network, we developed a composite scoring metric that integrates spectral, structural, and computational evidence into a unified Link Score . The structural similarity component employs two complementary Tanimoto measures calculated from molecular fingerprints: Tanimoto_Full (based on Morgan, PubChem or RDkit) captures overall molecular similarity, including substituents, while Tanimoto_Murcko (scaffold fingerprints) emphasizes core structural frameworks. These metrics are dynamically weighted according to their relative informativeness. When the absolute difference between these two measures exceeds 0.1 (|Tanimoto_Full - Tanimoto_Murcko| > 0.1), the higher value receives greater weight, reflecting either the dominance of substituent patterns (Tanimoto_Full) or core scaffold similarity (Tanimoto_Murcko). When this difference falls within ±0.1, equal weights are applied. The resulting Combined_Tanimoto is then scaled by the MS/MS cosine similarity to produce the Adjusted_Structural score, which accounts for both molecular structure and spectral concordance. In parallel, an InSilico_Combined score is calculated as the arithmetic mean of confidence scores described above. The final Link Score integrates these two components through a user-tunable weighting parameter: Link Score = (1 - α) × Adjusted_Structural + α × InSilico_Combined The parameter α allows users to control the relative contributions of spectral-structural evidence versus in silico prediction scores. Lower α values (e.g., 0.3) prioritize structural and spectral similarity, while higher values give more weight to computational prediction confidence. For each feature pair in the MSS network, the workflow evaluates all candidate structures and selects the annotation with the highest Link Score. This process propagates iteratively from high-confidence seeds to their direct neighbors and subsequently through the entire connected network. Features that remain outside the MSS network are ranked using a simplified metric combining their best in silico score with the maximum Tanimoto similarity to any annotated compound within the MSS network. Metadata Enrichment and Output Generation The final annotated feature list is optionally enriched with chemical metadata using ClassyFire and NPClassifier ontologies, providing hierarchical chemical classifications (kingdom, superclass, class, subclass). Database identifiers for each annotated feature are retrieved using the Chemical Translation Service 23 , ensuring compatibility with downstream pathway enrichment tools and multi-omics integration platforms (Figure 1E). MS-Net generates four primary outputs: (1) a comprehensive annotated feature table with confidence Levels, structural information, and chemical ontology; (2) a feature height/area table; (3) an MSS network edge table for visualizing spectral similarity relationships; and (4) a Tanimoto-based network edge table connecting structurally related compounds, including links between unknown features and their nearest annotated structural neighbors. This latter output enables users to infer structural motifs for unannotated features. Both networks are enriched with putative chemical reactions between two neighboring nodes using a delta mass match to a predefined list from Metanetter 2 24 . Optionally, features acquired in positive and negative ionization modes can be merged based on user-defined retention time and m/z tolerances. Application to Cannabis Metabolomics Dataset Inflorescences of three medical-grade Cannabis sativa L. chemotypes were selected to evaluate MS-Net's annotation capabilities. Cannabis represents an ideal model system for several reasons. First, the species exhibits remarkable chemical diversity, encompassing a wide array of cannabinoids, terpenes, and phenolic compounds (primarily flavonoids and hydroxycinnamic acids) 25–27 . The phytochemistry of cannabis is well-documented, with numerous metabolomic studies demonstrating robust discrimination among cultivars and chemotypes 28–30 . The traditional morphology-based classification (Indica vs. Sativa) has been superseded by a chemotype system based on the relative concentrations of Δ9-tetrahydrocannabinol (THC) and cannabidiol (CBD). This framework defines three main chemotypes: Type I (THC-predominant, <0.5% CBD), Type II (balanced THC:CBD ratio), and Type III (CBD-predominant, <1% THC). Although expanded classifications include Type IV (cannabigerol-predominant) and Type V (cannabinoid-free), Types I–III remain the most extensively characterized 31,32 . Crucially for algorithm evaluation, cannabinoids exhibit exceptional structural diversity—estimated at 120–150 distinct phytocannabinoids—while sharing highly similar core molecular scaffolds 33 . For instance, the major cannabinoids THC, CBD, and cannabichromene (CBC) are constitutional isomers (C₂₁H₃₀O₂) differing only in cyclization patterns: THC features a pyran ring, CBD a cyclohexene ring, and CBC a benzopyran structure. This structural similarity, combined with well-elucidated biosynthetic pathways, provides an ideal benchmark for evaluating network-based annotation algorithms. The biosynthetic pathway initiates with cannabigerolic acid (CBGA)—formed by CBGA Synthase-catalyzed condensation of olivetolic acid and geranyl pyrophosphate—which serves as the universal precursor for nearly all other cannabinoids 34 . Additionally, in planta , cannabinoids exist predominantly as carboxylic acids (THCA, CBDA, CBCA), with decarboxylation to neutral forms occurring upon heating. To demonstrate MS-Net's capabilities, we applied the workflow to a comprehensive untargeted metabolomics study of Cannabis sativa L., analyzing three distinct chemotypes: Bedrocan® (THC-dominant, Type I), Bedrolite® (CBD-dominant, Type III), and Bediol® (THC/CBD-balanced, Type II). Initial data processing detected 2,595 features across positive and negative ionization modes. Feature Filtering and Chemical Space Reduction We applied a sequential filtering strategy to reduce dataset complexity while retaining chemically meaningful signals. First, Ion Identity Networking identified and collapsed redundant adducts and isotopes by selecting the most informative precursor ions ([M+H]⁺ or [M-H]⁻) from each ion cluster. Subsequently, MS-CleanR-based retention time clustering further consolidated co-eluting features (Figure 2B). This filtering reduced the dataset to 1,297 unique features while maintaining comprehensive chemical coverage. In silico annotation using Sirius-CSI (top 50 candidates per feature) and MSNovelist (top 20 de novo structures per feature) generated a putative chemical space encompassing more than 118,000 candidate structures (Figure 2A). This expansive search space highlights the challenge of confident structural assignment in untargeted metabolomics: without prioritization strategies, the likelihood of selecting incorrect annotations from such large candidate pools is substantial. Network-Based Annotation Prioritization We seeded the MSS network using Level 1 (authentic standards), Level 2a (high-confidence spectral matches), and Level 3a annotations (taxonomically informed candidates from the Cannabis genus, Cannabaceae family, or cannabinoid chemical class). PubChem-based fingerprint was selected to calculate Tanimoto similarities. The Link Score algorithm was configured with α = 0.3 to prioritize structural and spectral similarity over raw in silico ranking scores. To evaluate the algorithm's performance, we examined the agreement between MSS network topology (cosine similarity) and structural similarity (Tanimoto scores). Before annotation prioritization, the raw chemical space exhibited a mean absolute distance of 0.55 between these metrics, with the highest density occurring between 0.6 and 0.8 (Figure 2C). This discrepancy reflects that spectral similarity does not always correlate with structural similarity, particularly when in silico tools generate diverse candidates. Restricting to only the top-ranked in silico candidate (top 1) dramatically reduced the mean distance to 0.3, but at the cost of excluding potentially correct structures ranked lower. Expanding to the top 10 or top 20 candidates achieved better performance, with maximum density centered around 0.2, indicating strong agreement between spectral and structural similarities. Notably, incorporating the top 10 de novo structures from MSNovelist further improved concordance, suggesting that machine learning-generated candidates can complement database-constrained searches for features representing novel or underrepresented chemical scaffolds. Finally, top 50 in silico and Top 20 de novo candidates per feature were selected for annotation prioritization. Global Dataset Annotation and Chemical Space Coverage MS-Net reduced the initial chemical space from 118,000 candidates to 1,275 confidently annotated compounds across 1,297 features (Figure 2D). Analysis of annotation rank distribution revealed that 47% of features were assigned their top-ranked in silico candidate, indicating strong agreement between computational predictions and network-guided prioritization. An additional 30% of annotations fell within ranks 2–20, demonstrating the algorithm's ability to rescue correct structures initially ranked lower due to limitations in in silico fragmentation models. The remaining 23% of annotations were ranked above position 20 (Figure 2F). The final annotation distribution by confidence Level showed: 9 Level 1 (authentic standards), 58 Level 2a (high-confidence spectral matches), 31 Level 2b (moderate-confidence spectral matches), 43 Level 3a (taxonomically informed in silico annotations), 1051 Level 3b in silico and 71 3b de novo matches, 4 Level 4 (MS/MS analogs), and 26 Level 5 (unknown) (Figure 2E). This distribution reflects the typical annotation coverage achievable in specialized metabolomics studies, where experimental spectral libraries cover only a fraction of detected features, necessitating extensive in silico inference. Chemotype Discrimination Principal component analysis (PCA) of the annotated feature matrix (n = 18 samples, p = 1,297 features) revealed clear separation of the three cannabis chemotypes, with the first two principal components explaining 90 % of total variance (Figure 2G). Sparse partial least squares discriminant analysis (sPLS-DA) identified 60 discriminant features that robustly distinguished the chemotypes (Figure 2H). Chemical ontology classification using NPClassifier revealed distinct natural product pathway enrichments for each chemotype (Figure 2I). Bedrocan® (THC-dominant) exhibited enrichment in phenylpropanoids and terpenoid pathways, consistent with high Levels of THC and related cannabinoids. Bedrolite® (CBD-dominant) showed elevated Levels of amino acid derivatives and shikimate pathways. Bediol® (balanced THC/CBD) displayed an intermediate metabolite profile with enrichment in polyketide derivatives. Case Study: Cannabinoid Subnetwork Annotation The initial MSS network is seeded with annotation Level 1, 2a and 3a (green dots, figure 3A). The cannabinoid MSS subnetwork illustrates the algorithm's discriminative power (Figure 3B). MS-Net successfully prioritized structurally coherent annotations, including close derivatives differing primarily in hydroxylation patterns, methyl substitutions, or double bond positions—chemical variations consistent with known cannabinoid biosynthetic pathways. Two de novo structures from MSNovelist were also integrated, representing potential novel cannabinoid scaffolds warranting further investigation (Figure 3B). An illustration of MS-net prioritization algorithm is displayed between cannabichromenic acid (CBC-A, Level 1 authentic standard) and its close neighbor, accounting for annotation le 3b (Figure 3C), displaying a pseudomolecular ion at m/z 301.145 (molecular formula C₁₈H₂₂O₄, 0.9 ppm mass accuracy), which matched 70 putative in silico candidates. Reliance solely on the top-ranked in silico candidate from Sirius-CSI would have resulted in an incorrect structural assignment. However, the Link Score algorithm identified cannabiorcichromenic acid, originally ranked 50th, as the most probable annotation based on its high Tanimoto similarities to CBC-A, which is consistent with the MSS cosine similarity of 0.89. Exploitation of the Tanimoto structural similarity network through the study of cannabinoid biosynthesis pathway Figure 4 illustrates how the MSNet workflow efficiently clusters metabolites according to their structural similarities, providing a comprehensive view of the metabolic landscape. The upper panel (A) presents the complete Tanimoto structural similarity network. Within this network, each node represents a distinct metabolite, and the edges delineate the Tanimoto similarity score between them. The organization of the network reveals well-defined clusters of metabolites corresponding to distinct biosynthetic routes, reflecting the algorithm’s capacity to capture pathway-level organization within complex metabolomic data. Overall, the similarity network is organized into three main subnetworks, five large clusters, and several smaller satellite clusters. An expanded sub-network (B) was isolated by selecting the cannabinoids and their precursors, as well as their close neighbors. This network highlights a local region of the network, illustrating how cannabinoids and precursors with related chemical structures or biosynthetic origins are tightly interconnected. The lower panel (C) focuses on the three main biosynthetic pathways that specifically yield cannabinoids: the olivetolic acid, orsellinic acid, and divarinic acid pathways. Each pathway is represented as a sequence of enzymatic and chemical reactions converting early intermediates into key cannabinoids. The three precursors associated with the three main biosynthetic pathways were detected within the extracts, along with the majority of metabolites produced via the olivetolic acid pathway and a subset originating from the divarinic and orsellinic acid pathways. Discussion MS-Net was designed to streamline the LC-MS-based untargeted metabolomic workflow from raw data acquisition to a complete annotated table enriched with metadata based on outputs from processing software commonly used in the field. Our results demonstrate that MS-Net not only provides confident structural annotations but also generates biologically interpretable outputs with enriched metadata suitable for downstream pathway enrichment analysis and multi-omics integration. The core algorithm of MS-Net leverages Tanimoto similarity metrics starting from confidently annotated nodes (Level 1, 2a, and 3a), which serve as seeds to propagate annotations into the MSS networks. Consequently, the results will be impacted by the quality of annotation used as input. Annotation Level 1 is dependent on the internal library and generally limited to synthetic standards. Annotation Level 2a, encompassing high experimental MS/MS similarity scores, will be influenced by the quality and coverage of the mass spectral library used. Initiatives such as FragHub 6 , MSnLib 35 , MassBank 36 , or GNPS 2 , dramatically improved matching results. However, metabolome coverage of mass spectral libraries is still limited compared to natural product catalogs 37 . To extend putative seeds among the MSS network, we implement InChiKey-based matches against the Coconut 2.0 22 chemical catalog or user-choice chemical ontology, which increases the number of seeds to more than 90 compounds in the case of Cannabis extract profiling (Supplementary data: ResultsNEGPOS.xlsx). This well-studied plant 38 is perfectly suitable for the MS-Net approach, while understudied organisms may suffer from less accurate annotation results. The type of molecular fingerprint used highly influences the Tanimoto-based similarity result 39 . Three contrasted type of fingerprints are available within MS-Net: Morgan fingerprints which capture local circular substructures with a radius of 2 by default to provide highly discriminative power for structural similarity, RDKit (Daylight-like) encodes molecular paths through graph providing a balanced general-purpose representation, while PubChem uses 881 predefined substructural features that tend to emphasize common functional groups and yield higher similarity scores between diverse molecules. This last have been used for the present study, and according to our experience, Morgan or Pubchem molecular fingerprints allowed the detection of subtle substructure changes compared to RDKit type. The MS-Net approach is also highly dependent on the MSS network topology. A poorly clustered network, constructed with a low similarity threshold (e.g., 0.6), may result in spurious results. For the Cannabis dataset, the modified cosine score algorithm with a threshold of 0.7 was used, which provided an equilibrated clustered MSS network (Fig. 3 A). Still, changing spectral clustering measures, such as Bonanza available in MSdial, impacts the final results (data not shown). In the frame of Cannabis dataset, the homogeneous distribution of compounds across biosynthetic pathways within the similarity network demonstrates that MSNet is effective not only in establishing structural relationships among compounds but also in grouping them coherently (Fig. 4 A). Among the annotated compounds, cannabinoids represent a chemically important class. Cannabinoids are secondary metabolites biosynthesized in the glandular trichomes of female cannabis inflorescences through three well-characterized biosynthetic routes: the olivetolic acid pathway, the divarinic acid pathway, and the orsellinic acid pathway. The MSNet annotation workflow successfully identified the three precursors of these metabolic routes (olivetolic, divarinic, and orsellinic acids), along with a series of downstream cannabinoids. The olivetolic acid pathway, for example, was almost fully reconstructed, as shown in Fig. 4 C. Notably, compounds with nearly identical structures were accurately annotated, such as Δ⁹-THC and Δ⁸-THC—two isomers with identical molecular masses. Three cannabinoids associated with the other two biosynthetic routes were annotated. It is important to emphasize that these results reflect the metabolic fingerprint of the polar phases of cannabis inflorescence extracts at the time of extraction. Consequently, cannabinoids that were not annotated (non-circled) may have been absent from the extracts due to degradation (either in the extracts or in the inflorescences) or may not have been biosynthesized by the plant. ( In planta , cannabinoids are predominantly produced in their acidic forms and are subsequently converted into their neutral analogues through decarboxylation induced by light or heat.) Overall, these findings demonstrate that MSNet is an effective tool to capture known biosynthetic pathways among the initial annotation chemical space. Material and methods Sample preparation process Three cultivars of medicinal C. sativa L. dried female inflorescences were purchased from Bedrocan International (Veendam, Netherlands) including a THC-dominant type Bedrocan® variety (batch: 20C30EY20E13), a CBD-dominant type Bedrolite® variety (batch: 20I14FR20L02) and a THC/CBD-intermediate type Bediol® variety (batch: 19L16FB20K04) according to OMC (Office of Medicinal Cannabis, Netherlands) and ANSM (Agence Nationale de Sécurité du Médicament et des produits de santé, France) requirements and authorization. Samples were stored in airtight containers in the dark at 25 °C. For each variety, 500 mg of material were weighed, flash-frozen in liquid nitrogen, and finely ground using a pre-chilled mortar and pestle. The resulting frozen powder was extracted at 25 °C using a biphasic solvent system. Briefly, 1.5 mL of a methyl tert-butyl ether/methanol (75:25 v/v) mixture was added to the sample, followed by sonication in an ultrasonic bath for 3 min. Subsequently, 1.5 mL of a water/methanol (75:25 v/v) mixture was added, and the sample was sonicated again under identical conditions. The extracts were then centrifuged at 240 g for 3 min at 5 °C and the resulting supernatants were collected. Both the polar and apolar fractions were evaporated using a SpeedVac SPD vacuum concentrator (Thermo Fisher Scientific) at 35 °C and 50 mbar. The dried residues were subsequently reconstituted in acetonitrile/water (50:50 v/v) to a final concentration of 10 mg/mL. Extractions were performed in triplicate. The extracts were stored at -20 °C in amber glass bottles between each use. Standards preparation A Level 1 library was constructed by injecting analytical grade standards of cannabinoids, which were prepared from commercially available solutions at 1 mg/mL of cannabigerolic acid (CBGA), cannabigerol (CBG), cannabidiolic acid (CBDA), cannabidiol (CBD), cannabichromenic acid (CBCA), cannabichromene (CBC), Δ-9-tetrahydrocannabinolic acid (d9THCA), Δ-9-tetrahydrocannabinol (d9THC), Δ-8-tetrahydrocannabinol (d8THC), tetrahydrocannabivarin (THCV), cannabidivarinic acid (CBDVA), cannabidivarin (CBDV), cannabicyclol (CBL) were purchased from Cerilliant (Austin, Texas, USA). Cannflavin A at 1 mg/mL was obtained from Sigma-Aldrich (Darmstadt, Germany). All standard solutions were diluted to a concentration of 12 μg·mL⁻¹ in acetonitrile. UHPLC-HRMS/MS analysis Chromatographic separation was carried out on a Vanquish UHPLC system (Thermo Fisher Scientific, Waltham, MA, USA) equipped with a Luna Omega Polar C18 analytical column (150 × 2.1 mm, 1.6 µm; Phenomenex, Torrance, CA, USA). The chromatographic system was coupled with a Vanquish Diode Array Detector (DAD), a Charged Aerosol Detector (CAD) and a Q-exactive Plus mass spectrometer (Thermo Fisher Scientific, Waltham, MA, USA). A dual-pump system was installed, where the first pump managed the separation at the column Level, and the second pump system generated a counter-gradient at the column outlet to maintain a constant mixture of 50% water and 50% acetonitrile. The separation was performed at a flow rate of 0.4 mL·min⁻¹ and at a column oven temperature of 40 °C, using a mobile phase composed of H₂O + 0.05% formic acid (A) and acetonitrile + 0.05% formic acid (B). A 1 µL injection of extracts at 10 mg·mL⁻¹ was introduced into the system. The applied gradient was as follows: 0 to 0.5 min, 2% B; 0.5 to 18 min, 98% B; 18 to 21 min, 98% B; 21 to 21.5 min, 2% B; and 21.4 to 24 min, 2% B. Mass spectrometry detection was performed using a Q Exactive Plus instrument equipped with a heated electrospray ionization source (HESI-II) operating in positive and negative ionization modes with a resolution of 35,000 (MS1) and 17,500 (MS2). The collision energy was set to 10, 20, and 40 eV in stepped mode for data-dependent MS/MS acquisition. The capillary temperature was 300°C, and the mass scan range was 100–1,500 m/z for both MS1 and MS2. Data-dependent acquisition targeted the four most intense precursor ions per scan cycle. Data Processing and Feature Detection Raw LC-MS data files were processed using MZmine (version 4.7, mzio GmbH). Mass detection was performed separately for MS1 and MS2 scans using the "Factor of lowest signal" algorithm with noise factors of 5.0 and 2.5, respectively. Chromatographic peaks were built using the ADAP chromatogram builder with the following parameters: minimum consecutive scans = 4, minimum intensity for consecutive scans = 10,000, minimum absolute height = 50,000, and m/z tolerance of 0.002 Da or 10 ppm (scan-to-scan). Chromatograms were smoothed using Savitzky-Golay filtering (5-scan window in the retention time dimension). Feature deconvolution was performed using the local minimum resolver with a minimum absolute height of 2.E6 and 1.E6 in positive and negative mode respectively, peak top-to-edge ratio of 1.8, peak duration range of 0.04–3.0 min, and minimum of 5 data points per feature. MS/MS spectra were paired to features using a precursor m/z tolerance of 0.01 Da or 10 ppm, retention time tolerance matching feature edges, and minimum relative feature height of 0.25. Isotope grouping was performed using the ¹³C isotope filter with m/z tolerance of 0.005 Da or 10 ppm, retention time tolerance of 0.05 min, monotonic shape validation, and maximum charge state of 2. Ion Identity Networking (IIN) was performed to identify adducts, in-source fragments, and neutral losses. Network refinement was applied with minimum network size of 1, deletion of networks lacking monomer ions, and removal of networks with fewer than 2 links. Retention time correction was performed using aligned features as reference points ( m/z tolerance: 0.005 Da or 10 ppm; RT tolerance: 0.1 min; minimum standard intensity: 500,000). Feature lists were aligned using the join aligner with m/z tolerance of 0.0015 Da or 10 ppm (sample-to-sample), RT tolerance of 0.15 min, and m/z weight-to-RT weight ratio of 3:1. Alignment incorporated isotope pattern comparison (minimum score: 0.8) and MS/MS spectral similarity (weighted cosine ≥ 0.7) when available. Gap filling was performed using the multithreaded peak finder with intensity tolerance of 20%, m/z tolerance of 0.002 Da or 10 ppm, RT tolerance of 0.1 min, and minimum of 3 scans per feature. Duplicate features were filtered using an averaging mode with m/z tolerance of 0.005 Da or 10 ppm and RT tolerance of 0.1 min. Blank subtraction was applied requiring minimum detection in 2 blank samples, with features retained only if they showed ≥2-fold higher intensity in samples compared to blanks (based on maximum blank intensity). Final feature filtering retained only features meeting all of the following criteria: detected in at least 6 samples within any sample class (out of 6 replicates per chemotype), presence of at least 2 isotopic peaks with validated ¹³C pattern and presence of MS/MS spectrum. Spectral networking was performed using the modified cosine algorithm with the following parameters: m/z tolerance = 0.003 Da or 10 ppm, maximum precursor m/z delta = 600 Da, minimum matched signals = 4, minimum cosine similarity = 0.6. Our In-house library of cannabinoids standards was used with a minimum cosine similarity = 0.85, precursor m/z tolerance = 0.01 Da or 20 ppm, spectral m/z tolerance = 0.015 Da or 25 ppm, RT tolerance = 0.2 min, minimum matched signals = 4 using weighted cosine similarity (MassBank weighting), removed precursor ions, performed ¹³C deisotoping ( m/z tolerance: 0.001 Da or 10 ppm, monotonic shape, max charge: 2), and cropped spectra to m/z overlap regions. The feature quantification table, mass spectral similarity network and annotation table were exported for further MS-Net processing. Processed MS/MS data were exported in .mgf format for further Sirius processing. Corresponding MZbatch used for MZmine processing are available here: https://zenodo.org/records/17669288 Sirius-CSI annotation Data in MGF format were imported into SIRIUS-CSI (v6.3, Lehrstuhl Bioinformatik Jena). Molecular formula prediction was performed using pre-configured Orbitrap parameters with allowed elements limited to C, H, N, O, P, and S, using a "database search" strategy. CSI:FingerID structural prediction queried PubChem as the primary database with bio-databases enabled for further taxonomically enriched matches. FragHub spectral databases were included for experimental spectral matching and analog detection using a precursor m/z tolerance of 20 ppm. Only features achieving minimum confidence scores <0.2 in the initial annotation round were selected for subsequent de novo structure prediction using MSNovelist. Results were exported in CSV format with quoted strings, retaining the top 50 ranked candidates per feature for both in silico and de novo matches. MS-Net parameters Output files from MZmine (feature quantification table, MSS network edge list, spectral library annotations) and SIRIUS-CSI (top 50 in silico candidates and top 20 de novo structures from MSNovelist per feature) were imported into MS-Net (developed in Knime Analytics Platform, v5.2). Features matching authentic standards (spectral cosine > 0.95, ΔRT 0.85) or 2b (0.7 < cosine ≤ 0.85). In silico candidates were enriched with taxonomic information by querying Coconut 2.0 (linked to NCBI taxonomy) using InChIKey-based matching. Among the top 10 in silico candidates per feature, those matching taxonomic criteria ( Cannabis genus, Cannabaceae family) or classified as cannabinoids by NPClassifier were elevated to Level 3a; remaining in silico matches received Level 3b. Spectral library matches with cosine < 0.7 or significant precursor mass errors were classified as Level 4 (MS/MS analogs). Features were filtered using Ion Identity Network (IIN) results to remove redundant adducts and isotopes, followed by MS-CleanR-based RT clustering (ΔRT ≤ 0.01 min). Within each cluster, the top 2 features by network degree and the top 2 by peak intensity were retained. The MSS network was constrained to edges with cosine similarity ≥ 0.7 and ΔRT ≤ 8 min between connected nodes. High-confidence annotations (Levels 1, 2a, 3a) seeded the MSS network for iterative annotation propagation. For each feature pair, candidate structures were ranked using the Link Score formula: Link Score= (1 - α) × Adjusted_Structural + α*× InSilico_Combined, where Adjusted_Structural integrates dynamically weighted Tanimoto similarities (full-molecule and Murcko scaffold fingerprints-PubChem) modulated by MS/MS cosine similarity, and InSilico_Combined represents the mean of available in silico confidence scores. The weighting parameter was set to α = 0.3, prioritizing structural-spectral evidence (70%) over in silico ranking (30%). The top 5 candidates by Link Score were retained per feature before looping through all MSS network. Redundant annotations with identical InChIKey identifiers and pearson correlation among samples >0.7 were consolidated by selecting the candidate with the highest mean peak height. Positive and negative mode feature lists were merged using ΔRT ≤ 0.05 min, Δ m/z ≤ 0.002 Da, and a minimum Pearson correlation ≥ 0.6 across sample intensities. Final annotations were enriched with chemical ontology classifications from ClassyFire (kingdom, superclass, class, subclass) and NPClassifier (pathway, superclass, class). Database identifiers (PubChem CID, KEGG, HMDB, ChEBI) were retrieved using the Chemical Translation Service. Natural product-likeness scores were calculated using the NPlikeness calculator. Structural similarity networks were constructed using Tanimoto coefficient ≥ 0.8, retaining the top 2 nearest neighbors per feature. Statistical analysis Multivariate data analysis was performed using the mixOmics R package v6.3.2 through the MetaboStat_AgX Shiny interface (available at https://zenodo.org/records/17352817). Prior to analysis, missing values were imputed using the global minimum intensity value divided by 10. Feature intensities were normalized using total sum normalization (TSN), followed by unit variance scaling (autoscaling). Principal component analysis (PCA) was performed to assess overall variance structure and sample clustering. Partial least squares discriminant analysis (PLS-DA) models were validated using 10-fold cross-validation repeated 100 times to assess classification performance and prevent overfitting. Model quality was evaluated using the classification error rate. Sparse PLS-DA (sPLS-DA) was applied to identify discriminant features, selecting the top 30 features per component (60 total across two components) based on variable importance in projection (VIP) scores. Hierarchical clustering (Euclidean distance, Ward's linkage) of the top 60 discriminant features generated the clustered heatmap visualization. Chemical ontology enrichment analysis was performed on features characteristic of each chemotype according to their coefficient score using NPClassifier pathway classifications. The results were visualized as pie charts showing pathway distribution per chemotype. Declarations Code availability MS-Net workflow, tutorials and MZbatch files used for this study are available here: https://zenodo.org/records/17669288 Data availability Raw LC-MS data acquired in positive and negative mode are available here: https://zenodo.org/records/17671960 References Borges, R. M. et al. Quantum Chemistry Calculations for Metabolomics: Focus Review. Chem. Rev. 121 , 5633–5670 (2021). Schmid, R. et al. Ion identity molecular networking for mass spectrometry-based metabolomics in the GNPS environment. Nat. Commun. 12 , 3832 (2021). Fraisier-Vannier, O. et al. MS-CleanR: A Feature-Filtering Workflow for Untargeted LC–MS Based Metabolomics. Anal. Chem. 92 , 9971–9981 (2020). Sumner, L. W. et al. Proposed minimum reporting standards for chemical analysis: Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI). Metabolomics 3 , 211–221 (2007). Charbonnet, J. A. et al. Communicating Confidence of Per- and Polyfluoroalkyl Substance Identification via High-Resolution Mass Spectrometry. Environ. Sci. Technol. Lett. 9 , 473–481 (2022). Dablanc, A. et al. FragHub: A Mass Spectral Library Data Integration Workflow. Anal. Chem. acs.analchem.4c02219 (2024) doi:10.1021/acs.analchem.4c02219. Medema, M. H., de Rond, T. & Moore, B. S. Mining genomes to illuminate the specialized chemistry of life. Nat. Rev. Genet. 22 , 553–571 (2021). Tsugawa, H. et al. Hydrogen Rearrangement Rules: Computational MS/MS Fragmentation and Structure Elucidation Using MS-FINDER Software. Anal. Chem. 88 , 7946–7958 (2016). Ruttkies, C., Schymanski, E. L., Wolf, S., Hollender, J. & Neumann, S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J. Cheminformatics 8 , 3 (2016). Hoffmann, M. A. et al. High-confidence structural annotation of metabolites absent from spectral libraries. Nat. Biotechnol. 40 , 411–421 (2022). CASMI 2022. Revisiting CASMI https://fiehnlab.ucdavis.edu/casmi (2022). Zhu, B. et al. Knowledge-based in silico fragmentation and annotation of mass spectra for natural products with MassKG. Comput. Struct. Biotechnol. J. 23 , 3327–3341 (2024). da Silva, R. R. et al. Propagating annotations of molecular networks using in silico fragmentation. PLOS Comput. Biol. 14 , e1006089 (2018). Rutz, A. et al. Taxonomically Informed Scoring Enhances Confidence in Natural Products Annotation. Front. Plant Sci. 10 , 1329 (2019). Quinlan, Z. A. et al. ConCISE: Consensus Annotation Propagation of Ion Features in Untargeted Tandem Mass Spectrometry Combining Molecular Networking and In Silico Metabolite Structure Prediction. Metabolites 12 , 1275 (2022). Quiros-Guerrero, L.-M. et al. Inventa: A computational tool to discover structural novelty in natural extracts libraries. Front. Mol. Biosci. 9 , 1028334 (2022). Mejri, Y. et al. MS2DECIDE: Aggregating Multiannotated Tandem Mass Spectrometry Data with Decision Theory Enhances Natural Products Prioritization. Chemistry–Methods 202400088 (2025) doi:10.1002/cmtd.202400088. Allard, P.-M. et al. Integration of Molecular Networking and In-Silico MS/MS Fragmentation for Natural Products Dereplication. Anal. Chem. 88 , 3317–3323 (2016). Zhou, Z. et al. Metabolite annotation from knowns to unknowns through knowledge-guided multi-layer metabolic networking. Nat. Commun. 13 , 6656 (2022). Lê Cao, K.-A. & Welham, Z. M. Multivariate Data Integration Using R: Methods and Applications with the mixOmics Package . (Chapman and Hall/CRC, Boca Raton, 2021). doi:10.1201/9781003026860. Berthold, M. R. et al. KNIME-the Konstanz information miner: version 2.0 and beyond. AcM SIGKDD Explor. Newsl. 11 , 26–31 (2009). Chandrasekhar, V. et al. COCONUT 2.0: a comprehensive overhaul and curation of the collection of open natural products database. Nucleic Acids Res. 53 , D634–D643 (2025). Wohlgemuth, G., Haldiya, P. K., Willighagen, E., Kind, T. & Fiehn, O. The Chemical Translation Service—a web-based tool to improve standardization of metabolomic reports. Bioinformatics 26 , 2647–2648 (2010). Burgess, K. E. V., Borutzki, Y., Rankin, N., Daly, R. & Jourdan, F. MetaNetter 2: A Cytoscape plugin for ab initio network analysis and metabolite feature classification. J. Chromatogr. B https://doi.org/10.1016/j.jchromb.2017.08.015 (2017) doi:10.1016/j.jchromb.2017.08.015. Radwan, M. M., Chandra, S., Gul, S. & ElSohly, M. A. Cannabinoids, phenolics, terpenes and alkaloids of cannabis. Molecules 26 , 2774 (2021). Andre, C. M., Hausman, J.-F. & Guerriero, G. Cannabis sativa: the plant of the thousand and one molecules. Front. Plant Sci. 7 , 19 (2016). Pereira Francisco, V. et al. Development of GC–MS coupled to GC–FID method for the quantification of cannabis terpenes and terpenoids: Application to the analysis of five commercial varieties of medicinal cannabis. (2024). Vásquez-Ocmín, P. G. et al. Cannabinoids vs. whole metabolome: Relevance of cannabinomics in analyzing Cannabis varieties. Anal. Chim. Acta 1184 , 339020 (2021). Aliferis, K. A. & Bernard-Perron, D. Cannabinomics: Application of Metabolomics in Cannabis (Cannabis sativa L.) Research and Development. Front. Plant Sci. 11 , (2020). Zandkarimi, F. et al. Comparison of the cannabinoid and terpene profiles in commercial cannabis from natural and artificial cultivation. Molecules 28 , 833 (2023). De Meijer, E. P. M. & Hammond, K. M. The inheritance of chemical phenotype in Cannabis sativa L. Euphytica 145 , 189–198 (2005). Hazekamp, A. & Fischedick, J. T. Cannabis‐from cultivar to chemovar. Drug Test. Anal. 4 , 660–667 (2012). Hanuš, L. O., Meyer, S. M., Muñoz, E., Taglialatela-Scafati, O. & Appendino, G. Phytocannabinoids: a unified critical inventory. Nat. Prod. Rep. 33 , 1357–1392 (2016). De Ronne, M. & Torkamaneh, D. Discovery of major QTL and a massive haplotype associated with cannabinoid biosynthesis in drug‐type Cannabis. Plant Genome 18 , e70031 (2025). Brungs, C. et al. MSnLib: efficient generation of open multi-stage fragmentation mass spectral libraries. Nat. Methods 22 , 2028–2031 (2025). Neumann, S. et al. MassBank: an open and FAIR mass spectral data resource. Nucleic Acids Res. gkaf1193 (2025) doi:10.1093/nar/gkaf1193. Marti, G. Lessons from Mass Spectral Library Integration: Addressing Metadata Gaps and Expanding Chemodiversity. (2025). Wishart, D. S. et al. Chemical Composition of Commercial Cannabis. J. Agric. Food Chem. acs.jafc.3c06616 (2024) doi:10.1021/acs.jafc.3c06616. Julian Pollmann Florian Huber. Count your bits: more subtle similarity measures using larger radius count vectors. https://doi.org/10.1101/2025.06.16.659994. Additional Declarations The authors declare no competing interests. Supplementary Files AllAnnotationPOSNEG.csv EdgestopKMSMSNEGPOS.csv EdgesTopKTanimotoNEGPOS.csv NodesNEGPOS.csv ReportPOSNEG.pdf ResultsNEGPOS.xlsx RTClusterNEGPOS.csv StatTableNEGPOS.csv Supplementarymaterials.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8174529","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":548812534,"identity":"1d0272f8-522c-4ae6-a3ae-d5b8cfdb789c","order_by":0,"name":"Pereira Francisco","email":"","orcid":"","institution":"toulouse university","correspondingAuthor":false,"prefix":"","firstName":"Pereira","middleName":"","lastName":"Francisco","suffix":""},{"id":548813333,"identity":"c442fdd6-360f-4b80-97fb-ed96cd1316d9","order_by":1,"name":"Duthen","email":"","orcid":"","institution":"toulouse university","correspondingAuthor":false,"prefix":"","firstName":"","middleName":"","lastName":"Duthen","suffix":""},{"id":548813334,"identity":"d92ff77d-0e5a-4ff7-9245-487120a89388","order_by":2,"name":"Crossay","email":"","orcid":"https://orcid.org/0000-0002-0641-1236","institution":"toulouse university","correspondingAuthor":false,"prefix":"","firstName":"","middleName":"","lastName":"Crossay","suffix":""},{"id":548813335,"identity":"ffd23813-2e3c-4c56-ab4c-468d1bafc68c","order_by":3,"name":"Alignan","email":"","orcid":"https://orcid.org/0000-0003-3930-4856","institution":"toulouse university","correspondingAuthor":false,"prefix":"","firstName":"","middleName":"","lastName":"Alignan","suffix":""},{"id":548813336,"identity":"99667337-1548-4229-85a1-b683145cae22","order_by":4,"name":"Hennechart","email":"","orcid":"","institution":"toulouse university","correspondingAuthor":false,"prefix":"","firstName":"","middleName":"","lastName":"Hennechart","suffix":""},{"id":548813337,"identity":"d561af87-3add-4e0e-90e4-5050d5bada12","order_by":5,"name":"Perez","email":"","orcid":"","institution":"toulouse university","correspondingAuthor":false,"prefix":"","firstName":"","middleName":"","lastName":"Perez","suffix":""},{"id":548813338,"identity":"ed85a8a5-46cc-4772-a631-d289ae4c304d","order_by":6,"name":"Marti","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABHElEQVRIie2PQUvDMBTHUwK9uNprdnFfoUOQyYR+DcFLwqA55T5BJDKwJ/XafotAYVcjD3rqB5gnu29QEEQvYto6JyzDq4f8Li/v5f34Jwg5HP8S70bX244cdaWfePXOdj+Xmv40lBwjv61dh6N9Ob8VxORfyuh8ZlKu4CJEB+z18n3C1cuiJHQOcYiw31iU8YoZpQSRy6AYVpQIVfoJoRWwXGKc2ZSsVXwQSgdqKDtlcELYLdBIh2B7WK98gnjUQfFhFB59K3GsMbb+hRjFLAiFgmWbQjeKp5BdiQZro9xzkUGwPJUJGedlMpvQirMM9qSkfF03b1PxkN4Vz/LsenQI8LRq5tM4TBf2FL057V5bBZMi7XOHw+FwbPkCGA1qmJDX8iIAAAAASUVORK5CYII=","orcid":"https://orcid.org/0000-0002-6321-9005","institution":"toulouse university","correspondingAuthor":true,"prefix":"","firstName":"","middleName":"","lastName":"Marti","suffix":""}],"badges":[],"createdAt":"2025-11-21 14:23:46","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-8174529/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8174529/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":96920215,"identity":"01ce7df3-e45b-44cd-81d2-612600a6f9a1","added_by":"auto","created_at":"2025-11-27 14:14:56","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6685982,"visible":true,"origin":"","legend":"","description":"","filename":"MSNetMultisimilaritybasednetworkannotationforuntargetedmetabolomicsvRS2.docx","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/76d609f5741c58f63da7a4a9.docx"},{"id":96892422,"identity":"9d7a3235-0fed-4eae-be8e-501011268c67","added_by":"auto","created_at":"2025-11-27 09:33:03","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":342,"visible":true,"origin":"","legend":"","description":"","filename":"rs8174529.json","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/de09231348936197fa7e1f38.json"},{"id":96892438,"identity":"7b1a8e5e-3936-4e77-af25-129280e9f5c8","added_by":"auto","created_at":"2025-11-27 09:33:03","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":104783,"visible":true,"origin":"","legend":"","description":"","filename":"rs81745290enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/877d6712a0b096f0dcc5eea7.xml"},{"id":96919926,"identity":"2af47ded-3a25-487c-94b5-61aacbe233c6","added_by":"auto","created_at":"2025-11-27 14:14:37","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":589172,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/b2454e833ba51bc5e4084907.png"},{"id":96892431,"identity":"8ef630d4-55c0-4e23-853d-2c1f23dc11e3","added_by":"auto","created_at":"2025-11-27 09:33:03","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1212905,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/f462b798f678ce4c4c64a85b.png"},{"id":96892430,"identity":"9c0f28ba-c205-4616-8694-98725389abdf","added_by":"auto","created_at":"2025-11-27 09:33:03","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":793890,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/db0eed4e760c66778ce76c90.png"},{"id":96892445,"identity":"f1595b7e-ef39-44fb-8051-1b795b85feff","added_by":"auto","created_at":"2025-11-27 09:33:04","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":281790,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/6ea0baead30dc2794cc7ad86.png"},{"id":96920809,"identity":"dd3167a6-c7e3-407a-b96f-3282a7d945d0","added_by":"auto","created_at":"2025-11-27 14:15:26","extension":"png","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":74738,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/2a7cf29363ac31ec0fc363a1.png"},{"id":96892441,"identity":"c790ad31-6fe5-4f17-a27c-c39a9d0dbe46","added_by":"auto","created_at":"2025-11-27 09:33:03","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":130661,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/21af2c66e94ee652de4fb580.png"},{"id":96920599,"identity":"5b64105a-3d6d-4c35-961a-bc74a67e6578","added_by":"auto","created_at":"2025-11-27 14:15:18","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":86799,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/1f3a8acd7b1abea7575cb5b7.png"},{"id":96920829,"identity":"5d1867d0-83af-4a7b-adb9-7885732a59d6","added_by":"auto","created_at":"2025-11-27 14:15:27","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":49520,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/a5a426bd0cf7b604b41647fa.png"},{"id":96892443,"identity":"a0e5f475-37e7-4d1f-9e59-98f43a48ea93","added_by":"auto","created_at":"2025-11-27 09:33:03","extension":"xml","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":103896,"visible":true,"origin":"","legend":"","description":"","filename":"rs81745290structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/32bf3ceff9ce17fad0a1165f.xml"},{"id":96920207,"identity":"2e7727df-48b1-4d47-8851-46d7b7e695b4","added_by":"auto","created_at":"2025-11-27 14:14:55","extension":"html","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":111837,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/e9b702593c4fb698c3c14bae.html"},{"id":96892421,"identity":"0235410f-9338-4127-be43-f264aee9e69e","added_by":"auto","created_at":"2025-11-27 09:33:03","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":598638,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eMS-Net processing workflow.\u003c/strong\u003e\u003c/em\u003e\u003cem\u003e (A) Input data files from LC-MS processing software (MZmine, MS-Dial) include feature peak tables and mass spectral similarity (MSS) networks. (B) Annotation confidence Levels are established by integrating experimental spectral library matches with in silico structure predictions, defining a comprehensive chemical search space. (C) Optional data reduction strategies include generic feature filters and MS-CleanR-based retention time clustering to remove redundant signals while preserving chemically informative ions. (D) High-confidence annotations (Levels 1, 2a, 3a) seed the MSS network for annotation propagation. Link Scores computed from Tanimoto similarities and in silico confidence metrics guide the selection of the most probable structure for each connected feature, with propagation iterating through the entire network. (E) Final annotations are enriched with chemical ontology classifications and cross-referenced database identifiers to facilitate integration with pathway analysis and multi-omics workflows.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"Fig1.png","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/1ecf6db6bd255f9f06d28b7f.png"},{"id":96919184,"identity":"3229e47a-0483-4aeb-a2e5-d585408bfb9d","added_by":"auto","created_at":"2025-11-27 14:13:18","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":1295749,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eMS-Net annotation pipeline results for the Cannabis metabolomics dataset.\u003c/strong\u003e\u003c/em\u003e\u003cem\u003e(A) Visualization of the raw chemical space comprising all putative in silico candidates (\u0026gt;118,000 structures) using t-SNE projection of PubChem molecular fingerprints. Chemical diversity is color-coded by NPClassifier pathway classification. (B) Feature filtering workflow: Ion Identity Network (IIN) removes redundant adducts and isotopes; retention time clustering (MS-CleanR algorithm) consolidates co-eluting features; final integration merges positive and negative ionization modes. (C) Evaluation of annotation prioritization strategies by analyzing the absolute distance between Tanimoto similarity and MSS cosine similarity. Density plots show improvement in concordance when selecting top k candidates: raw chemical space (mean distance = 0.55), top 1 only (mean = 0.3), top 10 (optimized), top 20, and top 10 combined with top 10 de novo structures. (D) Final annotated chemical space (1,275 compounds) visualized using t-SNE projection of PubChem fingerprints, showing improved chemical coherence compared to raw space. (E) Distribution of annotation confidence Levels across 1,297 features. (F) Annotation rank distribution. (G) Principal component analysis (PCA) score plot of the dataset (n = 18 samples, p = 1,297 features). (H) Heatmap of the top 60 discriminant features identified by sparse partial least squares discriminant analysis (sPLS-DA). (I) Natural product pathway distribution (NPClassifier) of discriminant features for each chemotype, revealing biochemical pathway enrichment patterns.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"Fig2MSnetWF.png","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/2ccfb585aa4985c0cacb3707.png"},{"id":96892423,"identity":"2a627cd2-1c0d-4278-9f8e-c6c039bfe400","added_by":"auto","created_at":"2025-11-27 09:33:03","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":1128316,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eAnnotation prioritization within the cannabinoid mass spectral similarity network.\u003c/strong\u003e\u003c/em\u003e\u003cem\u003eA) MSS network using modified cosine similarity \u0026gt; 0.75. Hashed black square highlights cannabinoids MSS subnetwork. Node colors represent annotation confidence Levels: Level 1 (green), Level 2 and 3a (light green), Level 3b in silico (orange), and de novo structures (violet). B) MSS subnetwork visualization of the cannabinoid cluster displaying annotation Levels and prioritization results. Analysis used α = 0.3 to favor structural-spectral similarity over in silico ranking. Edge widths scale with MS/MS cosine similarity (range: 0.75–1.0). C) Link score algorithm detail between CBC-A and node at m/z 301.145 displaying a cosine score of 0.86. top 1, 2 and 50 hits are displayed with their respective structures, tanimoto similarity with CBC-A, their murcko scaffolds, and respective tanimoto and the resulting final link score.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"Figure3MSSnetwork.png","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/4d68882626eafcd8e945aa7f.png"},{"id":96892425,"identity":"47f68c59-54b6-4ab0-98a1-77adfafc5d7d","added_by":"auto","created_at":"2025-11-27 09:33:03","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":346341,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eCannabinoid biosynthesis pathways elucidation through the similarity network. (A) Full Tanimoto structural similarity network (tanimoto similarity \u0026gt; 0.8). \u003c/strong\u003e\u003c/em\u003e\u003cem\u003eNodes are colored according to main pathways based on the NPClassfier ontology: Alkaloids (dark green), Carbohydrates (brown), Polyketides (blue), Terpenoids (light green), Shikimates and Phenylpropanoids (orange), Amino acids and peptides (red), Fatty acids (purple) and Other compounds (grey). \u003c/em\u003e\u003cem\u003e\u003cstrong\u003e(B) Cannabinoids cluster.\u003c/strong\u003e\u003c/em\u003e\u003cem\u003e Subnetworks extracted from the full similarity network. The nodes are color-coded based on their precursor pathway : Green for olivetolic acid pathway, Orange for divarinic acid pathway, and Blue for the orsellinic acid pathway. Yellow nodes represent cannabinoids not derived from the three main biosynthetic pathways and grey nodes for non cannabinoids structurally related compounds. \u003c/em\u003e\u003cem\u003e\u003cstrong\u003e(C) Main cannabinoid biosynthesis pathways.\u003c/strong\u003e\u003c/em\u003e\u003cem\u003e Colored circles correspond to annotated compounds from the olivetolic acid pathway (green), the orsellinic acid pathway (blue) and the divaric acid pathway (orange). Uncircled molecules were not found in the similarity network.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"figure4pathways.png","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/c532773984bd5fd0a512c76c.png"},{"id":97135392,"identity":"d01272f4-c1d3-4dea-ab0e-ad48a4f7a1ac","added_by":"auto","created_at":"2025-12-01 09:42:29","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3668178,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/d96a2d93-62b4-4a51-adcd-4109cbc1ffe9.pdf"},{"id":96892446,"identity":"529940ce-d6d7-4158-a4f9-7203759b4edc","added_by":"auto","created_at":"2025-11-27 09:33:05","extension":"csv","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":92874638,"visible":true,"origin":"","legend":"","description":"","filename":"AllAnnotationPOSNEG.csv","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/d62e89f70a0aca493874e4e3.csv"},{"id":96892429,"identity":"90b1e545-64e3-4e6e-a79d-0d21cf2facc4","added_by":"auto","created_at":"2025-11-27 09:33:03","extension":"csv","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":708808,"visible":true,"origin":"","legend":"","description":"","filename":"EdgestopKMSMSNEGPOS.csv","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/595efd7c018d4810985a5403.csv"},{"id":96892426,"identity":"9d684e44-9c30-4524-895a-a60998b33574","added_by":"auto","created_at":"2025-11-27 09:33:03","extension":"csv","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":527220,"visible":true,"origin":"","legend":"","description":"","filename":"EdgesTopKTanimotoNEGPOS.csv","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/c38bc95e662458fee871c866.csv"},{"id":96892435,"identity":"4de5f75d-e260-4314-a05b-03b2922e49bb","added_by":"auto","created_at":"2025-11-27 09:33:03","extension":"csv","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":1160843,"visible":true,"origin":"","legend":"","description":"","filename":"NodesNEGPOS.csv","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/13d65cd9ae37eb999ef4b40f.csv"},{"id":96892444,"identity":"242a7cb7-894d-48e1-8af3-279a0ae4f41e","added_by":"auto","created_at":"2025-11-27 09:33:04","extension":"pdf","order_by":5,"title":"","display":"","copyAsset":false,"role":"supplement","size":2649373,"visible":true,"origin":"","legend":"","description":"","filename":"ReportPOSNEG.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/27b60998f43f351d46a0e55c.pdf"},{"id":96892440,"identity":"47f7b738-d68b-4a2e-a4c0-38d2ea90ed1a","added_by":"auto","created_at":"2025-11-27 09:33:03","extension":"xlsx","order_by":6,"title":"","display":"","copyAsset":false,"role":"supplement","size":5943034,"visible":true,"origin":"","legend":"","description":"","filename":"ResultsNEGPOS.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/78db21f1606365d3c84642e6.xlsx"},{"id":96892428,"identity":"0e8851f1-1f3e-4891-bfaa-275f36663f54","added_by":"auto","created_at":"2025-11-27 09:33:03","extension":"csv","order_by":7,"title":"","display":"","copyAsset":false,"role":"supplement","size":27721,"visible":true,"origin":"","legend":"","description":"","filename":"RTClusterNEGPOS.csv","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/0451fbf6eef3fbfde227769d.csv"},{"id":96919511,"identity":"930ed489-4dd6-4a5d-87d7-e932379c7be0","added_by":"auto","created_at":"2025-11-27 14:14:01","extension":"csv","order_by":8,"title":"","display":"","copyAsset":false,"role":"supplement","size":476793,"visible":true,"origin":"","legend":"","description":"","filename":"StatTableNEGPOS.csv","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/550591d1fefeed6306a46fd4.csv"},{"id":96892433,"identity":"9d391a84-1895-40be-98de-d9bbc9cdc7ca","added_by":"auto","created_at":"2025-11-27 09:33:03","extension":"docx","order_by":9,"title":"","display":"","copyAsset":false,"role":"supplement","size":15265,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementarymaterials.docx","url":"https://assets-eu.researchsquare.com/files/rs-8174529/v1/f9378195727b0af654c66bbc.docx"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003eMS-Net: Multi-Similarity based network annotation for untargeted metabolomics\u003c/p\u003e","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe increasing application of metabolomics, both as a standalone discipline and integrated with other omics, underscores its maturity\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. Untargeted LC-MS fingerprints from biological matrix yield hundreds to thousands of features (\u003cem\u003emz\u003c/em\u003e x RT pairs). One of the main bottlenecks is properly annotating all detected signals with sufficient confidence to ensure robust biological contextualization (e.g. pathway enrichment analysis, multi-omics inference models\u0026hellip;).\u003c/p\u003e\u003cp\u003eA dereplication strategy involves signal deconvolution, ions linkage deciphering using tools such as Ion Identity Network\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e or MS-CleanR\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e followed by annotation relying on a \u0026ldquo;body of evidence\u0026rdquo; approach. The formalization of annotation confidence Levels was proposed by the Metabolomic Standard Initiative\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e and refined to account for new online or computational approaches\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e. Level 1 relies on an authentic standard match, while Level 2 involves a spectral match from external libraries. The above annotation Levels include \u003cem\u003ein silico\u003c/em\u003e approaches (Level 3) up to Level 4 (only partial match) and Level 5 for unknown spectroscopic signals. The main limiting step is the discrepancy between publicly available mass spectral libraries and the chemical diversity encountered in biological samples. For instance, the last version of FragHub mass spectral library integration\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.5281/zenodo.17235587\u003c/span\u003e\u003cspan address=\"10.5281/zenodo.17235587\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e, v12), encompasses two million spectra for 150,000 unique INCHIKEY, while chemical space produced by living organisms is estimated to be millions of molecules\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e. As a result, experimental spectral matches (Levels 1 and 2) cover only 5 to 20% of all detected signals in untargeted experiments. The remaining signal annotations are based on chemical catalog queries using \u003cem\u003ein silico\u003c/em\u003e fragmentation approaches (Level 3). For each feature, putative matches are ranked according to a specific scoring system depending on the software used. For instance, MS-Finder\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e leverages rule-based fragmentation to predict fragments from SMILES structures, and MetFrag\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e employs combinatorial bond breaking. It ranks putative structures based on weighted sum fragment matches, intensity correlation, and neutral losses. Sirius-CSI\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e combines fragmentation trees with machine learning to predict molecular fingerprints, ranking candidates via Bayesian probability and fingerprint similarity. The identification accuracy reached between 10 and 30% for the first hit (Top 1) to 90% for the top 20 according to the CASMI challenge\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e and other benchmark\u003csup\u003e\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e. As a result, despite the advancements in computational tools for metabolite annotation, these solutions may lead to numerous false annotations in the context of untargeted LC-MS metabolomics, dealing with hundreds to thousands of features. Several strategies have been proposed to address this challenge based on mass spectral similarity networks provided by the GNPS facility\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003eFor instance, Network Annotation Propagation (NAP)\u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e, leverages molecular network topology to re-rank structural candidates even without spectral library matches. Taxonomically Informed Metabolite Annotation (TIMA)\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e improves confidence in annotations by considering the taxonomic position of biological sources. ConCISE\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e integrates molecular networking, spectral library matching, and \u003cem\u003ein silico\u003c/em\u003e class predictions to provide accurate classifications for subnetworks. Inventa\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e calculates a priority score to highlight structural novelty within extracts. More recently, MS2DECIDE\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e aggregates data from GNPS, Sirius, and ISDB-LOTUS\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e using multi-criteria decision analysis to prioritize annotation and highlight feature novelty potential. Another avenue has been proposed using multilayer networks, which combines knowledge-based metabolic reaction, correlation networks, and mass spectral similarity to enable global metabolite annotation from knowns to unknowns\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e. These approaches collectively aim to streamline feature identification by leveraging computational insights and expert knowledge to enhance the efficiency and confidence of MS/MS-based annotations.\u003c/p\u003e\u003cp\u003eAlthough these network-based strategies have enhanced annotation confidence, their adoption in routine metabolomics workflows faces practical barriers, including requirements for bioinformatics expertise, dependencies on web servers or dedicated computational infrastructure, and outputs with limited chemical metadata that restrict seamless integration with pathway enrichment analysis and or multi-omics approaches\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003eTo address this challenge, we propose MS-Net (Multi-Similarity Network-based annotation). This user-friendly workflow combines mass spectral and Tanimoto similarity networks with correlation analysis and taxonomic data to drive annotation prioritization. Additionally, MS-Net integrates positive and negative ionization chromatographic fingerprints with several filtering options. Finally, the workflow generates enriched metadata annotation, statistical tables, and map results on mass spectral and Tanimoto similarity networks. This approach is completely offline and compatible with output files from widely used software in the field, such as MS-Dial, MZmine, Sirius-CSI, and MS-Finder (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e"},{"header":"Results","content":"\u003ch4\u003eWorkflow Architecture and Implementation\u003c/h4\u003e\n\u003cp\u003eMS-Net was developed using the Knime visual programming interface\u003csup\u003e21\u003c/sup\u003e and needs three primary inputs: (1) feature intensity tables (peak height or area), (2) mass spectral similarity (MSS) network edge lists, and (3) multi-Level annotation hits encompassing experimental library matches (Levels 1-2) and \u003cem\u003ein silico\u003c/em\u003e predictions (Levels 3-4). These last collectively define a putative chemical space that may comprise several thousand candidate structures per dataset (Figure 1A,B).\u003c/p\u003e\n\u003ch4\u003eAnnotation Confidence Scoring System\u003c/h4\u003e\n\u003cp\u003eThe workflow begins by normalizing confidence scores across annotation Levels to establish a unified ranking system. Level 1 receives the highest confidence score and is assigned to features showing perfect concordance with authentic standards based on MS/MS spectral similarity (similarity \u0026gt; 0.95) and retention time agreement (\u0026Delta;RT \u0026lt; 0.2 min). Level 2 annotations derive from spectral library matches and are subdivided according to matching quality: Level 2a for high-confidence matches (similarity \u0026gt; 0.85) and Level 2b for moderate-confidence matches (similarity \u0026gt; 0.7). For in silico annotations, MS-Net enriches structural candidates with taxonomic information by querying Coconut 2.022. The workflow performs InChIKey-based matching to identify compounds reported from user-specified taxonomic sources (genus and family Levels). Candidates with confirmed biosource origins or matching the target chemical class are assigned Level 3a, while remaining in silico matches receive Level 3b. Spectral library matches exhibiting lower similarity (similarity \u0026lt; 0.7) or significant precursor mass discrepancies are classified as MS/MS analogs (Level 4). Features lacking any structural annotation remain at Level 5 (unknown) (Figure 1B). Confidence scores are normalized to a 0\u0026ndash;100 scale to enable cross-Level comparison. The normalization scheme employed Level-specific transformations: Level 1 = 95 + (similarity score\u0026minus; 0.95) \u0026times; 100; Level 2a = 85 + (similarity \u0026nbsp;score\u0026minus; 0.85) \u0026times; 100; Level 2b = 70 + (similarity score \u0026minus; 0.70) / 0.15 \u0026times; 14; Level 3b (in silico and de novo) = 40 + confidence score \u0026times; 40; Level 4 (MS/MS analogs) = 30 + (similarity score \u0026minus; 0.5) \u0026times; 90; Level 5 = 0. These transformations ensure that experimental matches consistently receive higher scores than computational predictions while preserving score discrimination within each Level.\u0026nbsp;\u003c/p\u003e\n\u003ch4\u003eFeature Filtering and Data Reduction\u003c/h4\u003e\n\u003cp\u003ePrior to network-based annotation propagation, users may optionally apply filtering strategies to reduce dataset complexity while retaining chemically informative features. The workflow supports retention time clustering using the MS-CleanR\u003csup\u003e3\u003c/sup\u003e algorithm. Within each RT cluster, the most intense features and/or those with the highest network connectivity (degree) are preferentially retained, effectively removing redundant signals (Figure 1C). Optionally, a cosine filter threshold can be added to constrain the MSS network. Finally, putative annotations between two nodes may be filtered according to XlogP calculated for each candidate. For each feature pair, the XlogP is compared to the edge delta retention time. In C18 mode, only feature pairs exhibiting XlogP trends consistent with retention time order are conserved.\u003c/p\u003e\n\u003ch4\u003eNetwork-Based Annotation Propagation\u003c/h4\u003e\n\u003cp\u003eMS-Net constructs a seed subnetwork comprising only high-confidence annotations (Levels 1, 2a, and 3a), which serves as the foundation for propagating structural assignments throughout the entire MSS network. To rank competing structural candidates for each feature pair connected in the MSS network, we developed a composite scoring metric that integrates spectral, structural, and computational evidence into a unified \u003cem\u003eLink Score\u003c/em\u003e. The structural similarity component employs two complementary Tanimoto measures calculated from molecular fingerprints: \u003cstrong\u003eTanimoto_Full\u003c/strong\u003e (based on Morgan, PubChem or RDkit) captures overall molecular similarity, including substituents, while \u003cstrong\u003eTanimoto_Murcko\u003c/strong\u003e (scaffold fingerprints) emphasizes core structural frameworks. These metrics are dynamically weighted according to their relative informativeness. When the absolute difference between these two measures exceeds 0.1 (|Tanimoto_Full - Tanimoto_Murcko| \u0026gt; 0.1), the higher value receives greater weight, reflecting either the dominance of substituent patterns (Tanimoto_Full) or core scaffold similarity (Tanimoto_Murcko). When this difference falls within \u0026plusmn;0.1, equal weights are applied. The resulting \u003cstrong\u003eCombined_Tanimoto\u003c/strong\u003e is then scaled by the MS/MS cosine similarity to produce the \u003cstrong\u003eAdjusted_Structural\u003c/strong\u003e score, which accounts for both molecular structure and spectral concordance. In parallel, an \u003cstrong\u003eInSilico_Combined\u003c/strong\u003e score is calculated as the arithmetic mean of confidence scores described above. The final Link Score integrates these two components through a user-tunable weighting parameter:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLink Score\u003c/strong\u003e = (1 - \u0026alpha;) \u0026times; \u003cstrong\u003eAdjusted_Structural\u003c/strong\u003e + \u0026alpha; \u0026times; \u003cstrong\u003eInSilico_Combined\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe parameter \u0026alpha; allows users to control the relative contributions of spectral-structural evidence versus \u003cem\u003ein silico\u003c/em\u003e prediction scores. Lower \u0026alpha; values (e.g., 0.3) prioritize structural and spectral similarity, while higher values give more weight to computational prediction confidence. For each feature pair in the MSS network, the workflow evaluates all candidate structures and selects the annotation with the highest Link Score. This process propagates iteratively from high-confidence seeds to their direct neighbors and subsequently through the entire connected network. Features that remain outside the MSS network are ranked using a simplified metric combining their best \u003cem\u003ein silico\u003c/em\u003e score with the maximum Tanimoto similarity to any annotated compound within the MSS network.\u003c/p\u003e\n\u003ch4\u003eMetadata Enrichment and Output Generation\u003c/h4\u003e\n\u003cp\u003eThe final annotated feature list is optionally enriched with chemical metadata using ClassyFire and NPClassifier ontologies, providing hierarchical chemical classifications (kingdom, superclass, class, subclass). Database identifiers for each annotated feature are retrieved using the Chemical Translation Service\u003csup\u003e23\u003c/sup\u003e, ensuring compatibility with downstream pathway enrichment tools and multi-omics integration platforms (Figure 1E).\u003c/p\u003e\n\u003cp\u003eMS-Net generates four primary outputs: (1) a comprehensive annotated feature table with confidence Levels, structural information, and chemical ontology; (2) a feature height/area table; (3) an MSS network edge table for visualizing spectral similarity relationships; and (4) a Tanimoto-based network edge table connecting structurally related compounds, including links between unknown features and their nearest annotated structural neighbors. This latter output enables users to infer structural motifs for unannotated features. Both networks are enriched with putative chemical reactions between two neighboring nodes using a delta mass match to a predefined list from Metanetter 2\u003csup\u003e24\u003c/sup\u003e. Optionally, features acquired in positive and negative ionization modes can be merged based on user-defined retention time and \u003cem\u003em/z\u003c/em\u003e tolerances.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eApplication to Cannabis Metabolomics Dataset\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eInflorescences of three medical-grade \u003cem\u003eCannabis sativa\u003c/em\u003e L. chemotypes were selected to evaluate MS-Net\u0026apos;s annotation capabilities. Cannabis represents an ideal model system for several reasons. First, the species exhibits remarkable chemical diversity, encompassing a wide array of cannabinoids, terpenes, and phenolic compounds (primarily flavonoids and hydroxycinnamic acids)\u003csup\u003e25\u0026ndash;27\u003c/sup\u003e. The phytochemistry of cannabis is well-documented, with numerous metabolomic studies demonstrating robust discrimination among cultivars and chemotypes\u003csup\u003e28\u0026ndash;30\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eThe traditional morphology-based classification (Indica \u003cem\u003evs.\u003c/em\u003e Sativa) has been superseded by a chemotype system based on the relative concentrations of \u0026Delta;9-tetrahydrocannabinol (THC) and cannabidiol (CBD). This framework defines three main chemotypes: Type I (THC-predominant, \u0026lt;0.5% CBD), Type II (balanced THC:CBD ratio), and Type III (CBD-predominant, \u0026lt;1% THC). Although expanded classifications include Type IV (cannabigerol-predominant) and Type V (cannabinoid-free), Types I\u0026ndash;III remain the most extensively characterized\u003csup\u003e31,32\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eCrucially for algorithm evaluation, cannabinoids exhibit exceptional structural diversity\u0026mdash;estimated at 120\u0026ndash;150 distinct phytocannabinoids\u0026mdash;while sharing highly similar core molecular scaffolds\u003csup\u003e33\u003c/sup\u003e. For instance, the major cannabinoids THC, CBD, and cannabichromene (CBC) are constitutional isomers (C₂₁H₃₀O₂) differing only in cyclization patterns: THC features a pyran ring, CBD a cyclohexene ring, and CBC a benzopyran structure. This structural similarity, combined with well-elucidated biosynthetic pathways, provides an ideal benchmark for evaluating network-based annotation algorithms. The biosynthetic pathway initiates with cannabigerolic acid (CBGA)\u0026mdash;formed by CBGA Synthase-catalyzed condensation of olivetolic acid and geranyl pyrophosphate\u0026mdash;which serves as the universal precursor for nearly all other cannabinoids\u003csup\u003e34\u003c/sup\u003e. Additionally, \u003cem\u003ein planta\u003c/em\u003e, cannabinoids exist predominantly as carboxylic acids (THCA, CBDA, CBCA), with decarboxylation to neutral forms occurring upon heating.\u003c/p\u003e\n\u003cp\u003eTo demonstrate MS-Net\u0026apos;s capabilities, we applied the workflow to a comprehensive untargeted metabolomics study of \u003cem\u003eCannabis sativa\u003c/em\u003e L., analyzing three distinct chemotypes: Bedrocan\u0026reg; (THC-dominant, Type I), Bedrolite\u0026reg; (CBD-dominant, Type III), and Bediol\u0026reg; (THC/CBD-balanced, Type II). Initial data processing detected 2,595 features across positive and negative ionization modes.\u003c/p\u003e\n\u003ch4\u003eFeature Filtering and Chemical Space Reduction\u003c/h4\u003e\n\u003cp\u003eWe applied a sequential filtering strategy to reduce dataset complexity while retaining chemically meaningful signals. First, Ion Identity Networking identified and collapsed redundant adducts and isotopes by selecting the most informative precursor ions ([M+H]⁺\u0026nbsp;or [M-H]⁻) from each ion cluster. Subsequently, MS-CleanR-based retention time clustering further consolidated co-eluting features (Figure 2B). This filtering reduced the dataset to 1,297 unique features while maintaining comprehensive chemical coverage.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eIn silico\u003c/em\u003e annotation using Sirius-CSI (top 50 candidates per feature) and MSNovelist (top 20 \u003cem\u003ede novo\u003c/em\u003e structures per feature) generated a putative chemical space encompassing more than 118,000 candidate structures (Figure 2A). This expansive search space highlights the challenge of confident structural assignment in untargeted metabolomics: without prioritization strategies, the likelihood of selecting incorrect annotations from such large candidate pools is substantial.\u003c/p\u003e\n\u003ch4\u003eNetwork-Based Annotation Prioritization\u003c/h4\u003e\n\u003cp\u003eWe seeded the MSS network using Level 1 (authentic standards), Level 2a (high-confidence spectral matches), and Level 3a annotations (taxonomically informed candidates from the Cannabis genus, Cannabaceae family, or cannabinoid chemical class). PubChem-based fingerprint was selected to calculate Tanimoto similarities. The Link Score algorithm was configured with \u0026alpha; = 0.3 to prioritize structural and spectral similarity over raw \u003cem\u003ein silico\u003c/em\u003e ranking scores.\u003c/p\u003e\n\u003cp\u003eTo evaluate the algorithm\u0026apos;s performance, we examined the agreement between MSS network topology (cosine similarity) and structural similarity (Tanimoto scores). Before annotation prioritization, the raw chemical space exhibited a mean absolute distance of 0.55 between these metrics, with the highest density occurring between 0.6 and 0.8 (Figure 2C). This discrepancy reflects that spectral similarity does not always correlate with structural similarity, particularly when \u003cem\u003ein silico\u003c/em\u003e tools generate diverse candidates. Restricting to only the top-ranked \u003cem\u003ein silico\u003c/em\u003e candidate (top 1) dramatically reduced the mean distance to 0.3, but at the cost of excluding potentially correct structures ranked lower. Expanding to the top 10 or top 20 candidates achieved better performance, with maximum density centered around 0.2, indicating strong agreement between spectral and structural similarities. Notably, incorporating the top 10 \u003cem\u003ede novo\u003c/em\u003e structures from MSNovelist further improved concordance, suggesting that machine learning-generated candidates can complement database-constrained searches for features representing novel or underrepresented chemical scaffolds. Finally, top 50 \u003cem\u003ein silico\u003c/em\u003e and Top 20 \u003cem\u003ede novo\u003c/em\u003e candidates per feature were selected for annotation prioritization.\u003c/p\u003e\n\u003ch4\u003eGlobal Dataset Annotation and Chemical Space Coverage\u003c/h4\u003e\n\u003cp\u003eMS-Net reduced the initial chemical space from 118,000 candidates to 1,275 confidently annotated compounds across 1,297 features (Figure 2D). Analysis of annotation rank distribution revealed that 47% of features were assigned their top-ranked \u003cem\u003ein silico\u003c/em\u003e candidate, indicating strong agreement between computational predictions and network-guided prioritization. An additional 30% of annotations fell within ranks 2\u0026ndash;20, demonstrating the algorithm\u0026apos;s ability to rescue correct structures initially ranked lower due to limitations in \u003cem\u003ein silico\u003c/em\u003e fragmentation models. The remaining 23% of annotations were ranked above position 20 (Figure 2F).\u003c/p\u003e\n\u003cp\u003eThe final annotation distribution by confidence Level showed: 9 Level 1 (authentic standards), 58 Level 2a (high-confidence spectral matches), 31 Level 2b (moderate-confidence spectral matches), 43 Level 3a (taxonomically informed \u003cem\u003ein silico\u003c/em\u003e annotations), 1051 Level 3b \u003cem\u003ein silico\u003c/em\u003e and 71 3b \u003cem\u003ede novo\u003c/em\u003e matches, 4 Level 4 (MS/MS analogs), and 26 Level 5 (unknown) (Figure 2E). This distribution reflects the typical annotation coverage achievable in specialized metabolomics studies, where experimental spectral libraries cover only a fraction of detected features, necessitating extensive \u003cem\u003ein silico\u003c/em\u003e inference.\u003c/p\u003e\n\u003ch2\u003eChemotype Discrimination\u003c/h2\u003e\n\u003cp\u003ePrincipal component analysis (PCA) of the annotated feature matrix (n = 18 samples, p = 1,297 features) revealed clear separation of the three cannabis chemotypes, with the first two principal components explaining 90 % of total variance (Figure 2G). Sparse partial least squares discriminant analysis (sPLS-DA) identified 60 discriminant features that robustly distinguished the chemotypes (Figure 2H).\u003c/p\u003e\n\u003cp\u003eChemical ontology classification using NPClassifier revealed distinct natural product pathway enrichments for each chemotype (Figure 2I). Bedrocan\u0026reg; (THC-dominant) exhibited enrichment in phenylpropanoids and terpenoid pathways, consistent with high Levels of THC and related cannabinoids. Bedrolite\u0026reg; (CBD-dominant) showed elevated Levels of amino acid derivatives and shikimate pathways. Bediol\u0026reg; (balanced THC/CBD) displayed an intermediate metabolite profile with enrichment in polyketide derivatives.\u003c/p\u003e\n\u003ch4\u003eCase Study: Cannabinoid Subnetwork Annotation\u003c/h4\u003e\n\u003cp\u003eThe initial MSS network is seeded with annotation Level 1, 2a and 3a (green dots, figure 3A). The cannabinoid MSS subnetwork illustrates the algorithm\u0026apos;s discriminative power (Figure 3B). MS-Net successfully prioritized structurally coherent annotations, including close derivatives differing primarily in hydroxylation patterns, methyl substitutions, or double bond positions\u0026mdash;chemical variations consistent with known cannabinoid biosynthetic pathways. Two \u003cem\u003ede novo\u003c/em\u003e structures from MSNovelist were also integrated, representing potential novel cannabinoid scaffolds warranting further investigation (Figure 3B). An illustration of MS-net prioritization algorithm is displayed between cannabichromenic acid (CBC-A, Level 1 authentic standard) and its close neighbor, accounting for annotation le 3b (Figure 3C), displaying a pseudomolecular ion at \u003cem\u003em/z\u003c/em\u003e 301.145 (molecular formula C₁₈H₂₂O₄, 0.9 ppm mass accuracy), which matched 70 putative \u003cem\u003ein silico\u003c/em\u003e candidates. Reliance solely on the top-ranked \u003cem\u003ein silico\u003c/em\u003e candidate from Sirius-CSI would have resulted in an incorrect structural assignment. However, the Link Score algorithm identified cannabiorcichromenic acid, originally ranked 50th, as the most probable annotation based on its high Tanimoto similarities to CBC-A, which is consistent with the \u0026nbsp;MSS cosine similarity of 0.89.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eExploitation of the Tanimoto structural similarity network through the study of cannabinoid biosynthesis pathway\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFigure 4 illustrates how the MSNet workflow efficiently clusters metabolites according to their structural similarities, providing a comprehensive view of the metabolic landscape. The upper panel (A) presents the complete Tanimoto structural similarity network. Within this network, each node represents a distinct metabolite, and the edges delineate the Tanimoto similarity score between them. The organization of the network reveals well-defined clusters of metabolites corresponding to distinct biosynthetic routes, reflecting the algorithm\u0026rsquo;s capacity to capture pathway-level organization within complex metabolomic data. Overall, the similarity network is organized into three main subnetworks, five large clusters, and several smaller satellite clusters. An expanded sub-network (B) was isolated by selecting the cannabinoids and their precursors, as well as their close neighbors. This network highlights a local region of the network, illustrating how cannabinoids and precursors with related chemical structures or biosynthetic origins are tightly interconnected. The lower panel (C) focuses on the three main biosynthetic pathways that specifically yield cannabinoids: the olivetolic acid, orsellinic acid, and divarinic acid pathways. Each pathway is represented as a sequence of enzymatic and chemical reactions converting early intermediates into key cannabinoids. The three precursors associated with the three main biosynthetic pathways were detected within the extracts, along with the majority of metabolites produced via the olivetolic acid pathway and a subset originating from the divarinic and orsellinic acid pathways.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eMS-Net was designed to streamline the LC-MS-based untargeted metabolomic workflow from raw data acquisition to a complete annotated table enriched with metadata based on outputs from processing software commonly used in the field. Our results demonstrate that MS-Net not only provides confident structural annotations but also generates biologically interpretable outputs with enriched metadata suitable for downstream pathway enrichment analysis and multi-omics integration.\u003c/p\u003e\u003cp\u003eThe core algorithm of MS-Net leverages Tanimoto similarity metrics starting from confidently annotated nodes (Level 1, 2a, and 3a), which serve as seeds to propagate annotations into the MSS networks. Consequently, the results will be impacted by the quality of annotation used as input. Annotation Level 1 is dependent on the internal library and generally limited to synthetic standards. Annotation Level 2a, encompassing high experimental MS/MS similarity scores, will be influenced by the quality and coverage of the mass spectral library used. Initiatives such as FragHub\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e, MSnLib\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e, MassBank\u003csup\u003e\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e, or GNPS\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e, dramatically improved matching results. However, metabolome coverage of mass spectral libraries is still limited compared to natural product catalogs\u003csup\u003e\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e\u003c/sup\u003e. To extend putative seeds among the MSS network, we implement InChiKey-based matches against the Coconut 2.0\u003csup\u003e22\u003c/sup\u003e chemical catalog or user-choice chemical ontology, which increases the number of seeds to more than 90 compounds in the case of \u003cem\u003eCannabis\u003c/em\u003e extract profiling (Supplementary data: ResultsNEGPOS.xlsx). This well-studied plant\u003csup\u003e\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e is perfectly suitable for the MS-Net approach, while understudied organisms may suffer from less accurate annotation results.\u003c/p\u003e\u003cp\u003eThe type of molecular fingerprint used highly influences the Tanimoto-based similarity result\u003csup\u003e\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e\u003c/sup\u003e. Three contrasted type of fingerprints are available within MS-Net: Morgan fingerprints which capture local circular substructures with a radius of 2 by default to provide highly discriminative power for structural similarity, RDKit (Daylight-like) encodes molecular paths through graph providing a balanced general-purpose representation, while PubChem uses 881 predefined substructural features that tend to emphasize common functional groups and yield higher similarity scores between diverse molecules. This last have been used for the present study, and according to our experience, Morgan or Pubchem molecular fingerprints allowed the detection of subtle substructure changes compared to RDKit type.\u003c/p\u003e\u003cp\u003eThe MS-Net approach is also highly dependent on the MSS network topology. A poorly clustered network, constructed with a low similarity threshold (e.g., 0.6), may result in spurious results. For the \u003cem\u003eCannabis\u003c/em\u003e dataset, the modified cosine score algorithm with a threshold of 0.7 was used, which provided an equilibrated clustered MSS network (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA). Still, changing spectral clustering measures, such as Bonanza available in MSdial, impacts the final results (data not shown).\u003c/p\u003e\u003cp\u003eIn the frame of Cannabis dataset, the homogeneous distribution of compounds across biosynthetic pathways within the similarity network demonstrates that MSNet is effective not only in establishing structural relationships among compounds but also in grouping them coherently (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eA). Among the annotated compounds, cannabinoids represent a chemically important class. Cannabinoids are secondary metabolites biosynthesized in the glandular trichomes of female cannabis inflorescences through three well-characterized biosynthetic routes: the olivetolic acid pathway, the divarinic acid pathway, and the orsellinic acid pathway. The MSNet annotation workflow successfully identified the three precursors of these metabolic routes (olivetolic, divarinic, and orsellinic acids), along with a series of downstream cannabinoids. The olivetolic acid pathway, for example, was almost fully reconstructed, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eC. Notably, compounds with nearly identical structures were accurately annotated, such as Δ⁹-THC and Δ⁸-THC\u0026mdash;two isomers with identical molecular masses. Three cannabinoids associated with the other two biosynthetic routes were annotated. It is important to emphasize that these results reflect the metabolic fingerprint of the polar phases of cannabis inflorescence extracts at the time of extraction. Consequently, cannabinoids that were not annotated (non-circled) may have been absent from the extracts due to degradation (either in the extracts or in the inflorescences) or may not have been biosynthesized by the plant. (\u003cem\u003eIn planta\u003c/em\u003e, cannabinoids are predominantly produced in their acidic forms and are subsequently converted into their neutral analogues through decarboxylation induced by light or heat.) Overall, these findings demonstrate that MSNet is an effective tool to capture known biosynthetic pathways among the initial annotation chemical space.\u003c/p\u003e"},{"header":"Material and methods","content":"\u003cp\u003e\u003cstrong\u003eSample preparation process\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThree cultivars of medicinal\u0026nbsp;\u003cem\u003eC. sativa\u003c/em\u003e L. dried female inflorescences were purchased from Bedrocan International (Veendam, Netherlands) including a THC-dominant type Bedrocan\u0026reg; variety (batch: 20C30EY20E13), a CBD-dominant type Bedrolite\u0026reg; variety (batch: 20I14FR20L02) and a THC/CBD-intermediate type Bediol\u0026reg; variety (batch: 19L16FB20K04) according to OMC (Office of Medicinal Cannabis, Netherlands) and ANSM (Agence Nationale de S\u0026eacute;curit\u0026eacute; du M\u0026eacute;dicament et des produits de sant\u0026eacute;, France) requirements and authorization. Samples were stored in airtight containers in the dark at 25 \u0026deg;C.\u0026nbsp;\u003cbr\u003e\u0026nbsp;For each variety, 500 mg of material were weighed, flash-frozen in liquid nitrogen, and finely ground using a pre-chilled mortar and pestle. The resulting frozen powder was extracted at 25 \u0026deg;C using a biphasic solvent system. Briefly, 1.5 mL of a methyl tert-butyl ether/methanol (75:25 v/v) mixture was added to the sample, followed by sonication in an ultrasonic bath for 3 min. Subsequently, 1.5 mL of a water/methanol (75:25 v/v) mixture was added, and the sample was sonicated again under identical conditions. The extracts were then centrifuged at 240 g for 3 min at 5 \u0026deg;C and the resulting supernatants were collected. Both the polar and apolar fractions were evaporated using a SpeedVac SPD vacuum concentrator (Thermo Fisher Scientific) at 35 \u0026deg;C and 50 mbar. The dried residues were subsequently reconstituted in acetonitrile/water (50:50 v/v) to a final concentration of 10 mg/mL. Extractions were performed in triplicate. The extracts were stored at -20 \u0026deg;C in amber glass bottles between each use.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eStandards preparation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA Level 1 library was constructed by injecting analytical grade standards of cannabinoids, which were prepared from commercially available solutions at 1 mg/mL of cannabigerolic acid (CBGA), cannabigerol (CBG), cannabidiolic acid (CBDA), cannabidiol (CBD), cannabichromenic acid (CBCA), cannabichromene (CBC), \u0026Delta;-9-tetrahydrocannabinolic acid (d9THCA), \u0026Delta;-9-tetrahydrocannabinol (d9THC), \u0026Delta;-8-tetrahydrocannabinol (d8THC), tetrahydrocannabivarin (THCV), cannabidivarinic acid (CBDVA), cannabidivarin (CBDV), cannabicyclol (CBL) were purchased from Cerilliant (Austin, Texas, USA). Cannflavin A at 1 mg/mL was obtained from Sigma-Aldrich (Darmstadt, Germany). All standard solutions were diluted to a concentration of 12 \u0026mu;g\u0026middot;mL⁻\u0026sup1; in acetonitrile.\u003c/p\u003e\n\u003ch4\u003eUHPLC-HRMS/MS analysis\u0026nbsp;\u003c/h4\u003e\n\u003cp\u003eChromatographic separation was carried out on a Vanquish UHPLC system (Thermo Fisher Scientific, Waltham, MA, USA) equipped with a Luna Omega Polar C18 analytical column (150 \u0026times; 2.1 mm, 1.6 \u0026micro;m; Phenomenex, Torrance, CA, USA). The chromatographic system was coupled with a Vanquish Diode Array Detector (DAD), a Charged Aerosol Detector (CAD) and a Q-exactive Plus mass spectrometer (Thermo Fisher Scientific, Waltham, MA, USA). A dual-pump system was installed, where the first pump managed the separation at the column Level, and the second pump system generated a counter-gradient at the column outlet to maintain a constant mixture of 50% water and 50% acetonitrile. The separation was performed at a flow rate of 0.4 mL\u0026middot;min⁻\u0026sup1; and at a column oven temperature of 40 \u0026deg;C, using a mobile phase composed of H₂O + 0.05% formic acid (A) and acetonitrile + 0.05% formic acid (B). A 1 \u0026micro;L injection of extracts at 10 mg\u0026middot;mL⁻\u0026sup1; was introduced into the system. The applied gradient was as follows: 0 to 0.5 min, 2% B; 0.5 to 18 min, 98% B; 18 to 21 min, 98% B; 21 to 21.5 min, 2% B; and 21.4 to 24 min, 2% B.\u003c/p\u003e\n\u003cp\u003eMass spectrometry detection was performed using a Q Exactive Plus instrument equipped with a heated electrospray ionization source (HESI-II) operating in positive and negative ionization modes with a resolution of 35,000 (MS1) and 17,500 (MS2). The collision energy was set to 10, 20, and 40 eV in stepped mode for data-dependent MS/MS acquisition. The capillary temperature was 300\u0026deg;C, and the mass scan range was 100\u0026ndash;1,500 m/z for both MS1 and MS2. Data-dependent acquisition targeted the four most intense precursor ions per scan cycle.\u003c/p\u003e\n\u003ch4\u003eData Processing and Feature Detection\u003c/h4\u003e\n\u003cp\u003eRaw LC-MS data files were processed using MZmine (version 4.7, mzio GmbH). Mass detection was performed separately for MS1 and MS2 scans using the \u0026quot;Factor of lowest signal\u0026quot; algorithm with noise factors of 5.0 and 2.5, respectively. Chromatographic peaks were built using the ADAP chromatogram builder with the following parameters: minimum consecutive scans = 4, minimum intensity for consecutive scans = 10,000, minimum absolute height = 50,000, and \u003cem\u003em/z\u003c/em\u003e tolerance of 0.002 Da or 10 ppm (scan-to-scan). Chromatograms were smoothed using Savitzky-Golay filtering (5-scan window in the retention time dimension). Feature deconvolution was performed using the local minimum resolver with a minimum absolute height of 2.E6 and 1.E6 in positive and negative mode respectively, peak top-to-edge ratio of 1.8, peak duration range of 0.04\u0026ndash;3.0 min, and minimum of 5 data points per feature. MS/MS spectra were paired to features using a precursor \u003cem\u003em/z\u003c/em\u003e tolerance of 0.01 Da or 10 ppm, retention time tolerance matching feature edges, and minimum relative feature height of 0.25.\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003eIsotope grouping was performed using the \u0026sup1;\u0026sup3;C isotope filter with \u003cem\u003em/z\u003c/em\u003e tolerance of 0.005 Da or 10 ppm, retention time tolerance of 0.05 min, monotonic shape validation, and maximum charge state of 2. Ion Identity Networking (IIN) was performed to identify adducts, in-source fragments, and neutral losses. Network refinement was applied with minimum network size of 1, deletion of networks lacking monomer ions, and removal of networks with fewer than 2 links. Retention time correction was performed using aligned features as reference points (\u003cem\u003em/z\u003c/em\u003e tolerance: 0.005 Da or 10 ppm; RT tolerance: 0.1 min; minimum standard intensity: 500,000). Feature lists were aligned using the join aligner with \u003cem\u003em/z\u003c/em\u003e tolerance of 0.0015 Da or 10 ppm (sample-to-sample), RT tolerance of 0.15 min, and \u003cem\u003em/z\u003c/em\u003e weight-to-RT weight ratio of 3:1. Alignment incorporated isotope pattern comparison (minimum score: 0.8) and MS/MS spectral similarity (weighted cosine \u0026ge; 0.7) when available. Gap filling was performed using the multithreaded peak finder with intensity tolerance of 20%,\u0026nbsp;\u003cem\u003em/z\u003c/em\u003e tolerance of 0.002 Da or 10 ppm, RT tolerance of 0.1 min, and minimum of 3 scans per feature. Duplicate features were filtered using an averaging mode with \u003cem\u003em/z\u003c/em\u003e tolerance of 0.005 Da or 10 ppm and RT tolerance of 0.1 min. Blank subtraction was applied requiring minimum detection in 2 blank samples, with features retained only if they showed \u0026ge;2-fold higher intensity in samples compared to blanks (based on maximum blank intensity). Final feature filtering retained only features meeting all of the following criteria: detected in at least 6 samples within any sample class (out of 6 replicates per chemotype), presence of at least 2 isotopic peaks with validated \u0026sup1;\u0026sup3;C pattern and presence of MS/MS spectrum. Spectral networking was performed using the modified cosine algorithm with the following parameters:\u0026nbsp;\u003cem\u003em/z\u003c/em\u003e tolerance = 0.003 Da or 10 ppm, maximum precursor \u003cem\u003em/z\u003c/em\u003e delta = 600 Da, minimum matched signals = 4, minimum cosine similarity = 0.6. Our In-house library of cannabinoids standards was used with a minimum cosine similarity = 0.85, precursor \u003cem\u003em/z\u003c/em\u003e tolerance = 0.01 Da or 20 ppm, spectral \u003cem\u003em/z\u003c/em\u003e tolerance = 0.015 Da or 25 ppm, RT tolerance = 0.2 min, minimum matched signals = 4 using weighted cosine similarity (MassBank weighting), removed precursor ions, performed \u0026sup1;\u0026sup3;C deisotoping (\u003cem\u003em/z\u003c/em\u003e tolerance: 0.001 Da or 10 ppm, monotonic shape, max charge: 2), and cropped spectra to \u003cem\u003em/z\u003c/em\u003e overlap regions. The feature quantification table, mass spectral similarity network and annotation table were exported for further MS-Net processing. Processed MS/MS data were exported in .mgf format for further Sirius processing.\u003c/p\u003e\n\u003cp\u003eCorresponding MZbatch used for MZmine processing are available here: https://zenodo.org/records/17669288\u003c/p\u003e\n\u003ch4\u003eSirius-CSI annotation\u0026nbsp;\u003c/h4\u003e\n\u003cp\u003eData in MGF format were imported into SIRIUS-CSI (v6.3, Lehrstuhl Bioinformatik Jena). Molecular formula prediction was performed using pre-configured Orbitrap parameters with allowed elements limited to C, H, N, O, P, and S, using a \u0026quot;database search\u0026quot; strategy. CSI:FingerID structural prediction queried PubChem as the primary database with bio-databases enabled for further taxonomically enriched matches. FragHub spectral databases were included for experimental spectral matching and analog detection using a precursor \u003cem\u003em/z\u0026nbsp;\u003c/em\u003etolerance of 20 ppm. Only features achieving minimum confidence scores \u0026lt;0.2 in the initial annotation round were selected for subsequent \u003cem\u003ede novo\u0026nbsp;\u003c/em\u003estructure prediction using MSNovelist. Results were exported in CSV format with quoted strings, retaining the top 50 ranked candidates per feature for both \u003cem\u003ein silico\u003c/em\u003e and \u003cem\u003ede novo\u003c/em\u003e matches.\u0026nbsp;\u003c/p\u003e\n\u003ch4\u003eMS-Net parameters\u003c/h4\u003e\n\u003cp\u003eOutput files from MZmine (feature quantification table, MSS network edge list, spectral library annotations) and SIRIUS-CSI (top 50 \u003cem\u003ein silico\u003c/em\u003e candidates and top 20 \u003cem\u003ede novo\u003c/em\u003e structures from MSNovelist per feature) were imported into MS-Net (developed in Knime Analytics Platform, v5.2). Features matching authentic standards (spectral cosine \u0026gt; 0.95, \u0026Delta;RT \u0026lt; 0.2 min) were assigned Level 1. Spectral library matches were stratified as Level 2a (cosine \u0026gt; 0.85) or 2b (0.7 \u0026lt; cosine \u0026le; 0.85). \u003cem\u003eIn silico\u003c/em\u003e candidates were enriched with taxonomic information by querying Coconut 2.0 (linked to NCBI taxonomy) using InChIKey-based matching. Among the top 10 \u003cem\u003ein silico\u003c/em\u003e candidates per feature, those matching taxonomic criteria (\u003cem\u003eCannabis\u003c/em\u003e genus, Cannabaceae family) or classified as cannabinoids by NPClassifier were elevated to Level 3a; remaining \u003cem\u003ein silico\u003c/em\u003e matches received Level 3b. Spectral library matches with cosine \u0026lt; 0.7 or significant precursor mass errors were classified as Level 4 (MS/MS analogs). Features were filtered using Ion Identity Network (IIN) results to remove redundant adducts and isotopes, followed by MS-CleanR-based RT clustering (\u0026Delta;RT \u0026le; 0.01 min). Within each cluster, the top 2 features by network degree and the top 2 by peak intensity were retained. The MSS network was constrained to edges with cosine similarity \u0026ge; 0.7 and \u0026Delta;RT \u0026le; 8 min between connected nodes. High-confidence annotations (Levels 1, 2a, 3a) seeded the MSS network for iterative annotation propagation. For each feature pair, candidate structures were ranked using the Link Score formula: \u003cem\u003eLink Score= (1 - \u0026alpha;) \u0026times; Adjusted_Structural + \u0026alpha;*\u0026times; InSilico_Combined,\u0026nbsp;\u003c/em\u003ewhere \u003cem\u003eAdjusted_Structural\u0026nbsp;\u003c/em\u003eintegrates dynamically weighted Tanimoto similarities (full-molecule and Murcko scaffold fingerprints-PubChem) modulated by MS/MS cosine similarity, and \u003cem\u003eInSilico_Combined\u003c/em\u003e represents the mean of available \u003cem\u003ein silico\u003c/em\u003e confidence scores. The weighting parameter was set to \u0026alpha; = 0.3, prioritizing structural-spectral evidence (70%) over \u003cem\u003ein silico\u0026nbsp;\u003c/em\u003eranking (30%). The top 5 candidates by Link Score were retained per feature before looping through all MSS network. Redundant annotations with identical InChIKey identifiers and pearson correlation among samples \u0026gt;0.7 were consolidated by selecting the candidate with the highest mean peak height. Positive and negative mode feature lists were merged using \u0026Delta;RT \u0026le; 0.05 min, \u0026Delta;\u003cem\u003em/z\u0026nbsp;\u003c/em\u003e\u0026le; 0.002 Da, and a minimum Pearson correlation \u0026ge; 0.6 across sample intensities. Final annotations were enriched with chemical ontology classifications from ClassyFire (kingdom, superclass, class, subclass) and NPClassifier (pathway, superclass, class). Database identifiers (PubChem CID, KEGG, HMDB, ChEBI) were retrieved using the Chemical Translation Service. Natural product-likeness scores were calculated using the NPlikeness calculator. Structural similarity networks were constructed using Tanimoto coefficient \u0026ge; 0.8, retaining the top 2 nearest neighbors per feature.\u003c/p\u003e\n\u003ch4\u003eStatistical analysis\u003c/h4\u003e\n\u003cp\u003eMultivariate data analysis was performed using the mixOmics R package v6.3.2 through the MetaboStat_AgX Shiny interface (available at https://zenodo.org/records/17352817). Prior to analysis, missing values were imputed using the global minimum intensity value divided by 10. Feature intensities were normalized using total sum normalization (TSN), followed by unit variance scaling (autoscaling). Principal component analysis (PCA) was performed to assess overall variance structure and sample clustering. Partial least squares discriminant analysis (PLS-DA) models were validated using 10-fold cross-validation repeated 100 times to assess classification performance and prevent overfitting. Model quality was evaluated using the classification error rate. Sparse PLS-DA (sPLS-DA) was applied to identify discriminant features, selecting the top 30 features per component (60 total across two components) based on variable importance in projection (VIP) scores. Hierarchical clustering (Euclidean distance, Ward\u0026apos;s linkage) of the top 60 discriminant features generated the clustered heatmap visualization. Chemical ontology enrichment analysis was performed on features characteristic of each chemotype according to their coefficient score using NPClassifier pathway classifications. The results were visualized as pie charts showing pathway distribution per chemotype.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eCode availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eMS-Net workflow, tutorials and MZbatch files used for this study are available here: https://zenodo.org/records/17669288\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eRaw LC-MS data acquired in positive and negative mode are available here:\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003ehttps://zenodo.org/records/17671960\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eBorges, R. M. \u003cem\u003eet al.\u003c/em\u003e Quantum Chemistry Calculations for Metabolomics: Focus Review. \u003cem\u003eChem. Rev. \u003c/em\u003e\u003cstrong\u003e121\u003c/strong\u003e, 5633\u0026ndash;5670 (2021). \u003c/li\u003e\n\u003cli\u003eSchmid, R. \u003cem\u003eet al.\u003c/em\u003e Ion identity molecular networking for mass spectrometry-based metabolomics in the GNPS environment. \u003cem\u003eNat. Commun. \u003c/em\u003e\u003cstrong\u003e12\u003c/strong\u003e, 3832 (2021). \u003c/li\u003e\n\u003cli\u003eFraisier-Vannier, O. \u003cem\u003eet al.\u003c/em\u003e MS-CleanR: A Feature-Filtering Workflow for Untargeted LC\u0026ndash;MS Based Metabolomics. \u003cem\u003eAnal. Chem. \u003c/em\u003e\u003cstrong\u003e92\u003c/strong\u003e, 9971\u0026ndash;9981 (2020). \u003c/li\u003e\n\u003cli\u003eSumner, L. W. \u003cem\u003eet al.\u003c/em\u003e Proposed minimum reporting standards for chemical analysis: Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI). \u003cem\u003eMetabolomics \u003c/em\u003e\u003cstrong\u003e3\u003c/strong\u003e, 211\u0026ndash;221 (2007). \u003c/li\u003e\n\u003cli\u003eCharbonnet, J. A. \u003cem\u003eet al.\u003c/em\u003e Communicating Confidence of Per- and Polyfluoroalkyl Substance Identification via High-Resolution Mass Spectrometry. \u003cem\u003eEnviron. Sci. Technol. Lett. \u003c/em\u003e\u003cstrong\u003e9\u003c/strong\u003e, 473\u0026ndash;481 (2022). \u003c/li\u003e\n\u003cli\u003eDablanc, A. \u003cem\u003eet al.\u003c/em\u003e FragHub: A Mass Spectral Library Data Integration Workflow. \u003cem\u003eAnal. Chem.\u003c/em\u003e acs.analchem.4c02219 (2024) doi:10.1021/acs.analchem.4c02219. \u003c/li\u003e\n\u003cli\u003eMedema, M. H., de Rond, T. \u0026amp; Moore, B. S. Mining genomes to illuminate the specialized chemistry of life. \u003cem\u003eNat. Rev. Genet. \u003c/em\u003e\u003cstrong\u003e22\u003c/strong\u003e, 553\u0026ndash;571 (2021). \u003c/li\u003e\n\u003cli\u003eTsugawa, H. \u003cem\u003eet al.\u003c/em\u003e Hydrogen Rearrangement Rules: Computational MS/MS Fragmentation and Structure Elucidation Using MS-FINDER Software. \u003cem\u003eAnal. Chem. \u003c/em\u003e\u003cstrong\u003e88\u003c/strong\u003e, 7946\u0026ndash;7958 (2016). \u003c/li\u003e\n\u003cli\u003eRuttkies, C., Schymanski, E. L., Wolf, S., Hollender, J. \u0026amp; Neumann, S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. \u003cem\u003eJ. Cheminformatics \u003c/em\u003e\u003cstrong\u003e8\u003c/strong\u003e, 3 (2016). \u003c/li\u003e\n\u003cli\u003eHoffmann, M. A. \u003cem\u003eet al.\u003c/em\u003e High-confidence structural annotation of metabolites absent from spectral libraries. \u003cem\u003eNat. Biotechnol. \u003c/em\u003e\u003cstrong\u003e40\u003c/strong\u003e, 411\u0026ndash;421 (2022). \u003c/li\u003e\n\u003cli\u003eCASMI 2022. \u003cem\u003eRevisiting CASMI \u003c/em\u003ehttps://fiehnlab.ucdavis.edu/casmi (2022). \u003c/li\u003e\n\u003cli\u003eZhu, B. \u003cem\u003eet al.\u003c/em\u003e Knowledge-based in silico fragmentation and annotation of mass spectra for natural products with MassKG. \u003cem\u003eComput. Struct. Biotechnol. J. \u003c/em\u003e\u003cstrong\u003e23\u003c/strong\u003e, 3327\u0026ndash;3341 (2024). \u003c/li\u003e\n\u003cli\u003eda Silva, R. R. \u003cem\u003eet al.\u003c/em\u003e Propagating annotations of molecular networks using in silico fragmentation. \u003cem\u003ePLOS Comput. Biol. \u003c/em\u003e\u003cstrong\u003e14\u003c/strong\u003e, e1006089 (2018). \u003c/li\u003e\n\u003cli\u003eRutz, A. \u003cem\u003eet al.\u003c/em\u003e Taxonomically Informed Scoring Enhances Confidence in Natural Products Annotation. \u003cem\u003eFront. Plant Sci. \u003c/em\u003e\u003cstrong\u003e10\u003c/strong\u003e, 1329 (2019). \u003c/li\u003e\n\u003cli\u003eQuinlan, Z. A. \u003cem\u003eet al.\u003c/em\u003e ConCISE: Consensus Annotation Propagation of Ion Features in Untargeted Tandem Mass Spectrometry Combining Molecular Networking and In Silico Metabolite Structure Prediction. \u003cem\u003eMetabolites \u003c/em\u003e\u003cstrong\u003e12\u003c/strong\u003e, 1275 (2022). \u003c/li\u003e\n\u003cli\u003eQuiros-Guerrero, L.-M. \u003cem\u003eet al.\u003c/em\u003e Inventa: A computational tool to discover structural novelty in natural extracts libraries. \u003cem\u003eFront. Mol. Biosci. \u003c/em\u003e\u003cstrong\u003e9\u003c/strong\u003e, 1028334 (2022). \u003c/li\u003e\n\u003cli\u003eMejri, Y. \u003cem\u003eet al.\u003c/em\u003e MS2DECIDE: Aggregating Multiannotated Tandem Mass Spectrometry Data with Decision Theory Enhances Natural Products Prioritization. \u003cem\u003eChemistry\u0026ndash;Methods\u003c/em\u003e 202400088 (2025) doi:10.1002/cmtd.202400088. \u003c/li\u003e\n\u003cli\u003eAllard, P.-M. \u003cem\u003eet al.\u003c/em\u003e Integration of Molecular Networking and \u003cem\u003eIn-Silico\u003c/em\u003e MS/MS Fragmentation for Natural Products Dereplication. \u003cem\u003eAnal. Chem. \u003c/em\u003e\u003cstrong\u003e88\u003c/strong\u003e, 3317\u0026ndash;3323 (2016). \u003c/li\u003e\n\u003cli\u003eZhou, Z. \u003cem\u003eet al.\u003c/em\u003e Metabolite annotation from knowns to unknowns through knowledge-guided multi-layer metabolic networking. \u003cem\u003eNat. Commun. \u003c/em\u003e\u003cstrong\u003e13\u003c/strong\u003e, 6656 (2022). \u003c/li\u003e\n\u003cli\u003eL\u0026ecirc; Cao, K.-A. \u0026amp; Welham, Z. M. \u003cem\u003eMultivariate Data Integration Using R: Methods and Applications with the mixOmics Package\u003c/em\u003e. (Chapman and Hall/CRC, Boca Raton, 2021). doi:10.1201/9781003026860. \u003c/li\u003e\n\u003cli\u003eBerthold, M. R. \u003cem\u003eet al.\u003c/em\u003e KNIME-the Konstanz information miner: version 2.0 and beyond. \u003cem\u003eAcM SIGKDD Explor. Newsl. \u003c/em\u003e\u003cstrong\u003e11\u003c/strong\u003e, 26\u0026ndash;31 (2009). \u003c/li\u003e\n\u003cli\u003eChandrasekhar, V. \u003cem\u003eet al.\u003c/em\u003e COCONUT 2.0: a comprehensive overhaul and curation of the collection of open natural products database. \u003cem\u003eNucleic Acids Res. \u003c/em\u003e\u003cstrong\u003e53\u003c/strong\u003e, D634\u0026ndash;D643 (2025). \u003c/li\u003e\n\u003cli\u003eWohlgemuth, G., Haldiya, P. K., Willighagen, E., Kind, T. \u0026amp; Fiehn, O. The Chemical Translation Service\u0026mdash;a web-based tool to improve standardization of metabolomic reports. \u003cem\u003eBioinformatics \u003c/em\u003e\u003cstrong\u003e26\u003c/strong\u003e, 2647\u0026ndash;2648 (2010). \u003c/li\u003e\n\u003cli\u003eBurgess, K. E. V., Borutzki, Y., Rankin, N., Daly, R. \u0026amp; Jourdan, F. MetaNetter 2: A Cytoscape plugin for ab initio network analysis and metabolite feature classification. \u003cem\u003eJ. Chromatogr. B \u003c/em\u003ehttps://doi.org/10.1016/j.jchromb.2017.08.015 (2017) doi:10.1016/j.jchromb.2017.08.015. \u003c/li\u003e\n\u003cli\u003eRadwan, M. M., Chandra, S., Gul, S. \u0026amp; ElSohly, M. A. Cannabinoids, phenolics, terpenes and alkaloids of cannabis. \u003cem\u003eMolecules \u003c/em\u003e\u003cstrong\u003e26\u003c/strong\u003e, 2774 (2021). \u003c/li\u003e\n\u003cli\u003eAndre, C. M., Hausman, J.-F. \u0026amp; Guerriero, G. Cannabis sativa: the plant of the thousand and one molecules. \u003cem\u003eFront. Plant Sci. \u003c/em\u003e\u003cstrong\u003e7\u003c/strong\u003e, 19 (2016). \u003c/li\u003e\n\u003cli\u003ePereira Francisco, V. \u003cem\u003eet al.\u003c/em\u003e Development of GC\u0026ndash;MS coupled to GC\u0026ndash;FID method for the quantification of cannabis terpenes and terpenoids: Application to the analysis of five commercial varieties of medicinal cannabis. (2024). \u003c/li\u003e\n\u003cli\u003eV\u0026aacute;squez-Ocm\u0026iacute;n, P. G. \u003cem\u003eet al.\u003c/em\u003e Cannabinoids vs. whole metabolome: Relevance of cannabinomics in analyzing Cannabis varieties. \u003cem\u003eAnal. Chim. Acta \u003c/em\u003e\u003cstrong\u003e1184\u003c/strong\u003e, 339020 (2021). \u003c/li\u003e\n\u003cli\u003eAliferis, K. A. \u0026amp; Bernard-Perron, D. Cannabinomics: Application of Metabolomics in Cannabis (Cannabis sativa L.) Research and Development. \u003cem\u003eFront. Plant Sci. \u003c/em\u003e\u003cstrong\u003e11\u003c/strong\u003e, (2020). \u003c/li\u003e\n\u003cli\u003eZandkarimi, F. \u003cem\u003eet al.\u003c/em\u003e Comparison of the cannabinoid and terpene profiles in commercial cannabis from natural and artificial cultivation. \u003cem\u003eMolecules \u003c/em\u003e\u003cstrong\u003e28\u003c/strong\u003e, 833 (2023). \u003c/li\u003e\n\u003cli\u003eDe Meijer, E. P. M. \u0026amp; Hammond, K. M. The inheritance of chemical phenotype in Cannabis sativa L. \u003cem\u003eEuphytica \u003c/em\u003e\u003cstrong\u003e145\u003c/strong\u003e, 189\u0026ndash;198 (2005). \u003c/li\u003e\n\u003cli\u003eHazekamp, A. \u0026amp; Fischedick, J. T. Cannabis‐from cultivar to chemovar. \u003cem\u003eDrug Test. Anal. \u003c/em\u003e\u003cstrong\u003e4\u003c/strong\u003e, 660\u0026ndash;667 (2012). \u003c/li\u003e\n\u003cli\u003eHanu\u0026scaron;, L. O., Meyer, S. M., Mu\u0026ntilde;oz, E., Taglialatela-Scafati, O. \u0026amp; Appendino, G. Phytocannabinoids: a unified critical inventory. \u003cem\u003eNat. Prod. Rep. \u003c/em\u003e\u003cstrong\u003e33\u003c/strong\u003e, 1357\u0026ndash;1392 (2016). \u003c/li\u003e\n\u003cli\u003eDe Ronne, M. \u0026amp; Torkamaneh, D. Discovery of major QTL and a massive haplotype associated with cannabinoid biosynthesis in drug‐type Cannabis. \u003cem\u003ePlant Genome \u003c/em\u003e\u003cstrong\u003e18\u003c/strong\u003e, e70031 (2025). \u003c/li\u003e\n\u003cli\u003eBrungs, C. \u003cem\u003eet al.\u003c/em\u003e MSnLib: efficient generation of open multi-stage fragmentation mass spectral libraries. \u003cem\u003eNat. Methods \u003c/em\u003e\u003cstrong\u003e22\u003c/strong\u003e, 2028\u0026ndash;2031 (2025). \u003c/li\u003e\n\u003cli\u003eNeumann, S. \u003cem\u003eet al.\u003c/em\u003e MassBank: an open and FAIR mass spectral data resource. \u003cem\u003eNucleic Acids Res.\u003c/em\u003e gkaf1193 (2025) doi:10.1093/nar/gkaf1193. \u003c/li\u003e\n\u003cli\u003eMarti, G. Lessons from Mass Spectral Library Integration: Addressing Metadata Gaps and Expanding Chemodiversity. (2025). \u003c/li\u003e\n\u003cli\u003eWishart, D. S. \u003cem\u003eet al.\u003c/em\u003e Chemical Composition of Commercial Cannabis. \u003cem\u003eJ. Agric. Food Chem.\u003c/em\u003e acs.jafc.3c06616 (2024) doi:10.1021/acs.jafc.3c06616. \u003c/li\u003e\n\u003cli\u003eJulian Pollmann Florian Huber. Count your bits: more subtle similarity measures using larger radius count vectors. https://doi.org/10.1101/2025.06.16.659994. \u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[{"identity":"5da05b64-1169-4293-bde6-bd24516faa13","identifier":"10.13039/501100001665","name":"Agence Nationale de la Recherche","awardNumber":"11-INBS-0010","order_by":0}],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Federal University of Toulouse Midi-Pyrénées","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":true,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"metabolomics, mass spectrometry, annotation, molecular networking, Tanimoto similarity, natural products, KNIME, Cannabis, cannabinoids","lastPublishedDoi":"10.21203/rs.3.rs-8174529/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8174529/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eConfident metabolite annotation remains a critical bottleneck in untargeted LC-MS metabolomics, as experimental spectral libraries cover only 5\u0026ndash;20% of detected features. While \u003cem\u003ein silico\u003c/em\u003e tools generate extensive candidate lists, top-ranked predictions often fail to reflect true identities, resulting in high false annotation rates. We present MS-Net (Multi-Similarity Network-based annotation), an accessible workflow integrating mass spectral similarity networks, molecular structure similarity (Tanimoto metrics), and taxonomic knowledge to prioritize annotations within vast candidate spaces. MS-Net employs a composite Link Score combining full-molecule and scaffold Tanimoto similarities with MS/MS cosine similarity and \u003cem\u003ein silico\u003c/em\u003e confidence metrics. High-confidence annotations seed iterative propagation throughout the network. Applied to a \u003cem\u003eCannabis sativa\u003c/em\u003e dataset (2,595 initial features reduced to 1,297 after filtering, from 118,000 candidates), MS-Net resolved the annotation space to 1,275 confident assignments. notably, 53% of final annotations were rescued from lower \u003cem\u003ein silico\u003c/em\u003e ranks (2\u0026ndash;50), demonstrating the algorithm's ability to correct ranking errors. The workflow enables reproducible, offline annotation prioritization suitable for systems biology integration.\u003c/p\u003e","manuscriptTitle":"MS-Net: Multi-Similarity based network annotation for untargeted metabolomics","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-11-27 09:32:58","doi":"10.21203/rs.3.rs-8174529/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"08f5f665-5984-4934-a809-1c8ebe1e6141","owner":[],"postedDate":"November 27th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":58401303,"name":"Structural Biology"},{"id":58401304,"name":"Analytical Biochemistry"},{"id":58401305,"name":"Computational Biology"}],"tags":[],"updatedAt":"2025-11-27T09:32:58+00:00","versionOfRecord":[],"versionCreatedAt":"2025-11-27 09:32:58","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8174529","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8174529","identity":"rs-8174529","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00