DeepSeMS: a large language model reveals hidden biosynthetic potential of the global ocean microbiome

doi:10.21203/rs.3.rs-6233440/v1

DeepSeMS: a large language model reveals hidden biosynthetic potential of the global ocean microbiome

2025 · doi:10.21203/rs.3.rs-6233440/v1

preprint OA: closed

Full text JSON View at publisher

Full text 156,169 characters · extracted from preprint-html · click to expand

DeepSeMS: a large language model reveals hidden biosynthetic potential of the global ocean microbiome | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article DeepSeMS: a large language model reveals hidden biosynthetic potential of the global ocean microbiome Na Jiao, Tingjun Xu, Yuwei Yang, Ruixin Zhu, Weili Lin, Jixuan Li, and 4 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6233440/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 30 Apr, 2026 Read the published version in Nature Computational Science → Version 1 posted You are reading this latest preprint version Abstract Microbial biosynthetic diversity holds immense potential for discovering natural products with therapeutic applications, yet a substantial quantity of natural products derived from uncultivated microorganisms remains uncharacterized. The intricate nature of biosynthetic enzymes poses a major challenge in accurately predicting the chemical structures of secondary metabolites solely based on genome sequences using current rule-based methods. Here, we present DeepSeMS, a large language model designed to predict the chemical structures of secondary metabolites from various microbial biosynthetic gene clusters. Built on the Transformer architecture, DeepSeMS innovatively identifies sequence features using functional domains of biosynthetic enzymes, and incorporates feature-aligned chemical structure enumeration for training data augmentation. External evaluation results show that DeepSeMS predicts more accurate chemical structures of secondary metabolites with a Tanimoto coefficient up to 0.6 compared with the ground truth, significantly outperforming antiSMASH and PRISM with coefficients of only 0.14 and 0.45 respectively. Moreover, DeepSeMS successfully predicted secondary metabolites for 96.60% of cryptic biosynthetic gene clusters, surpassing existing methods with success rates less than 50%. Leveraging DeepSeMS, we characterized over 65,000 novel secondary metabolites from the global ocean microbiome with previously undocumented structural types, ecological distribution, and biomedical applications especially antibiotics. A login-free and user-friendly web server for DeepSeMS ( https://biochemai.cstspace.cn/deepsems/ ) has been launched, featuring an integrated global ocean microbial secondary metabolites repository to expedite the discovery of novel natural products. Collectively, this study underscores the great capacity of a large language model-driven method in revealing hidden biosynthetic potential of the global ocean microbiome. Biological sciences/Computational biology and bioinformatics/Computational models Biological sciences/Drug discovery/Drug screening/High-throughput screening Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 INTRODUCTION Secondary metabolites (SMs), particularly those produced by microbes, are an essential class of natural compounds with diverse biological activities, including antimicrobial, anti-inflammatory, and anticancer properties, as well as therapeutic potential for treating metabolic diseases 1 , 2 . These molecules are widely used as pharmaceutical agents, such as antibiotics, statins, and antitumor drugs 3 , 4 . However, the majority of clinically used microbial-derived SMs are identified within cultured species, which represent less than 1% of the vast microbial diversity. 5 – 7 . Metagenomics sequencing has enabled the characterization of numerous genomes from uncultured and unknown species across diverse environments, uncovering vast biosynthetic gene clusters (BGCs) and their potential to produce novel SMs 8 – 19 . Particularly, the global ocean, as the largest ecosystem on Earth, harbors an extraordinary diversity of microbial resources that remain largely underexplored, positioning it as a valuable reservoir for the discovery of novel SMs 20 , 21 . Nevertheless, the identification of novel SMs from microbial genomes remains challenging for existing methods like antiSMASH 14 and PRISM15 15 . These rule-based methods often fail to generate the chemical structures of SMs produced by cryptic BGCs from metagenome-assembled genomes (MAGs) 18 , 19 . This is mainly because of the highly context-dependent catalytic functions of homologous biosynthetic enzymes in BGCs 22 – 25 . For example, cytochromes P450 (CYP450) can catalyze biosynthetic reactions of carbon hydroxylation, heteroatom oxygenation, dealkylation, epoxidation, aromatic hydroxylation, reduction, and dehalogenation, generating structurally distinct SMs 25 . Limited substrate-specific biosynthetic enzyme libraries or virtual tailoring reactions cannot cover the non-canonical arrangements and combinations of biosynthetic enzymes in cryptic BGCs. The advanced artificial intelligence (AI) technology, particularly large language models (LLMs), has exhibited great capabilities in understanding, generating, and manipulating sequence context 26 – 29 . The exceptional generative capabilities inherent in LLMs provide significant potential to accurately identify biosynthetic functions of enzymes encoded in BGCs from known examples. Additionally, these models can automatically assemble complex chemical structures of various SMs as natural language translation 30 – 32 . However, there are significant differences exist between the processing of natural language sequences and biological sequences, particularly in terms of their feature recognition, encoding, and decoding. These differences present challenges in representing biological sequences effectively within LLMs. Furthermore, achieving high precision and generalization ability in LLMs typically requires large datasets for training. Yet, the availability of known BGCs and experimentally verified SMs remains relatively scarce 31 – 33 . While several factors contribute to this gap, the sequence representation and the scarcity of comprehensive training dataset are considered to be the most critical ones. In this work, we trained a language model named DeepSeMS (deep language model for secondary metabolite structures prediction) for automatically generating chemical sequences of a SM from input BGC sequences. To solve the sequence representation problem, we represented BGC sequence as functional domains of biosynthetic enzymes encoded in BGC. Additionally, we employed a data augmentation strategy to construct a refined dataset with sufficient quantity and superior quality, therefore addressed the dataset gap. Evaluations demonstrated that DeepSeMS significantly outperforms existing methods, generating chemical structures closer to the real world SMs and being applicable to more varied types of cryptic BGCs. We employed DeepSeMS for large-scale mining of SMs in the global ocean microbiome, successfully characterized more than 65,000 novel SMs with previously undocumented structural types, geographical coverage, ocean diversities, and ecological distribution characteristics, and identified various biomedical applications include antibiotics, cell protectants, and innovative drug candidates. Finally, we developed a user-friendly web server, along with a built-in global ocean SMs repository ( https://biochemai.cstspace.cn/deepsems/ ) for the convenience of researchers. RESULTS 1. DeepSeMS algorithm 1.1 Model overview DeepSeMS model, based on Transformer architecture, was trained on a dataset of known BGCs and SM structures processed by data augmentation strategy 26 – 29 . This model identified the features of input BGCs as source sequences by sequence representation, tokenized and embedded into the Transformer neural network. Subsequently, a chemical sequence decoder was used to convert the output target sequences to predicted SM structures (Fig. 1 a). The neural network of Transformer consisted of six encoder and decoder layers, and eight attentional layers with embedding dimension of 512, leading to a total of approximately 100 million trainable parameters (Supplementary Fig. 1). 1.2 Sequence representation One major challenge in predicting the chemical structures of SMs from BGCs was determining the most informative genomic input. Biological sequences can be represented at various levels, ranging from amino acids (basic building blocks) to functional domains (modular protein units), and enzymes (complete coding sequences). Among these, functional domains of the biosynthetic enzymes have been proved to be the most informative BGC representation 17 , 34 . Functional domains are the fundamental units within BGCs that responsible for substrate specificity, linear assembly, and tailoring reactions in SM biosynthesis 14 , 18 , 22 . Additionally, their sequential arrangement enables the model to capture contextual relationships between domains and the chemical structures of SMs. We further evaluated the efficiency of various sequence representation strategies by training the DeepSeMS model with input sequences of amino acids, enzymes, and functional domains (Fig. 1 b). Our experiment revealed that amino acid sequences were impractical due to their excessive length (up to 50,000 tokens), which exceeded the model’s capacity and required significant computational resources. Although enzyme sequences were shorter (up to 50 tokens), they suffered from substantial information loss, preventing the model from achieving training convergence. In contrast, sequences derived from functional domains, identified as protein families and domains (Pfam) identifiers through the biosequence analysis tool HMMER, provided an optimal balance of block size and manageable length (up to 250 tokens). This representation enabled the neural network to efficiently extract key features from BGCs. For the output, we used SMILES (Simplified Molecular Input Line Entry System) strings to represent the chemical structures of SMs. SMILES is widely regarded as the standard format for describing small molecule structures in chemical language models, making it ideal for capturing the structural diversity of SMs 35 – 37 . 1.3 Data augmentation strategy To address the training dataset gap, we first curated BGC sequences along with their corresponding SM structures from the MIBiG database, which is renowned for its large-scale collection of known BGC sequences and annotation of experimentally verified SM structures. From MIBiG database, we constructed an initial dataset ( n = 3,029) by data extraction and cleaning for model training. While the BGCs in the dataset represent a large biosynthetic diversity of known examples, the structures of SMs are so inadequate for the vast chemical space of small molecules that the LLM may be unable to identify the syntax of SMILES strings. In specific application scenarios of this study, data augmentation by using a batch of chemically identical but syntactically different SMILES strings can greatly improve the performance of deep learning methods 37 . Commonly used data augmentation of SMILES strings is representing a molecule as a 2D graph, and linear SMILES notations can be derived from this graph by enumerating its nodes in a specific topological order, i.e., SMILES enumeration (Fig. 1 c) 38 . While exposes the representations of a same molecule from various views, randomized SMILES enumeration would disorganize the notations in SMILES strings that may led to structural feature disorder, thereby hindering model performance 31 , 39 . Consequently, data augmentation in this study was also performed for target SMILES strings of the training dataset by structural features-aligned SMILES enumeration (Fig. 1 c). The molecular scaffold, which constitutes the major structural feature of microbial SMs, significantly dictates their biological activities and functions 1 , 3 . Therefore, this procedure (see Methods) not only maintains the major structural feature (scaffold) of a SM aligned in SMILES strings, but also augments the feature blocks of chemical sequences, thereby would enhance model performance. To validate the data augmentation strategy, we first divided the initial dataset randomly into base training (90%, n = 2,726) and internal validation (10%, n = 303) datasets. Next, random SMILES enumeration and structure feature aligned SMILES enumeration were implemented on the base training dataset respectively to train the DeepSeMS model. In comparison to the model trained on the dataset without data augmentation, amplifying the training dataset by randomized SMILES enumeration resulted with a significant increase in validity of generated SMILES strings. What’s more, structural features-aligned SMILES enumeration brought significant better performance in the mean structural similarity between the valid structures and the target structures (Supplementary Table 1). Specifically, the model trained on the dataset of structural features-aligned SMILES enumeration had generated approximately one-quarter of the structures that were completely identical to the target structures (structure recovery), and roughly half of the structures that had completely identical scaffolds to the target structures (scaffold recovery). This indicates that the data augmentation strategy of structural features-aligned SMILES enumeration is advantageous not only for learning the fundamental syntax of the chemical language in SMILES strings by enhancing sample diversity, but also for emphasizing the structural features of SMs by keeping sample alignment. Consequently, we trained DeepSeMS model on a refined dataset ( n = 55,903) that implemented data augmentation strategy of structural features-aligned SMILES enumeration on the initial dataset based on ten-fold cross-validation (Supplementary Fig. 2). However, the performance of the model in each fold is determined by the split of the dataset, which introduces significant randomness. Therefore, the best performance checkpoint of each fold was adopted as the application version of DeepSeMS model, aiming to generate top-10 output for the prediction of SM structures for each input BGC sequences. As a result, the model achieved up to 85.71% validity of the generated SMILES strings and 0.85 structural similarity between the valid structures and the target structures on the ten-fold cross-validation (Supplementary Table 2). 2. Evaluation of accuracy and generalization ability on external validation datasets In order to evaluate DeepSeMS model more comprehensively, we utilized two external validation datasets: ‘Known BGCs’ ( n = 326) with chemical structures of experimentally verified SMs for evaluation of accuracy; ‘Cryptic BGCs’ ( n = 940) without chemical structure of SMs for evaluation of generalization ability. The known BGCs dataset was derived from the ‘gold standard’ BGCs manually curated by PRISM 4 authors 18 , which is a comprehensive dataset of prokaryotic BGCs linked to experimentally verified SMs with unambiguously assigned chemical structures. We excluded BGCs that have a sequence similarity greater than 95% to the ones in the training dataset of DeepSeMS model to form the known BGCs dataset. On the other hand, in view of the vast biosynthetic diversity exhibited by microbes in the ocean 41 , we believe the BGCs derived from ocean microbiome would be ideal material to test DeepSeMS model for exploring previously unrecognized SMs, especially those cryptic ones from bathypelagic habitats 21 . Thus, we obtained the Malaspina Deep Metagenome-Assembled Genomes (MDeep-MAGs) 42 constructed from 58 metagenomes to search biosynthetic regions, which resulted with appropriate quantity of BGCs to form the cryptic BGCs dataset representing diverse biosynthetic pathways of the bathypelagic microbial communities. We also evaluated the performance of the DeepSeMS model against the two most widely used methods, namely antiSMASH 19 and PRISM 18 (SM structure prediction functions). 2.1 Prediction accuracy on the dataset of known BGCs Evaluation results on validation dataset of the known BGCs are summarized in Table 1 . The results demonstrated that DeepSeMS predicted more accurate structures of SMs for the known BGCs compared to existing methods, antiSMASH 7 and PRISM 4. DeepSeMS successfully predicted at least one chemically valid SM structure for 318 of 326 BGCs (97.55%), and the generated structures are more similar SMs to the ground truth for various BGC types (Supplementary Fig. 3). Notably, DeepSeMS had predicted 134 (41.10%) chemically identical structures to the ground truth, and over half (53.68%) of the predicted structures have the same scaffolds to the real-world SMs, which indicate that DeepSeMS has greatly improved the accuracy of SMs structure prediction than the other two methods. Table 1 Comparison of DeepSeMS model with existing methods on validation dataset of the known BGCs ( n = 326). The best results are bolded. Method Success rate 1 Structural similarity 2 Scaffold similarity 3 Structure recovery 4 Scaffold recovery 5 antiSMASH 7 63.50% 0.14 0.03 0.00% 1.93% PRISM 4 88.96% 0.45 0.42 8.11% 16.87% DeepSeMS 97.55% 0.60 0.63 41.10% 53.68% 1 The percentage of at least one chemically valid structure predicted for each BGC in the validation dataset by each method. 2 The mean structural similarity between the valid structures and the ground truth. 3 The mean scaffold similarity between the valid structures and the ground truth. 4 The percentage of chemically identical structures to the ground truth. 5 The percentage of chemically identical scaffolds to the ground truth. Source data are provided in Source Data file. 2.2 Generalization ability on the dataset of cryptic BGCs Comparative study further confirm that DeepSeMS achieved significant improvement on mining of SMs from cryptic BGCs (Fig. 2 , Supplementary Table 3). DeepSeMS successfully predicted at least one chemically valid SM structure for 908 out of 940 BGCs (96.60%) in the validation dataset of cryptic BGCs. This represents an approximately 80% increase over antiSMASH 7, which predicted 189 structures for 159 (16.91%) BGCs, and at least a 50% increase over PRISM 4, which predicted 455 structures for 203 (46.45%) BGCs (Fig. 2 a). The total number of predicted chemically valid SM structures and the percentage of chemically unique structures (uniqueness) indicate that DeepSeMS (5,104 valid SM structures with 78.66% uniqueness) exhibit remarkably higher molecular novelty than the other methods (PRISM 4: 455 valid SM structures with 62.42% uniqueness; antiSMASH 7: 189 valid SM structures with 24.87% uniqueness) (Fig. 2 b). The chemical space of the predicted SM structures by each method (Fig. 2 c) shows that DeepSeMS possesses the capability to expand the chemical space of SMs by a relatively small number of training data. Microbial SMs are primarily small molecules with molecular weights ranging from 300 to 500, and the distribution of molecular weight (Fig. 2 d) generated by DeepSeMS aligns well with this range. The distributions of synthetic accessibility (Fig. 2 e) and quantitative estimate of drug-likeness (QED) (Fig. 2 f) of the predicted SM structures by DeepSeMS also correspond to the structural complexity of SMs in nature. This demonstrates the capability of the LLM-driven method to generate molecules with a variety of molecular scaffolds, functional groups, and ring systems that mimic the diverse and intricate structures observed in natural products. Collectively, comparative analysis illustrate that DeepSeMS had successfully generated a wider array of complex structures on the dataset of cryptic BGCs than the other two methods. Furthermore, DeepSeMS improved the ability of predicting SM structures from various BGC types (Supplementary Table 4). DeepSeMS successfully predicted at least one chemically valid SM structure for 38 of 39 BGC types, including common types non-ribosomal peptide synthetase (NRPSs), polyketide synthase (PKSs), terpenes, along with ribosomally encoded and posttranslationally modified peptides (RiPPs). The type of ‘hybrid’ BGCs 19 , which represent a single gene cluster produces a hybrid compound that combines two or more biosynthetic pathways, were also comprehensively predicted by DeepSeMS with various of SM structures. Notably, DeepSeMS successfully predicted SMs structures for clusters containing biosynthetic regions that do not fit into currently known categories, which indicate the great generalization ability of DeepSeMS on BGCs of undescribed families. To illustrate how DeepSeMS works in simulating the biosynthesis of microbial SMs to generate chemical structures, we inspected whether the predicted structures have structural features that are implicit in the BGCs. Sequence similarity analysis shows that, the cluster ‘mp-deep_mag-0578_000009.region001’ from the cryptic BGCs dataset encodes four classes of homologous biosynthetic enzymes: dehydrogenase, phytoene synthase, α-glucosidase, and glycosyl transferase. The dehydrogenase and phytoene synthase were reported to be responsible for the biosynthesis of carbon chain of phytoene 43 , and the α-glucosidase and glycosyl transferase would catalyse the reaction of glycosylation 44 . Five SM structures generated by DeepSeMS (Supplementary Fig. 4), each has a scaffold of phytoene-like, long-chain, unsaturated, and aliphatic hydrocarbon with a terminal glucoside, which indicate that DeepSeMS generated the main structural features that are implicit in this cluster. Therefore, the case study validates the interpretability and practicability of DeepSeMS, which can provide new biological insights for biosynthetic pathway identifying and chemical structure elucidation. 3. The hidden biosynthetic potential of the global ocean microbiome The global ocean harbors an extraordinarily rich diversity of microbial resources, the vast majority of which remain largely under-explored, thus making it a vast reservoir for the discovery of novel SMs 20 , 21 . To tap into the biosynthetic potential of marine microbes, we obtained 27,139 MAGs from the most abundant available data resource - Ocean Microbiomics Database (OMD) 41 as the global ocean microbiome, which were reconstructed from more than 1,000 seawater samples collected on a global ocean scale. A comprehensive search for biosynthetic regions within OMD yielded 46,786 BGCs, forming the ‘global ocean BGCs’ dataset. Leveraging DeepSeMS, we finally characterized 65,868 unique SM structures on the dataset to form the ‘global ocean SMs’. This dataset represents substantial biosynthetic pathways and natural molecules yielded by the global ocean microbiome. 3.1 Molecular novelty and diversity of the global ocean SMs The hidden biosynthetic potential of the global ocean microbiome was revealed a vast array of novel SMs with previously undocumented structural types, geographical coverage, ocean diversities, and ecological distribution characteristics. To analyse and evaluate the molecular novelty of the generated global ocean SMs, we defined a ‘molecular novelty score’ (see Methods) which is the normalized percentage of the maximum similarity value to the structures of known SMs from the MIBiG database. Distribution of molecular novelty score of the global ocean SMs illustrated that the global ocean microbiome encoded a large number of novel SMs ( n = 65,735) with structural differences from the known ones (Supplementary Fig. 4a), significantly expanding the chemical space of this resource library and providing more potential candidate molecules for natural drug discovery 45 . Specifically, 97% of the global ocean SMs are novel structures, 69% of them have novel scaffolds, and 58% of them have novel shapes, which indicate that the great structural novelty of the global ocean microbiome (Supplementary Fig. 4b and c). Furthermore, we found that the microbes across various oceans globally exhibit high biosynthetic novelty (average of over 96% novel structures) from the view of geographical coverage of the SMs (Fig. 3 a), whereas the Arctic Ocean contribute the maximum number of SMs (22,426) and the North Atlantic Ocean have a slight advantage on molecular novelty considering the percentage of novel shapes (61%). However, there are significant differences emerge in terms of molecular diversity and uniqueness of the SMs when comparing the Arctic Ocean and Southern Ocean (Fig. 3 b). Specifically, the Arctic Ocean has the highest uniqueness (72%) of the SMs that are not found in other Oceans, while the Southern Ocean has the highest SM diversity (63%). An analysis of ecological distribution of the global ocean SMs (Fig. 3 c) further confirms that the diverse marine environment cultivates more varied biosynthetic features in microbes, leading to a wide range of novel chemical structure of SMs. Notably, the SMs from the abyssopelagic layer (> 4500 m), low-oxygen (< 100 µmol/kg), and medium-low temperature (5 ~ 15 ℃) environments, possess the highest molecular novelty and diversity. Oxygen (O), nitrogen (N), and carbon (C) contents of the global ocean SMs are also varied with oceanic depth, oxygen, and temperature, which indicates that the microbes in the ocean have evolved various of SM structural types with element contents to adapt to diverse marine ecological environments. Specifically, we found that the prevailing BGC types of PKS in the deep ocean led to high O content of the SM molecules. However, the high oxygen and temperature in seawater do not bring high O content but low N content and high C content of the SM molecules, which are corresponding with the low proportion of NRPS and the high proportion of terpenes BGC types. 3.2 Biomedical application potential of the global ocean SMs Specific novel SMs and biochemical pathways found in the global ocean microbiome contribute to pave the way for innovative biomedical applications (Fig. 4 ). To identify antibiotic potential of the global ocean SMs, we implemented a structural-based virtual screening focusing on SMs that incorporate functional groups known for their antibiotic properties, including β-lactams, aminoglycosides, tetracyclines, oxazolidinones, chloramphenicols, macrolides, ansamycins, and quinolones. This screening uncovered 8,783 unique structures of SMs featuring diverse antibiotic-associated functional groups (Fig. 4 a). These structures exhibit various antibacterial mechanisms, including inhibiting bacterial cell-wall, protein, RNA and DNA synthesis. Notably, these SMs possess novel side chains or substituents different from the current antibiotics. Consequently, our findings unearthed the great antibiotic potential of the global ocean SMs, which cover a broad spectrum of pathogens, especially when the infectious agent is unknown or resistance to current antibiotics. Collectively, the global ocean SMs would be an ideal virtual library for exploring alternatives of drugs to target antibiotic resistant bacteria, such as multidrug-resistant Gram-negative pathogens 46 . We also found that the ocean specific abundant BGC type ‘ectoine’, which can serve as a compatible solute, indicates the widespread microbial adaptations to the bathypelagic environment for preventing extreme osmotic stresses 47 . We discovered 2,078 natural molecules with novel structural types derived from the ectoine biosynthetic pathways in the global ocean SMs (Fig. 4 b). These ectoine molecules exhibit marked differences from the known SMs (average novelty score of 95.52) and could potentially serve as candidates for cell protectants in cosmetics, medicine, or biotechnology 48 , 49 . Notably, we characterized 645 unique SMs with novel molecular scaffolds and shapes from the global ocean BGCs containing biosynthetic regions that do not fit into any documented category 19 . These novel structural types of SMs provide new insight into undocumented biosynthetic pathways that may lead to the discovery of novel bioactive compounds with potential therapeutic applications 10 . Specifically, four of the top ten novel SM structures (Fig. 4 c), namely n3 , n6 , n7 , and n9 , were predicted from the same BGC within the MAG ‘BGEO_SAMN07136520_METAG_FKHEEFFA’, derived from a seawater sample collected in the North Atlantic Ocean. This makes the bacterial host, identified as ‘ UBA7446 sp002478685 ’, a promising candidate for the discovery of novel marine natural products. 4. AI-powered tools for accelerating novel SMs discovery To facilitate analytical applications, we have deployed the DeepSeMS model as a web server (Fig. 5 ), freely accessible at https://biochemai.cstspace.cn/deepsems/ . The ‘DeepSeMS web server’ enables users to submit prediction jobs for microbial genome mining of novel SMs by uploading BGC annotation files or providing antiSMASH job IDs of biosynthetic regions searching. The web server generates comprehensive predicting results, including detailed biosynthetic features of the input BGCs, structural visualisations, prediction scores and molecular properties of the predicted SM structures. Additionally, it offers integrated functionalities for comparing predicted SMs with known compounds to access molecular novelty and analyze antibiotic potential, enhancing the interpretability of the results and supporting further research into novel bioactive compounds. We also deposited sample input, tutorial, and example pages to interpret the data formatting requirements for job submission and the results returned by the web server. A job status page will be provided at the time of submission, allow the user to bookmark the job link or copy the job ID to access the results later. Furthermore, to explore the novel SMs from the global ocean microbiome discovered by this study more visually, we also deposited the dataset of the global ocean SMs as a build-in resource of the web server. The resource can be used for exploring various novel SMs by geographic locations, marine environments, and BGC types, for filtering and visualizing of the biosynthetic pathways, molecular novelties, and antibiotic potentials of the cryptic population. For instance, searching for cryptic BGCs of NRPS from Biogeotraces_GT15_GP13_TAN1109 sample set in South Pacific Ocean at the web server, will result in five records. And the first cluster was derived from the MAG of bacterium ‘ Arctic96AD-7 sp002082305’ which lives in bathypelagic layer (1008 m) with low temperature (4.94°C) and oxygen content of 200.4 µmol/kg. Five novel SM structures of this BGC are displayed on the detail page, two of which are predicted to possess antibiotic potential as macrolides. Both the cluster and the result can be downloaded for further research. DISCUSSION In this work, we demonstrate that training a LLM on converting biological sequences of BGCs to chemical sequences of SMs facilitates accurate prediction of the chemical structures encoded within microbial genomes. More relevantly, the model performs admirably on inferring the complexity and diversity in biosynthetic pathways of novel SMs from the global ocean microbiome, unveiling the substantial biosynthetic potential of this yet-to-be-explored reservoir. Our investigation revealed that the functional domains, as the most efficient representation of biosynthetic features in BGCs, empowers DeepSeMS to achieve higher performance with smaller model size and fewer computing resources. Additionally, the application of the structural feature aligned data augmentation strategy also enables DeepSeMS to navigate in sparsely populated chemical space of known SMs, and generate novel SM structures with scarce training data. These advanced capabilities enable DeepSeMS to be a powerful AI-driven tool to characterize the chemical structures of unidentified SMs via its web server, and provide new insights into microbial natural products discovery. Moreover, exploring various novel SMs from the global ocean microbiome as a build-in resource of the web server accelerates the discovery of innovative biomedical applications for marine natural products. However, it is essential to note certain limitations of DeepSeMS. First, the identification of biosynthetic features may be incomplete due to coverage of the training dataset, i.e., one or more functional domains of a cluster may fail to be identified as biosynthetic features because of sequence similarity threshold. As a result, the generated structures of a SM would be fragmentary. Furthermore, because of the lack of reliable methods to define borders of BGCs in prokaryotic genomes based solely on sequence data, the structures generated by DeepSeMS for unidentified BGC types may not represent the main structural features (i.e., scaffolds) of the expected SMs. Thus, these results can only serve as clues for exploring novel SMs, and need further experimental validations to confirm the structures. To address these challenges, potential improvements of DeepSeMS are raised: The increase in the diversity of annotations for experimentally verified BGCs and SMs in the training dataset would augment the biosynthetic features and effectively improve the generalization ability of the LLM. Despite these limitations, DeepSeMS outperforms other methods and offers a paradigm shift in genome mining. Traditional approaches struggle to identify cryptic BGCs of the microbes, which are either silently expressed or exhibit very low expression levels under laboratory conditions. Our method leverages a LLM to automatically generate all possible structural types of SMs based on the biosynthetic features encoded within microbial genomes, showcasing the advantages of AI to directly link genomic information to chemical output. Additionally, this study provides us with a new insight: Given the success of the LLM in predicting SM structures from BGCs, we can reverse the sequence generation model to design biosynthetic enzymes based on specific SM structures, thereby leveraging AI and synthetic biology to explore the vast biosynthetic potential of the microbiomes. METHODS Data preparation The training dataset of DeepSeMS model was collected from MIBiG database (version 3.1, https://mibig.secondarymetabolites.org/ ) 33 , BGC sequences and SMILES strings were paired according to the same accession number in the MIBiG sequence files and annotation files. Structural issues of SM structures in the dataset were identified by RDKit (version 2023.03.1, http://www.rdkit.org/ ) and addressed manually according to the references in annotations. In order to reduce the complexity of molecular generation models and ensure the validity of generated SMILES sequences, canonical SMILES representation was generated using RDKit by removing the stereochemical information. We represented the SM structures using SMILES notations as sequences of target tokens, and identified 35 distinct structural features (unique SMILES notations) in the dataset to form the vocabulary of target tokens for the LLM 37 . Biopython (version 1.8.1, https://biopython.org/ ) and HMMER (version 3.4, http://www.hmmer.org/ ) were used for identifying biosynthetic features (Pfam identifiers) as sequences of source tokens from BGC sequences by searching functional domains against Pfam 51 (version 36.0) database with a threshold of e-value < 0.01. We identified 1,020 distinct biosynthetic features (unique Pfam identifiers) in the dataset by annotating functional domains of biosynthetic enzymes encoded within BGCs to form the vocabulary of source tokens for the LLM. The source data of the known BGCs dataset was obtained from the curation of PRISM 4 authors 18 ( https://doi.org/10.5281/zenodo.3985982/ ), and data pairs were prepared by using the same procedures of structural and biosynthetic features annotation as the training dataset of DeepSeMS model. We excluded BGCs that have a sequence identity greater than 95% to the BGCs in training set by BLAST 52 , which resulted in the dataset of known BGCs for model evaluation of accuracy. The source data of the cryptic BGCs dataset was obtained from the Malaspina Deep Metagenome-Assembled Genomes 42 ( https://malaspina-public.gitlab.io/malaspina-deep-ocean-microbiome/ ). We searched biosynthetic regions of the MAGs using antiSMASH 19 (version 7.0.0) with ‘genefinding-tool’ of ‘prodigal’ and default parameters otherwise, which resulted in the dataset of cryptic BGCs for model evaluation of generalization ability. The source data of the global ocean microbiome was obtained from OMD 41 ( https://microbiomics.io/ocean/ ). Biosynthetic regions searching on the global ocean microbiome was performed as the same procedures of the cryptic BGCs dataset to form the global ocean BGCs dataset for large-scale mining of novel SMs. Sample metadata of the global ocean microbiome was also obtained from OMD for analysis of geographical coverage, ocean diversities, and ecological distribution characteristics of the resulted SMs. Data augmentation The procedure of data augmentation was implemented using RDKit in Python 31 (version 3.10, https://www.python.org/ ). We used the ‘MolToSmiles’ function to generate the randomized SMILES strings by setting the ‘doRandom’ parameter as ‘True’. The structural features-aligned data augmentation was implemented by generating molecular scaffold of an input SMILES string, then randomly selecting a starting node and topological path to enumerate the molecular subgraphs other than the scaffold of the input molecule (substituent groups), and combining the enumerated molecular subgraphs and the subgraph of the scaffold as a new molecular graph. Therefore, we have obtained new atomically-ordered but chemically identical molecular graphs for the input molecule, moreover, the atomic order of the scaffold is consistent. Ultimately, we can generate expression-different but structural features-aligned SMILES strings with the new molecular graphs. Specifically, chemical scaffold of an input SMILES string was generated by the ‘GetScaffoldForMol’ function using Murcko-type decomposition, the atom indices of the scaffold were then matched by the function ‘GetSubstructMatches’. The atom indices other than the scaffold was renumbered randomly and then combined with the atom indices of the scaffold to form a new molecular graph of atomic numbers. The structural features-aligned SMILES string was finally obtained from the new molecular graph by the functions of ‘RenumberAtoms’ and ‘MolToSmiles’. We used the data augmentation to generate up to 100 randomized and structural features-aligned SMILES strings for each molecule in the dataset for model training. Model training The DeepSeMS model was implemented as a sequence-to-sequence language model based on Transformer architecture 26 , 27 . In model training, both source and target sequences in the training dataset were converted into embeddings by batches. These embeddings were then passed through a positional encoding layer to retain the order information of the sequence. The embedded input sequence was processed by the encoder to generate a context-rich representation. This representation was then used by the decoder, along with the embedded target sequence, to predict the next item in the sequence. Masks were used to prevent the model from accessing future tokens in the target sequence prematurely. The output of the decoder was transformed through a linear layer and a softmax to predict the probability of the next token in the target sequence. The predictions of the model were compared against the actual target sequence using a loss function. The gradients from the loss were backpropagated through the model to update the weights. The model was trained by using dropout rate of 0.1 to perform regularization, AdamW as the optimizer, Cross Entropy as the loss function, learning rate of 0.0001, batch size of 64 and default parameters of Transformer otherwise. After each training epoch, the model state was validated on validation dataset. We employed an early stop strategy to avoid over-fitting problem, which would stop model training when the validation performances were not improved in 10 epochs. The model was implemented by PyTorch (version 2.1.0, https://pytorch.org/ ) in Python, and was trained on up to eight GPUs of ‘NVIDIA RTX 4090’. Model predicting In the model prediction, a target mask was generated to prevent the model from accessing future positions in the sequence. The model then generated the next token of start token (SOS) based on the input sequence of BGC features and the target mask. The token with the highest probability was selected and appended to the output sequence as target tokens for next token generation. The prediction stopped if the predicted token is an end (EOS) token. The final output sequence was decoded to SMILES strings of the predicted SM structure based on the vocabulary of target tokens. We defined a prediction score to evaluate the output sequences generated by the model: $$\:Prediction\:\text{Score}=\frac{\sum\:log\left(\text{probabilities}\right)}{(\text{length of sequence}{)}^{\text{length penalty}}}$$ 1 Where: $\:{\prime\:}\sum\:log\left(\text{probabilities}\right){\prime\:}$ is the sum of the log probabilities of each token selected during the sequence generation process, $\:{\prime\:}\text{length of sequence}{\prime\:}$ is the length of the generated sequence (total number of generated tokens), and $\:{\prime\:}\text{length penalty}{\prime\:}$ is a factor set to 0.6 in this study to adjust the score based on the length of the generated sequence, penalizing longer sequences to balance the trade-off between sequence length and the cumulative probability. The prediction scores were significant within each prediction served as crucial indicators of the neural network model confidence on its output. Model evaluation metrics The following metrics for evaluating the performances and comparisons of DeepSeMS model with existing methods were implemented using RDKit in Python: Validity , is the chemically valid SMILES strings that can be successfully parsed by the ‘MolToSmiles’ function. Structural similarity , is the Tanimoto coefficient of the chemical fingerprints (Morgan fingerprints with 2 bond radius) between two structures calculated by the functions of ‘TanimotoSimilarity’ and ’GetMorganFingerprint’ 53 , 54 . Molecular scaffold , is the core structure or framework of a molecule generated by the function of ‘GetScaffoldForMol’ using Murcko-type decomposition. Molecular shape , is the generic framework of a molecular scaffold generated by the function of ‘MakeScaffoldGeneric’. Molecular weight , the number of heavy atoms , and QED (quantitative estimate of drug-likeness), are molecular properties calculated by the functions of ‘MolWt’, ‘HeavyAtomCount’, and ‘qed’, respectively. Chemical space , is the distribution of Morgan fingerprints of the SM structures plotted by Matplotlib and Seaborn. Synthetic accessibility , is the estimation of synthetic accessibility score of molecules based on molecular complexity and fragment contributions calculated by SAscorer 55 . Genome mining and analysis The large-scale mining of novel SMs from the global ocean microbiome was performed on the dataset of global ocean BGCs with the DeepSeMS model implemented by PyTorch in Python. The generated global ocean SM structures were then calculated for structural similarities to all the known SMs in MIBiG database by Tanimoto coefficients of Morgan fingerprints. We also defined a ‘molecular novelty score’ to evaluate the novelty of a generated SM structure: $$\:\text{Molecular novelty score}=\left(1-\frac{Similarity-{Min}_{Similarity}}{{{Max}_{Similarity}-Min}_{Similarity}}\right)\times\:100$$ 2 Where: ‘ $\:Similarity$ ’ is the calculated maximum Tanimoto coefficient value to the structures of known SMs; ‘ $\:{Min}_{Similarity}$ ’ is the minimum value among all the maximum Tanimoto coefficient values between the structures of generated SMs and those of known SMs; ‘ $\:{Max}_{Similarity}$ ’ is the maximum value among these maximum Tanimoto coefficient values. The molecular novelty score is the normalized percentage of the maximum similarity value to the structures of known SMs, allowing for a more intuitive analysis and evaluation of the molecular novelty of the generated global ocean SMs. Molecular scaffolds and shapes of the global ocean SMs were generated by the functions of ‘GetScaffoldForMol’ and ‘MakeScaffoldGeneric’ using Murcko-type decomposition using RDKit in Python. The geographical coverage, ocean diversities, and ecological distribution characteristics of the global ocean SMs were analysed according to the metadata of the global ocean microbiome from OMD database 41 . The global ocean map was plotted in R (version 4.1.2) using Leaflet (version 2.2.2) by ‘OceanBasemap’ from Esri. ‘Diversity’ was the percentage of unique SM structures in the ocean provinces, ‘Uniqueness’ was the percentage of unique SM structures in the global oceans, and the uniqueness was calculated based on canonical SMILES of the structures generated by RDKit. O, N, C contents were calculated based on molecular weight percentage of oxygen, nitrogen, carbon atoms in the global ocean SM structures. The structural-based virtual screening strategy on the global ocean SMs with known antibiotic activity of functional groups was: 1) Contain substructure of 2-azetidinone in a bicyclic scaffold as β-lactams; 2) Contain one or more aminosugars as aminoglycosides; 3) Contain a scaffold of tetracene as tetracyclines; 4) Contain a scaffold of 2-oxazolidon as oxazolidinones; 5) Contain substructure of dichloroacetamide as chloramphenicols; 6) Contain substructure of lactone in a macro ring with 14 or more atoms as macrolides; 7) Contain substructure of amide and aromatic moiety in a macro ring with 14 or more atoms as ansamycins; 8) Contain a scaffold of 4-quinolone as quinolones. The virtual screening was also implemented using RDKit in Python by calculating whether the structures contain the above known antibiotic activity of functional groups. The SM structures in Fig. 4 were visualized and calculated for stereochemical information by ChemDraw (version 23.1.1). Web server implementation The DeepSeMS web server was implemented by Django (version 4.2.6, https://www.djangoproject.com/ ) for the web site framework, SQLite (version 3.41.2, https://www.sqlite.org/ ) for database, Python for backend applications and Docker (version 24.0.6, https://www.docker.com/ ) for implementation environment. Web pages were developed by JS, AJAX, JQuery and BootStrap (version 5.3.2, https://v5.bootcss.com/ ). We also applied RDKit in Python for chemical structure visualization, molecular properties calculation, known SMs comparison and antibiotic potential analysis. Declarations DATA AVAILABILITY All the datasets used in this work were obtained from public data depositories and are specified in the methods section. Source data of the figures and tables are provided in Source Data file. The training dataset of DeepSeMS is available at GitHub repository https://github.com/lab-of-biochemai/deepsems/data/. The dataset of the global ocean SMs is available as Source Data file and at DeepSeMS web server https://biochemai.cstspace.cn/deepsems/downloads/. CODE AVAILABILITY The DeepSeMS web server and the global ocean SMs resource are freely available with no login requirements at: https://biochemai.cstspace.cn/deepsems/. Source code of DeepSeMS is available at GitHub repository https://github.com/lab-of-biochemai/deepsems/. SOURCE DATA Source data 1: The source data of Table 1 and Figures 2-4. Source data 2: The dataset of the global ocean SMs. ACKNOWLEDGEMENTS This work was supported by the National Natural Science Foundation of China (92251307, 92451303, 32470098, 82170542), the National Key Research and Development Program of China (2023YFA0915501), and the Informatization Plan of Chinese Academy of Sciences (CAS-WX2021SF-0307). The authors acknowledge the use of resources provided by Beijing PARATERA Tech Corp.,Ltd. and China Science & Technology Cloud. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. AUTHOR CONTRIBUTIONS N.J., R.Z., G.Z. (Guoping Zhao) and G.Z. (Guoqing Zhang) conceived and designed the study. T.X. and W.Y. drafted the manuscript. R.Z., W.L., J.L., Y.Z., P.Z., G.Z. (Guoqing Zhang), G.Z. (Guoping Zhao) and N.J. reviewed and edited the manuscript. All authors read and approved the final manuscript. COMPETING INTERESTS The author declares no competing interests. References Clardy, J. & Walsh, C. Lessons from natural molecules. Nature 432, 829–837 (2004). Xu, T. et al. NPBS database: a chemical data resource with relational data between natural products and biological sources. Database 2020, baaa102 (2020). Newman, D. J. & Cragg, G. M. Natural products as sources of new drugs over the nearly four decades from 01/1981 to 09/2019. J. Nat. Prod. 83, 770–803 (2020). Koehn, F. E. & Carter, G. T. The evolving role of natural products in drug discovery. Nat. Rev. Drug Discov. 4, 206–220 (2005). Rodrigues, T., Reker, D., Schneider, P. & Schneider, G. Counting on natural products for drug design. Nature Chem 8, 531–541 (2016). Vanni, C. et al. Unifying the known and unknown microbial coding sequence space. Elife 11, e67667 (2022). Wirbel, J., Bhatt, A. S & Probst, A. J. The journey to understand previously unknown microbial genes. Nature 626, 267–269 (2024). Doroghazi, J. R. et al. A roadmap for natural product discovery based on large-scale genomics and metabolomics. Nat. Chem. Biol. 10, 963–968 (2014). Ziemert, N., Alanjary, M. & Weber, T. The evolution of genome mining in microbes—a review. Nat. Prod. Rep. 33, 988–1005 (2016). Scherlach, K. & Hertweck, C. Mining and unearthing hidden biosynthetic potential. Nat. Commun. 12, 3864 (2021). Walsh, C. T. & Fischbach, M. A. Natural products version 2.0: connecting genes to molecules. J. Am. Chem. Soc. 132, 2469–2493 (2010). Milshteyn, A., Schneider, J. S., & Brady, S. F. Mining the metabiome: identifying novel natural products from microbial communities. Chem. Biol. 21, 1211–1223 (2014). Li, M. H. T., Ung, P. M. U., Zajkowski, J., Garneau-Tsodikova, S. & Sherman, D.H. Automated genome mining for natural products. BMC Bioinf. 10, 185 (2009). Medema, M. H. et al. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res. 39, W339–W346 (2011). Skinnider, M. A. et al. Genomes to natural products PRediction Informatics for Secondary Metabolomes (PRISM). Nucleic Acids Res. 43, 9645–9662 (2015). Zierep, P. F. et al. SeMPI: a genome-based secondary metabolite prediction and identification web server. Nucleic Acids Res. 45, W64–W71 (2017). Hannigan, G. D. et al. A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Res. 47, e110 (2019). Skinnider, M. A. et al. Comprehensive prediction of secondary metabolite structure and biological activity from microbial genome sequences. Nat. Commun. 11, 6058 (2020). Blin, K. et al. antiSMASH 7.0: new and improved predictions for detection, regulation, chemical structures and visualisation. Nucleic Acids Res. 51, W46–W50 (2023). Chen, J. et al. Global marine microbial diversity and its potential in bioprospecting. Nature 633, 371–379 (2024). Logares R. Decoding populations in the ocean microbiome. Microbiome 12, 67 (2024). Walsh C. T. & Tang Y. Natural Product Biosynthesis: Chemical Logic and Enzymatic Machinery Ch. 1 (Royal Society of Chemistry Publishing, 2017). Donadio, S., Staver, M. J., McAlpine, J. B., Swanson, S. J. & Katz, L. Modular organization of genes required for complex polyketide biosynthesis. Science 252, 675–679 (1991). Schwarzer, D., Mootz, H. D. & Marahiel, M. A. Exploring the impact of different thioesterase domains for the design of hybrid peptide synthetases. Chem. Biol., 8, 997–1010 (2001). Bernhardt, R. Cytochromes P450 as versatile biocatalysts. J. Biotechnol. 124, 128–145 (2006). Vaswani, A. et al. Attention is all you need. Advances in Neural Information Processing Systems 30 (Nips 2017) 2017, 5998–6008 https://arxiv.org/abs/1706.03762 (2017). Wolf, T. et al. Transformers: State-of-the-Art Natural Language Processing. Proceedings of the 2020 EMNLP (Systems Demonstrations) 2020, 38–45 https://doi.org/10.18653/v1/2020.emnlp-demos.6 (2020). Xu, T. et al. Neural machine translation of chemical nomenclature between English and Chinese. J. Cheminform. 12, 50 (2020). Hoffmann, J. et al. An empirical analysis of compute-optimal large language model training. Adv. Neural Inf. Process. Syst. 35, 30016–30030 (2022). Saldívar-González, F. I., Aldas-Bulos, V. D., Medina-Franco, J. L. & Plisson, F. Natural product drug discovery in the artificial intelligence era. Chem. Sci. 13, 1526–1546 (2021). Diao, Y. et al. Macrocyclization of linear molecules by deep learning to facilitate macrocyclic drug candidates discovery. Nat. Commun. 14, 4552 (2023). Chowdhery, A. et al. PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24, 1–113 (2023). Terlouw, B. R. et al. MIBiG 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters. Nucleic Acids Res. 51, D603–D610 (2023). Outeiral, C., & Deane, C. M. Codon language embeddings provide strong signals for use in protein engineering. Nat. Mach. Intell. 6, 170–179 (2024). Weininger, D. SMILES, a chemical language and information system. J. Chem. Inf. Model., 28, 31–36 (1988). Schwaller, P. et al. Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction. ACS Cent. Sci. 5, 1572–1583 (2019). Skinnider, M. A. et al. Chemical language models enable navigation in sparsely populated chemical space. Nat. Mach. Intell. 3, 759–770 (2021). Arús-Pous, J. et al. Randomized SMILES strings improve the quality of molecular generative models. J. Cheminform. 11, 71 (2019). Polykovskiy, D. et al. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Front. Pharmacol. 11, 565644 (2020). Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39, 2887–2893 (1996) Paoli, L. et al. Biosynthetic potential of the global ocean microbiome. Nature 607, 111–118 (2022). Acinas, S. G. et al. Deep ocean metagenomes provide insight into the metabolic architecture of bathypelagic microbial communities. Commun. Biol. 4, 604 (2021). Becerril, A. et al. Uncovering production of specialized metabolites by Streptomyces argillaceus: Activation of cryptic biosynthesis gene clusters using nutritional and genetic approaches. PloS one 13, e0198145 (2018). Zheng, X. et al. Biosynthesis of the pyrrolidine protein synthesis inhibitor anisomycin involves novel gene ensemble and cryptic biosynthetic steps. Proc. Natl. Acad. Sci. U. S. A. 114, 4135–4140 (2017). Wills, T. J., & Lipkus, A. H. Structural Approach to Assessing the Innovativeness of New Drugs Finds Accelerating Rate of Innovation. ACS Med Chem Lett. 11, 2114–2119 (2020). Wong, F. et al. Discovery of a structural class of antibiotics with explainable deep learning. Nature 626, 177–185 (2024). Sadeghi, A. et al. Diversity of the ectoines biosynthesis genes in the salt tolerant Streptomyces and evidence for inductive effect of ectoines on their accumulation. Microbiol. Res. 169, 699–708 (2014). Pastor, J. M. et al. Ectoines in cell stress protection: uses and biotechnological production. Biotechnol Adv 28, 782–801 (2010). Widderich, N. et al. Biochemical properties of ectoine hydroxylases from extremophiles and their wider taxonomic distribution among microorganisms. PLoS One 9, e93809 (2014). Wishart, D.S. et al. NP-MRD: the Natural Products Magnetic Resonance Database. Nucleic Acids Res. 50, D665–D677 (2022). Mistry, J. et al. Pfam: The protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021). Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinf. 10, 421 (2009). Bajusz, D., Rácz, A. & Héberger, K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J. Cheminform. 7, 20 (2015). Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Cheminform. 50, 742–754 (2010). Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 1, 8 (2009). Additional Declarations There is NO Competing Interest. Supplementary Files SupplementaryInformation.docx SUPPLEMENTARY INFORMATION Supplementary Figs. 1-5, Tables 1-4. Cite Share Download PDF Status: Published Journal Publication published 30 Apr, 2026 Read the published version in Nature Computational Science → Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6233440","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":443369264,"identity":"5cc38aa3-cab4-4463-87f9-ae8022ac4de4","order_by":0,"name":"Na Jiao","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA2ElEQVRIie3OsQrCMBCA4RTBLlXXE/QdAoJFKj7LlYJZiqN0EMmUSfBtnCMBu0RdC13q4uSgi7gotpNbWzfBfEsucD8JIYbxm1ASMu7kZ3Fp1E6mXf5NUlBfJK4tT/IRHYHGwTkjkedzey9Lk9EKcbvSKVB9dinRzOfODEsTKhFlS6RLmuAQLKF8Dg4tT44Zbp/iADRhd7BedZIEUbWEzJMwf4XXSjJUPR1AV1/mgDs2EE5Y9bEwuF2iCbRjtoHrwuuvbV2eEOLgZy7GZsV+zpbVO4ZhGP/tDfnVS+J4sypPAAAAAElFTkSuQmCC","orcid":"https://orcid.org/0000-0003-3976-6313","institution":"School of Life Sciences, Fudan University","correspondingAuthor":true,"prefix":"","firstName":"Na","middleName":"","lastName":"Jiao","suffix":""},{"id":443369265,"identity":"761dd96f-e76c-4696-a9d0-e1f6547057ba","order_by":1,"name":"Tingjun Xu","email":"","orcid":"https://orcid.org/0000-0002-5529-1875","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Tingjun","middleName":"","lastName":"Xu","suffix":""},{"id":443369266,"identity":"6d2f1f6c-9118-4da6-9dfa-3f2c87b590e7","order_by":2,"name":"Yuwei Yang","email":"","orcid":"","institution":"School of Life Sciences and Technology, Tongji University","correspondingAuthor":false,"prefix":"","firstName":"Yuwei","middleName":"","lastName":"Yang","suffix":""},{"id":443369267,"identity":"42a3a3c9-5278-47e9-a2a0-1c7139a529a9","order_by":3,"name":"Ruixin Zhu","email":"","orcid":"","institution":"School of Life Sciences and Technology, Tongji University","correspondingAuthor":false,"prefix":"","firstName":"Ruixin","middleName":"","lastName":"Zhu","suffix":""},{"id":443369268,"identity":"867b8436-45e3-4282-9ac1-422b1ee1b9bc","order_by":4,"name":"Weili Lin","email":"","orcid":"","institution":"School of Life Sciences and Technology, Tongji University","correspondingAuthor":false,"prefix":"","firstName":"Weili","middleName":"","lastName":"Lin","suffix":""},{"id":443369269,"identity":"68e012fb-d396-42cc-9ca1-8c88649c40ee","order_by":5,"name":"Jixuan Li","email":"","orcid":"","institution":"School of Life Sciences and Technology, Tongji University","correspondingAuthor":false,"prefix":"","firstName":"Jixuan","middleName":"","lastName":"Li","suffix":""},{"id":443369270,"identity":"0fd92629-5553-4fcd-8cbc-af22e3ec6581","order_by":6,"name":"Yan Zheng","email":"","orcid":"","institution":"School of Life Sciences, Fudan University","correspondingAuthor":false,"prefix":"","firstName":"Yan","middleName":"","lastName":"Zheng","suffix":""},{"id":443369271,"identity":"0fd2a785-bc1f-4d60-a6ee-1f7e23dd9244","order_by":7,"name":"Peng Zhang","email":"","orcid":"","institution":"Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences","correspondingAuthor":false,"prefix":"","firstName":"Peng","middleName":"","lastName":"Zhang","suffix":""},{"id":443369272,"identity":"9a7f047b-fc08-424f-a0ad-9df4a65ce0e1","order_by":8,"name":"Guoqing Zhang","email":"","orcid":"","institution":"Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences","correspondingAuthor":false,"prefix":"","firstName":"Guoqing","middleName":"","lastName":"Zhang","suffix":""},{"id":443369273,"identity":"7b1a173f-689e-49fc-b3ad-57e6acc3b2a1","order_by":9,"name":"Guoping Zhao","email":"","orcid":"","institution":"School of Life Sciences, Fudan University","correspondingAuthor":false,"prefix":"","firstName":"Guoping","middleName":"","lastName":"Zhao","suffix":""}],"badges":[],"createdAt":"2025-03-15 14:35:29","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6233440/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6233440/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s43588-026-00983-1","type":"published","date":"2026-04-30T04:00:00+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":80723664,"identity":"bc5b9b0b-e1bc-4db5-8fb7-9764f0489594","added_by":"auto","created_at":"2025-04-16 11:28:17","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":170510,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eOverview of the modelling strategy of DeepSeMS. a\u003c/strong\u003e, The DeepSeMS model was implemented based on Transformer architecture. The model identified biosynthetic features of input BGCs as source sequences by sequence representation, and was trained on dataset of known BGCs and SM structures processed by data augmentation strategy. \u003cstrong\u003eb\u003c/strong\u003e, Illustration of the sequence representation by amino acids, functional domains, and enzymes. \u003cstrong\u003ec\u003c/strong\u003e, Illustration of the data augmentation strategy by randomized and structural features-aligned SMILES enumeration.\u003c/p\u003e","description":"","filename":"1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6233440/v1/5cf8ee9bdb2c83ad774bac9b.jpg"},{"id":80723663,"identity":"291c0a59-13a7-4aa6-a6d7-dd215d9332e1","added_by":"auto","created_at":"2025-04-16 11:28:17","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":95532,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eComparison of DeepSeMS with existing methods on validation dataset of cryptic BGCs (\u003c/strong\u003e\u003cem\u003e\u003cstrong\u003en\u003c/strong\u003e\u003c/em\u003e\u003cstrong\u003e=940)\u003c/strong\u003e. \u003cstrong\u003ea\u003c/strong\u003e, Number of BGCs predicted at least one chemically valid structure by each method, annotated with the percentage of at least one chemically valid structure predicted for each BGC by each method (Success rate). \u003cstrong\u003eb\u003c/strong\u003e, Number of chemically valid SM structures predicted by each method, annotated with the percentage of unique structures predicted by each method (Uniqueness). \u003cstrong\u003ec\u003c/strong\u003e, Chemical space of the predicted SM structures by each method. \u003cstrong\u003ed\u003c/strong\u003e, Distribution of molecular weight of predicted SM structures by each method. \u003cstrong\u003ee\u003c/strong\u003e, Distribution of synthetic accessibility (synthetic accessibility score) of predicted SM structures by each method. \u003cstrong\u003ef\u003c/strong\u003e, Distribution of QED (quantitative estimate of drug-likeness) of predicted SM structures by each method. Source data are provided in Source Data file.\u003c/p\u003e","description":"","filename":"2.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6233440/v1/0ccc263d57711c4ae1772cc4.jpg"},{"id":80723666,"identity":"0deca94f-a6c1-4b95-923e-88e63d93100a","added_by":"auto","created_at":"2025-04-16 11:28:17","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":165210,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eMolecular novelty and diversity of the global ocean SMs.\u003c/strong\u003e \u003cstrong\u003ea\u003c/strong\u003e, Geographical distribution of the global ocean SMs. Globally distributed sampling sites were annotated by provinces of ocean with number of SM structures, and percentage of known structures, novel structures, novel scaffolds and novel shapes (Mutually exclusive statistics). The global ocean map was plotted in R using Leaflet by ‘OceanBasemap’ from Esri. \u003cstrong\u003eb\u003c/strong\u003e, Molecular diversity and uniqueness distribution of the global ocean SMs. Diversity was percentage of unique SM structures in the ocean province, uniqueness was percentage of unique SM structures in the global ocean SMs. \u003cstrong\u003ec\u003c/strong\u003e, Ecological distribution of the global ocean SMs. Ocean depth: EPI, Epipelagic layer (\u0026lt;200 m); MES, Mesopelagic layer (200~1,000 m); BAT, Bathypelagic layer (1000~4500 m); ABY, Abyssopelagic layer (\u0026gt;4500 m). Oxygen: LO, Low oxygen (\u0026lt;100 µmol/kg); MLO, Medium-Low oxygen (100~200 µmol/kg); MHO, Medium-High oxygen (200 ~ 300 µmol/kg); HO, High oxygen (\u0026gt;300 µmol/kg). Temperature: LT, Low temperature (\u0026lt;5 ℃); MLT, Medium-Low temperature (5~15 ℃); MHT, Medium-High temperature (15~25 ℃); HT, High temperature (\u0026gt;25 ℃). O, N, C contents were calculated based on molecular weight percentage of oxygen, nitrogen, carbon atoms in the SM structures. Source data are provided in Source Data file.\u003c/p\u003e","description":"","filename":"3.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6233440/v1/7ed0e7ec44bb461f8e0b75c9.jpg"},{"id":80723689,"identity":"3680aa6e-4d01-4986-966f-0a2568118a32","added_by":"auto","created_at":"2025-04-16 11:28:19","extension":"jpg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":116325,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eThe biomedical application potential of the global ocean SMs.\u003c/strong\u003e \u003cstrong\u003ea\u003c/strong\u003e, Examples of antibiotic candidates from the global ocean SMs that contain known antibiotic activity of functional groups as β-lactams, aminoglycosides, tetracyclines, oxazolidinones, chloramphenicols, macrolides, ansamycins, and quinolones. The structures were annotated with identifiers and antibiotic mechanisms: Target bacterial cell-wall,\u003csup\u003e \u003c/sup\u003eprotein, RNA, and DNA synthesis. \u003cstrong\u003eb\u003c/strong\u003e, Structures of known natural cell protectant Ectoine and top 10 novel candidates from biosynthetic pathways of ectoine in the global ocean SMs. The structures were annotated with identifiers and molecular novelty scores (NS). ect_A/B/C, ectoine synthases. \u003cstrong\u003ec\u003c/strong\u003e, Top 10 novel SMs predicted from undefined BGC families. The structures were annotated with identifiers and molecular novelty scores (NS). Biosyn_core, core biosynthetic enzymes. Biosyn_add, additional biosynthetic enzymes. Source data are provided in Source Data file.\u003c/p\u003e","description":"","filename":"4.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6233440/v1/a8022d1d76a2d6db21ff5109.jpg"},{"id":80723667,"identity":"45de910f-472f-4e90-b6a9-3dafc2734311","added_by":"auto","created_at":"2025-04-16 11:28:18","extension":"jpg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":161891,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSchematic overview of the AI-powered tools for accelerating novel SMs discovery. \u003c/strong\u003eThe DeepSeMS web server is for predicting SM structures from microbial BGCs for data mining of novel SMs from uncultivated microbes. The global ocean SMs as a build-in resource of the web server, is for exploring various novel SMs from the global ocean microbiome discovered by this study for revealing hidden biosynthetic potential of the cryptic population.\u003c/p\u003e","description":"","filename":"5.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6233440/v1/f80f48c2cd3c4210899dfff8.jpg"},{"id":108495242,"identity":"e4a13b1e-d8fb-4672-9b48-f3ad9b0e2295","added_by":"auto","created_at":"2026-05-05 10:09:32","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1095387,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6233440/v1/9e2f948c-4d8d-49ab-b495-4b00cf032e9a.pdf"},{"id":80724389,"identity":"63f4aa10-c757-4af4-a761-e38204e828ac","added_by":"auto","created_at":"2025-04-16 11:36:17","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":3625688,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSUPPLEMENTARY INFORMATION\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSupplementary Figs. 1-5, Tables 1-4.\u003c/p\u003e","description":"","filename":"SupplementaryInformation.docx","url":"https://assets-eu.researchsquare.com/files/rs-6233440/v1/3d422085c333aae64cd3f54d.docx"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"DeepSeMS: a large language model reveals hidden biosynthetic potential of the global ocean microbiome","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003eSecondary metabolites (SMs), particularly those produced by microbes, are an essential class of natural compounds with diverse biological activities, including antimicrobial, anti-inflammatory, and anticancer properties, as well as therapeutic potential for treating metabolic diseases\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. These molecules are widely used as pharmaceutical agents, such as antibiotics, statins, and antitumor drugs\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e,\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e. However, the majority of clinically used microbial-derived SMs are identified within cultured species, which represent less than 1% of the vast microbial diversity.\u003csup\u003e\u003cspan additionalcitationids=\"CR6\" citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e. Metagenomics sequencing has enabled the characterization of numerous genomes from uncultured and unknown species across diverse environments, uncovering vast biosynthetic gene clusters (BGCs) and their potential to produce novel SMs\u003csup\u003e\u003cspan additionalcitationids=\"CR9 CR10 CR11 CR12 CR13 CR14 CR15 CR16 CR17 CR18\" citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e. Particularly, the global ocean, as the largest ecosystem on Earth, harbors an extraordinary diversity of microbial resources that remain largely underexplored, positioning it as a valuable reservoir for the discovery of novel SMs\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e,\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eNevertheless, the identification of novel SMs from microbial genomes remains challenging for existing methods like antiSMASH\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e and PRISM15\u003csup\u003e15\u003c/sup\u003e. These rule-based methods often fail to generate the chemical structures of SMs produced by cryptic BGCs from metagenome-assembled genomes (MAGs)\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e,\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e. This is mainly because of the highly context-dependent catalytic functions of homologous biosynthetic enzymes in BGCs\u003csup\u003e\u003cspan additionalcitationids=\"CR23 CR24\" citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e. For example, cytochromes P450 (CYP450) can catalyze biosynthetic reactions of carbon hydroxylation, heteroatom oxygenation, dealkylation, epoxidation, aromatic hydroxylation, reduction, and dehalogenation, generating structurally distinct SMs\u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e. Limited substrate-specific biosynthetic enzyme libraries or virtual tailoring reactions cannot cover the non-canonical arrangements and combinations of biosynthetic enzymes in cryptic BGCs.\u003c/p\u003e \u003cp\u003eThe advanced artificial intelligence (AI) technology, particularly large language models (LLMs), has exhibited great capabilities in understanding, generating, and manipulating sequence context\u003csup\u003e\u003cspan additionalcitationids=\"CR27 CR28\" citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e. The exceptional generative capabilities inherent in LLMs provide significant potential to accurately identify biosynthetic functions of enzymes encoded in BGCs from known examples. Additionally, these models can automatically assemble complex chemical structures of various SMs as natural language translation\u003csup\u003e\u003cspan additionalcitationids=\"CR31\" citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e. However, there are significant differences exist between the processing of natural language sequences and biological sequences, particularly in terms of their feature recognition, encoding, and decoding. These differences present challenges in representing biological sequences effectively within LLMs. Furthermore, achieving high precision and generalization ability in LLMs typically requires large datasets for training. Yet, the availability of known BGCs and experimentally verified SMs remains relatively scarce\u003csup\u003e\u003cspan additionalcitationids=\"CR32\" citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e. While several factors contribute to this gap, the sequence representation and the scarcity of comprehensive training dataset are considered to be the most critical ones.\u003c/p\u003e \u003cp\u003eIn this work, we trained a language model named DeepSeMS (deep language model for secondary metabolite structures prediction) for automatically generating chemical sequences of a SM from input BGC sequences. To solve the sequence representation problem, we represented BGC sequence as functional domains of biosynthetic enzymes encoded in BGC. Additionally, we employed a data augmentation strategy to construct a refined dataset with sufficient quantity and superior quality, therefore addressed the dataset gap. Evaluations demonstrated that DeepSeMS significantly outperforms existing methods, generating chemical structures closer to the real world SMs and being applicable to more varied types of cryptic BGCs. We employed DeepSeMS for large-scale mining of SMs in the global ocean microbiome, successfully characterized more than 65,000 novel SMs with previously undocumented structural types, geographical coverage, ocean diversities, and ecological distribution characteristics, and identified various biomedical applications include antibiotics, cell protectants, and innovative drug candidates. Finally, we developed a user-friendly web server, along with a built-in global ocean SMs repository (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://biochemai.cstspace.cn/deepsems/\u003c/span\u003e\u003cspan address=\"https://biochemai.cstspace.cn/deepsems/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) for the convenience of researchers.\u003c/p\u003e"},{"header":"RESULTS","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e1. DeepSeMS algorithm\u003c/h2\u003e \u003cdiv id=\"Sec4\" class=\"Section3\"\u003e \u003ch2\u003e1.1 Model overview\u003c/h2\u003e \u003cp\u003eDeepSeMS model, based on Transformer architecture, was trained on a dataset of known BGCs and SM structures processed by data augmentation strategy\u003csup\u003e\u003cspan additionalcitationids=\"CR27 CR28\" citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e. This model identified the features of input BGCs as source sequences by sequence representation, tokenized and embedded into the Transformer neural network. Subsequently, a chemical sequence decoder was used to convert the output target sequences to predicted SM structures (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea). The neural network of Transformer consisted of six encoder and decoder layers, and eight attentional layers with embedding dimension of 512, leading to a total of approximately 100\u0026nbsp;million trainable parameters (Supplementary Fig.\u0026nbsp;1).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003c/div\u003e\n\u003ch3\u003e1.2 Sequence representation\u003c/h3\u003e\n\u003cp\u003eOne major challenge in predicting the chemical structures of SMs from BGCs was determining the most informative genomic input. Biological sequences can be represented at various levels, ranging from amino acids (basic building blocks) to functional domains (modular protein units), and enzymes (complete coding sequences). Among these, functional domains of the biosynthetic enzymes have been proved to be the most informative BGC representation\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e,\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e. Functional domains are the fundamental units within BGCs that responsible for substrate specificity, linear assembly, and tailoring reactions in SM biosynthesis\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e,\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e,\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. Additionally, their sequential arrangement enables the model to capture contextual relationships between domains and the chemical structures of SMs. We further evaluated the efficiency of various sequence representation strategies by training the DeepSeMS model with input sequences of amino acids, enzymes, and functional domains (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb). Our experiment revealed that amino acid sequences were impractical due to their excessive length (up to 50,000 tokens), which exceeded the model\u0026rsquo;s capacity and required significant computational resources. Although enzyme sequences were shorter (up to 50 tokens), they suffered from substantial information loss, preventing the model from achieving training convergence. In contrast, sequences derived from functional domains, identified as protein families and domains (Pfam) identifiers through the biosequence analysis tool HMMER, provided an optimal balance of block size and manageable length (up to 250 tokens). This representation enabled the neural network to efficiently extract key features from BGCs.\u003c/p\u003e \u003cp\u003eFor the output, we used SMILES (Simplified Molecular Input Line Entry System) strings to represent the chemical structures of SMs. SMILES is widely regarded as the standard format for describing small molecule structures in chemical language models, making it ideal for capturing the structural diversity of SMs\u003csup\u003e\u003cspan additionalcitationids=\"CR36\" citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\n\u003ch3\u003e1.3 Data augmentation strategy\u003c/h3\u003e\n\u003cp\u003eTo address the training dataset gap, we first curated BGC sequences along with their corresponding SM structures from the MIBiG database, which is renowned for its large-scale collection of known BGC sequences and annotation of experimentally verified SM structures. From MIBiG database, we constructed an initial dataset (\u003cem\u003en\u003c/em\u003e\u0026thinsp;=\u0026thinsp;3,029) by data extraction and cleaning for model training. While the BGCs in the dataset represent a large biosynthetic diversity of known examples, the structures of SMs are so inadequate for the vast chemical space of small molecules that the LLM may be unable to identify the syntax of SMILES strings. In specific application scenarios of this study, data augmentation by using a batch of chemically identical but syntactically different SMILES strings can greatly improve the performance of deep learning methods\u003csup\u003e\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e\u003c/sup\u003e. Commonly used data augmentation of SMILES strings is representing a molecule as a 2D graph, and linear SMILES notations can be derived from this graph by enumerating its nodes in a specific topological order, i.e., SMILES enumeration (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec)\u003csup\u003e\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e. While exposes the representations of a same molecule from various views, randomized SMILES enumeration would disorganize the notations in SMILES strings that may led to structural feature disorder, thereby hindering model performance\u003csup\u003e\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e,\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e\u003c/sup\u003e. Consequently, data augmentation in this study was also performed for target SMILES strings of the training dataset by structural features-aligned SMILES enumeration (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec). The molecular scaffold, which constitutes the major structural feature of microbial SMs, significantly dictates their biological activities and functions\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. Therefore, this procedure (see Methods) not only maintains the major structural feature (scaffold) of a SM aligned in SMILES strings, but also augments the feature blocks of chemical sequences, thereby would enhance model performance.\u003c/p\u003e \u003cp\u003eTo validate the data augmentation strategy, we first divided the initial dataset randomly into base training (90%, \u003cem\u003en\u003c/em\u003e\u0026thinsp;=\u0026thinsp;2,726) and internal validation (10%, \u003cem\u003en\u003c/em\u003e\u0026thinsp;=\u0026thinsp;303) datasets. Next, random SMILES enumeration and structure feature aligned SMILES enumeration were implemented on the base training dataset respectively to train the DeepSeMS model. In comparison to the model trained on the dataset without data augmentation, amplifying the training dataset by randomized SMILES enumeration resulted with a significant increase in validity of generated SMILES strings. What\u0026rsquo;s more, structural features-aligned SMILES enumeration brought significant better performance in the mean structural similarity between the valid structures and the target structures (Supplementary Table\u0026nbsp;1). Specifically, the model trained on the dataset of structural features-aligned SMILES enumeration had generated approximately one-quarter of the structures that were completely identical to the target structures (structure recovery), and roughly half of the structures that had completely identical scaffolds to the target structures (scaffold recovery). This indicates that the data augmentation strategy of structural features-aligned SMILES enumeration is advantageous not only for learning the fundamental syntax of the chemical language in SMILES strings by enhancing sample diversity, but also for emphasizing the structural features of SMs by keeping sample alignment.\u003c/p\u003e \u003cp\u003eConsequently, we trained DeepSeMS model on a refined dataset (\u003cem\u003en\u003c/em\u003e\u0026thinsp;=\u0026thinsp;55,903) that implemented data augmentation strategy of structural features-aligned SMILES enumeration on the initial dataset based on ten-fold cross-validation (Supplementary Fig.\u0026nbsp;2). However, the performance of the model in each fold is determined by the split of the dataset, which introduces significant randomness. Therefore, the best performance checkpoint of each fold was adopted as the application version of DeepSeMS model, aiming to generate top-10 output for the prediction of SM structures for each input BGC sequences. As a result, the model achieved up to 85.71% validity of the generated SMILES strings and 0.85 structural similarity between the valid structures and the target structures on the ten-fold cross-validation (Supplementary Table\u0026nbsp;2).\u003c/p\u003e\n\u003ch3\u003e2. Evaluation of accuracy and generalization ability on external validation datasets\u003c/h3\u003e\n\u003cp\u003eIn order to evaluate DeepSeMS model more comprehensively, we utilized two external validation datasets: \u0026lsquo;Known BGCs\u0026rsquo; (\u003cem\u003en\u003c/em\u003e\u0026thinsp;=\u0026thinsp;326) with chemical structures of experimentally verified SMs for evaluation of accuracy; \u0026lsquo;Cryptic BGCs\u0026rsquo; (\u003cem\u003en\u003c/em\u003e\u0026thinsp;=\u0026thinsp;940) without chemical structure of SMs for evaluation of generalization ability. The known BGCs dataset was derived from the \u0026lsquo;gold standard\u0026rsquo; BGCs manually curated by PRISM 4 authors\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e, which is a comprehensive dataset of prokaryotic BGCs linked to experimentally verified SMs with unambiguously assigned chemical structures. We excluded BGCs that have a sequence similarity greater than 95% to the ones in the training dataset of DeepSeMS model to form the known BGCs dataset. On the other hand, in view of the vast biosynthetic diversity exhibited by microbes in the ocean\u003csup\u003e\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e\u003c/sup\u003e, we believe the BGCs derived from ocean microbiome would be ideal material to test DeepSeMS model for exploring previously unrecognized SMs, especially those cryptic ones from bathypelagic habitats\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e. Thus, we obtained the Malaspina Deep Metagenome-Assembled Genomes (MDeep-MAGs)\u003csup\u003e\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e constructed from 58 metagenomes to search biosynthetic regions, which resulted with appropriate quantity of BGCs to form the cryptic BGCs dataset representing diverse biosynthetic pathways of the bathypelagic microbial communities. We also evaluated the performance of the DeepSeMS model against the two most widely used methods, namely antiSMASH\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e and PRISM\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e (SM structure prediction functions).\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e2.1 Prediction accuracy on the dataset of known BGCs\u003c/h2\u003e \u003cp\u003eEvaluation results on validation dataset of the known BGCs are summarized in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. The results demonstrated that DeepSeMS predicted more accurate structures of SMs for the known BGCs compared to existing methods, antiSMASH 7 and PRISM 4. DeepSeMS successfully predicted at least one chemically valid SM structure for 318 of 326 BGCs (97.55%), and the generated structures are more similar SMs to the ground truth for various BGC types (Supplementary Fig.\u0026nbsp;3). Notably, DeepSeMS had predicted 134 (41.10%) chemically identical structures to the ground truth, and over half (53.68%) of the predicted structures have the same scaffolds to the real-world SMs, which indicate that DeepSeMS has greatly improved the accuracy of SMs structure prediction than the other two methods.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eComparison of DeepSeMS model with existing methods on validation dataset of the known BGCs (\u003cem\u003en\u003c/em\u003e\u0026thinsp;=\u0026thinsp;326). The best results are bolded.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMethod\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSuccess rate\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eStructural similarity\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eScaffold similarity\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eStructure recovery\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eScaffold recovery\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eantiSMASH 7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e63.50%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.03\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.00%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e1.93%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePRISM 4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e88.96%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.45\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.42\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e8.11%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e16.87%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDeepSeMS\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e97.55%\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.60\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.63\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e41.10%\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e53.68%\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003csup\u003e \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e \u003c/sup\u003eThe percentage of at least one chemically valid structure predicted for each BGC in the validation dataset by each method. \u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003eThe mean structural similarity between the valid structures and the ground truth. \u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003eThe mean scaffold similarity between the valid structures and the ground truth. \u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003eThe percentage of chemically identical structures to the ground truth. \u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003eThe percentage of chemically identical scaffolds to the ground truth. Source data are provided in Source Data file.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003e2.2 Generalization ability on the dataset of cryptic BGCs\u003c/h3\u003e\n\u003cp\u003eComparative study further confirm that DeepSeMS achieved significant improvement on mining of SMs from cryptic BGCs (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, Supplementary Table\u0026nbsp;3). DeepSeMS successfully predicted at least one chemically valid SM structure for 908 out of 940 BGCs (96.60%) in the validation dataset of cryptic BGCs. This represents an approximately 80% increase over antiSMASH 7, which predicted 189 structures for 159 (16.91%) BGCs, and at least a 50% increase over PRISM 4, which predicted 455 structures for 203 (46.45%) BGCs (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea). The total number of predicted chemically valid SM structures and the percentage of chemically unique structures (uniqueness) indicate that DeepSeMS (5,104 valid SM structures with 78.66% uniqueness) exhibit remarkably higher molecular novelty than the other methods (PRISM 4: 455 valid SM structures with 62.42% uniqueness; antiSMASH 7: 189 valid SM structures with 24.87% uniqueness) (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb). The chemical space of the predicted SM structures by each method (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec) shows that DeepSeMS possesses the capability to expand the chemical space of SMs by a relatively small number of training data. Microbial SMs are primarily small molecules with molecular weights ranging from 300 to 500, and the distribution of molecular weight (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ed) generated by DeepSeMS aligns well with this range. The distributions of synthetic accessibility (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ee) and quantitative estimate of drug-likeness (QED) (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ef) of the predicted SM structures by DeepSeMS also correspond to the structural complexity of SMs in nature. This demonstrates the capability of the LLM-driven method to generate molecules with a variety of molecular scaffolds, functional groups, and ring systems that mimic the diverse and intricate structures observed in natural products. Collectively, comparative analysis illustrate that DeepSeMS had successfully generated a wider array of complex structures on the dataset of cryptic BGCs than the other two methods.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFurthermore, DeepSeMS improved the ability of predicting SM structures from various BGC types (Supplementary Table\u0026nbsp;4). DeepSeMS successfully predicted at least one chemically valid SM structure for 38 of 39 BGC types, including common types non-ribosomal peptide synthetase (NRPSs), polyketide synthase (PKSs), terpenes, along with ribosomally encoded and posttranslationally modified peptides (RiPPs). The type of \u0026lsquo;hybrid\u0026rsquo; BGCs\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e, which represent a single gene cluster produces a hybrid compound that combines two or more biosynthetic pathways, were also comprehensively predicted by DeepSeMS with various of SM structures. Notably, DeepSeMS successfully predicted SMs structures for clusters containing biosynthetic regions that do not fit into currently known categories, which indicate the great generalization ability of DeepSeMS on BGCs of undescribed families.\u003c/p\u003e \u003cp\u003eTo illustrate how DeepSeMS works in simulating the biosynthesis of microbial SMs to generate chemical structures, we inspected whether the predicted structures have structural features that are implicit in the BGCs. Sequence similarity analysis shows that, the cluster \u0026lsquo;mp-deep_mag-0578_000009.region001\u0026rsquo; from the cryptic BGCs dataset encodes four classes of homologous biosynthetic enzymes: dehydrogenase, phytoene synthase, α-glucosidase, and glycosyl transferase. The dehydrogenase and phytoene synthase were reported to be responsible for the biosynthesis of carbon chain of phytoene\u003csup\u003e\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u003c/sup\u003e, and the α-glucosidase and glycosyl transferase would catalyse the reaction of glycosylation\u003csup\u003e\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e\u003c/sup\u003e. Five SM structures generated by DeepSeMS (Supplementary Fig.\u0026nbsp;4), each has a scaffold of phytoene-like, long-chain, unsaturated, and aliphatic hydrocarbon with a terminal glucoside, which indicate that DeepSeMS generated the main structural features that are implicit in this cluster. Therefore, the case study validates the interpretability and practicability of DeepSeMS, which can provide new biological insights for biosynthetic pathway identifying and chemical structure elucidation.\u003c/p\u003e\n\u003ch3\u003e3. The hidden biosynthetic potential of the global ocean microbiome\u003c/h3\u003e\n\u003cp\u003eThe global ocean harbors an extraordinarily rich diversity of microbial resources, the vast majority of which remain largely under-explored, thus making it a vast reservoir for the discovery of novel SMs\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e,\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e. To tap into the biosynthetic potential of marine microbes, we obtained 27,139 MAGs from the most abundant available data resource - Ocean Microbiomics Database (OMD)\u003csup\u003e\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e\u003c/sup\u003e as the global ocean microbiome, which were reconstructed from more than 1,000 seawater samples collected on a global ocean scale. A comprehensive search for biosynthetic regions within OMD yielded 46,786 BGCs, forming the \u0026lsquo;global ocean BGCs\u0026rsquo; dataset. Leveraging DeepSeMS, we finally characterized 65,868 unique SM structures on the dataset to form the \u0026lsquo;global ocean SMs\u0026rsquo;. This dataset represents substantial biosynthetic pathways and natural molecules yielded by the global ocean microbiome.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Molecular novelty and diversity of the global ocean SMs\u003c/h2\u003e \u003cp\u003eThe hidden biosynthetic potential of the global ocean microbiome was revealed a vast array of novel SMs with previously undocumented structural types, geographical coverage, ocean diversities, and ecological distribution characteristics. To analyse and evaluate the molecular novelty of the generated global ocean SMs, we defined a \u0026lsquo;molecular novelty score\u0026rsquo; (see Methods) which is the normalized percentage of the maximum similarity value to the structures of known SMs from the MIBiG database. Distribution of molecular novelty score of the global ocean SMs illustrated that the global ocean microbiome encoded a large number of novel SMs (\u003cem\u003en\u003c/em\u003e\u0026thinsp;=\u0026thinsp;65,735) with structural differences from the known ones (Supplementary Fig.\u0026nbsp;4a), significantly expanding the chemical space of this resource library and providing more potential candidate molecules for natural drug discovery\u003csup\u003e\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e\u003c/sup\u003e. Specifically, 97% of the global ocean SMs are novel structures, 69% of them have novel scaffolds, and 58% of them have novel shapes, which indicate that the great structural novelty of the global ocean microbiome (Supplementary Fig.\u0026nbsp;4b and c).\u003c/p\u003e \u003cp\u003eFurthermore, we found that the microbes across various oceans globally exhibit high biosynthetic novelty (average of over 96% novel structures) from the view of geographical coverage of the SMs (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea), whereas the Arctic Ocean contribute the maximum number of SMs (22,426) and the North Atlantic Ocean have a slight advantage on molecular novelty considering the percentage of novel shapes (61%). However, there are significant differences emerge in terms of molecular diversity and uniqueness of the SMs when comparing the Arctic Ocean and Southern Ocean (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eb). Specifically, the Arctic Ocean has the highest uniqueness (72%) of the SMs that are not found in other Oceans, while the Southern Ocean has the highest SM diversity (63%). An analysis of ecological distribution of the global ocean SMs (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ec) further confirms that the diverse marine environment cultivates more varied biosynthetic features in microbes, leading to a wide range of novel chemical structure of SMs. Notably, the SMs from the abyssopelagic layer (\u0026gt;\u0026thinsp;4500 m), low-oxygen (\u0026lt;\u0026thinsp;100 \u0026micro;mol/kg), and medium-low temperature (5\u0026thinsp;~\u0026thinsp;15 ℃) environments, possess the highest molecular novelty and diversity. Oxygen (O), nitrogen (N), and carbon (C) contents of the global ocean SMs are also varied with oceanic depth, oxygen, and temperature, which indicates that the microbes in the ocean have evolved various of SM structural types with element contents to adapt to diverse marine ecological environments. Specifically, we found that the prevailing BGC types of PKS in the deep ocean led to high O content of the SM molecules. However, the high oxygen and temperature in seawater do not bring high O content but low N content and high C content of the SM molecules, which are corresponding with the low proportion of NRPS and the high proportion of terpenes BGC types.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e3.2 Biomedical application potential of the global ocean SMs\u003c/h2\u003e \u003cp\u003eSpecific novel SMs and biochemical pathways found in the global ocean microbiome contribute to pave the way for innovative biomedical applications (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). To identify antibiotic potential of the global ocean SMs, we implemented a structural-based virtual screening focusing on SMs that incorporate functional groups known for their antibiotic properties, including β-lactams, aminoglycosides, tetracyclines, oxazolidinones, chloramphenicols, macrolides, ansamycins, and quinolones. This screening uncovered 8,783 unique structures of SMs featuring diverse antibiotic-associated functional groups (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea). These structures exhibit various antibacterial mechanisms, including inhibiting bacterial cell-wall, protein, RNA and DNA synthesis. Notably, these SMs possess novel side chains or substituents different from the current antibiotics. Consequently, our findings unearthed the great antibiotic potential of the global ocean SMs, which cover a broad spectrum of pathogens, especially when the infectious agent is unknown or resistance to current antibiotics. Collectively, the global ocean SMs would be an ideal virtual library for exploring alternatives of drugs to target antibiotic resistant bacteria, such as multidrug-resistant Gram-negative pathogens\u003csup\u003e\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eWe also found that the ocean specific abundant BGC type \u0026lsquo;ectoine\u0026rsquo;, which can serve as a compatible solute, indicates the widespread microbial adaptations to the bathypelagic environment for preventing extreme osmotic stresses\u003csup\u003e\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e\u003c/sup\u003e. We discovered 2,078 natural molecules with novel structural types derived from the ectoine biosynthetic pathways in the global ocean SMs (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb). These ectoine molecules exhibit marked differences from the known SMs (average novelty score of 95.52) and could potentially serve as candidates for cell protectants in cosmetics, medicine, or biotechnology\u003csup\u003e\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e,\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eNotably, we characterized 645 unique SMs with novel molecular scaffolds and shapes from the global ocean BGCs containing biosynthetic regions that do not fit into any documented category\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e. These novel structural types of SMs provide new insight into undocumented biosynthetic pathways that may lead to the discovery of novel bioactive compounds with potential therapeutic applications\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e. Specifically, four of the top ten novel SM structures (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ec), namely \u003cb\u003en3\u003c/b\u003e, \u003cb\u003en6\u003c/b\u003e, \u003cb\u003en7\u003c/b\u003e, and \u003cb\u003en9\u003c/b\u003e, were predicted from the same BGC within the MAG \u0026lsquo;BGEO_SAMN07136520_METAG_FKHEEFFA\u0026rsquo;, derived from a seawater sample collected in the North Atlantic Ocean. This makes the bacterial host, identified as \u0026lsquo;\u003cem\u003eUBA7446 sp002478685\u003c/em\u003e\u0026rsquo;, a promising candidate for the discovery of novel marine natural products.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e4. AI-powered tools for accelerating novel SMs discovery\u003c/h2\u003e \u003cp\u003eTo facilitate analytical applications, we have deployed the DeepSeMS model as a web server (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e), freely accessible at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://biochemai.cstspace.cn/deepsems/\u003c/span\u003e\u003cspan address=\"https://biochemai.cstspace.cn/deepsems/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. The \u0026lsquo;DeepSeMS web server\u0026rsquo; enables users to submit prediction jobs for microbial genome mining of novel SMs by uploading BGC annotation files or providing antiSMASH job IDs of biosynthetic regions searching. The web server generates comprehensive predicting results, including detailed biosynthetic features of the input BGCs, structural visualisations, prediction scores and molecular properties of the predicted SM structures. Additionally, it offers integrated functionalities for comparing predicted SMs with known compounds to access molecular novelty and analyze antibiotic potential, enhancing the interpretability of the results and supporting further research into novel bioactive compounds. We also deposited sample input, tutorial, and example pages to interpret the data formatting requirements for job submission and the results returned by the web server. A job status page will be provided at the time of submission, allow the user to bookmark the job link or copy the job ID to access the results later.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFurthermore, to explore the novel SMs from the global ocean microbiome discovered by this study more visually, we also deposited the dataset of the global ocean SMs as a build-in resource of the web server. The resource can be used for exploring various novel SMs by geographic locations, marine environments, and BGC types, for filtering and visualizing of the biosynthetic pathways, molecular novelties, and antibiotic potentials of the cryptic population. For instance, searching for cryptic BGCs of NRPS from Biogeotraces_GT15_GP13_TAN1109 sample set in South Pacific Ocean at the web server, will result in five records. And the first cluster was derived from the MAG of bacterium \u0026lsquo;\u003cem\u003eArctic96AD-7\u003c/em\u003e sp002082305\u0026rsquo; which lives in bathypelagic layer (1008 m) with low temperature (4.94\u0026deg;C) and oxygen content of 200.4 \u0026micro;mol/kg. Five novel SM structures of this BGC are displayed on the detail page, two of which are predicted to possess antibiotic potential as macrolides. Both the cluster and the result can be downloaded for further research.\u003c/p\u003e \u003c/div\u003e"},{"header":"DISCUSSION","content":"\u003cp\u003eIn this work, we demonstrate that training a LLM on converting biological sequences of BGCs to chemical sequences of SMs facilitates accurate prediction of the chemical structures encoded within microbial genomes. More relevantly, the model performs admirably on inferring the complexity and diversity in biosynthetic pathways of novel SMs from the global ocean microbiome, unveiling the substantial biosynthetic potential of this yet-to-be-explored reservoir. Our investigation revealed that the functional domains, as the most efficient representation of biosynthetic features in BGCs, empowers DeepSeMS to achieve higher performance with smaller model size and fewer computing resources. Additionally, the application of the structural feature aligned data augmentation strategy also enables DeepSeMS to navigate in sparsely populated chemical space of known SMs, and generate novel SM structures with scarce training data. These advanced capabilities enable DeepSeMS to be a powerful AI-driven tool to characterize the chemical structures of unidentified SMs via its web server, and provide new insights into microbial natural products discovery. Moreover, exploring various novel SMs from the global ocean microbiome as a build-in resource of the web server accelerates the discovery of innovative biomedical applications for marine natural products.\u003c/p\u003e \u003cp\u003eHowever, it is essential to note certain limitations of DeepSeMS. First, the identification of biosynthetic features may be incomplete due to coverage of the training dataset, i.e., one or more functional domains of a cluster may fail to be identified as biosynthetic features because of sequence similarity threshold. As a result, the generated structures of a SM would be fragmentary. Furthermore, because of the lack of reliable methods to define borders of BGCs in prokaryotic genomes based solely on sequence data, the structures generated by DeepSeMS for unidentified BGC types may not represent the main structural features (i.e., scaffolds) of the expected SMs. Thus, these results can only serve as clues for exploring novel SMs, and need further experimental validations to confirm the structures. To address these challenges, potential improvements of DeepSeMS are raised: The increase in the diversity of annotations for experimentally verified BGCs and SMs in the training dataset would augment the biosynthetic features and effectively improve the generalization ability of the LLM.\u003c/p\u003e \u003cp\u003eDespite these limitations, DeepSeMS outperforms other methods and offers a paradigm shift in genome mining. Traditional approaches struggle to identify cryptic BGCs of the microbes, which are either silently expressed or exhibit very low expression levels under laboratory conditions. Our method leverages a LLM to automatically generate all possible structural types of SMs based on the biosynthetic features encoded within microbial genomes, showcasing the advantages of AI to directly link genomic information to chemical output. Additionally, this study provides us with a new insight: Given the success of the LLM in predicting SM structures from BGCs, we can reverse the sequence generation model to design biosynthetic enzymes based on specific SM structures, thereby leveraging AI and synthetic biology to explore the vast biosynthetic potential of the microbiomes.\u003c/p\u003e "},{"header":"METHODS","content":"\u003ch2\u003eData preparation\u003c/h2\u003e\n\u003cp\u003eThe training dataset of DeepSeMS model was collected from MIBiG database (version 3.1, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://mibig.secondarymetabolites.org/\u003c/span\u003e\u003c/span\u003e)\u003csup\u003e33\u003c/sup\u003e, BGC sequences and SMILES strings were paired according to the same accession number in the MIBiG sequence files and annotation files. Structural issues of SM structures in the dataset were identified by RDKit (version 2023.03.1, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://www.rdkit.org/\u003c/span\u003e\u003c/span\u003e) and addressed manually according to the references in annotations. In order to reduce the complexity of molecular generation models and ensure the validity of generated SMILES sequences, canonical SMILES representation was generated using RDKit by removing the stereochemical information. We represented the SM structures using SMILES notations as sequences of target tokens, and identified 35 distinct structural features (unique SMILES notations) in the dataset to form the vocabulary of target tokens for the LLM\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e37\u003c/span\u003e\u003c/sup\u003e. Biopython (version 1.8.1, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://biopython.org/\u003c/span\u003e\u003c/span\u003e) and HMMER (version 3.4, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://www.hmmer.org/\u003c/span\u003e\u003c/span\u003e) were used for identifying biosynthetic features (Pfam identifiers) as sequences of source tokens from BGC sequences by searching functional domains against Pfam\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e51\u003c/span\u003e\u003c/sup\u003e (version 36.0) database with a threshold of e-value\u0026thinsp;\u0026lt;\u0026thinsp;0.01. We identified 1,020 distinct biosynthetic features (unique Pfam identifiers) in the dataset by annotating functional domains of biosynthetic enzymes encoded within BGCs to form the vocabulary of source tokens for the LLM.\u003c/p\u003e\n\u003cp\u003eThe source data of the known BGCs dataset was obtained from the curation of PRISM 4 authors\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.5281/zenodo.3985982/\u003c/span\u003e\u003c/span\u003e), and data pairs were prepared by using the same procedures of structural and biosynthetic features annotation as the training dataset of DeepSeMS model. We excluded BGCs that have a sequence identity greater than 95% to the BGCs in training set by BLAST\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e52\u003c/span\u003e\u003c/sup\u003e, which resulted in the dataset of known BGCs for model evaluation of accuracy. The source data of the cryptic BGCs dataset was obtained from the Malaspina Deep Metagenome-Assembled Genomes\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://malaspina-public.gitlab.io/malaspina-deep-ocean-microbiome/\u003c/span\u003e\u003c/span\u003e). We searched biosynthetic regions of the MAGs using antiSMASH\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e (version 7.0.0) with \u0026lsquo;genefinding-tool\u0026rsquo; of \u0026lsquo;prodigal\u0026rsquo; and default parameters otherwise, which resulted in the dataset of cryptic BGCs for model evaluation of generalization ability.\u003c/p\u003e\n\u003cp\u003eThe source data of the global ocean microbiome was obtained from OMD\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e41\u003c/span\u003e\u003c/sup\u003e (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://microbiomics.io/ocean/\u003c/span\u003e\u003c/span\u003e). Biosynthetic regions searching on the global ocean microbiome was performed as the same procedures of the cryptic BGCs dataset to form the global ocean BGCs dataset for large-scale mining of novel SMs. Sample metadata of the global ocean microbiome was also obtained from OMD for analysis of geographical coverage, ocean diversities, and ecological distribution characteristics of the resulted SMs.\u003c/p\u003e\n\u003ch2\u003eData augmentation\u003c/h2\u003e\n\u003cp\u003eThe procedure of data augmentation was implemented using RDKit in Python\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e (version 3.10, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.python.org/\u003c/span\u003e\u003c/span\u003e). We used the \u0026lsquo;MolToSmiles\u0026rsquo; function to generate the randomized SMILES strings by setting the \u0026lsquo;doRandom\u0026rsquo; parameter as \u0026lsquo;True\u0026rsquo;. The structural features-aligned data augmentation was implemented by generating molecular scaffold of an input SMILES string, then randomly selecting a starting node and topological path to enumerate the molecular subgraphs other than the scaffold of the input molecule (substituent groups), and combining the enumerated molecular subgraphs and the subgraph of the scaffold as a new molecular graph. Therefore, we have obtained new atomically-ordered but chemically identical molecular graphs for the input molecule, moreover, the atomic order of the scaffold is consistent. Ultimately, we can generate expression-different but structural features-aligned SMILES strings with the new molecular graphs. Specifically, chemical scaffold of an input SMILES string was generated by the \u0026lsquo;GetScaffoldForMol\u0026rsquo; function using Murcko-type decomposition, the atom indices of the scaffold were then matched by the function \u0026lsquo;GetSubstructMatches\u0026rsquo;. The atom indices other than the scaffold was renumbered randomly and then combined with the atom indices of the scaffold to form a new molecular graph of atomic numbers. The structural features-aligned SMILES string was finally obtained from the new molecular graph by the functions of \u0026lsquo;RenumberAtoms\u0026rsquo; and \u0026lsquo;MolToSmiles\u0026rsquo;. We used the data augmentation to generate up to 100 randomized and structural features-aligned SMILES strings for each molecule in the dataset for model training.\u003c/p\u003e\n\u003ch2\u003eModel training\u003c/h2\u003e\n\u003cp\u003eThe DeepSeMS model was implemented as a sequence-to-sequence language model based on Transformer architecture\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e26\u003c/span\u003e,\u003cspan class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e. In model training, both source and target sequences in the training dataset were converted into embeddings by batches. These embeddings were then passed through a positional encoding layer to retain the order information of the sequence. The embedded input sequence was processed by the encoder to generate a context-rich representation. This representation was then used by the decoder, along with the embedded target sequence, to predict the next item in the sequence. Masks were used to prevent the model from accessing future tokens in the target sequence prematurely. The output of the decoder was transformed through a linear layer and a softmax to predict the probability of the next token in the target sequence. The predictions of the model were compared against the actual target sequence using a loss function. The gradients from the loss were backpropagated through the model to update the weights.\u003c/p\u003e\n\u003cp\u003eThe model was trained by using dropout rate of 0.1 to perform regularization, AdamW as the optimizer, Cross Entropy as the loss function, learning rate of 0.0001, batch size of 64 and default parameters of Transformer otherwise. After each training epoch, the model state was validated on validation dataset. We employed an early stop strategy to avoid over-fitting problem, which would stop model training when the validation performances were not improved in 10 epochs. The model was implemented by PyTorch (version 2.1.0, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://pytorch.org/\u003c/span\u003e\u003c/span\u003e) in Python, and was trained on up to eight GPUs of \u0026lsquo;NVIDIA RTX 4090\u0026rsquo;.\u003c/p\u003e\n\u003ch2\u003eModel predicting\u003c/h2\u003e\n\u003cp\u003eIn the model prediction, a target mask was generated to prevent the model from accessing future positions in the sequence. The model then generated the next token of start token (SOS) based on the input sequence of BGC features and the target mask. The token with the highest probability was selected and appended to the output sequence as target tokens for next token generation. The prediction stopped if the predicted token is an end (EOS) token. The final output sequence was decoded to SMILES strings of the predicted SM structure based on the vocabulary of target tokens. We defined a prediction score to evaluate the output sequences generated by the model:\u003c/p\u003e\n\u003cdiv id=\"Equ1\" class=\"Equation\"\u003e\n \u003cdiv class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e$$\\:Prediction\\:\\text{Score}=\\frac{\\sum\\:log\\left(\\text{probabilities}\\right)}{(\\text{length of sequence}{)}^{\\text{length penalty}}}$$\u003c/div\u003e\n \u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\n\u003c/div\u003e\n\u003cp\u003eWhere: \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:{\\prime\\:}\\sum\\:log\\left(\\text{probabilities}\\right){\\prime\\:}\$\u003c/span\u003e\u003c/span\u003e is the sum of the log probabilities of each token selected during the sequence generation process, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:{\\prime\\:}\\text{length of sequence}{\\prime\\:}\$\u003c/span\u003e\u003c/span\u003e is the length of the generated sequence (total number of generated tokens), and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:{\\prime\\:}\\text{length penalty}{\\prime\\:}\$\u003c/span\u003e\u003c/span\u003e is a factor set to 0.6 in this study to adjust the score based on the length of the generated sequence, penalizing longer sequences to balance the trade-off between sequence length and the cumulative probability. The prediction scores were significant within each prediction served as crucial indicators of the neural network model confidence on its output.\u003c/p\u003e\n\u003ch2\u003eModel evaluation metrics\u003c/h2\u003e\n\u003cp\u003eThe following metrics for evaluating the performances and comparisons of DeepSeMS model with existing methods were implemented using RDKit in Python:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eValidity\u003c/strong\u003e, is the chemically valid SMILES strings that can be successfully parsed by the \u0026lsquo;MolToSmiles\u0026rsquo; function.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eStructural similarity\u003c/strong\u003e, is the Tanimoto coefficient of the chemical fingerprints (Morgan fingerprints with 2 bond radius) between two structures calculated by the functions of \u0026lsquo;TanimotoSimilarity\u0026rsquo; and \u0026rsquo;GetMorganFingerprint\u0026rsquo; \u003csup\u003e\u003cspan class=\"CitationRef\"\u003e53\u003c/span\u003e,\u003cspan class=\"CitationRef\"\u003e54\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMolecular scaffold\u003c/strong\u003e, is the core structure or framework of a molecule generated by the function of \u0026lsquo;GetScaffoldForMol\u0026rsquo; using Murcko-type decomposition.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMolecular shape\u003c/strong\u003e, is the generic framework of a molecular scaffold generated by the function of \u0026lsquo;MakeScaffoldGeneric\u0026rsquo;.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMolecular weight\u003c/strong\u003e, \u003cstrong\u003ethe number of heavy atoms\u003c/strong\u003e, and \u003cstrong\u003eQED\u003c/strong\u003e (quantitative estimate of drug-likeness), are molecular properties calculated by the functions of \u0026lsquo;MolWt\u0026rsquo;, \u0026lsquo;HeavyAtomCount\u0026rsquo;, and \u0026lsquo;qed\u0026rsquo;, respectively.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eChemical space\u003c/strong\u003e, is the distribution of Morgan fingerprints of the SM structures plotted by Matplotlib and Seaborn.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSynthetic accessibility\u003c/strong\u003e, is the estimation of synthetic accessibility score of molecules based on molecular complexity and fragment contributions calculated by SAscorer\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e55\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\n\u003ch2\u003eGenome mining and analysis\u003c/h2\u003e\n\u003cp\u003eThe large-scale mining of novel SMs from the global ocean microbiome was performed on the dataset of global ocean BGCs with the DeepSeMS model implemented by PyTorch in Python. The generated global ocean SM structures were then calculated for structural similarities to all the known SMs in MIBiG database by Tanimoto coefficients of Morgan fingerprints. We also defined a \u0026lsquo;molecular novelty score\u0026rsquo; to evaluate the novelty of a generated SM structure:\u003c/p\u003e\n\u003cdiv id=\"Equ2\" class=\"Equation\"\u003e\n \u003cdiv class=\"mathdisplay\" id=\"FileID_Equ2\" name=\"EquationSource\"\u003e$$\\:\\text{Molecular novelty score}=\\left(1-\\frac{Similarity-{Min}_{Similarity}}{{{Max}_{Similarity}-Min}_{Similarity}}\\right)\\times\\:100$$\u003c/div\u003e\n \u003cdiv class=\"EquationNumber\"\u003e2\u003c/div\u003e\n\u003c/div\u003e\n\u003cp\u003eWhere: \u0026lsquo;\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:Similarity\$\u003c/span\u003e\u003c/span\u003e\u0026rsquo; is the calculated maximum Tanimoto coefficient value to the structures of known SMs; \u0026lsquo;\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:{Min}_{Similarity}\$\u003c/span\u003e\u003c/span\u003e\u0026rsquo; is the minimum value among all the maximum Tanimoto coefficient values between the structures of generated SMs and those of known SMs; \u0026lsquo;\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:{Max}_{Similarity}\$\u003c/span\u003e\u003c/span\u003e\u0026rsquo; is the maximum value among these maximum Tanimoto coefficient values. The molecular novelty score is the normalized percentage of the maximum similarity value to the structures of known SMs, allowing for a more intuitive analysis and evaluation of the molecular novelty of the generated global ocean SMs.\u003c/p\u003e\n\u003cp\u003eMolecular scaffolds and shapes of the global ocean SMs were generated by the functions of \u0026lsquo;GetScaffoldForMol\u0026rsquo; and \u0026lsquo;MakeScaffoldGeneric\u0026rsquo; using Murcko-type decomposition using RDKit in Python. The geographical coverage, ocean diversities, and ecological distribution characteristics of the global ocean SMs were analysed according to the metadata of the global ocean microbiome from OMD database\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e41\u003c/span\u003e\u003c/sup\u003e. The global ocean map was plotted in R (version 4.1.2) using Leaflet (version 2.2.2) by \u0026lsquo;OceanBasemap\u0026rsquo; from Esri. \u0026lsquo;Diversity\u0026rsquo; was the percentage of unique SM structures in the ocean provinces, \u0026lsquo;Uniqueness\u0026rsquo; was the percentage of unique SM structures in the global oceans, and the uniqueness was calculated based on canonical SMILES of the structures generated by RDKit. O, N, C contents were calculated based on molecular weight percentage of oxygen, nitrogen, carbon atoms in the global ocean SM structures.\u003c/p\u003e\n\u003cp\u003eThe structural-based virtual screening strategy on the global ocean SMs with known antibiotic activity of functional groups was: 1) Contain substructure of 2-azetidinone in a bicyclic scaffold as \u0026beta;-lactams; 2) Contain one or more aminosugars as aminoglycosides; 3) Contain a scaffold of tetracene as tetracyclines; 4) Contain a scaffold of 2-oxazolidon as oxazolidinones; 5) Contain substructure of dichloroacetamide as chloramphenicols; 6) Contain substructure of lactone in a macro ring with 14 or more atoms as macrolides; 7) Contain substructure of amide and aromatic moiety in a macro ring with 14 or more atoms as ansamycins; 8) Contain a scaffold of 4-quinolone as quinolones. The virtual screening was also implemented using RDKit in Python by calculating whether the structures contain the above known antibiotic activity of functional groups. The SM structures in Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003e were visualized and calculated for stereochemical information by ChemDraw (version 23.1.1).\u003c/p\u003e\n\u003ch2\u003eWeb server implementation\u003c/h2\u003e\n\u003cp\u003eThe DeepSeMS web server was implemented by Django (version 4.2.6, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.djangoproject.com/\u003c/span\u003e\u003c/span\u003e) for the web site framework, SQLite (version 3.41.2, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.sqlite.org/\u003c/span\u003e\u003c/span\u003e) for database, Python for backend applications and Docker (version 24.0.6, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.docker.com/\u003c/span\u003e\u003c/span\u003e) for implementation environment. Web pages were developed by JS, AJAX, JQuery and BootStrap (version 5.3.2, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://v5.bootcss.com/\u003c/span\u003e\u003c/span\u003e). We also applied RDKit in Python for chemical structure visualization, molecular properties calculation, known SMs comparison and antibiotic potential analysis.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eDATA AVAILABILITY\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll the datasets used in this work were obtained from public data depositories and are specified in the methods section. Source data of the figures and tables are provided in Source Data file. The training dataset of DeepSeMS is available at GitHub repository https://github.com/lab-of-biochemai/deepsems/data/. The dataset of the global ocean SMs is available as Source Data file and at DeepSeMS web server https://biochemai.cstspace.cn/deepsems/downloads/.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCODE AVAILABILITY\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe DeepSeMS web server and the global ocean SMs resource are freely available with no login requirements at: https://biochemai.cstspace.cn/deepsems/. Source code of DeepSeMS is available at GitHub repository https://github.com/lab-of-biochemai/deepsems/.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSOURCE DATA\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSource data 1: The source data of Table 1 and Figures 2-4.\u003c/p\u003e\n\u003cp\u003eSource data 2: The dataset of the global ocean SMs.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eACKNOWLEDGEMENTS\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work was supported by the National Natural Science Foundation of China (92251307, 92451303, 32470098, 82170542), the National Key Research and Development Program of China (2023YFA0915501), and the Informatization Plan of Chinese Academy of Sciences (CAS-WX2021SF-0307). The authors acknowledge the use of resources provided by Beijing PARATERA Tech Corp.,Ltd. and China Science \u0026amp; Technology Cloud. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAUTHOR CONTRIBUTIONS\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eN.J., R.Z., G.Z. (Guoping Zhao) and G.Z. (Guoqing Zhang) conceived and designed the study. T.X. and W.Y. drafted the manuscript. R.Z., W.L., J.L., Y.Z., P.Z., G.Z. (Guoqing Zhang), G.Z. (Guoping Zhao) and N.J. reviewed and edited the manuscript. All authors read and approved the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCOMPETING INTERESTS\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe author declares no competing interests.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eClardy, J. \u0026amp; Walsh, C. Lessons from natural molecules. Nature 432, 829\u0026ndash;837 (2004).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXu, T. et al. NPBS database: a chemical data resource with relational data between natural products and biological sources. \u003cem\u003eDatabase\u003c/em\u003e 2020, baaa102 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNewman, D. J. \u0026amp; Cragg, G. M. Natural products as sources of new drugs over the nearly four decades from 01/1981 to 09/2019. J. Nat. Prod. 83, 770\u0026ndash;803 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKoehn, F. E. \u0026amp; Carter, G. T. The evolving role of natural products in drug discovery. Nat. Rev. Drug Discov. 4, 206\u0026ndash;220 (2005).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRodrigues, T., Reker, D., Schneider, P. \u0026amp; Schneider, G. Counting on natural products for drug design. Nature Chem 8, 531\u0026ndash;541 (2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVanni, C. et al. Unifying the known and unknown microbial coding sequence space. Elife 11, e67667 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWirbel, J., Bhatt, A. S \u0026amp; Probst, A. J. The journey to understand previously unknown microbial genes. Nature 626, 267\u0026ndash;269 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDoroghazi, J. R. et al. A roadmap for natural product discovery based on large-scale genomics and metabolomics. Nat. Chem. Biol. 10, 963\u0026ndash;968 (2014).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZiemert, N., Alanjary, M. \u0026amp; Weber, T. The evolution of genome mining in microbes\u0026mdash;a review. Nat. Prod. Rep. 33, 988\u0026ndash;1005 (2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eScherlach, K. \u0026amp; Hertweck, C. Mining and unearthing hidden biosynthetic potential. Nat. Commun. 12, 3864 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWalsh, C. T. \u0026amp; Fischbach, M. A. Natural products version 2.0: connecting genes to molecules. J. Am. Chem. Soc. 132, 2469\u0026ndash;2493 (2010).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMilshteyn, A., Schneider, J. S., \u0026amp; Brady, S. F. Mining the metabiome: identifying novel natural products from microbial communities. Chem. Biol. 21, 1211\u0026ndash;1223 (2014).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, M. H. T., Ung, P. M. U., Zajkowski, J., Garneau-Tsodikova, S. \u0026amp; Sherman, D.H. Automated genome mining for natural products. BMC Bioinf. 10, 185 (2009).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMedema, M. H. et al. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res. 39, W339\u0026ndash;W346 (2011).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSkinnider, M. A. et al. Genomes to natural products PRediction Informatics for Secondary Metabolomes (PRISM). Nucleic Acids Res. 43, 9645\u0026ndash;9662 (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZierep, P. F. et al. SeMPI: a genome-based secondary metabolite prediction and identification web server. Nucleic Acids Res. 45, W64\u0026ndash;W71 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHannigan, G. D. et al. A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Res. 47, e110 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSkinnider, M. A. et al. Comprehensive prediction of secondary metabolite structure and biological activity from microbial genome sequences. Nat. Commun. 11, 6058 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBlin, K. et al. antiSMASH 7.0: new and improved predictions for detection, regulation, chemical structures and visualisation. Nucleic Acids Res. 51, W46\u0026ndash;W50 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen, J. et al. Global marine microbial diversity and its potential in bioprospecting. Nature 633, 371\u0026ndash;379 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLogares R. Decoding populations in the ocean microbiome. Microbiome 12, 67 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWalsh C. T. \u0026amp; Tang Y. \u003cem\u003eNatural Product Biosynthesis: Chemical Logic and Enzymatic Machinery\u003c/em\u003e Ch. 1 (Royal Society of Chemistry Publishing, 2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDonadio, S., Staver, M. J., McAlpine, J. B., Swanson, S. J. \u0026amp; Katz, L. Modular organization of genes required for complex polyketide biosynthesis. Science 252, 675\u0026ndash;679 (1991).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSchwarzer, D., Mootz, H. D. \u0026amp; Marahiel, M. A. Exploring the impact of different thioesterase domains for the design of hybrid peptide synthetases. Chem. Biol., 8, 997\u0026ndash;1010 (2001).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBernhardt, R. Cytochromes P450 as versatile biocatalysts. J. Biotechnol. 124, 128\u0026ndash;145 (2006).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVaswani, A. et al. Attention is all you need. \u003cem\u003eAdvances in Neural Information Processing Systems 30 (Nips\u003c/em\u003e 2017) 2017, 5998\u0026ndash;6008 \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/1706.03762\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/1706.03762\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWolf, T. et al. Transformers: State-of-the-Art Natural Language Processing. \u003cem\u003eProceedings of the\u003c/em\u003e 2020 \u003cem\u003eEMNLP (Systems Demonstrations)\u003c/em\u003e 2020, 38\u0026ndash;45 \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.18653/v1/2020.emnlp-demos.6\u003c/span\u003e\u003cspan address=\"10.18653/v1/2020.emnlp-demos.6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXu, T. et al. Neural machine translation of chemical nomenclature between English and Chinese. J. Cheminform. 12, 50 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHoffmann, J. et al. An empirical analysis of compute-optimal large language model training. Adv. Neural Inf. Process. Syst. 35, 30016\u0026ndash;30030 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSald\u0026iacute;var-Gonz\u0026aacute;lez, F. I., Aldas-Bulos, V. D., Medina-Franco, J. L. \u0026amp; Plisson, F. Natural product drug discovery in the artificial intelligence era. Chem. Sci. 13, 1526\u0026ndash;1546 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDiao, Y. et al. Macrocyclization of linear molecules by deep learning to facilitate macrocyclic drug candidates discovery. Nat. Commun. 14, 4552 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChowdhery, A. et al. PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24, 1\u0026ndash;113 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTerlouw, B. R. et al. MIBiG 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters. Nucleic Acids Res. 51, D603\u0026ndash;D610 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOuteiral, C., \u0026amp; Deane, C. M. Codon language embeddings provide strong signals for use in protein engineering. Nat. Mach. Intell. 6, 170\u0026ndash;179 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWeininger, D. SMILES, a chemical language and information system. J. Chem. Inf. Model., 28, 31\u0026ndash;36 (1988).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSchwaller, P. et al. Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction. ACS Cent. Sci. 5, 1572\u0026ndash;1583 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSkinnider, M. A. et al. Chemical language models enable navigation in sparsely populated chemical space. Nat. Mach. Intell. 3, 759\u0026ndash;770 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAr\u0026uacute;s-Pous, J. et al. Randomized SMILES strings improve the quality of molecular generative models. J. Cheminform. 11, 71 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePolykovskiy, D. et al. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Front. Pharmacol. 11, 565644 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBemis, G. W. \u0026amp; Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39, 2887\u0026ndash;2893 (1996)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePaoli, L. et al. Biosynthetic potential of the global ocean microbiome. Nature 607, 111\u0026ndash;118 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAcinas, S. G. et al. Deep ocean metagenomes provide insight into the metabolic architecture of bathypelagic microbial communities. Commun. Biol. 4, 604 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBecerril, A. et al. Uncovering production of specialized metabolites by Streptomyces argillaceus: Activation of cryptic biosynthesis gene clusters using nutritional and genetic approaches. PloS one 13, e0198145 (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZheng, X. et al. Biosynthesis of the pyrrolidine protein synthesis inhibitor anisomycin involves novel gene ensemble and cryptic biosynthetic steps. \u003cem\u003eProc. Natl. Acad. Sci. U. S. A.\u003c/em\u003e 114, 4135\u0026ndash;4140 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWills, T. J., \u0026amp; Lipkus, A. H. Structural Approach to Assessing the Innovativeness of New Drugs Finds Accelerating Rate of Innovation. ACS Med Chem Lett. 11, 2114\u0026ndash;2119 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWong, F. et al. Discovery of a structural class of antibiotics with explainable deep learning. Nature 626, 177\u0026ndash;185 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSadeghi, A. et al. Diversity of the ectoines biosynthesis genes in the salt tolerant Streptomyces and evidence for inductive effect of ectoines on their accumulation. Microbiol. Res. 169, 699\u0026ndash;708 (2014).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePastor, J. M. et al. Ectoines in cell stress protection: uses and biotechnological production. Biotechnol Adv 28, 782\u0026ndash;801 (2010).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWidderich, N. et al. Biochemical properties of ectoine hydroxylases from extremophiles and their wider taxonomic distribution among microorganisms. PLoS One 9, e93809 (2014).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWishart, D.S. et al. NP-MRD: the Natural Products Magnetic Resonance Database. Nucleic Acids Res. 50, D665\u0026ndash;D677 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMistry, J. et al. Pfam: The protein families database in 2021. Nucleic Acids Res. 49, D412\u0026ndash;D419 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCamacho, C. et al. BLAST+: architecture and applications. BMC Bioinf. 10, 421 (2009).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBajusz, D., R\u0026aacute;cz, A. \u0026amp; H\u0026eacute;berger, K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J. Cheminform. 7, 20 (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRogers, D. \u0026amp; Hahn, M. Extended-connectivity fingerprints. J. Cheminform. 50, 742\u0026ndash;754 (2010).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eErtl, P. \u0026amp; Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 1, 8 (2009).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-6233440/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6233440/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eMicrobial biosynthetic diversity holds immense potential for discovering natural products with therapeutic applications, yet a substantial quantity of natural products derived from uncultivated microorganisms remains uncharacterized. The intricate nature of biosynthetic enzymes poses a major challenge in accurately predicting the chemical structures of secondary metabolites solely based on genome sequences using current rule-based methods. Here, we present DeepSeMS, a large language model designed to predict the chemical structures of secondary metabolites from various microbial biosynthetic gene clusters. Built on the Transformer architecture, DeepSeMS innovatively identifies sequence features using functional domains of biosynthetic enzymes, and incorporates feature-aligned chemical structure enumeration for training data augmentation. External evaluation results show that DeepSeMS predicts more accurate chemical structures of secondary metabolites with a Tanimoto coefficient up to 0.6 compared with the ground truth, significantly outperforming antiSMASH and PRISM with coefficients of only 0.14 and 0.45 respectively. Moreover, DeepSeMS successfully predicted secondary metabolites for 96.60% of cryptic biosynthetic gene clusters, surpassing existing methods with success rates less than 50%. Leveraging DeepSeMS, we characterized over 65,000 novel secondary metabolites from the global ocean microbiome with previously undocumented structural types, ecological distribution, and biomedical applications especially antibiotics. A login-free and user-friendly web server for DeepSeMS (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://biochemai.cstspace.cn/deepsems/\u003c/span\u003e\u003cspan address=\"https://biochemai.cstspace.cn/deepsems/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) has been launched, featuring an integrated global ocean microbial secondary metabolites repository to expedite the discovery of novel natural products. Collectively, this study underscores the great capacity of a large language model-driven method in revealing hidden biosynthetic potential of the global ocean microbiome.\u003c/p\u003e","manuscriptTitle":"DeepSeMS: a large language model reveals hidden biosynthetic potential of the global ocean microbiome","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-04-16 11:28:12","doi":"10.21203/rs.3.rs-6233440/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"nature-computational-science","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"natcomputsci","sideBox":"Learn more about [Nature Computational Science](http://www.nature.com/natcomputsci/)","snPcode":"","submissionUrl":"","title":"Nature Computational Science","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Nature Research","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"c11ef064-2f59-40dc-b1bc-1feebe48fbac","owner":[],"postedDate":"April 16th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":47204803,"name":"Biological sciences/Computational biology and bioinformatics/Computational models"},{"id":47204804,"name":"Biological sciences/Drug discovery/Drug screening/High-throughput screening"}],"tags":[],"updatedAt":"2026-05-05T09:56:37+00:00","versionOfRecord":{"articleIdentity":"rs-6233440","link":"https://doi.org/10.1038/s43588-026-00983-1","journal":{"identity":"nature-computational-science","isVorOnly":false,"title":"Nature Computational Science"},"publishedOn":"2026-04-30 04:00:00","publishedOnDateReadable":"April 30th, 2026"},"versionCreatedAt":"2025-04-16 11:28:12","video":"","vorDoi":"10.1038/s43588-026-00983-1","vorDoiUrl":"https://doi.org/10.1038/s43588-026-00983-1","workflowStages":[]},"version":"v1","identity":"rs-6233440","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6233440","identity":"rs-6233440","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00