Function-Driven Molecular Design Enabled by Instruction-Tuned Large Language Models

doi:10.21203/rs.3.rs-8819034/v1

Function-Driven Molecular Design Enabled by Instruction-Tuned Large Language Models

2026 · doi:10.21203/rs.3.rs-8819034/v1

preprint OA: closed CC-BY-4.0

🔓 Open OA copy Full text JSON View at publisher

Full text 143,644 characters · extracted from preprint-html · click to expand

Function-Driven Molecular Design Enabled by Instruction-Tuned Large Language Models | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Function-Driven Molecular Design Enabled by Instruction-Tuned Large Language Models Qianfan Yang, Xurui Wang, Yanxi Wang, Ruizhao Zhu, Hailiang Li, and 7 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8819034/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Translating high-level functional design intent into concrete molecular structures remains a fundamental challenge in generative molecular discovery, particularly for biomolecular targets governed by non-pocket-like recognition. Here, we introduce SemantiChem, an instruction-tuned generative framework for function-driven molecular design that maps functional objectives expressed in natural language directly to chemically meaningful molecular structures, without relying on predefined geometric constraints, molecular scaffolds, or pocket-centric assumptions. We apply this framework to G-quadruplexes (G4), a representative system characterized by diffuse and topology-driven molecular recognition, and experimentally validate model-generated candidates through assays of G4 stabilization, polymerase stalling, and cellular response. The same design pipeline is further evaluated on a structurally distinct RNA target and, for contrast, on a pocket-dominated protease target. Together, these results establish a function-level molecular design strategy with regime-dependent applicability, highlighting a complementary path for molecular discovery in biomolecular systems where conventional structure-centric paradigms are insufficient. Physical sciences/Chemistry/Chemical biology/Small molecules Biological sciences/Computational biology and bioinformatics/Machine learning Biological sciences/Drug discovery/Drug screening/Virtual screening Physical sciences/Chemistry/Chemical biology/Cheminformatics Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Introduction Generative artificial intelligence has substantially expanded the scope of molecular design by enabling the de novo generation of compounds with specified biological or chemical functions 1 , 2 . Much of this progress to date has been achieved in targets that present compact and well-defined binding pockets 3 – 8 , where molecular function can be closely associated with localized geometric features. However, advances in molecular biology have increasingly revealed biomolecular systems that involve extended, flexible, or delocalized interaction modes. These systems challenge the implicit assumption that molecular function can be inferred from localized geometric complementarity, and conventional structure-centric molecular design strategies become increasingly unreliable. This shift highlights the need for generative approaches that can link functional objectives to molecular structure in a general and transferable manner. G-quadruplexes (G4) represent a class of biomolecular systems governed by non-pocket-like molecular recognition 9 , 10 . Rather than presenting localized and geometrically constrained interaction sites, G4 structures expose relatively shallow and highly polymorphic interaction surfaces, with ligand recognition dominated by stacking interactions, electrostatics, and higher-order topology 11 , 12 (Fig. 1 ). In practice, the design of G4-binding ligands has largely focused on functional outcomes, such as G4 stabilization, transcriptional interference, or selective cellular effects 12 – 15 . However, most existing structure-driven generative models are not designed to accommodate the function-oriented nature of G4 ligand design. In our evaluation, several representative approaches showed limited applicability when applied to G4 targets, failing to consistently produce chemically meaningful ligands (detailed in Fig. 5 a and Supplementary Note 7). These challenges motivate the development of generative design strategies that can translate high-level functional intent directly into chemical structures, without relying on explicit pocket geometries or predefined molecular templates. In this regard, large language models (LLMs) provide a fundamentally different design interface by operating directly on natural language descriptions of function, behavior, and constraint 16 , 17 . To date, however, most applications of LLMs in chemistry have focused on tasks such as property prediction 18 , molecular annotation 19 , 20 , or text-molecule alignment 21 , rather than on the direct generation of novel chemical structures guided by task-level functional intent (ESI, Table S1 ). Consequently, the potential of language-based models as generative engines for function-driven molecular design remains largely unexplored. In this work, we introduce SemantiChem, an instruction-tuned generative framework designed to translate high-level functional design intent into chemically meaningful molecular structures. By moving beyond predefined geometric constraints, molecular scaffolds, or pocket-centric assumptions, SemantiChem enables controllable, function-driven molecular generation from natural language descriptions. We apply this framework to G4, a representative system governed by non-pocket-like molecular recognition, and experimentally validate generated candidates through assays of G4 stabilization, polymerase stalling, and cellular response. The same design pipeline is further evaluated on a structurally distinct RNA target and on a pocket-dominated protease target for contrast. Collectively, these results establish SemantiChem as a function-level molecular design approach and illustrate its applicability to biomolecular targets that lack well-defined structural constraints. Schematic comparison between conventional structure-driven molecular design and function-driven molecular generation. In structure-driven paradigms, molecular function is inferred from localized geometric complementarity within compact and well-defined binding pockets, as exemplified by pocket-dominated protein targets. In contrast, function-driven design addresses biomolecular systems governed by non-pocket-like recognition, where interaction surfaces are extended, flexible, and context-dependent, and molecular function emerges from distributed stacking, electrostatic, and topological interactions. G-quadruplexes (G4) are shown here as a representative system illustrating this non-pocket-like recognition regime. This conceptual distinction underlies the design strategy explored in this work. Results SemantiChem Pipeline To enable instruction-driven molecular generation for biomolecular targets governed by non-pocket-like recognition, we developed SemantiChem, a modular framework that transforms general-purpose LLMs into chemically grounded generative systems aligned to functional design objectives (Fig. 2 ). The framework integrates model training, model-level evaluation, molecule generation, and downstream validation into a unified workflow, establishing a direct link between linguistic intent and functional molecules. In the training stage, SemantiChem adapts a general-purpose LLM through two sequential procedures: molecular-syntax pretraining and target-specific instruction tuning (Fig. 2 a). Each model progresses through three conceptual stages: (1) Base , which preserves the original linguistic capacity of the pretrained LLM; (2) SELFIES-pretrained , in which the model is exposed to 10,000 molecules from PubChem10M_SELFIES to develop fundamental chemical-syntax fluency; and (3) target-specific instruction-tuned , which aligns the learned molecular representations with task- specific functional design intent through supervised instruction tuning (see Supplementary Note 2). These stages provide the basis for generating chemically valid and task-relevant structures from language-based design instructions. Model variants are denoted as “base + stage”, for example Instruct-Base , Instruct-SELFIES , ChemE-G4 , and ChemE-rRNA . To characterize how supervision affects model behavior, SemantiChem incorporates a model-level evaluation module (Fig. 2 b). This component examines three aspects of the trained model: its ability to interpret domain-related chemical language, its response to controlled prompt modifications, and the extent to which semantic cues correspond to changes in structural outputs. Since open-ended natural-language molecular generation has not been addressed in prior work, existing evaluation practices do not apply to this setting. Therefore, this framework provides a systematic and generalizable basis for assessing such models. Detailed procedures and metrics are provided in Supplementary Note 3. After training and evaluation, the model is applied to generate ligands for specific biomolecular targets with simple textual prompts (Fig. 2 c). The model outputs candidate structures in SELFIES format, which are decoded, filtered for validity and uniqueness, and examined using standard chemical-space analyses. This workflow enables de novo molecular generation guided solely by functional descriptions expressed in natural language, without predefined scaffolds or geometric constraints. All generated molecules were assessed in silico using classifier-based activity prediction and molecular docking. Candidates were selected for further wet-lab testing based on commercial availability and expert assessment (Fig. 2 d). With these steps, SemantiChem establishes a direct route from language-based design instructions to chemically grounded molecular candidates. (a) Training of SemantiChem. A general-purpose language model is adapted through molecular-syntax pretraining and target-specific instruction tuning to obtain models aligned with ligand design. (b) Model-level evaluation. The trained model is assessed for chemical language comprehension, responsiveness to prompt variations, and semantic-structural consistency. (c) Molecule generation. Simple textual prompts are used to generate SELFIES strings, which are decoded and analyzed in chemical space. (d) Downstream validation. Generated molecules are subjected to computational and experimental assays to identify candidate ligands. SemantiChem for generating G4 ligands Building on the SemantiChem framework described above, we first demonstrated its applicability using G4s, a representative biomolecular system governed by non-pocket-like molecular recognition. Two base models with distinct pretraining backgrounds were explored: Meta-LLaMA-3-8B-Instruct ( Instruct ), a general-purpose model without chemical exposure, and LLaMA-3.1-ChemEinstein ( ChemE ), pretrained on SMILES data. Two generative variants, Instruct-G4 and ChemE-G4 , were obtained by fine-tuning these base models using 4,442 reported G4 ligands from the G4 ligand database (G4LDB) 22 – 24 (Fig. 3 a). (a) G4 ligand generation models derived from two base LLMs. Meta-LLaMA-3-8B-Instruct and LLaMA-3.1-BestMix-ChemEinstein were adapted through SELFIES pretraining and subsequently fine-tuned on the G4LDB dataset to obtain Instruct-G4 and ChemE-G4 . (b) Four types of question-and-answer (Q&A) pairs used for instruction tuning, with representative examples from each type. (c) Generation metrics across model stages and ablation settings. Validity, uniqueness, and novelty were evaluated for Base, SELFIES-pretrained, and G4-tuned models, as well as Type-1-only and four-type multitask tuning variants. G4-specific fine-tuning was organized into four types of question-and-answer (Q&A) pairs (Fig. 3 b). Type-1 , Prompt-based Known Ligand Generation, promotes open-ended alignment between functional instructions and chemically plausible structures. Type-2 , Fragment Completion, introduces scaffold-level constraints to enhance the model’s understanding of structural continuity. Type-3 , Atom Completion, reinforces localized chemical intuition through atom-level reconstruction of masked regions. Type-4 , Identity Confirmation, trains the models to recognize underspecified or invalid prompts, cultivating a domain-consistent expert persona. Generation quality across training stages was evaluated by sampling 600 molecules from each checkpoint and assessing validity, uniqueness, and novelty (Fig. 3 c). During the Base and SELFIES stages, both models exhibited limited chemically meaningful generation. The Instruct-base model produced incoherent text, whereas ChemE-base generated malformed SMILES-like strings, with validity of 0% in both cases. SELFIES pretraining improved syntactic fluency, enabling generation of SELFIES strings with chemically relevant tokens and bracket structures. However, persistent errors in ring closures and branching logic kept validity below 1.2%. Substantial improvements emerged after full instruction tuning with the G4LDB dataset: the ChemE-G4 model achieved 99.8% validity, 94.0% uniqueness, and 67.3% novelty, demonstrating high structural fluency and molecular diversity. To better understand the contribution of individual prompt types, ChemE-SELFIES was fine-tuned using only the Type-1 Q&A pair, which serves as the core task among the four types and was used here for ablation analysis of open-ended prompt-based generation. This simplified model ( ChemE-G4 Type-1 ) retained high validity (99.2%) and uniqueness (92.2%) but exhibited reduced novelty (42.2%), indicating a narrower generative distribution. Compared with the single-task variant, ChemE-G4 , which was fine-tuned on all four Q&A types, showed consistent improvements across all metrics, confirming that multi-task instruction tuning enhances chemical correctness while maintaining structural diversity. Interestingly, Instruct-G4 Type-1 produced higher novelty (82.0%) but lower validity (76.8%) than Instruct-G4 , reflecting a trade-off between generative diversity and structural precision likely influenced by the inductive biases of the base models. To further examine how variations in functional instructions affect model behavior, we applied the evaluation framework to both model series (see Supplementary Note 3). The analysis revealed consistent and interpretable shifts in molecular descriptors in response to prompt changes. Moreover, flexibility-related features such as rotatable bonds varied systematically between stacking-oriented and topology-selective prompts, suggesting that the model adapts its outputs according to contextual functional cues rather than reproducing fixed biases. Characterization of Generated G4 Ligands Having trained the models for G4 ligand generation and examined the effects of prompt design, we next used a single, standardized function-oriented prompt “ Please show me a G-quadruplex ligand which hasn’t been reported. Be sure the structure is novel and unique. ” to produce candidate molecules (Fig. 4 a). For each model, 2,600 molecules were generated for characterization and comparison with known G4 ligands. To evaluate the structural relevance, Tanimoto similarity was calculated for each set of generated molecules against existing G4 ligands in the training set, returning high similarity scores for both ChemE-G4 (mean = 0.88) and Instruct-G4 (mean = 0.87) models (Fig. 4 b). We also found that novelty increased with sample sizes, rising from ~ 70% at 600 samples to over 75% at 2,600. While ~ 25% of generated molecules matched exactly with known ligands, the majority of generated molecules were structurally distinct, demonstrating the models’ capacity to recover active G4-binding motifs while exploring new regions of G4-relevant chemical space. (a) Simple textual prompt used for G4 ligand generation and a representative SELFIES output produced by ChemE-G4 . (b) Similarity distributions between generated molecules and known G4 ligands for Instruct-G4 and ChemE-G4 . (c) Bemis-Murcko scaffold overlap among molecules generated by the two model variants and ligands in G4LDB. (d) TMAP-based classification of generated molecules into A-class (directly connected to known G4 ligands) and B-class (connected only to A-class nodes). (e) Proportions of molecules predicted as G4-active by the GLAM classifier for Instruct-G4 , ChemE-G4 , and a random drug-like control set. (f) Docking score distributions (range from − 15 to 0 kcal/mol) for ChemE-G4 -generated ligands against human telomeric G4 (PDB: 6CCW) and two binding sites within the KRAS promoter G4 (PDB: 7X8M). Dashed lines indicate reference docking scores of the clinical-stage G4 ligand CX-5461 against these G4 receptors. The full distribution is shown in Fig. S4.2. Scaffold-level analysis also revealed a balance between reuse and innovation. Each model produced over 1,400 unique Bemis-Murcko scaffolds, with 16–20% overlapping with scaffolds in G4LDB (Fig. 4 c). To visualize structural diversity, we applied TMAP to project the generated molecules in latent space and classify them based on proximity to training ligands 25 (Fig. S4.1). Nodes directly connected to known G4 ligands were defined as A-class, while those connected only to A-class but not directly to known ligands were as B-class. ChemE-G4 showed a near-even split between A- and B-class molecules (49.7% vs. 50.3%), while Instruct-G4 favored exploration of new chemical space, with 60.6% of molecules disconnected from known G4 ligands (Fig. 4 d). Notably, the two models exhibit complementary generative behaviors shaped by their underlying model priors: ChemE-G4 tends to perform scaffold refinement, while Instruct-G4 favors broader chemotype exploration. This dual capacity reflects SemantiChem’s flexibility in generating ligands with strong structural consistency to known G4 binders while maintaining scaffold-level diversity. We next evaluate whether structurally plausible molecules generated by our models also exhibit G4-relevant activities. To do so, we first established a G4 ligand classifier by training the reported GLAM 26 architecture using 3,569 G4 ligands as positives and 2,695 FDA-approved small-molecule drugs with no known G4-binding activity as negatives. Based on this classifier, we found that 91.9% of Instruct-G4 and 91.7% of ChemE-G4 molecules were predicted as G4 ligands, whereas only 16.5% random drug-like compounds extracted from PubChem were predicted to be positive (Fig. 4 e and Supplementary Note 5.5). Docking 2,600 ChemE-G4 ligands to the human telomeric G4 (hTel, PDB 6CCW) and the KRAS promoter G4 (PDB 7X8M; two distinct binding sites) yielded valid poses for more than 1,440 molecules per site. Favorable binding (ΔG < 0 kcal/mol) was observed for 76.2% of ligands with hTel and for 99.5% and 96.0% with KRAS sites 1 and 2, respectively. Notably, 50.6% of hTel-bound ligands exceeded the affinity of the clinical-stage G4 ligand CX-5461 27 (-7.2 kcal/mol), while 43 and 184 ligands outperformed reference scores at KRAS sites 1 (-9.4 kcal/mol) and 2 (-9.5 kcal/mol) (Fig. 4 f). These results demonstrate that molecules generated by ChemE-G4 were highly enriched for binding across distinct G4 targets, despite the absence of explicit geometric constraints during generation. Among the thousands of generated molecules, we identified a small subset overlapping with commercially available compounds, which were prioritized for experimental validation. Two model-generated molecules were selected as representative candidates for structural and functional characterization (Fig. 5 a). Instruct-734 (commercially known as Hoechst 34580) is a double-stranded DNA staining dye with no previous record of G4 binding activity. It features an extended bis-benzimidazole scaffold linked through a flexible phenylene bridge, forming a curved, partially conjugated system well suited for groove binding. Structurally, it maintains the bis-benzimidazole framework typical of Hoechst dyes while diverging from canonical G4-binding motifs (maximal Tanimoto = 0.755). ChemE-1876 (commercially known as A366) is a G9a/GLP methyltransferase inhibitor with a compact benzimidazole core substituted by an unusual fused cyclobutyl ring, giving rise to a non-planar, sterically constrained topology. This architecture differs markedly from the extended aromatic systems typical of G4 binders and shows limited similarity to any known G4 ligand (maximal Tanimoto = 0.412). The analyses indicate that the two model-generated molecules adopt distinct yet chemically reasonable architectures for G4 recognition, motivating experimental validation of their activities. FRET melting experiments revealed that both Instruct-734 and ChemE-1876 stabilized the KRAS promoter G4, increasing the melting temperature (Tm) by 4.00°C and 4.25°C, respectively (Fig. 5 b). The stabilization capabilities of both molecules to G4 in the KRAS oncogene promoter was found to inhibit primer extension in the DNA polymerase stop assay in a dose-dependent manner. Near-complete inhibitions were achieved for both at 80 µM, comparable to berberine, a well-characterized natural G4 stabilizer with reported anticancer activity 28 (Fig. 5 c). Having confirmed their favored binding to G4 sequences, we next evaluated the cellular bioactivity using CCK-8 cytotoxicity assays. Instruct-734 showed broad cytotoxicity across different cell lines, whereas ChemE-1876 exhibited an interesting selective cytotoxicity toward A549 cells with mutation-induced KRAS overexpression (IC₅₀ = 1.65 ± 0.52 µM) compared with MCF-7 cells carrying wild-type KRAS (IC₅₀ > 30 µM) (Fig. 5 d). These results indicate that our model can generate structurally novel and functionally selective G4 ligands with therapeutic potential. The collective results are summarized in Fig. 5 e, which also includes positive and negative references tested under the same assays. These wet lab experiments demonstrate the ability of SemantiChem to generate chemically valid G4 ligands encompassing both canonical and previously unexplored binding architectures, supporting function-driven molecular design in non-pocket-like recognition regimes. (a) Chemical structures of four ligands selected for experimental validation. (b) FRET melting assay curves for the KRAS promoter G4 in the presence and absence of each ligand. (c) Polymerase stop assay showing dose-dependent inhibition of DNA synthesis on the KRAS G4 template, with a well-characterized natural G4 stabilizer berberine (BER) as a reference. (d) Cytotoxicity dose-response curves in A549 (lung carcinoma) and MCF-7 (breast cancer) cell lines. (e) Summary of experimental measurements. Benchmarking SemantiChem against Current Generative Models We benchmarked SemantiChem against seven representative generative models by evaluating their performance on G4 ligand generation (Fig. 6 ), including three conventional SMILES-based generators adversarial autoencoder (AAE), variational autoencoder (VAE) and character-level recurrent neural network (CharRNN), one transformer-based generator SPMM 29 , one geometry-conditioned graph neural network Pocket2Mol 3 , and two LLM-based generators ChemGPT 30 and ether0 20 . For comparability, each model produced 2,600 molecules, which were evaluated using a unified analysis pipeline. (a) Summary of generation performance for CharRNN, SPMM, Pocket2Mol, ChemGPT, and ChemE-G4 , including metrics for structural quality, scaffold-level statistics, semantic similarity, and predicted functional relevance based on docking and GLAM classification. Red values indicate lower metric values within each category. (b) TMAP projections showing the chemical-space distributions of molecules generated by the five models alongside ligands from G4LDB. (c) Representative ligands generated by each model. Red-highlighted labels indicate selected attributes displayed for each molecule. AAE, VAE, and CharRNN were trained on the MOSES platform 31 with the same G4LDB-based training set. Both AAE and VAE failed to perform effective molecular generation, producing fewer than 2% chemically valid molecules (Table S6.1). Although CharRNN achieved higher validity (65.6%), most generated molecules reproduced existing G4 ligand scaffolds, resulting in substantially lower scaffold novelty compared with the SemantiChem model ChemE-G4 (24.3% vs. 74.5%). SPMM was applied in its seed-based mode using the canonical G4 ligand Pyridostatin (PDS) 32 . Its outputs exhibited 100% scaffold novelty but a lower predicted positive rate than ChemE-G4 (60.9% vs. 91.9%), indicating limited functional enrichment despite structural diversification. Pocket2Mol is a classical geometry-conditioned model originally developed for protein-ligand generation. We applied it to generate molecules using a geometrically well-characterized telomeric G4 DNA structure (PDB ID: 6CCW) without additional training. The predicted positive rate was only 4.3%, lower than the random baseline (16.5%). Attempts to adapt the model for G4 generation failed even after code-level modifications (see Supplementary Note 7.1), reflecting the limited transferability of residue-centric geometric encodings to nucleic acid targets. Similar input restrictions for nucleic acid systems were also observed in other advanced protein-centric models, such as Delete 8 . Molecules were also generated using two LLM-based generators. ChemGPT, while primarily designed for molecular representation and translation tasks, is also capable of molecular generation. In our study, it accepted natural-language prompts for molecule generation, yielding a chemical validity of 62.9%. Although it exhibited high scaffold novelty (99.8%), the predicted positive rate for G4 ligands was 20.2%, only slightly above the baseline. We also briefly tested a general-purpose free-text molecular generator, ether0, which produced SMILES outputs but very few molecules that were chemically valid or biologically relevant (see Supplementary Note 7.2). Chemical-space analysis revealed systematic differences among models. Pocket-conditioned generators (SPMM, Pocket2Mol, ChemGPT) produced distributions that remained largely separated from the G4LDB manifold, whereas ChemE-G4 generated molecules interspersed across its fine-grained branches (Fig. 6 b). At the property level, descriptor distributions also varied across models (Table S6.2-S6.3). ChemE-G4 produced molecules with physicochemical profiles closely aligned to G4LDB, whereas protein-centric approaches such as Pocket2Mol exhibited the largest divergence among all baselines. Representative molecules from each model are shown in Fig. 6 c, illustrating characteristic trade-offs among structural novelty, semantic alignment, and predicted functional relevance across different generative paradigms. Generalizability of SemantiChem to rRNA and Mpro Having established the relative advantages of SemantiChem in G4 ligand generation, we next assessed how its function-driven design strategy performs across biomolecular targets with distinct recognition regimes. We applied the same training framework to representative RNA and protein targets, including E. coli ribosomal RNA (rRNA) and the SARS-CoV-2 main protease (Mpro). Using the same model architecture, we initialized from the ChemE-SELFIES checkpoint and fine-tuned two new variants, ChemE-rRNA and ChemE-Mpro , using task-specific prompt-ligand pairs. For ChemE-rRNA , we curated 5,501 active compounds from PubChem (BioAssay AID:720706) targeting rRNA. For ChemE-Mpro , we used 3,642 known small-molecule inhibitors of Mpro retrieved from ChEMBL. As shown in Fig. 7 a, SemantiChem generalizes effectively to rRNA. ChemE-rRNA model maintained high validity (99.9%) and uniqueness (94.6%), while achieving even higher novelty (91.8%) than in the original G4 task. A total of 2,096 unique Bemis-Murcko scaffolds were generated, with a scaffold novelty of 76.5%. Over 94.0% of generated molecules achieved docking scores better than the reference ligand chloramphenicol 33 . Collectively, these results indicate that SemantiChem retains strong performance for RNA targets governed by non-pocket-like recognition. (a) Summary of performance on three targets, including G-quadruplex structures (G4), ribosomal RNA (rRNA), and the SARS-CoV-2 main protease (Mpro), represented by fine-tuned instances of the SemantiChem framework ( ChemE-G4 , ChemE-rRNA and ChemE-Mpro ). Metrics include structure quality, scaffold novelty, semantic alignment, and docking-based functional relevance. Red values indicate comparatively lower numerical values observed for the ChemE-Mpro model. (b) Representative ligands generated by the three target-specific models, shown together with their predicted binding poses. In contrast, reduced performance was observed when applying SemaniChem to the pocket-dominated Mpro system. For the Mpro task, although validity remained high (99.6%), uniqueness dropped to 68.4%. Only 612 unique scaffolds were generated, corresponding to a scaffold novelty of 43.8%. While all generated molecules exhibited favorable docking energies (standard free energy ≤ 0 kcal/mol), only 27.0% outperformed the reference inhibitor nirmatrelvir 34 . Structural inspection revealed reduced diversity and weaker alignment with canonical Mpro-associated features. These trends were also reflected in physicochemical properties: ChemE-rRNA outputs closely tracked their training distribution across key descriptors, whereas ChemE-Mpro deviated significantly, particularly in descriptors of hydrophobicity ( LogP ) and molecular polarity ( TPSA ). (Table S8.1). Notably, the representative molecules selected for G4, rRNA, and Mpro (Fig. 7 b) all possess novel Murcko scaffolds relative to their training sets, confirming that SemantiChem does not rely on scaffold memorization but exhibits regime-dependent generative behavior. Discussion and Conclusion In this study, we developed SemantiChem, an instruction-tuned generative framework for function-driven molecular design that translates high-level functional intent into chemically meaningful small molecules from natural language descriptions. Biomolecular targets governed by non-pocket-like recognition, such as many nucleic acid systems, present diffuse, flexible, and electrostatically complex interaction surfaces, for which geometry-dependent design paradigms are often ineffective. By operating directly on function-level design intent rather than predefined geometric constraints, this work demonstrates a practical strategy for addressing such recognition regimes that remain challenging for conventional structure-centric molecular design. We demonstrated this capability using two representative systems, G4 DNA and rRNA, which pose distinct biochemical recognition challenges yet share non-pocket-like interaction characteristics. For both targets, the generated molecules were chemically valid and structurally diverse, and experimental testing confirmed G4 stabilization and polymerase inhibition for selected candidates. In particular, ChemE-1876 showed selective activity toward the KRAS promoter G4 and cytotoxicity against KRAS-overexpressing A549 cells, illustrating that function-driven, language-guided generation can yield structurally novel and biologically specific ligands. Together, these results indicate that the proposed design strategy can generalize across related non-pocket-like recognition regimes, with natural language serving as an interpretable interface for specifying functional objectives. At the same time, the results also delineate limitations of the approach. Available ligand datasets for non-pocket-like targets, especially nucleic acids, remain limited in size and structural diversity, which may constrain the scope of learnable design patterns. Moreover, while many generated molecules were chemically valid and experimentally active, challenges related to synthetic accessibility and downstream experimental feasibility persist. The reduced performance observed for pocket-dominated protein targets further highlights that the effectiveness of function-driven generation may be regime-dependent rather than universal. Future extensions incorporating richer biochemical constraints, selectivity cues, or feasibility-aware filtering may help expand the practical applicability of this framework. Overall, this work establishes a function-level molecular design strategy in which natural language serves as a flexible interface for translating human design intent into chemical space. By explicitly aligning generative modeling with functional objectives rather than geometric assumptions, SemantiChem offers a complementary path for molecular discovery in biomolecular systems that lie beyond the effective reach of conventional structure-centric approaches. Methods Model selection and initialization Two large language models (LLMs) were selected to investigate their capabilities in ligand generation. Meta-LLaMA-3-8B-Instruct and LLaMA-3.1-ChemEinstein were used. Both models were obtained from Hugging Face (see Supplementary Note 2.1 for model links) and initialized from their publicly available checkpoints. All stages of model development, including pretraining and instruction tuning, were carried out using the LLaMA Factory framework version 0.9.0. SELFIES-based domain pretraining Both models were subjected to lightweight domain-specific pretraining using approximately 10,000 molecular structures encoded in the SELFIES format. The data were obtained from the PubChem10M_SELFIES dataset (see Supplementary Note 2.1 for repository link). The first 10,000 entries were selected without additional filtering or augmentation. Pretraining was conducted for 2 epochs using an autoregressive language modeling objective under the LoRA framework, with LoRA applied to all transformer layers. The training used a learning rate of 1e-4, a per-device batch size of 8, gradient accumulation over 8 steps, and cosine learning rate scheduling with a warmup ratio of 0.1. Input sequences were truncated at 256 tokens, and training was performed with fp16 precision. No task-specific prompts or semantic supervision were applied during this stage. Prompt-driven multi-task fine-tuning All models were fine-tuned on nucleic acid- and protein-targeted ligand design tasks using a unified prompt-based multi-task instruction tuning framework. Target-specific datasets were prepared for three molecular targets: G-quadruplexes (G4), ribosomal RNA (rRNA), and the SARS-CoV-2 main protease (Mpro). For G4, 4,442 ligand structures and corresponding functional prompts were derived from the G4LDB database ( www.G4LDB.com ; accessed in 2023). For rRNA, 5,501 compounds labeled as active were extracted from PubChem assay AID_720706. For Mpro, 4,025 small-molecule ligands were retrieved from the ChEMBL database by querying “SARS-CoV-2 Mpro” and applying a compound type filter restricted to small molecules. No additional filtering or activity thresholding was applied. All datasets used for fine-tuning alongside the training scripts are accessible during peer review (see Data and Code Availability). The instruction tuning framework comprised four complementary task types. Type 1 involved the generation of known ligands from functional prompts. To enhance prompt diversity and improve model generalization, three distinct prompt templates were created and randomly assigned to ligand entries during preprocessing. Type 2 introduced scaffold-level constraints through fragment completion. Type 3 focused on atom-level reconstruction of masked regions to strengthen localized chemical reasoning. Type 4 aimed to reinforce expert-level identity grounding by prompting the model to assume the persona of a domain specialist (e.g., a G4 ligand discovery researcher). This was implemented by adapting identity conditioning examples from the LLaMA Factory corpus, with task-specific substitutions. Prompt templates and representative examples are accessible during peer review (see Data and Code Availability) and Table S2.3. All tasks were trained using positive-only supervision, without contrastive objectives or predefined negative samples. Fine-tuning was conducted jointly across all tasks using a mixed sampling strategy, in which all instruction-response pairs were pooled and drawn uniformly during training. Models were trained for 10 epochs using the LoRA method applied to all transformer layers, with a learning rate of 1e-4, a per-device batch size of 2, and gradient accumulation over 8 steps. A cosine learning rate scheduler with a warmup ratio of 0.1 was used. Training was performed with bfloat16 precision, and input sequences were truncated at 1,024 tokens. Molecule generation and evaluation in the SemantiChem framework Molecular structures were generated within the SemantiChem framework, which integrates prompt-based inference and evaluation. For this study, model training and inference were implemented via the LLaMA Factory API. Prompts were written in free-text format to elicit novel, chemically plausible ligands. Sampling was performed with nucleus sampling (top-p = 0.9, temperature = 1.0) based on prior parameter optimization (see Table S2.4). For quality assessment, 600 molecules were generated per model at each developmental stage (Base, SELFIES-pretrained, and Q&A fine-tuned). For downstream structural and functional analyses, 2,600 molecules were generated from each fully fine-tuned model. Outputs were collected in SELFIES format and converted to SMILES with RDKit 35 . Generation quality was evaluated using three metrics: Validity proportion of molecules that can be successfully parsed by RDKit. Uniqueness proportion of non-duplicate valid molecules. Novelty proportion of valid molecules absent from the fine-tuning set. Prompt variation experiments For each prompt variant, 500 molecules were generated using the ChemE-G4 model. Molecular structures were standardized and converted to canonical SMILES. Standard physicochemical descriptors were computed using RDKit, with additional custom descriptors (e.g., Max_fused, LCCS) implemented in Python (see Data and Code Availability). To evaluate whether different prompts produced distinguishable molecular distributions, binary classifiers were trained on descriptor sets using logistic regression with five-fold cross-validation. Classifier performance was reported as mean AUC ± standard deviation across folds. Descriptor importance was quantified from model coefficients and interpreted alongside kernel density estimates (KDE) of the top-weighted features. Representative molecules were selected to illustrate alignment with prompt semantics. Structure-Based Analysis Tools and Metrics Structure-based analyses were performed using RDKit. Tanimoto similarity was computed on MACCS keys, and Bemis-Murcko scaffolds were extracted for all compounds. To visualize the structural distribution of generated and reference ligands, two-dimensional tree-based layouts were constructed using the TMAP algorithm 25 , based on approximate nearest-neighbor relationships among molecular fingerprints. Nodes were color-coded by compound class and analyzed to compare scaffold reuse and structural exploration. These analytical procedures were applied to structural outputs from all models and target systems. G4 Binding Prediction with Graph-Level Attention Model A graph-level attention model (GLAM) was used to predict the G4-binding potential of generated molecules. The model architecture followed the GLAM framework as previously reported 26 and was retrained on curated ligands from the G4LDB. Molecular graphs were constructed from standardized SMILES using RDKit. Full training procedures, model parameters, and performance metrics are provided in Supplementary Note 5. As a comparative baseline, a background set of one million drug-like small molecules was randomly sampled from PubChem using its API, applying Lipinski’s rule-of-five filters. The trained model was then applied to both generated and background compounds to estimate their probability of G4 activity. Molecular Docking against Human Telomeric G4 DNA Molecular docking simulations were performed using AutoDock Vina 36 to evaluate the binding potential of generated ligands against three distinct biological targets. Ligand preparation involved conversion from SMILES to 3D structures using RDKit, with hydrogen addition and coordinate embedding via distance geometry. Ligands were energy-minimized using the MMFF94 or UFF force fields, depending on availability, and saved as SDF files. These were subsequently converted to PDBQT format using Open Babel 37 . Ligands failing any step of conversion or docking due to embedding errors, unsupported atoms, or file format issues were excluded. For receptor preparation, the human telomeric G4 (PDB ID: 6CCW), KRAS G4 (PDB ID: 7X8M), and the SARS-CoV-2 main protease (Mpro, PDB ID: 7BQY) structures were processed by removing bound ligands, solvent molecules, and metal ions. Hydrogen atoms and Gasteiger charges were added using AutoDockTools 38 , and the receptors were saved in PDBQT format. For the bacterial ribosomal RNA target, a single RNA chain (chain V [auth BA], extracted from PDB ID: 4V7T) proximal to the chloramphenicol binding site was selected as the receptor. Hydrogen atoms were added using MolProbity’s reduce tool (version 4.5.2) to optimize protonation states and hydrogen-bonding geometry 39 . The resulting structure was then further processed in AutoDockTools to assign Gasteiger charges and exported in PDBQT format for docking. Docking grid boxes were defined as follows: for the G4 DNA (6CCW and 7X8M), a grid box radius of 15 Å was centered on the known ligand-binding site encompassing the G-tetrad core; for Mpro (7BQY), a radius of 30 Å was centered on the active site; and for the rRNA target (4V7T), a radius of 24 Å was used around the chloramphenicol binding site. For each successfully docked ligand, the lowest predicted binding affinity (kcal/mol) was recorded. Reference compounds, CX-5461 for G4 DNA, Nirmatrelvir for Mpro, and Chloramphenicol for rRNA, were docked under identical conditions to provide benchmarks for comparative evaluation. Reagents and Materials Pyridostatin (PDS, Cat. No. S7444), ChemE-1876 (Cat. No. S7572), Instruct-734 (Cat. No. S0486), ChemE-1732 (Cat. No. S9032), and Instruct-2189 (Cat. No. S6830) were purchased from Selleck Chemicals (Shanghai, China), and berberine (BER, Cat. No. SB8130) was purchased from Solarbio (Beijing, China). All compounds were prepared as DMSO stock solutions. DNA oligonucleotides were synthesized and HPLC-purified by Sangon Biotech (Shanghai, China). Lyophilized DNA was reconstituted in ultrapure water to prepare 100 µM stock solutions. Concentrations were determined by UV absorbance using a BioTek microplate reader with a Take3 module, and adjusted accordingly. All stocks were stored at -20°C until use. Fluorescence-Based Thermal Melting Assay G4-binding activity was assessed using a fluorescence-based thermal melting assay. Two G4-forming DNA sequences were tested: a 22-mer human telomeric repeat (hTel) and a 22-mer KRAS promoter sequence. Both probes were labeled with a 5'-FAM fluorophore and a 3'-BHQ1 quencher. Probe sequences were: hTel: 5'-FAM-AGGGTTAGGGTTAGGGTTAGGG-BHQ1-3' · KRAS: 5'-FAM-AGGGCGGTGTGGGAAGAGGGAA-BHQ1-3' A duplex DNA (dsDNA) control was used to evaluate binding selectivity. It consisted of two complementary strands: FAM strand: 5'-FAM-AGGTTGGTGAGTGATTGGAGGTT-3' BHQ1 strand: 3'-TCCAACCACTCACTAA-C5-BHQ1 Equimolar duplex strands were annealed by heating to 95°C for 5 min followed by slow cooling to room temperature. Thermal melting experiments were conducted in 10 mM PBK buffer (pH 8.5) containing 80 mM K⁺ and 0.05% Tween-20. Final concentrations of DNA and compound were fixed at 100 nM and 1 µM, respectively. Fluorescence signals were recorded over a temperature gradient, and ligand-induced changes in melting temperature (ΔTₘ)F were determined by comparing melting curves in the presence and absence of test compounds. DNA Polymerase Stop Assay The DNA polymerase stop assay was performed as previously described 28 , with minor modifications. A 5′-FAM-labeled primer was mixed with a KRAS G4-containing DNA template at a 1.2:1 molar ratio. Primer : 5′-FAM-TAATACGACTCACTATAGCAATTGC Template: 5′-TGAATCCTGAGGGCGGTGTGGGAAGAGGGAAGATAGCTGCAC AATTGCTATAGTGAGTCGTATTA-3′ The mixtures were annealed by heating to 95°C for 5 min, followed by slow cooling to room temperature. Compounds were added at the indicated concentrations and incubated with the annealed DNA with100 mM K + at room temperature for 3 hours. Primer extension was carried out in a 50 µL reaction containing 0.2 µM DNA complex, 1X Taq PCR Master Mix (Sangon, China), for 30 min at 37°C. Reaction products were resolved by electrophoresis on a 12% denaturing polyacrylamide gel. FAM-labeled DNA fragments were visualized using a GenoSens 2200 imaging system (Clinx, China). CCK-8 Cytotoxicity Assay Cytotoxicity was assessed in A549 (lung adenocarcinoma) and MCF-7 (breast adenocarcinoma) cell lines using a standard CCK-8 colorimetric assay (Beyotime, China). Cells were seeded at a density of 1×10 4 cells/well in 96-well plates and incubated overnight before treatment. Test compounds were added at a range of concentrations and incubated for 24 hours. Cell viability was quantified by measuring absorbance at 450 nm following incubation with CCK-8 reagent, according to the manufacturer’s instructions. Viability values were normalized to vehicle-treated controls, and IC₅₀ values were calculated using nonlinear regression (four-parameter logistic model) in GraphPad Prism 9. Each experiment was performed in independent triplicate. Baseline models for G4 ligand generation Baseline models were applied to the G4 ligand generation task, each producing 2,600 molecules. Outputs were evaluated using the same pipeline as in this work. MOSES baselines Adversarial autoencoder (AAE), variational autoencoder (VAE), and character-level recurrent neural network (CharRNN) were implemented with the official MOSES benchmarking platform 34 and generated SMILES using default configurations. SPMM A seed-based generator applied without task-specific training, initialized with the reference ligand pyridostatin (PDS). Pocket2Mol : Applied in a zero-shot setting to the human telomeric G4 DNA structure (PDB ID: 6CCW) after ligand removal. The default “Sampling for PDB pockets” mode was used, with minor adjustments for nucleic acid compatibility. From 5,000 generated candidates, the first 2,600 valid molecules were retained. We also explored whether Pocket2Mol could be fine-tuned on nucleic acid-specific data, but training failed due to structural incompatibilities; details are provided in Supplementary Note 7.1. ChemGPT ChemGPT was obtained in the safetensors format and used for SELFIES-based molecular generation. For prompt-based sampling, input strings were tokenized and, when possible, converted into SELFIES prefixes. Molecules were generated with the same prompt settings and sampling parameters (temperature = 1.0, top-p = 0.9) as in SemantiChem. Declarations Data and Code Availability The code used in this study is available at: https://github.com/ADNLab-SCU/SemantiChem. This submission version contains only utility scripts used in the computational experiments. A fully documented pipeline and complete resources will be released upon publication. Supplementary datasets generated in this study have been deposited on Figshare (https://figshare.com/s/0fadc7628f71cb4600a3). Additional data supporting the findings of this work are available from the corresponding author upon reasonable request and will be made publicly accessible upon publication. Acknowledgement This work was supported by National Natural Science Foundation of China [22077087, 22474082], Sichuan Science and Technology Program [2025NSFJQ0019] and Fundamental Research Funds for the Central Universities. The authors would like to thank the Analytical & Testing Center of Sichuan University. References Jayatunga, M. K. P., Xie, W., Ruder, L., Schulze, U. & Meier, C. AI in small-molecule drug discovery: a coming wave? Nat. Rev. Drug Discov. 21 , 175–176 (2022). Krishnan, A. et al. A generative deep learning approach to de novo antibiotic design. Cell 188 , 5962–5979 e5922 (2025). Peng, X. et al. Pocket2Mol: Efficient Molecular Sampling Based on 3D Protein Pockets. In Proc. 39th International Conference on Machine Learning 17644–17655 (PMLR, 2022). Munson, B. P. et al. De novo generation of multi-target compounds using deep generative chemistry. Nat. Commun. 15 , 3636 (2024). Ozawa, M., Nakamura, S., Yasuo, N. & Sekijima, M. IEV2Mol: Molecular Generative Model Considering Protein-Ligand Interaction Energy Vectors. J. Chem. Inf. Model. 64 , 6969–6978 (2024). Wu, K. et al. TamGen: drug design with target-aware molecule generation through a chemical language model. Nat. Commun. 15 , 9360 (2024). Feng, W. et al. Generation of 3D molecules in pockets via a language model. Nat. Mach. Intell. 6 , 62–73 (2024). Chen, S. et al. Deep lead optimization enveloped in protein pocket and its application in designing potent and selective ligands targeting LTK protein. Nat. Mach. Intell. 7 , 448–458 (2025). Varshney, D., Spiegel, J., Zyner, K., Tannahill, D. & Balasubramanian, S. The regulation and functions of DNA and RNA G-quadruplexes. Nat. Rev. Mol. Cell Biol. 21 , 459–474 (2020). Sato, K. et al. RNA transcripts regulate G-quadruplex landscapes through G-loop formation. Science 388 , 1225–1231 (2025). Kovachka, S. et al. Small molecule approaches to targeting RNA. Nat. Rev. Chem. 8 , 120–135 (2024). Neidle, S. Quadruplex nucleic acids as targets for anticancer therapeutics. Nat. Rev. Chem. 1 , 0041 (2017). Hänsel-Hertsch, R. et al. Landscape of G-quadruplex DNA structural regions in breast cancer. Nat. Genet. 52 , 878–883 (2020). Hänsel-Hertsch, R., Di Antonio, M. & Balasubramanian, S. DNA G-quadruplexes in the human genome: detection, functions and therapeutic potential. Nat. Rev. Mol. Cell Biol. 18 , 279–284 (2017). Sultan, M. et al. Targeting the G-quadruplex as a novel strategy for developing antibiotics against hypervirulent drug-resistant Staphylococcus aureus. J. Biomed. Sci. 32 , 15 (2025). Ouyang, L. et al. Training language models to follow instructions with human feedback. Preprint at https://arxiv.org/abs/2203.02155 (2022). Chung, H. W. et al. Scaling Instruction-Finetuned Language Models. Preprint at https://arxiv.org/abs/2210.11416 (2022). Zheng, Y. et al. Large language models for scientific discovery in molecular property prediction. Nat. Mach. Intell. 7 , 438–447 (2025). Zhang, D. et al. ChemLLM: A Chemical Large Language Model. Preprint at https://arxiv.org/abs/2402.06852 (2024). Narayanan, S. M. et al. Training a Scientific Reasoning Model for Chemistry. Preprint at https://arxiv.org/abs/2506.17238 (2025). Edwards, C. et al. Translation between Molecules and Natural Language. Preprint at https://arxiv.org/abs/2204.11817 (2022). Li, Q. et al. G4LDB: a database for discovering and studying G-quadruplex ligands. Nucleic Acids Res. 41 , D1115–1123 (2013). Wang, Y. H. et al. G4LDB 2.2: a database for discovering and studying G-quadruplex and i-Motif ligands. Nucleic Acids Res. 50 , D150–D160 (2022). Yang, Q. F. et al. G4LDB 3.0: a database for discovering and studying G-quadruplex and i-motif ligands. Nucleic Acids Res. 53 , D91–D98 (2025). Probst, D. & Reymond, J. L. Visualization of very large high-dimensional data sets as minimum spanning trees. J. Cheminform. 12 , 12 (2020). Li, Y. et al. An adaptive graph learning method for automated molecular interactions and properties predictions. Nat. Mach. Intell. 4 , 645–651 (2022). Xu, H. et al. CX-5461 is a DNA G-quadruplex stabilizer with selective lethality in BRCA1/2 deficient tumours. Nat. Commun. 8 , 14432 (2017). Wang, K. B. et al. Structural insight into the bulge-containing KRAS oncogene promoter G-quadruplex bound to berberine and coptisine. Nat. Commun. 13 , 6016 (2022). Chang, J. & Ye, J. C. Bidirectional generation of structure and properties through a single molecular foundation model. Nat. Commun. 15 , 2323 (2024). Frey, N. C. et al. Neural scaling of deep chemical models. Nat. Mach. Intell. 5 , 1297–1305 (2023). Polykovskiy, D. et al. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Front. Pharmacol. 11 , 565644 (2020). Rodriguez, R. et al. A novel small molecule that alters shelterin integrity and triggers a DNA-damage response at telomeres. J. Am. Chem. Soc. 130 , 15758–15759 (2008). Xue, L., Spahn, C. M. T., Schacherl, M. & Mahamid, J. Structural insights into context-dependent inhibitory mechanisms of chloramphenicol in cells. Nat. Struct. Mol. Biol. 32 , 257–267 (2025). Owen, D. R. et al. An oral SARS-CoV-2 M(pro) inhibitor clinical candidate for the treatment of COVID-19. Science 374 , 1586–1593 (2021). Landrum, G. et al. RDKit: open-source cheminformatics software. GitHub repository, https://github.com/rdkit/rdkit (2016). Eberhardt, J., Santos-Martins, D., Tillack, A. F. & Forli, S. AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. J. Chem. Inf. Model. 61 , 3891–3898 (2021). O'Boyle, N. M. et al. Open Babel: An open chemical toolbox. J. Cheminform. 3 , 33 (2011). Forli, S. et al. Computational protein-ligand docking and virtual drug screening with the AutoDock suite. Nat. Protoc. 11 , 905–919 (2016). Williams, C. J. et al. MolProbity: More and better reference data for improved all-atom structure validation. Protein Sci. 27 , 293–315 (2018). Additional Declarations There is NO Competing Interest. Supplementary Files ESI20260208.docx Supplementary information 20251209SupplementaryData.zip Dataset 1 Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8819034","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":589670857,"identity":"e3219cd3-c4b1-450e-bb23-02c213a9aab9","order_by":0,"name":"Qianfan Yang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAyklEQVRIiWNgGAWjYNACGxsIzUO8lrQ00rUcJkGLwfGzh1+8SThvzz8jgfHB2zYGeXOCWs7kpVnOSbidOONGArPh3DYGw50NhLQcyDEz5v1xO8FAIoFNmreNIcHgACEt59+YGfMknLMHamH/TZyWGznGj3kSDjBuANrCTJQWyRtvzBjnJCQnzjjzsFlyzjkJww2EtPCdzzH+8CbBzp6/PfnghzdlNvIEbVE4wMAmAYkOxgYgIUFAPRDINzAwfyAh0kfBKBgFo2AkAgDdXEBZPfXvKAAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0003-1830-1154","institution":"Sichuan University","correspondingAuthor":true,"prefix":"","firstName":"Qianfan","middleName":"","lastName":"Yang","suffix":""},{"id":589670858,"identity":"14a1e0e0-7b31-41c8-9987-7716cac54da3","order_by":1,"name":"Xurui Wang","email":"","orcid":"","institution":"Sichuan University","correspondingAuthor":false,"prefix":"","firstName":"Xurui","middleName":"","lastName":"Wang","suffix":""},{"id":589670859,"identity":"edbf79b3-285b-4654-be42-169a664f1187","order_by":2,"name":"Yanxi Wang","email":"","orcid":"","institution":"Sichuan University","correspondingAuthor":false,"prefix":"","firstName":"Yanxi","middleName":"","lastName":"Wang","suffix":""},{"id":589670860,"identity":"af26fff7-0314-4c3d-82fb-3a69256a8051","order_by":3,"name":"Ruizhao Zhu","email":"","orcid":"","institution":"Sichuan University","correspondingAuthor":false,"prefix":"","firstName":"Ruizhao","middleName":"","lastName":"Zhu","suffix":""},{"id":589670861,"identity":"47ed61d8-73f0-4220-aed7-ed4494965019","order_by":4,"name":"Hailiang Li","email":"","orcid":"","institution":"Sichuan University","correspondingAuthor":false,"prefix":"","firstName":"Hailiang","middleName":"","lastName":"Li","suffix":""},{"id":589670862,"identity":"03912d12-8148-4d75-bee4-e06c0692e6c3","order_by":5,"name":"Xinghong Wu","email":"","orcid":"","institution":"Sichuan University","correspondingAuthor":false,"prefix":"","firstName":"Xinghong","middleName":"","lastName":"Wu","suffix":""},{"id":589670863,"identity":"86270e3a-6c86-4a9d-84a1-5b5140be3d06","order_by":6,"name":"Xinyi Zhang","email":"","orcid":"","institution":"Sichuan University","correspondingAuthor":false,"prefix":"","firstName":"Xinyi","middleName":"","lastName":"Zhang","suffix":""},{"id":589670864,"identity":"a91675f7-d4ab-41c8-b92b-688b97481ad4","order_by":7,"name":"Mingyuan Zhou","email":"","orcid":"","institution":"Sichuan University","correspondingAuthor":false,"prefix":"","firstName":"Mingyuan","middleName":"","lastName":"Zhou","suffix":""},{"id":589670865,"identity":"913bc653-e930-4af6-b399-257f55d93f59","order_by":8,"name":"Huaiwen Pu","email":"","orcid":"","institution":"Sichuan University","correspondingAuthor":false,"prefix":"","firstName":"Huaiwen","middleName":"","lastName":"Pu","suffix":""},{"id":589670866,"identity":"eb13343f-5629-4184-ac1d-21be5286cc74","order_by":9,"name":"Kaicong Cai","email":"","orcid":"","institution":"Fujian Normal University","correspondingAuthor":false,"prefix":"","firstName":"Kaicong","middleName":"","lastName":"Cai","suffix":""},{"id":589670867,"identity":"add787b7-4dde-4277-a964-b1ae708591bd","order_by":10,"name":"Yanan Tang","email":"","orcid":"https://orcid.org/0000-0003-0964-2352","institution":"Sichuan University","correspondingAuthor":false,"prefix":"","firstName":"Yanan","middleName":"","lastName":"Tang","suffix":""},{"id":589670868,"identity":"0b4dfc52-0225-4259-bea3-c68ce9d2de45","order_by":11,"name":"Feng Li","email":"","orcid":"","institution":"Sichuan University","correspondingAuthor":false,"prefix":"","firstName":"Feng","middleName":"","lastName":"Li","suffix":""}],"badges":[],"createdAt":"2026-02-08 04:10:28","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8819034/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8819034/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":102473628,"identity":"801fe67c-d80c-4104-b947-fa1b2a36d629","added_by":"auto","created_at":"2026-02-12 04:53:04","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":2408565,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eStructure-driven versus function-driven molecular design paradigms.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSchematic comparison between conventional structure-driven molecular design and function-driven molecular generation. In structure-driven paradigms, molecular function is inferred from localized geometric complementarity within compact and well-defined binding pockets, as exemplified by pocket-dominated protein targets. In contrast, function-driven design addresses biomolecular systems governed by non-pocket-like recognition, where interaction surfaces are extended, flexible, and context-dependent, and molecular function emerges from distributed stacking, electrostatic, and topological interactions. G-quadruplexes (G4) are shown here as a representative system illustrating this non-pocket-like recognition regime. This conceptual distinction underlies the design strategy explored in this work.\u003c/p\u003e","description":"","filename":"Fig120260208.png","url":"https://assets-eu.researchsquare.com/files/rs-8819034/v1/91b5a29b0d0706e84262aa38.png"},{"id":102473634,"identity":"d38e378a-4d2a-4b99-b753-b2959ddb9bc5","added_by":"auto","created_at":"2026-02-12 04:53:04","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":24086039,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eWorkflow of the study.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e(a) Training of SemantiChem. A general-purpose language model is adapted through molecular-syntax pretraining and target-specific instruction tuning to obtain models aligned with ligand design. (b) Model-level evaluation. The trained model is assessed for chemical language comprehension, responsiveness to prompt variations, and semantic-structural consistency. (c) Molecule generation. Simple textual prompts are used to generate SELFIES strings, which are decoded and analyzed in chemical space. (d) Downstream validation. Generated molecules are subjected to computational and experimental assays to identify candidate ligands.\u003c/p\u003e","description":"","filename":"Fig220251209.png","url":"https://assets-eu.researchsquare.com/files/rs-8819034/v1/8e76cfc55e582b12219da181.png"},{"id":102473632,"identity":"e621b816-e93e-42bd-abd3-d4c89563ed5f","added_by":"auto","created_at":"2026-02-12 04:53:04","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":5052793,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eG4-specific model construction, instruction tuning, and generation performance.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e(a) G4 ligand generation models derived from two base LLMs. \u003cem\u003eMeta-LLaMA-3-8B-Instruct\u003c/em\u003e and \u003cem\u003eLLaMA-3.1-BestMix-ChemEinstein\u003c/em\u003ewere adapted through SELFIES pretraining and subsequently fine-tuned on the G4LDB dataset to obtain \u003cem\u003eInstruct-G4\u003c/em\u003e and \u003cem\u003eChemE-G4\u003c/em\u003e. (b) Four types of question-and-answer (Q\u0026amp;A) pairs used for instruction tuning, with representative examples from each type. (c) Generation metrics across model stages and ablation settings. Validity, uniqueness, and novelty were evaluated for Base, SELFIES-pretrained, and G4-tuned models, as well as Type-1-only and four-type multitask tuning variants.\u003c/p\u003e","description":"","filename":"Fig320251117.png","url":"https://assets-eu.researchsquare.com/files/rs-8819034/v1/3eb5b16eaa6165480fc829ab.png"},{"id":102473629,"identity":"183cd36d-92ce-4447-b827-66aa21df0879","added_by":"auto","created_at":"2026-02-12 04:53:04","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":10116130,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCharacterization and in silico evaluation of G4 ligands generated by SemantiChem.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e(a) Simple textualprompt used for G4 ligand generation and a representative SELFIES output produced by \u003cem\u003eChemE-G4\u003c/em\u003e. (b) Similarity distributions between generated molecules and known G4 ligands for \u003cem\u003eInstruct-G4\u003c/em\u003e and \u003cem\u003eChemE-G4\u003c/em\u003e. (c) Bemis-Murcko scaffold overlap among molecules generated by the two model variants and ligands in G4LDB. (d) TMAP-based classification of generated molecules into A-class (directly connected to known G4 ligands) and B-class (connected only to A-class nodes). (e) Proportions of molecules predicted as G4-active by the GLAM classifier for \u003cem\u003eInstruct-G4\u003c/em\u003e, \u003cem\u003eChemE-G4\u003c/em\u003e, and a random drug-like control set. (f) Docking score distributions (range from -15 to 0 kcal/mol) for \u003cem\u003eChemE-G4\u003c/em\u003e-generated ligands against human telomeric G4 (PDB: 6CCW) and two binding sites within the KRAS promoter G4 (PDB: 7X8M). Dashed lines indicate reference docking scores of the clinical-stage G4 ligand CX-5461 against these G4 receptors. The full distribution is shown in Fig. S4.2.\u003c/p\u003e","description":"","filename":"Fig420251117.png","url":"https://assets-eu.researchsquare.com/files/rs-8819034/v1/03f573d5ce782edfb43a5eb5.png"},{"id":102473633,"identity":"2758392f-feaf-4580-856b-2546692c4a63","added_by":"auto","created_at":"2026-02-12 04:53:04","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":13786242,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eStructural and functional evaluation of generated G4 ligands.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e(a) Chemical structures of four ligands selected for experimental validation. (b) FRET melting assay curves for the KRAS promoter G4 in the presence and absence of each ligand. (c) Polymerase stop assay showing dose-dependent inhibition of DNA synthesis on the KRAS G4 template, with a well-characterized natural G4 stabilizer berberine (BER) as a reference. (d) Cytotoxicity dose-response curves in A549 (lung carcinoma) and MCF-7 (breast cancer) cell lines. (e) Summary of experimental measurements.\u003c/p\u003e","description":"","filename":"Fig520251118.png","url":"https://assets-eu.researchsquare.com/files/rs-8819034/v1/1655b854678416e51817103b.png"},{"id":102745676,"identity":"ec5f7deb-7766-40b5-b460-87f8ad6f9805","added_by":"auto","created_at":"2026-02-16 08:53:15","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":10664169,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eComparative evaluation of baseline generative models on the G4 ligand design task.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e(a) Summary of generation performance for CharRNN, SPMM, Pocket2Mol, ChemGPT, and \u003cem\u003eChemE-G4\u003c/em\u003e, including metrics for structural quality, scaffold-level statistics, semantic similarity, and predicted functional relevance based on docking and GLAM classification. Red values indicate lower metric values within each category. (b) TMAP projections showing the chemical-space distributions of molecules generated by the five models alongside ligands from G4LDB. (c) Representative ligands generated by each model. Red-highlighted labels indicate selected attributes displayed for each molecule.\u003c/p\u003e","description":"","filename":"Fig620251117.png","url":"https://assets-eu.researchsquare.com/files/rs-8819034/v1/c4a83be7940bb4a8b162b0ce.png"},{"id":102746413,"identity":"c58fa689-058b-4812-8c08-b00edd2d6c66","added_by":"auto","created_at":"2026-02-16 08:57:33","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":4928544,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eGeneralization performance of SemantiChem across diverse biomolecular targets.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e(a) Summary of performance on three targets, including G-quadruplex structures (G4), ribosomal RNA (rRNA), and the SARS-CoV-2 main protease (Mpro), represented by fine-tuned instances of the SemantiChem framework (\u003cem\u003eChemE-G4\u003c/em\u003e, \u003cem\u003eChemE-rRNA\u003c/em\u003e and \u003cem\u003eChemE-Mpro\u003c/em\u003e). Metrics include structure quality, scaffold novelty, semantic alignment, and docking-based functional relevance. Red values indicate comparatively lower numerical values observed for the \u003cem\u003eChemE-Mpro\u003c/em\u003e model. (b) Representative ligands generated by the three target-specific models, shown together with their predicted binding poses.\u003c/p\u003e","description":"","filename":"Fig720251120.png","url":"https://assets-eu.researchsquare.com/files/rs-8819034/v1/e352b6ed7889656840eaa98e.png"},{"id":105751739,"identity":"4e1e9549-19d3-464f-82bc-a6a14f96e67d","added_by":"auto","created_at":"2026-03-30 15:39:59","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":62190345,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8819034/v1/fe722381-8d0c-4dc0-b1d4-c391f5439963.pdf"},{"id":102473630,"identity":"e11a9836-9ef3-4d12-a252-8f1b362e4857","added_by":"auto","created_at":"2026-02-12 04:53:04","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":2099195,"visible":true,"origin":"","legend":"Supplementary information","description":"","filename":"ESI20260208.docx","url":"https://assets-eu.researchsquare.com/files/rs-8819034/v1/b9a5222b57d61d4ca91273f3.docx"},{"id":102745881,"identity":"e660f65d-7c3b-467c-990e-922277bae9fb","added_by":"auto","created_at":"2026-02-16 08:54:31","extension":"zip","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":15097496,"visible":true,"origin":"","legend":"Dataset 1","description":"","filename":"20251209SupplementaryData.zip","url":"https://assets-eu.researchsquare.com/files/rs-8819034/v1/5c88b36d886554b4e5141583.zip"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Function-Driven Molecular Design Enabled by Instruction-Tuned Large Language Models","fulltext":[{"header":"Introduction","content":"\u003cp\u003eGenerative artificial intelligence has substantially expanded the scope of molecular design by enabling the \u003cem\u003ede novo\u003c/em\u003e generation of compounds with specified biological or chemical functions\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. Much of this progress to date has been achieved in targets that present compact and well-defined binding pockets\u003csup\u003e\u003cspan additionalcitationids=\"CR4 CR5 CR6 CR7\" citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e, where molecular function can be closely associated with localized geometric features. However, advances in molecular biology have increasingly revealed biomolecular systems that involve extended, flexible, or delocalized interaction modes. These systems challenge the implicit assumption that molecular function can be inferred from localized geometric complementarity, and conventional structure-centric molecular design strategies become increasingly unreliable. This shift highlights the need for generative approaches that can link functional objectives to molecular structure in a general and transferable manner.\u003c/p\u003e \u003cp\u003eG-quadruplexes (G4) represent a class of biomolecular systems governed by non-pocket-like molecular recognition\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e. Rather than presenting localized and geometrically constrained interaction sites, G4 structures expose relatively shallow and highly polymorphic interaction surfaces, with ligand recognition dominated by stacking interactions, electrostatics, and higher-order topology\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). In practice, the design of G4-binding ligands has largely focused on functional outcomes, such as G4 stabilization, transcriptional interference, or selective cellular effects\u003csup\u003e\u003cspan additionalcitationids=\"CR13 CR14\" citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e. However, most existing structure-driven generative models are not designed to accommodate the function-oriented nature of G4 ligand design. In our evaluation, several representative approaches showed limited applicability when applied to G4 targets, failing to consistently produce chemically meaningful ligands (detailed in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ea and Supplementary Note 7).\u003c/p\u003e \u003cp\u003eThese challenges motivate the development of generative design strategies that can translate high-level functional intent directly into chemical structures, without relying on explicit pocket geometries or predefined molecular templates. In this regard, large language models (LLMs) provide a fundamentally different design interface by operating directly on natural language descriptions of function, behavior, and constraint\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e,\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e. To date, however, most applications of LLMs in chemistry have focused on tasks such as property prediction\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e, molecular annotation\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e,\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e, or text-molecule alignment\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e, rather than on the direct generation of novel chemical structures guided by task-level functional intent (ESI, Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e). Consequently, the potential of language-based models as generative engines for function-driven molecular design remains largely unexplored.\u003c/p\u003e \u003cp\u003eIn this work, we introduce SemantiChem, an instruction-tuned generative framework designed to translate high-level functional design intent into chemically meaningful molecular structures. By moving beyond predefined geometric constraints, molecular scaffolds, or pocket-centric assumptions, SemantiChem enables controllable, function-driven molecular generation from natural language descriptions. We apply this framework to G4, a representative system governed by non-pocket-like molecular recognition, and experimentally validate generated candidates through assays of G4 stabilization, polymerase stalling, and cellular response. The same design pipeline is further evaluated on a structurally distinct RNA target and on a pocket-dominated protease target for contrast. Collectively, these results establish SemantiChem as a function-level molecular design approach and illustrate its applicability to biomolecular targets that lack well-defined structural constraints.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eSchematic comparison between conventional structure-driven molecular design and function-driven molecular generation. In structure-driven paradigms, molecular function is inferred from localized geometric complementarity within compact and well-defined binding pockets, as exemplified by pocket-dominated protein targets. In contrast, function-driven design addresses biomolecular systems governed by non-pocket-like recognition, where interaction surfaces are extended, flexible, and context-dependent, and molecular function emerges from distributed stacking, electrostatic, and topological interactions. G-quadruplexes (G4) are shown here as a representative system illustrating this non-pocket-like recognition regime. This conceptual distinction underlies the design strategy explored in this work.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eSemantiChem Pipeline\u003c/h2\u003e \u003cp\u003eTo enable instruction-driven molecular generation for biomolecular targets governed by non-pocket-like recognition, we developed SemantiChem, a modular framework that transforms general-purpose LLMs into chemically grounded generative systems aligned to functional design objectives (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). The framework integrates model training, model-level evaluation, molecule generation, and downstream validation into a unified workflow, establishing a direct link between linguistic intent and functional molecules.\u003c/p\u003e \u003cp\u003eIn the training stage, SemantiChem adapts a general-purpose LLM through two sequential procedures: molecular-syntax pretraining and target-specific instruction tuning (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea). Each model progresses through three conceptual stages: (1) \u003cb\u003eBase\u003c/b\u003e, which preserves the original linguistic capacity of the pretrained LLM; (2) \u003cb\u003eSELFIES-pretrained\u003c/b\u003e, in which the model is exposed to 10,000 molecules from PubChem10M_SELFIES to develop fundamental chemical-syntax fluency; and (3) \u003cb\u003etarget-specific instruction-tuned\u003c/b\u003e, which aligns the learned molecular representations with task- specific functional design intent through supervised instruction tuning (see Supplementary Note 2). These stages provide the basis for generating chemically valid and task-relevant structures from language-based design instructions. Model variants are denoted as “base + stage”, for example \u003cem\u003eInstruct-Base\u003c/em\u003e, \u003cem\u003eInstruct-SELFIES\u003c/em\u003e, \u003cem\u003eChemE-G4\u003c/em\u003e, and \u003cem\u003eChemE-rRNA\u003c/em\u003e.\u003c/p\u003e \u003cp\u003eTo characterize how supervision affects model behavior, SemantiChem incorporates a model-level evaluation module (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb). This component examines three aspects of the trained model: its ability to interpret domain-related chemical language, its response to controlled prompt modifications, and the extent to which semantic cues correspond to changes in structural outputs. Since open-ended natural-language molecular generation has not been addressed in prior work, existing evaluation practices do not apply to this setting. Therefore, this framework provides a systematic and generalizable basis for assessing such models. Detailed procedures and metrics are provided in Supplementary Note 3.\u003c/p\u003e \u003cp\u003eAfter training and evaluation, the model is applied to generate ligands for specific biomolecular targets with simple textual prompts (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec). The model outputs candidate structures in SELFIES format, which are decoded, filtered for validity and uniqueness, and examined using standard chemical-space analyses. This workflow enables \u003cem\u003ede novo\u003c/em\u003e molecular generation guided solely by functional descriptions expressed in natural language, without predefined scaffolds or geometric constraints. All generated molecules were assessed \u003cem\u003ein silico\u003c/em\u003e using classifier-based activity prediction and molecular docking. Candidates were selected for further wet-lab testing based on commercial availability and expert assessment (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ed). With these steps, SemantiChem establishes a direct route from language-based design instructions to chemically grounded molecular candidates.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e(a) Training of SemantiChem. A general-purpose language model is adapted through molecular-syntax pretraining and target-specific instruction tuning to obtain models aligned with ligand design. (b) Model-level evaluation. The trained model is assessed for chemical language comprehension, responsiveness to prompt variations, and semantic-structural consistency. (c) Molecule generation. Simple textual prompts are used to generate SELFIES strings, which are decoded and analyzed in chemical space. (d) Downstream validation. Generated molecules are subjected to computational and experimental assays to identify candidate ligands.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eSemantiChem for generating G4 ligands\u003c/h3\u003e\n\u003cp\u003eBuilding on the SemantiChem framework described above, we first demonstrated its applicability using G4s, a representative biomolecular system governed by non-pocket-like molecular recognition. Two base models with distinct pretraining backgrounds were explored: Meta-LLaMA-3-8B-Instruct (\u003cem\u003eInstruct\u003c/em\u003e), a general-purpose model without chemical exposure, and LLaMA-3.1-ChemEinstein (\u003cem\u003eChemE\u003c/em\u003e), pretrained on SMILES data. Two generative variants, \u003cem\u003eInstruct-G4\u003c/em\u003e and \u003cem\u003eChemE-G4\u003c/em\u003e, were obtained by fine-tuning these base models using 4,442 reported G4 ligands from the G4 ligand database (G4LDB)\u003csup\u003e\u003cspan additionalcitationids=\"CR23\" citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e–\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e(a) G4 ligand generation models derived from two base LLMs. \u003cem\u003eMeta-LLaMA-3-8B-Instruct\u003c/em\u003e and \u003cem\u003eLLaMA-3.1-BestMix-ChemEinstein\u003c/em\u003e were adapted through SELFIES pretraining and subsequently fine-tuned on the G4LDB dataset to obtain \u003cem\u003eInstruct-G4\u003c/em\u003e and \u003cem\u003eChemE-G4\u003c/em\u003e. (b) Four types of question-and-answer (Q\u0026amp;A) pairs used for instruction tuning, with representative examples from each type. (c) Generation metrics across model stages and ablation settings. Validity, uniqueness, and novelty were evaluated for Base, SELFIES-pretrained, and G4-tuned models, as well as Type-1-only and four-type multitask tuning variants.\u003c/p\u003e \u003cp\u003eG4-specific fine-tuning was organized into four types of question-and-answer (Q\u0026amp;A) pairs (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eb). \u003cb\u003eType-1\u003c/b\u003e, Prompt-based Known Ligand Generation, promotes open-ended alignment between functional instructions and chemically plausible structures. \u003cb\u003eType-2\u003c/b\u003e, Fragment Completion, introduces scaffold-level constraints to enhance the model’s understanding of structural continuity. \u003cb\u003eType-3\u003c/b\u003e, Atom Completion, reinforces localized chemical intuition through atom-level reconstruction of masked regions. \u003cb\u003eType-4\u003c/b\u003e, Identity Confirmation, trains the models to recognize underspecified or invalid prompts, cultivating a domain-consistent expert persona.\u003c/p\u003e \u003cp\u003eGeneration quality across training stages was evaluated by sampling 600 molecules from each checkpoint and assessing validity, uniqueness, and novelty (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ec). During the \u003cem\u003eBase\u003c/em\u003e and \u003cem\u003eSELFIES\u003c/em\u003e stages, both models exhibited limited chemically meaningful generation. The \u003cem\u003eInstruct-base\u003c/em\u003e model produced incoherent text, whereas \u003cem\u003eChemE-base\u003c/em\u003e generated malformed SMILES-like strings, with validity of 0% in both cases. SELFIES pretraining improved syntactic fluency, enabling generation of SELFIES strings with chemically relevant tokens and bracket structures. However, persistent errors in ring closures and branching logic kept validity below 1.2%. Substantial improvements emerged after full instruction tuning with the G4LDB dataset: the \u003cem\u003eChemE-G4\u003c/em\u003e model achieved 99.8% validity, 94.0% uniqueness, and 67.3% novelty, demonstrating high structural fluency and molecular diversity.\u003c/p\u003e \u003cp\u003eTo better understand the contribution of individual prompt types, \u003cem\u003eChemE-SELFIES\u003c/em\u003e was fine-tuned using only the Type-1 Q\u0026amp;A pair, which serves as the core task among the four types and was used here for ablation analysis of open-ended prompt-based generation. This simplified model (\u003cem\u003eChemE-G4 Type-1\u003c/em\u003e) retained high validity (99.2%) and uniqueness (92.2%) but exhibited reduced novelty (42.2%), indicating a narrower generative distribution. Compared with the single-task variant, \u003cem\u003eChemE-G4\u003c/em\u003e, which was fine-tuned on all four Q\u0026amp;A types, showed consistent improvements across all metrics, confirming that multi-task instruction tuning enhances chemical correctness while maintaining structural diversity. Interestingly, \u003cem\u003eInstruct-G4 Type-1\u003c/em\u003e produced higher novelty (82.0%) but lower validity (76.8%) than \u003cem\u003eInstruct-G4\u003c/em\u003e, reflecting a trade-off between generative diversity and structural precision likely influenced by the inductive biases of the base models.\u003c/p\u003e \u003cp\u003eTo further examine how variations in functional instructions affect model behavior, we applied the evaluation framework to both model series (see Supplementary Note 3). The analysis revealed consistent and interpretable shifts in molecular descriptors in response to prompt changes. Moreover, flexibility-related features such as rotatable bonds varied systematically between stacking-oriented and topology-selective prompts, suggesting that the model adapts its outputs according to contextual functional cues rather than reproducing fixed biases.\u003c/p\u003e\n\u003ch3\u003eCharacterization of Generated G4 Ligands\u003c/h3\u003e\n\u003cp\u003eHaving trained the models for G4 ligand generation and examined the effects of prompt design, we next used a single, standardized function-oriented prompt “\u003cem\u003ePlease show me a G-quadruplex ligand which hasn’t been reported. Be sure the structure is novel and unique.\u003c/em\u003e” to produce candidate molecules (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea). For each model, 2,600 molecules were generated for characterization and comparison with known G4 ligands.\u003c/p\u003e \u003cp\u003eTo evaluate the structural relevance, Tanimoto similarity was calculated for each set of generated molecules against existing G4 ligands in the training set, returning high similarity scores for both \u003cem\u003eChemE-G4\u003c/em\u003e (mean = 0.88) and \u003cem\u003eInstruct-G4\u003c/em\u003e (mean = 0.87) models (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb). We also found that novelty increased with sample sizes, rising from ~ 70% at 600 samples to over 75% at 2,600. While ~ 25% of generated molecules matched exactly with known ligands, the majority of generated molecules were structurally distinct, demonstrating the models’ capacity to recover active G4-binding motifs while exploring new regions of G4-relevant chemical space.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e(a) Simple textual prompt used for G4 ligand generation and a representative SELFIES output produced by \u003cem\u003eChemE-G4\u003c/em\u003e. (b) Similarity distributions between generated molecules and known G4 ligands for \u003cem\u003eInstruct-G4\u003c/em\u003e and \u003cem\u003eChemE-G4\u003c/em\u003e. (c) Bemis-Murcko scaffold overlap among molecules generated by the two model variants and ligands in G4LDB. (d) TMAP-based classification of generated molecules into A-class (directly connected to known G4 ligands) and B-class (connected only to A-class nodes). (e) Proportions of molecules predicted as G4-active by the GLAM classifier for \u003cem\u003eInstruct-G4\u003c/em\u003e, \u003cem\u003eChemE-G4\u003c/em\u003e, and a random drug-like control set. (f) Docking score distributions (range from − 15 to 0 kcal/mol) for \u003cem\u003eChemE-G4\u003c/em\u003e-generated ligands against human telomeric G4 (PDB: 6CCW) and two binding sites within the KRAS promoter G4 (PDB: 7X8M). Dashed lines indicate reference docking scores of the clinical-stage G4 ligand CX-5461 against these G4 receptors. The full distribution is shown in Fig. S4.2.\u003c/p\u003e \u003cp\u003eScaffold-level analysis also revealed a balance between reuse and innovation. Each model produced over 1,400 unique Bemis-Murcko scaffolds, with 16–20% overlapping with scaffolds in G4LDB (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ec). To visualize structural diversity, we applied TMAP to project the generated molecules in latent space and classify them based on proximity to training ligands\u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e (Fig. S4.1). Nodes directly connected to known G4 ligands were defined as A-class, while those connected only to A-class but not directly to known ligands were as B-class. \u003cem\u003eChemE-G4\u003c/em\u003e showed a near-even split between A- and B-class molecules (49.7% vs. 50.3%), while \u003cem\u003eInstruct-G4\u003c/em\u003e favored exploration of new chemical space, with 60.6% of molecules disconnected from known G4 ligands (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ed).\u003c/p\u003e \u003cp\u003eNotably, the two models exhibit complementary generative behaviors shaped by their underlying model priors: \u003cem\u003eChemE-G4\u003c/em\u003e tends to perform scaffold refinement, while \u003cem\u003eInstruct-G4\u003c/em\u003e favors broader chemotype exploration. This dual capacity reflects SemantiChem’s flexibility in generating ligands with strong structural consistency to known G4 binders while maintaining scaffold-level diversity.\u003c/p\u003e \u003cp\u003eWe next evaluate whether structurally plausible molecules generated by our models also exhibit G4-relevant activities. To do so, we first established a G4 ligand classifier by training the reported GLAM\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e architecture using 3,569 G4 ligands as positives and 2,695 FDA-approved small-molecule drugs with no known G4-binding activity as negatives. Based on this classifier, we found that 91.9% of \u003cem\u003eInstruct-G4\u003c/em\u003e and 91.7% of \u003cem\u003eChemE-G4\u003c/em\u003e molecules were predicted as G4 ligands, whereas only 16.5% random drug-like compounds extracted from PubChem were predicted to be positive (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ee and Supplementary Note 5.5).\u003c/p\u003e \u003cp\u003eDocking 2,600 \u003cem\u003eChemE-G4\u003c/em\u003e ligands to the human telomeric G4 (hTel, PDB 6CCW) and the KRAS promoter G4 (PDB 7X8M; two distinct binding sites) yielded valid poses for more than 1,440 molecules per site. Favorable binding (ΔG \u0026lt; 0 kcal/mol) was observed for 76.2% of ligands with hTel and for 99.5% and 96.0% with KRAS sites 1 and 2, respectively. Notably, 50.6% of hTel-bound ligands exceeded the affinity of the clinical-stage G4 ligand CX-5461\u003csup\u003e27\u003c/sup\u003e (-7.2 kcal/mol), while 43 and 184 ligands outperformed reference scores at KRAS sites 1 (-9.4 kcal/mol) and 2 (-9.5 kcal/mol) (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ef). These results demonstrate that molecules generated by \u003cem\u003eChemE-G4\u003c/em\u003e were highly enriched for binding across distinct G4 targets, despite the absence of explicit geometric constraints during generation.\u003c/p\u003e \u003cp\u003eAmong the thousands of generated molecules, we identified a small subset overlapping with commercially available compounds, which were prioritized for experimental validation. Two model-generated molecules were selected as representative candidates for structural and functional characterization (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ea).\u003c/p\u003e \u003cp\u003eInstruct-734 (commercially known as Hoechst 34580) is a double-stranded DNA staining dye with no previous record of G4 binding activity. It features an extended bis-benzimidazole scaffold linked through a flexible phenylene bridge, forming a curved, partially conjugated system well suited for groove binding. Structurally, it maintains the bis-benzimidazole framework typical of Hoechst dyes while diverging from canonical G4-binding motifs (maximal Tanimoto = 0.755). ChemE-1876 (commercially known as A366) is a G9a/GLP methyltransferase inhibitor with a compact benzimidazole core substituted by an unusual fused cyclobutyl ring, giving rise to a non-planar, sterically constrained topology. This architecture differs markedly from the extended aromatic systems typical of G4 binders and shows limited similarity to any known G4 ligand (maximal Tanimoto = 0.412). The analyses indicate that the two model-generated molecules adopt distinct yet chemically reasonable architectures for G4 recognition, motivating experimental validation of their activities.\u003c/p\u003e \u003cp\u003eFRET melting experiments revealed that both Instruct-734 and ChemE-1876 stabilized the KRAS promoter G4, increasing the melting temperature (Tm) by 4.00°C and 4.25°C, respectively (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eb). The stabilization capabilities of both molecules to G4 in the KRAS oncogene promoter was found to inhibit primer extension in the DNA polymerase stop assay in a dose-dependent manner. Near-complete inhibitions were achieved for both at 80 µM, comparable to berberine, a well-characterized natural G4 stabilizer with reported anticancer activity\u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ec). Having confirmed their favored binding to G4 sequences, we next evaluated the cellular bioactivity using CCK-8 cytotoxicity assays. Instruct-734 showed broad cytotoxicity across different cell lines, whereas ChemE-1876 exhibited an interesting selective cytotoxicity toward A549 cells with mutation-induced KRAS overexpression (IC₅₀ = 1.65 ± 0.52 µM) compared with MCF-7 cells carrying wild-type KRAS (IC₅₀ \u0026gt; 30 µM) (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ed). These results indicate that our model can generate structurally novel and functionally selective G4 ligands with therapeutic potential.\u003c/p\u003e \u003cp\u003eThe collective results are summarized in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ee, which also includes positive and negative references tested under the same assays. These wet lab experiments demonstrate the ability of SemantiChem to generate chemically valid G4 ligands encompassing both canonical and previously unexplored binding architectures, supporting function-driven molecular design in non-pocket-like recognition regimes.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e(a) Chemical structures of four ligands selected for experimental validation. (b) FRET melting assay curves for the KRAS promoter G4 in the presence and absence of each ligand. (c) Polymerase stop assay showing dose-dependent inhibition of DNA synthesis on the KRAS G4 template, with a well-characterized natural G4 stabilizer berberine (BER) as a reference. (d) Cytotoxicity dose-response curves in A549 (lung carcinoma) and MCF-7 (breast cancer) cell lines. (e) Summary of experimental measurements.\u003c/p\u003e\n\u003ch3\u003eBenchmarking SemantiChem against Current Generative Models\u003c/h3\u003e\n\u003cp\u003eWe benchmarked SemantiChem against seven representative generative models by evaluating their performance on G4 ligand generation (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e), including three conventional SMILES-based generators adversarial autoencoder (AAE), variational autoencoder (VAE) and character-level recurrent neural network (CharRNN), one transformer-based generator SPMM\u003csup\u003e\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e, one geometry-conditioned graph neural network Pocket2Mol\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e, and two LLM-based generators ChemGPT\u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e and ether0\u003csup\u003e20\u003c/sup\u003e. For comparability, each model produced 2,600 molecules, which were evaluated using a unified analysis pipeline.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e(a) Summary of generation performance for CharRNN, SPMM, Pocket2Mol, ChemGPT, and \u003cem\u003eChemE-G4\u003c/em\u003e, including metrics for structural quality, scaffold-level statistics, semantic similarity, and predicted functional relevance based on docking and GLAM classification. Red values indicate lower metric values within each category. (b) TMAP projections showing the chemical-space distributions of molecules generated by the five models alongside ligands from G4LDB. (c) Representative ligands generated by each model. Red-highlighted labels indicate selected attributes displayed for each molecule.\u003c/p\u003e \u003cp\u003eAAE, VAE, and CharRNN were trained on the MOSES platform\u003csup\u003e\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e with the same G4LDB-based training set. Both AAE and VAE failed to perform effective molecular generation, producing fewer than 2% chemically valid molecules (Table S6.1). Although CharRNN achieved higher validity (65.6%), most generated molecules reproduced existing G4 ligand scaffolds, resulting in substantially lower scaffold novelty compared with the SemantiChem model \u003cem\u003eChemE-G4\u003c/em\u003e (24.3% vs. 74.5%).\u003c/p\u003e \u003cp\u003eSPMM was applied in its seed-based mode using the canonical G4 ligand Pyridostatin (PDS)\u003csup\u003e\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e. Its outputs exhibited 100% scaffold novelty but a lower predicted positive rate than \u003cem\u003eChemE-G4\u003c/em\u003e (60.9% vs. 91.9%), indicating limited functional enrichment despite structural diversification.\u003c/p\u003e \u003cp\u003ePocket2Mol is a classical geometry-conditioned model originally developed for protein-ligand generation. We applied it to generate molecules using a geometrically well-characterized telomeric G4 DNA structure (PDB ID: 6CCW) without additional training. The predicted positive rate was only 4.3%, lower than the random baseline (16.5%). Attempts to adapt the model for G4 generation failed even after code-level modifications (see Supplementary Note 7.1), reflecting the limited transferability of residue-centric geometric encodings to nucleic acid targets. Similar input restrictions for nucleic acid systems were also observed in other advanced protein-centric models, such as Delete\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eMolecules were also generated using two LLM-based generators. ChemGPT, while primarily designed for molecular representation and translation tasks, is also capable of molecular generation. In our study, it accepted natural-language prompts for molecule generation, yielding a chemical validity of 62.9%. Although it exhibited high scaffold novelty (99.8%), the predicted positive rate for G4 ligands was 20.2%, only slightly above the baseline. We also briefly tested a general-purpose free-text molecular generator, ether0, which produced SMILES outputs but very few molecules that were chemically valid or biologically relevant (see Supplementary Note 7.2).\u003c/p\u003e \u003cp\u003eChemical-space analysis revealed systematic differences among models. Pocket-conditioned generators (SPMM, Pocket2Mol, ChemGPT) produced distributions that remained largely separated from the G4LDB manifold, whereas \u003cem\u003eChemE-G4\u003c/em\u003e generated molecules interspersed across its fine-grained branches (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eb). At the property level, descriptor distributions also varied across models (Table S6.2-S6.3). \u003cem\u003eChemE-G4\u003c/em\u003e produced molecules with physicochemical profiles closely aligned to G4LDB, whereas protein-centric approaches such as Pocket2Mol exhibited the largest divergence among all baselines. Representative molecules from each model are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ec, illustrating characteristic trade-offs among structural novelty, semantic alignment, and predicted functional relevance across different generative paradigms.\u003c/p\u003e\n\u003ch3\u003eGeneralizability of SemantiChem to rRNA and Mpro\u003c/h3\u003e\n\u003cp\u003eHaving established the relative advantages of SemantiChem in G4 ligand generation, we next assessed how its function-driven design strategy performs across biomolecular targets with distinct recognition regimes. We applied the same training framework to representative RNA and protein targets, including \u003cem\u003eE. coli\u003c/em\u003e ribosomal RNA (rRNA) and the SARS-CoV-2 main protease (Mpro). Using the same model architecture, we initialized from the \u003cem\u003eChemE-SELFIES\u003c/em\u003e checkpoint and fine-tuned two new variants, \u003cem\u003eChemE-rRNA\u003c/em\u003e and \u003cem\u003eChemE-Mpro\u003c/em\u003e, using task-specific prompt-ligand pairs. For \u003cem\u003eChemE-rRNA\u003c/em\u003e, we curated 5,501 active compounds from PubChem (BioAssay AID:720706) targeting rRNA. For \u003cem\u003eChemE-Mpro\u003c/em\u003e, we used 3,642 known small-molecule inhibitors of Mpro retrieved from ChEMBL.\u003c/p\u003e \u003cp\u003eAs shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003ea, SemantiChem generalizes effectively to rRNA. \u003cem\u003eChemE-rRNA\u003c/em\u003e model maintained high validity (99.9%) and uniqueness (94.6%), while achieving even higher novelty (91.8%) than in the original G4 task. A total of 2,096 unique Bemis-Murcko scaffolds were generated, with a scaffold novelty of 76.5%. Over 94.0% of generated molecules achieved docking scores better than the reference ligand chloramphenicol\u003csup\u003e\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e. Collectively, these results indicate that SemantiChem retains strong performance for RNA targets governed by non-pocket-like recognition.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e(a) Summary of performance on three targets, including G-quadruplex structures (G4), ribosomal RNA (rRNA), and the SARS-CoV-2 main protease (Mpro), represented by fine-tuned instances of the SemantiChem framework (\u003cem\u003eChemE-G4\u003c/em\u003e, \u003cem\u003eChemE-rRNA\u003c/em\u003e and \u003cem\u003eChemE-Mpro\u003c/em\u003e). Metrics include structure quality, scaffold novelty, semantic alignment, and docking-based functional relevance. Red values indicate comparatively lower numerical values observed for the \u003cem\u003eChemE-Mpro\u003c/em\u003e model. (b) Representative ligands generated by the three target-specific models, shown together with their predicted binding poses.\u003c/p\u003e \u003cp\u003eIn contrast, reduced performance was observed when applying SemaniChem to the pocket-dominated Mpro system. For the Mpro task, although validity remained high (99.6%), uniqueness dropped to 68.4%. Only 612 unique scaffolds were generated, corresponding to a scaffold novelty of 43.8%. While all generated molecules exhibited favorable docking energies (standard free energy ≤ 0 kcal/mol), only 27.0% outperformed the reference inhibitor nirmatrelvir\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e. Structural inspection revealed reduced diversity and weaker alignment with canonical Mpro-associated features.\u003c/p\u003e \u003cp\u003eThese trends were also reflected in physicochemical properties: \u003cem\u003eChemE-rRNA\u003c/em\u003e outputs closely tracked their training distribution across key descriptors, whereas \u003cem\u003eChemE-Mpro\u003c/em\u003e deviated significantly, particularly in descriptors of hydrophobicity (\u003cem\u003eLogP\u003c/em\u003e) and molecular polarity (\u003cem\u003eTPSA\u003c/em\u003e). (Table S8.1). Notably, the representative molecules selected for G4, rRNA, and Mpro (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003eb) all possess novel Murcko scaffolds relative to their training sets, confirming that SemantiChem does not rely on scaffold memorization but exhibits regime-dependent generative behavior.\u003c/p\u003e "},{"header":"Discussion and Conclusion","content":"\u003cp\u003eIn this study, we developed SemantiChem, an instruction-tuned generative framework for function-driven molecular design that translates high-level functional intent into chemically meaningful small molecules from natural language descriptions. Biomolecular targets governed by non-pocket-like recognition, such as many nucleic acid systems, present diffuse, flexible, and electrostatically complex interaction surfaces, for which geometry-dependent design paradigms are often ineffective. By operating directly on function-level design intent rather than predefined geometric constraints, this work demonstrates a practical strategy for addressing such recognition regimes that remain challenging for conventional structure-centric molecular design.\u003c/p\u003e\u003cp\u003eWe demonstrated this capability using two representative systems, G4 DNA and rRNA, which pose distinct biochemical recognition challenges yet share non-pocket-like interaction characteristics. For both targets, the generated molecules were chemically valid and structurally diverse, and experimental testing confirmed G4 stabilization and polymerase inhibition for selected candidates. In particular, ChemE-1876 showed selective activity toward the KRAS promoter G4 and cytotoxicity against KRAS-overexpressing A549 cells, illustrating that function-driven, language-guided generation can yield structurally novel and biologically specific ligands. Together, these results indicate that the proposed design strategy can generalize across related non-pocket-like recognition regimes, with natural language serving as an interpretable interface for specifying functional objectives.\u003c/p\u003e\u003cp\u003eAt the same time, the results also delineate limitations of the approach. Available ligand datasets for non-pocket-like targets, especially nucleic acids, remain limited in size and structural diversity, which may constrain the scope of learnable design patterns. Moreover, while many generated molecules were chemically valid and experimentally active, challenges related to synthetic accessibility and downstream experimental feasibility persist. The reduced performance observed for pocket-dominated protein targets further highlights that the effectiveness of function-driven generation may be regime-dependent rather than universal. Future extensions incorporating richer biochemical constraints, selectivity cues, or feasibility-aware filtering may help expand the practical applicability of this framework.\u003c/p\u003e\u003cp\u003eOverall, this work establishes a function-level molecular design strategy in which natural language serves as a flexible interface for translating human design intent into chemical space. By explicitly aligning generative modeling with functional objectives rather than geometric assumptions, SemantiChem offers a complementary path for molecular discovery in biomolecular systems that lie beyond the effective reach of conventional structure-centric approaches.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\n \u003ch2\u003eModel selection and initialization\u003c/h2\u003e\n \u003cp\u003eTwo large language models (LLMs) were selected to investigate their capabilities in ligand generation. Meta-LLaMA-3-8B-Instruct and LLaMA-3.1-ChemEinstein were used. Both models were obtained from Hugging Face (see Supplementary Note 2.1 for model links) and initialized from their publicly available checkpoints. All stages of model development, including pretraining and instruction tuning, were carried out using the LLaMA Factory framework version 0.9.0.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\n \u003ch2\u003eSELFIES-based domain pretraining\u003c/h2\u003e\n \u003cp\u003eBoth models were subjected to lightweight domain-specific pretraining using approximately 10,000 molecular structures encoded in the SELFIES format. The data were obtained from the PubChem10M_SELFIES dataset (see Supplementary Note 2.1 for repository link). The first 10,000 entries were selected without additional filtering or augmentation. Pretraining was conducted for 2 epochs using an autoregressive language modeling objective under the LoRA framework, with LoRA applied to all transformer layers. The training used a learning rate of 1e-4, a per-device batch size of 8, gradient accumulation over 8 steps, and cosine learning rate scheduling with a warmup ratio of 0.1. Input sequences were truncated at 256 tokens, and training was performed with fp16 precision. No task-specific prompts or semantic supervision were applied during this stage.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\n \u003ch2\u003ePrompt-driven multi-task fine-tuning\u003c/h2\u003e\n \u003cp\u003eAll models were fine-tuned on nucleic acid- and protein-targeted ligand design tasks using a unified prompt-based multi-task instruction tuning framework. Target-specific datasets were prepared for three molecular targets: G-quadruplexes (G4), ribosomal RNA (rRNA), and the SARS-CoV-2 main protease (Mpro). For G4, 4,442 ligand structures and corresponding functional prompts were derived from the G4LDB database (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ewww.G4LDB.com\u003c/span\u003e\u003c/span\u003e; accessed in 2023). For rRNA, 5,501 compounds labeled as active were extracted from PubChem assay AID_720706. For Mpro, 4,025 small-molecule ligands were retrieved from the ChEMBL database by querying \u0026ldquo;SARS-CoV-2 Mpro\u0026rdquo; and applying a compound type filter restricted to small molecules. No additional filtering or activity thresholding was applied. All datasets used for fine-tuning alongside the training scripts are accessible during peer review (see Data and Code Availability).\u003c/p\u003e\n \u003cp\u003eThe instruction tuning framework comprised four complementary task types. Type 1 involved the generation of known ligands from functional prompts. To enhance prompt diversity and improve model generalization, three distinct prompt templates were created and randomly assigned to ligand entries during preprocessing. Type 2 introduced scaffold-level constraints through fragment completion. Type 3 focused on atom-level reconstruction of masked regions to strengthen localized chemical reasoning. Type 4 aimed to reinforce expert-level identity grounding by prompting the model to assume the persona of a domain specialist (e.g., a G4 ligand discovery researcher). This was implemented by adapting identity conditioning examples from the LLaMA Factory corpus, with task-specific substitutions. Prompt templates and representative examples are accessible during peer review (see Data and Code Availability) and Table S2.3.\u003c/p\u003e\n \u003cp\u003eAll tasks were trained using positive-only supervision, without contrastive objectives or predefined negative samples. Fine-tuning was conducted jointly across all tasks using a mixed sampling strategy, in which all instruction-response pairs were pooled and drawn uniformly during training. Models were trained for 10 epochs using the LoRA method applied to all transformer layers, with a learning rate of 1e-4, a per-device batch size of 2, and gradient accumulation over 8 steps. A cosine learning rate scheduler with a warmup ratio of 0.1 was used. Training was performed with bfloat16 precision, and input sequences were truncated at 1,024 tokens.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\n \u003ch2\u003eMolecule generation and evaluation in the SemantiChem framework\u003c/h2\u003e\n \u003cp\u003eMolecular structures were generated within the SemantiChem framework, which integrates prompt-based inference and evaluation. For this study, model training and inference were implemented via the LLaMA Factory API. Prompts were written in free-text format to elicit novel, chemically plausible ligands. Sampling was performed with nucleus sampling (top-p\u0026thinsp;=\u0026thinsp;0.9, temperature\u0026thinsp;=\u0026thinsp;1.0) based on prior parameter optimization (see Table S2.4).\u003c/p\u003e\n \u003cp\u003eFor quality assessment, 600 molecules were generated per model at each developmental stage (Base, SELFIES-pretrained, and Q\u0026amp;A fine-tuned). For downstream structural and functional analyses, 2,600 molecules were generated from each fully fine-tuned model. Outputs were collected in SELFIES format and converted to SMILES with RDKit\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e. Generation quality was evaluated using three metrics:\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eValidity\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003eproportion of molecules that can be successfully parsed by RDKit.\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eUniqueness\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003eproportion of non-duplicate valid molecules.\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eNovelty\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003eproportion of valid molecules absent from the fine-tuning set.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\n \u003ch2\u003ePrompt variation experiments\u003c/h2\u003e\n \u003cp\u003eFor each prompt variant, 500 molecules were generated using the ChemE-G4 model. Molecular structures were standardized and converted to canonical SMILES. Standard physicochemical descriptors were computed using RDKit, with additional custom descriptors (e.g., Max_fused, LCCS) implemented in Python (see Data and Code Availability).\u003c/p\u003e\n \u003cp\u003eTo evaluate whether different prompts produced distinguishable molecular distributions, binary classifiers were trained on descriptor sets using logistic regression with five-fold cross-validation. Classifier performance was reported as mean AUC\u0026thinsp;\u0026plusmn;\u0026thinsp;standard deviation across folds. Descriptor importance was quantified from model coefficients and interpreted alongside kernel density estimates (KDE) of the top-weighted features. Representative molecules were selected to illustrate alignment with prompt semantics.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e\n \u003ch2\u003eStructure-Based Analysis Tools and Metrics\u003c/h2\u003e\n \u003cp\u003eStructure-based analyses were performed using RDKit. Tanimoto similarity was computed on MACCS keys, and Bemis-Murcko scaffolds were extracted for all compounds. To visualize the structural distribution of generated and reference ligands, two-dimensional tree-based layouts were constructed using the TMAP algorithm\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e, based on approximate nearest-neighbor relationships among molecular fingerprints. Nodes were color-coded by compound class and analyzed to compare scaffold reuse and structural exploration. These analytical procedures were applied to structural outputs from all models and target systems.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e\n \u003ch2\u003eG4 Binding Prediction with Graph-Level Attention Model\u003c/h2\u003e\n \u003cp\u003eA graph-level attention model (GLAM) was used to predict the G4-binding potential of generated molecules. The model architecture followed the GLAM framework as previously reported\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e and was retrained on curated ligands from the G4LDB. Molecular graphs were constructed from standardized SMILES using RDKit. Full training procedures, model parameters, and performance metrics are provided in Supplementary Note 5. As a comparative baseline, a background set of one million drug-like small molecules was randomly sampled from PubChem using its API, applying Lipinski\u0026rsquo;s rule-of-five filters. The trained model was then applied to both generated and background compounds to estimate their probability of G4 activity.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e\n \u003ch2\u003eMolecular Docking against Human Telomeric G4 DNA\u003c/h2\u003e\n \u003cp\u003eMolecular docking simulations were performed using AutoDock Vina\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e to evaluate the binding potential of generated ligands against three distinct biological targets.\u003c/p\u003e\n \u003cp\u003eLigand preparation involved conversion from SMILES to 3D structures using RDKit, with hydrogen addition and coordinate embedding via distance geometry. Ligands were energy-minimized using the MMFF94 or UFF force fields, depending on availability, and saved as SDF files. These were subsequently converted to PDBQT format using Open Babel\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e37\u003c/span\u003e\u003c/sup\u003e. Ligands failing any step of conversion or docking due to embedding errors, unsupported atoms, or file format issues were excluded.\u003c/p\u003e\n \u003cp\u003eFor receptor preparation, the human telomeric G4 (PDB ID: 6CCW), KRAS G4 (PDB ID: 7X8M), and the SARS-CoV-2 main protease (Mpro, PDB ID: 7BQY) structures were processed by removing bound ligands, solvent molecules, and metal ions. Hydrogen atoms and Gasteiger charges were added using AutoDockTools\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e, and the receptors were saved in PDBQT format.\u003c/p\u003e\n \u003cp\u003eFor the bacterial ribosomal RNA target, a single RNA chain (chain V [auth BA], extracted from PDB ID: 4V7T) proximal to the chloramphenicol binding site was selected as the receptor. Hydrogen atoms were added using MolProbity\u0026rsquo;s reduce tool (version 4.5.2) to optimize protonation states and hydrogen-bonding geometry\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e39\u003c/span\u003e\u003c/sup\u003e. The resulting structure was then further processed in AutoDockTools to assign Gasteiger charges and exported in PDBQT format for docking.\u003c/p\u003e\n \u003cp\u003eDocking grid boxes were defined as follows: for the G4 DNA (6CCW and 7X8M), a grid box radius of 15 \u0026Aring; was centered on the known ligand-binding site encompassing the G-tetrad core; for Mpro (7BQY), a radius of 30 \u0026Aring; was centered on the active site; and for the rRNA target (4V7T), a radius of 24 \u0026Aring; was used around the chloramphenicol binding site.\u003c/p\u003e\n \u003cp\u003eFor each successfully docked ligand, the lowest predicted binding affinity (kcal/mol) was recorded. Reference compounds, CX-5461 for G4 DNA, Nirmatrelvir for Mpro, and Chloramphenicol for rRNA, were docked under identical conditions to provide benchmarks for comparative evaluation.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec18\" class=\"Section2\"\u003e\n \u003ch2\u003eReagents and Materials\u003c/h2\u003e\n \u003cp\u003ePyridostatin (PDS, Cat. No. S7444), ChemE-1876 (Cat. No. S7572), Instruct-734 (Cat. No. S0486), ChemE-1732 (Cat. No. S9032), and Instruct-2189 (Cat. No. S6830) were purchased from Selleck Chemicals (Shanghai, China), and berberine (BER, Cat. No. SB8130) was purchased from Solarbio (Beijing, China). All compounds were prepared as DMSO stock solutions. DNA oligonucleotides were synthesized and HPLC-purified by Sangon Biotech (Shanghai, China). Lyophilized DNA was reconstituted in ultrapure water to prepare 100 \u0026micro;M stock solutions. Concentrations were determined by UV absorbance using a BioTek microplate reader with a Take3 module, and adjusted accordingly. All stocks were stored at -20\u0026deg;C until use.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec19\" class=\"Section2\"\u003e\n \u003ch2\u003eFluorescence-Based Thermal Melting Assay\u003c/h2\u003e\n \u003cp\u003eG4-binding activity was assessed using a fluorescence-based thermal melting assay. Two G4-forming DNA sequences were tested: a 22-mer human telomeric repeat (hTel) and a 22-mer KRAS promoter sequence. Both probes were labeled with a 5\u0026apos;-FAM fluorophore and a 3\u0026apos;-BHQ1 quencher. Probe sequences were:\u003c/p\u003e\n \u003cul\u003e\n \u003cli\u003e\n \u003cp\u003ehTel: 5\u0026apos;-FAM-AGGGTTAGGGTTAGGGTTAGGG-BHQ1-3\u0026apos;\u003c/p\u003e\n \u003c/li\u003e\n \u003c/ul\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec20\" class=\"Section2\"\u003e\n \u003ch2\u003e\u0026middot; KRAS: 5\u0026apos;-FAM-AGGGCGGTGTGGGAAGAGGGAA-BHQ1-3\u0026apos;\u003c/h2\u003e\n \u003cp\u003eA duplex DNA (dsDNA) control was used to evaluate binding selectivity. It consisted of two complementary strands:\u003c/p\u003e\n \u003cul\u003e\n \u003cli\u003e\n \u003cp\u003eFAM strand: 5\u0026apos;-FAM-AGGTTGGTGAGTGATTGGAGGTT-3\u0026apos;\u003c/p\u003e\n \u003c/li\u003e\n \u003cli\u003e\n \u003cp\u003eBHQ1 strand: 3\u0026apos;-TCCAACCACTCACTAA-C5-BHQ1\u003c/p\u003e\n \u003c/li\u003e\n \u003c/ul\u003e\n \u003cp\u003eEquimolar duplex strands were annealed by heating to 95\u0026deg;C for 5 min followed by slow cooling to room temperature.\u003c/p\u003e\n \u003cp\u003eThermal melting experiments were conducted in 10 mM PBK buffer (pH 8.5) containing 80 mM K⁺ and 0.05% Tween-20. Final concentrations of DNA and compound were fixed at 100 nM and 1 \u0026micro;M, respectively. Fluorescence signals were recorded over a temperature gradient, and ligand-induced changes in melting temperature (\u0026Delta;Tₘ)F were determined by comparing melting curves in the presence and absence of test compounds.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec21\" class=\"Section2\"\u003e\n \u003ch2\u003eDNA Polymerase Stop Assay\u003c/h2\u003e\n \u003cp\u003eThe DNA polymerase stop assay was performed as previously described\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e, with minor modifications. A 5\u0026prime;-FAM-labeled primer was mixed with a KRAS G4-containing DNA template at a 1.2:1 molar ratio.\u003c/p\u003e\n \u003cul\u003e\n \u003cli\u003e\n \u003cp\u003ePrimer : 5\u0026prime;-FAM-TAATACGACTCACTATAGCAATTGC\u003c/p\u003e\n \u003c/li\u003e\n \u003cli\u003e\n \u003cp\u003eTemplate: 5\u0026prime;-TGAATCCTGAGGGCGGTGTGGGAAGAGGGAAGATAGCTGCAC\u003c/p\u003e\n \u003c/li\u003e\n \u003c/ul\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec22\" class=\"Section2\"\u003e\n \u003cp\u003eAATTGCTATAGTGAGTCGTATTA-3\u0026prime;\u003c/p\u003e\n \u003cp\u003eThe mixtures were annealed by heating to 95\u0026deg;C for 5 min, followed by slow cooling to room temperature. Compounds were added at the indicated concentrations and incubated with the annealed DNA with100 mM K\u003csup\u003e+\u003c/sup\u003e at room temperature for 3 hours. Primer extension was carried out in a 50 \u0026micro;L reaction containing 0.2 \u0026micro;M DNA complex, 1X Taq PCR Master Mix (Sangon, China), for 30 min at 37\u0026deg;C. Reaction products were resolved by electrophoresis on a 12% denaturing polyacrylamide gel. FAM-labeled DNA fragments were visualized using a GenoSens 2200 imaging system (Clinx, China).\u003c/p\u003e\n \u003cdiv id=\"Sec23\" class=\"Section3\"\u003e\n \u003ch2\u003eCCK-8 Cytotoxicity Assay\u003c/h2\u003e\n \u003cp\u003eCytotoxicity was assessed in A549 (lung adenocarcinoma) and MCF-7 (breast adenocarcinoma) cell lines using a standard CCK-8 colorimetric assay (Beyotime, China). Cells were seeded at a density of 1\u0026times;10\u003csup\u003e4\u003c/sup\u003e cells/well in 96-well plates and incubated overnight before treatment. Test compounds were added at a range of concentrations and incubated for 24 hours. Cell viability was quantified by measuring absorbance at 450 nm following incubation with CCK-8 reagent, according to the manufacturer\u0026rsquo;s instructions. Viability values were normalized to vehicle-treated controls, and IC₅₀ values were calculated using nonlinear regression (four-parameter logistic model) in GraphPad Prism 9. Each experiment was performed in independent triplicate.\u003c/p\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec24\" class=\"Section2\"\u003e\n \u003ch2\u003eBaseline models for G4 ligand generation\u003c/h2\u003e\n \u003cp\u003eBaseline models were applied to the G4 ligand generation task, each producing 2,600 molecules. Outputs were evaluated using the same pipeline as in this work.\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eMOSES baselines\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003eAdversarial autoencoder (AAE), variational autoencoder (VAE), and character-level recurrent neural network (CharRNN) were implemented with the official MOSES benchmarking platform\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e and generated SMILES using default configurations.\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eSPMM\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003eA seed-based generator applied without task-specific training, initialized with the reference ligand pyridostatin (PDS).\u003c/p\u003e\n \u003cp\u003e\u003cem\u003ePocket2Mol\u003c/em\u003e: Applied in a zero-shot setting to the human telomeric G4 DNA structure (PDB ID: 6CCW) after ligand removal. The default \u0026ldquo;Sampling for PDB pockets\u0026rdquo; mode was used, with minor adjustments for nucleic acid compatibility. From 5,000 generated candidates, the first 2,600 valid molecules were retained. We also explored whether Pocket2Mol could be fine-tuned on nucleic acid-specific data, but training failed due to structural incompatibilities; details are provided in Supplementary Note 7.1.\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eChemGPT\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003eChemGPT was obtained in the safetensors format and used for SELFIES-based molecular generation. For prompt-based sampling, input strings were tokenized and, when possible, converted into SELFIES prefixes. Molecules were generated with the same prompt settings and sampling parameters (temperature\u0026thinsp;=\u0026thinsp;1.0, top-p\u0026thinsp;=\u0026thinsp;0.9) as in SemantiChem.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eData and Code Availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe code used in this study is available at: https://github.com/ADNLab-SCU/SemantiChem. This submission version contains only utility scripts used in the computational experiments. A fully documented pipeline and complete resources will be released upon publication.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eSupplementary datasets generated in this study have been deposited on Figshare (https://figshare.com/s/0fadc7628f71cb4600a3). Additional data supporting the findings of this work are available from the corresponding author upon reasonable request and will be made publicly accessible upon publication.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work was supported by National Natural Science Foundation of China [22077087, 22474082], Sichuan Science and Technology Program [2025NSFJQ0019] and Fundamental Research Funds for the Central Universities. The authors would like to thank the Analytical \u0026amp; Testing Center of Sichuan University.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eJayatunga, M. K. P., Xie, W., Ruder, L., Schulze, U. \u0026amp; Meier, C. AI in small-molecule drug discovery: a coming wave? \u003cem\u003eNat. Rev. Drug Discov.\u003c/em\u003e \u003cstrong\u003e21\u003c/strong\u003e, 175\u0026ndash;176 (2022).\u003c/li\u003e\n\u003cli\u003eKrishnan, A. et al. A generative deep learning approach to de novo antibiotic design. \u003cem\u003eCell\u003c/em\u003e \u003cstrong\u003e188\u003c/strong\u003e, 5962\u0026ndash;5979 e5922 (2025).\u003c/li\u003e\n\u003cli\u003ePeng, X. et al. Pocket2Mol: Efficient Molecular Sampling Based on 3D Protein Pockets. In Proc. 39th International Conference on Machine Learning 17644\u0026ndash;17655 (PMLR, 2022).\u003c/li\u003e\n\u003cli\u003eMunson, B. P. et al. De novo generation of multi-target compounds using deep generative chemistry. \u003cem\u003eNat. Commun.\u003c/em\u003e \u003cstrong\u003e15\u003c/strong\u003e, 3636 (2024).\u003c/li\u003e\n\u003cli\u003eOzawa, M., Nakamura, S., Yasuo, N. \u0026amp; Sekijima, M. IEV2Mol: Molecular Generative Model Considering Protein-Ligand Interaction Energy Vectors. \u003cem\u003eJ. Chem. Inf. Model.\u003c/em\u003e \u003cstrong\u003e64\u003c/strong\u003e, 6969\u0026ndash;6978 (2024).\u003c/li\u003e\n\u003cli\u003eWu, K. et al. TamGen: drug design with target-aware molecule generation through a chemical language model. \u003cem\u003eNat. Commun.\u003c/em\u003e \u003cstrong\u003e15\u003c/strong\u003e, 9360 (2024).\u003c/li\u003e\n\u003cli\u003eFeng, W. et al. Generation of 3D molecules in pockets via a language model. \u003cem\u003eNat. Mach. Intell.\u003c/em\u003e \u003cstrong\u003e6\u003c/strong\u003e, 62\u0026ndash;73 (2024).\u003c/li\u003e\n\u003cli\u003eChen, S. et al. Deep lead optimization enveloped in protein pocket and its application in designing potent and selective ligands targeting LTK protein. \u003cem\u003eNat. Mach. Intell.\u003c/em\u003e \u003cstrong\u003e7\u003c/strong\u003e, 448\u0026ndash;458 (2025).\u003c/li\u003e\n\u003cli\u003eVarshney, D., Spiegel, J., Zyner, K., Tannahill, D. \u0026amp; Balasubramanian, S. The regulation and functions of DNA and RNA G-quadruplexes. \u003cem\u003eNat. Rev. Mol. Cell Biol.\u003c/em\u003e \u003cstrong\u003e21\u003c/strong\u003e, 459\u0026ndash;474 (2020).\u003c/li\u003e\n\u003cli\u003eSato, K. et al. RNA transcripts regulate G-quadruplex landscapes through G-loop formation. \u003cem\u003eScience\u003c/em\u003e \u003cstrong\u003e388\u003c/strong\u003e, 1225\u0026ndash;1231 (2025).\u003c/li\u003e\n\u003cli\u003eKovachka, S. et al. Small molecule approaches to targeting RNA. \u003cem\u003eNat. Rev. Chem.\u003c/em\u003e \u003cstrong\u003e8\u003c/strong\u003e, 120\u0026ndash;135 (2024).\u003c/li\u003e\n\u003cli\u003eNeidle, S. Quadruplex nucleic acids as targets for anticancer therapeutics. \u003cem\u003eNat. Rev. Chem.\u003c/em\u003e \u003cstrong\u003e1\u003c/strong\u003e, 0041 (2017).\u003c/li\u003e\n\u003cli\u003eH\u0026auml;nsel-Hertsch, R. et al. Landscape of G-quadruplex DNA structural regions in breast cancer. \u003cem\u003eNat. Genet.\u003c/em\u003e \u003cstrong\u003e52\u003c/strong\u003e, 878\u0026ndash;883 (2020).\u003c/li\u003e\n\u003cli\u003eH\u0026auml;nsel-Hertsch, R., Di Antonio, M. \u0026amp; Balasubramanian, S. DNA G-quadruplexes in the human genome: detection, functions and therapeutic potential. \u003cem\u003eNat. Rev. Mol. Cell Biol.\u003c/em\u003e \u003cstrong\u003e18\u003c/strong\u003e, 279\u0026ndash;284 (2017).\u003c/li\u003e\n\u003cli\u003eSultan, M. et al. Targeting the G-quadruplex as a novel strategy for developing antibiotics against hypervirulent drug-resistant Staphylococcus aureus. \u003cem\u003eJ. Biomed. Sci.\u003c/em\u003e \u003cstrong\u003e32\u003c/strong\u003e, 15 (2025).\u003c/li\u003e\n\u003cli\u003eOuyang, L. et al. Training language models to follow instructions with human feedback. Preprint at https://arxiv.org/abs/2203.02155 (2022).\u003c/li\u003e\n\u003cli\u003eChung, H. W. et al. Scaling Instruction-Finetuned Language Models. Preprint at https://arxiv.org/abs/2210.11416 (2022).\u003c/li\u003e\n\u003cli\u003eZheng, Y. et al. Large language models for scientific discovery in molecular property prediction. \u003cem\u003eNat. Mach. Intell.\u003c/em\u003e \u003cstrong\u003e7\u003c/strong\u003e, 438\u0026ndash;447 (2025).\u003c/li\u003e\n\u003cli\u003eZhang, D. et al. ChemLLM: A Chemical Large Language Model. Preprint at https://arxiv.org/abs/2402.06852 (2024).\u003c/li\u003e\n\u003cli\u003eNarayanan, S. M. et al. Training a Scientific Reasoning Model for Chemistry. Preprint at https://arxiv.org/abs/2506.17238 (2025).\u003c/li\u003e\n\u003cli\u003eEdwards, C. et al. Translation between Molecules and Natural Language. Preprint at https://arxiv.org/abs/2204.11817 (2022).\u003c/li\u003e\n\u003cli\u003eLi, Q. et al. G4LDB: a database for discovering and studying G-quadruplex ligands. \u003cem\u003eNucleic Acids Res.\u003c/em\u003e \u003cstrong\u003e41\u003c/strong\u003e, D1115\u0026ndash;1123 (2013).\u003c/li\u003e\n\u003cli\u003eWang, Y. H. et al. G4LDB 2.2: a database for discovering and studying G-quadruplex and i-Motif ligands. \u003cem\u003eNucleic Acids Res.\u003c/em\u003e \u003cstrong\u003e50\u003c/strong\u003e, D150\u0026ndash;D160 (2022).\u003c/li\u003e\n\u003cli\u003eYang, Q. F. et al. G4LDB 3.0: a database for discovering and studying G-quadruplex and i-motif ligands. \u003cem\u003eNucleic Acids Res.\u003c/em\u003e \u003cstrong\u003e53\u003c/strong\u003e, D91\u0026ndash;D98 (2025).\u003c/li\u003e\n\u003cli\u003eProbst, D. \u0026amp; Reymond, J. L. Visualization of very large high-dimensional data sets as minimum spanning trees. \u003cem\u003eJ. Cheminform.\u003c/em\u003e \u003cstrong\u003e12\u003c/strong\u003e, 12 (2020).\u003c/li\u003e\n\u003cli\u003eLi, Y. et al. An adaptive graph learning method for automated molecular interactions and properties predictions. \u003cem\u003eNat. Mach. Intell.\u003c/em\u003e \u003cstrong\u003e4\u003c/strong\u003e, 645\u0026ndash;651 (2022).\u003c/li\u003e\n\u003cli\u003eXu, H. et al. CX-5461 is a DNA G-quadruplex stabilizer with selective lethality in BRCA1/2 deficient tumours. \u003cem\u003eNat. Commun.\u003c/em\u003e \u003cstrong\u003e8\u003c/strong\u003e, 14432 (2017).\u003c/li\u003e\n\u003cli\u003eWang, K. B. et al. Structural insight into the bulge-containing KRAS oncogene promoter G-quadruplex bound to berberine and coptisine. \u003cem\u003eNat. Commun.\u003c/em\u003e \u003cstrong\u003e13\u003c/strong\u003e, 6016 (2022).\u003c/li\u003e\n\u003cli\u003eChang, J. \u0026amp; Ye, J. C. Bidirectional generation of structure and properties through a single molecular foundation model. \u003cem\u003eNat. Commun.\u003c/em\u003e \u003cstrong\u003e15\u003c/strong\u003e, 2323 (2024).\u003c/li\u003e\n\u003cli\u003eFrey, N. C. et al. Neural scaling of deep chemical models. \u003cem\u003eNat. Mach. Intell.\u003c/em\u003e \u003cstrong\u003e5\u003c/strong\u003e, 1297\u0026ndash;1305 (2023).\u003c/li\u003e\n\u003cli\u003ePolykovskiy, D. et al. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. \u003cem\u003eFront. Pharmacol.\u003c/em\u003e \u003cstrong\u003e11\u003c/strong\u003e, 565644 (2020).\u003c/li\u003e\n\u003cli\u003eRodriguez, R. et al. A novel small molecule that alters shelterin integrity and triggers a DNA-damage response at telomeres. \u003cem\u003eJ. Am. Chem. Soc.\u003c/em\u003e \u003cstrong\u003e130\u003c/strong\u003e, 15758\u0026ndash;15759 (2008).\u003c/li\u003e\n\u003cli\u003eXue, L., Spahn, C. M. T., Schacherl, M. \u0026amp; Mahamid, J. Structural insights into context-dependent inhibitory mechanisms of chloramphenicol in cells. \u003cem\u003eNat. Struct. Mol. Biol.\u003c/em\u003e \u003cstrong\u003e32\u003c/strong\u003e, 257\u0026ndash;267 (2025).\u003c/li\u003e\n\u003cli\u003eOwen, D. R. et al. An oral SARS-CoV-2 M(pro) inhibitor clinical candidate for the treatment of COVID-19. \u003cem\u003eScience\u003c/em\u003e \u003cstrong\u003e374\u003c/strong\u003e, 1586\u0026ndash;1593 (2021).\u003c/li\u003e\n\u003cli\u003eLandrum, G. et al. RDKit: open-source cheminformatics software. GitHub repository, https://github.com/rdkit/rdkit (2016).\u003c/li\u003e\n\u003cli\u003eEberhardt, J., Santos-Martins, D., Tillack, A. F. \u0026amp; Forli, S. AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. \u003cem\u003eJ. Chem. Inf. Model.\u003c/em\u003e \u003cstrong\u003e61\u003c/strong\u003e, 3891\u0026ndash;3898 (2021).\u003c/li\u003e\n\u003cli\u003eO\u0026apos;Boyle, N. M. et al. Open Babel: An open chemical toolbox. \u003cem\u003eJ. Cheminform.\u003c/em\u003e \u003cstrong\u003e3\u003c/strong\u003e, 33 (2011).\u003c/li\u003e\n\u003cli\u003eForli, S. et al. Computational protein-ligand docking and virtual drug screening with the AutoDock suite. \u003cem\u003eNat. Protoc.\u003c/em\u003e \u003cstrong\u003e11\u003c/strong\u003e, 905\u0026ndash;919 (2016).\u003c/li\u003e\n\u003cli\u003eWilliams, C. J. et al. MolProbity: More and better reference data for improved all-atom structure validation. \u003cem\u003eProtein Sci.\u003c/em\u003e \u003cstrong\u003e27\u003c/strong\u003e, 293\u0026ndash;315 (2018).\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-8819034/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8819034/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eTranslating high-level functional design intent into concrete molecular structures remains a fundamental challenge in generative molecular discovery, particularly for biomolecular targets governed by non-pocket-like recognition. Here, we introduce SemantiChem, an instruction-tuned generative framework for function-driven molecular design that maps functional objectives expressed in natural language directly to chemically meaningful molecular structures, without relying on predefined geometric constraints, molecular scaffolds, or pocket-centric assumptions. We apply this framework to G-quadruplexes (G4), a representative system characterized by diffuse and topology-driven molecular recognition, and experimentally validate model-generated candidates through assays of G4 stabilization, polymerase stalling, and cellular response. The same design pipeline is further evaluated on a structurally distinct RNA target and, for contrast, on a pocket-dominated protease target. Together, these results establish a function-level molecular design strategy with regime-dependent applicability, highlighting a complementary path for molecular discovery in biomolecular systems where conventional structure-centric paradigms are insufficient.\u003c/p\u003e","manuscriptTitle":"Function-Driven Molecular Design Enabled by Instruction-Tuned Large Language Models","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-12 04:52:55","doi":"10.21203/rs.3.rs-8819034/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"9210b591-6f1e-4ca4-af21-94328a7ba0e4","owner":[],"postedDate":"February 12th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":62744343,"name":"Physical sciences/Chemistry/Chemical biology/Small molecules"},{"id":62744344,"name":"Biological sciences/Computational biology and bioinformatics/Machine learning"},{"id":62744345,"name":"Biological sciences/Drug discovery/Drug screening/Virtual screening"},{"id":62744346,"name":"Physical sciences/Chemistry/Chemical biology/Cheminformatics"}],"tags":[],"updatedAt":"2026-03-17T15:14:16+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-12 04:52:55","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8819034","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8819034","identity":"rs-8819034","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-24T02:00:01.246996+00:00

License: CC-BY-4.0