Variantscape: Using Large Language Models to Build a Comprehensive Landscape of Cancer Variants for Precision Oncology

preprint OA: closed
Full text JSON View at publisher
Full text 192,114 characters · extracted from preprint-html · click to expand
Variantscape: Using Large Language Models to Build a Comprehensive Landscape of Cancer Variants for Precision Oncology | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Variantscape: Using Large Language Models to Build a Comprehensive Landscape of Cancer Variants for Precision Oncology Marie Wosny, Maximilian Boesch, Tobias Peres, Thibault Niederhauser, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6614711/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Precision oncology depends on accurate interpretation of molecular variants, yet novel insights are often buried in unstructured literature, described using heterogeneous nomenclature. To address this, we developed “Variantscape ,” an automated, large-scale pipeline and open-access web tool that integrates natural language processing and large language models to explore variant-cancer-treatment co-associations. Of over 2.7 million titles and abstracts processed, 7,524 mention all three entities, cancers, spanning 4,029 unique variants, 98 cancer types, and 377 treatments. Co-occurrence and network analyses revealed 15,577 significant co-associations within a graph comprising 4,504 nodes and 48,470 edges. Canonical variants in common cancers, such as BRAF V600E, had high-confidence treatment associations, while some rare variants showed strong literature-derived signals. By automating discovery and co-association detection, “Variantscape” offers a systematic overview of the variant landscape in the literature, enabling scalable insight generation that support hypothesis generation, uncover underrecognized connections, reveal novel applications of existing therapies, and advance precision oncology. Biological sciences/Cancer/Cancer genetics Biological sciences/Cancer/Cancer genomics Biological sciences/Cancer/Cancer therapy Biological sciences/Cancer/Tumour biomarkers Biological sciences/Computational biology and bioinformatics/Computational models Biological sciences/Computational biology and bioinformatics/Data processing Biological sciences/Computational biology and bioinformatics/Databases Biological sciences/Computational biology and bioinformatics/Gene ontology Biological sciences/Computational biology and bioinformatics/Genome informatics Biological sciences/Computational biology and bioinformatics/Literature mining Biological sciences/Computational biology and bioinformatics/Machine learning Biological sciences/Computational biology and bioinformatics/Predictive medicine Biological sciences/Computational biology and bioinformatics/Programming language and code Precision oncology large language models molecular variants next-generation sequencing cancer variants gene alterations natural language processing network-based graph analysis Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Introduction Precision oncology aims to tailor cancer treatments for a tumor’s unique genetic, molecular, or cellular characteristics, thus enabling more effective and targeted therapeutic interventions than conventional one-size-fits-all approaches 1 . Advances in next-generation sequencing (NGS) have facilitated the detection of genetic alterations, providing robust platforms for identifying actionable variants that guide diagnostic classification, prognosis, and personalized treatment decisions 2 . Large-scale studies integrating NGS with clinical data have demonstrated the clinical relevance of various genetic alterations, including missense mutations, structural variants, and functional signatures such as microsatellite instability (MSI) and homologous recombination deficiency (HRD) 3 , 4 . Numerous clinically relevant alterations have been successfully targeted with precision therapies, such as BRAF V600E, an oncogenic variant occurring in approximately 6% of all cancers and 40–50% of melanomas, where it activates the MAPK/ERK pathway and responds to combined BRAF and MEK inhibition 5 . Similarly, common EGFR alterations in non-small cell lung cancer (NSCLC), including exon 19 deletions and L858R, are effectively treated with tyrosine kinase inhibitors (TKIs) 6 – 8 . Alterations in BRCA 1/2, particularly in prostate cancer, have shown responsiveness to PARP inhibitors, offering clinical benefit despite their association with more aggressive disease 9 , 10 . However, many variants remain poorly understood, especially in rare cancers or underrepresented populations, limiting their clinical utility due to unclear biological and functional relevance 11 . Genomic knowledge bases, such as COSMIC 12 , CIViC 13 , ClinVar 14 , and OncoKB 15 , provide curated resources that support the functional interpretation of molecular variants, enabling clinicians and researchers to assess their clinical significance and inform personalized treatment strategies 16 . Despite extensive efforts to curate and standardize variant interpretations, knowledge remains fragmented and inconsistently represented due to the overwhelming volume of medical literature, the high cost of manual expert curation, and the limited scalability of these efforts, ultimately restricting the integration and coverage of genomic variants across platforms 17 , 18 . Consequently, clinicians often need to consult multiple databases and review additional literature in parallel to interpret variants accurately, making the process time-consuming, inefficient, and non-standardized 19 . Additionally, keyword-based search methods are hindered by inconsistent naming conventions and varied descriptions of the same variant. These include variant annotations defined by the Human Genome Variation Society (HGVS), DNA changes, protein changes (and within these, one-letter vs. three-letter codes), identifiers, and single-nucleotide polymorphism (SNP) IDs, all of which may be used interchangeably, complicating variant interpretation across platforms 20 . Traditional information retrieval methods, including keyword-based searches, named entity recognition (NER) models such as SciSpaCy 21 and transformer-based models such as BERT 22 , have improved the extraction of biomedical entities from literature 23 . However, while these methods have improved basic entity extraction, they often struggle with the full complexity of variant mentions, particularly non-standardized nomenclature, contextual ambiguity, and rare event extraction, and show reduced performance on unseen datasets, limiting their ability to accurately extract novel or rare entities 24 , 25 . Recent breakthroughs in artificial intelligence (AI), particularly the development of large language models (LLMs), are transforming how biomedical text is processed 26 , 27 . LLMs offer advanced contextual understanding and semantic flexibility that allow them to capture nuanced variant expressions and resolve ambiguities that confound traditional models 28 . Despite the potential of LLMs, current variant information extraction research is still dominated by earlier machine learning or rule-based systems, which face limitations. For instance, tmVar applied a machine learning system to detect sequence variants from PubMed abstracts based on HGVS nomenclature, though it primarily captures isolated mentions without broader contextual understanding 29 . PubTator 3.0 expanded automated extraction of biomedical entities to more than a billion annotations across PubMed, but it does not apply LLMs directly to unstructured text for extracting ambiguous or novel variant mentions 30 . The Variomes tool improved retrieval of variant-related publications but relies on pre-annotated datasets and lacks dynamic extraction capabilities for novel variants 31 . More recently, fine-tuned language models combined with manual annotation have enhanced variant entity recognition, yet these methods mainly focus on labeling rather than deeper contextual interpretation or relationship extraction 26 . Together, these prior efforts highlight both the progress achieved and the persistent challenges in enabling scalable, context-aware extraction of heterogeneous and novel variant mentions from biomedical literature. Building on these foundations, we developed an automated pipeline that also leverages recent advances in LLM techniques to systematically extract, standardize, and contextualize variant information from unstructured biomedical literature. Our approach addresses persistent challenges such as inconsistent nomenclature, rare and novel variant detection, and gaps in existing databases. Clinicians have traditionally lacked scalable tools to uncover rare or complex variant treatment cancer relationships buried in literature, a gap this pipeline is designed to close. By incorporating co-occurrence metrics and network-based analyses, the pipeline further infers and ranks potential variant, treatment, and cancer co-associations in a systematic and scalable manner. We show that our system is able to provide structured, context-rich variant insights to support evidence-driven scientific lead generation. This allows clinicians to rapidly identify such associations, including rare or underreported ones, directly from the literature without time-consuming manual searches. The framework supports researchers and clinicians in generating hypotheses, uncovering underrecognized connections, and identifying potential applications of existing therapies across both common and rare variants, reducing the need for manual literature review and streamlining access to clinically relevant insights. Methods All analyses (Supplementary information 1) were conducted using Python 3.11.5. The complete code is available on a GitHub repository, which can be accessed at https://github.com/hastingslab-org/Variantscape 32 , while the datasets are publicly available on Zenodo 33 . Literature retrieval Biomedical publications related to cancer research were retrieved from the OpenAlex database using its public API 34 . A structured, broad search query including “cancer” OR “tumor”/“tumour” OR “carcinoma” and related synonyms was applied to search titles and abstracts and identify relevant articles published in English between 2014 and 2024. Paper titles, abstracts, and additional metadata were extracted, including publication year, authors, referenced citations, and associated research concepts. Dataset cleaning and preprocessing The retrieved dataset was cleaned and normalized to ensure quality and consistency. Duplicate records were removed based on unique paper identifiers and/or combinations of titles and authors. Articles missing titles, abstracts, or metadata were excluded, and an additional language detection step was applied to titles and abstracts to exclude articles in non-English languages. Supplementary materials, corrections, tables, figures, and artifacts were removed based on keywords or regular expression patterns. Moreover, unusually short or long texts were filtered out (i.e., titles with fewer than three or more than 60 words and abstracts with fewer than 80 or more than 1,000 words) alongside withdrawn articles, and non-original research content, such as editorials or letters. The text fields were cleaned to remove HTML artifacts, and both the titles and abstracts were normalized for consistent casing and structures. A temporal analysis of publication trends and patterns over the years was performed. Information extraction Transformer-based named entity recognition for gene extraction A curated list of 161 cancer-relevant genes from the Oncomine Comprehensive Assay v3 NGS panel (Thermo Fisher) 35 was used as a reference set for identifying gene-relevant publications. Following preprocessing, the objective was to filter articles mentioning these cancer-associated genes to enable a focused downstream genomic variant analysis. Initial approaches involving rule-based string matching and NLP using SciSpaCy models 21 were evaluated but found inadequate, as they failed to capture a sufficient number of relevant gene mentions, showed an elevated false positive rates, and lacked contextual understanding. In contrast, while more robust, LLMs did not outperform the BioBERT-based approach and were deemed computationally inefficient and too resource-intensive for this relatively constrained information extraction task 36 . Therefore, the transformer-based BioBERT alvaroalon2/biobert_genetic_ner 37 model was employed. Genes from the Oncomine panel were programmatically expanded into synonyms via the MyGene.info API 38 , capturing official symbols, alternative names, and protein associations. Titles and abstracts were processed with BioBERT using a sliding window approach with a stride of 256 tokens to preserve context beyond the model’s 512-token limit. A multi-step normalization process addressed inconsistencies in gene nomenclature, species-specific differences, optical character recognition (OCR) errors, or typographic noise. This involved case-insensitive matching against a standardized reference list, punctuation removal, token-level comparison for formatting inconsistencies, and fuzzy matching 39 to unify lexical variants and biologically related terms. Detected genes were compiled into a binary matrix per article, where “1” indicated presence and “0” indicated absence, alongside a column counting total gene mentions. Articles without matched genes were excluded, yielding a filtered set of relevant publications. Categorization of entities in articles Various approaches were systematically evaluated to categorize biomedical entities and categories. These included transformer-based NER models and LLMs. The assessed transformer-based BERT models 22 were chosen for their specialized pretraining on biomedical corpora. Rule-based and statistical models from the SciSpaCy suite, particularly en_ner_bionlp13cg_md and en_ner_bc5cdr_md were evaluated 21 , 40 . Model selection was guided by empirical performance on a validation set. Performance was assessed using established evaluation metrics, including precision, recall, accuracy, and F1-score. NER for automated cancer type identification A multi-step NER pipeline was designed to extract, normalize, and categorize cancer types from unstructured text. Cancer mentions were identified from titles and abstracts using two complementary SciSpaCy models: “en_ner_bionlp13cg_md,” optimized for cancer-specific terms, and “en_ner_bc5cdr_md,” trained on more general disease entities 40 . To ensure consistency, a reference list of cancer types and synonyms was constructed by integrating definitions from the CIViC knowledgebase, mapping aliases to disease ontology ID (DOID) 41 and Monarch Disease Ontology (MONDO) 42 terms and expanding synonym coverage via the MONDO API 43 . Extracted mentions were cleaned through rule-based preprocessing, and synonyms were mapped to canonical terms through direct and fuzzy matching against a CIViC-MONDO dictionary. Each mention was standardized and assigned to a specific cancer type. A binary matrix was created to encode the presence or absence of each cancer type per article, excluding articles without any mapped cancer types. Additionally, standardized terms were linked to their ontological parents using Disease Ontology metadata retrieved via the API 44 . Ontology-guided NER for automated treatment identification The treatment extraction process involved querying the CIViC API 45 to retrieve a curated, clinically validated list of cancer therapies, including treatment names, National Cancer Institute Thesaurus (NCIt) identifiers, and aliases. NCIt was dynamically queried using the API 46 to obtain hierarchical parent concepts and overarching treatment categories. Treatment terms were extracted from articles using a hybrid rule-based approach combining regular expressions and fuzzy matching to identify treatment names and aliases. Ambiguous or short aliases were systematically filtered, and each publication was mapped to a binary matrix indicating the presence or absence of treatment mentions. LLM-based classifier for automated categorization To classify study types, an automated, LLM-based approach was implemented using Meta’s LLaMA3.3-70b-Instruct model via the General Classifier framework and DeepInfra API 47 – 49 . Predefined categories from the Ontology of Biomedical Investigations (OBI) 50 were consolidated into eight final categories, including “clinical study,” “observational/RWE study,” “case report study,” “ in vitro study”, “ in vivo /animal study,” “ in silico study,” “systematic review study,” or “other.” Prompt engineering strategies were evaluated on a representative data subset, and the approach with the highest accuracy and labeling consistency was selected to classify the full dataset in batches. To analyze classification results, abstracts were embedded using the SciBERT allenai/scibert_scivocab_uncased model 51 , 52 , producing 768-dimensional vectors by averaging final-layer token embeddings. Uniform Manifold Approximation and Projection (UMAP) was applied to reduce these embeddings to two dimensions, and an interactive 2D embedding-based cluster map was generated, with points colored by study design category and hover data for titles and publication years, enabling descriptive, exploratory analysis 53 . LLM-based molecular variant extraction The main methodological innovation of this study was to extract specific gene variants from biomedical text using LLMs, as the extraction with conventional NLP methods has been insufficient. Variant extraction was performed using Meta’s LLaMA3.3-70b-Instruct accessed via the cloud service DeepInfra 49 . This model was selected based on comparative evaluations against other state-of-the-art models, such as GPT-4o and DeepSeek, with LLaMA3.3-70b demonstrating superior accuracy in identifying and structuring variant-gene pairs from biomedical text 28 . The most effective prompt guided the model to detect HGVS notations, protein changes, and SNP IDs, while excluding vague or generic mentions. Extraction was performed in batches of 50,000 articles, with the cumulative runtime recorded. Extracted variants were standardized through preprocessing steps such as prefix removal, amino acid canonicalization, exon mapping, and keyword filtering. API calls to external databases improved synonym coverage and provided validation, including CIViC for variant IDs, aliases, and assertions, ClinVar for SNP IDs and HGVS notations, and Ensembl Variant Recoder for normalization 54 . Variants successfully matched to reference databases were standardized and included as unique variant, gene pairs in a binary matrix. Variants not found in these databases were retained in their original extracted form, without normalization or merging, to ensure their inclusion in downstream analyses and to preserve potentially novel findings. Finally, all datasets were merged using paper IDs as a common identifier for co-association analysis. Computational analysis of weighted co-occurrence networks and LLM-based categorization of relationships Co-associations among variants, treatments, and cancer types were assessed through co-occurrence analysis. Weighted matrix multiplication generated adjacency tables capturing pairwise combinations across entity types, specifically treatment-variant, cancer-variant, and treatment-cancer relationships. Statistical significance of each combination was assessed by Fisher’s exact test with Benjamini-Hochberg False Discovery Rate (FDR) correction (adjusted p ≤ 0.05 ), enabling identification of high-confidence associations enriched in the literature. In parallel, an undirected, evidence-weighted co-occurrence network was constructed using the python library NetworkX 55 , incorporating all observed co-occurrences, regardless of statistical significance. This broader inclusion was intended to capture weaker or rare signals, particularly for under-characterized variants and cancers that may be underpowered for significance testing but may still reflect clinically relevant relationships. Nodes in the network represented variants, cancers, and treatments, while edges were weighted by corresponding matrix values. Network analyses, including degree centrality, clustering coefficients, and community detection, were applied to uncover hubs, patterns of connectivity, and modular structure within the literature-derived variant landscape. To refine the interpretation of variant-treatment co-occurrence, a secondary annotation layer using LLaMA3.3-70B was applied. Each pairwise variant and treatment co-occurrence was extracted and classified by the LLM into one of five relationship categories reflecting different ways that variants can impact treatments: “sensitive/effective,” “resistant,” “diagnostic,” “unrelated,” or “unknown.” As the same variant-treatment relationship was often discussed across multiple abstracts, potentially with differing classifications, a tiered consensus strategy was applied. Consensus for each variant-treatment pair was determined through strict agreement, ≥ 80% agreement, and rule-based fallback logic. This logic resolved remaining cases by analyzing label distributions and applying heuristics that assigned the most frequent weak label when only weak predictions were present, and favored clinically meaningful labels (e.g., “sensitive”, “resistant”, “diagnostic”) over weaker ones (e.g., “unknown”, “unrelated”). Pairs with conflicting or ambiguous distributions remained classified as “no consensus.” Querying the network for specific variants revealed closely related treatments and cancers through evidence‑weighted scores. Three authors (MB, TP, MF) manually validated strong co-associations, leveraging their clinical and molecular expertise, as well as external data and knowledge bases such as DrugBank 56 ClinicalTrials.gov 57 , PubMed 58 , and Google Scholar. A targeted literature search was performed to identify rare or under-characterized variants in selected cancer types. These variants were then queried in the network to retrieve associated treatments, and the resulting co-associations were subsequently verified through additional literature review. Based on expert review to ensure network robustness, confidence filtering was applied to retain only those edges falling above the 80th percentile of the evidence‑weighted co‑occurrence distribution. Finally, the Python data analysis scripts were integrated into a Flask application to generate interactive HTML pages. The web tool was deployed to a cloud-hosted WSGI server, making it publicly accessible. Results Performance summary of information extraction pipeline Each component of the information extraction pipeline was evaluated against manually annotated ground truth to determine the best-performing method. Gene extraction using the BioBERT-based biobert_genetic_ner model achieved the strongest results (F1-score: 0.98, Recall: 1.00) 36 . Cancer type extraction was performed via the two SciSpaCy models en_ner_bionlp13cg_md and en_ner_bc5cdr_md combined with ontology-guided normalization (F1-score: 0.89, Recall: 0.80) (Supplementary information 2). Treatment extraction using a hybrid ontology-based and fuzzy matching approach demonstrated high reliability (F1score: 0.95, Recall: 0.95) (Supplementary information 3). Variant extraction using Meta’s LLaMA3.3-70B model outperformed other evaluated LLMs in extracting molecular variants (F1score: 0.95, Recall: 0.92) 28 . LLM-based classification of study types produced consistent results across categories (F1score: 0.92, Recall: 0.92) (Supplementary information 4). Finally, variant-treatment relationship classification using LLaMA3.3-70B achieved reliable consensus (F1score: 0.85 Recall: 0.84) (Supplementary information 5). Study selection and characteristics A total of 2,775,913 biomedical articles related to cancer research and published between 2014 and 2024 were retrieved from the OpenAlex database. After systematic preprocessing, 647,595 records (23.3%) were excluded due to duplicates (6.0%), missing metadata (1.6%), non-English texts (1.6%), withdrawn articles (0.1%), nonsensical titles (0.1%), length anomalies (8.3%), non-original research (0.7%), normalization errors (0.1%), and artifacts or entries misclassified as standalone articles but actually representing supplementary materials (4.8%). The final cleaned dataset comprised 2,128,318 articles (Fig. 1 ). Temporal analysis revealed a steady increase in publication volume from 150,000 articles in 2014 to over 250,000 articles by 2022, with a slight decline in 2024. Monthly publication spikes could be observed, which likely correspond to conference or publication cycles, while consistent weekly peaks suggest routine publishing practices (Supplementary information 6). Transformer-based gene extraction From the preprocessed dataset of 2,128,318 publications, the BioBERT model identified 308,748 articles (14.51%) containing genes or gene-related products from the Oncomine assay in their titles and/or abstracts. Consequently, 1,819,570 articles (85.49%) were excluded. In the filtered dataset, TP53 was the most frequently mentioned gene in 50,138 articles (16.24%), followed by EGFR (43,816; 14.19%), AKT1 (34,337; 11.12%), MTOR (19,864; 6.43%), and KRAS (19,364; 6.27%), with some mentions potentially amplified by pathway-specific research focus (Fig. 2 ). The distribution of gene mentions per publication revealed that most articles (189,309; 61.3%) focused on only one or two genes (66,034; 21.4%), with the average number of genes per article being 1.79. A steep decline was observed as the number of genes per article increased, with only a small fraction mentioning more than three genes (Supplementary information 7). Categorization of biomedical entities NER-based cancer extraction identified at least one specific cancer type in 199,726 (64.69%) of the 308,748 screened articles. A total of 225 distinct cancer types were identified, with the most frequently mentioned being breast cancer (40,002 articles; 20.03%), colon cancer (28,790; 14.41%), lung cancer (27,590; 13.81%), glioma (17,437; 8.73%), prostate cancer (16,208; 8.12%), liver cancer (15,093; 7.65%), pancreatic cancer (13,649; 6.83%), ovarian cancer (13,106; 6.56%) and melanoma (9,956; 4.98%), to some extent reflecting the high incidence/prevalence of these cancer types and/or their frontrunner status as regards targeted and immunotherapeutic drug development (cf. lung cancer, melanoma) (Fig. 3 ). Among these, most articles were classified by the LLM as in vitro study (80,899; 40.50%) or clinical study (52,769; 26.42%). Other common categories included in vivo /animal studies (25,165; 12.60%), in silico studies (12,212; 6.11%), and systematic reviews (12,158; 6.09%). Less frequent were case reports (9,753; 4.88%) and observational/RWE studies (5,955; 2.98%). A small subset of articles (815; 0.40%) was classified as “other”, either due to divergence from predefined design categories or insufficient information resulting in ambiguous classification (Fig. 4 , Supplementary information 8, interactive cluster plot accessible at https://evidencedb.hastingslab.org/variantscape/studydesignclustermap 59 ). Out of the 308,748 gene-harboring articles, 126,195 (40.87%) mentioned specific treatments, while 182,553 articles (59.13%) did not include any treatment references that could be matched with CIViC or NCIt mappings. The most frequently mentioned treatments include cisplatin (11,770 articles; 4.36%), sirolimus (6,874; 2.54%), paclitaxel (3,892; 1.44%), doxorubicin (3,700; 1.37%), erlotinib (3,608; 1.34%), gefitinib (3,554; 1.32%), and cetuximab (3,377; 1.25%) (Supplementary information 9). LLM-based variant extraction The LLM-based variant extraction pipeline processed 199,726 cancer-related articles in 2 days, 11 hours, and 49 minutes using LLaMA3.3-70b at a rate of 1.075 articles/s. Variants were identified in the titles and/or abstracts of 16,508 articles (8.27%), while 183,218 articles (91.73%) did not contain variant mentions. This subset constituted approximately 1% of all articles initially retrieved from OpenAlex. Following normalization, a total of 35,230 variant mentions across the dataset were identified, of which 11,950 variants were unique, indicating that many articles referenced multiple variants in their titles or abstracts. The most frequently mentioned variants included BRAF V600E in 3,859 articles (23.38%), KRAS G12D (1,489; 9.02%), EGFR T790M (1,196; 7.24%), EGFR L858R (1,105; 6.69%), KRAS G12C (743; 4.50%), and KRAS G12V (590; 3.57%) (Supplementary information 10). Further analysis showed that the majority of variant mentions (30,086; 85.40%), were associated with genes included in the Oncomine assay. Among the 161 Oncomine genes considered, 149 (92.55%) had at least one reported variant, while 12 genes (7.45%) had no variant mention. The remaining 5,144 variants (14.60%) were linked to 1,524 unique genes not included in the Oncomine panel (Supplementary information 11). Co-occurrence analysis Out of 308,748 articles (100%) analyzed, 199,726 (64.69%) mentioned at least one specific cancer type, 126,195 (40.87%) referenced a specific treatment, and 16,508 (5.35%) included a variant. Only 7,524 articles (2.44%) were at the intersection, mentioning all three entities, thus qualifying for statistical co-association analysis (Supplementary information 12). This dataset included 377 unique treatments, 98 unique cancer types, and 4,029 unique variants. The most frequently mentioned specific treatments were vemurafenib (612; 5.82%), osimertinib (564; 5.35%), and trametinib (521; 4.95%) (Supplementary information 13). For cancer types, the most commonly reported were lung cancer (2,022; 24.16%), melanoma (1,109; 13.25%), colon cancer (1,097; 13.11%), breast cancer (1,035; 12.37%), pancreatic cancer (497; 5.94%), and thyroid cancer (454; 5.42%) (Supplementary information 14). The analysis of variant frequencies revealed that BRAF V600E (1,952; 28.91%), EGFR T790M (852; 12.62%), EGFR L858R (641; 9.49%), KRAS G12D (571; 8.46%), KRAS G12C (436; 6.46%), and KRAS G12V (267; 3.95%) were most prevalent (Fig. 5 ). Co-occurrence matrices illustrated the frequency of associations between treatments, variants, and cancer types. Among all variant-treatment pairs, the most common LLM-predicted association type was “sensitive,” accounting for 29.1% of cases (4,552 associations), followed by “resistant” predictions at 21.6% (3,368 associations). A much smaller proportion, 0.9% (144 associations), was classified as “diagnostic”. Predictions labeled as “unknown” made up 24.1% of the total (3,770 associations), while 23.5% (3,666 associations) were deemed “unrelated”, and only 0.8% of associations (124 associations) could not be resolved and were labeled as “no consensus.” The matrix was predominantly sparse (indicated by light color in Fig. 6 ), reflecting relatively few strong associations. This aligns well with the biological expectation that most targeted therapies are relevant only to specific cancer types and genetic alterations/molecular signatures. Nonetheless, distinct hotspots, highlighted by more saturated green and red coloring, revealed statistically significant co-associations for select variant-treatment pairs predicted to be related to treatment-sensitivity (i.e., green color) or potential resistance (i.e., red color). Among the top-ranked predicted associations, several reflect well-established clinical relationships. For instance, dabrafenib, trametinib, and their combination regimen showed strong associations with the BRAF V600E variant (76.74%, 65.90%, and 81.98%, association respectively), aligning with their status as clinically approved and widely used therapies for BRAF -mutant cancers. Similarly, the cetuximab/encorafenib regimen shows a strong predicted association with BRAF V600E (95.28% association), aligning with its approval for metastatic colorectal cancer harboring this variant. On the other hand, some other high-scoring associations, such as ganitumab and elimusertib, lack a known mechanistic or clinical link to BRAF V600E, and are likely non-specific or non-actionable in this mutational context. Moreover, several variant-treatment pairs in the heatmap represent plausible resistance mechanisms, including ALK G1202R, which is a well-established resistance mutation targeted by later-generation ALK inhibitors, such as ASP3026, alectinib, and ceritinib (-70.00%, -21.55, -21.23% association respectively) 60 . Additionally, as indicated by the matrix, some treatments exhibited broader associations, with multiple non-zero entries across several variants. These treatments likely represent broad-spectrum treatments with potential relevance to multiple genetic alterations or potential relevance irrespective of genetic alterations. Conversely, isolated cells suggest high-specificity interactions, potentially reflecting precision oncology applications where therapies are tailored to target specific genetic alterations. The co-occurrence analysis between variants and treatments revealed 15,577 statistically significant co-associations ( p ≤ 0.05 ). Notably, BRAF V600E emerged as one of the most recurrently associated variants, showing strong co-associations with multiple cancer types including histiocytoma (100%), clear cell sarcoma (100%), neurofibroma (100%), chordoma (100%), and melanoma (61.85%) (Supplementary information 15). On the treatment side, compelling co-occurrences such as the cetuximab/encorafenib regimen for colorectal cancer (95.3%) and endocrine therapies for breast cancer (86–94%) were observed, while abiraterone was tightly linked to prostate cancer (91.4%) (Supplementary information 16). Investigating each cancer type individually (with lung-, breast-, colon- and prostate cancer as well as melanoma serving as case examples) revealed that some variants appear across multiple cancers, while others are more cancer-specific (Fig. 7 ). For example, BRAF V600E is shared as one of the top associated variants in lung cancer, colon cancer, and melanoma, while KRAS G12C is found in both lung and colon cancers. However, most other top-ranking variants appear to be disease-specific. In lung cancer, the strongest co-associations were observed for EGFR T790M (35.16), EGFR L858R (27.65), KRAS G12C (8.92), BRAF V600E (7.96), and EGFR Exon19del (5.59). For breast cancer, the most prominent co-associations included ESR1 Y537S (17.13), ESR1 D538G (16.29), PIK3CA H1047R (14.34), PIK3CA E545K (10.62), and PIK3CA E542K (8.43). In colon cancer, BRAF V600E showed the strongest co-occurrence (45.47), followed by several KRAS variants: G12C (12.13), G12D (9.96), G12V (8.43), and G13D (7.18). The top five variants in prostate cancer were all related to the AR gene, including AR T878A (17.03), AR L702H (16.95), AR F876L (14.16), AR F877L (12.35), and AR H875Y (11.80). In melanoma, BRAF V600E showed the highest co-association score (61.85), followed by BRAF V600K (12.26), NRAS Q61R (2.48), NRAS Q61K (2.21), and BRAF V600R (2.10). These high-ranking variants are largely consistent with well-established and frequently reported alterations, supporting the validity of the extraction and ranking approach and indicating that the method reliably recovers clinically relevant findings. Network analysis The network graph was comprised of 4,504 nodes and 48,470 edges, offering a structured representation of the variant landscape captured in the literature (full network graph accessible at https://evidencedb.hastingslab.org/variantscape/networkgraph ) 61 . As an example, network analysis of treatment-variant relationships identified the EGFR L858R point mutation and EGFR T790M as central nodes in NSCLC (Table 1 ). For EGFR L858R, multiple treatments demonstrated strong, sensitive co-associations exceeding the 80th percentile confidence threshold (highlighted in green color), including osimertinib (score: 678), gefitinib (score: 474), erlotinib (score: 432), and afatinib (score: 333), while radiation therapy (score: 167) showed a co-association but did not meet the predefined confidence cutoff and therefore deemed not relevant (grey color). No resistant co-associations for EGFR L858R surpassed the threshold (grey color), although several lower-confidence edges were present. In contrast, EGFR T790M was associated with a single high-confidence sensitivity edge, osimertinib (score: 871), and multiple strong resistance co-associations (highlighted in red), including gefitinib (score: 557), erlotinib (score: 505), afatinib (score: 375), and cisplatin (score: 182). Cross-cancer co-associations revealed that both variants were most frequently linked to colon cancer (highlighted in blue), suggesting a measurable, though relatively uncommon, presence of these variations outside the NSCLC setting ( EGFR L858R score: 92, EGFR T790M score: 98). Rare variant identification and evaluation of treatments In addition to the assessment of well-characterized alterations, literature-derived rare variants in NSCLC, BRAF G469V, EGFR S768I, EGFR L861Q, and EGFR L747P were examined to evaluate their co-associations with both sensitive and resistant therapies, as well as their occurrence in other cancer types (Table 2 ). These co-associations were supported by prior evidence 62 , 63 underscoring the method’s ability to recover clinically meaningful relationships for rare variants. For BRAF G469V, a class II mutation leading to RAS-independent dimeric activation, strong sensitivity co-associations were observed with osimertinib (score: 485) and gefitinib (score: 325), with no resistant associations detected 63 . EGFR S768I showed high-confidence sensitivity co-associations with osimertinib (score: 503) and gefitinib (score: 337), with no resistant treatments surpassing the confidence threshold 62 . EGFR L861Q demonstrated strong sensitivity co-associations with osimertinib (score: 507), gefitinib (score: 336), and erlotinib (score: 302), and no high-confidence resistance co-associations 62 . EGFR L747P was uniquely characterized by a strong resistant co-association with osimertinib (score: 487), without any high-confidence sensitivity co-associations 62 . Cross-cancer co-association analysis revealed that BRAF G469V (score: 85) and EGFR L861Q (score: 84) were frequently mentioned in the context of colorectal cancer, suggesting relevance beyond NSCLC 63 . Interactive web tool for querying variant-cancer-treatment co-associations To translate our findings into a usable resource, we developed the publicly accessible “Variantscape” web tool that enables interactive exploration of our network analysis and variant, treatment, and cancer associations identified in this study. The tool allows users to search by variant and associated gene, as well as cancer type, to retrieve co-association of treatments and other cancer types linked to a respective variant. This interface enables exploratory navigation of literature-derived variant associations and may serve as a starting point for hypothesis generation and knowledge discovery in research settings. The tool is available at https://evidencedb.hastingslab.org/variantscape 64 (Fig. 8 ). Discussion This study presents a large-scale computational analysis of variant-treatment-cancer co-associations derived from biomedical literature, specifically titles and abstracts. We introduce a novel technical approach that combines LLM- and NER-driven entity extraction with co-association and network-based analyses to systematically mine and structure oncogenic insights from unstructured text. This method enabled the identification of co-associations involving both well-known and less frequently discussed variants, revealing patterns and underexplored relationships from the underlying literature that may inform future hypotheses generation and focused evidence retrieval about oncogenic mechanisms or therapeutic relevance. In doing so, it contributes to a more comprehensive representation of the literature-derived variant landscape. Many of these insights, particularly those involving less-characterized variants, reflect contextual relationships that are often absent from traditional variant databases, frequently leaving researchers and clinicians performing their own literature reviews when confronted with rare or atypical alterations. Importantly, these findings are not intended as clinical recommendations but rather represent literature-based leads that should support further exploration within the context of existing evidence and expert judgment, including potential directions for precision oncology and drug repurposing. Technical contributions and methodological insights The large number of gene and variant mentions extracted from the literature corpus demonstrates the effectiveness of the LLM-based approach in capturing both commonly reported and less canonical molecular alterations, highlighting the richness of unstructured biomedical text and the value of large-scale text mining for variant discovery. However, despite the initial vast scale of the literature corpus, only 1% of articles mentioned a molecular variant, highlighting how relatively rare explicit molecular variants remain in titles and abstracts. Crucially, this also highlights a practical limitation for clinicians and researchers relying on traditional keyword-based searches, as variant nomenclature is often inconsistent and, in many cases, not mentioned at all in titles or abstracts, making manual discovery via platforms like PubMed extremely difficult. Similarly, traditional NLP methods, while highly effective for identifying more standardized biomedical entities such as cancer types and treatments, as shown in this study, struggle to capture the variability inherent to molecular variants. In contrast, our LLM-based approach retrieves and structures variant-level insights at scale, leveraging the unique ability of LLMs to interpret naming variability and accurately extract variant information from unstructured texts, providing access to information that might otherwise remain buried alive in the literature. Although the computational runtime reflects the current resource demands of large-scale LLM-based extraction, the study shows that the approach is nevertheless feasible and scalable for systematic biomedical literature mining. Interpretation of findings and clinical relevance Our approach revealed many plausible treatment-variant, variant-cancer, and treatment-cancer co-associations. In general, accuracy appeared highest for canonical variant-cancer pairs (e.g., EGFR L858R in NSCLC or BRAF V600E in melanoma) and declined when applied beyond well-established indications, as limited literature representation reduces both model confidence and evidence strength. However, even in well-characterized cases, the therapeutic relevance of a given alteration can vary substantially across tumor types, reflecting a broader challenge in precision oncology and treatment repurposing approaches. Factors such as pathway dependencies, co-mutations, and the tumor microenvironment can significantly influence treatment response 65 . As a result, the presence of a known oncogenic variant does not ensure equivalent efficacy across cancers, underscoring the importance of tumor-context-aware interpretation of variant-based co-associations 66 . Moreover, variant-treatment co-associations may be modulated by co-occurring genetic alterations, which can influence pathway activity, resistance mechanisms, or synthetic lethality interactions, underscoring the need to consider broader mutational contexts when interpreting literature-derived co-associations. While molecular biomarkers offer a rationale for extending therapies across diseases and toward tumor-agnostic strategies, their clinical utility remains constrained by biological complexity, tissue-specific signaling, pharmacokinetics, and off-target effects 67 . The presence of a shared variant does not guarantee equal therapeutic efficacy, as shown by BRAF V600E, which responds well to BRAF inhibition in melanoma but requires combination with EGFR blockade in colorectal cancer to overcome compensatory signaling 68 . Similarly, EGFR mutations are highly actionable in NSCLC but less effective targets in colorectal cancer due to downstream activation of the Ras-Raf-MEK-ERK pathway 69 . The type of gene affected also shapes therapeutic potential: oncogenes (e.g., EGFR , BRAF , ALK , MET ) are more readily targeted with inhibitors, whereas tumor suppressor genes, often altered by loss-of-function mutations, remain pharmacologically challenging 70 . Although TP53 was the most frequently reported gene in our dataset, most clinically actionable variants were linked to oncogenes, reflecting both their therapeutic tractability and greater representation in the literature. Moreover, while targeted therapies and immunotherapies dominate variant-guided treatment decisions, certain molecular features also influence response to broader modalities; for instance, BRCA 1/2 mutations confer sensitivity to platinum-based chemotherapy and PARP inhibitors 71 , and MMR deficiency or high tumor mutational burden (TMB) predict response to immune checkpoint blockade 72 . Together, these examples highlight the importance as well as limitations of the molecular context in determining therapeutic relevance, reinforcing the value of systematic literature-derived variant mapping while acknowledging the challenges that remain beyond current modeling capabilities. Despite these advances, significant gaps remain for certain clinically important variants, particularly in understudied scenarios. One example is the “entity” of Cancer of Unknown Primary (CUP), a diagnostically and therapeutically complex condition with a dismal prognosis. Although molecular profiling, including DNA methylation analysis, is frequently used to identify the tissue of origin and guide therapy, evidence suggests that even treatments informed by advanced genomic analyses often fail to outperform standard chemotherapy 73 . Notably, while CUP was identified in the literature, no CUP-associated variants were captured in our analysis, reflecting a genuine lack of co-mentioned variant data rather than a limitation of the extraction approach. This gap presents a clear opportunity for future studies to investigate and define robust genomic markers that could better inform CUP diagnosis and treatment. Similarly, variants of uncertain significance (VUS), while frequently encountered in clinical testing, are rarely emphasized in the literature and thus underrepresented in LLM-driven text mining. Beyond biological considerations, a critical but often underappreciated challenge in precision oncology is the influence of clinical testing patterns and the time point of molecular data collection reported in published literature. For example, AR alterations in prostate cancer are frequently reported, yet this may primarily reflect the timing of NGS, which is commonly performed in the advanced, castration-resistant setting 74 . Consequently, literature-derived variant frequencies may mirror clinical sampling patterns rather than true biological prevalence. This bias is not unique to prostate cancer; rather, it reflects a broader issue across various tumor types. When NGS is primarily performed on refractory or late-stage disease, the molecular landscape observed in both clinical practice and the literature may become disproportionately focused on aggressive or treatment-resistant phenotypes. Additionally, the structure of clinical studies, particularly randomized controlled trials, often necessitates inclusion of standard-of-care treatments, which may lead to inflated mention frequencies irrespective of actual therapeutic effectiveness. Practical implications and use cases In many rare or understudied malignancies, conventional trial-driven evidence is limited, and database curation often lags behind the pace of emerging literature. In addition, the importance of classical evidence generation using large trials may decline given the increased molecular disease understanding entailing patient sub-setting, refined stratification, predictive biomarkers, and n-equals-one medicine. To address this gap, we developed an LLM-supported approach to extract and identify potential variant-treatment-cancer co-associations across tumor types, supporting hypothesis generation in settings where direct evidence is sparse. These associations, however, are based on patterns of co-occurrence in the literature and should be interpreted as preliminary signals rather than definitive therapeutic recommendations. Clinical interpretation and functional validation remain essential to determine the true relevance of these findings. To support real-world applications, we developed an interactive web-based tool that enables clinicians and researchers to explore these associations in real-time. This resource could assist clinicians, researchers, molecular tumor boards, and curation teams by providing an early-stage evidence layer and screening tool to guide downstream analysis and decision-making. Limitations of the study While our findings offer a broad and structured view of literature-derived variant, treatment, and cancer co-associations, several limitations should be acknowledged when interpreting the results. First, our analysis was restricted to titles and abstracts, as full-text access is frequently limited by paywalls and licensing restrictions. Moreover, large-scale full-text analysis with LLMs requires additional infrastructure, such as section parsing and increased memory capacity. Consequently, variants mentioned exclusively in the full text, particularly within tables, figures, or supplementary materials, were missed. Nonetheless, titles and abstracts offer a consistent and accessible foundation for scalable literature mining and represent a practical starting point for systematic variant discovery. Second, our study analyzed publications from 2014 to 2024, a timeframe that, while comprehensive, may include outdated therapeutic associations given the rapid evolution of oncological research. However, as the pipeline is fully automated, the analysis can be updated at any time to incorporate newly published data and maintain alignment with the latest evidence. Third, our findings are inherently shaped by publication bias, as well-studied genes and high-profile therapies dominate the literature, while rare variants or under-researched populations are comparatively absent. Fourth, while our pipeline captured a broad treatment space, it occasionally surfaced non-specific or supportive agents (e.g., “chemotherapy,” “ibuprofen,” “prednisone”). These inclusions stem not from model error but from the inherent ambiguity in source knowledge bases and literature, where therapeutic intent, preventive roles, and symptomatic management are often conflated. Although some of these treatments have shown potential preventive or adjunctive effects in early studies, they are not considered standard therapies for cancer. This underscores a broader challenge in biomedical informatics: the need for context-aware systems and dynamically curated ontologies to distinguish between treatment, prevention, and supportive care in a clinically meaningful way. Fifth, applying a confidence threshold helped reduce noise and improve specificity in variant- treatment classifications, but emerging or sparsely reported associations may have been excluded. In qualitative review, some known treatment-variant co-associations were successfully recovered, while others were incorrectly labeled as “sensitive” despite lacking clear mechanistic or clinical support. This reflects the method’s susceptibility to noise and underscores the importance of thresholding and post-processing to reduce inaccurate associations. Moreover, this binary classification scheme oversimplifies complex therapeutic dynamics. Drug response is often dose-dependent, temporally variable, and context-specific. Acquired resistance, compensatory signaling, and co-mutational effects all play roles that are not captured in simple co-occurrence-based models. Sixth, while our pipeline effectively extracted point mutations and small indels, it did not explicitly detect gene fusions or tumor-level genomic signatures such as MSI, MMR deficiency, or high TMB. Although clinically significant, these alterations were not included in our variant extraction scheme and were not systematically queried via the LLM. Finally, human validation remains essential. While LLMs can scale variant extraction from the published literature, they cannot replace clinical expertise in contextualizing variant relevance. The associations presented here should be viewed as preliminary signals derived from textual patterns, not definitive therapeutic evidence. Future directions Looking ahead, the here presented approach holds potential for further integration with a range of data types and tools. Future work should incorporate full-text mining to capture deeper variant annotations, structured therapeutic evidence, and context-rich discussions inherently missing from abstracts. Expanding extraction capabilities to include gene fusions, functional tumor signatures, and compound biomarkers would enhance the biological breadth. Additionally, integrating our literature-derived insights with curated databases and clinical variant interpretation pipelines (e.g., CIViC, COSMIC, ClinVar) and real-world molecular datasets (e.g., TCGA) may enable more robust hypothesis testing and validation. Linking these associations to clinical outcomes could further help prioritize literature-derived scientific leads for translational investigation. Our publicly accessible web tool serves as a foundation for interactive variant exploration, but future iterations may support real-time updates, potentially evolving into a semi-automated support system for molecular tumor boards, variant curation teams, or precision oncology researchers, especially in low-resource settings where manual review is infeasible. Lastly, broader progress will require not only technical improvements in LLM-based extraction but also a shift in publishing culture, encouraging structured, standardized variant reporting, greater open access availability, and metadata enrichment to ensure biomedical literature can be systematically leveraged for clinical insight and discovery. Conclusion This study demonstrates the feasibility and utility of using LLMs and NLP-based approaches to systematically extract, structure, and explore variant, treatment, and cancer co-associations from biomedical literature at scale. Automation is essential to scaling extraction efforts, ensuring accuracy, and keeping pace with the exponential growth of scientific discoveries. By leveraging automation and advanced machine learning techniques, our pipeline uncovered thousands of meaningful connections, including both canonical and rare variants, that are often underrepresented or difficult to retrieve through traditional database queries or manual literature review. While constrained to titles and abstracts, our method offers a scalable and accessible foundation for literature-based variant discovery and contributes to mapping the evolving landscape of molecular oncology as reflected in published evidence. The open-access, interactive web tool developed as part of this work translates these findings into a usable resource that may support clinicians and researchers in hypothesis generation, early exploration of literature-derived signals, and the identification of potential new applications of existing therapies, especially in the case of rare variants. Despite known limitations and the need for clinical validation, this approach offers a powerful starting point for augmenting variant interpretation efforts and bridging gaps in curated knowledge. As biomedical literature continues to grow rapidly, automated extraction methods like “Variantscape” may become increasingly valuable in transforming unstructured evidence into structured insights that capture the full complexity of the variant landscape and advance precision oncology. Table 1 Top five ranked “sensitive” and “resistant” treatment co-associations for variants EGFR L858R and EGFR T790M in non-small cell lung cancer (NSCLC), and co-association of these variants in other cancer types based on evidence-weighted network analysis. Cancer type: Non-small cell lung cancer (NSCLC) Variant: L858R EGFR Variant: T790M EGFR Top 5 treatments associated “sensitive” treatments (weighted evidence score)* Osimertinib: 678 Gefitinib: 474 Erlotinib: 432 Afatinib: 333 Radiation therapy: 167 Osimertinib: 871 Radiation therapy: 178 Trametinib: 90 Rociletinib: 71 Cetuximab: 64 Top 5 treatments associated “resistant” treatments (weighted evidence score)* Cisplatin: 180 Pembrolizumab: 41 Paclitaxel: 30 Docetaxel: 30 Nivolumab: 21 Gefitinib: 557 Erlotinib: 505 Afatinib: 375 Cisplatin: 182 Crizotinib: 167 Other top 5 cancers associated with the respective variant (weighted evidence score)* Colon cancer: 92 Breast cancer: 64 Squamous cell cancer: 60 Melanoma: 56 Pancreatic cancer: 41 Colon cancer: 98 Breast cancer: 68 Melanoma: 58 Squamous cell cancer: 57 Pancreatic cancer: 45 *Colored co-associations represent high-confidence scores (≥ 80th percentile), based on evidence-weighted co-occurrence in the literature and filtered to retain only the strongest edges for network robustness. Table 2 Top five ranked treatment co-associations for rare variants identified in non-small cell lung cancer (NSCLC), along with additional cancer types where these variants might be observed more frequently. Cancer type: Non-small cell lung cancer (NSCLC) BRAF G469V EGFR S768I EGFR L861Q EGFR L747P Top 5 treatments associated “sensitive” treatments (weighted evidence score)* Osimertinib: 485 Gefitinib: 325 Trametinib: 87 Vemurafenib: 43 Cetuximab: 39 Osimertinib: 503 Gefitinib: 337 Erlotinib: 301 Afatinib: 253 Crizotinib: 135 Osimertinib: 507 Gefitinib: 336 Erlotinib: 302 Afatinib: 253 Crizotinib: 136 Afatinib: 234 Cetuximab: 39 - - - Top 5 treatments associated “resistant” treatments (weighted evidence score)* Dabrafenib: 66 Gemcitabine: 10 - - - Icotinib: 54 Brigatinib: 23 Lazertinib: 6 - - Lazertinib: 6 - - - - Osimertinib: 487 - - - - Other cancers associated with the respective variant (weighted evidence score)* Colon cancer: 85 Melanoma: 58 Breast cancer: 50 Thyroid cancer: 29 Ovarian cancer: 16 Breast cancer: 50 Pancreatic cancer: 40 Liver cancer: 11 Glioblastoma: 11 Cholangiocarcinoma: 8 Colon cancer: 84 Breast cancer: 51 Pancreatic cancer: 40 Squamous cell cancer: 39 Glioblastoma: 12 NA *Colored co-associations represent high-confidence scores (≥ 80th percentile), based on evidence-weighted co-occurrence in the literature and filtered to retain only the strongest edges for network robustness. Abbreviations AI: artificial intelligence API: application programming interface CIViC: Clinical Interpretation of Variations in Cancer CUP: cancer of unknown primary DOID: disease ontology ID FDR: false discovery rate HGVS: Human Genome Variation Society HRD: homologous recombination deficiency LLMs: large language models MMR: mismatch repair MONDO: Monarch Disease Ontology mCRPC: metastatic castration-resistant prostate cancer MSI: microsatellite instability NCIt: National Cancer Institute thesaurus NER: named entity recognition NGS: next-generation sequencing NLP: natural language processing NSCLC: non-small cell lung cancer OBI: Ontology of Biomedical Investigations OCR: optical character recognition PC: prostate cancer RAG: retrieval-augmented generation RWE: real-world evidence SNP: single-nucleotide polymorphism TKIs: tyrosine kinase inhibitors TMB: tumor mutational burden UMAP: uniform manifold approximation and projection VUS: variants of uncertain significance Declarations Author contributions Conception and design: MW, JH. Data collection, analysis, and statistics: MW, TN, JH. Data interpretation: MW, MB, TP, MF, CR, JH. Wrote the first draft of the paper: MW. Wrote the final version of the paper: MW, MB, TP, TN, MF, CR, JH. Approved the paper for submission and publication: MW, MB, TP, TN, MF, CR, JH. Acknowledgments This study was funded by the School of Medicine at the University of St.Gallen in Switzerland under grant number 2300380. Competing interests TP has received travel support from Janssen and Bayer. MF has received compensation from Bristol-Myers Squibb, MSD, AstraZeneca, Boehringer Ingelheim, Roche, Takeda, Pfizer, Janssen, Daiichi-Sankyo, and PharmaMar (Advisory Board, institutional). CR has received compensation from Pfizer, Bristol-Myers Squibb, and MSD Oncology (Advisory Board, institutional). MW, MB, TN, and JH declare no competing interests relevant to this paper. Data availability statement The datasets generated and/or analyzed during the current study are available in a Zenodo repository and can be accessed via this link: https://zenodo.org/records/15268056. Code availability statement The underlying code and validation datasets for this study is available in a GitHub repository and can be accessed via this link: https://github.com/hastingslab-org/variantscape. References Malone, E. R., Oliva, M., Sabatini, P. J. B., Stockley, T. L. & Siu, L. L. Molecular profiling for precision cancer therapies. Genome Medicine 12 , 8 (2020). Gibbs, S. N. et al. Comprehensive review on the clinical impact of next-generation sequencing tests for the management of advanced cancer. JCO Precis Oncol 7 , e2200715 (2023). Mendiratta, G. et al. Cancer gene mutation frequencies for the U.S. population. Nat Commun 12 , 5961 (2021). Kris, M. G. et al. Using multiplexed assays of oncogenic drivers in lung cancers to select targeted drugs. JAMA 311 , 1998–2006 (2014). Dankner, M., Rose, A. A. N., Rajkumar, S., Siegel, P. M. & Watson, I. R. Classifying BRAF alterations in cancer: new rational therapeutic strategies for actionable mutations. Oncogene 37 , 3183–3199 (2018). Chevallier, M., Borgeaud, M., Addeo, A. & Friedlaender, A. Oncogenic driver mutations in non-small cell lung cancer: Past, present and future. World J Clin Oncol 12 , 217–237 (2021). Kumar, A. & Kumar, A. Non-small-cell lung cancer-associated gene mutations and inhibitors. Advances in Cancer Biology - Metastasis 6 , 100076 (2022). Soria, J.-C. et al. Osimertinib in untreated EGFR-mutated advanced non-small-cell lung cancer. N Engl J Med 378 , 113–125 (2018). Mateo, J. et al. Olaparib for the treatment of patients with metastatic castration-resistant prostate cancer and alterations in BRCA1 and/or BRCA2 in the PROfound trial. JCO 42 , 571–583 (2024). Liu, Q. et al. A novel BRCA2 mutation in prostate cancer sensitive to combined radiotherapy and androgen deprivation therapy. Cancer Biol Ther 19 , 669–675 (2018). Schwartzberg, L., Kim, E. S., Liu, D. & Schrag, D. Precision Oncology: Who, How, What, When, and When Not? Am Soc Clin Oncol Educ Book 160–169 (2017) doi:10.1200/EDBK_174176. Sondka, Z. et al. COSMIC: a curated database of somatic variants and clinical data for cancer. Nucleic Acids Research 52 , D1210–D1217 (2024). Griffith, M. et al. CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nat Genet 49 , 170–174 (2017). Landrum, M. J. et al. ClinVar: Public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 42 , D980-985 (2014). Chakravarty, D. et al. OncoKB: A precision oncology knowledge base. JCO Precis Oncol 2017 , PO.17.00011 (2017). Gazola, A. A., Lautert-Dutra, W., Archangelo, L. F., Reis, R. B. dos & Squire, J. A. Precision oncology platforms: Practical strategies for genomic database utilization in cancer treatment. Molecular Cytogenetics 17 , 28 (2024). Allot, A. et al. Tracking genetic variants in the biomedical literature using LitVar 2.0. Nat Genet 55 , 901–903 (2023). Wagner, A. H. et al. A harmonized meta-knowledgebase of clinical interpretations of somatic genomic variants in cancer. Nat Genet 52 , 448–457 (2020). Howard, M. et al. VarStack: a web tool for data retrieval to interpret somatic variants in cancer. Database (Oxford) 2020 , baaa092 (2020). Lee, K., Wei, C.-H. & Lu, Z. Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature. Brief Bioinform 22 , bbaa142 (2020). Neumann, M., King, D., Beltagy, I. & Ammar, W. SciSpaCy: Fast and robust models for biomedical natural language processing. in Proceedings of the 18th BioNLP Workshop and Shared Task (eds. Demner-Fushman, D., Cohen, K. B., Ananiadou, S. & Tsujii, J.) 319–327 (Association for Computational Linguistics, Florence, Italy, 2019). doi:10.18653/v1/W19-5034. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2019). Jolly, A., Pandey, V., Singh, I. & Sharma, N. Exploring biomedical named entity recognition via SciSpaCy and BioBERT models. TOBEJ 18 , e18741207289680 (2024). Kühnel, L. & Fluck, J. We are not ready yet: Limitations of state-of-the-art disease named entity recognizers. Journal of Biomedical Semantics 13 , 26 (2022). Alamro, H., Gojobori, T., Essack, M. & Gao, X. BioBBC: A multi-feature model that enhances the detection of biomedical entities. Sci Rep 14 , 7697 (2024). Huang, D.-L. et al. A combined manual annotation and deep-learning natural language processing study on accurate entity extraction in hereditary disease related biomedical literature. Interdiscip Sci Comput Life Sci 16 , 333–344 (2024). Doneva, S. E. et al. Large language models to process, analyze, and synthesize biomedical texts: A scoping review. Discov Artif Intell 4 , 107 (2024). Wosny, M. & Hastings, J. Large language models for detection of genetic variants in biomedical literature. Studies in Health Technology and Informatics (preprint) (2025). Wei, C.-H., Harris, B. R., Kao, H.-Y. & Lu, Z. tmVar: A text mining approach for extracting sequence variants in biomedical literature. Bioinformatics 29 , 1433–1439 (2013). Wei, C.-H. et al. PubTator 3.0: An AI-powered literature resource for unlocking biomedical knowledge. Nucleic Acids Research 52 , W540–W546 (2024). Pasche, E. et al. Variomes: A high recall search engine to support the curation of genomic variants. Bioinformatics 38 , 2595–2601 (2022). Wosny, M. & Hastings, J. Variantscape: Python Notebooks. (2025). Wosny, M. Variantscape datasets. https://doi.org/10.5281/zenodo.15268056. OpenAlex & OurResearch. OpenAlex API. (2025). Thermo Fisher Scientific. Oncomine TM Comprehensive Assay v3. Thermo Fisher Scientific https://www.thermofisher.com/uk/en/home/clinical/preclinical-companion-diagnostic-development/oncomine-oncology/oncomine-cancer-research-panel-workflow.html (2025). Wosny, M. & Hastings, J. Automated gene identification in oncology literature: A comparative evaluation of natural language processing approaches. Studies in Health Technology and Informatics (preprint) (2025). Alvaro Alonso Casero & librAIry Team. BioBERT Genetic NER Model alvaroalon2/biobert_genetic_ner at Hugging Face. (2025). MyGene.info documentation — MyGene.info 3.0 documentation. https://docs.mygene.info/en/latest/index.html. Cohen, A. & SeatGeek Inc. FuzzyWuzzy: Fuzzy string matching in Python. SeatGeek (2025). Allen Institute for AI. scispaCy: SpaCy models for biomedical text processing. (2025). Schriml, L. M. et al. Human Disease Ontology 2018 update: Classification, content and workflow expansion. Nucleic Acids Res 47 , D955–D962 (2019). Vasilevsky, N. A. et al. Mondo: Unifying diseases for the world, by the world. 2022.04.13.22273750 Preprint at https://doi.org/10.1101/2022.04.13.22273750 (2022). European Bioinformatics Institute (EMBL-EBI). MONDO ontology API. (2025). Disease Ontology Consortium. Disease Ontology Knowledge Base (DO-KB) API. (2025). McDonnell Genome Institute, Washington University School of Medicine. CIViC API. (2025). National Cancer Institute (NCI). NCI Thesaurus (NCIt) via EVS Explore API. (2025). Dennstädt, F. General-classifier: Multi-topic text classification with LLMs. (2025). Dennstädt, F. et al. Application of a general LLM-based classification system to retrieve information about oncological trials. medRxiv 2024.12.03.24318390 (2024) doi:10.1101/2024.12.03.24318390. DeepInfra. DeepInfra API. (2025). Bandrowski, A. et al. The ontology for biomedical investigations. PLOS ONE 11 , e0154556 (2016). Beltagy, I., Lo, K. & Cohan, A. SciBERT: A pretrained language model for scientific text. Ai2 (2025). Allen Institute for AI. SciBERT scivocab_uncased at Hugging Face. (2025). González-Márquez, R., Schmidt, L., Schmidt, B. M., Berens, P. & Kobak, D. The landscape of biomedical research. Patterns 5 , 100968 (2024). Ensembl / European Bioinformatics Institute (EMBL-EBI). Ensembl variant recoder API. (2025). Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using NetworkX. in 11–15 (Pasadena, California, 2008). doi:10.25080/TCWV9851. DrugBank. DrugBank online. DrugBank https://go.drugbank.com/ (2025). U.S. National Library of Medicine. ClinicalTrials.gov. ClinicalTrials.gov https://clinicaltrials.gov/ (2025). National Center for Biotechnology Information (NCBI). PubMed. PubMed https://pubmed.ncbi.nlm.nih.gov/. Wosny, M. et al. Variantscape study design cluster map. https://evidencedb.hastingslab.org/variantscape/studydesignclustermap (2025). Fontana, D., Ceccon, M., Gambacorti-Passerini, C. & Mologni, L. Activity of second-generation ALK inhibitors against crizotinib-resistant mutants in an NPM-ALK model compared to EML4-ALK. Cancer Med 4 , 953–965 (2015). Wosny, M. et al. Variantscape network graph. https://evidencedb.hastingslab.org/variantscape/networkgraph (2025). John, T. et al. Uncommon EGFR mutations in non-small-cell lung cancer: A systematic literature review of prevalence and clinical outcomes. Cancer Epidemiol 76 , 102080 (2022). Wu, H., Feng, J., Lu, S. & Huang, J. A large-scale, multicenter characterization of BRAF G469V/A-mutant non-small cell lung cancer. Cancer Med 13 , e7305 (2024). Wosny, M. et al. Variantscape web tool. https://evidencedb.hastingslab.org/variantscape (2025). Pich, O. et al. The translational challenges of precision oncology. Cancer Cell 40 , 458–478 (2022). Blair, L. M. et al. Oncogenic context shapes the fitness landscape of tumor suppression. Nat Commun 14 , 6422 (2023). Xia, Y., Sun, M., Huang, H. & Jin, W.-L. Drug repurposing for cancer therapy. Sig Transduct Target Ther 9 , 1–33 (2024). Prahallad, A. et al. Unresponsiveness of colon cancer to BRAF(V600E) inhibition through feedback activation of EGFR. Nature 483 , 100–103 (2012). Misale, S. et al. Emergence of KRAS mutations and acquired resistance to anti-EGFR therapy in colorectal cancer. Nature 486 , 532–536 (2012). Morris, L. G. T. & Chan, T. A. Therapeutic targeting of tumor suppressor genes. Cancer 121 , 1357–1368 (2015). Robson, M. et al. Olaparib for metastatic breast cancer in patients with a germline BRCA mutation. New England Journal of Medicine 377 , 523–533 (2017). Le, D. T. et al. PD-1 blockade in tumors with mismatch-repair deficiency. New England Journal of Medicine 372 , 2509–2520 (2015). Greco, F. A., Labaki, C. & Rassy, E. Molecular diagnosis and site-specific therapy in cancer of unknown primary: An important milestone. The Lancet Oncology 25 , 955–956 (2024). Ikeda, S., Elkin, S. K., Tomson, B. N., Carter, J. L. & Kurzrock, R. Next-generation sequencing of prostate cancer: Genomic and pathway alterations, potential actionability patterns, and relative rate of use of clinical-grade testing. Cancer Biol Ther 20 , 219–226 (2018). Additional Declarations Competing interest reported. TP has received travel support from Janssen and Bayer. MF has received compensation from Bristol-Myers Squibb, MSD, AstraZeneca, Boehringer Ingelheim, Roche, Takeda, Pfizer, Janssen, Daiichi-Sankyo, and PharmaMar (Advisory Board, institutional). CR has received compensation from Pfizer, Bristol-Myers Squibb, and MSD Oncology (Advisory Board, institutional). MW, MB, TN, and JH declare no competing interests relevant to this paper. Supplementary Files SupplementarymaterialWosnyVariantscapeUsingLLMstoBuildaComprehensiveLandscapeofVariants.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6614711","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":463294990,"identity":"a9764d36-d1f6-450a-9915-a28c78a077b3","order_by":0,"name":"Marie Wosny","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABIUlEQVRIie3QPUvEMBjA8acUessD51g4v4IQODgFy/WrNBTiEo8D99KpLnUvCPoVzuXALSWDS0/XgC63dO4kJxyHaX2hQquOgvkveQL5kRAAk+mPJgAhaO1tAUDqAX9NnOBHUtcmSD6GzqMH56u1qPa92fDyIq+sxPMPXf48ms+jaTxYiS4yKU5IniE7c5/uQ9dKGL3NTpejjMgwxlnQSQQDiShprDhxraUMiNIEiQjB/XzhV/JQgtxqcq34eKOJTxQvNYn6idK3gCYLxSf1LZYeHE3saT8pIU+R0RtNjuiO0UVRjo+RyMBB3vMwZlcvqUev9MNUVXg+uQvXj7iN/OGg6CRNVvo+ND+09/ZPNOk9X7dpb4aiWfxvhclkMv2nXgFp52Q5Kk5/TAAAAABJRU5ErkJggg==","orcid":"","institution":"School of Medicine, University of St.Gallen","correspondingAuthor":true,"prefix":"","firstName":"Marie","middleName":"","lastName":"Wosny","suffix":""},{"id":463294991,"identity":"63665e3d-73d5-4bab-937d-0b4b08569144","order_by":1,"name":"Maximilian Boesch","email":"","orcid":"","institution":"Department of Medical Oncology and Hematology, HOCH Health Ostschweiz, Kantonsspital St.Gallen, Universitäres Lehr- und Forschungsspital","correspondingAuthor":false,"prefix":"","firstName":"Maximilian","middleName":"","lastName":"Boesch","suffix":""},{"id":463294992,"identity":"e7077fc4-4368-4f41-aeaf-e68883b562bc","order_by":2,"name":"Tobias Peres","email":"","orcid":"","institution":"Department of Medical Oncology and Hematology, HOCH Health Ostschweiz, Kantonsspital St.Gallen, Universitäres Lehr- und Forschungsspital","correspondingAuthor":false,"prefix":"","firstName":"Tobias","middleName":"","lastName":"Peres","suffix":""},{"id":463294994,"identity":"90b6a7aa-b587-42c3-b73f-a0c0e1dcf006","order_by":3,"name":"Thibault Niederhauser","email":"","orcid":"","institution":"School of Medicine, University of St.Gallen","correspondingAuthor":false,"prefix":"","firstName":"Thibault","middleName":"","lastName":"Niederhauser","suffix":""},{"id":463294995,"identity":"555d8a1b-4907-4911-93f1-90fffb408ab6","order_by":4,"name":"Martin Früh","email":"","orcid":"","institution":"Department of Medical Oncology and Hematology, HOCH Health Ostschweiz, Kantonsspital St.Gallen, Universitäres Lehr- und Forschungsspital","correspondingAuthor":false,"prefix":"","firstName":"Martin","middleName":"","lastName":"Früh","suffix":""},{"id":463294996,"identity":"bdd70331-7a27-4b23-b1f2-1bb29510845d","order_by":5,"name":"Christian Rothermundt","email":"","orcid":"","institution":"Department of Medical Oncology and Cancer Center, Luzerner Kantonsspital (LUKS)","correspondingAuthor":false,"prefix":"","firstName":"Christian","middleName":"","lastName":"Rothermundt","suffix":""},{"id":463294997,"identity":"370eee95-9c1d-41b4-865f-c95395dd86f8","order_by":6,"name":"Janna Hastings","email":"","orcid":"","institution":"School of Medicine, University of St.Gallen","correspondingAuthor":false,"prefix":"","firstName":"Janna","middleName":"","lastName":"Hastings","suffix":""}],"badges":[],"createdAt":"2025-05-07 19:38:17","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6614711/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6614711/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":83621077,"identity":"f02dd42a-8d36-4e6d-a171-5cf7ca70a772","added_by":"auto","created_at":"2025-05-29 15:08:47","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":81252,"visible":true,"origin":"","legend":"\u003cp\u003eWaterfall chart of the data preprocessing and variant extraction pipeline.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-6614711/v1/1b5347a69809308369562813.png"},{"id":83621084,"identity":"2ea1467a-e50a-473a-960e-8c8a0b308609","added_by":"auto","created_at":"2025-05-29 15:08:49","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":91919,"visible":true,"origin":"","legend":"\u003cp\u003eTop 15 most frequently mentioned genes across extracted 308,748 biomedical cancer-related publications.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-6614711/v1/6c7d368d74f6c3d8901a57f3.png"},{"id":83621079,"identity":"240e259c-516e-4056-b783-7ee9a985cd28","added_by":"auto","created_at":"2025-05-29 15:08:48","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":135488,"visible":true,"origin":"","legend":"\u003cp\u003eDistribution of the most frequently mentioned cancer types identified from 199,726 articles (64.68%) of the total 308,748 publications screened.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-6614711/v1/7caa77d5ce06f9a2a12d16df.png"},{"id":83621095,"identity":"d053dc36-2e19-41e8-b120-7f09a04655ed","added_by":"auto","created_at":"2025-05-29 15:08:51","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":650983,"visible":true,"origin":"","legend":"\u003cp\u003eUMAP-based visualization of embeddings from 199.726 articles by study design category, classified by LLaMA3.3-70b.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-6614711/v1/26c7b57333a6996e6c525d8e.png"},{"id":83621120,"identity":"7c752e05-928b-4aae-9471-a4da855cae46","added_by":"auto","created_at":"2025-05-29 15:08:52","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":130537,"visible":true,"origin":"","legend":"\u003cp\u003eTop 20 most frequent variants identified in analysis dataset of 7,524 articles.\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-6614711/v1/97b21976c208f196003387f5.png"},{"id":83621066,"identity":"838a607a-8feb-422a-a3f4-8727298347d3","added_by":"auto","created_at":"2025-05-29 15:08:46","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":197666,"visible":true,"origin":"","legend":"\u003cp\u003eHeatmap showing high-confidence variant-treatment co-associations across multiple cancer types. The 50 strongest positive and negative associations are selected based on evidence-weighted scores, supplemented with curated clinically relevant treatment-variant pairs. Green indicates LLM-predicted sensitivity and red indicates resistance.\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-6614711/v1/7b2d666a14daaebc479c3b1b.png"},{"id":83621123,"identity":"16aaf661-40ee-47b2-b546-ac47e28f35ab","added_by":"auto","created_at":"2025-05-29 15:08:52","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":90251,"visible":true,"origin":"","legend":"\u003cp\u003eDot plot representing top five variant co-associations by cancer type based on weighted co-occurrence score.\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-6614711/v1/59a45db9d5800245441f2f12.png"},{"id":83621086,"identity":"f452a450-2f7f-40aa-a6b6-221567194b63","added_by":"auto","created_at":"2025-05-29 15:08:50","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":114224,"visible":true,"origin":"","legend":"\u003cp\u003eInteractive web interface for exploring literature-derived variant, treatment, and cancer co-associations.\u003c/p\u003e","description":"","filename":"8.png","url":"https://assets-eu.researchsquare.com/files/rs-6614711/v1/dc5c4ed931a701769af93930.png"},{"id":85090894,"identity":"a25db3f4-bbcc-48a1-bf40-a0c0c8427bf3","added_by":"auto","created_at":"2025-06-21 00:16:25","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2644069,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6614711/v1/6124084e-c258-49de-8f88-f956dab8179a.pdf"},{"id":83621098,"identity":"ea504b7a-a026-4fe9-988c-41fbf6a07d9b","added_by":"auto","created_at":"2025-05-29 15:08:51","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":6728445,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementarymaterialWosnyVariantscapeUsingLLMstoBuildaComprehensiveLandscapeofVariants.docx","url":"https://assets-eu.researchsquare.com/files/rs-6614711/v1/c7a9ac3c042bf6e2396aa598.docx"}],"financialInterests":"Competing interest reported. TP has received travel support from Janssen and Bayer. MF has received compensation from Bristol-Myers Squibb, MSD, AstraZeneca, Boehringer Ingelheim, Roche, Takeda, Pfizer, Janssen, Daiichi-Sankyo, and PharmaMar (Advisory Board, institutional). CR has received compensation from Pfizer, Bristol-Myers Squibb, and MSD Oncology (Advisory Board, institutional). MW, MB, TN, and JH declare no competing interests relevant to this paper.","formattedTitle":"Variantscape: Using Large Language Models to Build a Comprehensive Landscape of Cancer Variants for Precision Oncology","fulltext":[{"header":"Introduction","content":"\u003cp\u003ePrecision oncology aims to tailor cancer treatments for a tumor\u0026rsquo;s unique genetic, molecular, or cellular characteristics, thus enabling more effective and targeted therapeutic interventions than conventional one-size-fits-all approaches\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. Advances in next-generation sequencing (NGS) have facilitated the detection of genetic alterations, providing robust platforms for identifying actionable variants that guide diagnostic classification, prognosis, and personalized treatment decisions\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. Large-scale studies integrating NGS with clinical data have demonstrated the clinical relevance of various genetic alterations, including missense mutations, structural variants, and functional signatures such as microsatellite instability (MSI) and homologous recombination deficiency (HRD)\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e,\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e. Numerous clinically relevant alterations have been successfully targeted with precision therapies, such as \u003cem\u003eBRAF\u003c/em\u003e V600E, an oncogenic variant occurring in approximately 6% of all cancers and 40\u0026ndash;50% of melanomas, where it activates the MAPK/ERK pathway and responds to combined BRAF and MEK inhibition\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e. Similarly, common \u003cem\u003eEGFR\u003c/em\u003e alterations in non-small cell lung cancer (NSCLC), including exon 19 deletions and L858R, are effectively treated with tyrosine kinase inhibitors (TKIs)\u003csup\u003e\u003cspan additionalcitationids=\"CR7\" citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e. Alterations in \u003cem\u003eBRCA\u003c/em\u003e1/2, particularly in prostate cancer, have shown responsiveness to PARP inhibitors, offering clinical benefit despite their association with more aggressive disease\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e. However, many variants remain poorly understood, especially in rare cancers or underrepresented populations, limiting their clinical utility due to unclear biological and functional relevance\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eGenomic knowledge bases, such as COSMIC\u003csup\u003e\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e, CIViC\u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e, ClinVar\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e, and OncoKB\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e, provide curated resources that support the functional interpretation of molecular variants, enabling clinicians and researchers to assess their clinical significance and inform personalized treatment strategies\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e. Despite extensive efforts to curate and standardize variant interpretations, knowledge remains fragmented and inconsistently represented due to the overwhelming volume of medical literature, the high cost of manual expert curation, and the limited scalability of these efforts, ultimately restricting the integration and coverage of genomic variants across platforms\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e,\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e. Consequently, clinicians often need to consult multiple databases and review additional literature in parallel to interpret variants accurately, making the process time-consuming, inefficient, and non-standardized\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e. Additionally, keyword-based search methods are hindered by inconsistent naming conventions and varied descriptions of the same variant. These include variant annotations defined by the Human Genome Variation Society (HGVS), DNA changes, protein changes (and within these, one-letter vs. three-letter codes), identifiers, and single-nucleotide polymorphism (SNP) IDs, all of which may be used interchangeably, complicating variant interpretation across platforms\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eTraditional information retrieval methods, including keyword-based searches, named entity recognition (NER) models such as SciSpaCy\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e and transformer-based models such as BERT\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e, have improved the extraction of biomedical entities from literature\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e. However, while these methods have improved basic entity extraction, they often struggle with the full complexity of variant mentions, particularly non-standardized nomenclature, contextual ambiguity, and rare event extraction, and show reduced performance on unseen datasets, limiting their ability to accurately extract novel or rare entities\u003csup\u003e\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e,\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e. Recent breakthroughs in artificial intelligence (AI), particularly the development of large language models (LLMs), are transforming how biomedical text is processed\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e,\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e. LLMs offer advanced contextual understanding and semantic flexibility that allow them to capture nuanced variant expressions and resolve ambiguities that confound traditional models\u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e. Despite the potential of LLMs, current variant information extraction research is still dominated by earlier machine learning or rule-based systems, which face limitations. For instance, tmVar applied a machine learning system to detect sequence variants from PubMed abstracts based on HGVS nomenclature, though it primarily captures isolated mentions without broader contextual understanding\u003csup\u003e\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e. PubTator 3.0 expanded automated extraction of biomedical entities to more than a billion annotations across PubMed, but it does not apply LLMs directly to unstructured text for extracting ambiguous or novel variant mentions\u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e. The Variomes tool improved retrieval of variant-related publications but relies on pre-annotated datasets and lacks dynamic extraction capabilities for novel variants\u003csup\u003e\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e. More recently, fine-tuned language models combined with manual annotation have enhanced variant entity recognition, yet these methods mainly focus on labeling rather than deeper contextual interpretation or relationship extraction\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e. Together, these prior efforts highlight both the progress achieved and the persistent challenges in enabling scalable, context-aware extraction of heterogeneous and novel variant mentions from biomedical literature.\u003c/p\u003e \u003cp\u003eBuilding on these foundations, we developed an automated pipeline that also leverages recent advances in LLM techniques to systematically extract, standardize, and contextualize variant information from unstructured biomedical literature. Our approach addresses persistent challenges such as inconsistent nomenclature, rare and novel variant detection, and gaps in existing databases. Clinicians have traditionally lacked scalable tools to uncover rare or complex variant treatment cancer relationships buried in literature, a gap this pipeline is designed to close. By incorporating co-occurrence metrics and network-based analyses, the pipeline further infers and ranks potential variant, treatment, and cancer co-associations in a systematic and scalable manner. We show that our system is able to provide structured, context-rich variant insights to support evidence-driven scientific lead generation. This allows clinicians to rapidly identify such associations, including rare or underreported ones, directly from the literature without time-consuming manual searches. The framework supports researchers and clinicians in generating hypotheses, uncovering underrecognized connections, and identifying potential applications of existing therapies across both common and rare variants, reducing the need for manual literature review and streamlining access to clinically relevant insights.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003eAll analyses (Supplementary information 1) were conducted using Python 3.11.5. The complete code is available on a GitHub repository, which can be accessed at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/hastingslab-org/Variantscape\u003c/span\u003e\u003cspan address=\"https://github.com/hastingslab-org/Variantscape\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003csup\u003e32\u003c/sup\u003e, while the datasets are publicly available on Zenodo\u003csup\u003e\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eLiterature retrieval\u003c/h2\u003e \u003cp\u003eBiomedical publications related to cancer research were retrieved from the OpenAlex database using its public API\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e. A structured, broad search query including \u0026ldquo;cancer\u0026rdquo; OR \u0026ldquo;tumor\u0026rdquo;/\u0026ldquo;tumour\u0026rdquo; OR \u0026ldquo;carcinoma\u0026rdquo; and related synonyms was applied to search titles and abstracts and identify relevant articles published in English between 2014 and 2024. Paper titles, abstracts, and additional metadata were extracted, including publication year, authors, referenced citations, and associated research concepts.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eDataset cleaning and preprocessing\u003c/h3\u003e\n\u003cp\u003eThe retrieved dataset was cleaned and normalized to ensure quality and consistency. Duplicate records were removed based on unique paper identifiers and/or combinations of titles and authors. Articles missing titles, abstracts, or metadata were excluded, and an additional language detection step was applied to titles and abstracts to exclude articles in non-English languages. Supplementary materials, corrections, tables, figures, and artifacts were removed based on keywords or regular expression patterns. Moreover, unusually short or long texts were filtered out (i.e., titles with fewer than three or more than 60 words and abstracts with fewer than 80 or more than 1,000 words) alongside withdrawn articles, and non-original research content, such as editorials or letters. The text fields were cleaned to remove HTML artifacts, and both the titles and abstracts were normalized for consistent casing and structures. A temporal analysis of publication trends and patterns over the years was performed.\u003c/p\u003e\n\u003ch3\u003eInformation extraction\u003c/h3\u003e\n\u003cp\u003eTransformer-based named entity recognition for gene extraction\u003c/p\u003e \u003cp\u003eA curated list of 161 cancer-relevant genes from the Oncomine Comprehensive Assay v3 NGS panel (Thermo Fisher)\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e was used as a reference set for identifying gene-relevant publications. Following preprocessing, the objective was to filter articles mentioning these cancer-associated genes to enable a focused downstream genomic variant analysis. Initial approaches involving rule-based string matching and NLP using SciSpaCy models\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e were evaluated but found inadequate, as they failed to capture a sufficient number of relevant gene mentions, showed an elevated false positive rates, and lacked contextual understanding. In contrast, while more robust, LLMs did not outperform the BioBERT-based approach and were deemed computationally inefficient and too resource-intensive for this relatively constrained information extraction task\u003csup\u003e\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e. Therefore, the transformer-based BioBERT alvaroalon2/biobert_genetic_ner\u003csup\u003e37\u003c/sup\u003e model was employed. Genes from the Oncomine panel were programmatically expanded into synonyms via the MyGene.info API\u003csup\u003e\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e, capturing official symbols, alternative names, and protein associations. Titles and abstracts were processed with BioBERT using a sliding window approach with a stride of 256 tokens to preserve context beyond the model\u0026rsquo;s 512-token limit. A multi-step normalization process addressed inconsistencies in gene nomenclature, species-specific differences, optical character recognition (OCR) errors, or typographic noise. This involved case-insensitive matching against a standardized reference list, punctuation removal, token-level comparison for formatting inconsistencies, and fuzzy matching\u003csup\u003e\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e\u003c/sup\u003e to unify lexical variants and biologically related terms. Detected genes were compiled into a binary matrix per article, where \u0026ldquo;1\u0026rdquo; indicated presence and \u0026ldquo;0\u0026rdquo; indicated absence, alongside a column counting total gene mentions. Articles without matched genes were excluded, yielding a filtered set of relevant publications.\u003c/p\u003e \u003cp\u003eCategorization of entities in articles\u003c/p\u003e \u003cp\u003eVarious approaches were systematically evaluated to categorize biomedical entities and categories. These included transformer-based NER models and LLMs. The assessed transformer-based BERT models\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e were chosen for their specialized pretraining on biomedical corpora. Rule-based and statistical models from the SciSpaCy suite, particularly en_ner_bionlp13cg_md and en_ner_bc5cdr_md were evaluated\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e,\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e. Model selection was guided by empirical performance on a validation set. Performance was assessed using established evaluation metrics, including precision, recall, accuracy, and F1-score.\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eNER for automated cancer type identification\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003eA multi-step NER pipeline was designed to extract, normalize, and categorize cancer types from unstructured text. Cancer mentions were identified from titles and abstracts using two complementary SciSpaCy models: \u0026ldquo;en_ner_bionlp13cg_md,\u0026rdquo; optimized for cancer-specific terms, and \u0026ldquo;en_ner_bc5cdr_md,\u0026rdquo; trained on more general disease entities\u003csup\u003e\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e. To ensure consistency, a reference list of cancer types and synonyms was constructed by integrating definitions from the CIViC knowledgebase, mapping aliases to disease ontology ID (DOID)\u003csup\u003e\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e\u003c/sup\u003e and Monarch Disease Ontology (MONDO)\u003csup\u003e\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e terms and expanding synonym coverage via the MONDO API\u003csup\u003e\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u003c/sup\u003e. Extracted mentions were cleaned through rule-based preprocessing, and synonyms were mapped to canonical terms through direct and fuzzy matching against a CIViC-MONDO dictionary. Each mention was standardized and assigned to a specific cancer type. A binary matrix was created to encode the presence or absence of each cancer type per article, excluding articles without any mapped cancer types. Additionally, standardized terms were linked to their ontological parents using Disease Ontology metadata retrieved via the API\u003csup\u003e\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eOntology-guided NER for automated treatment identification\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003eThe treatment extraction process involved querying the CIViC API\u003csup\u003e\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e\u003c/sup\u003e to retrieve a curated, clinically validated list of cancer therapies, including treatment names, National Cancer Institute Thesaurus (NCIt) identifiers, and aliases. NCIt was dynamically queried using the API\u003csup\u003e\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e\u003c/sup\u003e to obtain hierarchical parent concepts and overarching treatment categories. Treatment terms were extracted from articles using a hybrid rule-based approach combining regular expressions and fuzzy matching to identify treatment names and aliases. Ambiguous or short aliases were systematically filtered, and each publication was mapped to a binary matrix indicating the presence or absence of treatment mentions.\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eLLM-based classifier for automated categorization\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003eTo classify study types, an automated, LLM-based approach was implemented using Meta\u0026rsquo;s LLaMA3.3-70b-Instruct model via the General Classifier framework and DeepInfra API\u003csup\u003e\u003cspan additionalcitationids=\"CR48\" citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e\u003c/sup\u003e. Predefined categories from the Ontology of Biomedical Investigations (OBI)\u003csup\u003e\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e\u003c/sup\u003e were consolidated into eight final categories, including \u0026ldquo;clinical study,\u0026rdquo; \u0026ldquo;observational/RWE study,\u0026rdquo; \u0026ldquo;case report study,\u0026rdquo; \u0026ldquo;\u003cem\u003ein vitro\u003c/em\u003e study\u0026rdquo;, \u0026ldquo;\u003cem\u003ein vivo\u003c/em\u003e/animal study,\u0026rdquo; \u0026ldquo;\u003cem\u003ein silico\u003c/em\u003e study,\u0026rdquo; \u0026ldquo;systematic review study,\u0026rdquo; or \u0026ldquo;other.\u0026rdquo; Prompt engineering strategies were evaluated on a representative data subset, and the approach with the highest accuracy and labeling consistency was selected to classify the full dataset in batches. To analyze classification results, abstracts were embedded using the SciBERT allenai/scibert_scivocab_uncased model\u003csup\u003e\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e,\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e52\u003c/span\u003e\u003c/sup\u003e, producing 768-dimensional vectors by averaging final-layer token embeddings. Uniform Manifold Approximation and Projection (UMAP) was applied to reduce these embeddings to two dimensions, and an interactive 2D embedding-based cluster map was generated, with points colored by study design category and hover data for titles and publication years, enabling descriptive, exploratory analysis\u003csup\u003e\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eLLM-based molecular variant extraction\u003c/p\u003e \u003cp\u003eThe main methodological innovation of this study was to extract specific gene variants from biomedical text using LLMs, as the extraction with conventional NLP methods has been insufficient. Variant extraction was performed using Meta\u0026rsquo;s LLaMA3.3-70b-Instruct accessed via the cloud service DeepInfra\u003csup\u003e\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e\u003c/sup\u003e. This model was selected based on comparative evaluations against other state-of-the-art models, such as GPT-4o and DeepSeek, with LLaMA3.3-70b demonstrating superior accuracy in identifying and structuring variant-gene pairs from biomedical text\u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e. The most effective prompt guided the model to detect HGVS notations, protein changes, and SNP IDs, while excluding vague or generic mentions. Extraction was performed in batches of 50,000 articles, with the cumulative runtime recorded.\u003c/p\u003e \u003cp\u003eExtracted variants were standardized through preprocessing steps such as prefix removal, amino acid canonicalization, exon mapping, and keyword filtering. API calls to external databases improved synonym coverage and provided validation, including CIViC for variant IDs, aliases, and assertions, ClinVar for SNP IDs and HGVS notations, and Ensembl Variant Recoder for normalization\u003csup\u003e\u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e54\u003c/span\u003e\u003c/sup\u003e. Variants successfully matched to reference databases were standardized and included as unique variant, gene pairs in a binary matrix. Variants not found in these databases were retained in their original extracted form, without normalization or merging, to ensure their inclusion in downstream analyses and to preserve potentially novel findings. Finally, all datasets were merged using paper IDs as a common identifier for co-association analysis.\u003c/p\u003e\n\u003ch3\u003eComputational analysis of weighted co-occurrence networks and LLM-based categorization of relationships\u003c/h3\u003e\n\u003cp\u003eCo-associations among variants, treatments, and cancer types were assessed through co-occurrence analysis. Weighted matrix multiplication generated adjacency tables capturing pairwise combinations across entity types, specifically treatment-variant, cancer-variant, and treatment-cancer relationships. Statistical significance of each combination was assessed by Fisher\u0026rsquo;s exact test with Benjamini-Hochberg False Discovery Rate (FDR) correction (adjusted \u003cem\u003ep\u0026thinsp;\u0026le;\u0026thinsp;0.05\u003c/em\u003e), enabling identification of high-confidence associations enriched in the literature.\u003c/p\u003e \u003cp\u003eIn parallel, an undirected, evidence-weighted co-occurrence network was constructed using the python library NetworkX\u003csup\u003e\u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e55\u003c/span\u003e\u003c/sup\u003e, incorporating all observed co-occurrences, regardless of statistical significance. This broader inclusion was intended to capture weaker or rare signals, particularly for under-characterized variants and cancers that may be underpowered for significance testing but may still reflect clinically relevant relationships. Nodes in the network represented variants, cancers, and treatments, while edges were weighted by corresponding matrix values. Network analyses, including degree centrality, clustering coefficients, and community detection, were applied to uncover hubs, patterns of connectivity, and modular structure within the literature-derived variant landscape.\u003c/p\u003e \u003cp\u003eTo refine the interpretation of variant-treatment co-occurrence, a secondary annotation layer using LLaMA3.3-70B was applied. Each pairwise variant and treatment co-occurrence was extracted and classified by the LLM into one of five relationship categories reflecting different ways that variants can impact treatments: \u0026ldquo;sensitive/effective,\u0026rdquo; \u0026ldquo;resistant,\u0026rdquo; \u0026ldquo;diagnostic,\u0026rdquo; \u0026ldquo;unrelated,\u0026rdquo; or \u0026ldquo;unknown.\u0026rdquo; As the same variant-treatment relationship was often discussed across multiple abstracts, potentially with differing classifications, a tiered consensus strategy was applied. Consensus for each variant-treatment pair was determined through strict agreement, \u0026ge;\u0026thinsp;80% agreement, and rule-based fallback logic. This logic resolved remaining cases by analyzing label distributions and applying heuristics that assigned the most frequent weak label when only weak predictions were present, and favored clinically meaningful labels (e.g., \u0026ldquo;sensitive\u0026rdquo;, \u0026ldquo;resistant\u0026rdquo;, \u0026ldquo;diagnostic\u0026rdquo;) over weaker ones (e.g., \u0026ldquo;unknown\u0026rdquo;, \u0026ldquo;unrelated\u0026rdquo;). Pairs with conflicting or ambiguous distributions remained classified as \u0026ldquo;no consensus.\u0026rdquo;\u003c/p\u003e \u003cp\u003eQuerying the network for specific variants revealed closely related treatments and cancers through evidence‑weighted scores. Three authors (MB, TP, MF) manually validated strong co-associations, leveraging their clinical and molecular expertise, as well as external data and knowledge bases such as DrugBank\u003csup\u003e\u003cspan citationid=\"CR56\" class=\"CitationRef\"\u003e56\u003c/span\u003e\u003c/sup\u003e ClinicalTrials.gov\u003csup\u003e\u003cspan citationid=\"CR57\" class=\"CitationRef\"\u003e57\u003c/span\u003e\u003c/sup\u003e, PubMed\u003csup\u003e\u003cspan citationid=\"CR58\" class=\"CitationRef\"\u003e58\u003c/span\u003e\u003c/sup\u003e, and Google Scholar. A targeted literature search was performed to identify rare or under-characterized variants in selected cancer types. These variants were then queried in the network to retrieve associated treatments, and the resulting co-associations were subsequently verified through additional literature review. Based on expert review to ensure network robustness, confidence filtering was applied to retain only those edges falling above the 80th percentile of the evidence‑weighted co‑occurrence distribution. Finally, the Python data analysis scripts were integrated into a Flask application to generate interactive HTML pages. The web tool was deployed to a cloud-hosted WSGI server, making it publicly accessible.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003ePerformance summary of information extraction pipeline\u003c/h2\u003e \u003cp\u003eEach component of the information extraction pipeline was evaluated against manually annotated ground truth to determine the best-performing method. Gene extraction using the BioBERT-based \u003cem\u003ebiobert_genetic_ner\u003c/em\u003e model achieved the strongest results (F1-score: 0.98, Recall: 1.00)\u003csup\u003e\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e. Cancer type extraction was performed via the two SciSpaCy models en_ner_bionlp13cg_md and en_ner_bc5cdr_md combined with ontology-guided normalization (F1-score: 0.89, Recall: 0.80) (Supplementary information 2). Treatment extraction using a hybrid ontology-based and fuzzy matching approach demonstrated high reliability (F1score: 0.95, Recall: 0.95) (Supplementary information 3). Variant extraction using Meta\u0026rsquo;s LLaMA3.3-70B model outperformed other evaluated LLMs in extracting molecular variants (F1score: 0.95, Recall: 0.92)\u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e. LLM-based classification of study types produced consistent results across categories (F1score: 0.92, Recall: 0.92) (Supplementary information 4). Finally, variant-treatment relationship classification using LLaMA3.3-70B achieved reliable consensus (F1score: 0.85 Recall: 0.84) (Supplementary information 5).\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eStudy selection and characteristics\u003c/h3\u003e\n\u003cp\u003eA total of 2,775,913 biomedical articles related to cancer research and published between 2014 and 2024 were retrieved from the OpenAlex database. After systematic preprocessing, 647,595 records (23.3%) were excluded due to duplicates (6.0%), missing metadata (1.6%), non-English texts (1.6%), withdrawn articles (0.1%), nonsensical titles (0.1%), length anomalies (8.3%), non-original research (0.7%), normalization errors (0.1%), and artifacts or entries misclassified as standalone articles but actually representing supplementary materials (4.8%). The final cleaned dataset comprised 2,128,318 articles (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Temporal analysis revealed a steady increase in publication volume from 150,000 articles in 2014 to over 250,000 articles by 2022, with a slight decline in 2024. Monthly publication spikes could be observed, which likely correspond to conference or publication cycles, while consistent weekly peaks suggest routine publishing practices (Supplementary information 6).\u003c/p\u003e\n\u003ch3\u003eTransformer-based gene extraction\u003c/h3\u003e\n\u003cp\u003eFrom the preprocessed dataset of 2,128,318 publications, the BioBERT model identified 308,748 articles (14.51%) containing genes or gene-related products from the Oncomine assay in their titles and/or abstracts. Consequently, 1,819,570 articles (85.49%) were excluded. In the filtered dataset, \u003cem\u003eTP53\u003c/em\u003e was the most frequently mentioned gene in 50,138 articles (16.24%), followed by \u003cem\u003eEGFR\u003c/em\u003e (43,816; 14.19%), \u003cem\u003eAKT1\u003c/em\u003e (34,337; 11.12%), \u003cem\u003eMTOR\u003c/em\u003e (19,864; 6.43%), and \u003cem\u003eKRAS\u003c/em\u003e (19,364; 6.27%), with some mentions potentially amplified by pathway-specific research focus (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). The distribution of gene mentions per publication revealed that most articles (189,309; 61.3%) focused on only one or two genes (66,034; 21.4%), with the average number of genes per article being 1.79. A steep decline was observed as the number of genes per article increased, with only a small fraction mentioning more than three genes (Supplementary information 7).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eCategorization of biomedical entities\u003c/h2\u003e \u003cp\u003eNER-based cancer extraction identified at least one specific cancer type in 199,726 (64.69%) of the 308,748 screened articles. A total of 225 distinct cancer types were identified, with the most frequently mentioned being breast cancer (40,002 articles; 20.03%), colon cancer (28,790; 14.41%), lung cancer (27,590; 13.81%), glioma (17,437; 8.73%), prostate cancer (16,208; 8.12%), liver cancer (15,093; 7.65%), pancreatic cancer (13,649; 6.83%), ovarian cancer (13,106; 6.56%) and melanoma (9,956; 4.98%), to some extent reflecting the high incidence/prevalence of these cancer types and/or their frontrunner status as regards targeted and immunotherapeutic drug development (cf. lung cancer, melanoma) (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eAmong these, most articles were classified by the LLM as \u003cem\u003ein vitro\u003c/em\u003e study (80,899; 40.50%) or clinical study (52,769; 26.42%). Other common categories included \u003cem\u003ein vivo\u003c/em\u003e/animal studies (25,165; 12.60%), \u003cem\u003ein silico\u003c/em\u003e studies (12,212; 6.11%), and systematic reviews (12,158; 6.09%). Less frequent were case reports (9,753; 4.88%) and observational/RWE studies (5,955; 2.98%). A small subset of articles (815; 0.40%) was classified as \u0026ldquo;other\u0026rdquo;, either due to divergence from predefined design categories or insufficient information resulting in ambiguous classification (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, Supplementary information 8, interactive cluster plot accessible at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://evidencedb.hastingslab.org/variantscape/studydesignclustermap\u003c/span\u003e\u003cspan address=\"https://evidencedb.hastingslab.org/variantscape/studydesignclustermap\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003csup\u003e59\u003c/sup\u003e).\u003c/p\u003e \u003cp\u003eOut of the 308,748 gene-harboring articles, 126,195 (40.87%) mentioned specific treatments, while 182,553 articles (59.13%) did not include any treatment references that could be matched with CIViC or NCIt mappings. The most frequently mentioned treatments include cisplatin (11,770 articles; 4.36%), sirolimus (6,874; 2.54%), paclitaxel (3,892; 1.44%), doxorubicin (3,700; 1.37%), erlotinib (3,608; 1.34%), gefitinib (3,554; 1.32%), and cetuximab (3,377; 1.25%) (Supplementary information 9).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eLLM-based variant extraction\u003c/h2\u003e \u003cp\u003eThe LLM-based variant extraction pipeline processed 199,726 cancer-related articles in 2 days, 11 hours, and 49 minutes using LLaMA3.3-70b at a rate of 1.075 articles/s. Variants were identified in the titles and/or abstracts of 16,508 articles (8.27%), while 183,218 articles (91.73%) did not contain variant mentions. This subset constituted approximately 1% of all articles initially retrieved from OpenAlex. Following normalization, a total of 35,230 variant mentions across the dataset were identified, of which 11,950 variants were unique, indicating that many articles referenced multiple variants in their titles or abstracts. The most frequently mentioned variants included \u003cem\u003eBRAF\u003c/em\u003e V600E in 3,859 articles (23.38%), \u003cem\u003eKRAS\u003c/em\u003e G12D (1,489; 9.02%), \u003cem\u003eEGFR\u003c/em\u003e T790M (1,196; 7.24%), \u003cem\u003eEGFR\u003c/em\u003e L858R (1,105; 6.69%), \u003cem\u003eKRAS\u003c/em\u003e G12C (743; 4.50%), and \u003cem\u003eKRAS\u003c/em\u003e G12V (590; 3.57%) (Supplementary information 10).\u003c/p\u003e \u003cp\u003eFurther analysis showed that the majority of variant mentions (30,086; 85.40%), were associated with genes included in the Oncomine assay. Among the 161 Oncomine genes considered, 149 (92.55%) had at least one reported variant, while 12 genes (7.45%) had no variant mention. The remaining 5,144 variants (14.60%) were linked to 1,524 unique genes not included in the Oncomine panel (Supplementary information 11).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eCo-occurrence analysis\u003c/h2\u003e \u003cp\u003eOut of 308,748 articles (100%) analyzed, 199,726 (64.69%) mentioned at least one specific cancer type, 126,195 (40.87%) referenced a specific treatment, and 16,508 (5.35%) included a variant. Only 7,524 articles (2.44%) were at the intersection, mentioning all three entities, thus qualifying for statistical co-association analysis (Supplementary information 12). This dataset included 377 unique treatments, 98 unique cancer types, and 4,029 unique variants. The most frequently mentioned specific treatments were vemurafenib (612; 5.82%), osimertinib (564; 5.35%), and trametinib (521; 4.95%) (Supplementary information 13). For cancer types, the most commonly reported were lung cancer (2,022; 24.16%), melanoma (1,109; 13.25%), colon cancer (1,097; 13.11%), breast cancer (1,035; 12.37%), pancreatic cancer (497; 5.94%), and thyroid cancer (454; 5.42%) (Supplementary information 14). The analysis of variant frequencies revealed that \u003cem\u003eBRAF\u003c/em\u003e V600E (1,952; 28.91%), \u003cem\u003eEGFR\u003c/em\u003e T790M (852; 12.62%), \u003cem\u003eEGFR\u003c/em\u003e L858R (641; 9.49%), \u003cem\u003eKRAS\u003c/em\u003e G12D (571; 8.46%), \u003cem\u003eKRAS\u003c/em\u003e G12C (436; 6.46%), and \u003cem\u003eKRAS\u003c/em\u003e G12V (267; 3.95%) were most prevalent (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eCo-occurrence matrices illustrated the frequency of associations between treatments, variants, and cancer types. Among all variant-treatment pairs, the most common LLM-predicted association type was \u0026ldquo;sensitive,\u0026rdquo; accounting for 29.1% of cases (4,552 associations), followed by \u0026ldquo;resistant\u0026rdquo; predictions at 21.6% (3,368 associations). A much smaller proportion, 0.9% (144 associations), was classified as \u0026ldquo;diagnostic\u0026rdquo;. Predictions labeled as \u0026ldquo;unknown\u0026rdquo; made up 24.1% of the total (3,770 associations), while 23.5% (3,666 associations) were deemed \u0026ldquo;unrelated\u0026rdquo;, and only 0.8% of associations (124 associations) could not be resolved and were labeled as \u0026ldquo;no consensus.\u0026rdquo;\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe matrix was predominantly sparse (indicated by light color in Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e), reflecting relatively few strong associations. This aligns well with the biological expectation that most targeted therapies are relevant only to specific cancer types and genetic alterations/molecular signatures. Nonetheless, distinct hotspots, highlighted by more saturated green and red coloring, revealed statistically significant co-associations for select variant-treatment pairs predicted to be related to treatment-sensitivity (i.e., green color) or potential resistance (i.e., red color). Among the top-ranked predicted associations, several reflect well-established clinical relationships. For instance, dabrafenib, trametinib, and their combination regimen showed strong associations with the \u003cem\u003eBRAF\u003c/em\u003e V600E variant (76.74%, 65.90%, and 81.98%, association respectively), aligning with their status as clinically approved and widely used therapies for \u003cem\u003eBRAF\u003c/em\u003e-mutant cancers. Similarly, the cetuximab/encorafenib regimen shows a strong predicted association with \u003cem\u003eBRAF\u003c/em\u003e V600E (95.28% association), aligning with its approval for metastatic colorectal cancer harboring this variant. On the other hand, some other high-scoring associations, such as ganitumab and elimusertib, lack a known mechanistic or clinical link to \u003cem\u003eBRAF\u003c/em\u003e V600E, and are likely non-specific or non-actionable in this mutational context. Moreover, several variant-treatment pairs in the heatmap represent plausible resistance mechanisms, including \u003cem\u003eALK\u003c/em\u003e G1202R, which is a well-established resistance mutation targeted by later-generation \u003cem\u003eALK\u003c/em\u003e inhibitors, such as ASP3026, alectinib, and ceritinib (-70.00%, -21.55, -21.23% association respectively)\u003csup\u003e\u003cspan citationid=\"CR60\" class=\"CitationRef\"\u003e60\u003c/span\u003e\u003c/sup\u003e. Additionally, as indicated by the matrix, some treatments exhibited broader associations, with multiple non-zero entries across several variants. These treatments likely represent broad-spectrum treatments with potential relevance to multiple genetic alterations or potential relevance irrespective of genetic alterations. Conversely, isolated cells suggest high-specificity interactions, potentially reflecting precision oncology applications where therapies are tailored to target specific genetic alterations.\u003c/p\u003e \u003cp\u003eThe co-occurrence analysis between variants and treatments revealed 15,577 statistically significant co-associations (\u003cem\u003ep\u0026thinsp;\u0026le;\u0026thinsp;0.05\u003c/em\u003e). Notably, \u003cem\u003eBRAF\u003c/em\u003e V600E emerged as one of the most recurrently associated variants, showing strong co-associations with multiple cancer types including histiocytoma (100%), clear cell sarcoma (100%), neurofibroma (100%), chordoma (100%), and melanoma (61.85%) (Supplementary information 15). On the treatment side, compelling co-occurrences such as the cetuximab/encorafenib regimen for colorectal cancer (95.3%) and endocrine therapies for breast cancer (86\u0026ndash;94%) were observed, while abiraterone was tightly linked to prostate cancer (91.4%) (Supplementary information 16).\u003c/p\u003e \u003cp\u003eInvestigating each cancer type individually (with lung-, breast-, colon- and prostate cancer as well as melanoma serving as case examples) revealed that some variants appear across multiple cancers, while others are more cancer-specific (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e). For example, \u003cem\u003eBRAF\u003c/em\u003e V600E is shared as one of the top associated variants in lung cancer, colon cancer, and melanoma, while \u003cem\u003eKRAS\u003c/em\u003e G12C is found in both lung and colon cancers. However, most other top-ranking variants appear to be disease-specific. In lung cancer, the strongest co-associations were observed for \u003cem\u003eEGFR\u003c/em\u003e T790M (35.16), \u003cem\u003eEGFR\u003c/em\u003e L858R (27.65), \u003cem\u003eKRAS\u003c/em\u003e G12C (8.92), \u003cem\u003eBRAF\u003c/em\u003e V600E (7.96), and \u003cem\u003eEGFR\u003c/em\u003e Exon19del (5.59). For breast cancer, the most prominent co-associations included \u003cem\u003eESR1\u003c/em\u003e Y537S (17.13), \u003cem\u003eESR1\u003c/em\u003e D538G (16.29), \u003cem\u003ePIK3CA\u003c/em\u003e H1047R (14.34), \u003cem\u003ePIK3CA\u003c/em\u003e E545K (10.62), and \u003cem\u003ePIK3CA\u003c/em\u003e E542K (8.43). In colon cancer, \u003cem\u003eBRAF\u003c/em\u003e V600E showed the strongest co-occurrence (45.47), followed by several \u003cem\u003eKRAS\u003c/em\u003e variants: G12C (12.13), G12D (9.96), G12V (8.43), and G13D (7.18). The top five variants in prostate cancer were all related to the AR gene, including \u003cem\u003eAR\u003c/em\u003e T878A (17.03), \u003cem\u003eAR\u003c/em\u003e L702H (16.95), \u003cem\u003eAR\u003c/em\u003e F876L (14.16), \u003cem\u003eAR\u003c/em\u003e F877L (12.35), and \u003cem\u003eAR\u003c/em\u003e H875Y (11.80). In melanoma, \u003cem\u003eBRAF\u003c/em\u003e V600E showed the highest co-association score (61.85), followed by \u003cem\u003eBRAF\u003c/em\u003e V600K (12.26), \u003cem\u003eNRAS\u003c/em\u003e Q61R (2.48), \u003cem\u003eNRAS\u003c/em\u003e Q61K (2.21), and \u003cem\u003eBRAF\u003c/em\u003e V600R (2.10). These high-ranking variants are largely consistent with well-established and frequently reported alterations, supporting the validity of the extraction and ranking approach and indicating that the method reliably recovers clinically relevant findings.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eNetwork analysis\u003c/h2\u003e \u003cp\u003eThe network graph was comprised of 4,504 nodes and 48,470 edges, offering a structured representation of the variant landscape captured in the literature (full network graph accessible at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://evidencedb.hastingslab.org/variantscape/networkgraph\u003c/span\u003e\u003cspan address=\"https://evidencedb.hastingslab.org/variantscape/networkgraph\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e)\u003csup\u003e61\u003c/sup\u003e. As an example, network analysis of treatment-variant relationships identified the \u003cem\u003eEGFR\u003c/em\u003e L858R point mutation and \u003cem\u003eEGFR\u003c/em\u003e T790M as central nodes in NSCLC (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). For \u003cem\u003eEGFR\u003c/em\u003e L858R, multiple treatments demonstrated strong, sensitive co-associations exceeding the 80th percentile confidence threshold (highlighted in green color), including osimertinib (score: 678), gefitinib (score: 474), erlotinib (score: 432), and afatinib (score: 333), while radiation therapy (score: 167) showed a co-association but did not meet the predefined confidence cutoff and therefore deemed not relevant (grey color). No resistant co-associations for \u003cem\u003eEGFR\u003c/em\u003e L858R surpassed the threshold (grey color), although several lower-confidence edges were present. In contrast, \u003cem\u003eEGFR\u003c/em\u003e T790M was associated with a single high-confidence sensitivity edge, osimertinib (score: 871), and multiple strong resistance co-associations (highlighted in red), including gefitinib (score: 557), erlotinib (score: 505), afatinib (score: 375), and cisplatin (score: 182). Cross-cancer co-associations revealed that both variants were most frequently linked to colon cancer (highlighted in blue), suggesting a measurable, though relatively uncommon, presence of these variations outside the NSCLC setting (\u003cem\u003eEGFR\u003c/em\u003e L858R score: 92, \u003cem\u003eEGFR\u003c/em\u003e T790M score: 98).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eRare variant identification and evaluation of treatments\u003c/h2\u003e \u003cp\u003eIn addition to the assessment of well-characterized alterations, literature-derived rare variants in NSCLC, \u003cem\u003eBRAF\u003c/em\u003e G469V, \u003cem\u003eEGFR\u003c/em\u003e S768I, \u003cem\u003eEGFR\u003c/em\u003e L861Q, and \u003cem\u003eEGFR\u003c/em\u003e L747P were examined to evaluate their co-associations with both sensitive and resistant therapies, as well as their occurrence in other cancer types (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). These co-associations were supported by prior evidence\u003csup\u003e\u003cspan citationid=\"CR62\" class=\"CitationRef\"\u003e62\u003c/span\u003e,\u003cspan citationid=\"CR63\" class=\"CitationRef\"\u003e63\u003c/span\u003e\u003c/sup\u003e underscoring the method\u0026rsquo;s ability to recover clinically meaningful relationships for rare variants. For \u003cem\u003eBRAF\u003c/em\u003e G469V, a class II mutation leading to RAS-independent dimeric activation, strong sensitivity co-associations were observed with osimertinib (score: 485) and gefitinib (score: 325), with no resistant associations detected\u003csup\u003e\u003cspan citationid=\"CR63\" class=\"CitationRef\"\u003e63\u003c/span\u003e\u003c/sup\u003e. \u003cem\u003eEGFR\u003c/em\u003e S768I showed high-confidence sensitivity co-associations with osimertinib (score: 503) and gefitinib (score: 337), with no resistant treatments surpassing the confidence threshold\u003csup\u003e\u003cspan citationid=\"CR62\" class=\"CitationRef\"\u003e62\u003c/span\u003e\u003c/sup\u003e. \u003cem\u003eEGFR\u003c/em\u003e L861Q demonstrated strong sensitivity co-associations with osimertinib (score: 507), gefitinib (score: 336), and erlotinib (score: 302), and no high-confidence resistance co-associations\u003csup\u003e\u003cspan citationid=\"CR62\" class=\"CitationRef\"\u003e62\u003c/span\u003e\u003c/sup\u003e. \u003cem\u003eEGFR\u003c/em\u003e L747P was uniquely characterized by a strong resistant co-association with osimertinib (score: 487), without any high-confidence sensitivity co-associations\u003csup\u003e\u003cspan citationid=\"CR62\" class=\"CitationRef\"\u003e62\u003c/span\u003e\u003c/sup\u003e. Cross-cancer co-association analysis revealed that \u003cem\u003eBRAF\u003c/em\u003e G469V (score: 85) and \u003cem\u003eEGFR\u003c/em\u003e L861Q (score: 84) were frequently mentioned in the context of colorectal cancer, suggesting relevance beyond NSCLC\u003csup\u003e\u003cspan citationid=\"CR63\" class=\"CitationRef\"\u003e63\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eInteractive web tool for querying variant-cancer-treatment co-associations\u003c/h2\u003e \u003cp\u003eTo translate our findings into a usable resource, we developed the publicly accessible \u0026ldquo;Variantscape\u0026rdquo; web tool that enables interactive exploration of our network analysis and variant, treatment, and cancer associations identified in this study. The tool allows users to search by variant and associated gene, as well as cancer type, to retrieve co-association of treatments and other cancer types linked to a respective variant. This interface enables exploratory navigation of literature-derived variant associations and may serve as a starting point for hypothesis generation and knowledge discovery in research settings. The tool is available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://evidencedb.hastingslab.org/variantscape\u003c/span\u003e\u003cspan address=\"https://evidencedb.hastingslab.org/variantscape\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003csup\u003e64\u003c/sup\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis study presents a large-scale computational analysis of variant-treatment-cancer co-associations derived from biomedical literature, specifically titles and abstracts. We introduce a novel technical approach that combines LLM- and NER-driven entity extraction with co-association and network-based analyses to systematically mine and structure oncogenic insights from unstructured text. This method enabled the identification of co-associations involving both well-known and less frequently discussed variants, revealing patterns and underexplored relationships from the underlying literature that may inform future hypotheses generation and focused evidence retrieval about oncogenic mechanisms or therapeutic relevance. In doing so, it contributes to a more comprehensive representation of the literature-derived variant landscape. Many of these insights, particularly those involving less-characterized variants, reflect contextual relationships that are often absent from traditional variant databases, frequently leaving researchers and clinicians performing their own literature reviews when confronted with rare or atypical alterations. Importantly, these findings are not intended as clinical recommendations but rather represent literature-based leads that should support further exploration within the context of existing evidence and expert judgment, including potential directions for precision oncology and drug repurposing.\u003c/p\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eTechnical contributions and methodological insights\u003c/h2\u003e \u003cp\u003eThe large number of gene and variant mentions extracted from the literature corpus demonstrates the effectiveness of the LLM-based approach in capturing both commonly reported and less canonical molecular alterations, highlighting the richness of unstructured biomedical text and the value of large-scale text mining for variant discovery. However, despite the initial vast scale of the literature corpus, only 1% of articles mentioned a molecular variant, highlighting how relatively rare explicit molecular variants remain in titles and abstracts. Crucially, this also highlights a practical limitation for clinicians and researchers relying on traditional keyword-based searches, as variant nomenclature is often inconsistent and, in many cases, not mentioned at all in titles or abstracts, making manual discovery via platforms like PubMed extremely difficult. Similarly, traditional NLP methods, while highly effective for identifying more standardized biomedical entities such as cancer types and treatments, as shown in this study, struggle to capture the variability inherent to molecular variants. In contrast, our LLM-based approach retrieves and structures variant-level insights at scale, leveraging the unique ability of LLMs to interpret naming variability and accurately extract variant information from unstructured texts, providing access to information that might otherwise remain buried alive in the literature. Although the computational runtime reflects the current resource demands of large-scale LLM-based extraction, the study shows that the approach is nevertheless feasible and scalable for systematic biomedical literature mining.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003eInterpretation of findings and clinical relevance\u003c/h2\u003e \u003cp\u003eOur approach revealed many plausible treatment-variant, variant-cancer, and treatment-cancer co-associations. In general, accuracy appeared highest for canonical variant-cancer pairs (e.g., \u003cem\u003eEGFR\u003c/em\u003e L858R in NSCLC or \u003cem\u003eBRAF\u003c/em\u003e V600E in melanoma) and declined when applied beyond well-established indications, as limited literature representation reduces both model confidence and evidence strength. However, even in well-characterized cases, the therapeutic relevance of a given alteration can vary substantially across tumor types, reflecting a broader challenge in precision oncology and treatment repurposing approaches. Factors such as pathway dependencies, co-mutations, and the tumor microenvironment can significantly influence treatment response\u003csup\u003e\u003cspan citationid=\"CR65\" class=\"CitationRef\"\u003e65\u003c/span\u003e\u003c/sup\u003e. As a result, the presence of a known oncogenic variant does not ensure equivalent efficacy across cancers, underscoring the importance of tumor-context-aware interpretation of variant-based co-associations\u003csup\u003e\u003cspan citationid=\"CR66\" class=\"CitationRef\"\u003e66\u003c/span\u003e\u003c/sup\u003e. Moreover, variant-treatment co-associations may be modulated by co-occurring genetic alterations, which can influence pathway activity, resistance mechanisms, or synthetic lethality interactions, underscoring the need to consider broader mutational contexts when interpreting literature-derived co-associations.\u003c/p\u003e \u003cp\u003eWhile molecular biomarkers offer a rationale for extending therapies across diseases and toward tumor-agnostic strategies, their clinical utility remains constrained by biological complexity, tissue-specific signaling, pharmacokinetics, and off-target effects\u003csup\u003e\u003cspan citationid=\"CR67\" class=\"CitationRef\"\u003e67\u003c/span\u003e\u003c/sup\u003e. The presence of a shared variant does not guarantee equal therapeutic efficacy, as shown by \u003cem\u003eBRAF\u003c/em\u003e V600E, which responds well to \u003cem\u003eBRAF\u003c/em\u003e inhibition in melanoma but requires combination with EGFR blockade in colorectal cancer to overcome compensatory signaling\u003csup\u003e\u003cspan citationid=\"CR68\" class=\"CitationRef\"\u003e68\u003c/span\u003e\u003c/sup\u003e. Similarly, \u003cem\u003eEGFR\u003c/em\u003e mutations are highly actionable in NSCLC but less effective targets in colorectal cancer due to downstream activation of the Ras-Raf-MEK-ERK pathway\u003csup\u003e\u003cspan citationid=\"CR69\" class=\"CitationRef\"\u003e69\u003c/span\u003e\u003c/sup\u003e. The type of gene affected also shapes therapeutic potential: oncogenes (e.g., \u003cem\u003eEGFR\u003c/em\u003e, \u003cem\u003eBRAF\u003c/em\u003e, \u003cem\u003eALK\u003c/em\u003e, \u003cem\u003eMET\u003c/em\u003e) are more readily targeted with inhibitors, whereas tumor suppressor genes, often altered by loss-of-function mutations, remain pharmacologically challenging\u003csup\u003e\u003cspan citationid=\"CR70\" class=\"CitationRef\"\u003e70\u003c/span\u003e\u003c/sup\u003e. Although \u003cem\u003eTP53\u003c/em\u003e was the most frequently reported gene in our dataset, most clinically actionable variants were linked to oncogenes, reflecting both their therapeutic tractability and greater representation in the literature. Moreover, while targeted therapies and immunotherapies dominate variant-guided treatment decisions, certain molecular features also influence response to broader modalities; for instance, \u003cem\u003eBRCA\u003c/em\u003e1/2 mutations confer sensitivity to platinum-based chemotherapy and PARP inhibitors\u003csup\u003e\u003cspan citationid=\"CR71\" class=\"CitationRef\"\u003e71\u003c/span\u003e\u003c/sup\u003e, and MMR deficiency or high tumor mutational burden (TMB) predict response to immune checkpoint blockade\u003csup\u003e\u003cspan citationid=\"CR72\" class=\"CitationRef\"\u003e72\u003c/span\u003e\u003c/sup\u003e. Together, these examples highlight the importance as well as limitations of the molecular context in determining therapeutic relevance, reinforcing the value of systematic literature-derived variant mapping while acknowledging the challenges that remain beyond current modeling capabilities.\u003c/p\u003e \u003cp\u003eDespite these advances, significant gaps remain for certain clinically important variants, particularly in understudied scenarios. One example is the \u0026ldquo;entity\u0026rdquo; of Cancer of Unknown Primary (CUP), a diagnostically and therapeutically complex condition with a dismal prognosis. Although molecular profiling, including DNA methylation analysis, is frequently used to identify the tissue of origin and guide therapy, evidence suggests that even treatments informed by advanced genomic analyses often fail to outperform standard chemotherapy\u003csup\u003e\u003cspan citationid=\"CR73\" class=\"CitationRef\"\u003e73\u003c/span\u003e\u003c/sup\u003e. Notably, while CUP was identified in the literature, no CUP-associated variants were captured in our analysis, reflecting a genuine lack of co-mentioned variant data rather than a limitation of the extraction approach. This gap presents a clear opportunity for future studies to investigate and define robust genomic markers that could better inform CUP diagnosis and treatment. Similarly, variants of uncertain significance (VUS), while frequently encountered in clinical testing, are rarely emphasized in the literature and thus underrepresented in LLM-driven text mining.\u003c/p\u003e \u003cp\u003eBeyond biological considerations, a critical but often underappreciated challenge in precision oncology is the influence of clinical testing patterns and the time point of molecular data collection reported in published literature. For example, \u003cem\u003eAR\u003c/em\u003e alterations in prostate cancer are frequently reported, yet this may primarily reflect the timing of NGS, which is commonly performed in the advanced, castration-resistant setting\u003csup\u003e\u003cspan citationid=\"CR74\" class=\"CitationRef\"\u003e74\u003c/span\u003e\u003c/sup\u003e. Consequently, literature-derived variant frequencies may mirror clinical sampling patterns rather than true biological prevalence. This bias is not unique to prostate cancer; rather, it reflects a broader issue across various tumor types. When NGS is primarily performed on refractory or late-stage disease, the molecular landscape observed in both clinical practice and the literature may become disproportionately focused on aggressive or treatment-resistant phenotypes. Additionally, the structure of clinical studies, particularly randomized controlled trials, often necessitates inclusion of standard-of-care treatments, which may lead to inflated mention frequencies irrespective of actual therapeutic effectiveness.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003ePractical implications and use cases\u003c/h2\u003e \u003cp\u003eIn many rare or understudied malignancies, conventional trial-driven evidence is limited, and database curation often lags behind the pace of emerging literature. In addition, the importance of classical evidence generation using large trials may decline given the increased molecular disease understanding entailing patient sub-setting, refined stratification, predictive biomarkers, and n-equals-one medicine. To address this gap, we developed an LLM-supported approach to extract and identify potential variant-treatment-cancer co-associations across tumor types, supporting hypothesis generation in settings where direct evidence is sparse. These associations, however, are based on patterns of co-occurrence in the literature and should be interpreted as preliminary signals rather than definitive therapeutic recommendations. Clinical interpretation and functional validation remain essential to determine the true relevance of these findings. To support real-world applications, we developed an interactive web-based tool that enables clinicians and researchers to explore these associations in real-time. This resource could assist clinicians, researchers, molecular tumor boards, and curation teams by providing an early-stage evidence layer and screening tool to guide downstream analysis and decision-making.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003eLimitations of the study\u003c/h2\u003e \u003cp\u003eWhile our findings offer a broad and structured view of literature-derived variant, treatment, and cancer co-associations, several limitations should be acknowledged when interpreting the results. First, our analysis was restricted to titles and abstracts, as full-text access is frequently limited by paywalls and licensing restrictions. Moreover, large-scale full-text analysis with LLMs requires additional infrastructure, such as section parsing and increased memory capacity. Consequently, variants mentioned exclusively in the full text, particularly within tables, figures, or supplementary materials, were missed. Nonetheless, titles and abstracts offer a consistent and accessible foundation for scalable literature mining and represent a practical starting point for systematic variant discovery.\u003c/p\u003e \u003cp\u003eSecond, our study analyzed publications from 2014 to 2024, a timeframe that, while comprehensive, may include outdated therapeutic associations given the rapid evolution of oncological research. However, as the pipeline is fully automated, the analysis can be updated at any time to incorporate newly published data and maintain alignment with the latest evidence.\u003c/p\u003e \u003cp\u003eThird, our findings are inherently shaped by publication bias, as well-studied genes and high-profile therapies dominate the literature, while rare variants or under-researched populations are comparatively absent.\u003c/p\u003e \u003cp\u003eFourth, while our pipeline captured a broad treatment space, it occasionally surfaced non-specific or supportive agents (e.g., \u0026ldquo;chemotherapy,\u0026rdquo; \u0026ldquo;ibuprofen,\u0026rdquo; \u0026ldquo;prednisone\u0026rdquo;). These inclusions stem not from model error but from the inherent ambiguity in source knowledge bases and literature, where therapeutic intent, preventive roles, and symptomatic management are often conflated. Although some of these treatments have shown potential preventive or adjunctive effects in early studies, they are not considered standard therapies for cancer. This underscores a broader challenge in biomedical informatics: the need for context-aware systems and dynamically curated ontologies to distinguish between treatment, prevention, and supportive care in a clinically meaningful way.\u003c/p\u003e \u003cp\u003eFifth, applying a confidence threshold helped reduce noise and improve specificity in variant- treatment classifications, but emerging or sparsely reported associations may have been excluded. In qualitative review, some known treatment-variant co-associations were successfully recovered, while others were incorrectly labeled as \u0026ldquo;sensitive\u0026rdquo; despite lacking clear mechanistic or clinical support. This reflects the method\u0026rsquo;s susceptibility to noise and underscores the importance of thresholding and post-processing to reduce inaccurate associations. Moreover, this binary classification scheme oversimplifies complex therapeutic dynamics. Drug response is often dose-dependent, temporally variable, and context-specific. Acquired resistance, compensatory signaling, and co-mutational effects all play roles that are not captured in simple co-occurrence-based models.\u003c/p\u003e \u003cp\u003eSixth, while our pipeline effectively extracted point mutations and small indels, it did not explicitly detect gene fusions or tumor-level genomic signatures such as MSI, MMR deficiency, or high TMB. Although clinically significant, these alterations were not included in our variant extraction scheme and were not systematically queried via the LLM.\u003c/p\u003e \u003cp\u003eFinally, human validation remains essential. While LLMs can scale variant extraction from the published literature, they cannot replace clinical expertise in contextualizing variant relevance. The associations presented here should be viewed as preliminary signals derived from textual patterns, not definitive therapeutic evidence.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003eFuture directions\u003c/h2\u003e \u003cp\u003eLooking ahead, the here presented approach holds potential for further integration with a range of data types and tools. Future work should incorporate full-text mining to capture deeper variant annotations, structured therapeutic evidence, and context-rich discussions inherently missing from abstracts. Expanding extraction capabilities to include gene fusions, functional tumor signatures, and compound biomarkers would enhance the biological breadth.\u003c/p\u003e \u003cp\u003eAdditionally, integrating our literature-derived insights with curated databases and clinical variant interpretation pipelines (e.g., CIViC, COSMIC, ClinVar) and real-world molecular datasets (e.g., TCGA) may enable more robust hypothesis testing and validation. Linking these associations to clinical outcomes could further help prioritize literature-derived scientific leads for translational investigation.\u003c/p\u003e \u003cp\u003eOur publicly accessible web tool serves as a foundation for interactive variant exploration, but future iterations may support real-time updates, potentially evolving into a semi-automated support system for molecular tumor boards, variant curation teams, or precision oncology researchers, especially in low-resource settings where manual review is infeasible.\u003c/p\u003e \u003cp\u003eLastly, broader progress will require not only technical improvements in LLM-based extraction but also a shift in publishing culture, encouraging structured, standardized variant reporting, greater open access availability, and metadata enrichment to ensure biomedical literature can be systematically leveraged for clinical insight and discovery.\u003c/p\u003e \u003c/div\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis study demonstrates the feasibility and utility of using LLMs and NLP-based approaches to systematically extract, structure, and explore variant, treatment, and cancer co-associations from biomedical literature at scale. Automation is essential to scaling extraction efforts, ensuring accuracy, and keeping pace with the exponential growth of scientific discoveries. By leveraging automation and advanced machine learning techniques, our pipeline uncovered thousands of meaningful connections, including both canonical and rare variants, that are often underrepresented or difficult to retrieve through traditional database queries or manual literature review. While constrained to titles and abstracts, our method offers a scalable and accessible foundation for literature-based variant discovery and contributes to mapping the evolving landscape of molecular oncology as reflected in published evidence. The open-access, interactive web tool developed as part of this work translates these findings into a usable resource that may support clinicians and researchers in hypothesis generation, early exploration of literature-derived signals, and the identification of potential new applications of existing therapies, especially in the case of rare variants. Despite known limitations and the need for clinical validation, this approach offers a powerful starting point for augmenting variant interpretation efforts and bridging gaps in curated knowledge. As biomedical literature continues to grow rapidly, automated extraction methods like \u0026ldquo;Variantscape\u0026rdquo; may become increasingly valuable in transforming unstructured evidence into structured insights that capture the full complexity of the variant landscape and advance precision oncology.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eTop five ranked \u0026ldquo;sensitive\u0026rdquo; and \u0026ldquo;resistant\u0026rdquo; treatment co-associations for variants EGFR L858R and EGFR T790M in non-small cell lung cancer (NSCLC), and co-association of these variants in other cancer types based on evidence-weighted network analysis.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003eCancer type: Non-small cell lung cancer (NSCLC)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eVariant: L858R \u003cem\u003eEGFR\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003eVariant: T790M\u003c/b\u003e \u003cb\u003eEGFR\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003e\u003cem\u003eTop 5 treatments associated \u0026ldquo;sensitive\u0026rdquo; treatments (weighted evidence score)*\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOsimertinib: 678\u003c/p\u003e \u003cp\u003eGefitinib: 474\u003c/p\u003e \u003cp\u003eErlotinib: 432\u003c/p\u003e \u003cp\u003eAfatinib: 333\u003c/p\u003e \u003cp\u003eRadiation therapy: 167\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003eOsimertinib: 871\u003c/b\u003e\u003c/p\u003e \u003cp\u003eRadiation therapy: 178\u003c/p\u003e \u003cp\u003eTrametinib: 90\u003c/p\u003e \u003cp\u003eRociletinib: 71\u003c/p\u003e \u003cp\u003eCetuximab: 64\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003e\u003cem\u003eTop 5 treatments associated \u0026ldquo;resistant\u0026rdquo; treatments (weighted evidence score)*\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCisplatin: 180\u003c/p\u003e \u003cp\u003ePembrolizumab: 41\u003c/p\u003e \u003cp\u003ePaclitaxel: 30\u003c/p\u003e \u003cp\u003eDocetaxel: 30\u003c/p\u003e \u003cp\u003eNivolumab: 21\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003eGefitinib: 557\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cb\u003eErlotinib: 505\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cb\u003eAfatinib: 375\u003c/b\u003e\u003c/p\u003e \u003cp\u003eCisplatin: 182\u003c/p\u003e \u003cp\u003eCrizotinib: 167\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003e\u003cem\u003eOther top 5 cancers associated with the respective variant (weighted evidence score)*\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eColon cancer: 92\u003c/p\u003e \u003cp\u003eBreast cancer: 64\u003c/p\u003e \u003cp\u003eSquamous cell cancer: 60\u003c/p\u003e \u003cp\u003eMelanoma: 56\u003c/p\u003e \u003cp\u003ePancreatic cancer: 41\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003eColon cancer: 98\u003c/b\u003e\u003c/p\u003e \u003cp\u003eBreast cancer: 68\u003c/p\u003e \u003cp\u003eMelanoma: 58\u003c/p\u003e \u003cp\u003eSquamous cell cancer: 57\u003c/p\u003e \u003cp\u003ePancreatic cancer: 45\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003e\u003cem\u003e*Colored co-associations represent high-confidence scores (\u0026ge;\u0026thinsp;80th percentile), based on evidence-weighted co-occurrence in the literature and filtered to retain only the strongest edges for network robustness.\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eTop five ranked treatment co-associations for rare variants identified in non-small cell lung cancer (NSCLC), along with additional cancer types where these variants might be observed more frequently.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colspan=\"4\" nameend=\"c4\" namest=\"c1\"\u003e \u003cp\u003eCancer type: Non-small cell lung cancer (NSCLC)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cem\u003eBRAF\u003c/em\u003e G469V\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003eEGFR\u003c/b\u003e \u003cb\u003eS768I\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003eEGFR\u003c/b\u003e \u003cb\u003eL861Q\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003eEGFR\u003c/b\u003e \u003cb\u003eL747P\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c4\" namest=\"c1\"\u003e \u003cp\u003e\u003cem\u003eTop 5 treatments associated \u0026ldquo;sensitive\u0026rdquo; treatments (weighted evidence score)*\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOsimertinib: 485\u003c/p\u003e \u003cp\u003eGefitinib: 325\u003c/p\u003e \u003cp\u003eTrametinib: 87\u003c/p\u003e \u003cp\u003eVemurafenib: 43\u003c/p\u003e \u003cp\u003eCetuximab: 39\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003eOsimertinib: 503\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cb\u003eGefitinib: 337\u003c/b\u003e\u003c/p\u003e \u003cp\u003eErlotinib: 301\u003c/p\u003e \u003cp\u003eAfatinib: 253\u003c/p\u003e \u003cp\u003eCrizotinib: 135\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003eOsimertinib: 507\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cb\u003eGefitinib: 336\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cb\u003eErlotinib: 302\u003c/b\u003e\u003c/p\u003e \u003cp\u003eAfatinib: 253\u003c/p\u003e \u003cp\u003eCrizotinib: 136\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eAfatinib: 234\u003c/p\u003e \u003cp\u003eCetuximab: 39\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c4\" namest=\"c1\"\u003e \u003cp\u003e\u003cem\u003eTop 5 treatments associated \u0026ldquo;resistant\u0026rdquo; treatments (weighted evidence score)*\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDabrafenib: 66\u003c/p\u003e \u003cp\u003eGemcitabine: 10\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eIcotinib: 54\u003c/p\u003e \u003cp\u003eBrigatinib: 23\u003c/p\u003e \u003cp\u003eLazertinib: 6\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLazertinib: 6\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003eOsimertinib: 487\u003c/b\u003e\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c4\" namest=\"c1\"\u003e \u003cp\u003e\u003cem\u003eOther cancers associated with the respective variant (weighted evidence score)*\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eColon cancer: 85\u003c/p\u003e \u003cp\u003eMelanoma: 58\u003c/p\u003e \u003cp\u003eBreast cancer: 50\u003c/p\u003e \u003cp\u003eThyroid cancer: 29\u003c/p\u003e \u003cp\u003eOvarian cancer: 16\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBreast cancer: 50\u003c/p\u003e \u003cp\u003ePancreatic cancer: 40\u003c/p\u003e \u003cp\u003eLiver cancer: 11\u003c/p\u003e \u003cp\u003eGlioblastoma: 11\u003c/p\u003e \u003cp\u003eCholangiocarcinoma: 8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003eColon cancer: 84\u003c/b\u003e\u003c/p\u003e \u003cp\u003eBreast cancer: 51\u003c/p\u003e \u003cp\u003ePancreatic cancer: 40\u003c/p\u003e \u003cp\u003eSquamous cell cancer: 39\u003c/p\u003e \u003cp\u003eGlioblastoma: 12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNA\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c4\" namest=\"c1\"\u003e \u003cp\u003e\u003cem\u003e*Colored co-associations represent high-confidence scores (\u0026ge;\u0026thinsp;80th percentile), based on evidence-weighted co-occurrence in the literature and filtered to retain only the strongest edges for network robustness.\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eAI: artificial intelligence\u003c/p\u003e\n\u003cp\u003eAPI: application programming interface\u003c/p\u003e\n\u003cp\u003eCIViC: Clinical Interpretation of Variations in Cancer\u003c/p\u003e\n\u003cp\u003eCUP: cancer of unknown primary\u003c/p\u003e\n\u003cp\u003eDOID: disease ontology ID\u003c/p\u003e\n\u003cp\u003eFDR: false discovery rate\u003c/p\u003e\n\u003cp\u003eHGVS: Human Genome Variation Society\u003c/p\u003e\n\u003cp\u003eHRD: homologous recombination deficiency\u003c/p\u003e\n\u003cp\u003eLLMs: large language models\u003c/p\u003e\n\u003cp\u003eMMR: mismatch repair\u003c/p\u003e\n\u003cp\u003eMONDO: Monarch Disease Ontology\u0026nbsp;\u003c/p\u003e\n\u003cp\u003emCRPC: metastatic castration-resistant prostate cancer\u003c/p\u003e\n\u003cp\u003eMSI: microsatellite instability\u003c/p\u003e\n\u003cp\u003eNCIt: National Cancer Institute thesaurus\u003c/p\u003e\n\u003cp\u003eNER: named entity recognition\u003c/p\u003e\n\u003cp\u003eNGS: next-generation sequencing\u003c/p\u003e\n\u003cp\u003eNLP: natural language processing\u003c/p\u003e\n\u003cp\u003eNSCLC: non-small cell lung cancer\u003c/p\u003e\n\u003cp\u003eOBI: Ontology of Biomedical Investigations\u003c/p\u003e\n\u003cp\u003eOCR: optical character recognition\u003c/p\u003e\n\u003cp\u003ePC: prostate cancer\u003c/p\u003e\n\u003cp\u003eRAG: retrieval-augmented generation\u003c/p\u003e\n\u003cp\u003eRWE: real-world evidence\u003c/p\u003e\n\u003cp\u003eSNP: single-nucleotide polymorphism\u003c/p\u003e\n\u003cp\u003eTKIs: tyrosine kinase inhibitors\u003c/p\u003e\n\u003cp\u003eTMB: tumor mutational burden\u003c/p\u003e\n\u003cp\u003eUMAP: uniform manifold approximation and projection\u003c/p\u003e\n\u003cp\u003eVUS: variants of uncertain significance\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003eAuthor contributions\u003c/p\u003e\n\u003cp\u003eConception and design: MW, JH.\u003c/p\u003e\n\u003cp\u003eData collection, analysis, and statistics: MW, TN, JH.\u003c/p\u003e\n\u003cp\u003eData interpretation: MW, MB, TP, MF, CR, JH.\u003c/p\u003e\n\u003cp\u003eWrote the first draft of the paper: MW.\u003c/p\u003e\n\u003cp\u003eWrote the final version of the paper: MW, MB, TP, TN, MF, CR, JH.\u003c/p\u003e\n\u003cp\u003eApproved the paper for submission and publication: MW, MB, TP, TN, MF, CR, JH.\u003c/p\u003e\n\u003cp\u003eAcknowledgments\u003c/p\u003e\n\u003cp\u003eThis study was funded by the School of Medicine at the University of St.Gallen in Switzerland under grant number 2300380.\u003c/p\u003e\n\u003cp\u003eCompeting interests\u003c/p\u003e\n\u003cp\u003eTP has received travel support from Janssen and Bayer. MF has received compensation from Bristol-Myers Squibb, MSD, AstraZeneca, Boehringer Ingelheim, Roche, Takeda, Pfizer, Janssen, Daiichi-Sankyo, and PharmaMar (Advisory Board, institutional). CR has received compensation from Pfizer, Bristol-Myers Squibb, and MSD Oncology (Advisory Board, institutional). MW, MB, TN, and JH declare no competing interests relevant to this paper.\u003c/p\u003e\n\u003cp\u003eData availability statement\u003c/p\u003e\n\u003cp\u003eThe datasets generated and/or analyzed during the current study are available in a Zenodo repository and can be accessed via this link: https://zenodo.org/records/15268056.\u003c/p\u003e\n\u003cp\u003eCode availability statement\u003c/p\u003e\n\u003cp\u003eThe underlying code and validation datasets for this study is available in a GitHub repository and can be accessed via this link: https://github.com/hastingslab-org/variantscape.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eMalone, E. R., Oliva, M., Sabatini, P. J. B., Stockley, T. L. \u0026amp; Siu, L. L. Molecular profiling for precision cancer therapies. \u003cem\u003eGenome Medicine\u003c/em\u003e \u003cstrong\u003e12\u003c/strong\u003e, 8 (2020).\u003c/li\u003e\n\u003cli\u003eGibbs, S. N. \u003cem\u003eet al.\u003c/em\u003e Comprehensive review on the clinical impact of next-generation sequencing tests for the management of advanced cancer. \u003cem\u003eJCO Precis Oncol\u003c/em\u003e \u003cstrong\u003e7\u003c/strong\u003e, e2200715 (2023).\u003c/li\u003e\n\u003cli\u003eMendiratta, G. \u003cem\u003eet al.\u003c/em\u003e Cancer gene mutation frequencies for the U.S. population. \u003cem\u003eNat Commun\u003c/em\u003e \u003cstrong\u003e12\u003c/strong\u003e, 5961 (2021).\u003c/li\u003e\n\u003cli\u003eKris, M. G. \u003cem\u003eet al.\u003c/em\u003e Using multiplexed assays of oncogenic drivers in lung cancers to select targeted drugs. \u003cem\u003eJAMA\u003c/em\u003e \u003cstrong\u003e311\u003c/strong\u003e, 1998\u0026ndash;2006 (2014).\u003c/li\u003e\n\u003cli\u003eDankner, M., Rose, A. A. N., Rajkumar, S., Siegel, P. M. \u0026amp; Watson, I. R. Classifying BRAF alterations in cancer: new rational therapeutic strategies for actionable mutations. \u003cem\u003eOncogene\u003c/em\u003e \u003cstrong\u003e37\u003c/strong\u003e, 3183\u0026ndash;3199 (2018).\u003c/li\u003e\n\u003cli\u003eChevallier, M., Borgeaud, M., Addeo, A. \u0026amp; Friedlaender, A. Oncogenic driver mutations in non-small cell lung cancer: Past, present and future. \u003cem\u003eWorld J Clin Oncol\u003c/em\u003e \u003cstrong\u003e12\u003c/strong\u003e, 217\u0026ndash;237 (2021).\u003c/li\u003e\n\u003cli\u003eKumar, A. \u0026amp; Kumar, A. Non-small-cell lung cancer-associated gene mutations and inhibitors. \u003cem\u003eAdvances in Cancer Biology - Metastasis\u003c/em\u003e \u003cstrong\u003e6\u003c/strong\u003e, 100076 (2022).\u003c/li\u003e\n\u003cli\u003eSoria, J.-C. \u003cem\u003eet al.\u003c/em\u003e Osimertinib in untreated EGFR-mutated advanced non-small-cell lung cancer. \u003cem\u003eN Engl J Med\u003c/em\u003e \u003cstrong\u003e378\u003c/strong\u003e, 113\u0026ndash;125 (2018).\u003c/li\u003e\n\u003cli\u003eMateo, J. \u003cem\u003eet al.\u003c/em\u003e Olaparib for the treatment of patients with metastatic castration-resistant prostate cancer and alterations in BRCA1 and/or BRCA2 in the PROfound trial. \u003cem\u003eJCO\u003c/em\u003e \u003cstrong\u003e42\u003c/strong\u003e, 571\u0026ndash;583 (2024).\u003c/li\u003e\n\u003cli\u003eLiu, Q. \u003cem\u003eet al.\u003c/em\u003e A novel BRCA2 mutation in prostate cancer sensitive to combined radiotherapy and androgen deprivation therapy. \u003cem\u003eCancer Biol Ther\u003c/em\u003e \u003cstrong\u003e19\u003c/strong\u003e, 669\u0026ndash;675 (2018).\u003c/li\u003e\n\u003cli\u003eSchwartzberg, L., Kim, E. S., Liu, D. \u0026amp; Schrag, D. Precision Oncology: Who, How, What, When, and When Not? \u003cem\u003eAm Soc Clin Oncol Educ Book\u003c/em\u003e 160\u0026ndash;169 (2017) doi:10.1200/EDBK_174176.\u003c/li\u003e\n\u003cli\u003eSondka, Z. \u003cem\u003eet al.\u003c/em\u003e COSMIC: a curated database of somatic variants and clinical data for cancer. \u003cem\u003eNucleic Acids Research\u003c/em\u003e \u003cstrong\u003e52\u003c/strong\u003e, D1210\u0026ndash;D1217 (2024).\u003c/li\u003e\n\u003cli\u003eGriffith, M. \u003cem\u003eet al.\u003c/em\u003e CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. \u003cem\u003eNat Genet\u003c/em\u003e \u003cstrong\u003e49\u003c/strong\u003e, 170\u0026ndash;174 (2017).\u003c/li\u003e\n\u003cli\u003eLandrum, M. J. \u003cem\u003eet al.\u003c/em\u003e ClinVar: Public archive of relationships among sequence variation and human phenotype. \u003cem\u003eNucleic Acids Res\u003c/em\u003e \u003cstrong\u003e42\u003c/strong\u003e, D980-985 (2014).\u003c/li\u003e\n\u003cli\u003eChakravarty, D. \u003cem\u003eet al.\u003c/em\u003e OncoKB: A precision oncology knowledge base. \u003cem\u003eJCO Precis Oncol\u003c/em\u003e \u003cstrong\u003e2017\u003c/strong\u003e, PO.17.00011 (2017).\u003c/li\u003e\n\u003cli\u003eGazola, A. A., Lautert-Dutra, W., Archangelo, L. F., Reis, R. B. dos \u0026amp; Squire, J. A. Precision oncology platforms: Practical strategies for genomic database utilization in cancer treatment. \u003cem\u003eMolecular Cytogenetics\u003c/em\u003e \u003cstrong\u003e17\u003c/strong\u003e, 28 (2024).\u003c/li\u003e\n\u003cli\u003eAllot, A. \u003cem\u003eet al.\u003c/em\u003e Tracking genetic variants in the biomedical literature using LitVar 2.0. \u003cem\u003eNat Genet\u003c/em\u003e \u003cstrong\u003e55\u003c/strong\u003e, 901\u0026ndash;903 (2023).\u003c/li\u003e\n\u003cli\u003eWagner, A. H. \u003cem\u003eet al.\u003c/em\u003e A harmonized meta-knowledgebase of clinical interpretations of somatic genomic variants in cancer. \u003cem\u003eNat Genet\u003c/em\u003e \u003cstrong\u003e52\u003c/strong\u003e, 448\u0026ndash;457 (2020).\u003c/li\u003e\n\u003cli\u003eHoward, M. \u003cem\u003eet al.\u003c/em\u003e VarStack: a web tool for data retrieval to interpret somatic variants in cancer. \u003cem\u003eDatabase (Oxford)\u003c/em\u003e \u003cstrong\u003e2020\u003c/strong\u003e, baaa092 (2020).\u003c/li\u003e\n\u003cli\u003eLee, K., Wei, C.-H. \u0026amp; Lu, Z. Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature. \u003cem\u003eBrief Bioinform\u003c/em\u003e \u003cstrong\u003e22\u003c/strong\u003e, bbaa142 (2020).\u003c/li\u003e\n\u003cli\u003eNeumann, M., King, D., Beltagy, I. \u0026amp; Ammar, W. SciSpaCy: Fast and robust models for biomedical natural language processing. in \u003cem\u003eProceedings of the 18th BioNLP Workshop and Shared Task\u003c/em\u003e (eds. Demner-Fushman, D., Cohen, K. B., Ananiadou, S. \u0026amp; Tsujii, J.) 319\u0026ndash;327 (Association for Computational Linguistics, Florence, Italy, 2019). doi:10.18653/v1/W19-5034.\u003c/li\u003e\n\u003cli\u003eDevlin, J., Chang, M.-W., Lee, K. \u0026amp; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2019).\u003c/li\u003e\n\u003cli\u003eJolly, A., Pandey, V., Singh, I. \u0026amp; Sharma, N. Exploring biomedical named entity recognition via SciSpaCy and BioBERT models. \u003cem\u003eTOBEJ\u003c/em\u003e \u003cstrong\u003e18\u003c/strong\u003e, e18741207289680 (2024).\u003c/li\u003e\n\u003cli\u003eK\u0026uuml;hnel, L. \u0026amp; Fluck, J. We are not ready yet: Limitations of state-of-the-art disease named entity recognizers. \u003cem\u003eJournal of Biomedical Semantics\u003c/em\u003e \u003cstrong\u003e13\u003c/strong\u003e, 26 (2022).\u003c/li\u003e\n\u003cli\u003eAlamro, H., Gojobori, T., Essack, M. \u0026amp; Gao, X. BioBBC: A multi-feature model that enhances the detection of biomedical entities. \u003cem\u003eSci Rep\u003c/em\u003e \u003cstrong\u003e14\u003c/strong\u003e, 7697 (2024).\u003c/li\u003e\n\u003cli\u003eHuang, D.-L. \u003cem\u003eet al.\u003c/em\u003e A combined manual annotation and deep-learning natural language processing study on accurate entity extraction in hereditary disease related biomedical literature. \u003cem\u003eInterdiscip Sci Comput Life Sci\u003c/em\u003e \u003cstrong\u003e16\u003c/strong\u003e, 333\u0026ndash;344 (2024).\u003c/li\u003e\n\u003cli\u003eDoneva, S. E. \u003cem\u003eet al.\u003c/em\u003e Large language models to process, analyze, and synthesize biomedical texts: A scoping review. \u003cem\u003eDiscov Artif Intell\u003c/em\u003e \u003cstrong\u003e4\u003c/strong\u003e, 107 (2024).\u003c/li\u003e\n\u003cli\u003eWosny, M. \u0026amp; Hastings, J. Large language models for detection of genetic variants in biomedical literature. \u003cem\u003eStudies in Health Technology and Informatics (preprint)\u003c/em\u003e (2025).\u003c/li\u003e\n\u003cli\u003eWei, C.-H., Harris, B. R., Kao, H.-Y. \u0026amp; Lu, Z. tmVar: A text mining approach for extracting sequence variants in biomedical literature. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cstrong\u003e29\u003c/strong\u003e, 1433\u0026ndash;1439 (2013).\u003c/li\u003e\n\u003cli\u003eWei, C.-H. \u003cem\u003eet al.\u003c/em\u003e PubTator 3.0: An AI-powered literature resource for unlocking biomedical knowledge. \u003cem\u003eNucleic Acids Research\u003c/em\u003e \u003cstrong\u003e52\u003c/strong\u003e, W540\u0026ndash;W546 (2024).\u003c/li\u003e\n\u003cli\u003ePasche, E. \u003cem\u003eet al.\u003c/em\u003e Variomes: A high recall search engine to support the curation of genomic variants. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cstrong\u003e38\u003c/strong\u003e, 2595\u0026ndash;2601 (2022).\u003c/li\u003e\n\u003cli\u003eWosny, M. \u0026amp; Hastings, J. Variantscape: Python Notebooks. (2025).\u003c/li\u003e\n\u003cli\u003eWosny, M. Variantscape datasets. https://doi.org/10.5281/zenodo.15268056.\u003c/li\u003e\n\u003cli\u003eOpenAlex \u0026amp; OurResearch. OpenAlex API. (2025).\u003c/li\u003e\n\u003cli\u003eThermo Fisher Scientific. Oncomine\u003csup\u003eTM\u003c/sup\u003e Comprehensive Assay v3. \u003cem\u003eThermo Fisher Scientific\u003c/em\u003e https://www.thermofisher.com/uk/en/home/clinical/preclinical-companion-diagnostic-development/oncomine-oncology/oncomine-cancer-research-panel-workflow.html (2025).\u003c/li\u003e\n\u003cli\u003eWosny, M. \u0026amp; Hastings, J. Automated gene identification in oncology literature: A comparative evaluation of natural language processing approaches. \u003cem\u003eStudies in Health Technology and Informatics (preprint)\u003c/em\u003e (2025).\u003c/li\u003e\n\u003cli\u003eAlvaro Alonso Casero \u0026amp; librAIry Team. BioBERT Genetic NER Model alvaroalon2/biobert_genetic_ner at Hugging Face. (2025).\u003c/li\u003e\n\u003cli\u003eMyGene.info documentation \u0026mdash; MyGene.info 3.0 documentation. https://docs.mygene.info/en/latest/index.html.\u003c/li\u003e\n\u003cli\u003eCohen, A. \u0026amp; SeatGeek Inc. FuzzyWuzzy: Fuzzy string matching in Python. SeatGeek (2025).\u003c/li\u003e\n\u003cli\u003eAllen Institute for AI. scispaCy: SpaCy models for biomedical text processing. (2025).\u003c/li\u003e\n\u003cli\u003eSchriml, L. M. \u003cem\u003eet al.\u003c/em\u003e Human Disease Ontology 2018 update: Classification, content and workflow expansion. \u003cem\u003eNucleic Acids Res\u003c/em\u003e \u003cstrong\u003e47\u003c/strong\u003e, D955\u0026ndash;D962 (2019).\u003c/li\u003e\n\u003cli\u003eVasilevsky, N. A. \u003cem\u003eet al.\u003c/em\u003e Mondo: Unifying diseases for the world, by the world. 2022.04.13.22273750 Preprint at https://doi.org/10.1101/2022.04.13.22273750 (2022).\u003c/li\u003e\n\u003cli\u003eEuropean Bioinformatics Institute (EMBL-EBI). MONDO ontology API. (2025).\u003c/li\u003e\n\u003cli\u003eDisease Ontology Consortium. Disease Ontology Knowledge Base (DO-KB) API. (2025).\u003c/li\u003e\n\u003cli\u003eMcDonnell Genome Institute, Washington University School of Medicine. CIViC API. (2025).\u003c/li\u003e\n\u003cli\u003eNational Cancer Institute (NCI). NCI Thesaurus (NCIt) via EVS Explore API. (2025).\u003c/li\u003e\n\u003cli\u003eDennst\u0026auml;dt, F. General-classifier: Multi-topic text classification with LLMs. (2025).\u003c/li\u003e\n\u003cli\u003eDennst\u0026auml;dt, F. \u003cem\u003eet al.\u003c/em\u003e Application of a general LLM-based classification system to retrieve information about oncological trials. \u003cem\u003emedRxiv\u003c/em\u003e 2024.12.03.24318390 (2024) doi:10.1101/2024.12.03.24318390.\u003c/li\u003e\n\u003cli\u003eDeepInfra. DeepInfra API. (2025).\u003c/li\u003e\n\u003cli\u003eBandrowski, A. \u003cem\u003eet al.\u003c/em\u003e The ontology for biomedical investigations. \u003cem\u003ePLOS ONE\u003c/em\u003e \u003cstrong\u003e11\u003c/strong\u003e, e0154556 (2016).\u003c/li\u003e\n\u003cli\u003eBeltagy, I., Lo, K. \u0026amp; Cohan, A. SciBERT: A pretrained language model for scientific text. Ai2 (2025).\u003c/li\u003e\n\u003cli\u003eAllen Institute for AI. SciBERT scivocab_uncased at Hugging Face. (2025).\u003c/li\u003e\n\u003cli\u003eGonz\u0026aacute;lez-M\u0026aacute;rquez, R., Schmidt, L., Schmidt, B. M., Berens, P. \u0026amp; Kobak, D. The landscape of biomedical research. \u003cem\u003ePatterns\u003c/em\u003e \u003cstrong\u003e5\u003c/strong\u003e, 100968 (2024).\u003c/li\u003e\n\u003cli\u003eEnsembl / European Bioinformatics Institute (EMBL-EBI). Ensembl variant recoder API. (2025).\u003c/li\u003e\n\u003cli\u003eHagberg, A. A., Schult, D. A. \u0026amp; Swart, P. J. Exploring network structure, dynamics, and function using NetworkX. in 11\u0026ndash;15 (Pasadena, California, 2008). doi:10.25080/TCWV9851.\u003c/li\u003e\n\u003cli\u003eDrugBank. DrugBank online. \u003cem\u003eDrugBank\u003c/em\u003e https://go.drugbank.com/ (2025).\u003c/li\u003e\n\u003cli\u003eU.S. National Library of Medicine. ClinicalTrials.gov. \u003cem\u003eClinicalTrials.gov\u003c/em\u003e https://clinicaltrials.gov/ (2025).\u003c/li\u003e\n\u003cli\u003eNational Center for Biotechnology Information (NCBI). PubMed. \u003cem\u003ePubMed\u003c/em\u003e https://pubmed.ncbi.nlm.nih.gov/.\u003c/li\u003e\n\u003cli\u003eWosny, M. \u003cem\u003eet al.\u003c/em\u003e Variantscape study design cluster map. https://evidencedb.hastingslab.org/variantscape/studydesignclustermap (2025).\u003c/li\u003e\n\u003cli\u003eFontana, D., Ceccon, M., Gambacorti-Passerini, C. \u0026amp; Mologni, L. Activity of second-generation ALK inhibitors against crizotinib-resistant mutants in an NPM-ALK model compared to EML4-ALK. \u003cem\u003eCancer Med\u003c/em\u003e \u003cstrong\u003e4\u003c/strong\u003e, 953\u0026ndash;965 (2015).\u003c/li\u003e\n\u003cli\u003eWosny, M. \u003cem\u003eet al.\u003c/em\u003e Variantscape network graph. https://evidencedb.hastingslab.org/variantscape/networkgraph (2025).\u003c/li\u003e\n\u003cli\u003eJohn, T. \u003cem\u003eet al.\u003c/em\u003e Uncommon EGFR mutations in non-small-cell lung cancer: A systematic literature review of prevalence and clinical outcomes. \u003cem\u003eCancer Epidemiol\u003c/em\u003e \u003cstrong\u003e76\u003c/strong\u003e, 102080 (2022).\u003c/li\u003e\n\u003cli\u003eWu, H., Feng, J., Lu, S. \u0026amp; Huang, J. A large-scale, multicenter characterization of BRAF G469V/A-mutant non-small cell lung cancer. \u003cem\u003eCancer Med\u003c/em\u003e \u003cstrong\u003e13\u003c/strong\u003e, e7305 (2024).\u003c/li\u003e\n\u003cli\u003eWosny, M. \u003cem\u003eet al.\u003c/em\u003e Variantscape web tool. https://evidencedb.hastingslab.org/variantscape (2025).\u003c/li\u003e\n\u003cli\u003ePich, O. \u003cem\u003eet al.\u003c/em\u003e The translational challenges of precision oncology. \u003cem\u003eCancer Cell\u003c/em\u003e \u003cstrong\u003e40\u003c/strong\u003e, 458\u0026ndash;478 (2022).\u003c/li\u003e\n\u003cli\u003eBlair, L. M. \u003cem\u003eet al.\u003c/em\u003e Oncogenic context shapes the fitness landscape of tumor suppression. \u003cem\u003eNat Commun\u003c/em\u003e \u003cstrong\u003e14\u003c/strong\u003e, 6422 (2023).\u003c/li\u003e\n\u003cli\u003eXia, Y., Sun, M., Huang, H. \u0026amp; Jin, W.-L. Drug repurposing for cancer therapy. \u003cem\u003eSig Transduct Target Ther\u003c/em\u003e \u003cstrong\u003e9\u003c/strong\u003e, 1\u0026ndash;33 (2024).\u003c/li\u003e\n\u003cli\u003ePrahallad, A. \u003cem\u003eet al.\u003c/em\u003e Unresponsiveness of colon cancer to BRAF(V600E) inhibition through feedback activation of EGFR. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e483\u003c/strong\u003e, 100\u0026ndash;103 (2012).\u003c/li\u003e\n\u003cli\u003eMisale, S. \u003cem\u003eet al.\u003c/em\u003e Emergence of KRAS mutations and acquired resistance to anti-EGFR therapy in colorectal cancer. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e486\u003c/strong\u003e, 532\u0026ndash;536 (2012).\u003c/li\u003e\n\u003cli\u003eMorris, L. G. T. \u0026amp; Chan, T. A. Therapeutic targeting of tumor suppressor genes. \u003cem\u003eCancer\u003c/em\u003e \u003cstrong\u003e121\u003c/strong\u003e, 1357\u0026ndash;1368 (2015).\u003c/li\u003e\n\u003cli\u003eRobson, M. \u003cem\u003eet al.\u003c/em\u003e Olaparib for metastatic breast cancer in patients with a germline BRCA mutation. \u003cem\u003eNew England Journal of Medicine\u003c/em\u003e \u003cstrong\u003e377\u003c/strong\u003e, 523\u0026ndash;533 (2017).\u003c/li\u003e\n\u003cli\u003eLe, D. T. \u003cem\u003eet al.\u003c/em\u003e PD-1 blockade in tumors with mismatch-repair deficiency. \u003cem\u003eNew England Journal of Medicine\u003c/em\u003e \u003cstrong\u003e372\u003c/strong\u003e, 2509\u0026ndash;2520 (2015).\u003c/li\u003e\n\u003cli\u003eGreco, F. A., Labaki, C. \u0026amp; Rassy, E. Molecular diagnosis and site-specific therapy in cancer of unknown primary: An important milestone. \u003cem\u003eThe Lancet Oncology\u003c/em\u003e \u003cstrong\u003e25\u003c/strong\u003e, 955\u0026ndash;956 (2024).\u003c/li\u003e\n\u003cli\u003eIkeda, S., Elkin, S. K., Tomson, B. N., Carter, J. L. \u0026amp; Kurzrock, R. Next-generation sequencing of prostate cancer: Genomic and pathway alterations, potential actionability patterns, and relative rate of use of clinical-grade testing. \u003cem\u003eCancer Biol Ther\u003c/em\u003e \u003cstrong\u003e20\u003c/strong\u003e, 219\u0026ndash;226 (2018).\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Precision oncology, large language models, molecular variants, next-generation sequencing, cancer variants, gene alterations, natural language processing, network-based graph analysis","lastPublishedDoi":"10.21203/rs.3.rs-6614711/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6614711/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003ePrecision oncology depends on accurate interpretation of molecular variants, yet novel insights are often buried in unstructured literature, described using heterogeneous nomenclature. To address this, we developed \u003cem\u003e\u0026ldquo;Variantscape\u003c/em\u003e,\u0026rdquo; an automated, large-scale pipeline and open-access web tool that integrates natural language processing and large language models to explore variant-cancer-treatment co-associations. Of over 2.7\u0026nbsp;million titles and abstracts processed, 7,524 mention all three entities, cancers, spanning 4,029 unique variants, 98 cancer types, and 377 treatments. Co-occurrence and network analyses revealed 15,577 significant co-associations within a graph comprising 4,504 nodes and 48,470 edges. Canonical variants in common cancers, such as \u003cem\u003eBRAF\u003c/em\u003e V600E, had high-confidence treatment associations, while some rare variants showed strong literature-derived signals. By automating discovery and co-association detection, \u003cem\u003e\u0026ldquo;Variantscape\u0026rdquo;\u003c/em\u003e offers a systematic overview of the variant landscape in the literature, enabling scalable insight generation that support hypothesis generation, uncover underrecognized connections, reveal novel applications of existing therapies, and advance precision oncology.\u003c/p\u003e","manuscriptTitle":"Variantscape: Using Large Language Models to Build a Comprehensive Landscape of Cancer Variants for Precision Oncology","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-05-29 15:08:17","doi":"10.21203/rs.3.rs-6614711/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"d113db22-2122-465f-8288-203572a973bf","owner":[],"postedDate":"May 29th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":49195446,"name":"Biological sciences/Cancer/Cancer genetics"},{"id":49195447,"name":"Biological sciences/Cancer/Cancer genomics"},{"id":49195448,"name":"Biological sciences/Cancer/Cancer therapy"},{"id":49195449,"name":"Biological sciences/Cancer/Tumour biomarkers"},{"id":49195450,"name":"Biological sciences/Computational biology and bioinformatics/Computational models"},{"id":49195451,"name":"Biological sciences/Computational biology and bioinformatics/Data processing"},{"id":49195452,"name":"Biological sciences/Computational biology and bioinformatics/Databases"},{"id":49195453,"name":"Biological sciences/Computational biology and bioinformatics/Gene ontology"},{"id":49195454,"name":"Biological sciences/Computational biology and bioinformatics/Genome informatics"},{"id":49195455,"name":"Biological sciences/Computational biology and bioinformatics/Literature mining"},{"id":49195456,"name":"Biological sciences/Computational biology and bioinformatics/Machine learning"},{"id":49195457,"name":"Biological sciences/Computational biology and bioinformatics/Predictive medicine"},{"id":49195458,"name":"Biological sciences/Computational biology and bioinformatics/Programming language and code"}],"tags":[],"updatedAt":"2025-06-21T00:08:13+00:00","versionOfRecord":[],"versionCreatedAt":"2025-05-29 15:08:17","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6614711","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6614711","identity":"rs-6614711","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00