Automating Candidate Gene Prioritization with Large Language Models: From Naive Scoring to Literature-Grounded Validation

doi:10.1101/2025.09.17.676837

Automating Candidate Gene Prioritization with Large Language Models: From Naive Scoring to Literature-Grounded Validation

2025 · doi:10.1101/2025.09.17.676837

preprint OA: closed

📄 Open PDF Full text JSON View at publisher

Full text 2,361 characters · extracted from oa-doi-fallback · 4 sections · click to expand

Abstract

Background Identifying promising therapeutic targets from thousands of genes in transcriptomic studies remains a major bottleneck in biomedical research. While large language models (LLMs) show potential for gene prioritization, they suffer from hallucination and lack systematic validation against expert knowledge.

Methods

We developed a two-stage computational framework that combines LLM-based screening with literature validation for systematic gene prioritization. Starting with 10,824 genes from the BloodGen3 repertoire, we applied multi-criteria evaluation for sepsis relevance, followed by retrieval-augmented generation (RAG) using 6,346 curated sepsis publications. A novel faithfulness evaluation system verified that LLM predictions aligned with retrieved literature evidence.

Results

The framework identified 609 sepsis-relevant genes with >94% filtering efficiency, demonstrating strong enrichment for inflammatory pathways including TNF-α signaling, complement activation, and interferon responses. Literature validation yielded 30 ultra-high confidence therapeutic candidates, including both established sepsis genes (IL10, TREM1, S100A9, NLRP3) and novel targets warranting investigation. Benchmark validation against expert-curated databases achieved 71.2% recall, with systematic correlation between computational confidence and evidence quality. The final candidate set balanced discovery (11 novel genes) with validation (19 known genes), maintaining biological coherence throughout the filtering process.

Conclusions

This framework demonstrates that rigorous methodology can transform unreliable LLM outputs into systematically validated biological insights. By combining computational efficiency with literature grounding, the approach provides a practical tool for prioritizing experimental validation efforts. The modular design enables adaptation to other diseases through knowledge base substitution, offering a systematic approach to literature-guided biomarker discovery. Availability Source code and implementation details are available at https://github.com/taushifkhan/llm-geneprioritization-framework, vector database at https://doi.org/10.5281/zenodo.15802241 and Interactive demonstration at https://llm-geneprioritization.streamlit.app/ Competing Interest Statement The authors have declared no competing interest.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00