DeepGEOSearch: LLM-Powered Schemaless Retrieval for Biomedical Data Discovery

preprint OA: closed
Full text JSON View at publisher
Full text 1,478 characters · extracted from oa-doi-fallback · click to expand
ABSTRACT Public biomedical repositories contain extensive, valuable datasets, yet identifying datasets that precisely match specific study requirements remains inefficient. Conventional keyword- and schema-based systems frequently fall short when queries encompass multiple biological and experimental facets. To address this, we developed DeepGEOSearch, an LLM-powered, schema-less retrieval system that interprets dataset text directly rather than relying solely on predefined metadata fields. The system extracts key study attributes, harmonizes terminology across sources, and ranks datasets by contextual alignment with natural-language queries while providing verifiable evidence for each match. In this work, we applied DeepGEOSearch for the Gene Expression Omnibus (GEO). The framework is repository-agnostic and can integrate new metadata sources without manual relabeling. This design enables complex compositional queries that existing tools cannot support. In evaluations against strong baselines using a curated evaluation benchmark comprising a mix of query complexities, DeepGEOSearch achieved more than 90% precision and best recall, with the largest performance gains observed on complex, real-world queries. DeepGEOSearch consistently identifies relevant datasets overlooked by conventional search tools, accelerating dataset discovery and improving reuse of public biomedical data. Competing Interest Statement The authors have declared no competing interest.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00