Adding layers of information to scRNA-seq data using pre-trained language models

doi:10.1101/2025.08.23.671699

Adding layers of information to scRNA-seq data using pre-trained language models

2025 · doi:10.1101/2025.08.23.671699

preprint OA: closed

📄 Open PDF Full text JSON View at publisher

Full text 1,079 characters · extracted from oa-doi-fallback · click to expand

Abstract Pre-trained language models promise to enrich analyses of single-cell data with additional layers of information leveraging large text corpora. Yet, it is still unclear how to achieve optimal alignment with the primary quantitative single-cell data. To address this, we construct text-based training datasets from both scRNA-seq data and biomedical literature targeted to the experimental setting at hand. We then jointly train language models on both information sources to learn a common, literature-enriched representation. Our examples on functionality, disease associations, and temporal trajectories show the potential of knowledge-augmented embeddings as a generalizable and interpretable strategy for enriching single-cell analysis pipelines. Competing Interest Statement The authors have declared no competing interest. Footnotes Streamlining story line and revision of positioning within current literature. Updated Results section incorporating further results on integration of disease and temporal meta-data as well as updated corresponding Methods section.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00