When Task-Specific Learning Outperforms Transfer Learning: A Benchmark of Gene and Expression Encoding Strategies

preprint OA: closed
Full text JSON View at publisher
Full text 1,231 characters · extracted from oa-doi-fallback · click to expand
Abstract Single-cell foundational models have emerged as a powerful tool for learning generalizable cellular representations from large-scale data. Most models in this domain use transformer backbones, which require careful engineering of gene and expression encoding strategies, yet there is no consensus on which encoding techniques are effective. While benchmarking efforts up to date have focused on evaluating downstream applications using already pretrained models, we take a fundamentally different approach: we isolate different encoding paradigms and systematically compare them by training models from scratch under controlled conditions. Moreover, we scale pretraining to 10 million cells across 100 diverse datasets, a tenfold increase compared to similar studies. Through empirical experiments, we find that contrary to common assumptions, pretrained embeddings from large protein models like ESM-2 consistently underperformed task-specific learned embeddings. Our work provides clear empirical guidance for model design decisions and establishes a systematic benchmark for evaluating encoding strategies in single-cell foundational models. Competing Interest Statement The authors have declared no competing interest.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00