Methodological pitfalls in plant pangenome gene family identification may lead to biased evolutionary inferences

preprint OA: closed
Full text JSON View at publisher
Full text 1,585 characters · extracted from oa-doi-fallback · click to expand
Abstract Pangenome-level gene family identification often applies sequence similarity clustering without phylogenetic or synteny information, which risks biologically misleading evolutionary inferences. Using five transcription factor families (bHLH, MYB, NAC, WRKY, MADS-box) across 401 rice pangenome accessions, we compared clustering strategies: OrthoFinder alone, cd-hit alone, MMseqs2 alone, and OrthoFinder-informed refinement by cd-hit or MMseqs2. Methods solely based on sequence similarity merged distinct orthogroups and generated fewer orthogroups than approaches incorporating graph-based orthology. Conflicting cluster assignments, measured against OrthoFinder, varied strongly among families, from approximately 14% in MADS-box to approximately 57% in MYB, and were associated with protein length differences. Core, shell, and cloud gene classifications shifted substantially depending on the method, especially in MYB, NAC, and WRKY families. Critically, Ka/Ks distributions for core genes were highly method-sensitive, with orthology-aware methods yielding more convergent and less variable estimates of selective pressure, whereas noncore gene estimates remained robust. These findings demonstrate that neglecting graph-based orthogroup inference inflates methodological artifacts. We recommend a two-step strategy: initial graph-based orthogroup delineation followed by sequence similarity refinement to balance evolutionary accuracy and resolution in pangenome-scale gene family studies. Competing Interest Statement The authors have declared no competing interest.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00