Methodological pitfalls in plant pangenome gene family identification may lead to biased evolutionary inferences

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Pangenome-level gene family identification often applies sequence similarity clustering without phylogenetic or synteny information, which risks biologically misleading evolutionary inferences. Using five transcription factor families (bHLH, MYB, NAC, WRKY, MADS-box) across 401 rice pangenome accessions, we compared clustering strategies: OrthoFinder alone, cd-hit alone, MMseqs2 alone, and OrthoFinder-informed refinement by cd-hit or MMseqs2. Methods solely based on sequence similarity merged distinct orthogroups and generated fewer orthogroups than approaches incorporating graph-based orthology. Conflicting cluster assignments, measured against OrthoFinder, varied strongly among families, from approximately 14% in MADS-box to approximately 57% in MYB, and were associated with protein length differences. Core, shell, and cloud gene classifications shifted substantially depending on the method, especially in MYB, NAC, and WRKY families. Critically, Ka/Ks distributions for core genes were highly method-sensitive, with orthology-aware methods yielding more convergent and less variable estimates of selective pressure, whereas noncore gene estimates remained robust. These findings demonstrate that neglecting graph-based orthogroup inference inflates methodological artifacts. We recommend a two-step strategy: initial graph-based orthogroup delineation followed by sequence similarity refinement to balance evolutionary accuracy and resolution in pangenome-scale gene family studies.

Article activity feed