Methodological pitfalls in plant pangenome gene family identification may lead to biased evolutionary inferences

Shuotong Liu
Wei Zhang
Pei Yu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Pangenome-level gene family identification often applies sequence similarity clustering without phylogenetic or synteny information, which risks biologically misleading evolutionary inferences. Using five transcription factor families (bHLH, MYB, NAC, WRKY, MADS-box) across 401 rice pangenome accessions, we compared clustering strategies: OrthoFinder alone, cd-hit alone, MMseqs2 alone, and OrthoFinder-informed refinement by cd-hit or MMseqs2. Methods solely based on sequence similarity merged distinct orthogroups and generated fewer orthogroups than approaches incorporating graph-based orthology. Conflicting cluster assignments, measured against OrthoFinder, varied strongly among families, from approximately 14% in MADS-box to approximately 57% in MYB, and were associated with protein length differences. Core, shell, and cloud gene classifications shifted substantially depending on the method, especially in MYB, NAC, and WRKY families. Critically, Ka/Ks distributions for core genes were highly method-sensitive, with orthology-aware methods yielding more convergent and less variable estimates of selective pressure, whereas noncore gene estimates remained robust. These findings demonstrate that neglecting graph-based orthogroup inference inflates methodological artifacts. We recommend a two-step strategy: initial graph-based orthogroup delineation followed by sequence similarity refinement to balance evolutionary accuracy and resolution in pangenome-scale gene family studies.

Version published to 10.64898/2026.05.15.725319 on bioRxiv
May 18, 2026

Biological foundation models illuminate annotation blind spots in evolutionarily divergent genomes

This article has 9 authors:
1. Toby B. Lanser
2. Sydney K. Caldwell
3. Gaspar A. Pacheco
4. Jessica W. Chen
5. Shahab Saghaei
6. Mariah Hassan
7. Meydan Kronrod
8. Duane R. Wesemann
9. H. Robert Frost
This article has no evaluationsLatest version May 16, 2026
kamino: proteome-wide variant calling for amino acid phylogenomics

This article has 3 authors:
1. Romain Derelle
2. John A. Lees
3. Leonid Chindelevitch
This article has no evaluationsLatest version May 24, 2026
Revisiting the genome assembly of Lupinus species reveals differential diploidization after a shared whole-genome duplication

This article has 8 authors:
1. Liying Yang
2. Yanju Shuai
3. Weiran Li
4. Junping Gao
5. Jianduo Zhang
6. Haomin Lyu
7. Songyi Ji
8. Mingli Chen
This article has no evaluationsLatest version Apr 23, 2026

Methodological pitfalls in plant pangenome gene family identification may lead to biased evolutionary inferences

Discuss this preprint

Listed in

Abstract

Article activity feed

Biological foundation models illuminate annotation blind spots in evolutionarily divergent genomes

kamino: proteome-wide variant calling for amino acid phylogenomics

Revisiting the genome assembly of Lupinus species reveals differential diploidization after a shared whole-genome duplication

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Biological foundation models illuminate annotation blind spots in evolutionarily divergent genomes

kamino: proteome-wide variant calling for amino acid phylogenomics

Revisiting the genome assembly of Lupinus species reveals differential diploidization after a shared whole-genome duplication