PyamilySeq: Exposing the fragility of conventional gene (re)clustering and pangenomic inference methods

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Pangenomics, the identification of shared genes across a taxonomic range, is essential for understanding microbial genetic diversity and functional capacities.

This study introduces PyamilySeq, a flexible and transparent framework designed to systematically identify challenges in gene clustering and pangenomic analysis, and to support the development of practical solutions. By evaluating widely used gene clustering and pangenome tools, we can observe how clustering thresholds (often hardcoded or provided without a clear reason) and paralog handling impact gene family composition. More strikingly, while such tools are operated under the assumption that running them with broadly the same parameters will yield consistent results, this study demonstrates how parameters unrelated to clustering thresholds, such as parameter decimal precision (0.8 vs. 0.80), output selection, and even CPU and memory allocation, can alter gene family assignments. Additionally, sequence clustering and pangenome tools often fail to report biologically meaningful or representative sequences for gene families, further complicating downstream analyses.

This work highlights key limitations in current gene clustering and pangenome methodologies, demonstrating their potential to influence biological interpretations in fundamental ways. To advance the field, we must prioritise adaptable and transparent approaches that refine gene clustering methodologies and move beyond rigid, one-size-fits-all tools and parameter choices.

Article activity feed