Diagnosing phenotypic signal before clustering: A simulation-based decision framework for agrobiodiversity studies

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Unsupervised clustering is widely applied to phenotypic data to explore population structure and guide decisions in agrobiodiversity research, particularly for neglected and underutilized species where genomic information is scarce. However, phenotypic datasets often exhibit weak differentiation, strong trait covariance, heteroscedasticity, and uneven sampling, raising fundamental questions about the reliability of clustering outcomes under such conditions. Here, we propose a signal-first diagnostic framework that evaluates the strength of phenotypic differentiation prior to clustering, rather than treating clustering as a default exploratory step. Using an empirically calibrated simulation design informed by trait distributions and covariance patterns observed in fonio ( Digitaria exilis ), we quantify clustering recoverability across a continuous gradient of phenotypic differentiation (Pst = 0.05–0.85) for eleven commonly used algorithms. Our results indicate that, under realistic trait architectures, meaningful recovery is not achievable below Pst ≈ 0.30 across the evaluated methods, and that internal validation metrics may provide misleading support for structure in low-signal regimes. The proposed framework offers a practical, transferable workflow for diagnosing when phenotypic clustering is informative, thereby supporting more robust interpretation of phenotypic diversity in data-constrained agrobiodiversity studies.

Article activity feed