When to cluster phenotypic data? A simulation-based framework to guide decisions in agrobiodiversity research

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Phenotypic clustering is a cornerstone of population structure analysis in agrobiodiversity research, especially for neglected and underutilized species (NUS) where genomic data are scarce. However, there is currently no formal method to determine whether a given dataset contains sufficient biological signal to justify clustering, leading to potential overinterpretation of spurious patterns. To address this, we introduce a signal-first diagnostic framework. This framework mandates the assessment of phenotypic differentiation prior to any unsupervised classification, providing clear, data-driven thresholds to decide if clustering is statistically meaningful. We developed this framework through a large-scale, empirically-grounded simulation study. Using realistic trait architectures calibrated on fonio ( Digitaria exilis ), we evaluated 11 clustering algorithms across a continuous gradient of phenotypic differentiation (Pst = 0.05–0.85). Our results establish quantitative detectability thresholds: under the calibrated trait architecture, clustering fails to recover meaningful structure below Pst ≈ 0.30, a range typical for many NUS. Even the best-performing algorithm required Pst > 0.47 for moderate accuracy. We further demonstrate that internal validation metrics (e.g., Silhouette score) are unreliable under weak differentiation, often misleadingly suggesting robust clusters. The proposed framework shifts the analytical paradigm from algorithm selection to signal assessment. We provide practical guidelines and an openly available simulation template to help researchers implement this workflow, thereby supporting more reliable diversity assessments, core collection design, and germplasm management decisions in data-scarce systems.

Article activity feed