The use of cross-validation has overestimated the value of genomic selection in plant breeding
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
Genomic Selection (GS) is widely considered to be a transformative approach for plant breeding, and has been a subject of well over a thousand papers since its proposal 25 years ago. The reduced costs of marker genotyping and genome sequencing, the proliferation of powerful statistical methods, and innovative breeding schemes that leverage GS have promised a revolution in the speed, efficiency, and precision of plant breeding. However, clear evidence of dramatically improved breeding outcomes using GS is difficult to find in the literature. I argue here that the most commonly presented evidence of GS success—high estimated accuracies of Genomic Prediction (GP) models as evaluated by crossvalidation—may be giving a highly misleading impression about the value of GS, at least in moderate-sized breeding programs. Estimating GP accuracy by cross-validation is only appropriate when GS is used to increase selection intensity, one of four key control parameters of the breeders equation and usually the least cost-effective way to increase genetic gain. If GS is instead used to increase the accuracy of selection among a fixed set of candidates or used to speed up breeding cycles, cross-validation-based estimates can be dramatically inaccurate, in ways that differ among breeding populations and traits. Instead, I show that analytical expressions and computational simulations are more informative about the likelihood of success of GS than cross-validation, and can be more effectively employed to evaluate GS program design.
Article activity feed
-
Instead, I show that analytical expressions and computational simulations are more informative about the likelihood of success of GS than cross-validation, and can be more effectively employed to evaluate GS program design.
I really enjoyed reading this preprint, I think similar lessons on what exactly CV evaluates depending on your strategy have been (re)learned several times across deep learning biology.
-
predictand
typo
-
foundations plant
typo
-
Thus the gains in Figure 1 (particularly B and C) are likely optimistic. The low predicted gains from Genomic Prediction in small, diverse populations (second panel of Figure 1D) are particularly concerning, as this indicates that such Genomic Prediction models are likely to have very low accuracy.
There's also other secondary costs to aggressive recurrent GS such as drift in traits that aren't amenable/feasible to build GP models for.
-
Genomic Prediction, except under limited narrow contexts, perhaps including CV2 and CV0 breeding scenarios where the goal is closer to phenotype prediction than breeding value prediction.
I understand the rational for why the current study is setup to ignore CV0/CV2 scenarios, however I would argue that CV2 is likely one of the more promising study designs for genomic prediction at least for commercial breeding programs. By leveraging sparse testing + GP, one can test material across a broader environmental footprint for a fixed budget. The GP based breeding values from such a study design as particularly valuable as they can allow more relevant selection decisions to be made, as commercial breeding programs are often aiming for broad acreage products as the end goal.
-