Early Digital Selection: Machine Learning and XAI Approaches for Predicting Six-Month Body Weight in Hair Goats
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Early identification of animals with high growth potential is essential for enhancing efficiency and sustainability in goat breeding systems. This study developed a machine learning–based predictive framework to estimate six-month live weight (LW2) using early-life phenotypic and environmental data from 52,938 Hair goat records collected between 2011 and 2024. Five models—Linear Regression, Ridge Regression, Decision Tree, Random Forest, and XGBoost—were systematically compared within a robust pipeline incorporating Winsorization, group-based cross-validation, and hyperparameter optimization. XGBoost achieved the highest performance (R² = 0.842; MAE = 1.96 kg), reducing prediction error by approximately 36% compared to the linear baseline (R² = 0.753; MAE = 2.54 kg). The model demonstrated strong generalization capacity and a biologically consistent residual structure. Explainable AI (XAI) analyses confirmed that weaning weight (LW1) is the primary predictor of LW2, serving as an integrated proxy for genetic potential and early developmental conditions. Notably, weaning age emerged as a nonlinear determinant of growth efficiency despite exhibiting a weak linear correlation, underscoring the advantage of tree-based ensembles in capturing maturation dynamics. A physiologically meaningful decision threshold (LW1 ≈ 21.7 kg) was identified, distinguishing divergent growth trajectories with an approximate 9 kg difference in predicted weight. These findings demonstrate that routinely collected early-life data can be transformed into a reliable tool for early selection. The proposed Early Digital Selection framework facilitates proactive, data-driven breeding decisions at weaning, providing measurable improvements in prediction accuracy and resource allocation efficiency under heterogeneous field conditions.