A comparative ML approach to classify Lupinus species using VIS-NIR spectral data from entire seeds and various data transformation techniques and resampling methods
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The increasing interest in the cultivation and utilization of Lupinus species is driven by their nutritional value and potential for sustainable agriculture. This study evaluates five machine learning algorithms for classifying seven Lupinus species using visible and near-infrared (VIS-NIR) spectral data (reflectance and absorbance) from seeds of the official active collection at the CICYTEX Germplasm Bank, characterized by class imbalance. Both raw data and four hybrid data transformation techniques were analyzed. To address class imbalance, six resampling methods were applied alongside the original dataset. Two validation approaches were employed: a simple split (80% training, 20% testing) and stratified K-fold cross-validation (K=5). Random Forest and Support Vector Classification algorithms achieved the highest F1-score (>94%) and AUC (>97%) across all techniques. Logistic regression also performed well with hybrid transformation methods. Cross-validation confirmed model robustness and generalization. These findings demonstrate that combining non-destructive spectral analysis with machine learning is effective for taxonomic identification and genetic resource management in germplasm collections. Furthermore, this approach may facilitate rapid, objective, and cost-effective selection of Lupinus ecotypes in breeding programs, as well as enhance traceability and conservation for sustainable agriculture. Increasing minority species representation and validating models in external environments are recommended to maximize applicability.