Research on Crop Phenotype Prediction Methods Based on SNP-context and Whole-genome Features Embedding

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Modern agriculture demands precise genomic prediction to accelerate elite crop breeding, yet current methods predominantly analyze single nucleotide polymorphisms (SNPs) while neglecting the regulatory context of surrounding sequences. To overcome these limitations, two novel feature embedding modes(SNP-context and whole-genome) to capture both local and global genetic information. These were implemented through a unified computational framework combining the DNABERT-2 cross-species genomic foundation model with optimized dimensionality reduction techniques, including Partial Least Squares and Principal Component Analysis, to mitigate noise in high-dimensional feature spaces. The performance of classical machine learning algorithms, such as Support Vector Regression, ridge regression Best Linear Unbiased Prediction, Random Forest, and Gradient Boosting Regression, was evaluated across different dimensions(4-768), dimensionality reduction methods and SNP-context lengths(500-3000bp) in three crop datasets—rice413, rice395, and maize301. Results demonstrate that while both embedding modes have limitations in predicting traits with ultra-low heritability in rice395, the SNP-context feature embedding significantly outperforms SNPs marker-based predictions for traits with low-to-medium heritability in rice413 and maize301, achieving higher accuracy and lower mean absolute error (MAE). For some traits, the accuracy of SNP-context embedding mode even surpasses that of whole-genome feature embedding. When predicting traits with high heritability, the whole-genome feature embedding mode, combined with conventional machine learning algorithms, significantly improves prediction accuracy and reduces MAE compared to state-of-the-art deep learning models like DNNGP and ResGS. These findings show the capacity to capture non-additive genetic effects often neglected by conventional SNP-based models and provides new insights into the genetic architecture of complex traits.

Article activity feed