Research on crop phenotype prediction using SNP context and whole-genome feature embedding

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Modern agriculture demands precise genomic prediction to accelerate elite crop breeding, yet traditional genomic prediction approaches, such as genomic best linear unbiased prediction (GBLUP) and Bayesian methods, focus primarily on the cumulative effect of individual SNPs, thus neglecting the concerted influence that the surrounding sequence context has on the phenotype. To overcome these limitations, we propose two novel feature embedding modes (SNP-context and whole-genome) based on DNABERT-2, a cross-species genomic foundation model that uses self-attention mechanisms and transfer learning to automatically identify conserved sequence features across diverse evolutionary lineages without prior biological assumptions. The whole-genome feature embedding aggregates genomic information at a global scale by pooling vectors from chunked sequences processed by DNABERT-2, whereas the context feature embedding captures local information by directly encoding variable-length (500--3000 bp) sequences centered on target SNPs. To reduce noise in the high-dimensional feature embeddings, we employed principal component analysis (PCA) and partial least squares (PLS) to project the features into a lower-dimensional space. We generated two kinds of feature embedding for three crop datasets (rice413, rice395, and maize301), investigated the impact of 500--3000 bp flanking SNP contexts on phenotypic prediction, and compared prediction accuracy variations across algorithms at 4--768 feature dimensions among the PCA, PLS, and no dimensionality reduction strategies. The results demonstrate that machine learning (ML) algorithms operating under the SNP-context embedding mode achieve greater accuracy and lower mean absolute errors (MAEs) than traditional SNP features do at specific context lengths, particularly for traits with low-to-moderate heritability (h 2 ∈(0.2, 0.7]). In contrast, using whole-genome embeddings as input for ML can further improve the prediction accuracy for highly heritable traits (h 2 ∈(0.7, 1.0]), even outperforming state-of-the-art deep learning models (such as DNNGP and ResGS) that rely on SNP markers. Our code is available on https://github.com/oliveSpring/Crop_DNA_Embedding.git

Article activity feed