Research on crop phenotype prediction using SNP context and whole-genome feature embedding

Huan Li
Yunpeng Cui
Tan Sun
Ting Wang
Zhen Chen
Chao Wang
Wenbo Bian
Juan Liu
Mo Wang
Li Chen
Jinming Wu
Jie Huang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Modern agriculture demands precise genomic prediction to accelerate elite crop breeding, yet traditional genomic prediction approaches, such as genomic best linear unbiased prediction (GBLUP) and Bayesian methods, focus primarily on the cumulative effect of individual SNPs, thus neglecting the concerted influence that the surrounding sequence context has on the phenotype. To overcome these limitations, we propose two novel feature embedding modes (SNP-context and whole-genome) based on DNABERT-2, a cross-species genomic foundation model that uses self-attention mechanisms and transfer learning to automatically identify conserved sequence features across diverse evolutionary lineages without prior biological assumptions. The whole-genome feature embedding aggregates genomic information at a global scale by pooling vectors from chunked sequences processed by DNABERT-2, whereas the context feature embedding captures local information by directly encoding variable-length (500--3000 bp) sequences centered on target SNPs. To reduce noise in the high-dimensional feature embeddings, we employed principal component analysis (PCA) and partial least squares (PLS) to project the features into a lower-dimensional space. We generated two kinds of feature embedding for three crop datasets (rice413, rice395, and maize301), investigated the impact of 500--3000 bp flanking SNP contexts on phenotypic prediction, and compared prediction accuracy variations across algorithms at 4--768 feature dimensions among the PCA, PLS, and no dimensionality reduction strategies. The results demonstrate that machine learning (ML) algorithms operating under the SNP-context embedding mode achieve greater accuracy and lower mean absolute errors (MAEs) than traditional SNP features do at specific context lengths, particularly for traits with low-to-moderate heritability (h ² ∈(0.2, 0.7]). In contrast, using whole-genome embeddings as input for ML can further improve the prediction accuracy for highly heritable traits (h ² ∈(0.7, 1.0]), even outperforming state-of-the-art deep learning models (such as DNNGP and ResGS) that rely on SNP markers. Our code is available on https://github.com/oliveSpring/Crop_DNA_Embedding.git

Version published to 10.1101/2025.04.08.647762 on bioRxiv
Apr 14, 2025

Combining genomic prediction and multi-trait indices through stochastic simulations: do index type and deployment order affect genetic gain?

This article has 6 authors:
1. Roberto Fritsche-Neto
2. Lorena Gabriela Coelho Queiroz
3. Jesimiel Viana
4. Kajal Gupta
5. Kashish Grover
6. Júlio César DoVale
This article has no evaluationsLatest version Dec 17, 2025
An Explainable Machine Learning Framework for Predicting Hybrid Maize Performance Using Genomic and Phenotypic Data

This article has 4 authors:
1. DanielRaj K
2. RobinsonJoel M
3. Japhynth J
4. Jasperline T
This article has no evaluationsLatest version Jan 16, 2026
Bayesian fine-mapping pinpoints candidate genes and pleiotropic loci of production traits from a chicken backcrossing scheme

This article has 8 authors:
1. Chi Mei Sun
2. Johannes Geibel
3. Henner Simianer
4. Björn Andersson
5. David Cavero
6. Rudolf Preisinger
7. Steffen Weigend
8. Christian Reimer
This article has no evaluationsLatest version Jan 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Combining genomic prediction and multi-trait indices through stochastic simulations: do index type and deployment order affect genetic gain?

An Explainable Machine Learning Framework for Predicting Hybrid Maize Performance Using Genomic and Phenotypic Data

Bayesian fine-mapping pinpoints candidate genes and pleiotropic loci of production traits from a chicken backcrossing scheme