Identification of Sugar-Related Genes in Sugar Beet (Beta vulgaris) Through Comparative Genomics and Machine Learning Approaches

Sara Behnamian

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Despite sugar beet's importance as a major sucrose source, comprehensive identification of genes underlying sugar metabolism remains incomplete. We developed an integrated approach combining comparative genomics with advanced machine learning to systematically catalog sugar-related genes in the Beta vulgaris genome. Analysis of the EL10 reference genome identified 18,223 high-quality protein-coding genes, of which 91.6% showed orthology to Arabidopsis thaliana proteins. Traditional keyword-based screening identified 310 sugar-related genes, of which 286 exhibited high-confidence orthology (E-value < 1×10⁻⁵⁰) to Arabidopsis proteins. To overcome limitations of keyword approaches, we implemented zero-shot classification using transformer-based Sentence-BERT embeddings to identify genes through semantic similarity to sugar-related concepts, independent of explicit nomenclature. This machine learning strategy identified 1,999 candidate genes, including 1,736 novel candidates absent from keyword results—an 85% expansion of the sugar gene catalog. Despite this novelty, 84.8% of keyword-identified genes were also detected by machine learning, validating the approach. Multiple high-confidence predictions corresponded to experimentally validated genes in published studies. This framework establishes transformer-based semantic analysis as a powerful complement to traditional annotation, with broad applicability for functional gene discovery in crop genomics.

Version published to 10.21203/rs.3.rs-7950380/v1 on Research Square
Mar 21, 2026

Protein language model embeddings enable proteome-wide discovery of plant defense gene networks across species

This article has 2 authors:
1. Sara Behnamian
2. Naghmeh Boyouk
This article has no evaluationsLatest version Mar 31, 2026
Genome-wide identification of sORFs in indica rice and their comparative transcriptomic analysis under stress

This article has 5 authors:
1. SHEUE NI ONG
2. Boon Chin Tan
3. Kousuke Hanada
4. Hui Zhao
5. Chee How Teo
This article has no evaluationsLatest version Mar 25, 2026
Genome-wide identification and transcriptome analysis revealed candidate genes controlling plant height in soybean using YHSBLP

This article has 8 authors:
1. Hongyan Yang
2. Song Xue
3. Wenqing Yu
4. Wenliang Yan
5. Qingyuan He
6. Huiquan Shen
7. Tuanjie Zhao
8. Yinghu Zhang
This article has no evaluationsLatest version Mar 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Protein language model embeddings enable proteome-wide discovery of plant defense gene networks across species

Genome-wide identification of sORFs in indica rice and their comparative transcriptomic analysis under stress

Genome-wide identification and transcriptome analysis revealed candidate genes controlling plant height in soybean using YHSBLP