Identification of Sugar-Related Genes in Sugar Beet (Beta vulgaris) Through Comparative Genomics and Machine Learning Approaches

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Despite sugar beet's importance as a major sucrose source, comprehensive identification of genes underlying sugar metabolism remains incomplete. We developed an integrated approach combining comparative genomics with advanced machine learning to systematically catalog sugar-related genes in the Beta vulgaris genome. Analysis of the EL10 reference genome identified 18,223 high-quality protein-coding genes, of which 91.6% showed orthology to Arabidopsis thaliana proteins. Traditional keyword-based screening identified 310 sugar-related genes, of which 286 exhibited high-confidence orthology (E-value < 1×10⁻⁵⁰) to Arabidopsis proteins. To overcome limitations of keyword approaches, we implemented zero-shot classification using transformer-based Sentence-BERT embeddings to identify genes through semantic similarity to sugar-related concepts, independent of explicit nomenclature. This machine learning strategy identified 1,999 candidate genes, including 1,736 novel candidates absent from keyword results—an 85% expansion of the sugar gene catalog. Despite this novelty, 84.8% of keyword-identified genes were also detected by machine learning, validating the approach. Multiple high-confidence predictions corresponded to experimentally validated genes in published studies. This framework establishes transformer-based semantic analysis as a powerful complement to traditional annotation, with broad applicability for functional gene discovery in crop genomics.

Article activity feed