GeneCAD: Plant Genome Annotation with a DNA Foundation Model

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Accurate genome annotation remains a bottleneck in plants, where polyploidy and repeat-rich sequence confound homology- and RNA-based pipelines. We introduce GeneCAD, a sequence-only method that predicts complete plant gene models directly from DNA. GeneCAD couples representations from a plant DNA foundation model, PlantCAD2, with a lightweight ModernBERT encoder and a chromosome-wide conditional random field that enforces splice-phase and feature order, and applies a protein language-model screen to suppress repeat-driven open reading frames. To limit label noise, we rank and filter public annotations using a sequence-based masked-motif score and fine-tune on five phylogenetically diverse, high-quality references. Across five held out angiosperms, including the allotetraploid Nicotiana tabacum , GeneCAD improves transcript-level F1 by 8–10% on average over Helixer and BRAKER3, increases exact match transcripts, and sharpens boundaries at start/stop codons and splice junctions. By removing dependence on species-matched RNA-seq or proteomics while retaining cross-species accuracy, GeneCAD provides an accurate, scalable route to biologically coherent plant gene models from DNA alone.

Article activity feed