GeneCAD: Plant Genome Annotation with a DNA Foundation Model
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate genome annotation remains a bottleneck in plants, where polyploidy and repeat-rich sequence confound homology- and RNA-based pipelines. We introduce GeneCAD, a sequence-only method that predicts complete plant gene models directly from DNA. GeneCAD couples representations from a plant DNA foundation model, PlantCAD2, with a lightweight ModernBERT encoder and a chromosome-wide conditional random field that enforces splice-phase and feature order, and applies a protein language-model screen to suppress repeat-driven open reading frames. To limit label noise, we rank and filter public annotations using a sequence-based masked-motif score and fine-tune on five phylogenetically diverse, high-quality references. Across five held out angiosperms, including the allotetraploid Nicotiana tabacum , GeneCAD improves transcript-level F1 by 8–10% on average over Helixer and BRAKER3, increases exact match transcripts, and sharpens boundaries at start/stop codons and splice junctions. By removing dependence on species-matched RNA-seq or proteomics while retaining cross-species accuracy, GeneCAD provides an accurate, scalable route to biologically coherent plant gene models from DNA alone.