The promise of AlphaFold for gene structure annotation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background As sequencing technology improves, more genomes become available. Most lack annotation, automated methods are error prone, and few genomes are ever manually curated due to time and cost. Protein structure prediction software may provide new angles for assessing and improving gene models without requiring experimental data. In this paper, we explore whether scores from protein structure prediction can aid in scoring gene model quality. We chose three species (Fusarium graminearum, Toxoplasma gondii, and Aspergillus fumigatus) from the VEuPathDB database which have collectively undergone more than 1000 manual curation events. We modelled translations of the gene models with AlphaFold 3, before and after curation, collecting various scores. Then we carried out structure searching of the PDB with Foldseek and sequence-based domain identification using InterProScan. We profiled the scores produced by these methods to identify those best for gene model assessment. Results AlphaFold 3 scores strongly favoured manually improved over pre-improvement models, supporting 75% of manually-curated changes in F. graminearum, 65% in T. gondii, and 84% in A. fumigatus (the lower percentage in T. gondii attributed to a high level of disorder). Further, combining scores across multiple tools (AlphaFold 3, Foldseek and InterProScan) provided additional improvements in model scoring. Conclusion Overall, the most discriminative scores combined outputs of AlphaFold 3 and Foldseek. Our results therefore highlight the potential of scores derived from deep learning-based protein structure prediction for scoring gene models in the absence of experimental data. Future work should focus on intrinsically disordered regions and developing integrated tools to apply this approach.