Fine-tuning sequence to function deep learning models on large-scale proteomic data improves the accuracy of variant effect prediction

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Fine-tuning sequence to function models has shown promise for variant effect prediction, but accuracy and generalization to unseen genes and unseen individuals remains a standing challenge. We fine-tuned Borzoi on 54,219 individuals and 2,923 circulating plasma proteins from the UK Biobank Plasma Proteomic Project. Across 150 single-gene models where the genes had a range of cis-heritability we observed that the fine-tuned Borzoi model improved variant effect prediction for 86% of the genes compared to an Elastic Net baseline model. We demonstrated that the improved prediction stems from increased sample size which provides tremendous amounts of rare genetic variants (MAF < 0.01) to the training data. Masking rare and uncommon variants nullified improved performance of fine-tuned Borzoi and we showed that fine-tuned Borzoi highly weights rare variants (MAF < 0.01) while the Elastic Net model highly weights common variants (MAF > 0.05) that are enriched for regulatory regions. We evaluated the generalizability of our model on a fine-tuned Borzoi model trained jointly on varying numbers of genes and observed that these models consistently outperform the pre-trained Borzoi model, the single-gene models yield more accurate results. Together this work demonstrates the importance of including larger sample sizes and rare variants in sequence to function models for variant effect prediction and demonstrates feasibility that these models are capable of highly accurate variant effect prediction.

Article activity feed