Fitness translocation: improving variant effect prediction with biologically-grounded data augmentation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Predicting the functional effects of protein variants (variant effect prediction) is essential in protein engineering but remains challenging due to the scarcity of fitness data for training prediction models. To address this limitation, we introduce a data augmentation strategy called fitness translocation, which leverages variant fitness data from homologous proteins to enhance prediction models for a target protein. Using embeddings from protein language models, our method computes the differences between the homologs wild type and its variants, which are applied to the target wild type to generate its synthetic variants in the embedding space. We evaluate this approach on three protein families: IGPS, GFP, and SARS-CoV-2 spike proteins, under various prediction models and training data sizes. Fitness translocation consistently improves prediction accuracy, especially under limited training data. Moreover, accuracy improvement is observed even between remote homologs with sequence identity as low as 35%. These results highlight the potential of data-efficient protein engineering by reusing fitness data previously accumulated in homologs. The code is available at https://github.com/adrienmialland/ProtFitTrans .