Leveraging protein language models and scoring function for Indel characterisation and transfer learning
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Protein language models (PLMs) are increasingly used to assess the impact of genetic variation on proteins. By leveraging sequence information alone, PLMs achieve high performance and accuracy and can outperform traditional pathogenicity predictors specifically designed to identify harmful variants contributing to diseases. PLMs can perform zero-shot inference, making predictions without task-specific fine-tuning, offering a simpler and less overfitting-prone alternative to complex methods. However, studying in-frame insertions and deletions (indels) with PLMs remains challenging. Indels alter protein length, making direct comparisons between wildtype and mutant sequences not straightforward. Additionally, indel pathogenicity is less studied than other genetic variants, such as single nucleotide variants, resulting in a lack of annotated datasets. Despite these challenges, approaches that leverage PLMs through transfer learning have emerged, making it possible to capture the features needed for more accurate predictions. Still, the current approaches are limited in terms of allowed organisms, indel length, and interpretability. In this work, we devise a new scoring approach for indel pathogenicity prediction (IndeLLM) that provides a solution for the difference in protein lengths. Our method only uses sequence information and zero-shot inference with a fraction of computing time while achieving performances similar to other indel pathogenicity predictors. We used our approach to construct a simple transfer learning approach for a Siamese network, which outperformed all tested indel pathogenicity prediction methods (Matthews correlation coefficient = 0.77). IndeLLM is universally applicable across species since PLMs are trained on diverse protein sequences. To enhance accessibility, we designed a plug-and-play Google Colab notebook that allows easy use of IndeLLM and visualisation of the impact of indels on protein sequence and structure. The tool is available on GitHub https://github.com/OriolGraCar/IndeLLM and Colab https://colab.research.google.com/drive/1CgwprttaNFR_KeJGyFzP0a0C9Y wc4P.