IFF : Identifying key residues in intrinsically disordered regions of proteins using machine learning

Abstract

Conserved residues in protein homolog sequence alignments are structurally or functionally important. For intrinsically disordered proteins or proteins with intrinsically disordered regions (IDRs), however, alignment often fails because they lack a steric structure to constrain evolution. Although sequences vary, the physicochemical features of IDRs may be preserved in maintaining function. Therefore, a method to retrieve common IDR features may help identify functionally important residues. We applied unsupervised contrastive learning to train a model with self‐attention neuronal networks on human IDR orthologs. Parameters in the model were trained to match sequences in ortholog pairs but not in other IDRs. The trained model successfully identifies previously reported critical residues from experimental studies, especially those with an overall pattern (e.g., multiple aromatic residues or charged blocks) rather than short motifs. This predictive model can be used to identify potentially important residues in other proteins, improving our understanding of their functions. The trained model can be run directly from the Jupyter Notebook in the GitHub repository using Binder ( mybinder.org ). The only required input is the primary sequence. The training scripts are available on GitHub ( https://github.com/allmwh/IFF ). The training datasets have been deposited in an Open Science Framework repository ( https://osf.io/jk29b ).

This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/7631234.

This work aims to address the issue of the lack of success in predicting conserved amino acid residues in intrinsically disordered proteins (IDPs) and intrinsically disordered regions (IDRs) of proteins. While primary sequence alignment can provide insight about the evolution of structured proteins, little can be extracted about IDP/Rs due to their low sequence similarity and lack of structure and predicted function.

In this study the authors attempt to address this gap in knowledge using machine learning to find highly conserved residues in human protein orthologs containing IDP/Rs that give rise to liquid-liquid phase separation (LLPS). The authors applied unsupervised contrastive ML to find the highest conserved residues, which might indicate critical functional importance.

The authors found that cysteine and tryptophan residues overall were assigned the highest "attention paid" score by the ML algorithm while most other residues received broadly distributed attention scores indicating low importance to IDP/R function. This is consistent with previously reported experimental findings, which report that aromatic residues are critical to LLPS function.

This work is interesting as there are few predicative tools that can provide insight into IDP/R function from primary sequence analysis. I can see its potential value not only to IDP/R researchers but to the broader protein design/engineering community.

Major issues

The title is vague and I think it could be beneficial to be more descriptive. Might suggest, "Identifying key residues that drive LLPS in…."

Not clear on how the sequences were "padded"? Does this bias the model?
Concluding paragraph needs work. Reiterate major findings and reframe improvements as future directions. What's next for the model? Other IDP/R functional predictions?
Can resolution of Figure 1 be better?
Color scheme in Figure 2 makes it difficult to read
Figure 2 is confusing. What are the arrows indicating (it's stated in the legend but not clear in the figure), they also show up red (authors call them purple)? Are the amino acids the arrows point to highlighting the group with shared physicochemical properties or is it supposed to be indicating individual residues? Please clarify.

Why are there illustrative figures in panel B but not the other panels?

What are the labels at the bottom of panel D indicating and why aren't they in all panels?

Can these questions be simply addressed in the figure legend?
Panel E in Figure 2 might be better as a separate figure entirely. Figure S5 might be used to replace it in the main text and remove panel E instead.
Combine references in Supple. Mat. with main text references

Minor issues

minor spelling errors. Figure 1. "attension" should be "attention"

Competing interests

The author declares that they have no competing interests.

Read the original source

IFF : Identifying key residues in intrinsically disordered regions of proteins using machine learning

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Extending Conformational Ensemble Prediction to Multidomain Proteins and Protein Complex

Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

Feature-Optimized Machine Learning Benchmarking for Protein Interface Prediction in Permanent Homodimer Complexes with Distinct Structural Features

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Extending Conformational Ensemble Prediction to Multidomain Proteins and Protein Complex

Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

Feature-Optimized Machine Learning Benchmarking for Protein Interface Prediction in Permanent Homodimer Complexes with Distinct Structural Features