AmyloDeep: pLM-based ensemble model for predicting amyloid propensity from the amino acid sequence

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Amyloids are predominantly β-sheet-rich, stable protein structures that can maintain their presence in the human body for multiple years. Amyloid protein aggregates contribute to the development of multiple neurodegenerative diseases, such as Alzheimer’s, Parkinson’s, and Huntington’s, and are involved in different vital functions, such as memory formation and immune system function. Here, we used advanced machine learning and deep learning techniques to predict amyloid propensity from the amino acid sequence. First, we aggregated labeled amino acid sequence data from multiple sources, obtaining a roughly balanced dataset of 2366 sequences for binary classification. We leveraged that data to both fine-tune the ESM2 model and to train new models based on protein embeddings from ESM2 and UniRep. The predictions from these models were then unified into a single soft voting ensemble model, yielding highly robust and accurate results. We further made a tool where users can provide the amino acid sequence and get the amyloid formation probabilities of different segments of the input sequence. Users can access the light version of AmyloDeep through the web server at https://amylodeep.com/ , and the full model is available as a Python package at https://pypi.org/project/amylodeep/ .

Article activity feed