Machine learning for optimal growth temperature prediction of prokaryotes using amino acid descriptors
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Motivation
The optimal growth temperature (OGT) of organisms is valuable in bioprospecting enzymes that work under extreme conditions. Existing OGT prediction models achieve high accuracy, but mainly capture trends of overrepresented groups in the training set including organisms that thrive at moderate temperatures and those from well-described taxa.
Results
In this study, we incorporated weighted scoring and phylogenetic splits to improve the generalizability of the prediction models. We first built a new growth temperature dataset comprising more than 21,000 species distributed over all three domains of life, with special attention to include OGT and extreme temperature data. We then trained machine learning models on the OGT data of 6,401 prokaryotes using proteome-averaged amino acid descriptors. The best-performing model was the multilayer perceptron with a cross-validated RMSE of 5.07°C ( ± 0.24) and an R 2 of 0.89 ( ± 0.04). The most important proteome features were related to backbone flexibility, charged residues, as well as surface accessibility.
Availability and Implementation
The MLP model is integrated in the command line tool OGTFinder and available under MIT license at: https://github.com/SC-Git1/OGTFinder .