Predicting bacterial phenotypic traits through improved machine learning using high-quality, curated datasets
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Predicting prokaryotic phenotypes—observable traits that govern functionality, adaptability, and interactions—holds significant potential for fields such as biotechnology, environmental sciences, and evolutionary biology. In this study, we leverage machine learning to explore the relationship between prokaryotic genotypes and phenotypes. Utilizing the highly standardized datasets in the Bac Dive database, we model eight physiological properties based on protein family inventories, evaluate model performance using multiple metrics, and examine the biological implications of our predictions. The high confidence values achieved underscore the importance of data quality and quantity for reliably inferring bacterial phenotypes. Our approach generates 50,396 completely new datapoints for 15,938 strains, now openly available in the Bac Dive database, thereby enriching existing phenotypic resources and enabling further research. The open-source software we provide can be readily applied to other datasets, such as those from metagenomic studies, and to various applications, including assessing the potential of soil bacteria for bioremediation.