Genotype-phenotype modeling of light ecotypes in Prochlorococcus reveals genomic signatures of ecotypic divergence
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Prochlorococcus is a cyanobacterial genus that exhibits photosynthetic capacity and remarkable genetic diversity. We analyze how Prochlorococcus genomics relate to high vs. low light environment adaptations, applying traditional comparative genomics and machine learning (ML) approaches to connect genotypes to phenotypes. We downloaded nearly 1,000 Prochlorococcus genomes from NCBI with information on their light adaptation ecotype (high-light/low-light) and depth of isolation using metadata through JGI. Average nucleotide identity analysis and traditional pangenome generation tools struggle to capture the cyanobacterial core genome, but despite its scant conservation, we clearly observe a sharp separation of the taxon by its light utilization preferences — that is, its light ecotypes. A range of classical ML models trained to predict ecotype achieve exceptional binary classification accuracy even when predicting on partial genomes (Matthews Correlation Coefficient = 0.81 – 0.98), while regression models trained to predict the depth of isolation performed poorly, with relatively high root mean square error values (40.8 – 45.3m). For ecotype prediction, top features for the best-performing models included photosynthesis-associated genes and pathways, as well as some novel markers of unknown function. Our research findings recapitulate the extreme genetic versatility in Prochlorococcus and find that variable genetic markers allow excellent classification accuracy, and therefore, ecotype prediction, even for incomplete metagenomic assemblies, emphasizing the specialization and separation among these cyanobacterial ecotypes.