Machine learning-assisted discovery of growth decision elements by relating bacterial population dynamics to environmental diversity

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    Aida et al use a combination of novel experimental measurements and data processing to wrangle the complexity of bacterial growth for different media conditions. This study represents a clear example tackling the complexity of biological systems from the condition sides (~13,000 growth curves were measured) influencing the growth of a well-defined single specie of bacterium and with a reasonable first pass at data processing. The findings are ultimately simple (with essentially 3 conditions accounting all variability in the system) and easily interpretable.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #2 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Microorganisms growing in their habitat constitute a complex system. How the individual constituents of the environment contribute to microbial growth remains largely unknown. The present study focused on the contribution of environmental constituents to population dynamics via a high-throughput assay and data-driven analysis of a wild-type Escherichia coli strain. A large dataset constituting a total of 12,828 bacterial growth curves with 966 medium combinations, which were composed of 44 pure chemical compounds, was acquired. Machine learning analysis of the big data relating the growth parameters to the medium combinations revealed that the decision-making components for bacterial growth were distinct among various growth phases, e.g., glucose, sulfate, and serine for maximum growth, growth rate, and growth delay, respectively. Further analyses and simulations indicated that branched-chain amino acids functioned as global coordinators for population dynamics, as well as a survival strategy of risk diversification to prevent the bacterial population from undergoing extinction.

Article activity feed

  1. Author Response

    Reviewer #1 (Public Review):

    The authors succeed at generating a large amount of data using a high-throughput platform to measure bacterial growth, analyzing its complexity and deriving some simple rules to model the system. The limited complexity of the system under consideration (with 3 nutrients quantitatively determining all dynamic parameters for this bacterium) suggests that very simple analysis tools would be enough to tackle this large amount of data. This study is a clear example of a clever combination of high-throughput data generation and machine learning.

    Parametrization of growth curves (with lag times, growth rates, and growth saturation plateau as all-encompassing parameters) is simple, accurate and ultimately addressable. Indeed, using the large number of combinations of growth conditions (varied amino acids, metal ions, etc.) at different concentrations. It is very satisfying that a simple growth model and 3 parameters are enough to capture the entire dynamic complexity of these bacterial growth curves in vitro.

    Thank you for the careful reading and the positive evaluation. Your thoughtful comments helped us to improve our manuscript.

    The authors argue that the 3 dynamic parameters (lag time, growth rate, and carrying capacity) are essentially bimodal across all conditions (Fig. 2B). A closer inspection of the parameter K actually reflects 4 separatable peaks (see also Fig 7). Moreover, a simple PCA of the 3 dynamic parameters reveals only 4 separate clusters (while one could anticipate 2^3=8 clusters if the 3 parameters were truly bimodal and independent). The authors need to comment on the missing clusters e.g. what rules forbid some combinations of parameters (cf correlation between parameters as shown in Fig. 7).

    Thank you for the insightful comment. Fig. 2C showed that a total of 966 medium combinations could be roughly divided into four clusters. It’s true that if the three growth parameters were independent, more than eight PCA clusters were theoretically estimated, because the three distributions of growth parameters were all multimodal. The disappearance of the PCA clusters strongly suggested that the growth parameters were somehow dependent, which was further demonstrated in Fig. 7. The following sentences were added.

    (lines 92~95) “If the three parameters of τ, r and K, which all showed the multimodal distributions, were independent, more than eight clusters were anticipated. Only four separate clusters were identified, indicated that the growth parameters were somehow dependent.”

    (lines 209~211) “The correlations demonstrated that τ, r and K were highly dependent, which well explained why the multimodal distributions of the growth parameters led to only four PCA clusters (Figure 2).”

    Additionally, the relevance of the Machine Learning (ML) framework to analyze the data read like over-complicated for a "simple" classification task: the authors need to explain better what insight was derived from the ML analysis compared to simpler/unsupervised PCA and such.

    Thank you for the advice. The benefit of using Machine Learning (ML) framework was additionally discussed by comparing with a simpler and more common analytical approach. Considering the interpretability (i.e., the quantitative contribution of individual chemicals to the three growth parameters), multiple regression was employed for the comparison. The results showed that the accuracy of multiple regression was worse than that of ML (Figure 3−figure supplement 1). Accordingly, the figures were revised and the corresponding description was added in the Discussion as follows (lines 289~297).

    “First, the representative ML models and a commonly used statistic model of multiple regression were compared. Although multiple regression is known to have the highest interpretability, its accuracy of predictability was likely to be worse than that of the ML models (Figure 3−figure supplement 1). The results well supported the common sense that the ML approach was more suitable for studying the complex systems, which were the growing bacterial cells and the chemical media in the present survey. Additionally, among the tested ML models, the best accuracy was acquired with the ensemble model; nevertheless, as it required the longest time for model training (Figure 3−figure supplement 2) and was uninterpretable, the GBDT model was finally employed.”

    Overall, this study reads strong in its experimental implementation and insight. Additional analysis and easier interpretation will help the reader better assess the relevance of the findings.

    Thank you again for your supportive comments. We hope the revised manuscript meets your concern.

    Reviewer #2 (Public Review):

    This paper describes the analysis of a large data set collected from growth experiments on one strain of E. coli. The experimenters varied the growth media and used machine learning to try to deconstruct what was going on biologically. I have two major concerns with the methodology.

    1. The results of growth experiments are often severely affected by whether or not the strain has had time to adapt to the growth conditions tested. There is no time allowed for the different cultures to become adapted to these different growth media.
    1. All of these results are based on the concentration of chemical substances at t=0. As a culture grows it uses chemicals and releases other chemicals. That means the concentration of the different chemicals is changing as well as the ratio of different chemicals.

    Because of this, I have serious doubts about the specific biological claims.

    Thank you for reviewing our paper and the valuable comments, which helped us to improve the manuscript to a large extent. Taking all the concerns into account, we performed the additional experiments and analyses, and intensively revised the manuscript.

    The concept of making ML methods less opaque and using them to tease apart specific biological processes is intriguing. This is also a very interesting and large data set that would be useful to others for developing algorithms. Readers who are interested in ML applications in biology would be interested in this paper.

    We do agree and sincerely hope the findings, datasets and analytical approaches provided in the present study are valuable for the readers of varied research backgrounds.

    Reviewer #3 (Public Review):

    In this manuscript, the authors define 966 different media combinations on which they run over 12,000 growth curves for E. coli. After fitting the growth curves to estimate classical growth parameters (e.g. lag, growth rate and carrying capacity) the authors evaluate different machine learning methods in their ability to predict growth parameters from media composition. They use the results of the modeling to determine what media components are more important in affecting a certain parameter. The authors use the findings to try to explain why distinct "decision-making" components are found to associate with each of the growth parameters under an ecology and evolutionary biology light.

    The experiment appears executed well. However, apart from making sure the 966 media combinations are well defined, this is running growth curves with E. coli. This has been established for many years. The machine learning modeling is not innovative. Better posed, the authors use off-the-shelf machine learning methods available from different python packages to perform regression.

    Overall, the paper lacks motivation for why is this work done and what implications this work has. Based on the regression analysis the authors find that different growth medium components are more important (or associate specifically with) in predicting classical growth curve parameters including growth rate, carrying capacity and lag time. Knowing that the amount of glucose in the media determines the carrying capacity value has been known for several decades and does not need machine learning to tell us.

    Given that the authors use the most studied and genetically manipulatable model system in biology, and they use growth curves as the experimental system I would have expected some creative validation experiment to confirm the biological interpretation that they give to the data. After reading and evaluating the paper I cannot say I have learned anything new.

    Thank you for reviewing our paper and the helpful comments. Accordingly, the manuscript was intensively revised, associated with the additional results and newly provided figures. We hope the changes made in the paper meet your concern.

  2. Evaluation Summary:

    Aida et al use a combination of novel experimental measurements and data processing to wrangle the complexity of bacterial growth for different media conditions. This study represents a clear example tackling the complexity of biological systems from the condition sides (~13,000 growth curves were measured) influencing the growth of a well-defined single specie of bacterium and with a reasonable first pass at data processing. The findings are ultimately simple (with essentially 3 conditions accounting all variability in the system) and easily interpretable.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #2 agreed to share their name with the authors.)

  3. Reviewer #1 (Public Review):

    The authors succeed at generating a large amount of data using a high-throughput platform to measure bacterial growth, analyzing its complexity and deriving some simple rules to model the system. The limited complexity of the system under consideration (with 3 nutrients quantitatively determining all dynamic parameters for this bacterium) suggests that very simple analysis tools would be enough to tackle this large amount of data.

    This study is a clear example of a clever combination of high-throughput data generation and machine learning.

    Parametrization of growth curves (with lag times, growth rates, and growth saturation plateau as all-encompassing parameters) is simple, accurate and ultimately addressable. Indeed, using the large number of combinations of growth conditions (varied amino acids, metal ions, etc.) at different concentrations. It is very satisfying that a simple growth model and 3 parameters are enough to capture the entire dynamic complexity of these bacterial growth curves in vitro.

    The authors argue that the 3 dynamic parameters (lag time, growth rate, and carrying capacity) are essentially bimodal across all conditions (Fig. 2B). A closer inspection of the parameter K actually reflects 4 separatable peaks (see also Fig 7).

    Moreover, a simple PCA of the 3 dynamic parameters reveals only 4 separate clusters (while one could anticipate 2^3=8 clusters if the 3 parameters were truly bimodal and independent). The authors need to comment on the missing clusters e.g. what rules forbid some combinations of parameters (cf correlation between parameters as shown in Fig. 7). Additionally, the relevance of the Machine Learning (ML) framework to analyze the data read like over-complicated for a "simple" classification task: the authors need to explain better what insight was derived from the ML analysis compared to simpler/unsupervised PCA and such.

    Overall, this study reads strong in its experimental implementation and insight. Additional analysis and easier interpretation will help the reader better assess the relevance of the findings.

  4. Reviewer #2 (Public Review):

    This paper describes the analysis of a large data set collected from growth experiments on one strain of E. coli. The experimenters varied the growth media and used machine learning to try to deconstruct what was going on biologically. I have two major concerns with the methodology.

    1. The results of growth experiments are often severely affected by whether or not the strain has had time to adapt to the growth conditions tested. There is no time allowed for the different cultures to become adapted to these different growth media.

    2. All of these results are based on the concentration of chemical substances at t=0. As a culture grows it uses chemicals and releases other chemicals. That means the concentration of the different chemicals is changing as well as the ratio of different chemicals.

    Because of this, I have serious doubts about the specific biological claims.

    The concept of making ML methods less opaque and using them to tease apart specific biological processes is intriguing. This is also a very interesting and large data set that would be useful to others for developing algorithms. Readers who are interested in ML applications in biology would be interested in this paper.

  5. Reviewer #3 (Public Review):

    In this manuscript, the authors define 966 different media combinations on which they run over 12,000 growth curves for E. coli. After fitting the growth curves to estimate classical growth parameters (e.g. lag, growth rate and carrying capacity) the authors evaluate different machine learning methods in their ability to predict growth parameters from media composition. They use the results of the modeling to determine what media components are more important in affecting a certain parameter. The authors use the findings to try to explain why distinct "decision-making" components are found to associate with each of the growth parameters under an ecology and evolutionary biology light.

    The experiment appears executed well. However, apart from making sure the 966 media combinations are well defined, this is running growth curves with E. coli. This has been established for many years. The machine learning modeling is not innovative. Better posed, the authors use off-the-shelf machine learning methods available from different python packages to perform regression. Overall, the paper lacks motivation for why is this work done and what implications this work has. Based on the regression analysis the authors find that different growth medium components are more important (or associate specifically with) in predicting classical growth curve parameters including growth rate, carrying capacity and lag time. Knowing that the amount of glucose in the media determines the carrying capacity value has been known for several decades and does not need machine learning to tell us.

    Given that the authors use the most studied and genetically manipulatable model system in biology, and they use growth curves as the experimental system I would have expected some creative validation experiment to confirm the biological interpretation that they give to the data. After reading and evaluating the paper I cannot say I have learned anything new.