A unified framework for measuring selection on cellular lineages and traits
Abstract
Intracellular states probed by gene expression profiles and metabolic activities are intrinsically noisy, causing phenotypic variations among cellular lineages. Understanding the adaptive and evolutionary roles of such variations requires clarifying their linkage to population growth rates. Extending a cell lineage statistics framework, here we show that a population’s growth rate can be expanded by fitness cumulants of any cell lineage trait. The expansion enables quantifying the contribution of each fitness cumulant, such as variance and skewness, to population growth. We introduce a function that contains all the essential information of cell lineage statistics, including mean lineage fitness and selection strength. We reveal a relation between fitness heterogeneity and population growth rate response to perturbation. We apply the framework to experimental cell lineage data from bacteria to mammalian cells, revealing that third or higherorder cumulants’ contributions are negligible under constant growth conditions but could be significant in regrowing processes from growtharrested conditions. Furthermore, we identify cellular populations in which selection leads to an increase of fitness variance among lineages. The framework assumes no particular growth models or environmental conditions, and is thus applicable to various biological phenomena for which phenotypic heterogeneity and cellular proliferation are important.
Article activity feed

Author Response
Reviewer #1 (Public Review):
This paper introduces a new statistical framework to study cellular lineages and traits. Several new measures are introduced to infer selection strength from individual lineages. The key observation is that one can simply relate cumulants of a fitness landscape to population growth, and all of this can be simply computed from one generating function, that can be inferred from data. This formalism is then applied to experimental cell lineage data.
I think this is a very interesting and clever paper. However, in its current form the paper is very hard to read, with very few explanations beyond the mathematical observations/definitions, which makes it almost unreadable for people outside of the field in my opinion. Some more intuitive explanations should be given for a broader audience, on …
Author Response
Reviewer #1 (Public Review):
This paper introduces a new statistical framework to study cellular lineages and traits. Several new measures are introduced to infer selection strength from individual lineages. The key observation is that one can simply relate cumulants of a fitness landscape to population growth, and all of this can be simply computed from one generating function, that can be inferred from data. This formalism is then applied to experimental cell lineage data.
I think this is a very interesting and clever paper. However, in its current form the paper is very hard to read, with very few explanations beyond the mathematical observations/definitions, which makes it almost unreadable for people outside of the field in my opinion. Some more intuitive explanations should be given for a broader audience, on all aspects : definitions of fitness « landscape », selection strength(s), connections between cumulants and other properties (including skewness) etc... There are many new definitions given with names reminiscent of classical concepts in evolutionary theory, but the connection is not always obvious. It would be great to better explain with very simple, intuitive examples, what they mean, beyond maths, possibly with simple examples. Some of this might be obvious to population geneticists, and in fact some explanations made in discussion are more illuminating, but earlier would be much better. I give more specific comments below.
We thank the reviewer for calling our attention to the lack of accessible explanations on the significant terms and quantities in this framework. Following the suggestion in the comments below, we added Box 1, providing intuitive and plain explanations on the terms of fitness, fitness landscape, selection, selection strength, and cumulants. In each section, we explain the standard usage of these terms in evolutionary biology and clarify the similarities and differences in this framework. We also added a figure to Box 1 and provided a schematic explanation of the relationships among chronological and retrospective distributions, fitness landscapes, and selection strength. We believe that these explanations and a figure would better clarify the meanings and functions of these quantities.
Major comments :
 the authors give names to several functions, for instance before equation (1) they mention « fitness landscape », then describe « net fitness » , which allows the authors to define « fitness cumulants ». Later on, a « selection » is defined. Those terms might mean different things for different authors depending on the context, to the point there are sometimes almost confusing. For instance, why is h a « landscape » ? For me, a landscape is kind of like a potential, and I really do not see how this is connected to h. « fitness cumulants » is particularly jargonic. There are also two kinds of selection strengths, which is very confusing. I would recommend that the authors make a glossary of the term, explain intuitively what they mean and maybe connect them to standard definitions.
We appreciate the suggestion of making a glossary of the terms. Following the suggestion, we added Box 1 to provide intuitive and plain explanations of the terms used in this framework.
In Box 1, we explain why we called h(x) a fitness landscape, referring to its standard usage in evolutionary biology. In evolutionary biology, fitness landscapes (also called adaptive landscapes) are visual representations of relationships between reproductive abilities (fitness) and genotypes. The height of landscapes corresponds to fitness. Since constructing "genotype space" is usually difficult, fitness is often mapped on an allele frequency or phenotype (trait) space to depict a "landscape." Fitness landscapes introduced in our framework are analogous to those in evolutionary biology in that fitness differences are mapped on trait spaces. Although fitness landscapes in evolutionary biology are usually metaphorical or conceptual tools for understanding evolutionary processes, the landscapes in our framework are directly measurable from division count and trait dynamics on cellular lineages.
We also explain "selection" and "selection strength" in Box 1. As pointed out, we define three kinds of selection strength measures. These three measures share a similar property of reporting the overall correlations between traits and fitness. However, they also have critical differences regarding additional selection effects they represent: S_KL^((1)) for growth rate gain, S_KL^((2)) for additional loss of growth rate under perturbations, and their difference S_KL^((2))S_KL^((1)) for the effect of selection on fitness variance. We restructured the sections in Results and clarified these important meanings of the different selection strength measures.
We removed the term "fitness cumulants" as this is nongeneral and might cause confusion to readers. We now rephrased this more precisely as "cumulants of a fitness landscape (with respect to chronological distribution)." Besides, we added a general explanation of "cumulants" to Box 1 and clarified what first, second, and thirdorder cumulants represent about distributions.
 Along the same line, it would be good to give more intuitive explanations of the different functions introduced. For instance I find (2) more intuitive than (1) to define h . I think some more intuition on what the authors call selection strengths would be super useful . In Table 1 selection strengths are related to Kublack Leibler divergence (which does not seem to be defined), it would be good to better explain this.
In addition to Box 1, we included more intuitive explanations on fitness landscapes and selection strength where they first appear in the Theoretical background section. As pointed out, descriptions of the linkage between the selection strength measures and KullbackLeibler divergence were only in the Supplemental Information in the original manuscript. We now explicitly show this linkage where we first define the selection strength.
Following this comment, we also changed the definition of a fitness landscape from the original one to h(x)≔τΛ+ln〖Q_rs (x)/Q_cl (x)〗 (Eq. 1), using the chronological and retrospective distributions introduced in the preceding paragraph. This definition is mathematically equivalent to the previous one, but we believe it is more intuitive.
 It seems to me the authors implicitly assume that, along a lineage, one would have almost stationary phenotypes (e.g. constant division rate) . However, one could imagine very different situations, for instance the division rates could depend on interactions with other cells in the growing population, and thus change with time along a lineage. One could also have some strong random components of division rate over time . I am wondering how those more complex cases would impact the results and the discussion
We thank the reviewer for pointing out our insufficient explanation of an essential feature of this framework. As we now explain in the "Examples of biological questions" section (L6265) and Discussion (L492493), this framework does not assume stationary phenotypes (traits) on cellular lineages. On the contrary, we developed this framework so that one can quantify fitness and selection strength even for nonstationary phenotypes (traits) due to factors such as nonconstant environments and inherent stochasticity.
In fact, if traits are stationary in cellular lineages, this framework becomes essentially identical to the individualbased evolutionary biology framework (see ref. 26, for example). Our framework assumes a cell lineage as a unit of selection and any measurable quantities along cellular lineages as lineage traits, whether they are stationary or nonstationary. Therefore, our framework can evaluate fitness landscapes and selection strength without explicitly taking the environmental conditions around cells into account. This means that h(x) and S[X] in this framework extract the correlations between the traits of interest and division counts among various factors that could potentially influence division counts. On the other hand, this framework has a limitation due to this design: it cannot say anything about the influence of factors such as nonquantified traits and potential variations in environmental conditions. We now explain these important points explicitly in the revised manuscript (L493496).
Likewise, stochasticity in division rate does affect division count distributions, and its influence appears as differences in the selection strength of division count S[D]. As stated in the text, S[D] sets the maximum bound for the selection strength of any lineage trait (L143145). Therefore, S_rel [X]≔S[X]/S[D] reports the relative strength of the correlation between the trait X and lineage fitness in a given level of S[D] in each condition.
To clarify the influence of stochasticity in division rate, we present a cell population model in which cells divide stochastically according to generation time (interdivision time) distributions in Appendix 2 (we moved this section from the Supplemental Information with modifications). We can confirm from this model that the shapes of generation time distributions influence the selection strength S[D]. Importantly, one can understand from this model that stochasticity in generation times constantly introduces selection to cell populations and modulates the growth rate and selection strength even in the longterm limit. We now clarify this important point in the Discussion (L519526).
 « Therefore, in contrast to a common assumption that selection necessarily decreases fitness variance, here we show that under certain conditions selection can increase fitness variance among cellular ». This is a super interesting statement, but there is such a lack of explanations and intuition here that it is obscure to me what actually happens here.
When a decrease in fitness variance by selection is mentioned in evolutionary biology, an upper bound and inheritance of fitness across the generations of individuals are usually assumed. In such circumstances, selection drives the fitness distribution toward the maximum value, and the selection eventually causes fitness variance to decrease. However, even in this process, a decrease is not assured for every step; whether selection reduces fitness variance at each step depends on the fitness distribution at that time.
In our argument, we compared fitness variances between chronological and retrospective distributions. We showed both theoretically and experimentally that there are cases where the variances of the retrospective distributions (distributions after selection) become larger than those of the chronological distributions (distributions before selection). The direction of variance change depends on the shape of chronological distributions, primarily on the skewness of the distributions (positive skew for increasing the variance and negative skew for decreasing the variance). The direction of variance changes can also be probed by the difference between the two selection strength measures S_KL^((2))S_KL^((1)). Notably, we can demonstrate that there are cases where retrospective fitness variances are larger than chronological fitness variances even in the longterm limit, as shown by a cell population model in Appendix 2.
We now explain what kind of situations are usually premised when reduction of fitness variance is mentioned and clarify that, in our framework, we compare the fitness variances between chronological and retrospective distributions (L542548). We also explain that a selection effect on fitness variance generally depends on fitness distribution and that a larger fitness variance in retrospective distribution is possible even in the longterm limit (L548557).
Reviewer #2 (Public Review):
The paper addresses a fundamental question: how do phenotypic variations among lineages relate to the growth rate of a population. A mathematical framework is presented which focuses on lineage traits, i.e. the value of a quantitative trait averaged over a cell lineage, thus defining a fitness landscape h(x). Several measures of selection strengths are introduced, whose relationships are clarified through the introduction of the cumulant generating function of h(x). These relationships are illustrated in analytical mathematical models and examined in the context of experimental data. It is found that higher than third order cumulants are negligible when cells are in early exponential phase but not when they are regrowing from a stationary phase.
The framework is elegant and its independence from mechanistic models appealing. The statistical approach is broadly applicable to lineage data, which are becoming increasingly available, and can for instance be used to identify the conditions under which specific traits are subject to selection.
We appreciate the reviewer for the positive evaluation. We will reply to your specific comments below.
Reviewer #3 (Public Review):
In this work the authors have constructed a useful mathematical framework to delineate contributions leading to differences in lineages of populations of cells. In principle, the framework is widely applicable to exponentially growing populations. An attractive feature is that the framework is not tailored to particular growth models or environmental conditions. I expect it will be valuable for systems where contributions from phenotypic heterogeneity overwhelm contributions from intrinsic stochasticity in cellular dynamics.
I am generally very positive about this work. Nevertheless, a few specific concerns:
 In here, lineages are considered as fitter if they have more division events. But this consideration neglects inherent stochasticity in division events. Even in a completely homogeneous population, the number of division events for different lineages is different due to intrinsic stochasticity, but applying the methods discussed in this manuscript may lead to falsely assigning different fitness levels to different lineages. The reason why (despite having different number of division events) these lineages ought be assigned the same fitness level is that future generations of these cells will have identical statistics, in contrast with those of cells that are phenotypically different. Extending the idea to heterogeneous populations, the actual difference in fitness levels may be significantly different from what is obtained from the mathematical framework presented here, depending on the level of inherent stochasticity.
We thank the reviewer for the comment on the point of which our explanation was insufficient in the original manuscript. Intrinsic stochasticity in interdivision time (generation time) is, in fact, critical for selection. For example, if a cell divides with a generation time shorter than the average due to stochasticity, this cell is likely to have more descendant cells in the future population on average than the other cells born at the same timing, even if the descendants follow identical statistics. Therefore, the properties of intrinsic stochasticity, including shapes of generation time distributions and transgenerational correlations, significantly affect the overall selection strength S_KL^((1)) [D] (and also S_KL^((2)) [D]). We now explain this important point in the Results section, referring to the analytical model in Appendix 2 (L327334), and also in Discussion (L519524).
Importantly, even when cell division processes seem purely stochastic, different states in some traits might underlie these variations in generation times. In such cases, evaluating h(x) and S_rel [X] can still unravel the correlations between the trait values and fitness. Especially, the relative selection strength S_rel [X]≔S_KL^((1) ) [X]/S_KL^((1) ) [D] extracts the correlation of the trait values in a given level of division count heterogeneity in each condition. We now clarify this important aspect of the framework in Discussion (L524526).
When a cell population is composed of heterogeneous subpopulations each of which follows a distinct statistical rule, our framework evaluates the combined effects from the heterogeneous rules and the inherent stochasticity of each subpopulation. Untangling these two contributions is generally challenging unless we have appropriate markers for distinguishing the subpopulations. However, when the subpopulations follow significantly distinct statistics, the division count distribution should become skewed or multimodal, and the difference between the two selection strength measures S_KL^((2) ) [D]S_KL^((1) ) [D] can suggest the existence of such subpopulations. Therefore, detailed analyses using all the selection strength measures and the fitness landscapes can provide insights into cell populations’ internal structures and selection.
We now explain the effect of inherent stochasticity in generation times (L327334 and L519524) and discuss how we can probe the existence of subpopulations based on the selection strength measures (L508512). Please also refer to our reply to the comment 3 of reviewer #1.
 In one of the sections the authors mention having performed analytical calculations for a cellular population in which cells divide with gamma distributed uncorrelated interdivision times. It's unclear if 1) within specific subpopulations, cells with the subpopulation divide with the same division time, and the distribution of division times is due to the diverse distribution of subpopulations; or 2) if there are no such subpopulations and all cells stochastically choose division time from the same distribution irrespective of their past lineage. If the latter, then I do not see the need for a lineagebased mathematical formulation when the problem can dealt with in much simpler traditional ways which so not keep track of lineages.
We dealt with the situation of 2) in this model. As noted by the reviewer, we can calculate the chronological and retrospective mean fitness and the population growth rate by a simpler individualbased agestructured population model (see ref. 10, for example). However, applying this framework to this model can clarify the utility of the cumulant generating function, the meaning of the differences between these fitness measures, and the effect of statistical properties of intrinsic stochasticity on longterm growth rate and selection. Therefore, we kept this model in Appendix 2 (the section is moved from Supplemental Information) with additional clarification of our motivation for analysis and the implication of the results.
 The analytical calculations provided seem to be exact only for trajectories of almost infinite duration (or in practice, duration much greater than typical interdivision time). For example, if the observation time is of the order of division time, this would create significant artifacts / artificial bias in the weights of lineages depending on whether the cell was able to divide within the observation time or not. Thus, the results claiming that contributions of higher order cumulants become significant in the regrowth from a late stationary phase are questionable, especially since authors note that 90% of cells showed no divisions within the observation time.
We thank the reviewer for an insightful comment. It is true that the duration of observation influences the results. In the regrowing experiments with E. coli, we aimed to compare the two cell populations regrowing from different stages of the stationary phase. Therefore, it is appropriate to fix the time windows between the two conditions. Even though a significant fraction of cell lineages remains undivided, the regrowing cells already divide several times within this time window. Therefore, the results are valid if we compare and discuss the selection levels in this time scale. However, clarification of the selection in the longer time scales requires a more detailed characterization of lag time distributions under both conditions.
We now clarify the range of validity of the results and the limitations on prediction for the longterm selection without knowing the details of the lag time distributions in Discussion (L536539).

Evaluation Summary:
The manuscript by Wakamoto and colleagues presents a general statistical framework to infer selection on a quantitative trait based on measurements of the values of this trait along related cell lineages. The manuscript provides both a detailed explanation of the mathematical underpinnings of the method and an illustration of its application to existing and new cell lineage datasets. The framework is widely applicable to general exponentially growing populations and is not tailored to particular growth models or environmental conditions.
(This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #3 agreed to share their name with the authors.)

Reviewer #1 (Public Review):
This paper introduces a new statistical framework to study cellular lineages and traits. Several new measures are introduced to infer selection strength from individual lineages. The key observation is that one can simply relate cumulants of a fitness landscape to population growth, and all of this can be simply computed from one generating function, that can be inferred from data. This formalism is then applied to experimental cell lineage data.
I think this is a very interesting and clever paper. However, in its current form the paper is very hard to read, with very few explanations beyond the mathematical observations/definitions, which makes it almost unreadable for people outside of the field in my opinion. Some more intuitive explanations should be given for a broader audience, on all aspects : …
Reviewer #1 (Public Review):
This paper introduces a new statistical framework to study cellular lineages and traits. Several new measures are introduced to infer selection strength from individual lineages. The key observation is that one can simply relate cumulants of a fitness landscape to population growth, and all of this can be simply computed from one generating function, that can be inferred from data. This formalism is then applied to experimental cell lineage data.
I think this is a very interesting and clever paper. However, in its current form the paper is very hard to read, with very few explanations beyond the mathematical observations/definitions, which makes it almost unreadable for people outside of the field in my opinion. Some more intuitive explanations should be given for a broader audience, on all aspects : definitions of fitness « landscape », selection strength(s), connections between cumulants and other properties (including skewness) etc... There are many new definitions given with names reminiscent of classical concepts in evolutionary theory, but the connection is not always obvious. It would be great to better explain with very simple, intuitive examples, what they mean, beyond maths, possibly with simple examples. Some of this might be obvious to population geneticists, and in fact some explanations made in discussion are more illuminating, but earlier would be much better. I give more specific comments below.
Major comments :
 The authors give names to several functions, for instance before equation (1) they mention « fitness landscape », then describe « net fitness » , which allows the authors to define « fitness cumulants ». Later on, a « selection » is defined. Those terms might mean different things for different authors depending on the context, to the point there are sometimes almost confusing. For instance, why is h a « landscape » ? For me, a landscape is kind of like a potential, and I really do not see how this is connected to h. « fitness cumulants » is particularly jargonic. There are also two kinds of selection strengths, which is very confusing. I would recommend that the authors make a glossary of the term, explain intuitively what they mean and maybe connect them to standard definitions.
 Along the same line, it would be good to give more intuitive explanations of the different functions introduced. For instance I find (2) more intuitive than (1) to define h . I think some more intuition on what the authors call selection strengths would be super useful . In Table 1 selection strengths are related to Kublack Leibler divergence (which does not seem to be defined), it would be good to better explain this.
 It seems to me the authors implicitly assume that, along a lineage, one would have almost stationary phenotypes (e.g. constant division rate) . However, one could imagine very different situations, for instance the division rates could depend on interactions with other cells in the growing population, and thus change with time along a lineage. One could also have some strong random components of division rate over time . I am wondering how those more complex cases would impact the results and the discussion
 « Therefore, in contrast to a common assumption that selection necessarily decreases fitness variance, here we show that under certain conditions selection can increase fitness variance among cellular ». This is a super interesting statement, but there is such a lack of explanations and intuition here that it is obscure to me what actually happens here.

Reviewer #2 (Public Review):
The paper addresses a fundamental question: how do phenotypic variations among lineages relate to the growth rate of a population. A mathematical framework is presented which focuses on lineage traits, i.e. the value of a quantitative trait averaged over a cell lineage, thus defining a fitness landscape h(x). Several measures of selection strengths are introduced, whose relationships are clarified through the introduction of the cumulant generating function of h(x). These relationships are illustrated in analytical mathematical models and examined in the context of experimental data. It is found that higher than third order cumulants are negligible when cells are in early exponential phase but not when they are regrowing from a stationary phase.
The framework is elegant and its independence from mechanistic …
Reviewer #2 (Public Review):
The paper addresses a fundamental question: how do phenotypic variations among lineages relate to the growth rate of a population. A mathematical framework is presented which focuses on lineage traits, i.e. the value of a quantitative trait averaged over a cell lineage, thus defining a fitness landscape h(x). Several measures of selection strengths are introduced, whose relationships are clarified through the introduction of the cumulant generating function of h(x). These relationships are illustrated in analytical mathematical models and examined in the context of experimental data. It is found that higher than third order cumulants are negligible when cells are in early exponential phase but not when they are regrowing from a stationary phase.
The framework is elegant and its independence from mechanistic models appealing. The statistical approach is broadly applicable to lineage data, which are becoming increasingly available, and can for instance be used to identify the conditions under which specific traits are subject to selection.

Reviewer #3 (Public Review):
In this work the authors have constructed a useful mathematical framework to delineate contributions leading to differences in lineages of populations of cells. In principle, the framework is widely applicable to exponentially growing populations. An attractive feature is that the framework is not tailored to particular growth models or environmental conditions. I expect it will be valuable for systems where contributions from phenotypic heterogeneity overwhelm contributions from intrinsic stochasticity in cellular dynamics.
I am generally very positive about this work. Nevertheless, a few specific concerns:
In here, lineages are considered as fitter if they have more division events. But this consideration neglects inherent stochasticity in division events. Even in a completely homogeneous population, the …
Reviewer #3 (Public Review):
In this work the authors have constructed a useful mathematical framework to delineate contributions leading to differences in lineages of populations of cells. In principle, the framework is widely applicable to exponentially growing populations. An attractive feature is that the framework is not tailored to particular growth models or environmental conditions. I expect it will be valuable for systems where contributions from phenotypic heterogeneity overwhelm contributions from intrinsic stochasticity in cellular dynamics.
I am generally very positive about this work. Nevertheless, a few specific concerns:
In here, lineages are considered as fitter if they have more division events. But this consideration neglects inherent stochasticity in division events. Even in a completely homogeneous population, the number of division events for different lineages is different due to intrinsic stochasticity, but applying the methods discussed in this manuscript may lead to falsely assigning different fitness levels to different lineages. The reason why (despite having different number of division events) these lineages ought be assigned the same fitness level is that future generations of these cells will have identical statistics, in contrast with those of cells that are phenotypically different. Extending the idea to heterogeneous populations, the actual difference in fitness levels may be significantly different from what is obtained from the mathematical framework presented here, depending on the level of inherent stochasticity.
In one of the sections the authors mention having performed analytical calculations for a cellular population in which cells divide with gamma distributed uncorrelated interdivision times. It's unclear if 1) within specific subpopulations, cells with the subpopulation divide with the same division time, and the distribution of division times is due to the diverse distribution of subpopulations; or 2) if there are no such subpopulations and all cells stochastically choose division time from the same distribution irrespective of their past lineage. If the latter, then I do not see the need for a lineagebased mathematical formulation when the problem can dealt with in much simpler traditional ways which so not keep track of lineages.
The analytical calculations provided seem to be exact only for trajectories of almost infinite duration (or in practice, duration much greater than typical interdivision time). For example, if the observation time is of the order of division time, this would create significant artifacts / artificial bias in the weights of lineages depending on whether the cell was able to divide within the observation time or not. Thus, the results claiming that contributions of higher order cumulants become significant in the regrowth from a late stationary phase are questionable, especially since authors note that 90% of cells showed no divisions within the observation time.
