Automated, high-dimensional evaluation of physiological aging and resilience in outbred mice

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    Chen et al. develop a comprehensive platform to score aging-dependent changes in mouse physiology and behavior using a multi-dimensional longitudinal phenotyping approach. Their thorough data collection and analysis reveals a diversity of trajectories in aging-related physiological and behavioral changes and helps disentangle biological aging from chronological aging, providing a reference pioneering work for future studies aimed at large-scale aging multi-dimensional phenotyping.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Behavior and physiology are essential readouts in many studies but have not benefited from the high-dimensional data revolution that has transformed molecular and cellular phenotyping. To address this, we developed an approach that combines commercially available automated phenotyping hardware with a systems biology analysis pipeline to generate a high-dimensional readout of mouse behavior/physiology, as well as intuitive and health-relevant summary statistics (resilience and biological age). We used this platform to longitudinally evaluate aging in hundreds of outbred mice across an age range from 3 months to 3.4 years. In contrast to the assumption that aging can only be measured at the limits of animal ability via challenge-based tasks, we observed widespread physiological and behavioral aging starting in early life. Using network connectivity analysis, we found that organism-level resilience exhibited an accelerating decline with age that was distinct from the trajectory of individual phenotypes. We developed a method, Combined Aging and Survival Prediction of Aging Rate (CASPAR), for jointly predicting chronological age and survival time and showed that the resulting model is able to predict both variables simultaneously, a behavior that is not captured by separate age and mortality prediction models. This study provides a uniquely high-resolution view of physiological aging in mice and demonstrates that systems-level analysis of physiology provides insights not captured by individual phenotypes. The approach described here allows aging, and other processes that affect behavior and physiology, to be studied with improved throughput, resolution, and phenotypic scope.

Article activity feed

  1. Author Response

    Reviewer #1 (Public Review):

    Chen et al. embark into a comprehensive analysis of physiological and behavioral aging in a mixed-bred (Diversity Outbred, or DO) mouse population. They aim to analyze spontaneous trajectories in mouse aging from longitudinal data acquisition, using commercially-available monitoring cages, able to detect a diversity of aging-related changes in individual mice physiology and behavior. This work has two major strengths: the extensive data generated and the analytical tour-deforce to extract relevant features from multi-dimensional aging data. Overall, the authors reached their goal and I congratulate them for the clarity and thoroughness of the analyses conducted.

    We thank the reviewer for this positive assessment of our work.

    The main question of this work is somehow subordinate to their approach. If I were to summarize their main question, I would say "can we extract spontaneous aging trajectories/features from non-invasive behavioral monitoring in mix-bred mice"? Overall, the authors answer this question and discuss the implications of their findings. This work helps generate a clear separation between the concepts of chronological aging from biological aging (CASPAR approach), providing an integrated measure of both, and relating this measure with individual data sources. The authors further provide important insights into the concept of agingrelated decline in resilience, which their multi-dimensional data integration convincingly support. This work will likely have important impact on future studies focused on integrated measures of physiological/behavioral aging. What is not entirely clear so far from this work, is how future work by other groups will be able to benefit from these data and approaches, i.e. how accessible and scalable are the analyses presented in this work to different experimental designs, e.g. where more sparse data are obtained. The authors should make the data easily available/accessible to the public, as well as their code.

    We agree with the reviewer and have made all data and code available on Github in order to facilitate future work (https://github.com/calico/catnap).

    While this work is comprehensive and rather impressive, the way it is written so far does not focus on the results, but rather on the methodology.

    We agree with the reviewer that this study could be framed in multiple ways. We chose to frame the narrative around the methodology and analyses because we believe those will be more useful to the field than the particular set of physiological changes that we identify, though these are also interesting.

    Reviewer #2 (Public Review):

    In their study, Chen et al. consider a set of 415 genetically diverse, outbred mice. This population is assembled from eight distinct cohorts, each entering the study at a separate chronological age ranging from three to twenty-four months. By employing a commerciallyavailable automated-phenotyping system, the authors collected high-dimensional phenotyping data that quantifies both behavior and physiologic properties like oxygen consumption. Animals were placed in the phenotyping system for week-long measurement intervals, alternated by three-week intervals in more standard cages. In this way, the authors cleverly overcome challenges in longitudinal measurement by stitching together eight overlapping longitudinal time series into a single forty-week characterization of the entire murine lifespan.

    The authors found many of their measurements covary at short timescales according to an individual's behavioral state-sleeping, eating, running, etc. To control for this effect, the authors developed a hidden markov model that allowed them to automatically identify an animals' behavioral state, thus segmenting longitudinal measurements into distinct behavioral stages. This allowed the authors to more accurately study the long-term effects of aging by removing the confounding effects of short-term behavioral changes.

    The authors find that circadian rhythms changed with chronological age, as did energy expenditure while resting declined. In fact, eighty percent of all metrics correlated significantly with chronological age.

    The authors genotyped each mouse using an array of SNP probes, allowing them to identify genotype-phenotype correlations. The authors observed a low heritability on average among all traits (median correlation = 0.22), but found that these heritable factors tended to affect multiple phenotypes simultaneously. Notably, the heritability of body mass was relatively high, in agreement with previous studies.

    Irrespective of genetics, 250 features clustered into 20 groups based on covariation over time. The authors identified a general increase in the covariation of traits between and within these clusters as animals aged. The authors refer to these increases in covariation as "decreases in resilience".

    Finally, the authors developed a model of aging that integrates phenotypic data and lifespan data. This model appears to draw implicitly from concepts developed by OO Aalen and James Vaupel under the name of "frailty" models, positing that each individual exhibits a characteristic rate of aging that contributes to differences in lifespan among peers. The authors fit their model using a maximum likelihood approach-implemented using gradient boosted decision trees-that allows them to estimate the relative rate of each individuals' aging using longitudinal phenotypic data and compare this to inter-individual differences in lifespan. The authors' model produces rather unimpressive predictions of chronological age, with correlations ranging between 0.5 to 0.75 depending on model tuning. The model has more difficulty predicting an individuals' remaining lifespan, only correlating between 0.25 and 0.425 depending on model tuning.

    Strengths

    The main strengths of this manuscript are its thoughtful study design, which combines highdimensional phenotyping, genotypic data, and large population size. An impressive effort went into collecting these measurements and the result seems likely to be useful for many future analyses. An additional strength of this manuscript is the HMM model. By subdividing timeseries measurements into distinct short-term behavioral periods, long-term trends in behavior and physiology can be identified without the confounding influence of short-term behavioral states. Finally, the authors' "CASPAR" model seems like a thoughtful attempt to relate longitudinal phenotypic aging to lifespan, even if its performance is not yet so impressive.

    We thank the reviewer for their positive assessment of the study’s utility, including the experimental design, data generated, and analytical tools we developed.

    The performance of our model is comparable to chronological age and time-to-death predictions of other models based on rodent physiological and behavioral data1 . Further, given that the field lacks a ground truth measurement of biological age, and that biological age is not perfectly captured by either chronological age or time-to-death, it is unclear exactly what “good model performance” looks like in this context; a higher R2 is not necessarily better. For example, if the rate of aging between individuals varies by ~30%, a model that predicts chronological age to an R 2 of 0.95 is likely less useful than a model that predicts chronological age to an R2 of 0.7 or lower. Similar observations have been made in the field of epigenetic aging, in which epigenetic clocks fit by optimizing prediction of chronological age were able to achieve high correlation with chronological age, but failed to capture variation in disease or mortality risk among similarly aged individuals. Instead, clock models optimized for predicting other proxies of physiological health did better at predicting various clinical outcomes of interest, at the expense of correlation with chronological age2 . We further note that in such (human) models, the average chronological age correlation with model prediction sits at R2 = 0.53. We discuss the CASPAR model in more detail below.

    Weaknesses

    The manuscript is substantially weakened by a lack of clarity on several important conceptual points. First, the authors appear to assume that any change that occurs at month-long timescales must be "aging". The authors choose to discard the first day of measurements in a cage to account for behavioral adaptation, demonstrating their concern for distinguishing behavioral adaptations from aging phenomena. However, the authors' efforts to do this seem rather cursory, as mice surely learn and adapt over time-scales longer than twenty-four hours. The reader is left wondering to what extent this study measures the phenotypic consequences of aging, and to which extent is the study measuring long-term adaptation of individuals to a four-week rotation schedule in and out of different cages.

    The reviewer raises an excellent point: in longitudinal studies of aging, it is important to distinguish training effects from biological aging. This is an issue for our study as well as for many behavior/physiological phenotyping protocols employed in other healthspan studies, e.g. strength tests, rotarod, fear conditioning, mazes, etc. One solution is to employ a continuous phenotyping platform that does not perturb the test subject, e.g. continuous home cage monitoring. We could have approximated that by keeping animals individually housed in the phenotyping cages at all times, but this would have 1) substantially reduced the number of animals we could study, and 2) introduced potentially confounding consequences of permanent isolation. Instead, we chose to employ monthly rotations (a ~4x increase in animal number vs continuous monitoring), and we controlled for long-term adaptation to the phenotyping cages as described in the “Features” subsection in the “Methods” section:

    “We controlled for exposure to the phenotyping cage as a confounder using ANOVA / multiple linear regression. After introducing a new variable "run number" for the number of times a mouse has been profiled in the phenotyping cages, we fit a regression model regressing out the effect of "run number" and interactions between "run number" and the HMM state on all measurements. This allowed us to learn a correction for exposure effects specific to each state for each measurement.”

    We were able to employ this strategy because, due to our staggered enrollment design, age and exposure to the metabolic cage is not perfectly confounded.

    Covariate correction aside, the question remains whether any bona fide change that occurs at month-long timescales in mice should be considered “aging”. This is a nearly-philosophical question, and one that we are unable to definitively resolve in this study. However, we have previously described aging as a label for all of the biological changes that develop in a high proportion of individuals within a population over an average lifespan3 , and by that metric, all of the age-related changes we observe that occur over months/years would be considered part of “aging”.

    As a second conceptual issue, the authors adopt a rather shallow and limited practical definition of the term "resilience". Conceptually, they define resilience as "the ability of a system to maintain function in the face of change", which seems reasonable and corresponds with the general thinking about resilience. However, in practice, the authors define resilience as the inverse of correlation among traits-an animal is more "resilient" when its different phenotypic traits are less correlated. This practical definition lends itself well for measurement using the data in this study, but leads to an incongruity between conceptual and practical definitions of "resilience". Correlation of traits is not uniquely determined by an organism's resilience--there could be any number of reasons for traits to increase in covariance beyond a failure of resilience. Any change in the physiologic relationship between two traits will alter the causal structure of the traits' interactions and therefore alter the trait's covariance. Are the authors arguing that any change in physiology must inherently involve changes in resilience? A more convincing practical definition of resilience would involve a more direct test of conceptual definition, as defined by the authors as "the ability of a system to maintain function in the face of change". For example, the authors might have provided some sort of physiologic challenge and measured animals' response to it-a physical stress test, a test of thermoregulation in response to changes in temperature, the speed of adaptation to a novel environment. Given the data collected, the authors can measure many interesting aspects of aging, but they do not seem adequately justified in calling one of these aspects "resilience".

    We agree with the reviewer that resilience is a broad term. Like other broad terms such as “health” or “biological age”, it is unlikely to be perfectly captured by a single metric. We have added this caveat to the text (edits in purple): “Resilience refers to the ability of a system to maintain function in the face of change. This is a broad concept that is unlikely to be fully captured by a single number, and multiple approaches to measuring organism-level resilience have been proposed4 . Here, we propose a metric that is based on the relationship between physiological features.”

    There are several reasons why the metric we employ here is a valuable addition to the resilience toolbox. First, we believe our metric does capture the ability of the system to respond to change. Animals experience numerous environmental perturbations over the course of daily living: circadian transitions, metabolizing a meal, recovering from a bout of exercise, etc, and our network model incorporates the systemic response to each of these. These challenges are not as overt as a thermoregulation test, but with sufficiently sensitive readouts, such as the ones our platform employs, large perturbations are not necessary. This concept of measuring “micro recoveries” has been successfully applied in other fields, such as ecology, and has recently been used to quantify individual resilience(5,6.)

    Second, unlike utilizing a specific perturbation (e.g. a treadmill test, thermoregulation, or novel environment), which likely impacts certain physiological systems more than others, our metric incorporates the system-wide response to a variety of small perturbations, thereby incorporating more dimensions of physiology into the final summary statistic.

    Third, only specific types of physiological changes will affect our measurement of resilience; we are quantifying the negative multivariate mutual information, which is effectively a measurement of overall network connectivity. Changes in physiology that result in no net change in network connectivity will have no effect on our resilience measurement, e.g. randomizing network edges will not affect the metric.

    The manuscript also raises technical concerns. First, it is unclear whether all analyses in the manuscript are performed using features normalized body mass or whether only analyses in certain sections of the manuscript are performed using features normalized for body mass. The details here are crucial because improper normalization would undermine the main conclusions of the manuscript. Normalization of multiple features to any shared reference has the potential to introduce a correlation between normalized features and the shared normalization factor. In fact, many approaches for normalization to body mass will always introduce a correlation between normalized features and body mass, with the only exception being if the un-normalized features and body mass are perfectly correlated. If the authors normalize traits before performing their various correlation analyses, such normalization could introduce artefactual correlations between traits. Any normalized quantity will correlate with body mass and all traits correlated with body mass will in consequence correlate with each other. In summary, the authors must explain their normalization procedure in more detail to identify or exclude any improper normalization that could confound their analyses. Analyses at risk of being confounded include the heritability analysis, the network analysis of phenotypes during aging, and the CASPAR analyses.

    We apologize if this methodology was unclear. Gas measurements (VO2, VCO2, VH2O, EE) were corrected for body mass and the corrected values were used for all analyses (Figs 2-4). We chose to perform this correction because uncorrected gas measurements were strongly correlated with body mass (see Figure S1D), and as we already had body mass information, we were interested in any residual information contained in the gas measurements. Once we performed this correction, gas measurements were no longer significantly correlated with body mass (see Figure S1E). We describe this in the first section of the Results:

    “As we already had body mass information, we were primarily interested in changes to energy expenditure and related parameters that were body mass independent, therefore for all gas-derived measurements except RQ (VO2, VCO2, VH2O, and Energy expenditure), which is a ratio, we normalized for body mass by via linear regression in all subsequent analysis steps. This effectively removed the positive correlation with body mass (Figure 1–Figure Supplement 1E).”

    We agree with the reviewer that, in principle, this could lead to spurious correlations between gas measurements, however 1) gas measurements were already highly correlated to one another, 2) the correlations before and after body mass corrections were similar (we have now included this analysis in Figure S1), 3) gas measurements were present in a multitude of clusters, rather than forming one large cluster (see Table S2). Thus, we believe the benefit of this normalization outweighs the cost.

    In the methods section, the "CASPAR" model is described clearly. However, the intuitive description provided in the main text invokes the concept of an "unavoidable tension" between chronological age and inter-individual heterogeneity in the aging rate. The reviewer finds this latter description unhelpful and potentially misleading. The sigma parameter can in some sense be considered a hyperparameter, because tuning it alters the model's behavior and performance. However, the sigma parameter is, more importantly, a potentially measurable property of the system being studied. Individuals within the population exhibited some amount of variability in their individual aging rates, which if measured would determine the value of an empirically-grounded sigma parameter. Unfortunately, the authors are currently unable to estimate this sigma empirically and so they can only speculate about its true value. The authors are correct that different assumptions regarding variability in individual aging rates will produce different model behavior and differential performance in predicting chronological age and agingrate heterogeneity. However, the authors err in implying that any "tension" exists in some grander, theoretic sense. More simply, the authors simply cannot currently measure an important parameter of their model. Readers would benefit from a clearer description of this parameter and the challenges in statistical inference it highlights.

    We agree with the reviewer that a ground truth measure of biological age would allow us to empirically determine the aging rate of individuals, and thus the variability in aging rate between individuals. Unfortunately, such a ground truth is not available: this is more an issue of definition (i.e. there is no agreed-upon measure of biological age) rather than collecting the appropriate data.

    The unavoidable tension we describe is not a tension between chronological age and individual aging rate, but rather between chronological age and time to death, neither of which are perfect representations of aging rate (it is well-established that individual lifespan is not a perfect reflection of aging rate, hence increasing interest in healthspan). Chronological age and time to death are not well-correlated (Figure 4–Figure Supplement 6A), thus optimizing a model to predict one necessarily results in some loss of performance in the other: this is the tension we refer to. Although neither outcome variable (chronological age or time to death) is a perfect representation of biological age, both have rationale for being a useful proxy, which is why models to predict one or the other are present in the literature1,2,7. However, we know of no models that allow for titration of the relative weights between these two outcome variables, or that have explored model performance when both outcome variables are taken into account - our approach fills this gap. We regret that our description was unclear, and we have edited the text to clarify and further describe the meaning of the sigma_beta parameter:

    To address this, we developed an aging rate regression model in which biological age is determined from a combination of predicted chronological age and predicted health status (in this case, predicted time to death, though other health proxies such as a frailty score could be used). The model includes a hyperparameter that allows for tuning of the relative weighting of chronological age and time to death, allowing us to generate models with different behaviors. More specifically, this hyperparameter (denoted sigma_beta) quantifies our belief that different individuals age at different rates. If a ground truth measurement of individual aging rates existed, this hyperparameter could be measured empirically. Unfortunately, there remains no agreed-upon definition of biological age and no such ground truth is available. Therefore, here we explore model behavior under several different values of sigma_beta. A low value of sigma_beta causes the model to assume that all individuals age at similar rates, meaning that the biological age of individuals of the same chronological age should be similar. In this case, model training heavily weights chronological age, and the resulting model approximates a standard age clock model. Conversely, a high value of sigma_beta causes the model to assume that individuals can age at different rates, and thus model training disregards chronological age, instead emphasizing health status (time to death), and the resultant model approximates a standard accelerated failure time model. Neither chronological age nor time to death are perfect representations of aging rate, and they are not particularly well-correlated with one another (Figure 4–Figure Supplement 6A), thus optimizing the prediction of one necessarily reduces performance for the other, resulting in a tunable tension in model behavior and the ability to explore intermediate states that may avoid overfitting to either of these imperfect biological age surrogates. Because this framework utilizes both chronological age and survival time as outcome variables, we name this approach the "Combined Age and Survival Prediction of Aging Rate", or CASPAR.

    Though impressive, this study's data has two limitations that the authors already acknowledge:

    1. an absence of lifespan data for all animals and 2) a limited population size. Despite such limitations, the current data represents an impressive effort that will likely support many additional analyses.

    We thank the author for this positive assessment of our work and hope to address both of these limitations in future studies.

    Finally, the authors seem to neglect substantial prior experimental characterizations of phenotypic aging and methodological work in studying multi-dimensional phenotyping of aging. For example, in nematodes a similar characterization has already been performed: CN Martineau et al PLoS computational biology 2020, and related analytic methods have already been developed that show similar performance: Zhang et all Cell Systems 2016. If the authors wish to draw conclusions that generalize beyond their particular mouse model, they cannot focus myopically on only mouse experiments.

    We are fans of both of those studies. We did not apply their exact methodology in this work because most of their analyses require fully longitudinal data and a larger number of individuals than we have available, but we do appreciate that they explore a similar conceptual space. We have expanded the following to the introduction (edits in purple): “Our study builds upon preexisting literature from other model organisms, particularly nematodes, demonstrating that passive, automated monitoring can be used to quantify multi-dimensional, organism-level aging8–10”.

    In summary, the manuscript describes a solid and commendable effort that has produced a valuable data set. However, in contextualizing and analyzing this data, the authors fall noticeably short of their self-proclaimed "sophistication and rigor".

    References

    1. Schultz, M. B. et al. Age and life expectancy clocks based on machine learning analysis of mouse frailty. Nat. Commun. 11, 4618 (2020).
    2. Levine, M. E. et al. An epigenetic biomarker of aging for lifespan and healthspan. Aging 10, 573–591 (2018).
    3. Freund, A. Untangling Aging Using Dynamic, Organism-Level Phenotypic Networks. Cell Syst. 8, 172–181 (2019).
    4. Huffman, D. M. et al. Evaluating Health Span in Preclinical Models of Aging and Disease: Guidelines, Challenges, and Opportunities for Geroscience. J. Gerontol. A. Biol. Sci. Med. Sci. 71, 1395–1406 (2016).
    5. Scheffer, M. et al. Quantifying resilience of humans and other animals. Proc. Natl. Acad. Sci. 115, 11883–11890 (2018).
    6. Pyrkov, T. V. et al. Longitudinal analysis of blood markers reveals progressive loss of resilience and predicts human lifespan limit. Nat. Commun. 12, 2765 (2021).
    7. Fahy, G. M. et al. Reversal of epigenetic aging and immunosenescent trends in humans. Aging Cell 18, e13028 (2019).
    8. Zhang, W. B. et al. Extended Twilight among Isogenic C. elegans Causes a Disproportionate Scaling between Lifespan and Health. Cell Syst. 3, 333-345.e4 (2016).
    9. Martineau, C. N., Brown, A. E. X. & Laurent, P. Multidimensional phenotyping predicts lifespan and quantifies health in C. elegans. PLOS Comput. Biol. 16, e1008002 (2020).
    10. Le, K. N. et al. An automated platform to monitor long-term behavior and healthspan in Caenorhabditis elegans under precise environmental control. Commun. Biol. 3, 1–13 (2020).
    11. Acosta-Rodríguez, V. A., de Groot, M. H. M., Rijo-Ferreira, F., Green, C. B. & Takahashi, J. S. Mice under Caloric Restriction Self-Impose a Temporal Restriction of Food Intake as Revealed by an Automated Feeder System. Cell Metab. 26, 267-277.e2 (2017).
    12. Yasumoto, Y., Nakao, R. & Oishi, K. Free Access to a Running-Wheel Advances the Phase of Behavioral and Physiological Cir
  2. Evaluation Summary:

    Chen et al. develop a comprehensive platform to score aging-dependent changes in mouse physiology and behavior using a multi-dimensional longitudinal phenotyping approach. Their thorough data collection and analysis reveals a diversity of trajectories in aging-related physiological and behavioral changes and helps disentangle biological aging from chronological aging, providing a reference pioneering work for future studies aimed at large-scale aging multi-dimensional phenotyping.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 agreed to share their name with the authors.)

  3. Reviewer #1 (Public Review):

    Chen et al. embark into a comprehensive analysis of physiological and behavioral aging in a mixed-bred (Diversity Outbred, or DO) mouse population. They aim to analyze spontaneous trajectories in mouse aging from longitudinal data acquisition, using commercially-available monitoring cages, able to detect a diversity of aging-related changes in individual mice physiology and behavior.

    This work has two major strengths: the extensive data generated and the analytical tour-de-force to extract relevant features from multi-dimensional aging data.
    Overall, the authors reached their goal and I congratulate them for the clarity and thoroughness of the analyses conducted.

    The main question of this work is somehow subordinate to their approach. If I were to summarize their main question, I would say "can we extract spontaneous aging trajectories/features from non-invasive behavioral monitoring in mix-bred mice"? Overall, the authors answer this question and discuss the implications of their findings. This work helps generate a clear separation between the concepts of chronological aging from biological aging (CASPAR approach), providing an integrated measure of both, and relating this measure with individual data sources. The authors further provide important insights into the concept of aging-related decline in resilience, which their multi-dimensional data integration convincingly support. This work will likely have important impact on future studies focused on integrated measures of physiological/behavioral aging. What is not entirely clear so far from this work, is how future work by other groups will be able to benefit from these data and approaches, i.e. how accessible and scalable are the analyses presented in this work to different experimental designs, e.g. where more sparse data are obtained. The authors should make the data easily available/accessible to the public, as well as their code.

    While this work is comprehensive and rather impressive, the way it is written so far does not focus on the results, but rather on the methodology.

  4. Reviewer #2 (Public Review):

    In their study, Chen et al. consider a set of 415 genetically diverse, outbred mice. This population is assembled from eight distinct cohorts, each entering the study at a separate chronological age ranging from three to twenty-four months. By employing a commercially-available automated-phenotyping system, the authors collected high-dimensional phenotyping data that quantifies both behavior and physiologic properties like oxygen consumption. Animals were placed in the phenotyping system for week-long measurement intervals, alternated by three-week intervals in more standard cages. In this way, the authors cleverly overcome challenges in longitudinal measurement by stitching together eight overlapping longitudinal time series into a single forty-week characterization of the entire murine lifespan.

    The authors found many of their measurements covary at short timescales according to an individual's behavioral state-sleeping, eating, running, etc. To control for this effect, the authors developed a hidden markov model that allowed them to automatically identify an animals' behavioral state, thus segmenting longitudinal measurements into distinct behavioral stages. This allowed the authors to more accurately study the long-term effects of aging by removing the confounding effects of short-term behavioral changes.

    The authors find that circadian rhythms changed with chronological age, as did energy expenditure while resting declined. In fact, eighty percent of all metrics correlated significantly with chronological age.

    The authors genotyped each mouse using an array of SNP probes, allowing them to identify genotype-phenotype correlations. The authors observed a low heritability on average among all traits (median correlation = 0.22), but found that these heritable factors tended to affect multiple phenotypes simultaneously. Notably, the heritability of body mass was relatively high, in agreement with previous studies.

    Irrespective of genetics, 250 features clustered into 20 groups based on covariation over time. The authors identified a general increase in the covariation of traits between and within these clusters as animals aged. The authors refer to these increases in covariation as "decreases in resilience".

    Finally, the authors developed a model of aging that integrates phenotypic data and lifespan data. This model appears to draw implicitly from concepts developed by OO Aalen and James Vaupel under the name of "frailty" models, positing that each individual exhibits a characteristic rate of aging that contributes to differences in lifespan among peers. The authors fit their model using a maximum likelihood approach-implemented using gradient boosted decision trees-that allows them to estimate the relative rate of each individuals' aging using longitudinal phenotypic data and compare this to inter-individual differences in lifespan. The authors' model produces rather unimpressive predictions of chronological age, with correlations ranging between 0.5 to 0.75 depending on model tuning. The model has more difficulty predicting an individuals' remaining lifespan, only correlating between 0.25 and 0.425 depending on model tuning.

    *Strengths*

    The main strengths of this manuscript are its thoughtful study design, which combines high-dimensional phenotyping, genotypic data, and large population size. An impressive effort went into collecting these measurements and the result seems likely to be useful for many future analyses. An additional strength of this manuscript is the HMM model. By subdividing time-series measurements into distinct short-term behavioral periods, long-term trends in behavior and physiology can be identified without the confounding influence of short-term behavioral states. Finally, the authors' "CASPAR" model seems like a thoughtful attempt to relate longitudinal phenotypic aging to lifespan, even if its performance is not yet so impressive.

    *Weaknesses*

    The manuscript is substantially weakened by a lack of clarity on several important conceptual points. First, the authors appear to assume that any change that occurs at month-long timescales must be "aging". The authors choose to discard the first day of measurements in a cage to account for behavioral adaptation, demonstrating their concern for distinguishing behavioral adaptations from aging phenomena. However, the authors' efforts to do this seem rather cursory, as mice surely learn and adapt over time-scales longer than twenty-four hours. The reader is left wondering to what extent this study measures the phenotypic consequences of aging, and to which extent is the study measuring long-term adaptation of individuals to a four-week rotation schedule in and out of different cages.

    As a second conceptual issue, the authors adopt a rather shallow and limited practical definition of the term "resilience". Conceptually, they define resilience as "the ability of a system to maintain function in the face of change", which seems reasonable and corresponds with the general thinking about resilience. However, in practice, the authors define resilience as the inverse of correlation among traits-an animal is more "resilient" when its different phenotypic traits are less correlated. This practical definition lends itself well for measurement using the data in this study, but leads to an incongruity between conceptual and practical definitions of "resilience". Correlation of traits is not uniquely determined by an organism's resilience--there could be any number of reasons for traits to increase in covariance beyond a failure of resilience. Any change in the physiologic relationship between two traits will alter the causal structure of the traits' interactions and therefore alter the trait's covariance. Are the authors arguing that any change in physiology must inherently involve changes in resilience? A more convincing practical definition of resilience would involve a more direct test of conceptual definition, as defined by the authors as "the ability of a system to maintain function in the face of change". For example, the authors might have provided some sort of physiologic challenge and measured animals' response to it-a physical stress test, a test of thermoregulation in response to changes in temperature, the speed of adaptation to a novel environment. Given the data collected, the authors can measure many interesting aspects of aging, but they do not seem adequately justified in calling one of these aspects "resilience".

    The manuscript also raises technical concerns. First, it is unclear whether all analyses in the manuscript are performed using features normalized body mass or whether only analyses in certain sections of the manuscript are performed using features normalized for body mass. The details here are crucial because improper normalization would undermine the main conclusions of the manuscript. Normalization of multiple features to any shared reference has the potential to introduce a correlation between normalized features and the shared normalization factor. In fact, many approaches for normalization to body mass will always introduce a correlation between normalized features and body mass, with the only exception being if the un-normalized features and body mass are perfectly correlated. If the authors normalize traits before performing their various correlation analyses, such normalization could introduce artefactual correlations between traits. Any normalized quantity will correlate with body mass and all traits correlated with body mass will in consequence correlate with each other. In summary, the authors must explain their normalization procedure in more detail to identify or exclude any improper normalization that could confound their analyses. Analyses at risk of being confounded include the heritability analysis, the network analysis of phenotypes during aging, and the CASPAR analyses.

    In the methods section, the "CASPAR" model is described clearly. However, the intuitive description provided in the main text invokes the concept of an "unavoidable tension" between chronological age and inter-individual heterogeneity in the aging rate. The reviewer finds this latter description unhelpful and potentially misleading. The sigma parameter can in some sense be considered a hyperparameter, because tuning it alters the model's behavior and performance. However, the sigma parameter is, more importantly, a potentially measurable property of the system being studied. Individuals within the population exhibited some amount of variability in their individual aging rates, which if measured would determine the value of an empirically-grounded sigma parameter. Unfortunately, the authors are currently unable to estimate this sigma empirically and so they can only speculate about its true value. The authors are correct that different assumptions regarding variability in individual aging rates will produce different model behavior and differential performance in predicting chronological age and aging-rate heterogeneity. However, the authors err in implying that any "tension" exists in some grander, theoretic sense. More simply, the authors simply cannot currently measure an important parameter of their model. Readers would benefit from a clearer description of this parameter and the challenges in statistical inference it highlights.

    Though impressive, this study's data has two limitations that the authors already acknowledge: 1) an absence of lifespan data for all animals and 2) a limited population size. Despite such limitations, the current data represents an impressive effort that will likely support many additional analyses.

    Finally, the authors seem to neglect substantial prior experimental characterizations of phenotypic aging and methodological work in studying multi-dimensional phenotyping of aging. For example, in nematodes a similar characterization has already been performed: CN Martineau et al PLoS computational biology 2020, and related analytic methods have already been developed that show similar performance: Zhang et all Cell Systems 2016. If the authors wish to draw conclusions that generalize beyond their particular mouse model, they cannot focus myopically on only mouse experiments.

    In summary, the manuscript describes a solid and commendable effort that has produced a valuable data set. However, in contextualizing and analyzing this data, the authors fall noticeably short of their self-proclaimed "sophistication and rigor".