Risk factors relate to the variability of health outcomes as well as the mean: A GAMLSS tutorial
Curation statements for this article:
Curated by eLife
Evaluation Summary:
Using data from the 1970 British Birth Cohort study, the authors demonstrated the utility of Generalized Additive Models for Location, Scale and Shape (GAMLSS) to investigate the association of three risk factors (sex, socioeconomic circumstances, and physical inactivity) with body mass index and mental wellbeing. This work provides empirical evidence for why we should consider how risk factors influence the variability and not just the mean of outcomes. From the perspective of developing personalized medicine, it is important to know whether interventions have response heterogeneity as the first step. If such heterogeneity is identified, the next step will be to identify the factors associated with the heterogeneity (or those who will be benefitted from the intervention). Therefore, this study contributes to the first step by investigating the possibility of response heterogeneity.
(This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #3 agreed to share their name with the authors.)
This article has been Reviewed by the following groups
Listed in
 Evaluated articles (eLife)
 Epidemiology and Global Health (eLife)
Abstract
Risk factors or interventions may affect the variability as well as the mean of health outcomes. Understanding this can aid aetiological understanding and public health translation, in that interventions which shift the outcome mean and reduce variability are typically preferable to those which affect only the mean. However, most commonly used statistical tools do not test for differences in variability. Tools that do have few epidemiological applications to date, and fewer applications still have attempted to explain their resulting findings. We thus provide a tutorial for investigating this using GAMLSS (Generalised Additive Models for Location, Scale and Shape).
Methods:
The 1970 British birth cohort study was used, with body mass index (BMI; N = 6007) and mental wellbeing (WarwickEdinburgh Mental Wellbeing Scale; N = 7104) measured in midlife (42–46 years) as outcomes. We used GAMLSS to investigate how multiple risk factors (sex, childhood social class, and midlife physical inactivity) related to differences in health outcome mean and variability.
Results:
Risk factors were related to sizable differences in outcome variability—for example males had marginally higher mean BMI yet 28% lower variability; lower social class and physical inactivity were each associated with higher mean and higher variability (6.1% and 13.5% higher variability, respectively). For mental wellbeing, gender was not associated with the mean while males had lower variability (–3.9%); lower social class and physical inactivity were each associated with lower mean yet higher variability (7.2% and 10.9% higher variability, respectively).
Conclusions:
The results highlight how GAMLSS can be used to investigate how risk factors or interventions may influence the variability in health outcomes. This underutilised approach to the analysis of continuously distributed outcomes may have broader utility in epidemiologic, medical, and psychological sciences. A tutorial and replication syntax is provided online to facilitate this ( https://osf.io/5tvz6/ ).
Funding:
DB is supported by the Economic and Social Research Council (grant number ES/M001660/1), The Academy of Medical Sciences / Wellcome Trust (“Springboard Health of the Public in 2040” award: HOP001/1025); DB and LW are supported by the Medical Research Council (MR/V002147/1). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Article activity feed


Author Response:
Reviewer #1:
Regression models are a widespread statistical technique used in epidemiological studies. Most commonly used regression models do not explicitly parameterize the relationship between the independent variables and the variance or skewness of the dependent variable. Generalized Additive Models for Location, Scale and Shape (GAMLSS) is a regression technique that provides the flexibility to estimate parameters of the dependent variable distribution (mean, median, variance, e.t.c) as a function of independent variables. This manuscript uses data from the 1970 British birth cohort study to showcase the use of GAMLSS in epidemiological studies and further compares the results to quantile regressions.
The primary concern with this manuscript is its overall goal. In its current form, it is hard to assess …
Author Response:
Reviewer #1:
Regression models are a widespread statistical technique used in epidemiological studies. Most commonly used regression models do not explicitly parameterize the relationship between the independent variables and the variance or skewness of the dependent variable. Generalized Additive Models for Location, Scale and Shape (GAMLSS) is a regression technique that provides the flexibility to estimate parameters of the dependent variable distribution (mean, median, variance, e.t.c) as a function of independent variables. This manuscript uses data from the 1970 British birth cohort study to showcase the use of GAMLSS in epidemiological studies and further compares the results to quantile regressions.
The primary concern with this manuscript is its overall goal. In its current form, it is hard to assess whether the manuscript is meant to be a tutorial on how to fit GAMLSS and interpret its output from an epidemiological context, or it is meant to be a research report investigating the association between three risk factors (sex, social class, physical activity) with two outcomes (BMI and mental wellbeing).
We have edited the manuscript to form a tutorial. We have also provided additional detail to rationalise the risk factors and outcomes used. That these associations are of substantive interest helps to motivate the use of GAMLSS.
The modelling choices in the manuscript are only suited if it is aimed to be a tutorial. For example, the rationale for the choice of the outcomes (BMI and mental wellbeing) is reported to be the fact that they are often measured on a continuous scale. Similarly, authors only interpret the unadjusted estimates because they were similar to those from an adjusted model. Although these are acceptable choices for a tutorial, if the manuscript's goal was to estimate the true association between these variables, it has several shortcomings. Such as i) the disadvantages of dichotomizing a continuous independent variable are well known(1); ii) it is recommended to choose potential confounders based on a Directed Acyclic Graph (DAG) to ensure the estimates are unbiased(2); iii) a clear rationale for estimating this effect and what is already known in the literature about the association should be mentioned in the introduction.
Yet, interpretations provided in the results section and parts of the discussion imply that they are to be taken as estimates of a true association. For example, i) estimates for variable sex is contrasted with that of social class (Page 7 Line 204), ii) argument comparing results of previous studies on BMI and use of a national representative sample (Page 9 line 252 to 258), iii) using GAMLSS and British Birth Cohort data are reported as strengths of the manuscript (Page 10 Line 309), iv) arguments about limitations to make causal claims for the estimates and other data complexities (Page 11 Line 319 to 334).
We agree that understanding causality is challenging. Binary risk factors were used to aid interpretation of the potentially complex GAMLSS results; findings do not differ when using categorical form (see appendix tables). We have now discussed the potential for confounding and reverse causality in the discussion.
“The study also has limitations. As in all observational studies, causal inference is challenging despite the use of longitudinal data. Associations of social class at birth with outcomes for example could be explained by unmeasured confounding—this may include factors such as parental mental health. This is challenging to falsify empirically owing to a lack of such data collected before birth. In contrast, sex is randomly assigned at conception, and thus its associations with outcomes are unlikely to be confounded. However, sex differences in reporting may bias associations with mental wellbeing. Physical activity and mental wellbeing were ascertained at broadly the same age, so that associations between the two could be explained by reverse causality; existing evidence appears to suggest bidirectionality of links between physical activity and both outcomes.32 51 Finally, attrition led to lower power to precisely estimate smaller effect sizes (e.g., gender differences in mental wellbeing) or confirm null effects. Such attribution could potentially bias associations—those in worse health and adverse socioeconomic circumstances are disproportionately lost to followup.52 53 The focus of principled approaches to handle missing data in epidemiology has been on the main parameter of interest—typically beta coefficients in linear regression models—and further empirical work is required to investigate the potential implications of (nonrandom) missingness for the variability and other moments of the outcome distribution.”

Evaluation Summary:
Using data from the 1970 British Birth Cohort study, the authors demonstrated the utility of Generalized Additive Models for Location, Scale and Shape (GAMLSS) to investigate the association of three risk factors (sex, socioeconomic circumstances, and physical inactivity) with body mass index and mental wellbeing. This work provides empirical evidence for why we should consider how risk factors influence the variability and not just the mean of outcomes. From the perspective of developing personalized medicine, it is important to know whether interventions have response heterogeneity as the first step. If such heterogeneity is identified, the next step will be to identify the factors associated with the heterogeneity (or those who will be benefitted from the intervention). Therefore, this study contributes to the …
Evaluation Summary:
Using data from the 1970 British Birth Cohort study, the authors demonstrated the utility of Generalized Additive Models for Location, Scale and Shape (GAMLSS) to investigate the association of three risk factors (sex, socioeconomic circumstances, and physical inactivity) with body mass index and mental wellbeing. This work provides empirical evidence for why we should consider how risk factors influence the variability and not just the mean of outcomes. From the perspective of developing personalized medicine, it is important to know whether interventions have response heterogeneity as the first step. If such heterogeneity is identified, the next step will be to identify the factors associated with the heterogeneity (or those who will be benefitted from the intervention). Therefore, this study contributes to the first step by investigating the possibility of response heterogeneity.
(This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #3 agreed to share their name with the authors.)

Reviewer #1 (Public Review):
Regression models are a widespread statistical technique used in epidemiological studies. Most commonly used regression models do not explicitly parameterize the relationship between the independent variables and the variance or skewness of the dependent variable. Generalized Additive Models for Location, Scale and Shape (GAMLSS) is a regression technique that provides the flexibility to estimate parameters of the dependent variable distribution (mean, median, variance, e.t.c) as a function of independent variables. This manuscript uses data from the 1970 British birth cohort study to showcase the use of GAMLSS in epidemiological studies and further compares the results to quantile regressions.
The primary concern with this manuscript is its overall goal. In its current form, it is hard to assess whether the …
Reviewer #1 (Public Review):
Regression models are a widespread statistical technique used in epidemiological studies. Most commonly used regression models do not explicitly parameterize the relationship between the independent variables and the variance or skewness of the dependent variable. Generalized Additive Models for Location, Scale and Shape (GAMLSS) is a regression technique that provides the flexibility to estimate parameters of the dependent variable distribution (mean, median, variance, e.t.c) as a function of independent variables. This manuscript uses data from the 1970 British birth cohort study to showcase the use of GAMLSS in epidemiological studies and further compares the results to quantile regressions.
The primary concern with this manuscript is its overall goal. In its current form, it is hard to assess whether the manuscript is meant to be a tutorial on how to fit GAMLSS and interpret its output from an epidemiological context, or it is meant to be a research report investigating the association between three risk factors (sex, social class, physical activity) with two outcomes (BMI and mental wellbeing).
The modelling choices in the manuscript are only suited if it is aimed to be a tutorial. For example, the rationale for the choice of the outcomes (BMI and mental wellbeing) is reported to be the fact that they are often measured on a continuous scale. Similarly, authors only interpret the unadjusted estimates because they were similar to those from an adjusted model. Although these are acceptable choices for a tutorial, if the manuscript's goal was to estimate the true association between these variables, it has several shortcomings. Such as i) the disadvantages of dichotomizing a continuous independent variable are well known(1); ii) it is recommended to choose potential confounders based on a Directed Acyclic Graph (DAG) to ensure the estimates are unbiased(2); iii) a clear rationale for estimating this effect and what is already known in the literature about the association should be mentioned in the introduction.
Yet, interpretations provided in the results section and parts of the discussion imply that they are to be taken as estimates of a true association. For example, i) estimates for variable sex is contrasted with that of social class (Page 7 Line 204), ii) argument comparing results of previous studies on BMI and use of a national representative sample (Page 9 line 252 to 258), iii) using GAMLSS and British Birth Cohort data are reported as strengths of the manuscript (Page 10 Line 309), iv) arguments about limitations to make causal claims for the estimates and other data complexities (Page 11 Line 319 to 334).
The data generating process in epidemiological studies, especially in observational designs, is complex and needs to be taken into consideration when conducting statistical analysis. Statistical models are often oversimplified mathematical representations of this realworld data generating process. Often in practice, this simplification (e.g. mean the only model) and strong assumptions (e.g. homoscedasticity) are chosen to aid in estimating quantities that are easy to interpret and that have high clinical or public health utility. The popularity of logistic regression in epidemiology, compared to other fields, is a clear example of this practice. On the same lines, complex models should not be adopted at the expense of interpretability or utility of the model outputs. Users of such complex models should clearly explain the interpretation of the parameter estimates and should provide clinical and/or public health utility of the same to avoid misinterpretation of the outputs by potential future users of the model and by policymakers. For example, GAMLSS provides considerable flexibility in modelling choices compared to more standard techniques (e.g. GLMs, and GAMs). Tutorials clearly explaining how to fit GAMLSS, interpretation of its output along its utility from an epidemiological context are needed. There are some shortcomings to the current manuscript if assessed from a perspective of a tutorial: i) it would be pertinent to provide a comparison to linear regression, which only models the mean of the outcome, and elaborate how and why the more complex model help interprets the observed relationship or lack of it; ii) no clear lay interpretation of the effect measures on SD, Coefficient of Variance, and Skewness; iii) guidelines of choosing outcome distribution type (e.g., Normal distribution vs BoxCox Cole and Green family) from an epidemiological context.

Reviewer #2 (Public Review):
The authors demonstrated the utility of GAMLSS to investigate the association of risk factors and outcomes in variability as well as central tendency. They used BMI, mental wellbeing as outcomes and three risk factors (sex, socioeconomic circumstances, and physical inactivity) in this study. The strength of this study is that they successfully demonstrated the utility of the approach using a large empirical data set. The limitation of the study is that the data is from observational study, thus causal inference was not feasible.
Most of clinical studies have been focused on the difference in mean rather than other characteristics of distributions such as variance. However, recent studies have demonstrated that intervention effect is heterogeneous (some are benefitted from the intervention, but others are …
Reviewer #2 (Public Review):
The authors demonstrated the utility of GAMLSS to investigate the association of risk factors and outcomes in variability as well as central tendency. They used BMI, mental wellbeing as outcomes and three risk factors (sex, socioeconomic circumstances, and physical inactivity) in this study. The strength of this study is that they successfully demonstrated the utility of the approach using a large empirical data set. The limitation of the study is that the data is from observational study, thus causal inference was not feasible.
Most of clinical studies have been focused on the difference in mean rather than other characteristics of distributions such as variance. However, recent studies have demonstrated that intervention effect is heterogeneous (some are benefitted from the intervention, but others are not). From the perspective of developing personalized medicine, it is important to know whether interventions have response heterogeneity as the first step. If such heterogeneity is identified, the next step will be to identify the factors associated with the heterogeneity (or those who will be benefitted from the intervention). Therefore, this study contributes to the first step by investigating the possibility of response heterogeneity.

Reviewer #3 (Public Review):
The manuscript by Bann and Cole assesses how risk factors such as sex, socioeconomic status, and physical inactivity affect outcome variability and skewness of outcomes such as BMI and mental wellbeing. The authors also explored how these risk factors affect the skewness of the distributions of these outcomes. The GAMLSS distribution was used to determine how these risk factors influence both the mean and variability of the outcomes. Additionally, the authors also investigated how these risk factors influence quantile function of the outcomes.
The authors provide an important contribution to epidemiologic studies by examining how risk factors influence the variability and skewness of outcomes. Their discussions provide topics for other researchers to consider when examining intervention effects. While most …
Reviewer #3 (Public Review):
The manuscript by Bann and Cole assesses how risk factors such as sex, socioeconomic status, and physical inactivity affect outcome variability and skewness of outcomes such as BMI and mental wellbeing. The authors also explored how these risk factors affect the skewness of the distributions of these outcomes. The GAMLSS distribution was used to determine how these risk factors influence both the mean and variability of the outcomes. Additionally, the authors also investigated how these risk factors influence quantile function of the outcomes.
The authors provide an important contribution to epidemiologic studies by examining how risk factors influence the variability and skewness of outcomes. Their discussions provide topics for other researchers to consider when examining intervention effects. While most current approaches often focus on discussions of how interventions influence the mean functions, the authors provide adequate discussions on the importance of also considering how interventions also influence the variances. In addition to providing empirical evidence for why more authors should consider how risk factors influence the variability and not just the mean of outcomes, the authors provide strong justifications for such considerations with the use of important large cohort data.
