Reporting and misreporting of sex differences in the biological sciences

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This manuscript presents a descriptive audit on statistically treatment, reporting and interpretation of the effects of sex as a biological variable (SABV) on the studied outcomes in articles published across nine scholarly disciplines. The manuscript highlights and provides data on prevalence of several inconsistencies and inaccuracies in the literature regarding treatment of SABV as an important moderator of the effects of an intervention on a considered outcome and how such inconsistencies could lead to biased conclusions regarding the effects of SABV. As such, the manuscript may inform not only funding agencies and grant reviewers, but also researchers in most scientific disciplines regarding the importance of adhering to rigorous methodological standards when examining the effects of SABV.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #3 agreed to share their names with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

As part of an initiative to improve rigor and reproducibility in biomedical research, the U.S. National Institutes of Health now requires the consideration of sex as a biological variable in preclinical studies. This new policy has been interpreted by some as a call to compare males and females with each other. Researchers testing for sex differences may not be trained to do so, however, increasing risk for misinterpretation of results. Using a list of recently published articles curated by Woitowich et al. (eLife, 2020; 9:e56344), we examined reports of sex differences and non-differences across nine biological disciplines. Sex differences were claimed in the majority of the 147 articles we analyzed; however, statistical evidence supporting those differences was often missing. For example, when a sex-specific effect of a manipulation was claimed, authors usually had not tested statistically whether females and males responded differently. Thus, sex-specific effects may be over-reported. In contrast, we also encountered practices that could mask sex differences, such as pooling the sexes without first testing for a difference. Our findings support the need for continuing efforts to train researchers how to test for and report sex differences in order to promote rigor and reproducibility in biomedical research.

Article activity feed

  1. Author Response:

    We thank the three reviewers for their feedback and insightful comments. We share Reviewer 2’s opinion that NIH’s policy on “sex as a biological variable” leaves largely open how that variable should be treated statistically, and this concern was in fact the main impetus for this study.

    In response to the concern of R1 and R3 that all of the articles were coded by just one author, we have expanded the description of how coding was done. All articles were read by both authors, and ~25% of the articles were discussed between them during the coding process. Our coding system was checked by having both authors independently read a subset (~20%) of the articles; interrater reliability exceeded 90%.

    Regarding the fact that most, if not all, of the articles we analyzed contained multiple studies, we used a hierarchical coding method (described in the paper). Our goal was to illuminate cases in which the statistical methods potentially led to unsupported conclusions; therefore, an article that arrived at sound conclusions for one study but questionable conclusions for another was coded into the “questionable” category.

    We apologize that Reviewer 2 could not find the information in our paper on the percentage of articles that “did it right”. We have revised the text to make this information clearer.

    Regarding Reviewer 3’s concern that our method of assigning the articles into disciplines was not clear, we have now emphasized this information more in the paper. We did not assign the articles to disciplines ourselves; the assignment was done originally by Beery & Zucker (2011) on the basis of the journals in which the articles were published, and Woitowich et al. (2020) used the same categorizations, which we followed.

  2. Evaluation Summary:

    This manuscript presents a descriptive audit on statistically treatment, reporting and interpretation of the effects of sex as a biological variable (SABV) on the studied outcomes in articles published across nine scholarly disciplines. The manuscript highlights and provides data on prevalence of several inconsistencies and inaccuracies in the literature regarding treatment of SABV as an important moderator of the effects of an intervention on a considered outcome and how such inconsistencies could lead to biased conclusions regarding the effects of SABV. As such, the manuscript may inform not only funding agencies and grant reviewers, but also researchers in most scientific disciplines regarding the importance of adhering to rigorous methodological standards when examining the effects of SABV.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #3 agreed to share their names with the authors.)

  3. Reviewer #1 (Public Review):

    Garcia-Sifuentes and Maney provide a descriptive analysis, using a set of articles previously screened by Woitowich et al. (2020), on how sex differences are reported and evaluated in biological articles. Beyond the inclusion of both males and females in research, it is crucial that authors employ appropriate design and analysis strategies to be able to identify sex-specific biological effects and realize the potential of studying sex as a biological variable. On the one hand, Garcia-Sifuentes and Maney reassuringly show that researchers are often including both sexes in their studies. On the other, concerningly, appropriate statistical tests are not often being used to correctly test for sex-specific differences. In addition, their results suggest that authors often pool data from both sexes together without appropriate statistical tests, or even do it in the presence of evidence for differences.

    The article is well written and organized through four questions: whether sex differences were reported by authors of the included studies, 2) whether the studies used a factorial design and whether an appropriate statistical comparison was used to conclude a sex-specific effect, 3) whether data from males and females were pooled, and 4) terminology used (gender vs. sex) to describe non-human animals. Though the analysis is not systematic in nature, it uses a convenience sample from Woitowich et al. (2020) who surveyed 34 journals over 9 disciplines. The subset of included in this work is limited within some of these journals and disciplines as the authors note, and as such future surveys may be useful to quantify within-discipline practices more rigorously.

    I note only one potential weakness:

    Per line 374, it is noted that there was a single coder for all extractions. Given the complexity of some classifications, this may be of some concern. As the authors noted in places, language used to describe statistical tests, results, and interpretations can vary considerably, and having a second set of eyes review each paper would have been good to reduce the potential for some systematic misclassification. It is also noted that "A subset of the articles was independently coded by YGS and any discrepancies discussed between the authors until agreement was reached." More information would be helpful about this process.

    As a point of interpretation, the authors selected the article as the unit of analysis which is reasonable. Some articles however may have many experiments within, and therefore, the results may be different (though unlikely qualitatively) if considered at the level of the experiment. Within articles, is there reason to think there is much variability on the questions addressed herein?

  4. Reviewer #2 (Public Review):

    The authors performed an additional meta-analysis on articles collected for another study considering the use of sex as a biological variable. In this study, they examined a subset of articles that did include male and female subjects to see how sex differences were reported on and tested.

    One criticism I have of the phrase "sex as a biological variable" is that it does not immediately fit into a traditional statistical framework. Does it mean that sex should be treated as a possible confounder or mediator in the scope of causal inference? One or both of those? This gets to the crux of what a "sex difference" is: is it a between-groups difference in an independent variable of interest (e.g. treatment), the outcome variable, or a moderating effect on the relationship between a treatment and outcome? Should sex be controlled for, stratified by, or used in an interaction term? These distinctions are important, as the different questions the authors pose of the sample of articles correspond to different kinds of sex differences. It is a strength of this study that all of these different types of sex differences are considered, though the ways in which these fit into different parts of an analysis plan is not thoroughly discussed.

    The authors were very thorough in their coding of the articles for how different aspects of the reported analyses were presented. For example, there were nine possible results for how sex as a moderating variable was addressed. The river plots were very useful to map out some of those differences. However, this thoroughness means that they sometimes missed the forest for the trees, as they did not report overall percentages of articles that "did it right" (e.g. the original authors appropriately reported on and correctly interpreted the results of an interaction). There was a focus on whether the results in the articles were positive or negative and what the fields the articles came from, but the small sample sizes for these made it difficult at times to see a big picture. The discussion of errors in analysis and reporting by the original authors was very through, though.

    The broader picture the authors paint with their findings is one where it seems like the analysis and reporting of sex differences has substantial room for improvement in the surveyed fields. I agree with this study's conclusion that more statistical training is needed for scientists in both clinical and basic science research. Unfortunately, this is not a new problem as evidenced by the wealth of prior literature the authors cite on that issue.

  5. Reviewer #3 (Public Review):

    Garcia-Sifuentes and Maney have conducted a comprehensive descriptive audit on statistically treatment, reporting and interpretation of the effects of SABV on the studied outcomes in articles published across nine scholarly disciplines. In this manuscript, the authors present the proportion of studies examining and reporting sex differences and the proportion of studies that have correctly treated, reported and interpreted SABV (i.e. modeling the main effects and the interaction of examined treatment(s) and sex) as well as less than ideal or inaccurate methods employed by researchers to examine the effects of sex. The authors also present the prevalence of inconsistencies in reporting (e.g. reporting a sex difference when a difference was not examined for / does not exist). Furthermore, the authors reported data on the use of the term "gender" in place of "sex". Overall, the manuscript provides a valuable, data-driven summary on statistical treatment of SABV in recently published articles across the reviewed nine disciplines, while providing data to suggest the fact that the practices followed in a majority of researchers to examine for the effects of SABV are less than ideal, if not inaccurate.
    The manuscript is well-written in general and most conclusions are mostly supported by the data. The use of river plots to summarize the data is commendable. However, a few minor limitations of the manuscript are noteworthy.

    1. The manuscript presents overall summary statistics of the responses for each questions the authors asked about the reviewed articles and also discipline-wise summaries. However, it is not very clear as to how the authors classified the articles into these disciplines (especially given that some articles may have covered the scope of more than one discipline). This limits the ability to make inferences about the treatment of SABV in the considered disciplines.

    2. The authors also acknowledge that the coding was based on their interpretation of the data presentation and wording. As such, there is a possibility of introducing bias due to the subjectivity of at least some decisions that had to be made. It would have been ideal if all articles were coded by at least two independent coders and a third coder voted on decisions on which a disagreement was observed. However, only one author (author 2) has made all the coding decisions and the validity of the decisions have not been checked. As such, the validity of the reported proportions and percentages remain questionable.