Use of the p-value as a size-dependent function to address practical differences when analyzing large datasets

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Biomedical research has come to rely on p-values as a deterministic measure for data-driven decision making. In the largely extended null-hypothesis significance testing (NHST) for identifying statistically significant differences among groups of observations, a single p-value computed from sample data is routinely compared with a threshold, commonly set to 0.05, to assess the evidence against the hypothesis of having non-significant differences among groups, or the null hypothesis. Because the estimated p-value tends to decrease when the sample size is increased, applying this methodology to large datasets results in the rejection of the null hypothesis, making it not directly applicable in this specific situation. Herein, we propose a systematic and easy-to-follow method to detect differences based on the dependence of the p-value on the sample size. The proposed method introduces new descriptive parameters that overcome the effect of the size in the p-value interpretation in the framework of large datasets, reducing the uncertainty in the decision about the existence of biological/clinical differences between the compared experiments. This methodology enables both the graphical and quantitative characterization of the differences between the compared experiments guiding the researchers in the decision process. An in-depth study of the proposed methodology is carried out using both simulated and experimentally obtained data. Simulations show that under controlled data, our assumptions on the p-value dependence on the sample size holds. The results of our analysis in the experimental datasets reflect the large scope of this approach and its interpretability in terms of common decision-making and data characterization tasks. For both simulated and real data, the obtained results are robust to sampling variations within the dataset.

Article activity feed

  1. ###Reviewer #2:

    The authors describe the dependence of the p-value on sample size (which is true by definition) and offer a solution, using simulated data and an applied example.

    I'm not sure that the introduction successfully motivates the paper. It is unclear whether this is due to misunderstandings by the authors of some key points, or rather is a matter of awkward communication, such that the authors' intentions are accurately conveyed.

    The authors note the link between the p-value and sample size. In particular, the authors suggest that statistical significance can be achieved by using a sufficiently large sample size, and they call this 'p-hacking'. I certainly don't recognise use of a large sample size as an example of p-hacking. Instead, this term refers to analytical behaviours which cause the p-value to lose its advertised properties (advertised type 1 error rate). Examples would include taking repeated looks at data without making any appropriate adjustment, trying tests on different groupings of data (and selecting results on the basis of significance), or trying different definitions of an outcome measure. The key point is that, when these actions are performed, reported p-values are no longer valid p-values - they do not behave as they are supposed to. So straight away the authors' argument becomes confusing. Are they criticising the behaviour of the valid p-value? Or are they trying to criticise behaviours that cause the p-value to lose its stated properties? This point remains very unclear. I believe the authors are attempting the former, but wrongly describe this as an example of p-hacking.

    But other statements in the introduction invite further confusion. The authors say " even when comparing the mean value of two groups with identical distribution, statistically significant differences among the groups can always be found as long as a sufficiently large number of observations is available using any of the conventional statistical tests (i.e., Mann Whitney U-test (Mann and Whitney, 1947), Rank Sum test (Wilcoxon, 1945), Student's ttest (Student, 1908)) (Bruns and Ioannidis, 2016)." Again, it is unclear what the authors are trying to say here, and the statement is clearly false under the most obvious interpretation. If the authors are saying that significance will always be found when the null is true and model assumptions are correct provided that the sample size is large, then this is clearly false. In this case, the test will reject the null 5% of the time, using a significance threshold of 5%. The authors can easily confirm this for themselves with a simple simulation. Are the authors trying to make the point that the error rate is conditional not only on the null, but also on the test assumptions (and so when they are violated the test may reject erroneously?) They certainly do not state this, and the fact that they refer to 'identical distribution' suggests otherwise. Another way the test assumptions could be violated is if actual p-hacking (see examples above) were present, such that the reported p-values were no longer valid. Again, the authors do not tell us that this is what they mean, if they in fact do, and this would be a criticism of p-hacking behaviours rather than of the p-value.

    When they write "big data can make insignificance seemingly significant by means of the classical p-value" they might be thinking of confusion between statistical and practical significance, which is a common misinterpretation made in the presence of large data size, but again, if this is what the authors are thinking of they should say it. The discussion by Greenland (Valid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution With S-Values, especially section 4.3) seems to address the concerns raised by the authors fairly decisively. For a given parameter size, increasing sample size should produce stronger evidence against the null. The p-value does not tell you about the size of the parameter directly - it measures the discrepancy between the data and the null - interpreted correctly, there is no problem.

    So, with apologies to the authors, I don't think they are successful in convincing the reader that there is a problem to be solved, and the manner of presentation (which may just be an issue of communicating the authors' intentions) is such that it causes doubt about the authors' handling of the relevant concepts. Throughout the text, there are other confusing presentations around fundamental concepts. E.g. the authors write things like "Hence, we claim that whenever there exist real statistically significant differences between two samples..." I know what a real difference is, but what is a real statistically significant difference? There are no statistically significant differences in nature. Are the authors trying to refer to instances where the null is false and is rejected? Or, are they trying to say that a 'real significant difference' is where the difference exceeds some magnitude?

    For example - the authors write things such as "When 𝑁(0,1) is compared with 𝑁(0,1), 𝑁(0.01,1) and 𝑁(0.1,1), πœƒ is null; so those distributions are assumed to be equal. In the remaining comparisons though, πœƒ = 1, thus there exist differences between 𝑁(0,1) and 𝑁(πœ‡,1) for πœ‡ ∈ [0.25,3]", highlighting the fact that perhaps the authors really want to address the practical significance vs statistical significance issue (although again, this is not explicitly stated). If the authors are interested in size of effect/ difference, then it is not clear that this proposal offers any advantage in that regard over the p-value (which, as noted, does not tell us about the size of a parameter). If interest is in size, then it is unclear why the authors do not direct the reader to consider the estimate and confidence interval, so that they may consider this explicitly in terms of magnitude and precision.

    With apologies to the authors, who have clearly spent a large amount of time on this - I would think that the best way forward here would be to post this as a preprint and to try to invite as much feedback as possible. The authors have lofty ambitions with this work. Maybe there is a good underlying idea here, obscured by the presentation? Unfortunately, it is difficult to assess this at present.

  2. ###Reviewer #1:

    The paper sets out to confront p-hacking and addressing the dependence of the p-value on the sample size. The paper sets out the motivation behind the problem and then proposes a solution using three examples.

    I have a major problem with this work in that I do not understand the motivation and hence cannot judge the value of the proposed solution.

    The authors need to set out some definitions which might help them framing the context. I outline below what I understand as the context and hence why I do not understand how their proposal will address the problem.

    Firstly 'p-hacking' is the term usually reserved for when researchers do not follow a pre-specified protocol on how a research question will be answered through the statistical analysis of a resource, single study or experiment, but instead analyse the data in many ways. Maybe they use slightly different assumptions, adjust the definition of an outlier or who is eligible for inclusion or adjust to a different outcome variable. In this manner they select to report the analysis that gives the smallest p-value. (Ioannidis referred to some of this as vibration effects) This is a major problem in science but it is not only the problem of the size of the data available. Although the bigger the dataset, the more subgroups that can be analysed. The main problem here is that we do not know how many ways the data have been analysed, we only know what researchers have selected to report. The manuscript does not address this problem at all.

    The p-value is defined as the probability of observing a result as or more extreme when the null hypothesis is true. In most settings the 'null' is that there are no differences between two or more groups, for example that all the means are the same or equal. Often this translates into the statement that we expect the distribution of p-values under the null to be uniformly distributed [0,1]. This can be demonstrated or checked by simulation. In the hypothesis testing framework we usually power our studies so we will be able to detect a (true) difference between two groups with some high probability. The specific difference we are interested in would be called the alternative hypothesis. Hence the p-value is used to reject the null, but under the alternative hypothesis the p-value will not be uniform [0,1]. It is well known that the larger your sample size the more precise estimates you will obtain and the smaller differences you will be able to detect. Sample size calculations require a specific alternative to be stated (e.g. a difference in means of 0.5 of a standard deviation) then a sample size that guarantees as specific power for the specific type 1 error can be calculated.

    This manuscript is confusing properties of the p-value when there are no differences and minimal differences between the two groups. I think the authors are trying to make the point that a statistically significant result is not necessarily a clinically or biologically meaningful result. They have done some simulations to show the distribution of the p-value when the true difference between the two means is 0.01. This is an example of an 'unimportant' difference, but it is not the null. This problem is best addressed by reporting effect sizes and 95% confidence intervals for quantities of interest rather than trying to adjust p-values in some way. Obviously when we have access to large datasets we may have a much larger sample than we needed to detect a meaningful effect though we may find small p-values. Adjusting the p-values will not really help as it is the effect sizes that are of interest.

    I feel the manuscript needs to be redrafted to be more clear about the problem they are trying to fix.

  3. ##Preprint Review

    This preprint was reviewed using eLife’s Preprint Review service, which provides public peer reviews of manuscripts posted on bioRxiv for the benefit of the authors, readers, potential readers, and others interested in our assessment of the work. This review applies only to version 3 of the manuscript.

    ###Summary:

    The authors describe the dependence of the p-value on sample size (which is true by definition) and offer a solution, using simulated data and an applied example. Unfortunately, both reviewers found it difficult to understand the motivation for the work and hence both had difficulty judging the value of the proposed solution. Detailed comments and suggestions are provided below.