What it takes for a p-value to be meaningful.

2014 September 3
by Daniel Lakeland

Frequentist statistics often relies on p values as summaries of whether a particular dataset implies an important property about a population (often that the average is different from 0).

In a comment thread on Gelman’s blog (complete with a little controversy) I discussed some of the realistic problems with that, which I’ll repeat and elaborate here:

When we do some study in which we collect data $$d$$ and then calculate a $$p$$ value to see if it has some particular property, we calculate the following:


Where $$P$$ is a functional form for a cumulative distribution function, and $$s_i$$ are sample statistics of the data $$d$$.

A typical case might be $$1-p_t(\bar d / s(d),n(d)-1)$$ where $$\bar d$$ is the sample average of the data and $$s(d)$$ is the sample standard deviation, $$n(d)$$ is the number of data points, and $$p_t$$ is the standard t distribution CDF with $$n-1$$ degrees of freedom.

The basic idea is this: you have a finite population of things, you can sample those things, and measure them to get values $$d$$.  You do that for some particular sample, and then want to know whether future samples will have similar outcomes. In order for the $$p$$ value to be a meaningful way to think about those future samples you need:

  • Representativeness of the sample. If your sample covers a small range of the population’s total variability, then obviously future samples will not necessarily look like your current sample.
  • Stability of the measurements in time. If the population’s values are changing on the timescale between now and the next time you have a sample, then the p value is meaningless for the future sample.
  • Knowledge of a good functional form for $$p$$. When we can rely on things like central limit theorems, and certain summary statistics therefore have sampling distributions that are somewhat independent of the underlying population distribution, we will get a more robust and reliable summary from our p values. This is one reason why the t-test is so popular.
  • Belief that there is only one, or at least a small number of possible analyses that could have been done, and that the choice of sample statistics and functional form are not influenced by information about the data: $$p_q=1-P_q(s_{iq}(d))$$ represents in essence a population of possible p values from analyses indexed by $$q$$, when there are a wide variety of possible values for $$q$$, the fact that one particular p value was reported with “statistical significance” only indicates to the reader that it was possible to find a given $$q$$ that gave the required small $$p_q$$.

The “Garden of Forking Paths” that Gelman has been discussing is really about the size of the set $$q$$ independent of the number of values that the researcher actually looked at. It’s also about the fact that having seen your data, it is plausibly easier to choose a given analysis which produces small $$p_q$$ values even without looking at a large number of $$q$$ values when there is a large plausible set of potential $$q$$.

Gelman has commented on all of these, but there’s been a fair amount of hoo-ha about his “Forking Paths” argument. I think the symbolification of it here makes things a little clearer, if there are a huge number of $$q$$ values which could plausibly have been accepted by the reader, and the particular $$q$$ value chosen (the analysis) was not pre-registered, then there is no way to know whether $$p$$ is a meaningful summary about future samples representative of the whole population of things.

What problems are solved by a Bayesian viewpoint?

Representativeness of the sample is still important, but if we have knowledge of the data collection process, and background knowledge about the general population, we can build in that knowledge to our choice of data model and prior. We can, at least partially, account for our uncertainty in representativeness.

Stability in time: A Bayesian analysis can give us reasonable estimates of model parameters for a model of the population at the given point in time, and can use probability to do this, even though there is no possibility to go back in time and make repeated measurements at the same time point. Frequentist sampling theory often confuses things by implicitly assuming time-independent values, though I should mention it is possible to explicitly include time in frequentist analyses.

Knowledge of a good functional form: Bayesian analysis does not rely on the concept of repeated sampling for its conception of a distribution. A Bayesian data distribution does not need to reproduce the actual unobserved histogram of values “out there” in the world in order to be accurate. What it does need to do is encode true facts about the world which make it sensitive to the questions of interest. see my example problem on orange juice for instance.

Possible Alternative Analysis: In general, Bayesian analyses are rarely summarized by p values, so the idea that the $$p$$ values themselves are random variables and we have a lot to choose from is less relevant. Furthermore, Bayesian analysis is always explicitly conditional on the model, and the model is generally something with some scientific content. One of the huge advantages of Bayesian models is that they leave the description of the data to the modeler in a very general way. So a Bayesian model essentially says: “if you believe my model for how data $$d$$ arises, then the parameter values that are reasonable are $$a,b,c\ldots$$ “. Most Frequentist results can be summarized by “if you believe the data arise by some kind of simple boring process, then you would be surprised to see my data”. That’s not at all the same thing!


Comments are closed.