# Understanding when to use p-values

The p value is perhaps one of the most mis-used concepts in statistics. In fact, many researchers in science who are not statistics experts seem to believe that statistics is really the study of how to define and calculate p values. I’d say this attitude is prevalent especially in biology, medicine, and some areas of social sciences.

The truth is, to a Bayesian such as myself, p values are largely irrelevant, but they DO have one area where they make sense. First, let’s understand what a p value is.

### p value meaning

The p value in a test is the probability that a random number generator of a certain type would output test-static data that is as extreme or more extreme than the actually observed value of that test statistic.

$$P_{H0}(t(d) > t(D))$$

Procedurally, imagine that your data $$D$$ comes from a random number generator which has boring properties that you don’t care about. If your data comes from this random number generator, it would by definition, be an uninteresting process that you’d stop studying. Call this random number generator $$H_0$$ and its output $$d$$ (little d). Now consider some function $$t$$ which maps your data to a real number: $$t(D)$$ or for random generator output $$t(d)$$. Generally the function measures in some sense how far away your data falls from an uninteresting value of $$t$$, (often t=0). Now, how often would your specifically chosen boring random number generator produce fake data $$d$$ whose $$t$$ value is more extreme than the $$t$$ value of your actual data $$D$$? This is what the formula above describes.

### p value use

So, that above description seems rather sterile, here is an examples of “proper” use of a p value: filtering data

You have a large sample of 1 second long audio recordings of the ambient noise around the area of a surveillance camera. You want to detect when the overall loudness of the noise is “unusual” so that you can tag and save the audio and video recordings for 30 minutes on either side of the “unusual” event. These events will be saved indefinitely, and other time periods will be deleted after 7 days to reduce data storage requirements. You calculate an overall amplitude of the sound recording $$s(t)$$ using this formula for an amplitude: $$A_{t_0} = \int_{t=t_0}^{t=(t_0+1)} s(q)^2 dq$$ this is a real number, and its calculation from the data does not require generating random numbers, and therefore the formula is a deterministic function that maps your data (a sequence of voltages) to a real number, and qualifies as a “test statistic”. Next you manually identify a set of 1000 time intervals during which “nothing happened” on your recording, and you calculate the $$A$$ values for each of these “uninteresting” intervals. Now, if you have an $$A$$ value which is greater than 99% of all the “uninteresting” $$A$$ values, then you know that the “A” value is unusually large under the assumption that your “A” value was generated by the “nothing happened” random number generator, in this case, the p value for the amplitude to come from a “nothing happened” time period is $$p = 0.01$$ because 99% of “nothing happened” samples have amplitude less than this given amplitude.

Note that this does not mean in any way that “a crime happened” perhaps a cat knocked over a trash-can, or a window washer came and bumped the camera, or a jet airplane flew over, or a Harley Davidson drove by, or whatever. Taking the fact that the audio was louder than most of the samples of “nothing happened” as evidence that “a crime was committed” is seriously WRONG, in just the way that taking the fact that your psychology experiment produced measurements that are different from some simple “null hypothesis” as evidence that “my explanatory mechanism is true” is also seriously WRONG.

### The real value of p: filtering

So, we see the real value of “p” values: filters. We have lots of things that we probably shouldn’t pay attention to: chemicals synthesized at random in a pharma lab, noise produced by unimportant events near our surveillance camera, time periods on a seismometer during which no earthquake waves are being received, psychology experiments that we shouldn’t pay attention to. The p value gives us a way to say “something is worth considering here”. The big problem comes when we make the unjustified leap from “something is worth considering here” to “my pet theory is true!”.

As a Bayesian, I appreciate the use of p values in filtering down the data to stuff I probably shouldn’t ignore, it’s a first step. But the next step is always going to be “let’s build a model of what’s going on, and then find out what the data tells us about the unobserved explanatory variables within the model” That’s where the real science occurs!

PS: often under the assumption that there is a single stationary distribution from which all the data d come means that the random samples of t(d) have some common and well known frequency distribution (like normal, or chi-squared, or chi, or gamma or whatever). The central limit theorem often guarantees the frequency distribution is normal under some mild assumptions, this is why the “null hypothesis” is often not really about the data generating histogram, but rather the histogram of t(d). In this view, it doesn’t so much matter what your data turn out to be, but rather what the t(D) is relative to a reference distribution for the test statistic (like the “student-t” or the normal, chi-squared, etc).

I think this is in some sense how Fisher originally intended they be used. I like the intuition of “a measure of surprise” [given no effect].

FDR-controlling machinery might be of interest to self-identified “Bayesians” in this kind of context, too.