Andrew Gelman has discussed and written on his concept of "The Garden Of Forking Paths" in NHST analysis of scientific data.

"Laplace" whose insights I respect a lot has ridiculed the idea, and when put into the terms he uses, I agree with him. However, I don't think that Gelman's point is quite the same as the one Laplace ridicules. So thinking about it, here's how I'd like to proceed to an understanding.

For simplicity we'll analyze the situation in which a research collects data $D$, and then does a test $T$ to determine if the two subsets $A^+(D)$ and $A^-(D)$ differ in some way that is detectable by the test by use of a sample statistic $S$.

First off, consider what the various options are available to the researcher:

and

and

That is, we can choose which test to use, which statistic to test, and how to subset and exclude certain portions of the data to form the partition (the function A partitions and excludes the data, so that there are two groups).

Now, what is the Bayesian probability that p < 0.05 given our knowledge N (I use N because I've already used K).

Suppose in the first case that N contains the information "i,j,k were preregistered choices and D was collected after i,j,k were specified and is independent of the i,j,k". Then $P(i,j,k|N) = 1$, and $P(p < 0.05 | N)$ is determined entirely by our knowledge in N of the appropriateness of the test and the p values that it outputs.

So, we're still left with all the problems of the use of p values, but we're at least not left with the problems described below.

In the case that N contains the information "I,J,K are all large integers and were chosen after seeing D, and the researcher is motivated to get p < 0.05 and probably at least looked at the data, produced some informal graphs, and discussed which analysis to do with colleagues" we're left with the assumption that i,j,k were chosen from among those analyses which seemed via informal data "peeking" to be likely to give p < 0.05 so the Bayesian is left with:

Now, due to our pre-analysis choice peeking, we can safely assume

sure it might not be exactly 1, but it's much much bigger than 0.05 like maybe 0.5 or 0.77 or 0.93 and this is FOR ALL i,j,k that would actually be chosen.

where G is the reachable subset of the $I \times J \times K$ space called "the garden of forking paths" such that any typical researcher would find themselves choosing i,j,k out of that subset such that it leads to analyses where $P(p < 0.05 | i,j,k,D,N) \sim 1$

So, how much information does $p < 0.05$ give the Bayesian about the process of interest? In the preregistered case, it at least tells you something like "it is unlikely that a random number generator of the type specified in the null hypothesis test would have generated the data" (not that we usually care, but this could be relevant some of the time).

In the GOFP case, it tells us "these researchers know how to pick analyses that will get them into the GOFP subset so they can get their desired p < 0.05 even without first doing the explicit calculations of p values."

So, using this formalism, we arrive at the idea that it's not so much that GOFP invalidates the p value, it's that it alters the evidentiary value of the p value to a Bayesian.

One Response leave one →
1. April 7, 2017

Put another way, the p value is yet another thing calculated from the data, and so is itself a kind of data, and what the data means to a Bayesian is dependent on the Bayesian's understanding of the data collection and production process. The existence of an "adversary" who is "trying to fool you" by choosing data analysis procedures in a utility maximization scheme (ie. trying to get grants/publications) *should* alter your opinion of what the p < 0.05 means In the preregistered case, you can at least say "this particular null hypothesis has been shown to be probably false". In the GOFP case, with I,J,K being an enormous number, the idea is that there's always some way to slice and dice the data to get p < 0.05 and that it is easy to do by just looking at a few graphs and discussing ideas with your colleagues so it doesn't require formal multiple-hypothesis-testing to implement. In that context, even if the data set comes out of a pure RNG null hypothesis procedure (some joker literally ran the Mersenne-Twister algorithm and generated it!) you'd find that the adversary would be pointing out things that "must be true" because they found p < 0.05 through an informal search procedure. It's still the case that the null hypothesis RNG actually chosen rarely outputs data as extreme as the data that was seen... but with the large cardinality of the potential search space... it's also the case that there's bound to be some way to get p < 0.05 even if everything *is* "pure null". So, if you had a specific reason to care about the specific null hypothesis tested by the actual researchers, then rejoice, it seems like this data likely invalidates it. But that's pretty rarely the case in the kinds of research that GOFP criticism applies to.