Andrew Gelman has discussed and written on his concept of "The Garden Of Forking Paths" in NHST analysis of scientific data.

"Laplace" whose insights I respect a lot has ridiculed the idea, and when put into the terms he uses, I agree with him. However, I don't think that Gelman's point is quite the same as the one Laplace ridicules. So thinking about it, here's how I'd like to proceed to an understanding.

For simplicity we'll analyze the situation in which a research collects data $D$, and then does a test $T$ to determine if the two subsets $A^+(D)$ and $A^-(D)$ differ in some way that is detectable by the test by use of a sample statistic $S$.

First off, consider what the various options are available to the researcher:

and

and

That is, we can choose which test to use, which statistic to test, and how to subset and exclude certain portions of the data to form the partition (the function A partitions and excludes the data, so that there are two groups).

Now, what is the Bayesian probability that p < 0.05 given our knowledge N (I use N because I've already used K).

Suppose in the first case that N contains the information "i,j,k were preregistered choices and D was collected after i,j,k were specified and is independent of the i,j,k". Then $P(i,j,k|N) = 1$, and $P(p < 0.05 | N)$ is determined entirely by our knowledge in N of the appropriateness of the test and the p values that it outputs.

So, we're still left with all the problems of the use of p values, but we're at least not left with the problems described below.

In the case that N contains the information "I,J,K are all large integers and were chosen after seeing D, and the researcher is motivated to get p < 0.05 and probably at least looked at the data, produced some informal graphs, and discussed which analysis to do with colleagues" we're left with the assumption that i,j,k were chosen from among those analyses which seemed via informal data "peeking" to be likely to give p < 0.05 so the Bayesian is left with:

Now, due to our pre-analysis choice peeking, we can safely assume

sure it might not be exactly 1, but it's much much bigger than 0.05 like maybe 0.5 or 0.77 or 0.93 and this is FOR ALL i,j,k that would actually be chosen.

where G is the reachable subset of the $I \times J \times K$ space called "the garden of forking paths" such that any typical researcher would find themselves choosing i,j,k out of that subset such that it leads to analyses where $P(p < 0.05 | i,j,k,D,N) \sim 1$

So, how much information does $p < 0.05$ give the Bayesian about the process of interest? In the preregistered case, it at least tells you something like "it is unlikely that a random number generator of the type specified in the null hypothesis test would have generated the data" (not that we usually care, but this could be relevant some of the time).

In the GOFP case, it tells us "these researchers know how to pick analyses that will get them into the GOFP subset so they can get their desired p < 0.05 even without first doing the explicit calculations of p values."

So, using this formalism, we arrive at the idea that it's not so much that GOFP invalidates the p value, it's that it alters the evidentiary value of the p value to a Bayesian.

4 Responses leave one →
1. April 7, 2017

Put another way, the p value is yet another thing calculated from the data, and so is itself a kind of data, and what the data means to a Bayesian is dependent on the Bayesian's understanding of the data collection and production process. The existence of an "adversary" who is "trying to fool you" by choosing data analysis procedures in a utility maximization scheme (ie. trying to get grants/publications) *should* alter your opinion of what the p < 0.05 means In the preregistered case, you can at least say "this particular null hypothesis has been shown to be probably false". In the GOFP case, with I,J,K being an enormous number, the idea is that there's always some way to slice and dice the data to get p < 0.05 and that it is easy to do by just looking at a few graphs and discussing ideas with your colleagues so it doesn't require formal multiple-hypothesis-testing to implement. In that context, even if the data set comes out of a pure RNG null hypothesis procedure (some joker literally ran the Mersenne-Twister algorithm and generated it!) you'd find that the adversary would be pointing out things that "must be true" because they found p < 0.05 through an informal search procedure. It's still the case that the null hypothesis RNG actually chosen rarely outputs data as extreme as the data that was seen... but with the large cardinality of the potential search space... it's also the case that there's bound to be some way to get p < 0.05 even if everything *is* "pure null". So, if you had a specific reason to care about the specific null hypothesis tested by the actual researchers, then rejoice, it seems like this data likely invalidates it. But that's pretty rarely the case in the kinds of research that GOFP criticism applies to.

April 7, 2017

The probability “P(A|K)” roughly answers the question:

Question 1: "consider all possibilities compatible with K; what percentage of them deliver A?"

P(A|K) is in a sense the best we can do given K. The only way this “fails” is if we get K wrong, or fail to translate K into P(A|K) correctly. The latter two mistakes tend to occur more if we’re looking at this post-data but that's because it's easier to "cheat" in that situation. It has nothing inherently to do with it being post-data. Don’t “cheat” and you’re fine. The structure of DNA was discovered post-data.

An entirely separate question is:

Question 2: "if I consider a large number of questions A_i, and decide in favor of A_i whenever P(A_i|K) is greater than .95 and only report those cases, then what percentage of of reported cases will be right?"

This latter question has a subtle answer. More subtle than most statisticians realize. But unfortunately almost all statisticians want to use answers from Question 1 to answer Question 2. Specifically, they think something very bad happens if the answer to Question 2 isn’t numerically equal to P(A_i|K) for the A_i reported.

Now, this doesn’t happen much in practice. So they invent all kinds of stories about “garden of forked paths” or p-hacking to explain away the discrepancy between what they think the answer should be and what it really is.

Inventing irrelevant fantasies is one option.

Another option is to understand what probabilities really are, and that P(A|K) is not even close to being an estimate, either conceptually or numerically, for the answer to Question2. Once you do that, you can go back to the pre-Frequentist days when scientist did real science without any “garden of forked paths” insanity.

Indeed, if you take this second route, you can use P(A|K) successfully, even if other questions might have been asked, and even if you would have asked different questions in a universe different from the one we live in.

• April 7, 2017

I agree with your assessment of probabilities in general. But I think the question here is what to make of the quantity p in a particular NHST test. That is how should we update our probabilities related to broader scientific questions, after seeing some p < 0.05 result published somewhere. When we see this result in the context of a preregistered study, and the hypothesis being tested is of inherent interest, we update our Bayesian probabilities in a different way than if we just see some typical non-preregistered study with p < 0.05 type results, and the reason is that our model of the "generating process" of producing p < 0.05 type results is different in the two cases.

• April 7, 2017

Another way of putting it is... p values are almost always the wrong way to approach a question, but given that someone has done some research and published a p value, what should we as Bayesians who know better do about our prior over the broader scientific hypothesis being studied (ie. say "eating vegetable fats is healthier than animal fats" or "people's political beliefs may vary broadly based on hormone cycles" or whatever garbage people study these days).

In some cases, the appropriate response is to say "there is no real evidentiary value to this study" and in other cases it would be something like "we can rule out the idea that responses are as good as iid normal with mean zero" (which isn't much for a \$1M grant to find out). And part of the evidence is "whether the multiplicity over plausible sub-hypotheses that could have been chosen for testing is large enough to guarantee that a brief wave of the magic wand will get you the p < 0.05 you need to publish" It's not really directly a question of what p values mean or whatever, it's more like a Bayesian meta-model of "researcher as game-theoretic adversary trying to dupe us into giving money to the lab and believing their results"