Bayesian models are also (sometimes) Non-Reproducible
Shravan at Gelman's blog writes "I moved away from frequentist to Bayesian modeling in psycholinguistics some years ago (2012?), and my students and I have published maybe a dozen or more papers since then using Bayesian methods (Stan or JAGS). We still have the problem that we cannot replicate most of our and most of other people’s results."
So, Bayesian modeling isn't a panacea. Somehow I think this has been the take-away message for many people from my recent discussions of the philosophy of what Bayes vs Frequentist methods are about. "Switch to Bayes and your models will stop being un-reproducible!"
So, I feel like I've failed (so far, in part), because that certainly isn't my intention. The only thing that will make your science better is to figure out how the world works. This means figuring out the details of the mechanism that connects some observables to other observables using some parameter values which are unobservable. If you're doing science, you're looking for mechanisms, causality.
But, I believe that Bayesian analysis helps us detect when we have a problem, and by doing Bayesian analysis, we can force a kind of knowledge of our real uncertainty onto the problem, that we can't do with Frequentist procedures!
That detection of un-reproducibility, and the ability to force realistic evaluations of uncertainty, both help us be less wrong.
Some Bayesian Advantages
So, what are some of the things that make Bayesian models helpful for detecting scientific hypotheses that are unreproducible or keep our models more realistic?
Consistency is Detectible:
First off, there's a huge flexibility of modeling possibilities, with a single consistent evaluation procedure (Bayesian probability theory).
For example, you can easily distinguish between the following cases:
- There is a single approximately fixed parameter in the world which describes all of the instances of the data. (ie. charge of an electron, or vapor pressure of water at temperature T and atmospheric pressure P, or the maximum stack depth of linguistic states required for a pushdown automaton to parse a given sentence)
- There are parameters that describe individual sets of data and all of these parameters are within a certain range of each other (ie. basal metabolic rate of individual atheletes, fuel efficiency of different cars on a test track).
- There is no stable parameter value that describes a given physical process (ie. basal metabolic rate of couch potatoes who first start training for a marathon, fuel efficiency of a car with an intermittent sensor problem, or if you believe that Tracy and Beall paper Andrew Gelman likes to use as an example, the shirt color preferences of women through time).
When Shravan says that his Bayesian models are unreproducible, what does that indicate?
Most likely it indicates that he's created a model like (1) where he hypothesizes a universal facet of language processing, and then when he fits it to data and finds his parameter value, if he fits the model to different data, the parameter value isn't similar to the first fit. That is, the universality fails to hold. If, in fact, he were to use the posterior distribution of the first fit as the prior for the second, he'd find that the posterior widened or maybe shifted somewhere that the posterior from the first analysis wouldn't predict when the additional data was taken into account.
Or, maybe he's got a model more like (2) where he hypothesizes that at least there's some stable parameter for each person but that all the people are clustered in some region of space. But, when he measures the same person multiple times he gets different parameter values for each trial, or when he measures different populations he winds up with different ranges of parameter values.
Or, maybe he has a model like (3) where he expects changes in time, but the kinds of changes he sees are so broad and have so little structure to them that they aren't really predictive of anything at all.
In each case, probability as extended logic gives you a clear mechanism to detect that your hypothesis is problematic. When you collect more data and it fails to concentrate your parameter estimates, or it causes the parameters to shift outside the region they had previously been constrained to, it indicates that the model is inconsistent and can't explain the data sufficiently.
Bayesian models can be structured to be valid even when the data isn't an IID sample:
IID sampling is just way more powerful mathematically than the reality of running real experiments, so that assumption implicit in a Frequentist test, will fool you into failing to realize the range of possibilities. In other words, Frequentism relies on you throwing away knowledge of alternatives that might happen under other circumstances ("let the data speak for themselves" which is sometimes attributed to Ronald Fisher but may not be something he really said)
Suppose instead you are analyzing performance of a linguistic test in terms of a Frequency distribution of events. The basic claim is that each event is an IID sample from a distribution where x is the measured value, and a,b are some fixed but unknown parameters. Now, you do a test to see if and you reject this hypothesis, so you assume and you publish your result. First off, the assumption of a fixed and IID sampling is usually a huge huge assumption. And, with that assumption comes a "surety" about the past, the future, about other places in the world, other people, etc. In essence what that assumption means is "no matter where I go or when I take the measurements all the measurements will look like samples from F and every moderate size dataset will fill up F". How is this assumption different from "no matter where I go all I know about the data is that they won't be unusual events under distribution P" which is the Bayesian interpretation?
Consider the following specific example: P = normal(0,10), and I go and I get a sample of 25 data points, and I find they are all within the range -1,1. A Frequentist rejects the model of normal(0,10). Any random sample from normal(0,10) would contain values in the ranges say -15,-1 and 1,15 but this sample doesn't.
Does the Bayesian reject the model? No, not necessarily. Because the Bayesians know that a) the probability is in their head based on the knowledge they have, whereas the Frequency is in the world based on the laws of physics. And, b) in many many cases there is no sense in which the data is a representative random sample of all the things that can happen in the world. So, even if Bayesians really do believe that is a stable Frequency distribution across the full range of conditions, there's nothing to say that the current conditions couldn't be all within a small subset of what's possible. In other words, the Bayesian's knowledge may tell them only about the full range of possibilities, but not about the particulars of how this sample might vary from the full range of possibilities. Since Bayesians know that their samples aren't IID samples from they need not be concerned that it fails to fill up all of . The only time they need to be concerned is if they see samples that are far outside the range of possibilities considered. Like say x=100 in this example. That would never happen under normal(0,10);
To the Bayesian, x ~ normal(0,s); s ~ prior_for_s(constants) is very different than x ~ normal(0,constant); In the first case the scale s is unknown and has a kind of range of possibilities which we can do inference on. The assumption is implicitly that the data is informative about the range of s values. In the second case, the constant scale comes from theoretical principles and we're not interested in restricting the range down to some other value s based on the data.
This can be important, because it lets the Bayesian impose his knowledge on the model, forcing the model to consider a wide range of possibilities for data values even when the data doesn't "fill up" the range. Furthermore, it's possible to be less dogmatic for the Bayesian, they can provide informative priors that are not as informative as the delta-function (ie. a fixed constant) but also will not immediately mold themselves to the particular sample. For example s ~ gamma(100,100.0/10) says that the scale is close to 10 and this information is "worth" about 100 additional samples, so don't take the 25 samples we've got as fully indicative of the full range.
The assumption of random sampling that comes out of Frequentist theory says that there's no way in hell that you could fail to fill up the Frequency distribution if your samples were 25 random ones. This means implicitly that the current data and any future data look "alike" for sufficiently large sample sizes. That's a very strong assumption which only holds when you are doing random sampling from a finite population using an RNG. But all of the Frequentist tests rely on that assumption. Without that assumption you simply can't interpret the p values as meaningful at all. If you know that your sample is not a random sample from a finite population but just "some stuff that happened through time where people were recruited to take part in various experiments in various labs" then assuming that they are random samples from a fixed population says automatically "when n is more than a smallish number, like 25 to 50, there is no way in hell we've failed to sample any part of the possible outcomes"
Since the p values you get out of Frequentist tests are typically only valid when the data really IS an IID sample of the possibilities. You're left with either being fooled by statistics, or being a Bayesian and imposing your knowledge onto your inferences.