NOTE: No, damn it, this is NOT an April Fools joke!

While I'm running some massive SQL queries for my wife, I'm reading blog posts in which various people complain about the idea of Bayesians changing their priors after seeing the results of the Bayesian machinery. What they seem to ignore is that rarely do we have any real certainty about what the right model is for our data.

Suppose you have some situation where some data has been collected. You think about the science involved. You come up with some initial model. You should interpret this as:

"From the set of all models I might ever consider {M1,M2,M3...} I chose the one I thought had highest prior probability, and I wrote it down."

interpret this as p(M1) > P(Mn) for all n>1.

Now you should also imagine that you have a kind of informal probability function over the results of the analysis: "The results of fitting this model (data on fit) shouldn't be off by very much if the choice of model was correct" interpret that as a probability over goodness of fit p(fit_of_model | M1);

Now, you write down the model into Stan or whatever, you plug in your data, and you do a fit and you see how well it works using some plots and model checks and etc. Imagine you can collapse the concept of "fit_of_model" down to a single number trading-off all the different types of ill-fit. Plug this informal "fit of model" into the function:

p(fit_of_model | M1) (in other words, squint at the diagnostic plots) and see if the fit exceeds your expectations for badness. If the fit is way out in the tail of your expectations... then the "data" on goodness of fit causes you to downweight the plausibility of using this model M1 and your new posterior distribution over models should be consulted. In other words

p(M1 | fit_of_model) \propto p(fit_of_model | M1) p(M1)

If after seeing the fit, your probability over M1 declines dramatically, then go back to your models {M2...Mn} and see which one has highest probability given what you now know about the fit. p(M2 | M1_fit_poorly) etc.

Do we have formal machinery for this? No, it's too hard to write down every model you might EVER consider fitting (though, note that in practice it's potentially a very large but finite set! For example, the set of all Stan codes that compile when given your data and that can be written down by a human in less than a human lifetime. Let's suppose we can write maybe 2 gigabytes of code in our lifetime. That's finite!). Instead, what we do is we include as much as we think we need to into model M1, and we see how well it fits, and if it fits poorly, we go back and use our new information to put more useful stuff into a new model M2.

Sometimes, there might be a very limited set of models M1...M5 for example and you could actually do this formally by writing out all 5 models into one Stan code and formalizing the goodness of fit stuff. But most of the time, we just do our best, do the Bayesian fit on a model, and see if it didn't work out well, in which case, back to the drawing board. It's a kind of Monte Carlo Method over models.

Since the model is the choice of parameters, the choice of priors, and the choice of likelihood all put together, sometimes the updated model just has a different prior and the same likelihood etc. Voila, Bayes theorem actually (informally) tells us we should change our prior after fitting our model, provided that the new prior is one that we might have considered to have some degree of reasonableness before taking the shortcut and writing down M1 to test it first.

5 Responses leave one →
1. April 1, 2016

I think for this informal model revision process to be safe (in the sense of avoiding inadvertent data double-use, or let us say rather, inadvertent incorrect conditioning) the model revision ought to satisfy some condition along the lines of, say, the the typical set of the prior predictive distribution under M2 needs to both cover and be wider than under M1.

• April 1, 2016

You're saying start with the most specific model you can think of and then go broader if it doesn't fit? I'm pretty sure that disagrees with Gelman's approach which is start with a broad model and add additional predictors and things as seems necessary to make goodness of fit be reasonable enough.

Perhaps what you meant was that if M2 is M1 with a different prior, then the prior on M2 (revised model) should be "bigger" than the one in M1 ?? ie. you can detect when you're being overly sure about your prior information and back off on it?

I think that's probably a good strategy, but I don't think it needs to be the only one.

For example, when fitting a curve through 5 data points, traditionally people have shied away from fitting a curve with 5 parameters, so that the curve goes through all the data. Why? Informally, it's because their likelihood on goodness of fit is such that they think there's zero chance that their model should go through all the data. There should be "some" error, but neither too much nor too little. Kind of like a gamma(2,1) distribution puts mass around 2 and away from both 0 and large values like 6.

similarly, you might say to yourself something like: "I'll model the error in this measurement procedure as normally distributed" and then you fit your Bayesian model, and you then realize based on some goodness of fit graphs that sometimes the errors are really big... so you go back and change your model to include longer tails, or what's kind of the same, you change your prior on degrees of freedom in a t distribution. It used to be that the prior specified dof > 20, and now it's dof ~ 5 to 10... totally disjoint priors but still required by this informal process.

• April 1, 2016

I think the thing that needs to be consistent in order for this to work correctly, is that you have to have a consistent world-view about how well your model OUGHT to fit, if it were the right kind of model. That is, p(fit_of_model | ModelIsCorrect) should be realistic even if it's informal. Since fit_of_model is really a multi-dimensional thing, you're talking about having some reasonably consistent function on a multi-dimensional space. Not always that easy.

If you think it ought to fit perfectly, you're likely to chase the noise and "overfit". But, if you believe that you have good insight into how things work, but just hadn't taken into account certain facts that you discover only after seeing the data, then you will correctly modify your model and it will look like "going back and changing your prior" or whatever (changing your likelihood, changing your model in some way)

where this works best, most likely, is where the things you're changing are in some sense "meta" to the thing you're trying to find out. It's sometimes a mistake to try to estimate X using a prior on X and some data, then see the data and then go back and change your prior on X just because you think your model should fit better. But not always. You might have unintentionally biased your estimate away from reality and like you say, you should go back and make your prior vaguer so that it's less biased, it stops using information that you only thought you had.

But it's not necessarily so bad to use a prior on Q which is a nuisance parameter that is vague, then fit your model, see that it doesn't fit X well, and go back and change what you think about Q to be something more specific, so that your fit on X makes more sense. One reason is if you put a vague prior on Q initially to be conservative, but you find that the "structure" of your model is working very well, it just really needs to know more about Q to get things right.

For example, suppose Q is the speed of sound in some material in your model. You put a prior on it being somewhere close to 3000 m/s and you fit your model. You find that there are certain Q values which cause the fit of the model to be very good except in a certain un-modeled condition that you actually don't care about (say transmission of sound across an interface that is highly mismatched in impedance). So you go in and put a tighter prior on Q and a longer tail on the likelihood to account for the conditions that you don't expect to be well modeled.... that makes good Bayesian sense.

• April 4, 2016

But it's not necessarily so bad to use a prior on Q which is a nuisance parameter that is vague, then fit your model, see that it doesn't fit X well, and go back and change what you think about Q to be something more specific, so that your fit on X makes more sense. One reason is if you put a vague prior on Q initially to be conservative, but you find that the "structure" of your model is working very well, it just really needs to know more about Q to get things right.

Well, I'm not going to disagree with this, since I have done this exact thing for this exact reason. In my case the joint prior on the interest parameters and the noise process was sufficiently broad that some of the interest parameters were being estimated with apparent near-perfect precision (or actually, the Gibbs sampler was getting caught in sharp local maxima; I could have worked around this technical issue by using a parameter-expanded Gibbs sampler, but the nonsensically sharp local maxima would still have been present in the posterior). It was precisely the case that the model was failing to reflect the fact that '[t]here should be "some" error, but neither too much nor too little.' I put a prior on the noise process that pushed estimated measurement noise away from zero and that fixed both the model and the Gibbs sampler.

No, what I had in mind was not overfitting caused by an overly flexible model but rather straight-up lack of fit.

• April 3, 2016

Another thought I had while doing dishes...

Suppose you initially create a model for some physical process and you leave g, the gravitational acceleration near the earth, as a parameter, with a prior say something like

normal(9.81,0.05);

Now you run your code, get your fit, and realize that in this model, g and some other parameter, maybe a density of a fluid or something are together unidentifiable.

So, you go back and look up some data on g near your location, say it's Los Angeles and you can find a table where someone else has done careful experiments on gravitational anomalies due to the local rock structure etc, so you find out that say g = 9.793 +- 0.001. So you just eliminate g as a parameter and stick it in as a constant 9.793. Now you can do inference on your fluid density.

You've essentially started with a "broad" prior on g and gone back and put a delta function informed by additional information because of the problem with the fit that you detect after running the model.

This seems totally intelligent and consistent with Bayesian reasoning in the presence of model uncertainty. Admittedly, you brought more external information into the analysis. But the basic concept of altering your model after seeing how well it fits based on a kind of Bayesian meta-probability distribution over the way in which your model should fit (the kind of errors it should make). That makes sense.