Bayesian Probability Distributions, Frequencies, Random Variables, and Model Fitting
Whoo, the title is a mouthful! Again, in discussions with Joseph over at the Entsophy blog some interesting issues in the philosophy of statistics were raised. It has me thinking about the following issue:
Suppose that you have some data and some measured covariates and some brainpower. You would like to understand the process that produces in terms of things you know about the state of the world and things you assume about the state of the world, your model (I'm using G because we'll want to distinguish between probabilities and frequencies in ensembles ). So you are going to use the following formulation:
and has a Bayesian probability distribution associated to it. Actually, technically there's a joint distribution over all the epsilons . For the moment pretend we haven't thought about it yet.
So, you build your functional form for and it has some unknown parameters about which you have some general knowledge, say order of magnitude estimates, from which you can build priors that approximate that knowledge . Now you need to do some inference on to find out what reasonable values are for those unknown parameters. Using Bayes Theorem, you write down:
[ Edited: changed this to condition on the observed covariates , and will let the measurement error in enter as with the parameters including the unobservable measurement errors in the measurements of ]
And you're off to the races doing MCMC right? Not so fast! what the heck is ? The usual jargon is "The Likelihood", and well, the predictions from given are well known, so the only thing we're not sure about is how likely we are to see a certain differences between the data and the predictions . In other words we need to specify the Bayesian probability distribution on the values.
Well, first let's consider the "usual" answer. The usual answer is something like "the data are treated as IID draws from distribution 'foo'" (often normal, or binomial, or poisson, or something else fairly convenient). The next thing after that is often some kind of heteroskedastic version, where are all IID and is some kind of known measurement error scale that comes from instrument calibration data.
But, from a Bayesian perspective the joint distribution over the values is really a belief about how closely the prediction equation will match the data in each circumstance, and it could be pretty much any probability distribution. The "frequentist" notion that there is a "data generating process" like IID normal draws is backwards, the actual direction of implication seems to be (Knowledge about model performance and measurement properties) (Belief about errors when values are "reasonable", ie a mathematical specification of the likelihood) (Belief about numerical values for "reasonable values" of parameters , typically through MCMC procedures) (Ensemble of measured errors under high probability values, this ensemble will be constrained to look like draws from some single distribution, typically mean 0 and maybe having other Maximum Entropy like constraints such as standard deviation, quantiles, or other structure).
The "frequentist influenced" Bayesian method goes more around in a circle: first assume a data generating mechanism (the final distribution above) then use that to produce an "automatic" likelihood (back to the likelihood at the beginning of the above chain), then use the prior and the likelihood to get the high probability parameter values via MCMC or whatever (you could do the "full frequentist" version of this by leaving off the priors and simply doing maximum likelihood for example).
In other words, in the fully Bayesian world, depending on what we think the model and measurement devices ought to do (how we specify the likelihood), we will get different distributions on the parameters, and then when we treat the parameters as random variables and observe the discrepancies, we will get an ensemble of discrepancies which look like IID draws (or dependent draws if you want to specify some dependency structure) from some distribution, but the distribution they look like draws from need not be the one we use to define the likelihood. It is more or less the case that we can often use a simple IID "frequency" model as the Bayesian probability model for defining the likelihood and we will get in the end that the ensemble of errors look like they came from that IID distribution we assumed. This is in essence a special case, but that special case is exactly the one that Joseph is complaining about in his "jumped the shark" post. In other words, there is no "Correct" likelihood that corresponds to actual facts about the world. The "Data Generating Process" doesn't really exist, so we can't be right or wrong about it, and confidence intervals derived from assumptions about the "Data Generating Process" need not have real coverage properties because the world need not behave as if that "Data Generating Process" assumed by a frequentist statistician is actually correct.