# Understanding Data Dependant Priors

There's been a big discussion over at Andrew's blog about data dependent priors. This is actually something that I have been skeptical of in the past, but that I was happy to have a chance to re-visit. Here's the summary of what I discovered by arguing for an interpretation of what they mean.

#### What is a data dependent prior?

When you create a Bayesian model you're writing down a joint distribution for the data and the parameters. Typically you're writing it down in the form p(data | parameters, K) p(parameters | K) where "K" is just a stand in for the knowledge you have about the way things work. Then you sample the parameters from this distribution and find out something about the parameters.

The part where you write down p(parameters | K) is often called "the prior" and philosophically there's a tendency to view the whole process as starting with the prior, adding in the data, and seeing what the posterior distribution is. But, mathematically, the joint distribution (the posterior) is really a simultaneous distribution that simultaneously summarizes everything you know, there's no before and after. Or, if anything, the K is what you know before, and the joint distribution summarizes what you know after.

So, imagine you have a parameter "s" and you're trying to write down p(s | K). It might often be the case that the parameter s is something that's reasonably approximated by a sample statistic. In this case let's use the standard deviation for concreteness, but it could be anything. You might be tempted to write (in Stan notation):

s ~ normal(sd(data), some_error_here);

And this is a data dependent prior. In other words, you're writing p(s) in terms of a sample statistic sd(data).

#### What's wrong with it?

Well, my conclusion is that nothing is actually wrong with it per se. It's possible to create Bayesian models with a clear valid interpretation when constructing them in this way. But we need an interpretation. And the usual interpretation is that the "prior" represents pre-data uncertainty, and the posterior (or in my terminology here the joint distribution) represents uncertainty after seeing the data. Since in this case the prior is built using the data, it can't have that interpretation. So under a philosophy where you have to have this separation, it's *wrong to mix the data into the prior.*

#### So, when is it OK to construct a data dependent prior?

My conclusion is, it's always ok to do this, provided that the model you build is using valid information. When we construct a data dependent prior such as this one, the first thing we need to realize is that the sample statistic is not giving us the actual value of the parameter. So it's wrong to make s a transformed parameter and say s <- sd(data) which has the interpretation s ~ delta_function(sd(data)).

Instead, we need to use our knowledge K to figure out how sd(data) relates to the actual parameter of interest s. Suppose that we have data, and it's a large random sample using a random number generator from a very large finite population of stuff. Then, we're in the easiest case to interpret. We know something about the sampling distribution of the standard deviation for certain analytical distributions for data. Or, we could guess some kind of distribution for data and sample from it and approximate the error in s to get a realistic error estimate. In general, the sampling error in the standard deviation of the data decreases like 1/sqrt(N) which you can verify by numerical calcs.

Now, suppose we're in a less clear case, like data is a biased convenience sample of the population. Can we come up with some information about how big the bias is likely to be, and in what direction? If so, we can connect the sample standard deviation to the parameter in a different way, typically with a lot more uncertainty. In this case, our model will be different from the model under a random-number-generated sample.

#### Summary:

The goal in Bayesian models is to write down a joint distribution over the data and the parameters using our knowledge. Sometimes the easiest way to do that is to write down what we know about how a parameter relates to a sample statistic. But when we do that, we need a model for the sampling process that produced that sample statistic. And when we do this, our model will have the interpretation

p(data | parameters) p(parameters | Knowledge, ValuesOfSampleStatistics)

And our Knowledge will include information about how the sampling works, and how it connects the sample statistics to the parameters.

#### When would we want to do this?

The usual case would be when a parameter is on a fairly arbitrary scale, or when we know a lot more about the sampling process than we do about the parameter's value. It also probably works best when the parameter is a nuisance parameter rather than one you're primarily interested in. One example I can think of is normalization of DNA microarrays. You get measurements of brightnesses on some arbitrary scale. The more DNA you put on the array, the bigger the general brightness is. There is a lot of noise here, and so the median brightness is a good proxy for "how much stuff" you have on the array. But it's not perfect. But because the scale is arbitrary, it would be near impossible to have a meaningful prior on how much stuff you have. So it makes sense to say that there's an amount of stuff related to the median, with some wiggle room.

Q ~ normal(median(brightness),median(brightness)/sqrt(length(brightness)));

and now you can normalize by Q and have a realistic uncertainty on your normalizing factor.

Similar things arise in other circumstances. It's especially useful when the parameter is more or less a nuisance parameter, in other words, when you're not actually interested in the value of Q per-se but rather just need it in order to make the model work so you can get estimates of other more important unknown quantities.

The interpretation is clear, and what needs to be justified is the model for the uncertainty in the parameter, not whether it comes *before* or *after* the data. You should however be clear that your model is now conditional on your knowledge of the sampling process. In Bayesian models, that is not always the case. This is more or less a case where Frequentist ideas are special cases of Bayesian ones.

PS: in later discussion over at Gelman's blog we came to realize that it is fairly easy to accidentally forget to condition your DATA MODEL on whatever you knowledge is. The WHOLE model needs to be conditional on whatever sample statistic you extracted.

So, in further comments on Gelman's blog, I came to the following example, which suggests that it may always be possible to take a data dependent prior and re-parameterize the model in terms of a parameter that's independent of the data.

consider the two following priors:

s ~ exponential(1/1000.0); /* I ask my knowledge K what the value of s is, and it says "I have no idea maybe up to a couple thousand?"*?

vs

e ~ normal(1,0.75/sqrt(20));

s <- e*sd(data);

/* I ask my knowledge K what is s, and it says I'm not sure, but based on some simulations of sampling, I can tell you what the multiplicative factor e is such that s = e * sd(data) */

vs.

s ~ normal(sd(data),sd(data)*0.75/sqrt(20));

where the last two models are totally equivalent, but the final one has a "data dependent prior".

I suspect there's some proof using techniques from computational linguistics that might let us always replace

a ~ distribution(data,stuff);

with

FOO ~ distribution(other stuff)

b <- Function(FOO,data); to prove that data dependent priors are always equivalent to models with data independent priors and different or additional parameters. But I'll leave this up to the computational linguists in the crowd ðŸ™‚ (@Bob Carpenter?)

Or, at the very least, Bayesian's have to consider these cases where data dependent priors can be converted by reparameterization to be data free in the priors as "philosophically legitimate". They may not agree with the substance of the model, but they can't arbitrarily disagree with the form.