"Using the data twice" vs "Incorrect Conditioning"

2016 March 30
by Daniel Lakeland

The advice that "you shouldn't use data dependent models because they use the data twice" which is certainly something that until a few days ago I mostly believed, is not particularly helpful. Intuitively it comes from the basic problem of estimating something like the mean of a normal distribution with known standard deviation = 1 (thanks to Carlos Ungil on Gelman's blog for making this super clear). Suppose you initially know nothing at all, so you have an improper prior on mu. Then you get one observation. You calculate the mean of this one observation. You know that this one observation should be close to the mean. so you say

mu ~ normal(mean(obs),1); /* this is my "prior"*/

now you naively write down the likelihood because that's the usual thing to do:

obs ~ normal(mu,1);

BUT, the model is conditional on knowing the mean(obs), and the mean(obs) ALWAYS equals obs. So the actual "likelihood" p(obs | mu, mean(obs)) = 1 we know for sure what obs will be because we know mean(obs) and there's only 1 value!

So, this mistake has the flavor of "using the data twice" but a better way to think about it is failing to condition on what you know. Probability theory doesn't lead you astray when you condition properly. The reason this is better to think this way, is because it provides you a far better guideline to think about more complex situations and how to do them correctly.

Suppose you have a physicsy example, there's some say computer simulation where you look at a few thousand molecules for a window of time. You calculate the time it takes for each molecule to do something during this window. You have a model in mind for the high-energy molecules.

t[i] = f(some_global_params, some_observables[i], energy[i]) + err

when energy > 2 on some scale. But, you also know that err can vary widely in form based on what the mean energy of all the molecules is. And, by the way, the person who ran the computational experiment didn't tell you what they set the mean energy to.

So, should you put in your "prior model" of what the mean energy is based on what you know about the experimenter's psychology that might motivate them to write different numbers on the command line when they start the simulation? Or should you just do mean(energy) and worry about "using the data twice?". Here, "using the data twice" is not a useful way to think about it. But "conditioning on knowing the mean energy" IS.

And, you know based on your physics model that

err ~ particular_form(mean(energy), particular_constants);

So you write down a correct conditional probability of the t values given what you know:

for(i in 1:N){
if(energy[i] > 2){
     t[i] ~ particular_form(f(some_global_params,some_observables[i], energy[i]), mean(energy), particular_constants);
  }
}

Are you "using the data twice?" Who cares, that's only a heuristic based on naive stats 101 type problems. The real question is: are you conditioning correctly on the things that you really know? And the answer is, although you've used the full data to calculate mean(energy) you have written down a different p(t[i] | ...) than you would have otherwise and this is the correct thing to do. You've conditioned properly. I think this is legitimate, even though it's a data dependent model.

Similarly, problems like the normalization of microarray data which I've discussed, where you know that median(brightness) on an array is related to the amount of material you put on the array as follows

median(brightness) = some_constant * mols_of_stuff + some_small_error

and you'd like to divide all your brightnesses by (some_constant*mols_of_stuff)

so you can do (for the i'th array)

normalizer[i] ~ normal(median(brightness[i]), some_small_error);

and later do

brightness[i][j] / normalizer[i] ~ my_model_for_whats_going_on(...);

Now, have you "used the data twice?" again, who cares, that's more of a heuristic. Have you conditioned properly on what you know? The fact is that my_model_for_whats_going_on assumes that you have properly normalized stuff, and you'd have to write down a totally different model if you didn't normalize everything. Therefore, you've conditioned both the prior for the normalizer, and the likelihood properly (at least as well as you can given your knowledge, it's ALWAYS possible for your model to be wrong, you do have to check your models!)

No comments yet

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS