Thanks greatly to Carlos Ungil who in blog comments at Andrew's blog got me to realize where we have to be careful in using data dependency within a model.

First, is your model dependent on the data? From a practical perspective, if you are using Stan and either passing in some summary statistics as if they were extra data, or you're calculating a summary statistic somewhere in the code... your model is data dependent. And that is true whether your prior is data dependent or not.

s ~ normal(sd(data),sd(data)*0.75/sqrt(N));

is a data dependent prior on a parameter s.

e ~ normal(1,1*0.75/sqrt(N));

s <- e*sd(data)

Removes the data dependence from the prior, but NOT from the model. later we're going to do

data ~ something(s);

and so now our likelihood is data dependent. And this was a mistake! We wrote down the wrong likelihood.

If we have data dependence in our model we are effectively writing down a joint distribution

p(data, params | Knowledge, statistics_of(data));

Ungil's example shows where you can go wrong. You initially know nothing about the mean of a normal RNG except its standard deviation. You get one data point. You calculate the mean of this one data point. You build your model:

merr ~ normal(0,1);

m <- mean(data) + merr;

data ~ ????

what's the probability of the data given the mean of one value? It's 1!! The likelihood is 1, so just leave it out.

/* data ~ ??? nothing goes here... comment it out */

In these simpler cases you're probably better off not having data dependence. But you could "fix" this by simply leaving out one data point in this case.

But data dependence can be a necessity, and in fact I was so focused on the kinds of situations I was thinking of using this in, that I missed that point. We MUST keep our conditional probability consistent. Thanks Carlos.

So, what cases was I thinking of? Primarily things where we really do have some kind of simulation of what is going on, and if we find out what case/regime we're in we can put very useful real scientific knowledge to work.

EDIT: below is the original content in this post, but the point is better described in this later post.

Imagine you have a molecular dynamics simulation, and you run it for an extremely wide set of conditions. In each condition you have some thing you might like to condition on, perhaps the total energy. And you have some parameter Q, maybe it's something like fraction of cases in which you initiate a fracture at a particular point, or something like that.

Now, a person performs an experiment and hands you the data. They can set the total energy to anything they'd like from a mouse sneezing to a ton of TNT going off... Without the knowledge of the total energy you can't know what to put for a distribution on Q.

But, with the total energy you get to use real useful information based on hard-won computations. So you do

Q ~ fitted_distribution(sum(energy));

Now your model is data dependent. But, it's definitely a better model than if you did something like guess at what the total energy is in this experiment based on what you think the experimenter did to the dials this morning.

toten ~ exponential(1/1e15);

Q ~ fitted_distribution(toten);

Do you have to make a"correction" in this case? Well, if you were thinking of using a likelihood based on iid sampling of the energy, then yes. Because conditional on the sum(energy) the energy values are NOT IID.

energy ~ normal(energy_parameter, error_parameter)

would be wrong. But suppose that's not what you're doing, instead you're fitting a model form conditionally to events at different energies:

observed_result[i] ~ some_distribution(some_model_function(Q, energy[i], some_predictors[i], etc);

here, you're assuming

p(observed_result[i] | Q,energy[i],....) is known and is conditional on Q but NOT directly on the total energy given Q even if it is conditional on the individual energy and some other stuff. It's wise to think carefully about this. And, clearly, it's wise to keep your conditional notation clear in your models.