So, why should we believe that it works to put probability distributions on functions of both data and parameters, in a non-generative way? (see my previous example)

Well, one reason is the success of Approximate Bayesian Computation (ABC), in that scheme typically you have a physical simulator, like for example a PDE or an ODE solver, or an agent-based model or something, and if you put in the parameters, the simulator will return you a prediction for the data..... and then, you have data. And the data never matches the predictions exactly, and there's no arguments you can make about IID samples, since typically every measurement in space is tied together by the whole simulation. So what do you do?

The typical method is to calculate several statistics of the simulation and data, and see if they fit together close enough. For example you might calculate the total kinetic energy of some ensemble, calculate the mean and stddev of reflectivity of infra-red over your domain, calculate the density of predators in a certain region... whatever is appropriate for your model, you get some kind of summary statistic (or several of them), and you compare it to the data, and you accept the parameters if the difference between the two is in the high probability region of some distribution (could be just if you're within epsilon, in other words, a uniform distribution).

This method works, in lots of examples, and it is a generalized version of what I did in my previous example. Its success gives confidence to the idea that writing "non generative" models whose whole structure simply puts high probability on parameters when some function of the data an parameters is in the high probability region of a particular distribution you've chosen.

This is a pretty powerful generalization from the typical IID likelihood principle!

7 Responses leave one →
1. August 26, 2015

// it's NOT GENERATIVE, what does it mean?

Well yes, exactly. What is the implied prior on true_len? As you say, the canonical ABC algorithm is: generate parameters from the prior, simulate data, retain corresponding parameters if the simulated data is approximately equal to the observed data (for some operationalization of "approximately equal"). If I wanted to implement that algorithm using your non-generative specification, how would I go about it? I can generate deviates from the priors on the wavelengths easily enough; if I had an explicit prior on true_len I could carry out the data simulations required for ABC in almost exactly the same way that you generate observed data given the hidden true parameters (the "almost" here referring to the t vs. normal issue). But what exactly am I supposed to do with your non-generative component to get myself to simulated data?

2. August 26, 2015

Yes, in the absence of any other specification I think Stan puts a uniform prior on the unconstrained space it uses internally. Let's pretend instead I have an explicit prior on true_len, say uniform(0,1e7) in nanometers (so up to a cm long! definitely covers the real value).

So, you'd generate from true_len prior, generate from rounding error priors, generate from the wavelength priors, etc etc and then calculate the thing on the left hand side and accept it with probability somehow proportional to the right hand side of the various statements

`  (l1[i]-roundl1[i])*wl1/true_len ~ normal(1,sd_scaled);`
• August 26, 2015

There is an inaccuracy in your description of ABC: there would be no sampling of rounding errors from their prior. I find it puzzling that you'd describe generating simulated data for ABC this way, because you do all of the steps correctly when you simulate "observed" data for the stan model.

Specifically, once parameter values from the true_len and wavelength priors have been generated, one can back out the RMS error in distance units and sample the latent non-rounded data; in your stan model this is the real.length.nm* (1+ rt(10,df=dof.t)*tscale piece, although for ABC I guess one would use the normal instead of the t. With samples of wavelengths from their priors, one can then round the latent data appropriately; in your stan model, this is the round(simulated_latent_observations*4*/real.wls)/4 piece. Nowhere in this process are rounding errors sampled directly from a prior distribution.

And now that I've sorted that out, the declarative model statements of the form

(l2[i]-roundl2[i])*wl2/true_len ~ normal(1,sd_scaled);

become clear to me; I confess they were rather opaque to me until I could see how the generative process goes.

• August 27, 2015

Ah yes, I was being too quick to gloss over the details. The underlying non-rounded variate and the observed rounded value determines the roundoff error deterministically, so you could just simulate from the normal (I was using t in the simulation just to point out that in reality our normal assumption never really holds, and that's ok). Alternatively, you could simulate from the roundoff error prior, subtract the roundoff error from the data, and get your simulated underlying normal variate... The process is completely symmetric. Actually, the advantage of doing it that way, is that sampling the latent non-rounded data will require doing a LOT of rejections, at least if you do it the dumb way and say underlying ~ normal(foo,bar) since the variate HAS to land within epsilon of the data according to the roundoff ... for an MCMC approach like Stan uses the inference isn't necessarily a problem, but for an ABC type rejection approach... you'd want to sample from a truncated normal truncated to the roundoff region around the observed.

• August 27, 2015

No, that can't be right. (Or it could be right, but not useful in the cases where ABC is needed, to wit, models with intractable likelihoods that can be simulated forward.) If we start with the observed data and simulate back up the prior hierarchy, then we won't have *any* rejections -- the "simulated" data are identical to the observed data by construction.

• August 27, 2015

The analogy with ABC is just to show that you can get good inference when you put information into the analysis in the form of distributions on functions of the data and the priors. After all, if you calculate the mean brightness in the infra-red over a simulated domain of plant growth... you're calculating a big function of the priors (since putting in the priors gets you a forward simulated data field). Accepting and rejecting on the match for that function with the data you have from a sattelite, is basically the same as saying:

(meas+roundoff)*wavelen/trueval ~ normal(1,0);

in both cases you're calculating a function of data and parameters, and putting a distribution on it. Usually the ABC distribution is just uniform in an interval maybe, but it's still a distribution.

3. August 27, 2015

As for simulating back up the tree from the data... you're right, that's not applicable when you have only a forward process...

Suppose you're trying to do ABC on the laser example.

You choose from the prior for true_len, and wavelength.
You simulate a latent normal(true_len,sigma_given*true_len);
You reject if observed - round(latent*4/wavelength)/4 isn't 0.

That's obviously terribly inefficient.

But it is putting a kroneker-delta-function distribution on the function of observed values and parameters such as true_len, and wavelength when projected onto a grid of rounded possibilities.