# Why put distributions on functions/expressions?

You might well ask, in all this declarative vs generative modeling discussion, why would you want to put distributions on expressions involving data and parameters?

And the answer I'd give to that is that sometimes it's more natural. For example, with the roundoff error in the laser example (data rounded to nearest 1/4 wavelength) the alternative is to write a likelihood on a discrete grid of rounded results. for example if you know that x is rounded to the nearest 0.25, then its underlying value fell somewhere between -0.125 and +0.125 relative to its reported value, in Stan:

increment_log_prob(log(normal_cdf(x+0.125,mu,sigma)-normal_cdf(x-0.125,mu,sigma)));

But, what if you knew not just that it was rounded, but for example you had some second measurement which, with noise, gives you some additional information about the underlying value for some of the measurements? Now, to do this above model you have to calculate a weighted integral of the normal_pdf instead of an unweighted one (which is what the normal_cdf is).

Alternatively, you could put a prior on the roundoff error, and express your probability over the result of subtracting the roundoff error. For example suppose in this particular case, the underlying value was above by about 0.08 (but no more than 0.125)

real<lower=0,upper=0.125> err; err ~ exponential(1/0.08); (x-err) ~ normal(mu,sigma);

If you marginalize away the variable err, by calculating

when you get the above model with the normal_cdf, but if you use the informed prior, you get a model which is, marginally, a weighted integral of the normal_pdf over the range [0,0.125]. Good luck calculating that in the likelihood, it will require doing numerical integration in the increment_log_prob statement!

But, why else would you want to do this? Another reason that I can think of is that often, at least in the physical sciences, we can decompose the error in to a variety of sources. For example, in the laser molecular measurement example we had two sources: molecular vibrations, and roundoff. But in a moderately famous neutrino experiment from a few years back, you had what people thought were faster-than-light neutrinos. It turned out if I remember correctly that this was due to an error in the model for the errors! Specifically, if you have a signal traveling through the earth at the speed of light, and then hitting a detector, and the detector is connected to an amplifier, and the amplifier is connected to a signal processor, and the signal processor is connected to a computer network, and the leg-bone is connected to the hip-bone... Then at each stage there is some kind of systematic error in the signal which you can model via various functions and prior information. But at the end, you're expecting (but not requiring!) that the signal couldn't have gotten there faster than light.

(signal_speed - my_complicated_error_model)/speed_of_light ~ normal(1,0.00001); // a little remaining statistical noise

or suppose that from previous experiments, you know the noise is a little skewed or something

(signal_speed - my_complicated_error_model)/speed_of_light ~ gamma(appropriate,constants); // a bit non-normal noise

The fact that is salient, is that after removing the systematic errors, the remaining result should be just about exactly the speed of light. If, however, you use a highly informed prior for some portion of the error model, you can wind up overwhelming this distribution, and getting samples that predict faster than light. If your error model is correct, then you've discovered something! and if you are less certain about your error model and express that uncertainty appropriately, then the knowledge you have about the speed of the signal being almost exactly the speed of light will help you discover the true value for that component of the error.

Of course, in most cases you could write these models in a generative way. But we've seen in the above example the generative model might very well require doing weighted numerical integration in the likelihood function. The distribution that is implied by the above statement on the signal speed could well be extremely strange, especially if my_complicated_error_model has all sorts of informed priors on multiple parameters involved in it.

Remember, the distribution of a+b is the convolution of the distribution for a and the distribution for b.

Suppose your error model has multiple terms, each of which is an informed prior:

a-(b+c+d+e+f+g) ~ gamma(q,p);

This means that the appropriate convolution of all those informed distributions for b,c,d,e,f,g with the distribution for a, is a gamma distribution with some constants q,p you've gotten from previous experiments:

convolution_of_a_b_c_d_e_f_g = gamma_pdf(q,p)

What does that mean for the probability of a? Good luck writing that down in a simple likelihood. Ok, so perhaps you can figure it out, the point is that if the only reason you aren't willing to put a distribution on the left hand side is that it's not generative... then you're wasting a lot of time for a requirement that isn't a real requirement! Much better to simply state the fact that you know, and let Stan find the samples for a.

"But we've seen in the above example the generative model might very well require doing weighted numerical integration in the likelihood function."

Nah. Once we've got the generative model down, we can manipulate it to find probabilistically equivalent models however we please. For instance, we can pretty much always replace those integrated likelihood functions with some two-stage hierarchical model.

So implementation details aren't in it; the question is whether it's safe to go straight to the declarative model form or not. Maybe you're more practiced at it or just smarter than me, but when I've tried it I've often found myself writing down inconsistent models because I've over-determined some model component.

(In the neutrino experiment, systematic errors were being generated by a loose cable and a mis-calibrated clock.)

Well, I'm not willing to say I'm smarter than you, and I value your perspective a lot here. My perspective is this:

1) Declarative models seem natural to me. This probably comes from my background.

2) People who I respect find them confusing, Andrew even called putting a prior on roundoff errors "a hack" which was a bad idea, referencing vague "statistical principles". Andrew and Michael Betancourt even went so far as to say that it wouldn't work unless there was actually uniformly distributed noise added to the signal (their intuition was flat out WRONG, but nevertheless, apparently this kind of method is confusing).

3) My approach turned out to be correct, but the resistance made me worry that it was only accidentally correct, and that declarative models are in general somehow "dangerous" because they can be more likely to be seriously wrong.

So I set out to figure out what I'm doing in a way that would either convince me that it made sense, or convince me that there was a kind of special case where I happened to be right but in general it was a bad idea.

I think I've convinced myself that if you believe in Cox's theorem as the justification of Bayesian probability, then it makes perfect sense to do algebra on your equations until you solve for something that you have knowledge about, and declare that knowledge in the form of a probability distribution. I think the success of ABC isn't accidental, and ABC isn't a hack, ABC uses declarative statements about summary statistics of the simulation and actual data to determine whether parameters are probable or not, by downweighting parameters that don't fit the summary statistics. It's inherently declarative since the summary statistics of the output are linear or nonlinear functions of the parameters and data.

If you need to generate fake data, you wind up generating intermediate values, and doing the algebra or root finding or whatever to back out the data. So long as the data appears not too nonlinearly in your function

f(data,params) ~ distribution(other,params)

this isn't really too hard., you just generate f values and solve for the data given the params.

(f(data,params) = distrib_variate) ----grind-grind-grind---> data

Finally, I think this method works well with dimensionless models. In the laser example, what's going on is that I'm using the mean molecular length as the length unit

(data +roundoff)*wavelength is our estimate for the length in meters, and dividing by the molecular mean length puts it in dimensionless form. I can then say that measured in those units, the length is a normal(1,observed_sd) variate. The constant 1 is in there as part of the definition of what the proper denominator true_len is.

This makes good sense if you had molecular dynamics simulations that would give the distribution of the molecular lengths in abstract dimensionless quantities like units of say atomic interaction radii. You can use those simulations to fit a new distribution (in my example it'd be the t distrib) and it would directly follow through.

"...the summary statistics of the output are... functions of the parameters and data."

But in ABC, the summary statistics are functions of only the data, not the parameters. Otherwise it would be impossible to compute the closeness of simulated statistics to observed statistics.

My perspective on ABC is the opposite of yours: ABC is all about generative models, and is useful precisely when the likelihood is intractable and it would be impossible to make use of declarative claims.

It is definitely the case that I'm not used to thinking about dimensionless quantities. I was trained to use them in chemical engineering undergrad, but that was a long time ago and I haven't used them since.

If you put in a set of parameters, the forward model will give you deterministically a set of simulated data. Therefore these values are functions of the parameters. You calculate summary statistics of the simulated data, and compare these summary statistics to the summary of the real data. It's this "distance" between the simulated summary and the actual summary that you are putting a distribution on when you do your accept/reject.

another way to say it is in ABC you are making the following declarative statement when you do a rejection:

distance(summary(fakedata(parameters)) , summary(real_data)) ~ uniform(0,epsilon)

Yeeeesssss, but this is not how one usually speaks of functions...

The summary function is

f:data_space -> summary_space

and not

f: (data_space x parameter_space) -> summary_space

It's true that the data we put in generated via

g:(random_seed_space x parameter_space) -> data_space

but the composition h = f(g(.))

h:(random_seed_space x parameter_space) - > summary_space

is conceptually distinct from f. So you see how I could get confused...