# Nonlinear functions on the left hand side and all that jazz

In my posts about assigning distributions to functions of data and parameters, I mentioned the tried and true example of trying to apply a distribution to a nonlinear function of a parameter:

log(foo) ~ normal(0,1);

In Stan at least, this does NOT imply that samples of foo have a lognormal frequency distribution, for that you have to take into account the differential compression that the log function applies to the dfoo intervals. This is always true when you transform a particular set of n parameters through nonlinear transforms into n other parameters (linear transforms imply a constant multiplier correction which is then normalized out) the differential volumes transform according to the determinant of the Jacobian of the transform in the vicinity of the volume interval.

But, why is the same thing not true when we do:

lfoo ~ normal(0,1); foo <- exp(lfoo)

which, if you write those statements in the appropriate places (model, and transformed parameters) in a Stan program, will give you a sample foo from a lognormal distribution? (alternatively you could sample lfoo, and exp transform in R after you grab the sample).

And, when you make a statement like

f(data,parameter_vector) ~ distribution(other_params)

The parameter_vector may have many elements, and therefore, there is no Jacobian of the transform (specifically the Jacobian isn't a square matrix and so has no determinant). But even if you are using only one parameter, and could calculate the distortion of the differentials, would you want to?

For the first question, about the samples of lfoo and their exponential, the answer is that we're transforming fixed numbers that Stan spits out which have the proper normal(0,1) frequency distribution thanks to Stan ensuring that fact in the lfoo space. If we specify instead that foo is our parameter, and log(foo) ~ normal(0,1) Stan is trying to enforce a frequency in the foo space, it calculates p(log(foo)) dfoo and says this is the frequency of samples in the region dfoo. If this is what you meant, then FINE, but if you meant for foo to have a lognormal distribution, you need to match the space in which you're calculating the density to the space in which Stan is calculating the probability: so you need to specify log(foo) ~ p(log(foo))dlog(foo)/dfoo

For the second question, about whether you'd do a transformation when you say f(data,params) ~ distribution(other,params), the answer is no, you don't want to transform anything, and the reason comes from the mathematics. The statement about f(data, parameters) is a conditional probability:

p(f(data,params) | params) = distribution(other_params)

This isn't a statement about the frequency of "params" in your Stan sample, since "params" is a given vector (it's on the right hand side of the vertical bar), it's a statement about where the f function values should be if the params are correct... and since the parameters are the only things that change during the simulation process, not the data, it's ultimately a statement about which params values result in transformed data -> f values that you consider probable.

Like a likelihood that is generated by an iid sampling assumption, the best way to think about this statement is that it's actually a weighting function for the prior distribution over the parameters. We start with a basket of possible parameter values called the prior, and we reject any values (downweight the density) of the parameters which result in f(data,params) being in the low density region of the given distribution. In the same light, we could write

data ~ distribution(param1,param2)

which is a generative statement of the likelihood. This is a special case, it corresponds to:

id(data) ~ distribution(param1,param2)

where "id" is the identity function and says that we should downweight the priors over param1 and param2 whenever the data values themselves are in a region of "distribution" which is improbable given the parameters.

Restricting yourself to generative models is a lot like restricting yourself to separable equations.

solve for ... sorry no can do. The best we can do is invent a name for a function which is the solution to this equation, and come up with a way to calculate it. Incidentally, the function is called the "Lambert W" and .

Similarly, if we have a mathematical model in which we can specify the probability distribution of parameterized transformations of the data, we can either write down that fact:

roundoff ~ uniform(-0.5,0.5) (x+roundoff) ~ normal(mu,sigma);

or we can invent a special likelihood to put it in a generative (separated) form:

increment_log_prob(normal_cdf(x+0.5,0,1)-normal_cdf(x-0.5,0,1))

But in many ways enforcing the "generative" style will induce you to work with more limited sets of models, because sometimes the likelihood you need to invent is going to be very complicated, and that's particularly true when the left hand side involves several parameters. For example if we know a little more about some of the roundoff errors, we'll need to do a weighted integral of normal_pdf weighted by the information we know about roundoff. There is no hard reason why you need to specify your model in this way. In my opinion, instead of trying to work your model into a form where you can say

data ~ my_distribution(other,parameters)

You should think hard about what you know, and make appropriate probability statements. If that results in

f(data,params) ~ my_distribution(other,parameters)

then so be it. Just be aware that your "fact" needs to be both a true (approximate, probabilistic) fact about the world, and informative. If there are a LOT of different parameter values that can make the left hand side be in the high probability region of my_distribution, then you won't be downweighting much of the parameter space... and you won't find out much about your parameters. The ultimate version of this is if you choose a function f(data,params)=constant. Then "params" will be totally uninformed by this statement.