Once more about Stan and Jacobian warnings

2017 May 30
by Daniel Lakeland

In Stan, if you have an expression that looks like

foo ~ distrib(a,b,c);

Where foo is either a function, or a transformed parameter variable, and distrib is some distribution, Stan will complain about you needing to add a log-Jacobian to the target.

DIAGNOSTIC(S) FROM PARSER:
Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If so, you need to call target += with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:

Now, if foo is directly a transformed single parameter, then in fact, Stan is correct. But if foo is the result of applying a function to several parameters together, collapsing an N dimensional space down to 1, then there does not exist a Jacobian

for example:

exp(q/r) ~ distrib(a,b,c)

where q and r are both parameters. There is one function, but two parameters, q,r. So the Jacobian matrix is 1x2 and has no determinant. How can you understand what this means?

Consider the prior on q,r perhaps you've said
q ~ normal(qguess, qerr);
r ~ normal(rguess, rerr);

and you thought to yourself "Ok there's the prior for q and r", but then you wrote down:

exp(q/r) ~ distrib(a,b,c)

and thought to yourself what? Something like "there's the prior for exp(q/r)" which is kind of true. But exp(q/r) is not a parameter, it's a deterministic function of two parameters q,r. On the other hand, we're multiplying them all together, so the actual prior for q,r is

normal_pdf(q,qguess,qerr) * normal_pdf(r,rguess,rerr) * distrib_pdf(exp(q/r),a,b,c)/Z

that is, "exp(q/r) ~ distrib(a,b,c)" declares a weighting function on values of q,r which further downweights those regions of q,r space that cause exp(q/r) to fall outside the high density region of distrib(a,b,c) (and you need the Z to re-normalize after the reweighting, but note that in MCMC you typically don't need the normalization constant so you're ok there).

This "declarative" technique produces an intentional "tilt" on the prior for q,r that helps you create a more complex and nuanced prior. In particular, when there are various ways to get your function to be near the high probability region for distrib, you'll wind up with a non-independent prior over the parameters q,r you might have a ridge along some curve, or several modes or whatever. This is good when you really do have information that q,r should lead your function to be in the high probability region of distrib.

To generate from this model, you'd need to first generate according to the simple prior, and then reject on the basis of the weighting function, so for example

q = normal_rng(qguess,qerr);
r = normal_rng(rguess,rerr);
f = exp(q/r)
w = distrib_pdf(f,a,b,c)/distrib_pdf_max_pdf_value
p = runif(0,1);
if(p < w) accept else reject

2 Responses leave one →
  1. Daniel Lakeland
    May 30, 2017

    Note that if you have a proper prior on each of q,r then when you use the weighting function approach and the density used for weighting is bounded then you are guaranteed to still have a proper prior.

    However, if you have an improper prior on q,r (such as you just didn't declare anything for q,r and use Stan's default) then adding the weighting function is not guaranteed to give you a proper prior. Consider

    (q+r) ~ normal(0,1);

    with q,r having the Stan default flat prior. as long as q=-r we have q+r =0 which is in the high probability region of normal(0,1) and so we have an infinite swath along this line that satisfies this weighting function.

    As soon as you put a proper prior on q and on r, then eventually they individually head outside their own little region of plausibility and you wind up with a normalizable density that has a ridge along the line q=-r

    This technique is extremely useful, and can help you set up good priors. It's particularly useful for priors on the coefficients in basis expansion of functions. There you typically know nothing about the individual coefficients, what you know about is the *properties of the function*. So you can put very broad proper priors on the individual coefficients, and then calculate the properties of the function and weight them in order to form an informed prior on the coefficients.

  2. Daniel Lakeland
    May 30, 2017

    Also, it's perfectly reasonable to use other kinds of weighting functions. Any bounded positive function can be multiplied into the prior (or the log of any bounded positive function can be added to the target). For example you could say

    target += log(inv_logit(q/r-1))

    which since inv_logit(x) = 1/(1+exp(-x)) is the same as a slightly more numerically stable:

    target += - log1p_exp(-(q/r-1))

    to say that the region of q,r space in which q is bigger than r are upweighted, and where q is less than r are downweighted. Again, provided q,r have a proper prior, this weighting function simply describes a form of dependence between the q,r and gives more nuanced information about the q,r space.

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS