# Informative Priors from "masking"

There was a discussion on Andrew Gelman's blog about creating informative priors using "quantities of interest" which are basically functions of the parameters.

His example included some code which sets a prior over parameters a, b by combining several functions of a,b:

model { target += normal(y | a + b*x, sigma); \\ data model target += normal(a | 0, 10); \\ weak prior information on a target += normal(b | 0, 10); \\ weak prior information on b target += normal(a + 5*b | 4.5, 0.2); \\ strong prior information on a + 5*b

Andrew then went on to say: "You should be able to do the same thing if you have information on a nonlinear function of parameters too, but then you need to fix the Jacobian, or maybe there’s some way to do this in Stan."

I disagreed with this idea of "fixing the jacobian". Deep in the comments I discussed how this works and how to understand it vs how to deal with "jacobian corrections" when describing priors in terms of probability measures on a target space. The question of whether you need a Jacobian is determined by the role that the information plays, whether you have absolute knowledge of a probability measure, or relative knowledge of how much an underlying probability density should be modified (ie. "masked" if you're familiar with using masks in Photoshop). I thought I'd post it here so I can refer to it easily:

The first thing you have to understand is what is a Jacobian correction for.

Essentially a Jacobian “correction” allows you to sample in one space A in such a way that you induce a particular given known density on B when B is a known function of A.

if B = F(A) is a one to one and onto mapping (invertible) and you know what density you want B to have (pb(B)), then there is only one density you can define for A which will cause A to be sampled correctly so that B has the right density. Most people might work out that since B = F(A) the density should be pb(F(A))… but instead…

To figure this out, we can use nonstandard analysis, in which dA and dB are infinitesimal numbers, and we can do algebra on them. We will do algebra on *probability values*. Every density must be multiplied by an infinitesimal increment of the value over which the density is defined in order to keep the “units” of probability (densities are probability per unit something).

We want to define a density pa(A) such that any small set of width dA at any given point A* has total probability pb(F(A*)) abs(dB*)

That is, we have the infinitesimal equation:

pa(A) dA = pb(F(A)) abs(dB)

solve for pa(A) = pb(F(A)) abs(dB/dA)

if we said pa(A) = pb(F(A)) we’d be wrong by a factor involving the derivative dB/dA = dF/dA evaluated at A, which is itself a function of A. The absolute value is to ensure that everything remains positive.

abs(dB/dA) is a jacobian “correction” to the pb(F(A)) we derived naively at first.

—————

So, the applicability is when

1) you know the density in one space

2) you want to sample in a different space

3) There is a straightforward transformation between the two spaces

In Andrew’s example, this isn’t the case. We are trying to decide on a probability density *in the space where we’re sampling* A, and we’re not basing it on a known probability density in another space. Instead we’re basing it on

1) Information we have about the plausibility of values of A based on examination of the value of A itself. p(A)

2) Information about the relative plausibility of a given A value after we calculate some function of A… as I said, this is a kind of “mask”. pi(F(A))

Now we’re defining the full density P(A) in terms of some “base” density little p, p(A) and multiplying it by a masking function pi(F(A)) and then dividing by Z where Z is a normalization factor. So the density on space A is *defined* as P(A) = p(A) pi(F(A)) / Z

Notice how if p(A) is a properly normalized density, then pi(F(A)) K is a perfectly good mask function for all values of K, because the K value is eliminated by the normalization constant Z, which changes with K. In other words, pi tells us only *relatively* how “good” a given A value is in terms of the value of its F(A) value. It need not tell us any “absolute” goodness quantity.

Should “how much we like A” depend on how many different possible A values converge to the region F(A)? I think this seems wrong. If you have familiarity with photoshop, think like this: you want to mask away a “green screen”, should whether you mask a given pixel depend on “how green that pixel is” or “how many total green pixels there are”? The *first* notion is the masking notion I’m talking about, it’s local information about what is going on in vicinity of A, the second notion is the probability notion: how much total spatial locations get mapped to “green” that’s “probability of being green”

For example pi(F(A)) = 1 is a perfectly good mask, it says “all values of F(A) are equally good” (it doesn’t matter what color the pixel is). Clearly pi is not a probability density, since you can’t normalize that. You could also say “A is totally implausible if it results in negative F(A)” (if the pixel is green don’t include it) so then pi(F(A)) = 1 for all F(A) >= 0 and 0 otherwise is a good mask function as well. It’s also not normalizable in general.

If you start including “jacobian corrections” in your mask function, then your mask function isn’t telling you information like “mask this based on how green it is” it’s telling you some mixed up information instead that involves “how rapidly varying the “greenness measurement” is in the vicinity of this level of greenness”. This isn't the nature of the information you have, and so you shouldn't just blindly think that because you're doing a nonlinear transform of a parameter, that you're obligated to start taking derivatives and calculating Jacobians.

This is very interesting, and a great way to design complex informative priors when the parameters are not independent.

This seems so easy so I had to convince myself with an example, where I set:

- a /sqrt(2)/2 ~ N(1, 0.5)

- b /sqrt(2)/2 ~ N(1, 0.5)

- sqrt(a^2 + b^2) ~ N(1, 0.1)

Basically, in (a, b) space, the first two priors are centered around (1, pi/4) in polar coordinates and the last prior acts as a mask to take a ring of radius normally distributed around 1.

FYI it works but I was wondering how would you even adjust for the Jacobian in this context?

Like, if you consider the transformation (a, b) -> (a, sqrt(a^2 + b^2)) (we can assume a and b are positive for simplicity) you would do the change of variables for the first and third prior but you would still need to define a prior for b.

Alternatively, if you go to polar coordinates with a = r *cos(phi) and b = r * sin(phi), you could define the third prior for r but you would have to trouble defining the first and second priors, unless you define something like tan(phi) ~ N(1, 0.5) / N(1, 0.5).

Do you have any thoughts on this?

No Jacobian is required when you express the prior in the space where the sampler samples. if you express the prior like this, and then calculate transformed parameters and use them you are home free.