Followup on Implicit Function Theorem / Likelihoods

2017 September 25
by Daniel Lakeland

I think it's important to understand conceptually what is going on in these cases where we have an implicit relationship that data and parameters are supposed to follow, and to know when it is that we need to do some kind of Jacobian corrections.

A Jacobian correction is required when you have a GIVEN probability distribution on space A and you have a transformation from A to B, call it B=F(A) and you want to express a probability distribution on the B space which is *equivalent in every way* to the GIVEN distribution on space A. The distribution on B is called the *push forward* distribution on B. The mnemonic here is that if you have a small neighborhood in A and you "push it forward" through the F function into the B space, it produces a small neighborhood in B, and if you want this to be equivalent in every way, then the measure of the neighborhood on A is going to be forced to be equal to the measure of the pushed-forward neighborhood in B.

GIVEN: A ~ DistA(values)

GIVEN: B = F(A)

DERIVE: B ~ DistB(values)

This process requires using DistA, the inverse transform Finv(B) and a Jacobian correction.

Compare this to:

UNKNOWN: A ~ UnknownDistro(Avalues)

GIVEN: B = F(A)

GIVEN: B ~ GivenDistB(Bvalues)

Here we don't know what measure we have on the A space, but we know (by a modeling assumption) what measure we have on the B space. If F is an invertible function, then this situation is entirely symmetric with the above distribution it's just the case that *which space the distributional information is given in is different*.

Now, let's add some semantics to all of this.

In the above problem let A be a space in which your data measurements live. Let DistA(values) then be p(A | values) a likelihood factor in your model. Here, you know what the distribution is on your data. So, just use it. But if you insist on transforming your data to some other space, like say taking the log of your data, in order to leave your model unchanged by the fact that you insist on taking the log, you will have to find a DistB which is the push-forward measure of DistA through the transformation B=log(A).

Now, suppose you don't know what likelihood to give for your data values, but you know that if you calculate some complicated function B = F(A) you would be willing to model the results, in the B space, as having a distribution p(B|Parameters) = DistB(Parameters)

Now, if you want to know what measure this implies in the data space, you will have to do the whole change of variables rigamarole with Jacobians. The important thing to understand is *what is given vs what is derived*

Now, let's imagine a situation where you have a non-separable relationship between various data and parameters, which is constant plus error, a typical situation where the implicit function theorem applies. Here x,y are data, a,b,c are parameters in your model, and we'll assume F is a "nice" function of the kind you're likely to write down as part of a modeling exercise not something really weird which is nowhere differentiable on any of its inputs or the like. Our model says that there is a relationship between x,y,a,b,c which is a constant plus noise. This relationship will be written:

F(x,y,a,b,c) = 0 + \epsilon

And let's say \epsilon \sim De(C) has GIVEN distribution De(C) where C are some constants (easiest case).

Now suppose that a,b,c have given values, and x,y are measured. Then the quantity on the left of this equation is a number F(x,y,a,b,c)=3.310 for example. And so, 3.310 = \epsilon is data, derived data to be sure, but data nonetheless, for a given a,b,c and measured x,y there is no uncertainty left it's just a number. By MODELING ASSUMPTION the probability that this \epsilon would be calculated to be within d\epsilon of 3.310 if the true values of a,b,c were the ones given by the sampler, is De(3.310|C)d\epsilon where De is a given function.

And so the distribution De(C) is of the form p(\epsilon | a,b,c) it is a *given* likelihood in "epsilon space". Note that x,y are needed to get \epsilon but they are known data values, throughout the sampling process they stay constant. So this is really a function L(a,b,c) where a,b,c are the only things that change while you're sampling. Given the data x,y the post-data distribution on a,b,c is

L(a,b,c) prior(a,b,c)/Z da db dc

Where Z is a normalization factor Z = \int L(a,b,c) prior(a,b,c)da db dc

Now, if you have this given likelihood in epsilon space, and you want to see what the equivalent likelihood is over say y space where we think of y as data we'd like to predict, and x as covariates, and a,b,c as parameter values:

p(y | a,b,c) dy = p(\epsilon(x,y) | a,b,c) \frac{d\epsilon(y)}{dy} dy

Under the assumption that F is sufficiently well behaved that the implicit function theorem gives us a unique differentiable transform from y to epsilon for given x,a,b,c. And d\epsilon(y)/dy is the "Jacobian Correction". Now divide both sides by dy and we have our answer for the density of y (I'm using nonstandard analysis, dy is an infinitesimal number).

The point is, the likelihood is strictly implied to be the push-forward measure of the GIVEN distribution over \epsilon. But the truth is, we don't know the transformation y = f(x,a,b,c,\epsilon) or its inverse. The typical way we'd do predictions would be to set \epsilon_n to be a parameter with the epsilon distribution, and then sample, then we take the \epsilon_n values and use an iterative numerical solver to get y values. And so, now we have a computational criterion for deciding of F is sufficiently nice: it produces a unique answer (you might be able to extend this to a countable number of possible alternative answers) under iterative numerical solution for y from a given x,a,b,c,\epsilon_n.

 

No comments yet

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS