Posterior probability density ratios

2015 September 3
by Daniel Lakeland

Continuous probability densities are in dimensions of 1/[x]  where [x] is whatever the dimensions of the input are. For example, probability per unit length for $$p(x)$$ on $$x$$ a length.

Mathematical models should be made based on dimensionless quantities. If we’re looking to compare different data points under a model, it seems to me it’s worth considering the quantity:

\[p(x)/p(x^*)\]

Where $$x^*$$ is the x where p takes on its maximum value, so for the unit normal $$x^*=0$$ for example.

In particular, I have in mind using this alternative quantity to filter data in a similar way to the filtering by “p value” described in a previous post.

So, if you’re trying to detect unusual situations (in the previous post example it was unusual loudness of noise) you fit your model $$p(x)$$ and then for each data value $$x_i$$ you calculate $$\log(p(x_i)/p(x^*))$$ and select those data points which produce a negative enough log probability density ratio. In particular, you’d actually calculate say the average over the posterior distribution for the parameters of this quantity.

This isn’t quite the same as the likelihood ratio, since of course it’s averaged over the posterior of the parameters, but it is obviously related. (EDIT: you can obviously use these ratios both for likelihood and for posterior density of parameter values, if you’re trying to compare alternative parameters, but with parameters you probably want marginal density values, and if you have samples, you probably need kernel density estimates or something…)

This is a case where it makes sense to use the Bayesian machinery to actually fit the frequency histogram of the data points, a perfectly legit thing to do, even though it is sometimes not required.

Ways in which this is superior to “p values” include being able to distinguish when something is unusual even when it’s not on the tail of a distribution. For example in multi-modal distributions, as well as being actually motivated by the science (in so far as you put science into your data model) and the actual values of the data, it doesn’t involve any probability under the model to observe theoretical data ‘as extreme or more extreme” than the data point for example, it’s just the relative probability to observe data in the immediate neighbourhood of the data point compared to in the immediate neighbourhood of the most common point. In particular you can think of $$p(x) dx / p(x^*) dx$$ as a ratio of probabilities for any infinitesimal quantities $$dx$$ which thanks to the ratio happen to cancel out.

I used this exact method recently with data my wife gave me. It was a kind of screening of certain chemicals for their ability to induce certain kinds of fluorescent markers in cells. The image data resulted in a bunch of noisy “typical” values, and then a number of outliers. Each plate was different since it had different chemicals, different cells, different markers etc. In Stan I fit a t distribution to the image quantification data from each plate, and looked for data points that were in the tails of the fit t distribution based on this quantity. It seemed to work quite well, and got her what she needed.

No comments yet

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS