Differential Entropy and nonstandard analysis

2016 January 14

For reasons discussed previously I believe that every scientific measurement lives on a finite sample set. But, it is tiresome to work with enormous explicit finite sample sets. like for example the actual vales that a 64 bit IEEE floating point number can take on... They're not actually evenly spaced for example. What we tend to do is deal with discrete samples spaces with explicit values when the set is small enough (2 or 10 or 256 or something like that) and deal with "continuous" distributions as approximations when there are lots of values, and the finite set of values are close enough together (for example a voltage measured by a 24 bit A/D converter in which the range 0-1V is represented by the numbers 0-16777215 so that the interval between sample values is about 0.06 micro-volts, which corresponds to 0.06 micro amps for a microsecond into a microfarad capacitor, or around 374000 electrons).

Because of this, the nonstandard number system of IST corresponds pretty well to what we're doing typically. Suppose for example x ~ normal(0,1) in a statistical model. We can pick a large enough number, like 10, and a small enough number like 10^{-6} and grid out all the individual values between -10 and +10 in steps of 0.000001 and very rarely is anyone going to have a problem with this discrete distribution instead of the normal one. Anyone who does have a problem should remember that we're free to choose a smaller grid, and their normal RNG might be giving them single precision floating point numbers that have 24 bit mantissas anyway... IST formalizes this by some stuff (axioms, lemmas etc) that proves the existence, in IST, of an infinitesimal number that is so small no "standard" math could distinguish it from zero, and yet it isn't zero.

So, now we could say we have the problem of picking a distribution to represent some data, and we know only that the data has mean 0 and standard deviation 1. We appeal to the idea that we'd like to maximize a measure of uncertainty conditional on mean 0 and standard deviation 1. In discrete outcomes, there's an obvious choice of uncertainty metric, it's one of the entropies

E = -\sum_{i=1}^{N}p_i\log(p_i)

Where the free choice of logarithm is equivalent to a free choice of a scale constant which is why I say "entropies" above. Informally, since the log of a number between 0 and 1 (a probability) is always negative, then the negative of the log is positive. The smaller you make each of the p values, the bigger you make each of the \log(p) values. So maximizing the entropy is like pushing down on all the probabilities. The fact that total probability stays equal to 1 limits how hard you can push down. So that in the end the total probably is spread out over more and more of the possible outcomes. If there are no constraints, all the probability become equal (the uniform probability). Other constraints limit how hard you can push down in certain areas (ie. if you want a mean of 0 you probably can't push the whole range around 0 down too hard) so you wind up with more "lumpy" distributions or whatever depending on your constraints.

The procedure for maximizing this sum subject to the constraints is detailed elsewhere. The basic technique is to take a derivative with respect to each of the p_i values and set all the derivatives equal to 0. To add the constraints, you use the method of lagrange multipliers. The result would be each p_i = \exp(-Z-k(x_i-\mu)^2) and the k will depend on \sigma=1 in our case, and the Z chosen to normalize the total probability to 1.

Now, suppose you want to work with a "continuous" variable. In nonstandard analysis we can say that our model is that the possible outcomes are on an infinitesimal grid with grid size dx and constrained to be between the values [-N,N] for N a nonstandard integer. So the possible values are -N+idx for all the i values between 0 and M = 2N/dx. We define a nonstandard probability density function p(x) to be a constant over each interval of length dx, and the probability to land at the grid point in the center (or left side or some fixed part) of the interval is p(x)dx.

Now we calculate the nonstandard entropy

E = -\sum_{i=0}^{M}\log(p(x_i)dx) p(x_i)dx

Now clearly the argument to \log(p(x_i)dx) is infinitesimal since p(x_i) is limited and dx is infinitesimal, so -\log(p(x_i)dx) is nonstandard (very very large and positive). But, it's a perfectly good number. There is a finite number of terms in the sum so the sum is well defined. The value of the sum is of course a nonstandard number, but we could ask, how to set the p(x_i) values such that the sum achieves its largest (nonstandard) value. Clearly p(x) is going to be the same kind of expression as before, because we're doing the same calculation (hand waving goes here feel free to formalize this in the comments) so we're going to wind up with:

p^*(x) = \exp(-Z- k (x-\mu)^2)

Where p^*(x) refers to the nonstandard function which is constant over each interval, the standardization of this p(x) is going to be the usual normal distribution.

The point is, just because the entropy is nonstandard doesn't mean it doesn't have a maximum, and so long as the maximum occurs for some function of x whose standardization exists, we can take the standard probability density that is chosen as the maximum entropy result we should use, and this procedure is justified in large part because of the way that the continuous function is being used to approximate a grid of points anyway!

If you don't like this result, you could always use the relative entropy (ie. replace the logarithm expression with \log(p(x)dx/q(x)dx) relative to a nonstandard uniform distribution whose height is q(x) = \frac{1}{2N} across the whole domain [-N,N].  This seems to be the concept referred to by Jaynes as the limiting density of discrete points. Then, the dx values in the logarithm cancel, and the entropy value itself isn't nonstandard, but the distribution q(x) is, so it's still a nonstandard construct. Since q(x) is just a constant anyway, it's basically just saying that by rescaling the original one via a nonstandard constant, we can recover a standard entropy to be maximized. But... and this is key, we are never USING the numerical entropy value itself, except as a means to pick out a probability density which turns out to have a perfectly well defined standardization, namely the normal distribution.


3 Responses leave one →
  1. February 29, 2016

    You've seen this, right?

    • Daniel Lakeland
      February 29, 2016

      Yes, I have. It's been a while since I looked at it. I remember liking it, but thinking he needed to spend more time developing some of the ideas. If I remember, it was published as a sort of technical note specifically so one of his grad students could cite it in a PhD Thesis? The main issue for me is it's kind of a "mathematical statistics" thing, rather than examples of using the tools to develop models in applied statistics. But, it's still a great little publication.

    • Daniel Lakeland
      February 29, 2016

      Reading through Geyer's PDF it seems that there may be some technical issues which a mathematician should investigate in my construct here. For example, it may be possible that for n nonstandard and less than N the construct here doesn't fully constrain the tail behavior. In other words, way out in the tail where the density is infinitesimal you might have non-uniqueness (you could have different infinitesimal distributions with the same nonstandard entropy), for example higher moments might not be fully constrained. so maybe you need an additional criterion to constrain the tail behavior in order to get the standard normal distribution out of this construct.

      It doesn't keep me up at night, because I suspect that everywhere that the density is appreciable it will converge to the normal distribution via the usual arguments about calculus of variations, and we're only really interested in areas where there is appreciable density, (if the distribution we pick is basically zero at x=300 but not quite the same kind of zero as exp(-300^2) it will have no effect on actual statistics but it could be an interesting problem for a math grad student or something.

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS