# Thoughts on MaxEnt distributions and demographics and computation

When working with social data it's pretty frequent that you get binned information. Like for example some survey might tell you the quartiles or deciles of the age distribution for males in the US, or the percentage of people whose age is between certain fixed values. If you're lucky you might also find out something about the people within each quartile (such as the mean age).

Suppose you'd like to do some prediction of some quantity based on this kind of information. For example, you might have percentage of people with a given educational attainment born before each year, as a graph, and you might have population quartiles of the current population, and you have some predictive equation for say income based on educational attainment and age, and you'd like to calculate the average income for males between age 20 and 45 today and for males between 20 and 45 years of age 20 years ago.

This is a made up example, but typical of the kind of thing I'm thinking of. In particular, you might like to do something like take panel data and infer trajectories through time for individuals, even though you don't have repeated measures. So for example you might generate virtual people born in 1940 and then have them go through earnings trajectories which put together replicate the panel data in 1960, 1970, 1980, 1990, 2000 and estimate something like what the distribution of household wealth would have been if some kind of policy were different (and you have say some simple causal model for what the savings would have been if the policy were different).

The answer of course is that you use maximum entropy. But the maximum entropy distribution of interest is a complicated one, and you might like to do numerical maximization of the entropy.

If you want to do something like this in Stan, where you're simultaneously doing inference on parameters using Bayesian methods, and finding the parameters for a distribution that maximize some measure of entropy for some prior... how do you go about it? I don't think there is an easy answer. It might be good to come up with a more simple and tractable example problem. So for example, suppose you know that the quartiles of age in some population are 23, 44, and 70 years of age, that the average age between 0 and 23 is 9, between 23 and 44 is 31, between 44 and 70 is 62, and over 70 is 81.

Suppose also that we have some function Q(x), and we want, in Stan, to approximately identify the gaussian mixture model with 3 components (8 degrees of freedom, the mean and SD of each mixture component and the weights of the mixture) that maximizes entropy subject to those 7 constraints, and calculate in the generated quantities the mean value of Q(x) from a sample of 1000 points drawn from the maxent distribution. I'll even add in some leeway as if these numbers above are rounded off, so your maxent only needs to satisfy the constraints to within +- 0.5% and +- 0.5 years.

One thought I had was to set up some kind of really broad priors on the parameters of the mixture model, then numerically calculate an approximation to the entropy, then use a kind of "pseudo-likelihood" on the calculated entropy which increases super-exponentially with increasing entropy. Also, you'd numerically calculate the constraints, and include them in the psuedo-likelihood so that they have to be pretty near their constrained values. Then, the hope then is that HMC algorithm sends the mixture parameters running around until it runs up the pseudo-likelihood as high as it can and this exponentially high density region defines something close to the high entropy prior we're looking for.

Then of course, you define say 1000 parameters each of which is distributed according to this mixture model, and you in the generated quantities calculate the average of Q(x) over the 1000 samples.

I have no idea how well that could work.

Would it be sufficient to calculate the entropy as the average of negative the log probability of the gaussian mixture density over the 1000 parameters, and then define something like

meanent < - mean(loggausdens(x)) increment_log_prob(exp(meanent))

seems hairy.

It might make sense to calculate the entropy from some smaller set of parameters that were used purely for calculating the entropy. So you might calculate say 20 parameters whose distribution was the GMM, and calculate the entropy from that. And then 1000 parameters that were for doing the inference on the rest of the model, so that the entropy and the inference on other parameters are independently calculated things.

Thinking about things I can see that these strategies won't work, because there is nothing keeping the sample of values distributed according to the GMM distribution. In particular the sample points from which you're calculating the entropy will be induced by the pseudo-likelihood to fly down to the low probability density regions of the function because that's where the negative log prob will be largest....

so back to the drawing board.

Perhaps simply using a fine-grained histogram, where within each tabulated bin you create 4 or 5 sub-bins from which you sample uniformly, then you do inference on the height of those bins by providing a Dirichlet prior and a likelihood based on uniform independent sampling within the bins....

I like this idea. It has problems in higher dimensions, but for a single dimension or two or three independent dimensions it could work fine. Since each cell has a uniform distribution the log density is a constant and you only need one point in each cell and you should simply make those points the mid-point of the cell due to the mid-point rule being a good integration rule.

On the other hand, as dimension grows, the number of points involved grows exponentially, so it only works for a small number of dimensions.

Perhaps for a gaussian mixture model and an unconstrained space you could use a closed form expression for the entropy, and do a two step procedure. First maximize the entropy, then using the parameters you found, write a second Stan model to do the inference given the fixed GMM.

For the toy problem, here's what I'm thinking:

- set the prior on the parameters to be flat on the appropriate support

- sample N deviates from the mixture distribution for some for large value of N

- compute MC estimates of the constrained quantities and the entropy

- use your declarative modelling strategy to create penalties on the MC estimates in the target

- target += entropy

- optimize

- ???

- profit

Yeah, I was kind of thinking along the same lines, but one issue is that "sample N deviates from the mixture distribution" could be problematic in the more general case where you need N deviates from a general "model distribution" that you can't sample independently from.

What about something like

Define a parameterized class of distributions (mixtures for example)

Put very broad priors on the distributional parameters.

Define a set of parameters as independent samples from the distribution above, to be sampled by Stan.

Calculate the estimates of the observed statistics from the sample parameters, and use declarative modeling to create a likelihood that compares the estimates to the observed data.

Ignore entropy.

Do further inference on your model using your sample from your parameterized class of distributions to infer what might have gone on in the detailed data that you only have a table for.

This is essentially Bayesian probability inference over frequency distributions, and it might make more sense in this context.

I've been trying to fit a 3 element GMM to my toy problem using a simple uniform grid numerical integration scheme and finding that it's REALLY not wanting to work. Interesting, but I don't know why.

One thought is that with 8 DOF for the GMM to fit 7 constraints, the typical set is sort of 1D, which might be hard to find and move around in. Perhaps an extra mixture component might make more sense.

Ok, looks like it was bugs in the model that made it so hard to fit, but adding an extra GMM component anyway since it makes sense that it would work better.

Well, here's the interpretation:

.25 is the observed fraction of people under age 23, and 9 is the observed average age for those people... both of those are data points telling us about a function which happens to have the properties of a pdf, namely f(x) parameterized by our unknown parameters.

so we can put this data on the left, and some function of the parameters on the right of a typical ~ statement in Stan..

.25 ~ normal(integrate(f(x),0,25),SOMETHING)

and

9 ~ normal(integrate(x*f(x),0,25),SOMETHING2)

where SOMETHING and SOMETHING2 are expressions of how big the modeling error and the measurement error and the integration error all put together could be.

Of course, there is no "integrate" function, so instead we need to do some numerical stuff and store the result in a variable, and change our Stan expression appropriately.

We can then do inference on the parameters that control f(x) (namely the GMM parameters) based on this totally legit likelihood.

However, getting it to actually *fit in Stan* is a different story.

----------------------

Update: found a few bugs. I was calculating the averages incorrectly, not dividing by the appropriate normalization constant. Anyway, now it runs pretty well, but if you enforce the constraints very tightly, it has trouble with divergent transitions and slow sampling.

i'll stick the code up and some plots monday.

I think I must be confused as to the purpose of the exercise. Either you're doing analytically tractable MaxEnt in which the forms of the constraints determines the parameter space (after which setting the parameter values to satisfy the constraint data is equivalent to maximum likelihood), or you're doing numerical MaxEnt in which you have to choose a numerical-computationally tractable space of densities in which to perform the optimization which then satisfies the constraints directly if at all possible, or you're doing Bayesian statistics in which you pick your parameter space, prior, and likelihood (including declarative likelihoods -- you've convinced me that that's kosher) -- and if you want predictions, a sampling distribution against which you can marginalize the posterior distribution to get the posterior predictive. Stan can't help you with the first, can probably be contorted into accomplishing the second, and is designed for the third.

(Incidentally, the toy problem has an analytical solution: the log-density is piecewise linear with no continuity conditions imposed at the quartile boundaries. The only reason I can see to pick a parametric form like your mixture model is to (in effect) impose a smoothness constraint on the solution. Add enough mixture components and you'll get rapid changes on the quartile boundaries anyway.)

The purpose of the exercise is to figure out a way to infer a continuous smooth density from some tables of numbers you typically get from Census reports and stuff like that.

The goal of doing that would be to use the smooth density as a starting point for inferring some other stuff from a causal model of some kind of human or social process.

I think i've come to the conclusion that the "right" way to do this is via Bayesian inference on the parameters that describe the frequency distribution (the GMM for example). Thereby arriving finally at a probability distribution over the frequency distribution.

I'm glad to hear I've convinced you of the kosherness of declarative likelihoods ðŸ˜‰

One major reason to impose smoothness constraints is so that Stan can do HMC with parameters defined to have the inferred distribution.

So, as it is, things work ok provided I do numerical integration on a fixed set of sample points.

It would be nice to do the numerical integration via MC estimates using parameters. The problem is that when parameters cross the boundaries of the sub-regions the integrals are discontinuous and the HMC would probably be unhappy.

If you could figure out a way to use parameters, it'd be great because it'd scale to higher dimensions, and you could infer things like multi-dimensional distributions: age, individual income, household income, age of spouse, educational attainment, quantity of meat consumed per year, average duration of commute.

Without a way to calculate how well the distribution fits the observed statistics from a sample of parameters distributed according to the model distribution pdf.... you can't scale up to such multi-dimensional things.