# Dirichlet Processes as Probability over Histograms

Suppose you are sampling from a large population, trying to find out about how that population is distributed. Suppose it's a large but finite population at a fixed time, like say 1 Million balances in checking accounts. You can't take a full sample, so you need to take a much smaller sample, say 100 balances. Yet, you want to be able to use your information to come up with plausible models for the frequency distribution of the balance so that you could for example run some simulations of what might happen if you offered people a chance to buy something at various prices or something like that (some calculation that requires simulating from the full frequency distribution, not just knowing its mean and sd or the like).

Most of us have worked with histograms in the past. In the simple version, you take your sample, break up the range into bins of equal size, and count up how many samples fall in each bin. The value for each bin gives an estimate of the value

where is thought of as the "real" frequency density, and when the population is large this is taken to be basically a continuous function (hopefully the nonstandard analysis analogy is already coming into your head).

Well, each of the bins is basically a "categorical" variable, and the are counts in each category. The simplest categorical variable is the binomial one, where there are two possible outcomes or "true vs false" or "success vs failure" or "heads vs tails". A more complex categorical variable analogy is drawing balls from an urn. Suppose there is a bag full of balls, and there are 5 different colors. If you draw a sample of 100 balls, you can provide a vector of the counts of red,orange,yellow,green,blue for example. Similarly by breaking up the range of a variable into different bins you can take your sample and turn it into a count of the number of times a data point fell in each bin.

Now, we're familiar with the idea that if you take a large sample, the histogram should be close to the real frequency distribution whereas with a small sample, the histogram might be very different from .

Many people are also familiar with the Beta-Binomial conjugate family. If you check 100 out of a large number of items with only two colors, and 40 are red and 60 are blue, then with a prior of beta(1,1) your posterior over the frequency of reds is beta(41,61) for example. A similar distribution is conjugate to the multinomial/categorical distribution with k categories. This distribution is called the "Dirichlet" distribution, and it's a distribution over the vector of relative frequencies with which each category appears in random sampling from a giant urn with k colors of balls, or the process of creating histograms with k bins...

In R, if you load the "gtools" package, you can draw samples from a Dirichlet distribution:

Is a function that returns a vector which is in essence a frequency histogram from a conjugate model under the assumption that you've observed a vector of counts where the values are actually the sum of the counts observed plus the "effective prior counts".

In the absence of strong prior information and the presence of a large sample we can think of the as just observed counts in a histogram process, and the as plausible values for the true frequency with which values from a very large sample would fall into histogram bin for all the possible .

Unsurprisingly, the larger is, the closer different draws of will be to a single consistent value . In other words, all the vectors will look similar. Whereas, when the is small, the histograms will vary a lot and tend to look the histograms of only a small number of observations... noisy. Note, there's no requirement to make each of the values in the be integers even though we're understanding them in terms of "effective counts".

If you want to simulate values that are plausible for the full population using only your prior guess at the frequency histogram and the counts of observed data points, one way to do it is to use the rdirichlet function to generate an , then generate data consistent with that histogram. Each time you do this you'll generate a plausible full population which you can then use to calculate whatever complex thing you want as if you had the whole population dataset. Doing this over and over shows you how much your uncertainty in the whole population affects whatever your calculated statistic is.

There's nothing to say that the bins have to be equally spaced, or even have finite ends, though you'll have to have a model for the tail of the distribution if you include an infinite bin. Also, there's no reason you can't use an informative prior over the histograms, and the more informative you want the prior to be, the bigger the "count" of the fake data. If is a vector of prior guesses for the histogram bin frequencies, then is in essence a prior over counts "worth" data points. If you have an additional data points that have observed relative frequencies then is the parameter vector you should give to rdirichlet to get a sample from the posterior over histograms.

Once you have a vector of histogram frequencies you can generate a plausible population of N values by selecting a histogram bin at random and generating a value uniformly within that bin. But probably the real is not a piecewise constant function, so you can maybe do better by a little smoothing. Adding a small amount of normal noise to each simulated value will produce a draw from a smoothed histogram a piecewise constant function convolved with a normal curve. If you simply use the midpoint of each interval and then smooth with the appropriate normally distributed perturbation you have a "nonparametric" gaussian mixture model. It's nonparametric in the sense that it is actually a family of models, as you can always increase the number of bins to fit the size of your data.

Basically, this scheme allows you to put Bayesian probabilities over the unknown frequency histogram of a large but finite population from which you have taken a random sample. If you have models for bias in your sampling, you can include that model through appropriate perturbations to the parameter vector of the Dirichlet.