The Dirichlet process via NSA?

2013 May 15
by Daniel Lakeland

Andrew Gelman links to a paper on priors for Bayesian nonparametric statistics (an has a wonderful in joke in the title for those who know late 70's and early 80's music).

The paper itself is something about nonparametric bayesian models, which really means models that have a huge number of parameters. One thing that comes up a lot in these nonparametric problems is the Dirichlet Process. Unfortunately that wikipedia article is inscrutable to someone who doesn't already have an intuition about the Dirichlet Process (ie. it's maybe a good reference, but a terrible way to learn).

I am not really sure what a Dirichlet Process is, but I'm going to take a stab at it via nonstandard analysis and see if someone can interpret what I have to say and tell me if it makes sense.

First, let's talk about what a Dirichlet distribution is. Imagine you have a measurement process that can produce several different discrete outcomes. For example you have a survey form and people can put their "race" as one of 8 options. Such a process is a categorical random variable with 8 categories. When there are only 2 categories this is a bernoulli random variable (ie. its result is either 0 or 1), but in general it could be N categories. For each of the N categories there is a probability associated with that category. Every outcome MUST fall into one of the categories (for example you have 7 words describing "common" races and then "other" so that there's an option for everyone).

The parameters of the N dimensional categorical distribution is then a vector of N probabilities. These probabilities must add to 1 (because of the "other" category, you always fall into one of the categories).

The Dirichlet distribution is a distribution over vectors of N numbers that are in the interval [0,1] and add to 1. Since if you know N-1 of these numbers the last one is automatically exactly known (ie. p_N = 1-\sum \limits_{i=1}^{N-1}p_i) the probability density in regions where the sum is greater or less than one is zero. So it's a kind of N-1 dimensional hyper-surface embedded in the N dimensional hyper-cube. If you draw a vector of N different p values from the Dirichlet distribution you'll get a vector that can be interpreted as the parameters of a categorical distribution over N categories.

Fine and dandy so far (relatively speaking). Now suppose that N is a nonstandard number, there are a nonstandard number of categories. Well, we could interpret a set of N different p values where N is nonstandard as a function f(x)= Np_i ; i/N \le x <(i+1)/N.  Since the p_i values add to 1, if we take dx = 1/N and integrate \int f(x) dx = \sum \limits_{i=1}{N} p_i dx the N values cancel, and we see that the integral is guaranteed to be 1. So the function f is a nonstandard probability density function on [0,1]. It's a very weird function though, there's nothing that guarantees it's s-continuous anywhere, any two adjacent p_i values could be different by a significant amount. The graph of this function doesn't exist as a standard thing (because it requires a nonstandard number of discrete steps) , but could be approximated by some kind of very spikey noisy graph using N a large but standard number (like say 1000).

The typical way that this is used, as far as I can tell, is to define mixture models. So you have some set N of PDFs that define different ways in which a data point can be generated (for example N different gaussians with different mean values). Each of the N PDFs has a probability of being used to generate a given dataset. You draw a single pdf from the Dirichlet process, and then you use this pdf to select which of the different gaussians you will use to explain the ith observation. Since in practice we always work with a finite standard number of things, you think of the Dirichlet process with length N as a way to get a finite standard number n of discrete values, by sampling f(i/n) for i from 0 to n-1 where n is standard, or something like that.

Well, that sort of helped me, I wonder if it's more or less correct :-).

EDIT:  ok, reading up on the Wikipedia entry, it seems like there are basically two "parameters" to the DP itself, the first is a pdf which describes the marginal probability of a given p_i and the other is a "concentration parameter" which describes how much the values of the DP concentrate about a finite set of values. So the function f(x) is going to look like some kind of step function because it will frequently take on some discrete set of values. Also the concentration parameter has something to do with how spread out the discrete values are.

No comments yet

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS