# Mommy? Where do likelihoods come from?

I’ve been thinking about this problem for a long time, especially with respect to problems involving physical models with uncertainty. Let me try to give you an example, and then maybe you’ll have some insight that I don’t.

Suppose you have a model involving a deterministic dynamical system, but with unknown parameters. To be concrete, suppose it’s something like a model for population of bacteria in an infected organ (i’m making this up, I don’t actually have such a model). So you have something like

$$!\begin{align}dP/dt &= DP(C,P,I)\\ dC/dt &= DC(C,CB)\\ dCB/dt &= DCB(t)\\ dI/dt &= DI(I,P,CB)\end{align}$$

The meaning of these symbols is: P is population of bacteria, C is concentration of some drug in the affected organ, CB is concentration in the blood, and I is the immune system response intensity. The functions DP, DC, DCB, DI are something we have hypothesized and are specified as our model, and they have some parameters, like rate coefficients and soforth which are unknown. Now we have some observations, say a small number of cases where we track through time some laboratory indicators of P,C,CB,I for a few patients, and then perhaps for a lot more patients we have an initial treatment time (where C and CB = 0), an initial estimate of P and I, and a treatment protocol (DCB as a function of time, basically the dosing protocol) and some final time and final resolution (ie. at final time T_i we either observed complete recovery for patient i or we switched to some other more aggressive treatment method) We’d like to estimate the coefficients of the model based on the observations, so we write down Bayes’ theorem:

$$P({\rm Coefs} | {\rm Obs}) = P({\rm Obs} | {\rm Coefs}) P({\rm Coefs}) / Z$$

where Z is the normalizing constant which we’ll ignore and calculate via some kind of sampling procedure. Now because I’ve built the model I may have some kind of idea about the prior on the coefficients $$P({\rm Coefs})$$, perhaps one rate is expected to be large, another one might be near zero, blablabla, I could come up with some mildly informative priors. This actually seems like the easy part.

What I am really confused about is to come up with a likelihood $$P({\rm Obs} | {\rm Coefs})$$ and the reason is that the observations are a fairly complicated thing. Let’s consider the various classes of observations:

- Time series: we could hypothesize some kind of measurement error on our instruments, and we could then put a $$P({\rm Obs} | {\rm Coefs})$$ by taking the given coefficients, running the model to the various time points, looking at the differences between the measurements and the model predictions, and using the measurement error model (perhaps gaussian, or perhaps some kind of skewed or discretized, or censored error model depending on how we do our measurements). So here we’re saying that the likelihood of a particular model outcome is related primarily to the measurement device.
- Time to outcome data: we take the dynamic model, with the given coefficients, and we run it forward in time to the final observation point, and we ask what is the probability that given those coefficients we would observe the given outcome (recovery, defined as P approx 0, or not recovery, defined as P significantly bigger than 0). However it seems that given the deterministic nature of the model, the outcome is deterministic so the probability is either 1 or 0. If we have likelihood either 1 or 0 we will wind up with our prior multiplied by a square wave type function. I suppose we could also come up with a measurement error model here: if we say that they’ve recovered then the actual population P is somewhere near zero with some kind of say exponential distribution with average value a hyperparameter with a prior that’s some sort of small value, and if they haven’t recovered then there’s some kind of say lognormal distribution for the population P with unknown hyperparameters again. We’d need to supply informative priors over the measurement error models given the observation (because it’s not plausible for our estimation procedure to give us estimates that P is large if we gave the patient an “all clear” at the final time).
- Time to outcome data with dynamic noise: We could also hypothesize some dynamic noise driving the model so that the outcome is not deterministic (and also there could be uncertainty on the initial conditions, but the initial conditions are perhaps hyperparameters themselves so that in a given sample run they are given). Still even if initial conditions are given as coefficients, there can be dynamic variability even with a fixed sample of hyper parameters, perhaps one hyperparameter describes the distribution of drug quantity in pills, so that there are slightly different drug doses at the 10% fluctuation level at each time point where a pill is taken, perhaps also growth depends on temperature and the fever depends on the variable I and some other random factors which we could hypothesize so that we have a random forcing in the growth condition. Now even given all the coefficient values, the outcome at the final time is dependent on the entire random path, so to get a likelihood for the observation given the parameters using a sampling type approach, we need to run the model many times and estimate the probability distribution for the outcomes at the final measurement time. How good do our estimates from a finite number of sample paths have to be in order to get convergence of some kind of MCMC procedure? There doesn’t seem to be an obvious way to know how to get these “numerical likelihoods” except in the limit of very large amounts of computing time.
- Random forcing and random measurement error: If we run the model with random forcing as above, and also our measurement device has a random error, how do we determine the likelihood $$P({\rm Obs} | {\rm Coefs})$$? We need to compare the distribution of model outcomes to the distribution of errors given the measurement? So this is basically a goodness of fit test? Some kind of Bayesian model selection procedure? I am confused. To be clear what I mean here, suppose that we have the random temperatures and pill concentrations, so that for a given set of coefficients we could run the dynamic model say 100 times and come up with 100 final populations from which we can estimate a final population density function. But in addition to that we have the final measurement data, and a measurement error model. So we want a probability that essentially is a function of two
*distributions.*EDIT: I thought about this, and essentially it looks like we want $$P({\rm Obs} | {\rm Coefs}) = \sum_i P({\rm Obs} | {\rm Model Outcome}_i) P({\rm Model Outcome}_i|{\rm Coefs})$$ where here $$P({\rm Model Outcome}_i | {\rm Coefs}) = 1/N$$ if we are doing evenly weighted samples from Model Outcomes, or something else if we are doing an importance sampling type of deal.

It gets even more complicated when the likelihood is related to say a partial differential equation involving 3D fluid flow or something, say a model for the dispersal of radioactive pollution from the Fukushima power plant, where our data are pretty much a randomly sampled set of measurements in a small set of locations but the ocean is big and turbulent! How many times do I have to run an entire ocean fluid flow model in order to get an estimate of the likelihood for *one* MCMC jump on the parameter estimation procedure?

### Trackbacks and Pingbacks

- Mathblogging.org Weekly Picks « Mathblogging.org — the Blog
- Post about a Post about a Post (On Likelihoods and Deterministic Models) | Models Of Reality
- In which I provide free consulting to LBL and they pay for my mother’s retirement | Models Of Reality
- Talk today at USC on RBF approximation of objective functions | Models Of Reality

Another way to introduce uncertainty is to create a micro-state model. In this case for example you might have each cell in a given location and concentrations which vary by location as well. Variables like P and C will be functions of the micro-state. Knowing measured values for these variables, even at different points in time, will not be enough to determine the micro-state. So there will be a residual uncertainty from that.

The differential equations will actually involve expected values (kind of like Erenfest’s Theorem in Quantum Mechanics). For example instead of using P and dP/dt use E[P] and dE[P]/dt (expectations here are computed over some probability distribution on the set of all micro-states). You could then get a likelihood by considering say the variance Var[P]. If these variances are always very small then you get a situation like (2) in your post. If the variances are large it will appear like there is added noise or possibly a random forcing element.

Again, this uncertainty arises because of your ignorance of the micro-state not because of measurement error. You could include the measurement error as well, but it’s probably much smaller than your uncertainty about the micro-state.

The biggest problem you’ll likely have is that the micro-state model doesn’t naturally lead to relations between the expected values of interest, like E[P] and E[C], that can be expressed as nice differential equations. Maybe that’s a feature more than a bug though. It will force you to think harder about which variables you want to measure and the kind of interrelations they might have.

The microstate approach is one approach that leads to something like number 4 in my post. Random forcing comes from the instantaneous difference between the micro-state dynamics and the bulk averaged dynamics, and spatially varying parameters can also come out of that sort of interpretation if you have a PDE due to spatial variability in the microstate, but it still doesn’t tell us how to get a likelihood exactly.

What I mean by that is even if we hypothesize some distribution of random stochastic parameters that account for micro-state variability, the distribution of predicted values at the various timepoints in general has no closed form, so P(Obs | Coefs) is only calculable in terms of a numerical method. But none of the numerical outcomes will exactly match your observations, and so you need some way to say what the likelihood of observing the measured values is that ultimately incorporates both the random variability in the process, as well as the difference between the (distribution of) predictions and the actual measurements.

Daniel,

Yes you can in some cases interpret the microstate approach as an example of random forcing. And you do run into the problem you mentioned if take that view of it.

There is another way to go with the microstate approach though. Use the measured values as constraints:

P_measured = E[P]

Where the expected value is computed using some unknown probability distribution over the microstates. Then find this unknown probability distribution by maximizing the Entropy functional subject to the above constraints.

It’s a bit of a mystery and a bit controversial why this works, but the fact is that it works in both equilibrium and non-equilibrium statistical mechanics and most of the well-known probability distributions in regular statistics can be similarly derived.

So I guess the short answer to “where do likelihood functions come from?” is the same as it always has been: from maximizing entropy.

Yes the maximum entropy principle seems to be a powerful one even if it is a bit mysterious especially with respect to continuous random variables. But it doesn’t solve all the problems.

For one thing, it will work best when the measurements really are some kind of average over the microstates, so that the P_measured = E(P) assumption is good. If instead for example we take a tiny biopsy from an organ then we’re really just sampling one small region of tissue so the bulk average over the entire organ could be significantly different from the measurement. In stat-mech a typical pressure measurement averages over huge quantities of molecular collisions (10^20 maybe).

Also, if we have the measurements as constraints that’s well and good, but it doesn’t respect the dynamics unless we also somehow add the dynamics as constraints. That is, for the probability density function at time t let’s maximize the entropy subject to the constraint that the current average value is P_measured, but also that the current value got to its current state because it came from a particular sample from the microstates at the previously measured time point VIA the dynamics described by the equation. That seems like a hard problem in general, and will generally wind up involving some sort of numerical scheme which will produce a discrete distribution which you can interpolate or density-estimate from but then we’re back to the original problem.

Sometimes it’s not that important what the dynamics are, because for example they may lead to some kind of attractor (say a gaussian distribution). I think that’s one reason why maximum entropy type approaches work well in statistical mechanics, Hamiltonian dynamics of trillions of molecules often leads to similar distributions of results regardless of the starting point after a relatively short period of time. That’s not necessarily true for all examples.

The micro-state and entropy approach are good tools, I agree, but they aren’t the last word.

BTW: I really appreciate your input, and I’m glad to see that this topic is generating some interest 😉

It’s a very good problem which few people seem to know anything about and fewer still really get.

You’re very right about what types of functions to use as constraints. Ones that represent some kind of total over all elements of the microstate are generally going to work best. In stat mech you start out by using the total energy of a gas (or equivalently “average energy per molecule” if you like) and not the “x-component of momentum of the 457th particle”. Note thought that it is not actually wrong to use a variable of the latter type. You don’t have to worry about whether “P_measured = E[P] assumption is good”. It’s not an assumption in that sense, but merely a convenience used to generate a probability distribution which encodes the knowledge of P_measured.

The real problem with using variables of the later type for constraints is that they don’t contain much information. That is to say, they don’t provide much of a constraint to the probability distribution and you’d typically have to know so many of them as to be impractical. The kind of variables you mentioned in your initial example above are naturally “totals” though (for example P can be thought of as the total number of bacteria in the microstate) and would probably work fine.

You’re discussion of how to include the dynamics really hits the nail on the head. I believe there is actually room for a great deal of work here. There are generally two extreme cases. One where we know the dynamics precisely, as in classical mechanics. Here you have to use the equations of motion to back any constraints to a single point in time. Then find the distributions on microstates for that point in time. Finally, evolve that distribution forward. Both steps are mathematically horrendous even for simple cases.

In the other extreme you ignore the dynamics altogether. You may know the value of a P for an entire time interval either because you measured it or it was constrained to take on certain values. Then maximize the entropy for each point in time. The constraints are then P(t)_measured = E[P(t)]. Here you don’t really care how the microstate evolved and the Maximum Entropy distribution will only change in time if the values P(t)_measured change in time.

An intermediate case between these two extremes is Brownian Motion (or variously “diffusion”, “Fokker-Plank Equation”). Brownian Motion can be viewed as a maximum entropy problem where which includes partial information about the dynamics.

Note that the same physical system can be treated in more than one way depending on the question being asked. For example, consider a box half filled with gas and the other half a vacuum. Supposed then the partition is removed between and the gas is allowed to fill the box.

If you wanted to know details about how gas sloshed about and filled up the box, you’d have to do the horrendous math of using the equations of motion for evolve the initial maximum entropy distribution forward in time.

If on the other hand, all you want to know is the final Pressure after a long time, you can get that answer much simpler by maximizing the entropy given what you know about the final state.

This flexibility introduces an interesting possibility. Suppose your initial model for the bacterial infection includes everything you know to be true and is mathematically horrendous. You can then think carefully about the question you’re really trying to answer. Then start removing some of the constraints. Each time you get rid of something, the problem becomes more mathematically tractable. The maximum entropy probability distributions you get after leaving out some constraints will likely be useless for most purposes, but as long as you can still estimate the one thing you care about (with a small variance) you’re golden.

Remember that by leaving some constraints off your model doesn’t make it “wrong”. It’s still “correct” in the sense that all of the constraints remaining are valid. It’s just that now the probability distributions will be more spread out and there will be more uncertainty than before. But again, as long as the one question you care about can be answered with a small variance you don’t really care about anything else.

That’s the beauty of equilibrium thermodynamics. It can’t answer the vast majority of question you might have about an ideal gas, but the questions it can answer can be gotten with a miniscule fraction of the effort needed to solve the equations of motion for 10^23 particles.

I’ve been thinking about carrying all this out explicitly for the Predator-Prey model. This is a good test example because it is simple, intuitive and well known. It is also know to be pretty inaccurate by specialists for real world examples.

So Joseph pointed me here in this discussion: http://www.entsophy.net/blog/?p=188, where he is pushing a microstate approach that ignores physical dynamics. The more standard approach (which is what the post here is asking about) involves specifying a model of physical dynamics while ignoring microstate. The likelihoods constitute the core of the model, which typically involves “random” physical dynamics of macrostate variables and “random” mappings from those to observables.

Here I put “random” in quotes because different people will disagree about how to interpret the word – some will take it to mean “nondeterministic”, but I prefer to interpret it as “not perfectly predictable from the quantities on which we are conditioning”. In the latter interpretation you can think of the likelihood as describing the best prediction (given as a probability density function) the modeller can come up with. Sometimes it’s pretty obvious what this should look like (which often makes the problem boring), but sometimes it is not – this is where the “art” (and the fun) of modelling comes in.