Andrew Gelman has been doing an all-statistics blogging stint. He brought up a point that I myself have thought about with respect to Statistical modeling and I figured I’d egg him on by commenting on it here. It’s really a conglomeration of several issues. Procedural vs Model based statistics, the classic “where does the prior come from” response which is “the same place as the likelihood (out of thin air?)”, and soforth.

The main issue is that there is a fundamental difference between using statistical models to inform mechanistic models, vs using purely statistical models to predict some process. This goes a little deeper than the distinction between sampling distribution and likelihood, but is related.

For example, suppose that I have a model of a dynamic process which is expressed in terms of an ODE, and the ODE has some unknown components, such as thermal conductivity or chemical reaction rates, or whatever. If we observe some data about the process, then we should be able to make some inference about the process based on our data, and of course that inference will have some uncertainty. Gelman himself supposedly got his start at doing statistical inference by inferring temperature inside some silicon apparatus from exterior measurements and finite element analysis (see his autobiography blog post).

Now if we’re measuring the diameter of bearing balls over and over, we can pretend that what we care about is inference about the average, and that our finite sample average is different from the true average according to the sampling distribution of the mean. This is straightforward pure probability theory. If we’re doing something more complicated like our ODE, then we don’t have observations of the thing we’re interested in (we’re not directly measuring the say heat conductivity) instead we have measurements of something that the heat conductivity affects (such as say the time series of a certain temperature). In order to make good quality inferences about the hidden variable, we need a good quality model of both how the hidden variable affects the measured variables, as well as how the hidden variable might be expected to vary from one experiment to another. Suppose that our ODE is well derived so that when we know the hidden variables very well, its predictions are very close to the measurements. Now how can we make inferences about the unknown variable from observations of the process? Here are some alternative methods:

1. Use the value of the hidden variable which minimizes the squared error of the time series measurements from the predictions. (Least Squares Procedure).
2. Use a similar procedure as above, but use a loss function or penalty procedure such as generalized M estimator procedure for robust classical type estimation.
3. Describe a model of the way in which the hidden variable is expected to be distributed, and a model of how the measurement machine’s errors are distributed, and use the value of the hidden variable which maximizes the likelihood of observing the various time series which were actually observed.
4. Add to the previous method a prior on the values of the various parameters, and calculate the posterior distribution of the parameters given the actual time series observed.
5. Do something else, there are a million other good ideas that can work.

Now with the ball bearing problem, generating a model for the sampling distribution of the population mean is a matter of using the central limit theorem, but with the various problems above, there is less guidance when it comes to talking about the “sampling distribution”. Indeed, the sampling distribution of what? If I give you a distribution on the unknown parameter can you give me a sampling distribution over the time series functions? What about when you include measurement noise? In practice, the time series is measured at a finite set of sample points, so we’re talking about a finite dimensional vector of output values, but we could in theory extend that to the infinite dimensional case. But we’re practical people here at the Models of Reality blog, so let’s just consider a discretely sampled timeseries. We’re asking what is the sampling distribution of a 1000 or more dimensional vector?

How about a likelihood or Bayesian procedure? If I know the form of the distribution of the unknown variable, then I can simulate the process many times and find a simulation based approximation of the distribution of the outcome variable, but I don’t know much about the distribution of the unknown variable, so maybe I create some hyperparameters, such as the coefficients of a polynomial chaos expansion, or the parameters of a gaussian mixture model or something. But there are layers upon layers here, all of the following things affect my measurements: the distribution of the unknown parameter, the distribution of the measurement errors, the auto-correlation of the measurement errors, and any error that exists in the specification of the mechanical model. Just deciding what you mean by a statistical model in this case is pretty complicated. In these kinds of cases, the most common thing is to stick with a simple procedure like least squares, but in that procedure we have some implicit model, such as that the unknown parameters vary in a gaussian way around some average from experiment to experiment, and the individual measurement errors in the time series are iid gaussians as well. This could be a serious mistake, as although measurement errors are almost “designed” to be gaussian (that is we build measurement devices so that their errors are quite small and unbiased and soforth) the variability in the underlying parameter has no reason to be any particular distribution at all.

Sometimes doing a good job of modeling is just too hard, or not worth the effort, but it does concern me here that there is not an obvious way to even know theoretically or philosophically what a good statistical model would be. Interestingly, when we have these multi-degree-of freedom systems, there are some probabilistic arguments related to theoretical notions called Concentration of Measure which tell us that some types of variation are essentially irrelevant. That is, the outcome of certain high dimensional functions of many random variables is basically going to be near to some average regardless of which values the various many random variables take on. I’m not sure this always helps though, it can mean that for example the likelihood or the mean squared error is basically independent of the individual values of certain variables, and therefore inference can be a hard problem (many different initial values will wind up all very near a similar final observation).