# Tips on fitting Bayesian MCMC based models

So at the moment I'm working on an interesting problem involving Bayesian inference on statistics that result from detailed simulations of molecular ensembles. The idea is to derive a bulk scale description of a certain phenomenon. I'm not going to discuss the simulations themselves, but have found a lot of difficulty in fitting the statistical models and now that they appear to be working better I thought I'd give some tips on what I've discovered.

One problem I have is that I have observed values that correspond to rho*A*dx namely the number of molecules in a certain slice. But I don't have measurements of rho the density or of A the effective cross sectional area. What I have is measurements of rho at a single end-time. So I can assume that the rho at a particular earlier time is distributed around the rho at the end time and try to infer the rho values. But the measurements of rho are not perfect, they count the number of molecules in a certain volume, but the volume is not guaranteed to be filled with molecules, it might bulge out of the material in some places, so when the volume is filled it's an accurate measurement of rho, but there are outliers. Technically the outliers should also skew left (toward smaller density) but I didn't use that information (yet).

Fitting rho as a normal distribution around the measured value is therefore problematic, something with a heavier tail makes more sense, and in fact when I decided to use a t distribution and allow the simulation to infer the degrees of freedom, it results in a better fit with a fairly small number of degrees of freedom (like 1-3). The fit is better for two reasons, the first is that the scale parameter becomes smaller, with the outliers a normal distribution can only fit the data by expanding the variance, but the variance is quite small among those measurements that are of high quality so a normal distribution expects too many values at intermediate distances. The second reason the t distribution fits better is that its location parameter is less sensitive to the outliers, so with the normal distribution it biases the location if you have a few large outliers.

The t distribution for measured density helps me infer an overall average density for all simulations. Using that overall average density, I then infer a time-averaged density for each simulation using a normal distribution. Using the inferred average density for the simulation I then try to infer the cross sectional area by calculating rho*A*dx/(rhoavg * dx) and averaging the resulting A values. The numerator is measured exactly, and dx is known exactly, but rhoavg has noticeable variance. Because of the variance in the denominator this appears to have heavy tailed errors as well. It's not a cauchy distribution because the denominator is not centered at zero, but it is heavy tailed. Again I decided to use a t distribution to model the average over multiple A values and it seems like although the simulation now converges more slowly, the fit is more reasonable.

Finally another thing I noticed is that it's easy to make stupid mistakes in complicated models. These mistakes range from the fact that JAGS (the MCMC sampler) likes scale parameters in terms of precision (1/variance) and it's easy to do something like 1/sd if you have a parameter describing the standard deviation whereas you should do 1/sd^2, or that the exponential distribution is parameterized in terms of rate, instead of scale exp(-r*x) vs exp(-x/s). To typos, to a couple of situations in which I had parameters that were nearly redundant.

For example I had in my model something like (1-a) * (B-c)*c where a,B,c are all parameters. B is generally big, but a and c are of similar small size. Asymptotically for large B/c the expression becomes (1-a)*B*c and therefore a and c are basically redundant except for a tiny second order c^2 term. The simulation wouldn't converge because it had a large flat region where as long as (1-a) and c varied inversely the posterior distribution was nearly exactly the same. Setting a to a fixed value based on physical principles and letting c vary allowed the model to converge.

In the end I will have an average A for each simulation, and an average rho for each simulation. Also I will have some parameters that describe how energy in the system depends on the size of the system (as determined by the cross sectional area A which is different for the different simulations). This is ultimately what I want, but all the intermediate inferences are "nuisance parameters" that must be fit in order to get an accurate picture of the energy parameters, but are not really that interesting in and of themselves.

Interestingly the symptoms of bad fit were

- The variance parameters became unreasonable and then densities were inferred that varied far too much from measured values
- Convergence took a long time, or maybe never converged if there were nearly redundant parameters.
- Changing the model to use more precise prior information didn't really help, the likelihood dominated because of the mis-fit and I couldn't really get convergence even with far too precise priors.

So if you have similar experiences, consider some of these ideas.