Joseph over at Entsophy has posted about some theoretical physics, namely the problem of what constitutes a Newtonian inertial reference frame. I have some comments over there that aren't terribly well developed, but I thought I'd ruminate here on this topic with a little bit of Nonstandard Analysis (NSA) flavor as well.

Suppose that we are in a spaceship somewhere in the solar system. Do we experience centrifugal forces? The answer seems to depend on our state of rotation, but rotation with respect to what? Newton more or less says to consider the distant stars as fixed, create a coordinate system relative to them, and determine if we are rotating relative to that coordinate system. In practice this produces a frame in which if we are not rotating, we experience essentially no centrifugal force (or so small as to be undetectable).

But this is a global procedure, it requires us to see the photons from extremely far off stars and by using those photons determine where those stars appear to be relative to us, and then construct our coordinate system. Is that essential to the procedure, that the apparent position of far off masses is involved?

Suppose instead we want to construct a coordinate system that is local to us, that doesn't involve masses that are enormous distances away. Can we do it? How?

Imagine we are conducting an experiment to observe some masses and determine whether they experience centrifugal (and/or coriolis) forces relative to our coordinate system. Let's first define a time-scale for our experiment $t^*$ is the time it takes by our lab clock to conduct the whole experiment. To make this concrete, suppose we're interested in human scale objects and processes, so $t^*$ is somewhere between say 1 second and 1 day. Or in any case, it's isn't something like a billion years over which we're supposed to observe galactic events and in which the far off stars actually clearly move.

Next we will define a length scale. So, there is one obvious length scale, which is $r^*=ct^*$. This defines a communication horizon, motions we carry out after the start of the experiment can not be felt by objects farther away than this until after the experiment is done. But we may be able to define another length scale which could be of use. Define the force applied by any object $i$ on our objects of interest $j$ as $F_{ij}$. Now define a length scale by $\min r^+ : \max_i F_{ij} \cong 0 \land \forall \{k\} \sum_k F_{kj} \cong 0$ where $i$ ranges over objects farther away than $r^+$ and $k$ is any subset of objects beyond distance $r^+$. Ok, in reality physical measurements are always representable as standard numbers, the use of infinitesimal nonstandard numbers must be interpreted in terms of quantities too small to measure, the threshold is different for different measuring instruments, but we take it for granted that for any given laboratory there is some unspecified level below which forces will be so small as to be unmeasurable. For example good luck measuring the gravitational attraction of two electrons in a lab here on earth.

So we have a length scale that defines a horizon beyond which no individual object produces any appreciable force, and the sum of any group of objects produces no appreciable net force. This is I think a stronger horizon. Why? Well, for example I might have some delicate apparatus here on earth, and it measures the motion of some tiny slightly charged droplets of water in still air for example. The whole experiment might take say 1 minute, but I could imagine being able to just barely detect the tidal effect of the sun in this experiment even though the sun is 8 light minutes away. That is, the retarded field of the sun is still relevant even if the sun doesn't feel the wiggle of my droplet's motion until another 7 minutes go by.

Nevertheless, this radius is distinctly local, in that it's much smaller than the distance to the nearest galaxy for example. In fact, if we can define some cosmic radius $R$ as encompassing essentially all the visible mass in the universe, then $r^+/R \cong 0$ for our purposes.

Now, we can observe our objects doing their dynamic dance, whatever they are, perhaps we're in a spaceship and we are spinning a bucket of water, or watching charged particles interact, or playing catch the stick with a quadrocopter (see link for YouTube example). Overall we'd like to determine from what point of view do the dynamics occur without any superfluous centrifugal or coriolis accelerations? To figure this out, we can make reference to the positions of any masses within our horizon (but not to the "distant fixed stars"). The important thing here is that we need a notion of "superfluous". Clearly, if we're here on earth and want to study a spinning bucket of water we will say that there are centrifugal forces involved. But if we're doing it while seated on a merry go round there will be additional rotational effects involved. It would be simpler to view the whole thing from the ground rather than the merry go round itself. If we're watching the weather, it might be simpler to describe the forces from the point of view of a coordinate system that doesn't rotate with the earth around it's axis of spin. For example a coordinate system centered at the center of mass of the earth and with one axis pointed between the center of mass of the earth and the center of mass of the sun, the other being the axis of the earth's rotation, and the third being perpendicular to the other two. (yes, I know the earth also orbits the sun, but for the sake of argument perhaps that effect is infinitesimal on our time scale)

So, clearly the notion of a "preferred" or "inertial" reference frame is related to the complexity of the forces involved. So we can fix a coordinate system in our lab, and then determine a rotating coordinate system relative to the lab coordinate system by specifying a vector $\Omega$ around which our inertial coordinate system is rotating and the length of this vector is the rate of rotation. If we orient our measurement instrument along that axis and rotate it at the appropriate angular velocity all the dynamics will be "simplest" in terms of rotational forces. Perhaps we can define the net rotational force as something like $\sum_i |C_i|$ where $C_i$ are the individual centrifugal forces.

The next question is, in this coordinate system, what is the distribution of angular velocities of all masses within our local shell? If this procedure essentially constructs a zero mean angular velocity coordinate system for the objects that have appreciable affects on our observed experiment, then when you expand your shell to the universe it's not surprising that the distant stars (which make up the vast majority of the universe) are "fixed".

For the moment, that's as much time as I have to devote to this topic, but perhaps I'll come back or get some interesting comments here.

I had to postpone Jury Duty with the county of Los Angeles this week. Things worked out fine and now it won't interfere with family reunions or my wife's biology conference. But in thinking about Jury Duty and having to take the online orientation, it became clear that we need to fix the economics of this system.

Here in LA county Jurors are paid $15 each day after the first day that they are on duty. Now, this is clearly a slap in the face. Jurors are selected at random from a large pool of people. The average productivity of a given person is going to be at least on the order of the GDP/capita/working day of the US. It's going to be slightly higher in LA than in other parts of the country because cost of living is higher so there's a bias there, but order of magnitude that's going to be pretty much correct. How much is that? According to Google, GDP/capita of US in 2012 is about$50,000

There are more or less (365*5/7-5*3) working days in the year. So the figure we're talking about is: $200/day or so.$15/day is about 7.5% of the societal cost of compelling jury service.

Now, there's a definite benefit to jury service, in that we have laws that are enforced and are able to avoid all sorts of societal problems, but the jury costs are born by either the jurors or their employers, whereas the benefits accrue to everyone.

Furthermore, if you're going to have a 5 day trial with 12 jurors and 3 alternates you're going to eat up about $15,000 in productivity, whereas if the trial is a Civil trial, maybe it's only about$9000 worth of unpaid rent or something. Society loses on the net just on the lost productivity of jurors.

So, my proposal is this, courts pay Jurors GDP/capita/working day for any day they are called to the court (including the first day). Courts raise this money by charging court fees to the losing parties in Civil lawsuits in courts other than small claims court. This means it doesn't pay to go to court over amounts that aren't significantly larger than this basic cost reducing the abuse of the court system. And, small claims courts have their maximum dollar amounts raised to 15*5*GDP/capita/working day to handle disputes that aren't worth a jury trial. The fees would have to be high enough that they also cover the jury costs of criminal trials where the losing party likely isn't going to pay, though fines could be assessed as well in those cases.

I think under this system, far fewer people would duck jury duty with contrived excuses, and the quality of juries would probably rise as well.

The vast majority of middle class residences have at least one WiFi router these days. Usually connected to cable or DSL internet services. These internet services work a lot better and cheaper than doing everything over your phone's data connection, but they are inconvenient without wireless distribution throughout your house. Many people get a WiFi router with their cable/DSL set modem up, and a few buy an after-market one and set it up themselves. Most of these devices come with some default settings, enough people just keep those defaults that you will frequently see network IDs (ESSID) that look like "linksys" or "NETGEAR" or some other default name. (In the following I assume mostly US based rules and regulations)

Unfortunately, WiFi doesn't have much in the way of non-interference standards. So for example routers out of the box don't periodically search for the "best" channel and switch to it. Doing so would cause a brief interruption of service, but it wouldn't be that hard to take samples throughout the day of the channel usage, and whenever a very low traffic moment arrives, switching to the "best" channel. Few people would notice a 1-3 second downtime at 2am for example. There's the possibility of some interesting and strange iterative dynamics with such a scheme I suppose. As it is, I see routers tuned to all sorts of channels. For 2.4Ghz there are channels 1-11, and I can see 1, 3,4,6,7,9,11 nearby my house. Few everyday non-tech-geeks know that there are actually only 3 independent channels. These are (if you're a geek, say it with me) ... 1, 6, and 11 (this is different for the newer 5Ghz band where channels range from 36 up to 165 and some are forbidden). Channel 1 interferes with 1,2,3,4,5 but not 6-11, channel 6 interferes with 2,3,4,5,6,7,8,9,10 but not 11, and channel 11 interferes with 7,8,9,10,11. In Europe and some other countries, there may be an extra set of channels 12-14 so that you can get 4 non-interfering ones, 1, 6, 11, 14. In some circumstances where there's sufficient separation between the stations, 1,5,9,11 can be used with only minimal interference but added channel capacity that makes up for it. That kind of deployment requires that you have control over all the stations, like inside a hotel or shopping center, since other people are not likely to choose the right channels.

Another thing access points don't usually do is automatically scale back their power output. In fact, manufacturers tout their high power outputs, and they typically come off the shelf configured to max out the power. High power is useful in a point-to-point link where two wireless devices have directional antennas pointed at each other and are trying to send data over long distances, from a few hundred meters to kilometers in some cases. But in a typical single family home or apartment complex, high power output is a nightmare that doesn't improve the range, but does dramatically increase interference. In one example, a wireless networking conference found that their hotel WiFi was terrible, and the biggest fix was to go around turning off access points and turning down power output until the interference was eliminated.

The same thing goes for WiFi. I was recently trying to improve my WiFi performance and found that I could turn my access point down to 13dBm and still get just as good performance throughout my house as when I had it cranked to the default of 18 dBm. That's about 1/3 of max power output. Sure, it meant that my signal wasn't FULL BARS at the street in front of my house, but that's actually a huge advantage for everyone. I don't need full bar streaming at my curb, and across the street from me, they don't need my WiFi router to be interfering with their laptop. Whereas in fact my across the street neighbors have their power cranked up in such a way that on my front porch their signal is roughly 83 dBm which is a respectable signal strength. They might be able to sit on my porch and stream their netflix if they had a high enough power laptop wifi.

Perhaps consider discussing this article politely with your closest neighbors if they aren't using channels 1,6,11 or are using a default high power output and leaking a lot of signal into your dwelling.

I submitted my paper Grain Settlement and Fluid Flow Cause the Earthquake Liquefaction of Sand to Proceedings of the Royal Society A and just received reviewer comments today. This preprint summarizes what I've found in my PhD studies with regards to how soil liquefaction works. (Also some more details on the derivation).

The key finding is that contrary to every textbook definition of liquefaction, fluid flow is not ignorable, in fact fluid flow dynamics is a key component of how liquefaction occurs. The standard description of liquefaction involves "undrained" contraction of soil grains during shaking causing an increase of fluid pressure, and a transfer of total stress from the grain-grain contacts to the water pressure. If the water pressure gets high enough, the grains carry no stress and hence can not transfer shear stresses. At this point the slurry flows like a liquid.

Although it's been known that "void rearrangement" could occur (in which grains move one way and water moves the other), it was treated as a special case of soil liquefaction behavior. However, centrifuge scale-experiments and specialized tabletop experiments performed in the last decade have called this in to question, and the standard triaxial tabletop experiments involving 10 to 20cm sized samples of soil require enormous deformations that are unrealizable in-situ before water pressure exceeds total vertical stress. Why is soil so hard to liquefy in a triaxial machine?

Tabletop triaxial experiments involve sand and water wrapped in a completely impermeable rubber membrane, so that the total mass of water inside the membrane remains constant. In the soil, water can conduct from one place to another through the inter-grain spaces. The standard description assumes that such conduction is slow compared to the duration of an earthquake, and this assumption is so strong that some researchers have actually tried to compensate for the bulging of the membrane in these tabletop experiments under the assumption that even the slightest conduction of water causing the membrane to bulge would be more than the actual conduction in-situ.

The truth may be hard for the Geotech community to swallow. My analysis shows that for typical loose sands,  water pressure transmission via fluid flow occurs over tens of meters on timescales of substantially less than a single earthquake loading cycle. In the paper I derive a dimensionless equation that describes the phenomenon in the top ~ 10 meters of surface soil and I show that the equation can be solved for equilibrium conditions to determine approximately the water pressure field. We can use an equilibrium solution rather than a dynamic time-varying one because equilibriation occurs so quickly compared to the earthquake duration. I then solve the equation for a variety of special cases that are relevant to actual physical experiments and show how the equation predicts the qualitative results of those experiments.

Suppose we'd like to design some system to perform in an "optimal" way. One intuition I've often had is that "Optimal" is rarely worthwhile in real situations. It is difficult to specify a criterion in an exact enough way which makes the "optimal point" under your criterion actually "optimal" in any sense. Instead, it's better to settle for "Good" or if you're lucky "Pretty Darn Good". In other words, once you've climbed most of the way up the mountain, it really doesn't make much difference if you are standing on the very top, especially if your surveys can't really tell you which of the little peaks really is the highest one.

With that in mind, suppose we have a Bayesian model for the probability that some particular tradeoff function is truly our "favorite" tradeoff. To make it semi-concrete, suppose we have to choose amounts to consume between two quantities, say Beer (B) and Nachos (N). We have a certain amount of money, all of which we are going to spend. We can spend it in such a way that $P_b B + P_N N = D$ so that there is a line in $B,N$ space that defines the reachable region.

Now in simple micro-economics, the solution is to specify a utility function $U(B,N)$ and maximize $U$ subject to the constraint on $B,N$.

But, in reality, we don't know how to trade-off $B,N$, there is no utility function. At best, we can sort of guess at how happy we'd be with any given position on the $B,N$ line.

Instead, suppose we specify some probability distribution over different $U$, $p[U]$. This is a distribution over functions that specifies the region of $U$ space (say coordinates in a Hilbert space) where we think our true preferences lie. But it's now impossible to say with certainty what exact values of $(B,N)$ are best. We could try to maximize expected utility, we could maximize the utility under the most-probable $U$, we could do a lot of things, but another way to think about this is that we could choose a range of $U$ which are in the high probability region, and find a range of $(B,N)$ points that are all "good" relative to all the $U$ in that region, and then any of those $(B,N)$ points are going to be good choices.

My intuition is, generally in real problems, this version of the problem is going to be easier to solve than finding the one-true-optimum point of a well specified $U$. There are two issues.

Often in applied mathematics, the utility function is chosen to make things relatively easily solvable. Such a utility function doesn't necessarily incorporate all the things we know about the real problem. Once we do incorporate all those things, we'll often have a non-smooth, weird utility function (think of making investment decisions under accurate tax-code rules, where there are weird break-points and eligibility requirements, and soforth).

On the other hand, putting probability in the mix, we can't be sure of any "one true value" we can now compare "good values" across "good models". There are two scales for error, one is the difference between a "good" value and "the optimal" value for a given utility function, and then there is the difference between "the optimal" value under one utility function and "the optimal" value under another. It makes sense to stop trying to find "the optimum" of a given utility once the scale of your search-progress gets down below the scale of the differences between different utility functions. I think often solving a bunch of related optimizations in a very approximate way will give you a lot more information about what the good values are than misguiding yourself into believing you've specified a real utility function and then solving that (wrong) optimization to perfection.

Whoo, the title is a mouthful! Again, in discussions with Joseph over at the Entsophy blog some interesting issues in the philosophy of statistics were raised. It has me thinking about the following issue:

Suppose that you have some data $D_i$ and some measured covariates $x_i$ and some brainpower. You would like to understand the process that produces $D$ in terms of things you know about the state of the world $x$ and things you assume about the state of the world, your model $G(x)$ (I'm using G because we'll want to distinguish between probabilities $p(D)$ and frequencies in ensembles $f(D)$). So you are going to use the following formulation:

and $\epsilon_i$ has a Bayesian probability distribution associated to it. Actually, technically there's a joint distribution over all the epsilons $p(\epsilon_1,\epsilon_2,\ldots,\epsilon_N)$. For the moment pretend we haven't thought about it yet.

So, you build your functional form for $G$ and it has some unknown parameters $q_i$ about which you have some general knowledge, say order of magnitude estimates, from which you can build priors that approximate that knowledge $p(q)$. Now you need to do some inference on $q$ to find out what reasonable values are for those unknown parameters. Using Bayes Theorem, you write down:

[ Edited: changed this to condition on the observed covariates $x$, and will let the measurement error in $x$ enter as $p(x|q)$ with the parameters $q$ including the unobservable measurement errors in the measurements of $x$]

And you're off to the races doing MCMC right? Not so fast! what the heck is $p(D|q)$? The usual jargon is "The Likelihood", and well, the predictions from $G$ given $q$ are well known, so the only thing we're not sure about is how likely we are to see a certain differences $\epsilon_i$ between the data $D_i$ and the predictions $G(x_i)$. In other words we need to specify the Bayesian probability distribution on the $\epsilon$ values.

Well, first let's consider the "usual" answer. The usual answer is something like "the data are treated as IID draws from distribution 'foo'" (often normal, or binomial, or poisson, or something else fairly convenient). The next thing after that is often some kind of heteroskedastic version, where $\epsilon_i/\sigma_i$ are all IID and $\sigma_i$ is some kind of known measurement error scale that comes from instrument calibration data.

But, from a Bayesian perspective the joint distribution over the $\epsilon$ values is really a belief about how closely the prediction equation will match the data in each circumstance, and it could be pretty much any probability distribution. The "frequentist" notion that there is a "data generating process" like IID normal draws is backwards, the actual direction of implication seems to be (Knowledge about model performance and measurement properties) $\rightarrow$ (Belief about errors when $q$ values are "reasonable", ie a mathematical specification of the likelihood) $\rightarrow$ (Belief about numerical values for "reasonable values" of parameters $q$, typically through MCMC procedures) $\rightarrow$ (Ensemble of measured errors under high probability $q$ values, this ensemble will be constrained to look like draws from some single distribution, typically mean 0 and maybe having other Maximum Entropy like constraints such as standard deviation, quantiles, or other structure).

The "frequentist influenced" Bayesian method goes more around in a circle: first assume a data generating mechanism (the final distribution above) then use that to produce an "automatic" likelihood (back to the likelihood at the beginning of the above chain), then use the prior and the likelihood to get the high probability parameter values $q$ via MCMC or whatever (you could do the "full frequentist" version of this by leaving off the priors and simply doing maximum likelihood for example).

In other words, in the fully Bayesian world, depending on what we think the model and measurement devices ought to do (how we specify the likelihood), we will get different distributions on the parameters, and then when we treat the parameters as random variables and observe the discrepancies, we will get an ensemble of discrepancies which look like IID draws (or dependent draws if you want to specify some dependency structure) from some distribution, but the distribution they look like draws from need not be the one we use to define the likelihood. It is more or less the case that we can often use a simple IID "frequency" model as the Bayesian probability model for defining the likelihood and we will get in the end that the ensemble of errors look like they came from that IID distribution we assumed. This is in essence a special case, but that special case is exactly the one that Joseph is complaining about in his "jumped the shark" post. In other words, there is no "Correct" likelihood that corresponds to actual facts about the world. The "Data Generating Process" doesn't really exist, so we can't be right or wrong about it, and confidence intervals derived from assumptions about the "Data Generating Process" need not have real coverage properties because the world need not behave as if that "Data Generating Process" assumed by a frequentist statistician is actually correct.

I was reading and commenting on the Entsophy Blog today and I thought up a little question about the Gaussian distribution in relation to his theory of what a probability distribution means.

Suppose we have some measurement device, and we know it has a resolution. Say a wooden ruler whose smallest mark is 1mm. If we read 32 mm as the length of something then we know that the real length is more or less $32 \pm \delta$ mm. We also have an intuition about $\delta$ that it should be around say 1.

We could model the real length as uniformly distributed between 31 and 33 mm, or maybe between 30 and 34 mm or maybe... We don't really want to completely exclude any given distance. Since we don't want to exclude any distance, we maybe choose a gaussian distribution with $\sigma = 1$.

Let's interpret the unit gaussian distribution as a mixture of different uniform distributions centered on 0. So we have a half width $w$ and the height of a uniform distribution on $[-w,w]$ is $1/(2w)$. Then if we're using a gaussian to discuss the idea that we'd like to put a fixed bound on how far from zero something can be, but we just don't know what exactly that fixed bound should be, then we're talking about a mixture of uniforms. At any given value $x$ the density of the mixture should be $\exp(-x^2/2)/\sqrt{2\pi}$. Since uniforms only contribute if their half width is at least x, we have (let's assume x > 0)

So we want to find out what this means about $a(w)$ the mixture distribution. Playing around in maxima results in the fact that $a(w) \propto w^2 e^{-w^2/2}$, which when normalized happens to be a chi-distribution with 3 degrees of freedom. A Chi distributed random variable with 3 degrees of freedom is always positive, and has high probability region in the range of around $[0.5,2.5]$ or so. In other words, we don't know exactly how wide our finite uniform interval should be, but if we let it be an unknown positive (chi-distributed) quantity which won't often be much bigger than 2 or 3 we can recover the idea of the normal distribution as a model for measurement uncertainty. In this sense, the normal distribution has a kind of "bounded and O(1)" quality to it even though it's not actually bounded.

tags:

I've been meaning to update my blog several times now in the last month, but I've been having mysterious connectivity issues. I have a mixed ipv4/ipv6 network at home, and by default it connects via ipv6. The blog has an ipv6 address, and so far most of the time it just doesn't respond properly to requests originating from my home network. I can see my blog when I'm away, but from home I just can't see it. This has been causing problems for me, hopefully not for you. Suddenly today I have access again. I am going to try something in the next few days to see if it affects anything. After that if I have solved the problem hopefully I can get back to regularly scheduled blogging.

So, I'm trying to fit my statistical model. I have real data, and I have an ODE that predicts the results of the real data. Both have strong periodic components. The ODE is fixed via dimensional analysis to have period 2, or maybe $2+\epsilon$ for some small epsilon, I haven't thought about whether the perturbation effects perturb the frequency, but in any case, it's very close to 2. There's a statistical parameter that's a timescale for each data series where I rescale the real data's time points to try to match this fixed period, this tells me how long a certain meaningful process takes in my real data. In MCMC inevitably the period will not be dead spot on. Whenever you have two periodic signals subtracted from each other, you wind up with a signal that's close to the average period modulated by a signal that is close to 1/difference in periods (or frequency is the difference in frequencies).

I've been trying to model the errors as gaussian processes to form a likelihood, but I've been ignoring this periodic component. If the periods are exactly aligned, the signal goes away, but even a tiny fraction of a percent different and the signal shows up. I should really allow the errors to have this periodic component but constrain its size to be small. I'm working on that now.

One thing that's important is that near the "peaks" the model should fit much better, whereas near the "troughs" I expect it to fit less well, but it's not very important. So I need to anchor the gaussian processes to have small total variance near times 0,2,4... while maintaining the periodic signal and some high frequency noise. Gaussian processes are pretty flexible fortunately.

I have a model where a multidimensional ODE predicts the statistical dynamics of a very high dimensional ODE. In essence I'm predicting the dissipation of wave energy in a molecular system. In order to test my theory I want to fit the coefficients of my ODE to actual data statistically sampled from big computational runs of the molecular dynamics.

The behavior I want to constrain is the proper dissipation of the total kinetic energy carried by the wave. Once I have that, I want to show that the mechanism behaves appropriately by comparing the time history of the velocity of a single degree of freedom. So what I need is to fit the coefficients of my model to match the energy decay. I have informative priors for the size of the coefficients from theoretical analysis.

The biggest problem is how to fit this timeseries. As with almost all timeseries, if two timepoints are nearby then they have similar statistical errors. If my model predicts too much energy at time $t$ then it will probably predict too much at $t+\epsilon$ for "small" $\epsilon$.

The "obvious" way to deal with this is via a gaussian process, in other words we impose a covariance structure on the errors based on a covariance function which compares how close in time two measurements are. The problem with the straightforward application of gaussian processes to the errors, is that in this situation there are definitely some complex processes in the molecular dynamics which are not modeled by my ODE. This leads to outliers (for example in one run the molecular dynamics became unstable right before the end of the run, and energy shot off to orders of magnitude more than it really should have been. In other runs the initial conditions probably caused melting or recrystalization in the first few percent of the timeseries resulting in a dramatic loss of kinetic energy). I am not interested in these modeling errors for my purposes, I know they exist, I simply want to show that in many cases they aren't important and focus on the cases where my simple model works. One way to do this is to specify that the errors have fat tails, like a t distribution. After an off-topic discussion on Andrew Gelman's blog with a helpful commenter, the current error model looks like this.

where $PE$ is predicted energy and $AE$ is actual measured energy.

where $s$ is an error scale, and $d$ is the degrees of freedom, both $s,d$ are hyperparameters of my bayesian statistical model, the $u_i$ are now assumed to be uniform.

If the $u_i$ really are uniform (ie. if the t distribution fits the marginal errors well), then these are gaussian with unit standard deviation and mean zero. I now specify that the $x_i$ have a gaussian process structure, that is they are related by how close in time they are, the $t_i$ values associated to the data points.

To get the likelihood of some observed $E_i$ we, as usual, want to specify the log-density of the vector of $E_i$ values. To do this, we need to go through the CDF. The CDF of a vector is a weird function, it's a multi-argument function (one for each dimension of the vector), and it specifies the probability that the random variable is less than the observed value in each dimensions. The dependency structure can be thought of in terms of a copula $C(u_1,u_2,u_3,\dots,u_n)$. The $u_i$ values are all marginally uniform, but the joint distribution is not uniform over the hypercube. In fact $C(u_1,u_2,u_3,\dots,u_n) = pmvnorm(x_1,x_2,x_3,\dots,x_n; \sigma)$ with $\sigma$ coming from the gaussian process covariance function. Hence, this $C$ could be called a "Gaussian Process Copula"and the process is called a gaussian copula process.

To get the likelihood of a particular vector of $E_i$ we can differentiate

When we differentiate this thing we get $dmvnorm(x_1\dots x_n; \sigma) \prod_i \frac{\partial x_i}{\partial E_i}$ and the $x_i = qnorm(pt(E_i))$ so $\frac{\partial x_i}{\partial E_i} = \frac{1}{dnorm(x_i)}dt(E_i)$. Using the mvtnorm package in R I can calculate the log of all this mess and hand that to the mcmc package and do Bayesian inference. Running last night it seems to have produced a significantly better fitting model that can move through space much easier, apparently a less rough landscape thanks to the error dependency.

Many thanks to "Corey" on Gelman's blog for the suggestion any inaccuracies in the above are entirely due to my attempting to figure the details out on my own.