Teaching Bayesian thinking

2014 March 28
by Daniel Lakeland

I have a friend who is a Philosophy professor. He recently posted to his Facebook page that he was offering his students extra credit if they could come up with false positive and false negative rates for the drug tests that used to be required by Florida law in order to receive welfare benefits. His students were having a heck of a time finding anything on these rates.

The only data he had was that about 2.4% of people who took the test got positive tests (reported in the news I guess).

So, I did a little quick googling, and came up with a WebMD article that had an approximate range for these tests.

I wrote him the following email which describes how to use this range to construct a simple prior over the false positive and false negative frequencies for these tests, and then using a single 6 sided die to generate a sample from this prior and calculate the range of frequencies for a person who tests positive to be an actual drug user.

Since it's a Philosophy course, I put in some stuff at the end about the distinction between frequencies and probabilities within the Bayesian framework. If you're either a teacher, or a student just seeing these ideas for the first time, it might be helpful to read this description

read more...

What I've been doing in between the important blogging on Bayesian philosophy

2014 March 21
by Daniel Lakeland

If you follow my blog you probably know that I was a PhD student at USC. I graduated in December, so you can read my dissertation which discusses both the mechanics of liquefaction and the thermodynamics and mechanics of wave dissipation due to microscopic vibrations in a molecular solid, along with some Bayesian fitting of that ODE model.

The liquefaction research was recently published on Proceedings of the Royal Society A. My version of the preprint can be downloaded here: Grain Settlement and Fluid Flow Cause the Earthquake Liquefaction of Sand and my postprint version will be available after 12 months (though I understand the definitive version will become available from the journal after a year or so too). The supplemental materials are available as well.

The article in Proceedings A was picked up by two different news sources so far:

ABC News Australia interviewed me for their article from a week or two ago, and in the last few days ScienceNOW interviewed me and one of my advisers Amy Rechenmacher for an article that came out today.

I was particularly happy to see Ross Boulanger quoted with a favorable comment. He was a professor of mine when I did my second undergrad degree at Davis, and he definitely motivated me to think about the problems with standard liquefaction explanations.

In the mean time, I am now working on building up a consulting practice. My company is: Lakeland Applied Sciences LLC. My emphasis has been on building mathematical models of all sorts of phenomena, from things like liquefaction and wave vibration to scaling laws for the healing time of bones and cartilage, to economic decision making, to biochemical explanations for behavior. You can see examples of projects I've worked on in the past on the company web site.

 

The Bayesian Approach to Frequentist Sampling theory

2014 March 21
by Daniel Lakeland

So I've been thinking a lot recently about the issue of sampling from a finite population and doing Bayesian analysis. Pretty clearly, when we have a finite population of items in the world, and we use a random number generator to select a subset of them, and we then analyze the data from our subset and try to extrapolate to the full set, we are in a situation where the "urn full of balls" interpretation of probability has some intuitive meaning. But the "urn full of balls" interpretation of probability is not directly the same as a Bayesian probability. In fact, the Bayesian conception of probability is as a measure of what it is reasonable to believe about things we haven't, and usually can't measure.

So to make my example concrete, we'll suppose we have a crate full of 100 nominally 2 liter bottles of orange juice. Our interest is to find out how much total orange juice is in this crate. Further, suppose we know that the filling machine broke recently and has been recently repaired so that it might be producing problematic results (and we can't use information that for example historically this machine produced bottles with 2.0 \pm .05 liters reliably). You can extrapolate this example to estimating total votes in Florida (where the population is a finite set), or the effect of preschool education on academic performance in high school (where the set of students is some finite set) or whatever...

First of all, why is the information about the repair important? It's important because it breaks a symmetry, the historical record prior to the repair is distinguishable from the current results, in technical terms, the historical data is not exchangeable with the current data given our state of information (that the repair might have caused miscalibration). So let's start by constructing a Bayesian prior for the average volume of orange juice in a bottle. Supposedly, this is the big bad hard task that Frequentists complain about, though in this case I think it's going to be pretty easy.

The Prior on average volume in a jug

We know for sure that the average volume of a container is not going to be bigger than the absolute maximum volume that any container could hold. The bottles are nominally 2L volume, most likely they have some headspace designed into them, and a little variation in manufacturing, yet I'm very sure that none of the bottles holds 3L, that's just an unreasonable amount of manufacturing variation. So a simple prior for the average number of liters of OJ is U(0,3) (that is, uniform on the interval (0,3)). This is a valid prior, and it encodes the possibility of having empty containers or full containers, even containers that we're pretty sure are overfull by much more than is really possible. There is no realistic outcome which is excluded from this prior, however there are some unrealistic ones included (namely 2.9L etc). Because we know there are unrealistic values included in this prior, it is not the most informative prior we could come up with. I am fairly sure that the volume of typical 2L OJ bottles is probably less than some number in the vicinity of 2.1 or 2.2 liters, perhaps I decide that the headspace is on the order of 0.25L at most. I could use U(0,2.25) as a reasonable prior with slightly more information than the above. The maximum entropy distribution for a positive number whose average is known exactly is an exponential distribution. Another reasonable prior for the average number of liters of orange juice is then \exp(1/2.25), that is exponential with a rate of 1/2.25 or an average of 2.25. Finally, another reasonable prior for the average OJ is one which is uniform on [0,2], and falls off exponentially past that with a rate of 1/0.25 indicating that the headspace is on the order of about 0.25L. This is probably a good prior, but not easy to encode in typical software.

So, that's it, we think about what's reasonable to believe about the average amount of OJ in the containers, we encode that information into a distribution which puts high probability in the regions we think could be reasonable, and low probability in other regions, and we're done. There's absolutely no notion of Frequency anywhere here in this process. In fact, although I would not be surprised by the average turning out to be somewhere in [0,2.5] I would be very surprised if on repeated sampling of different crates I found out that the average OJ per bottle in each crate did have a histogram that looked anything like the above priors!

Collecting data: sampling from the crate

So, now to the sampling itself. How do we understand sampling in a Bayesian framework of probability. First off, we could just haphazardly sample, say take a few from the top of the crate, and then remove some and set them aside, and a few from the middle of the crate... Let's say we're going to look at 10.

What's the problem with haphazard sampling? The biggest problem is unintentionally restricting the chosen set to a subset of the whole which has very different properties from the whole. Andrew Gelman gives the classroom example of weighing candies where it's just much easier to get ahold of the big candy bars, and the small candies fall to the bottom of the bag and are under-represented in typical samples. Similar things happen when you ask engineers to look at a bunch of potentially damaged structures. Engineers find damage interesting and their eye and attention will be drawn to it. In the case of our OJ jugs perhaps when we find what we think are lightweight jugs we are more likely to include them than leave them out.

So our task is to get a "representative" sample, and a bit circularly "representative" just means "has similar properties to the whole collection". Is there a way to guarantee a representative sample? The answer is, yes if you're willing to sample 100% of all the jugs. That sample will tell you exactly what the average weight is. Any smaller sample will contain less information and could be biased. So if there's no guarantee, is there at least a way to ensure that we almost always get a representative sample? Here, if you're familiar with computer science, we can take a cue from "randomized algorithms" such as quicksort with a random pivot. The idea is that even if some "adversary" is trying to make your sample non-representative (the worst case) unless they can know what the output of your random number generator is, choosing the samples with a random number generator will make it nearly impossible to do.

To see why this is we just need a simple counting argument. We have 100 jugs and we're going to choose 10. The number of different sets of 10 we could choose is C(100,10) or "100 choose 10" which is calculated as

\frac{100!}{10! (100-10)!} \approx 1.73 \times 10^{13}

and a random number generator is going to give us each of those sets with equal frequency (in the context of a well designed random number generator, Frequentist conceptions of probability are exactly true by mathematical construction). This number of possibilities is simply too large to make any significant fraction of them be non-representative. For example, if you put all the light jugs in the first set, then they can't all be in each one of the 10^{13} other sets too. In fact, suppose that there are say 10 jugs which are quite different from the rest. How many sets of 10 will have 0,1,or 2 but nor more of these jugs? A counting argument works again, we have 10 special jugs, we'll choose 0 or 1 or 2 of the weird ones, and the rest will be the usual ones. The total number of ways to do these choices is:

C(10,0)\times C(90,10) + C(10,1)\times C(90,9) + C(10,2)\times C(90,8) \approx 1.63 \times 10^{13}

which is \approx 94\% of the possible sets will have either 0, 1 or 2 of the weird jugs but no more. So 94% of the time we do an RNG sampling method we'll get a sample with the fraction of "weird" jugs between 0 and 20% of the sample, when the proper fraction is 10/100 = 10%. Not bad. If you like you can fiddle with the numbers to see how this changes with different sample sizes (in R the choose function is just called "choose").

So the simplest way to avoid getting a non-representative sample from a crate is to number all the jugs and select 10 of them randomly using a computer RNG. It's not that other methods are invalid, it's just that we don't know how valid other methods are, especially in the context of a potential adversary (such as settling a regulatory issue or a contract dispute or lawsuit etc) but even in the context of some other accidental pattern in the choice of data, such as it being harder to get to the bottom of the crate, but the first jugs off the line went into the bottom and they were the ones that were affected by some start-up problem with the machine. Fortunately, for a Bayesian, we can potentially adjust our model of our data to account for our knowledge that a sample is chosen non-randomly.

Analyzing the data: Bayesian probability calculations

Now we'd like to figure out from our random sample what the overall average weight \mu is, (and hence what the total is, since the total is just the average times the number of jugs). Suppose we weigh the jugs on a scale with negligible measurement error, perhaps it gives readings in tenths of a gram. Also suppose we know the density of OJ and the weight of jug exactly so once we've weighed the jugs we know how much OJ was in them to within on the order of 1 part in 10,000. The only meaningful uncertainty then is the uncertainty due to sampling. Another way to say this is, we know the weight of 10 jugs exactly, but we don't know the weight of 90 jugs. We have missing data, and each missing data point could be considered an unknown parameter over which we could place a Bayesian probability distribution. We already have a prior distribution for the average \mu, now we want to know what is the probability of \mu given the data we've seen. We'll call the data w_i where i goes 1\ldots 10

Now, thinking about Bayes theorem: p(\mu|D) = p(D|\mu) p(\mu) / Z where Z=p(D) is a number chosen to normalize the right hand side so that the integral over all \mu values integrates to 1. Note that when we use a continuous probability distribution and then observe particular data, the probability of observing exactly that data is a slightly problematic concept that I don't want to get into too much at this point. We've already given our prior for p(\mu) we now need a "likelihood" p(D|\mu).

All I really know about the process is that the \mu is a positive number and I have a prior over it encoded in the first section. What do I know about which possible values of individual measurements is more likely vs less likely? How about I simply say that the individual measurements will be consistent with draws from a distribution which maximizes the entropy (a measurement of the uncertainty in a distribution) subject to the constraint that the mean is some particular value \mu? Since I have symmetric information about each data point, the joint distribution is just a product of this distribution over and over again. In other words, symmetry of information = IID joint distribution.

The maximum entropy distribution for a positive random variable with a given average is the exponential distribution, so I'm going to say that the likelihood of seeing the data is \propto \prod_{i=1..10}\exp(-w_i/\mu) with \mu itself being a random variable from my prior.

Note that this is VERY different from anything that you'd get from a Frequentist intuition, and it's also different from what you'd get from standard sampling theory or anything like that. It's also not the highest "power" model you could come up with. In fact in some sense it's the least informative because the exponential distribution is maximum entropy. Nevertheless, it is valid in a Bayesian sense and how does it work? To see how it works, I'm going to try it out on a simulation in R using Stan.

The Simulation

Here's some R code that carries out the above model, followed by the graph that's produced. With this random seed, the average for the 100 jugs is 1.71

library(rstan);
library(ggplot2);
set.seed(1)
wall <- runif(100,1.4,2.0); ## the actual weights of the 100 jugs will
 ## be chosen as IID random numbers from a
 ## uniform distribution between 1.4 and
 ## 2.0 liters. This is the real frequency
 ## distribution from which we're going to
 ## get a population of 100 jug
 ## weights. the 100 jug weights are the
 ## actual population of jugs
wsamp <- sample(wall,10)
stancode <- "
data{
real<lower=0> wsamp[10];
}
parameters{
real<lower=0,upper=2.25> mu;
}
model{
wsamp ~ exponential(1/mu);
}
";
fit <- stan(model_code=stancode,data=list(wsamp=wsamp),chains=2,iter=1000);
mean(wall); ## the actual average weight of the 100 jugs
samps <- extract(fit,permuted=TRUE)
qplot(samps$mu,geom="density",xlab="mu",main="Posterior Density of mu");

The resulting graph is as follows:

sampleOJ

Clearly, without relying on the frequency of anything, and by using a very conservative model, I've discovered that the overall average OJ quantity is about 1.6 (near the peak of the density) and that I can be 90% sure it's between about 1.11 and 2.14 (from the 0.05 and 0.95 quantiles of the Stan samples), whereas I started out believing it was between 0 and 2.25 uniformly.

The procedure worked even though my likelihood function is maximally spread out, and my prior weights all the credible values uniformly, and I only have seen 10 out of 100 jugs. The REAL average for the 100 jugs using this random seed was 1.71 so my inference is working correctly. If I used the posterior mode of 1.6 I'd estimate 160 liters of OJ for the crate and the real amount would be 171, so I'd be off by 11 liters which is about 6% error, pretty good considering how little effort I put in with weighing only 10 jugs. Of course, I'm only sure that I'm in the range of 111 to 214 liters total, this contains the true value, and is at most off by 35% which isn't great, but we do know a lot more about the machine than we did before we looked at the 10 jugs.

Finally, there's no reason we couldn't use a much more informative model with a likelihood that incorporates the central limit theorem and uses normal approximations etc and we'd probably get a smaller interval than this model. The flexibility in modeling is in fact one of the things that makes the Bayesian viewpoint very practical and helpful in real world problems.

 

Bayesian Science vs Frequentist Science (and both of them vs Mathematics of Probability)

2014 March 14
by Daniel Lakeland

In the last post, we saw that probability theory as a mathematical idea is, unsurprisingly, about purely mathematical objects (like numbers). No one need ever perform an experiment to see whether 2 is greater or less than 1, because 2 and 1 are purely logical entities. This is not true when we use probability as a model to describe a scientific question. Specifically, we can make a mathematical model and ask what it logically implies, but we can't make the claim that such a model corresponds to scientific reality without something else.

That something else is a little bit of philosophy and a lot of observed data. So first I'll describe my best take on the philosophy:

Bayesian Philosophy:

This one I feel I am better qualified to describe, because it ultimately is what I am convinced by. Bayesian philosophy in essence comes down to allowing scientific claims to have degrees of plausibility instead of only true/false dichotomy. If you're going to allow claims to have degrees of plausibility, then you'll want to pay attention to Cox's axioms and Cox's theorem, and you'll most likely arrive at the conclusion that the algebra of calculation with plausibility is the same as the algebra that is used in probability theory. In the previous post, I then showed that there's a correspondence between working with plausibilities in terms of density functions, and working with big bags of numbers ("random sequences"). So mathematically, whenever we want to do calculations on plausibility we can always turn them into calculations on frequencies within a big enough sequence of numbers. This is exactly what we do when we use MCMC methods in applied Bayesian statistics. However, this mathematical isomorphism is not also a scientific isomorphism. Obviously when we say weigh a person on a scale that person has one particular weight (at least to 4 or 5 decimal places, ignoring respiration, perspiration, dandruff...) so there's no sense in talking about the "frequency with which that person's weight is between 120 and 120.5 pounds" even though there is a sense in which the plausibility of that person's weight being within that range can be turned into a question about the frequency with which a random number generator of a particular type would produce numbers between 120 and 120.5.

So, in Bayesian philosophy, scientific questions have an answer of sorts, that answer is more or less a distribution that describes which of the possible truths are credible and which are not credible, and there is no frequency in the real world, only frequency in a computational device of a random number generator which allows us to compute plausibilities.

Frequentist Philosophy:

I'm not fully qualified to give the Frequentist philosophy, but my best stab at it is this: When we repeatedly perform experiments which are similar enough the outcome of each experiment naturally varies in some way. If there are some stable conditions which cause the future to look like the past, then future observations will also vary in the same way (note that these are pretty strong assumptions about the way the world works physically). If we alter conditions in some way we can ask the question: does the data in the altered condition look like a sample from a random number generator that replicates the shape of the histogram of the data in the unaltered (control) condition? Because of certain mathematical attractors (Central Limit Theorem etc), it's easier to analyze summaries of data such as averages than it is to analyze raw data, so often we aggregate information from individual experiments and wind up comparing random number generators which have the simpler distribution that is predicted for the output of the aggregation of random numbers.

The Frequentist philosophy requires that we analyze only repeatable experiments, things which could in theory be replicated a large number of times. So, it's meaningless to ask the question "what is the probability-as-frequency that this person has a weight between 120 and 120.5 pounds" since the person has some given weight which isn't subject to "randomness" but it is meaningful to ask the question "what is the probability-as-frequency with which a given scale will spit out measurements between 120 and 120.5 pounds when we place the same person on it several times in a row" since the measurement process "is subject to randomness" (ie. it doesn't always give the same answer). For the moment, pretend that the scale has some variability on the level of at least 0.1 lb for re-weighing things in the range of 120 lbs.

In essence, the Frequentist philosophy says that scientific experiments should be analyzed as if they were the output of specialized random number generators. When does this make philosophical sense? Well, one situation of interest is when we actually DO use a random number generator to select a subset of some finite set of possible objects to measure. This is like sampling widgets from a carton, or sampling people from the population for surveys. In this sense, the mathematics of sampling from a distribution makes sense because we have imposed that mathematical structure on the problem. You will, in fact, see many Frequentist statistics texts that talk about how we can only apply statistics to randomly selected samples of items. And then, occasionally you will see people argue that as long as the sample is at least "representative" that it could be ok to do statistics in this case as well, but you will need to have some real argument for why it is representative etc etc. Frequentist purists will say no random assignment = no statistical methods are valid.

However, there's more to it. The typical answer that comes from a Frequentist analysis is to the question: "does our data look plausibly like it might have been just a sample from a random number generator which chooses things of the 'control' or 'uninteresting' or 'null' type?"

To provide more than that, the Frequentist procedure must make more assumptions. In particular often the method is to use maximum likelihood. We hypothesize that the actual data is as if it were from some other random number generator, and then holding the data constant, we find values of the knobs (parameters) for the random number generator which make the actual data seem least surprising (or maximally "likely"). This procedure is, in essence, a special case of Bayesian inference, in which the likelihood comes from IID samples of some distribution, and the prior is nonstandard and uniform. But more than that, this procedure is calculating the plausibility of some values of the parameters based on fixed data and a particular model (IID likelihood). It is ultimately a special case of Bayesian philosophy. The Frequentist simply refuses to interpret the likelihood as a probability (since there is no repeated sampling from the parameters) but in all other respects uses the likelihood function as a plausibility measure. This becomes especially true when the Hessian of the likelihood is used to get "standard errors". In which case, we are just fitting a Gaussian around the peak of the special-case Bayesian posterior.

The Science, and my interpretation:

From a scientific standpoint, in general I start with a parameterized description of a physical process and some data and want to find out which particular parameter values are consistent with the data. So I choose the Bayesian viewpoint on probability and Cox's theorem makes this mathematically allowable. However, there are instances where the "sampling from an urn" interpretation of what I'm doing makes sense. For example I have some well defined finite population (say a set of 100 houses in a housing development) and I want to find out something about the degree to which settlement has caused damage to the foundations without seeing all the houses. In this situation Bayesian reasoning and Frequentist reasoning will likely coincide. A sample will be taken using a random number generator. In the absence of additional information, this imposes a symmetry of our information on all the houses. This symmetry basically is equivalent to IID draws from a distribution. Although we don't know the distribution, under the assumption that it has an average we can say something about the observed averages of the data... etc. It is, however, meaningful to keep in mind that the Bayesian analysis is working with the plausibility measure and the application of the random number generator essentially makes it implausible to believe that the distribution of damages in unseen houses is very different from the distribution of seen houses. Without the random number generator, there is the distinct possibility that we only saw houses which were subtly similar (say only ones where there was a stay-at-home adult who would answer the door on the tuesday-thursday inspection days, and these houses tend to be cheaper and smaller because families with two working adults generally choose to buy the larger and more expensive ones... or whatever).

 

Probability vs Frequency, representations and reality

2014 March 6
by Daniel Lakeland

One of the themes of this blog is the relationships between mathematical models and the actual things in the world they are intended to represent. A theme of the Entsophy blog on the other hand is the difference between probability in the Bayesian sense (as a measure of how credible a particular value of an unknown physical quantity is) and frequencies in random trials. For the most part, I agree with Joseph about what a Bayesian probability is, and I agree with him that the world is not a random process that spits out independent and identical trials. But, there is a sense in which sequences of "random numbers" are meaningful to probability. What is that sense?

First, let's consider how we'd study the mathematics of probability? Could we talk about the "credibility of scientific claims" in a mathematical definition? Certainly not. Whether a claim is credible or not is a scientific question, a question that requires observation of the actual facts of the world. Mathematical objects need to be purely logical things. We can't have the truth of a theorem dependent on whether or not we observe a particular object somewhere in the universe.

So probability theory (ie. the algebra of "random numbers") must be about something else. In particular there are a variety of different definitions. But I'm going to try out one here using nonstandard analysis (internal set theory) and ideas from Per Martin-Löf's paper which I've blogged previously. I'm not going to be too concerned about strange mathematical objects such as probability measures on function spaces etc but we'll talk about some very basic ideas, like univariate continuous random variables.

A probability density p(x) is a function which assigns to each x value a non-negative density p(x) and whose integral over all of x is equal to 1. Any function which satisfies this is a valid density, and we can define some non-standard densities like delta function densities etc as well thanks to our use of nonstandard analysis.

But functions are difficult to compute with. Sure a few functions are easy to specify symbolically and/or evaluate numerically, and we can approximate functions using things like Chebyshev series or other orthogonal or even non-orthogonal bases, but ultimately they are not very easily computable things. It's useful then to have an alternative representation for the density p(x).

A sequence \{x_i\}_{i=1}^{N} where N is a nonstandard integer, is an unlimited sample from the density p(x) iff whenever \delta x is a standard interval |\{x_i \in \delta x\}| / N \cong \int_{\delta x} p(x)dx. A nonstandard sample is a representation of the density in the sense that we can compute the probability of any standard region of x space by counting the number of samples in the region and dividing by the total number of samples (to within an infinitesimal). Determining whether a number is in a region, counting and dividing are all very simple computational tasks. There is then a very natural mathematical correspondence between a density, and a very big bag of numbers. But this mathematical correspondence is not a scientific fact, it's a mathematical definition.

So far we haven't talked about "independence." From Per Martin-Löf´s paper (which I've blogged previously) essentially "random is as random does". That is, we define random sequences by reference to whether a test based on basic standard probability theory concepts passes the sequence.

The mathematical definition of independence says that variables A and B are independent if p(a,b) = p(a) p(b). We can take two disjoint subsequences of our big sequence and ask if they define independent random variables by essentially demanding that the counting version of the above equation holds.

|\{a_i,b_i\}\in \delta_a \delta_b|/(N/2) \cong \frac{|\{a_i\}\in \delta_a| \times |\{b_i\}\in \delta_b|}{(N/2)^2}

where we choose half of the x_i to represent a and the others represent b and then \delta_a and \delta_b are any arbitrary standard regions of the domain of x.

Finally, we could say that \{x_i\} are independent and identically distributed random samples from the distribution p(x) if almost every standard set of disjoint subsequences have this property. Obviously, we won't get this property if say A is all the positive values and B is all the negative values, but there are 2^{(N-1)} different ways to make the sequences A and B (we divide by 2 because of the possibility of simple label-switching). Since N is nonstandard, that's a really LOT of sequences, so if we require that the vast majority of them satisfy the independence criterion, then we're "safe". And what we consider the "vast majority" we could set to be an \alpha level for our test. But beyond that we want it to be true for all standard integer number of subsequences, so this would include tests for mutual independence of 1,2,3,4,5,6,7,8,... random subsequences selected in almost any way.

I hope that I haven't overstepped. I mean, I haven't actually done any proofs that these things are equivalent to other more standard definitions, but I'm working on intuition of what this means here and I suspect it will work out fine. The point I want to make though, is that independence isn't a physical thing, it's more a property of "coming from all the different parts of the space". When things are independent, they won't often come together in small regions of space (notice how the right hand side of the above equation is divided by N^2). When they're independent, any old subsequence will do a good job of summarizing the density. When things are independent we get good coverage and exploration of the space, and that's it... it's not a physical thing independence.

All this is to say, big bags of numbers are nothing more than an equivalent but computationally more useful way to represent density functions. They aren't a philosophical way to define probability, they're a mathematical isometry.

So when it comes to philosophy, we want to distinguish between this mathematical convenience, and the use of probability in answering scientific questions. I guess this post is damn long now, so I'll have to get to the Bayesian philosophy later.

Our connection to the distant stars?

2013 December 2
by Daniel Lakeland

Joseph over at Entsophy has posted about some theoretical physics, namely the problem of what constitutes a Newtonian inertial reference frame. I have some comments over there that aren't terribly well developed, but I thought I'd ruminate here on this topic with a little bit of Nonstandard Analysis (NSA) flavor as well.

Suppose that we are in a spaceship somewhere in the solar system. Do we experience centrifugal forces? The answer seems to depend on our state of rotation, but rotation with respect to what? Newton more or less says to consider the distant stars as fixed, create a coordinate system relative to them, and determine if we are rotating relative to that coordinate system. In practice this produces a frame in which if we are not rotating, we experience essentially no centrifugal force (or so small as to be undetectable).

But this is a global procedure, it requires us to see the photons from extremely far off stars and by using those photons determine where those stars appear to be relative to us, and then construct our coordinate system. Is that essential to the procedure, that the apparent position of far off masses is involved?

Suppose instead we want to construct a coordinate system that is local to us, that doesn't involve masses that are enormous distances away. Can we do it? How?

Imagine we are conducting an experiment to observe some masses and determine whether they experience centrifugal (and/or coriolis) forces relative to our coordinate system. Let's first define a time-scale for our experiment t^* is the time it takes by our lab clock to conduct the whole experiment. To make this concrete, suppose we're interested in human scale objects and processes, so t^* is somewhere between say 1 second and 1 day. Or in any case, it's isn't something like a billion years over which we're supposed to observe galactic events and in which the far off stars actually clearly move.

Next we will define a length scale. So, there is one obvious length scale, which is r^*=ct^*. This defines a communication horizon, motions we carry out after the start of the experiment can not be felt by objects farther away than this until after the experiment is done. But we may be able to define another length scale which could be of use. Define the force applied by any object i on our objects of interest j as F_{ij}. Now define a length scale by \min r^+ : \max_i F_{ij} \cong 0 \land \forall \{k\} \sum_k F_{kj} \cong 0 where i ranges over objects farther away than r^+ and k is any subset of objects beyond distance r^+. Ok, in reality physical measurements are always representable as standard numbers, the use of infinitesimal nonstandard numbers must be interpreted in terms of quantities too small to measure, the threshold is different for different measuring instruments, but we take it for granted that for any given laboratory there is some unspecified level below which forces will be so small as to be unmeasurable. For example good luck measuring the gravitational attraction of two electrons in a lab here on earth.

So we have a length scale that defines a horizon beyond which no individual object produces any appreciable force, and the sum of any group of objects produces no appreciable net force. This is I think a stronger horizon. Why? Well, for example I might have some delicate apparatus here on earth, and it measures the motion of some tiny slightly charged droplets of water in still air for example. The whole experiment might take say 1 minute, but I could imagine being able to just barely detect the tidal effect of the sun in this experiment even though the sun is 8 light minutes away. That is, the retarded field of the sun is still relevant even if the sun doesn't feel the wiggle of my droplet's motion until another 7 minutes go by.

Nevertheless, this radius is distinctly local, in that it's much smaller than the distance to the nearest galaxy for example. In fact, if we can define some cosmic radius R as encompassing essentially all the visible mass in the universe, then r^+/R \cong 0 for our purposes.

Now, we can observe our objects doing their dynamic dance, whatever they are, perhaps we're in a spaceship and we are spinning a bucket of water, or watching charged particles interact, or playing catch the stick with a quadrocopter (see link for YouTube example). Overall we'd like to determine from what point of view do the dynamics occur without any superfluous centrifugal or coriolis accelerations? To figure this out, we can make reference to the positions of any masses within our horizon (but not to the "distant fixed stars"). The important thing here is that we need a notion of "superfluous". Clearly, if we're here on earth and want to study a spinning bucket of water we will say that there are centrifugal forces involved. But if we're doing it while seated on a merry go round there will be additional rotational effects involved. It would be simpler to view the whole thing from the ground rather than the merry go round itself. If we're watching the weather, it might be simpler to describe the forces from the point of view of a coordinate system that doesn't rotate with the earth around it's axis of spin. For example a coordinate system centered at the center of mass of the earth and with one axis pointed between the center of mass of the earth and the center of mass of the sun, the other being the axis of the earth's rotation, and the third being perpendicular to the other two. (yes, I know the earth also orbits the sun, but for the sake of argument perhaps that effect is infinitesimal on our time scale)

So, clearly the notion of a "preferred" or "inertial" reference frame is related to the complexity of the forces involved. So we can fix a coordinate system in our lab, and then determine a rotating coordinate system relative to the lab coordinate system by specifying a vector \Omega around which our inertial coordinate system is rotating and the length of this vector is the rate of rotation. If we orient our measurement instrument along that axis and rotate it at the appropriate angular velocity all the dynamics will be "simplest" in terms of rotational forces. Perhaps we can define the net rotational force as something like \sum_i |C_i| where C_i are the individual centrifugal forces.

The next question is, in this coordinate system, what is the distribution of angular velocities of all masses within our local shell? If this procedure essentially constructs a zero mean angular velocity coordinate system for the objects that have appreciable affects on our observed experiment, then when you expand your shell to the universe it's not surprising that the distant stars (which make up the vast majority of the universe) are "fixed".

For the moment, that's as much time as I have to devote to this topic, but perhaps I'll come back or get some interesting comments here.

Improving the Jury Duty system

2013 November 15
by Daniel Lakeland

I had to postpone Jury Duty with the county of Los Angeles this week. Things worked out fine and now it won't interfere with family reunions or my wife's biology conference. But in thinking about Jury Duty and having to take the online orientation, it became clear that we need to fix the economics of this system.

Here in LA county Jurors are paid $15 each day after the first day that they are on duty. Now, this is clearly a slap in the face. Jurors are selected at random from a large pool of people. The average productivity of a given person is going to be at least on the order of the GDP/capita/working day of the US. It's going to be slightly higher in LA than in other parts of the country because cost of living is higher so there's a bias there, but order of magnitude that's going to be pretty much correct. How much is that?

According to Google, GDP/capita of US in 2012 is about $50,000

There are more or less (365*5/7-5*3) working days in the year. So the figure we're talking about is: $200/day or so. $15/day is about 7.5% of the societal cost of compelling jury service.

Now, there's a definite benefit to jury service, in that we have laws that are enforced and are able to avoid all sorts of societal problems, but the jury costs are born by either the jurors or their employers, whereas the benefits accrue to everyone.

Furthermore, if you're going to have a 5 day trial with 12 jurors and 3 alternates you're going to eat up about $15,000 in productivity, whereas if the trial is a Civil trial, maybe it's only about $9000 worth of unpaid rent or something. Society loses on the net just on the lost productivity of jurors.

So, my proposal is this, courts pay Jurors GDP/capita/working day for any day they are called to the court (including the first day). Courts raise this money by charging court fees to the losing parties in Civil lawsuits in courts other than small claims court. This means it doesn't pay to go to court over amounts that aren't significantly larger than this basic cost reducing the abuse of the court system. And, small claims courts have their maximum dollar amounts raised to 15*5*GDP/capita/working day to handle disputes that aren't worth a jury trial. The fees would have to be high enough that they also cover the jury costs of criminal trials where the losing party likely isn't going to pay, though fines could be assessed as well in those cases.

I think under this system, far fewer people would duck jury duty with contrived excuses, and the quality of juries would probably rise as well.

Radio Frequency Engineering with WiFi, or how to stream in comfort and politeness

2013 October 28
by Daniel Lakeland

The vast majority of middle class residences have at least one WiFi router these days. Usually connected to cable or DSL internet services. These internet services work a lot better and cheaper than doing everything over your phone's data connection, but they are inconvenient without wireless distribution throughout your house. Many people get a WiFi router with their cable/DSL set modem up, and a few buy an after-market one and set it up themselves. Most of these devices come with some default settings, enough people just keep those defaults that you will frequently see network IDs (ESSID) that look like "linksys" or "NETGEAR" or some other default name. (In the following I assume mostly US based rules and regulations)

Unfortunately, WiFi doesn't have much in the way of non-interference standards. So for example routers out of the box don't periodically search for the "best" channel and switch to it. Doing so would cause a brief interruption of service, but it wouldn't be that hard to take samples throughout the day of the channel usage, and whenever a very low traffic moment arrives, switching to the "best" channel. Few people would notice a 1-3 second downtime at 2am for example. There's the possibility of some interesting and strange iterative dynamics with such a scheme I suppose. As it is, I see routers tuned to all sorts of channels. For 2.4Ghz there are channels 1-11, and I can see 1, 3,4,6,7,9,11 nearby my house. Few everyday non-tech-geeks know that there are actually only 3 independent channels. These are (if you're a geek, say it with me) ... 1, 6, and 11 (this is different for the newer 5Ghz band where channels range from 36 up to 165 and some are forbidden). Channel 1 interferes with 1,2,3,4,5 but not 6-11, channel 6 interferes with 2,3,4,5,6,7,8,9,10 but not 11, and channel 11 interferes with 7,8,9,10,11. In Europe and some other countries, there may be an extra set of channels 12-14 so that you can get 4 non-interfering ones, 1, 6, 11, 14. In some circumstances where there's sufficient separation between the stations, 1,5,9,11 can be used with only minimal interference but added channel capacity that makes up for it. That kind of deployment requires that you have control over all the stations, like inside a hotel or shopping center, since other people are not likely to choose the right channels.

Another thing access points don't usually do is automatically scale back their power output. In fact, manufacturers tout their high power outputs, and they typically come off the shelf configured to max out the power. High power is useful in a point-to-point link where two wireless devices have directional antennas pointed at each other and are trying to send data over long distances, from a few hundred meters to kilometers in some cases. But in a typical single family home or apartment complex, high power output is a nightmare that doesn't improve the range, but does dramatically increase interference. In one example, a wireless networking conference found that their hotel WiFi was terrible, and the biggest fix was to go around turning off access points and turning down power output until the interference was eliminated.

As an analogy, imagine the following scenario, you are in a parking lot with a megaphone. You are trying to have a conversation with your wife (husband) about what items to buy in the store. Your partner has no megaphone. When you're sitting across from each other in your car, the fact that you are pointing a megaphone at your wife and blasting her in the face with 100dB of sound is probably not going to improve your chances of communicating clearly (try it some time!). And, when she's all the way across the parking lot at the entrance to the store, blasting in your megaphone "Did you get the eggs?" doesn't really help, because without a megaphone on her end, you just can't hear her reply. In the mean time, everyone else in the neighboring cars suffers through hearing every detail of your shopping list when you could be quietly talking inside your car without the megaphone.

The same thing goes for WiFi. I was recently trying to improve my WiFi performance and found that I could turn my access point down to 13dBm and still get just as good performance throughout my house as when I had it cranked to the default of 18 dBm. That's about 1/3 of max power output. Sure, it meant that my signal wasn't FULL BARS at the street in front of my house, but that's actually a huge advantage for everyone. I don't need full bar streaming at my curb, and across the street from me, they don't need my WiFi router to be interfering with their laptop. Whereas in fact my across the street neighbors have their power cranked up in such a way that on my front porch their signal is roughly 83 dBm which is a respectable signal strength. They might be able to sit on my porch and stream their netflix if they had a high enough power laptop wifi.

So what is the upshot of it all? Please go to your wireless router and select a channel from 1,6,11 and turn down your power output to about 12 to 17 dBm (for most people this will work). You can check your wireless signal strength using Wifi Analyzer, the Android phone app. Walk around and make sure your coverage works well inside your living area, and in important outside areas (on your deck or whatever) but not across the street or in the apartment 4 doors down the hall. Finally, if you need to extend your range, consider rather than turning up your power output, to add another station connected via wired ethernet to your main router, on a different channel, with low power, and placed in the middle of the "worst" signal area. This is the same as bringing your wife closer to you in the parking lot instead of turning on the megaphone, your neighbors will thank you. Of course sometimes running the wired line is a pain, in which case, try repositioning your main router, and if that doesn't work, turn the signal up slowly until you get coverage.

Perhaps consider discussing this article politely with your closest neighbors if they aren't using channels 1,6,11 or are using a default high power output and leaking a lot of signal into your dwelling.

What's wrong with our understanding of Soil Liquefaction?

2013 September 4
by Daniel Lakeland

I submitted my paper Grain Settlement and Fluid Flow Cause the Earthquake Liquefaction of Sand to Proceedings of the Royal Society A and just received reviewer comments today. This preprint summarizes what I've found in my PhD studies with regards to how soil liquefaction works. (Also some more details on the derivation).

The key finding is that contrary to every textbook definition of liquefaction, fluid flow is not ignorable, in fact fluid flow dynamics is a key component of how liquefaction occurs. The standard description of liquefaction involves "undrained" contraction of soil grains during shaking causing an increase of fluid pressure, and a transfer of total stress from the grain-grain contacts to the water pressure. If the water pressure gets high enough, the grains carry no stress and hence can not transfer shear stresses. At this point the slurry flows like a liquid.

Although it's been known that "void rearrangement" could occur (in which grains move one way and water moves the other), it was treated as a special case of soil liquefaction behavior. However, centrifuge scale-experiments and specialized tabletop experiments performed in the last decade have called this in to question, and the standard triaxial tabletop experiments involving 10 to 20cm sized samples of soil require enormous deformations that are unrealizable in-situ before water pressure exceeds total vertical stress. Why is soil so hard to liquefy in a triaxial machine?

Tabletop triaxial experiments involve sand and water wrapped in a completely impermeable rubber membrane, so that the total mass of water inside the membrane remains constant. In the soil, water can conduct from one place to another through the inter-grain spaces. The standard description assumes that such conduction is slow compared to the duration of an earthquake, and this assumption is so strong that some researchers have actually tried to compensate for the bulging of the membrane in these tabletop experiments under the assumption that even the slightest conduction of water causing the membrane to bulge would be more than the actual conduction in-situ.

The truth may be hard for the Geotech community to swallow. My analysis shows that for typical loose sands,  water pressure transmission via fluid flow occurs over tens of meters on timescales of substantially less than a single earthquake loading cycle. In the paper I derive a dimensionless equation that describes the phenomenon in the top ~ 10 meters of surface soil and I show that the equation can be solved for equilibrium conditions to determine approximately the water pressure field. We can use an equilibrium solution rather than a dynamic time-varying one because equilibriation occurs so quickly compared to the earthquake duration. I then solve the equation for a variety of special cases that are relevant to actual physical experiments and show how the equation predicts the qualitative results of those experiments.

 

Bayesian Uncertainty About Objective Functions in Optimization

2013 August 30
by Daniel Lakeland

Suppose we'd like to design some system to perform in an "optimal" way. One intuition I've often had is that "Optimal" is rarely worthwhile in real situations. It is difficult to specify a criterion in an exact enough way which makes the "optimal point" under your criterion actually "optimal" in any sense. Instead, it's better to settle for "Good" or if you're lucky "Pretty Darn Good". In other words, once you've climbed most of the way up the mountain, it really doesn't make much difference if you are standing on the very top, especially if your surveys can't really tell you which of the little peaks really is the highest one.

With that in mind, suppose we have a Bayesian model for the probability that some particular tradeoff function is truly our "favorite" tradeoff. To make it semi-concrete, suppose we have to choose amounts to consume between two quantities, say Beer (B) and Nachos (N). We have a certain amount of money, all of which we are going to spend. We can spend it in such a way that P_b B + P_N N = D so that there is a line in B,N space that defines the reachable region.

Now in simple micro-economics, the solution is to specify a utility function U(B,N) and maximize U subject to the constraint on B,N.

But, in reality, we don't know how to trade-off B,N, there is no utility function. At best, we can sort of guess at how happy we'd be with any given position on the B,N line.

Instead, suppose we specify some probability distribution over different U, p[U]. This is a distribution over functions that specifies the region of U space (say coordinates in a Hilbert space) where we think our true preferences lie. But it's now impossible to say with certainty what exact values of (B,N) are best. We could try to maximize expected utility, we could maximize the utility under the most-probable U, we could do a lot of things, but another way to think about this is that we could choose a range of U which are in the high probability region, and find a range of (B,N) points that are all "good" relative to all the U in that region, and then any of those (B,N) points are going to be good choices.

My intuition is, generally in real problems, this version of the problem is going to be easier to solve than finding the one-true-optimum point of a well specified U. There are two issues.

Often in applied mathematics, the utility function is chosen to make things relatively easily solvable. Such a utility function doesn't necessarily incorporate all the things we know about the real problem. Once we do incorporate all those things, we'll often have a non-smooth, weird utility function (think of making investment decisions under accurate tax-code rules, where there are weird break-points and eligibility requirements, and soforth).

On the other hand, putting probability in the mix, we can't be sure of any "one true value" we can now compare "good values" across "good models". There are two scales for error, one is the difference between a "good" value and "the optimal" value for a given utility function, and then there is the difference between "the optimal" value under one utility function and "the optimal" value under another. It makes sense to stop trying to find "the optimum" of a given utility once the scale of your search-progress gets down below the scale of the differences between different utility functions. I think often solving a bunch of related optimizations in a very approximate way will give you a lot more information about what the good values are than misguiding yourself into believing you've specified a real utility function and then solving that (wrong) optimization to perfection.