# The Bayesian Approach to Frequentist Sampling theory

So I’ve been thinking a lot recently about the issue of sampling from a finite population and doing Bayesian analysis. Pretty clearly, when we have a finite population of items in the world, and we use a random number generator to select a subset of them, and we then analyze the data from our subset and try to extrapolate to the full set, we are in a situation where the “urn full of balls” interpretation of probability has some intuitive meaning. But the “urn full of balls” interpretation of probability is not directly the same as a Bayesian probability. In fact, the Bayesian conception of probability is as a measure of what it is reasonable to believe about things we haven’t, and usually can’t measure.

So to make my example concrete, we’ll suppose we have a crate full of 100 nominally 2 liter bottles of orange juice. Our interest is to find out how much total orange juice is in this crate. Further, suppose we know that the filling machine broke recently and has been recently repaired so that it might be producing problematic results (and we can’t use information that for example historically this machine produced bottles with $$2.0 \pm .05$$ liters reliably). You can extrapolate this example to estimating total votes in Florida (where the population is a finite set), or the effect of preschool education on academic performance in high school (where the set of students is some finite set) or whatever…

First of all, why is the information about the repair important? It’s important because it breaks a symmetry, the historical record prior to the repair is distinguishable from the current results, in technical terms, the historical data is not exchangeable with the current data given our state of information (that the repair might have caused miscalibration). So let’s start by constructing a Bayesian prior for the average volume of orange juice in a bottle. Supposedly, this is the big bad hard task that Frequentists complain about, though in this case I think it’s going to be pretty easy.

### The Prior on average volume in a jug

We know for sure that the average volume of a container is not going to be bigger than the absolute maximum volume that any container could hold. The bottles are nominally 2L volume, most likely they have some headspace designed into them, and a little variation in manufacturing, yet I’m very sure that none of the bottles holds 3L, that’s just an unreasonable amount of manufacturing variation. So a simple prior for the average number of liters of OJ is U(0,3) (that is, uniform on the interval (0,3)). This is a valid prior, and it encodes the possibility of having empty containers or full containers, even containers that we’re pretty sure are overfull by much more than is really possible. There is no realistic outcome which is excluded from this prior, however there are some unrealistic ones included (namely 2.9L etc). Because we know there are unrealistic values included in this prior, it is not the most informative prior we could come up with. I am fairly sure that the volume of typical 2L OJ bottles is probably less than some number in the vicinity of 2.1 or 2.2 liters, perhaps I decide that the headspace is on the order of 0.25L at most. I could use U(0,2.25) as a reasonable prior with slightly more information than the above. The maximum entropy distribution for a positive number whose average is known exactly is an exponential distribution. Another reasonable prior for the average number of liters of orange juice is then $$\exp(1/2.25)$$, that is exponential with a rate of 1/2.25 or an average of 2.25. Finally, another reasonable prior for the average OJ is one which is uniform on [0,2], and falls off exponentially past that with a rate of $$1/0.25$$ indicating that the headspace is on the order of about 0.25L. This is probably a good prior, but not easy to encode in typical software.

So, that’s it, we think about what’s reasonable to believe about the average amount of OJ in the containers, we encode that information into a distribution which puts high probability in the regions we think could be reasonable, and low probability in other regions, and we’re done. There’s absolutely no notion of Frequency anywhere here in this process. In fact, although I would not be surprised by the average turning out to be somewhere in [0,2.5] I would be very surprised if on repeated sampling of different crates I found out that the average OJ per bottle in each crate did have a histogram that looked anything like the above priors!

### Collecting data: sampling from the crate

So, now to the sampling itself. How do we understand sampling in a Bayesian framework of probability. First off, we could just haphazardly sample, say take a few from the top of the crate, and then remove some and set them aside, and a few from the middle of the crate… Let’s say we’re going to look at 10.

What’s the problem with haphazard sampling? The biggest problem is unintentionally restricting the chosen set to a subset of the whole which has very different properties from the whole. Andrew Gelman gives the classroom example of weighing candies where it’s just much easier to get ahold of the big candy bars, and the small candies fall to the bottom of the bag and are under-represented in typical samples. Similar things happen when you ask engineers to look at a bunch of potentially damaged structures. Engineers find damage interesting and their eye and attention will be drawn to it. In the case of our OJ jugs perhaps when we find what we think are lightweight jugs we are more likely to include them than leave them out.

So our task is to get a “representative” sample, and a bit circularly “representative” just means “has similar properties to the whole collection”. Is there a way to guarantee a representative sample? The answer is, yes if you’re willing to sample 100% of all the jugs. That sample will tell you exactly what the average weight is. Any smaller sample will contain less information and could be biased. So if there’s no guarantee, is there at least a way to ensure that we almost always get a representative sample? Here, if you’re familiar with computer science, we can take a cue from “randomized algorithms” such as quicksort with a random pivot. The idea is that even if some “adversary” is trying to make your sample non-representative (the worst case) unless they can know what the output of your random number generator is, choosing the samples with a random number generator will make it nearly impossible to do.

To see why this is we just need a simple counting argument. We have 100 jugs and we’re going to choose 10. The number of different sets of 10 we could choose is $$C(100,10)$$ or “100 choose 10” which is calculated as

\[\frac{100!}{10! (100-10)!} \approx 1.73 \times 10^{13}\]

and a random number generator is going to give us each of those sets with equal frequency (in the context of a well designed random number generator, Frequentist conceptions of probability are exactly true by mathematical construction). This number of possibilities is simply too large to make any significant fraction of them be non-representative. For example, if you put all the light jugs in the first set, then they can’t all be in each one of the $$10^{13}$$ other sets too. In fact, suppose that there are say 10 jugs which are quite different from the rest. How many sets of 10 will have 0,1,or 2 but nor more of these jugs? A counting argument works again, we have 10 special jugs, we’ll choose 0 or 1 or 2 of the weird ones, and the rest will be the usual ones. The total number of ways to do these choices is:

\[C(10,0)\times C(90,10) + C(10,1)\times C(90,9) + C(10,2)\times C(90,8) \approx 1.63 \times 10^{13}\] which is $$\approx 94\%$$ of the possible sets will have either 0, 1 or 2 of the weird jugs but no more. So 94% of the time we do an RNG sampling method we’ll get a sample with the fraction of “weird” jugs between 0 and 20% of the sample, when the proper fraction is 10/100 = 10%. Not bad. If you like you can fiddle with the numbers to see how this changes with different sample sizes (in R the choose function is just called “choose”).

So the simplest way to avoid getting a non-representative sample from a crate is to number all the jugs and select 10 of them randomly using a computer RNG. It’s not that other methods are invalid, it’s just that we don’t know how valid other methods are, especially in the context of a potential adversary (such as settling a regulatory issue or a contract dispute or lawsuit etc) but even in the context of some other accidental pattern in the choice of data, such as it being harder to get to the bottom of the crate, but the first jugs off the line went into the bottom and they were the ones that were affected by some start-up problem with the machine. Fortunately, for a Bayesian, we can potentially adjust our model of our data to account for our knowledge that a sample is chosen non-randomly.

### Analyzing the data: Bayesian probability calculations

Now we’d like to figure out from our random sample what the overall average weight $$\mu$$ is, (and hence what the total is, since the total is just the average times the number of jugs). Suppose we weigh the jugs on a scale with negligible measurement error, perhaps it gives readings in tenths of a gram. Also suppose we know the density of OJ and the weight of jug exactly so once we’ve weighed the jugs we know how much OJ was in them to within on the order of 1 part in 10,000. The only meaningful uncertainty then is the uncertainty due to sampling. Another way to say this is, we know the weight of 10 jugs exactly, but we don’t know the weight of 90 jugs. We have *missing data, *and each missing data point could be considered an unknown parameter over which we could place a Bayesian probability distribution. We already have a prior distribution for the average $$\mu$$, now we want to know what is the probability of $$\mu$$ given the data we’ve seen. We’ll call the data $$w_i$$ where $$i$$ goes $$1\ldots 10$$

Now, thinking about Bayes theorem: $$p(\mu|D) = p(D|\mu) p(\mu) / Z$$ where $$Z=p(D)$$ is a number chosen to normalize the right hand side so that the integral over all $$\mu$$ values integrates to 1. Note that when we use a continuous probability distribution and then observe particular data, the probability of observing exactly that data is a slightly problematic concept that I don’t want to get into too much at this point. We’ve already given our prior for $$p(\mu)$$ we now need a “likelihood” $$p(D|\mu)$$.

All I really know about the process is that the $$\mu$$ is a positive number and I have a prior over it encoded in the first section. What do I know about which possible values of individual measurements is more likely vs less likely? How about I simply say that the individual measurements will be consistent with draws from a distribution which maximizes the entropy (a measurement of the uncertainty in a distribution) subject to the constraint that the mean is some particular value $$\mu$$? Since I have symmetric information about each data point, the joint distribution is just a product of this distribution over and over again. In other words, symmetry of information = IID joint distribution.

The maximum entropy distribution for a positive random variable with a given average is the exponential distribution, so I’m going to say that the likelihood of seeing the data is $$\propto \prod_{i=1..10}\exp(-w_i/\mu)$$ with $$\mu$$ itself being a random variable from my prior.

Note that this is VERY different from anything that you’d get from a Frequentist intuition, and it’s also different from what you’d get from standard sampling theory or anything like that. It’s also not the highest “power” model you could come up with. In fact in some sense it’s the least informative because the exponential distribution is maximum entropy. Nevertheless, it is valid in a Bayesian sense and how does it work? To see how it works, I’m going to try it out on a simulation in R using Stan.

### The Simulation

Here’s some R code that carries out the above model, followed by the graph that’s produced. With this random seed, the average for the 100 jugs is 1.71

library(rstan); library(ggplot2); set.seed(1) wall <- runif(100,1.4,2.0); ## the actual weights of the 100 jugs will ## be chosen as IID random numbers from a ## uniform distribution between 1.4 and ## 2.0 liters. This is the real frequency ## distribution from which we're going to ## get a population of 100 jug ## weights. the 100 jug weights are the ## actual population of jugs wsamp <- sample(wall,10) stancode <- " data{ real<lower=0> wsamp[10]; } parameters{ real<lower=0,upper=2.25> mu; } model{ wsamp ~ exponential(1/mu); } "; fit <- stan(model_code=stancode,data=list(wsamp=wsamp),chains=2,iter=1000); mean(wall); ## the actual average weight of the 100 jugs samps <- extract(fit,permuted=TRUE) qplot(samps$mu,geom="density",xlab="mu",main="Posterior Density of mu");

The resulting graph is as follows:

Clearly, without relying on the frequency of anything, and by using a very conservative model, I’ve discovered that the overall average OJ quantity is about 1.6 (near the peak of the density) and that I can be 90% sure it’s between about 1.11 and 2.14 (from the 0.05 and 0.95 quantiles of the Stan samples), whereas I started out believing it was between 0 and 2.25 uniformly.

The procedure worked even though my likelihood function is maximally spread out, and my prior weights all the credible values uniformly, and I only have seen 10 out of 100 jugs. The REAL average for the 100 jugs using this random seed was 1.71 so my inference is working correctly. If I used the posterior mode of 1.6 I’d estimate 160 liters of OJ for the crate and the real amount would be 171, so I’d be off by 11 liters which is about 6% error, pretty good considering how little effort I put in with weighing only 10 jugs. Of course, I’m only sure that I’m in the range of 111 to 214 liters total, this contains the true value, and is at most off by 35% which isn’t great, but we do know a lot more about the machine than we did before we looked at the 10 jugs.

Finally, there’s no reason we couldn’t use a much more informative model with a likelihood that incorporates the central limit theorem and uses normal approximations etc and we’d probably get a smaller interval than this model. The flexibility in modeling is in fact one of the things that makes the Bayesian viewpoint very practical and helpful in real world problems.

If I use your code to generate wall I get a mean of 1.69 with sd .21. Okay not surprising, because the data were generated already from the normal distribution. But here’s the part that I’m interested in: suppose you had a practice of regularly sampling 10 jugs from each batch. You have a large set of these data in your factory logs (say 1k samples of 10 jugs from different lots), and you want to estimate from a single sample whether the “fix” resulted in defective or inconsistent jugging of OJ. That’s problem 1. Its a common type of problem in biology. Not just to estimate the magnitude and spread of the data as you have done, but to ask directly given limited opportunity or resources, how different is this from another group and should I care about that difference?

Then the second problem is (related to the first), suppose we’ve decided (as I have in the case of a problem I’m working on) that a frequentist approach is warranted because I have this data with 1k samples from different lots of the same era in machinery. Suppose further you have to control for 100 other QC measures of the milk, and some of them are related to human health (we don’t want people to get sick after all) but we don’t know a priori which of these parameters are likely to come up. Well if we’re frequentists, bonferroni comes along and says “no problem! none of these are significant, the milk is perfectly safe!” Sound familiar?

Dennis: I’m a little confused because given that I set the RNG seed in the code, you should probably get the same thing I do for wall, which is 1.71 mean and 0.16 sd, but let’s just chalk it up to maybe different versions of R or something. Also, I generate the data with a uniform distribution, not a normal distribution. That’s because I wanted to illustrate how the long run frequencies are uniform, the likelihood is based on a maximum entropy exponential distribution, and yet I get an inference that ultimately is peaked about approximately the true value and makes good sense without needing to invoke anything like central limit theorems or sampling distributions or normal approximations etc.

Anyway, to your further question problem 1. Suppose you’ve got 1000 crates you’ve sampled 10 each from in a timeseries, and the last one is “after” the fix. I’d probably set up a regression model where the mean quantity of OJ in a sample of 10 is distributed as say exponential(1/(mu)) for i in 1..999 and exponential(1/(mu+eps)) for i=1000 and then say eps ~ normal(0,2); representing the fact that the effect of fixing the machine has shifted the average either up or down by an amount which is a small multiple of 2 L, you could really give a more informative prior, maybe normal(0,1) if you wanted to, since you know that the jugs aren’t really holding more than 2.25 L anyway, you can’t shift the average by more than about 2L. You might have to fiddle with this model a little to avoid getting negative values or zero for mu+eps.

Note that the above model stays in line with my basic thread of using a maximum entropy based likelihood distribution, but you could also use a normal distribution for the likelihood which is more informative and therefore higher power.

Now as to problem 2: A Frequentist interpretation of statistics with hypothesis testing and corrections for multiple-comparisons is not in my opinion valid for that type of inference, but using the samples to describe your information about the state of the “pre” machine is a valid way to set up a probability distribution.

You have a problem where you’ve got a single sample, like 10 jugs from a crate after the fix, and you’ve also got a bunch of other samples from other time points (or in your case, other portions of the genome), and instead of measuring one thing about the sample, like weight, you’re measuring as you say maybe 100 different aspects of the OJ in your sample of 10 jugs… let’s say it’s a test for 100 different types of bacterial contaminants.

You want to find out if there’s any reason to believe that the 10 jugs after the fix have been contaminated in any way that is “different” from the way it was before the fix. If even ONE type of microbe is now present at higher quantities than before the fix, that’s really important. Also, you might get several types of contamination together, or you might increase one kind of contamination and decrease other kinds… lots of stuff could happen.

Consider your sample of 100 measurements of 999 “pre” cartons not as 99900 measurements, but as 999 measurements of a vector of 100 items. Since you know very little about the process that generated the pre measurements, your information about the “pre” condition is pretty much that the kinds of things you would get were points in this 100 dimensional space which “looked like” the samples you have.

From a Bayesian perspective, you’re saying something like: “given that everything I know about the pre condition is just that it seemed to produce samples like the 999 ‘pre’ samples I have, is the 1 extra sample of 100 measurements in some ‘not very common’ region of my Bayesian probability distribution for the ‘pre’ condition”

You could set up some kind of maximum entropy distribution for the “pre” samples, get a closed form formula which is consistent with those samples, and then see if the “new” sample is in the low probability region of this maximum entropy distribution. That’s more or less a “functional” approach to computing. But you could also use the *samples* themselves to estimate the probability of “pre” samples being in the region of the one “post” sample. For example, suppose you’re working with contamination measurements scaled so that the “pre” samples are on average 1. You could decide that any sample in the hypercube around the “post” sample of width .1 in any dimension is “similar”, and then the fraction of the “pre” samples in that range represents the approximate “Bayesian probability of getting a sample similar to the ‘post’ sample under the state of information which consists entirely of the pre samples”

I should maybe take this as a challenge to blog up an example.

Just to clarify, the frequentist approach is warranted in my second example because I can easily calculate a probability of observing my data from the 1000 prior samples. AND its OJ we’re talking about, not milk. 🙂 Since Bonferroni said it, I’ll blame him.

So, suppose you find out that the “post” sample is different from the “pre” samples in the sense of being in a low probability region of the distribution that is consistent with the pre samples. Does that prove that “your fix caused contamination?” or “your fix reduced contamination” or “increased contamination from these microbes, and decreased it for these other microbes” or whatever?

No, it shows more or less that *something* has caused our sample to be inconsistent with what used to be happening before the fix. But it could easily be say that while you were fixing the machine, someone else was changing the supplier of jugs, or doing some plumbing on the water supply, or whatever.

You knew that, and in your case, it means you need to follow up with biological experiments to see if the “unusual” measurements are also causally related due to measureable changes in transcription or whatever, but it’s good to think about these things in the simpler OJ context.

The mean I reported was for wsam, not wall. I wanted to compare the frequentist stat to the bayesian estimate. Its not important for the rest of my comment though.

Is it possible to ask a question as follows (regarding the OJ sample). I can calculate easily a posterior likelihood from the most recent (post-fix) sample, given my historical factory data logs. Can I go back and calculate a similar likelihood from an individual log entry, or a random sample of entries, and obtain information about how likely my post- sample is (as a whole). If an unusual number of the parameters are in a low probability region of the n-dimensional data set, (but the majority aren’t), would that potentially be reflected there? Is that even an interesting question, i.e. does it get at this question of whether the new set is truly abnormal in a historical context?