No, it shows more or less that *something* has caused our sample to be inconsistent with what used to be happening before the fix. But it could easily be say that while you were fixing the machine, someone else was changing the supplier of jugs, or doing some plumbing on the water supply, or whatever.

You knew that, and in your case, it means you need to follow up with biological experiments to see if the “unusual” measurements are also causally related due to measureable changes in transcription or whatever, but it’s good to think about these things in the simpler OJ context.

]]>Anyway, to your further question problem 1. Suppose you’ve got 1000 crates you’ve sampled 10 each from in a timeseries, and the last one is “after” the fix. I’d probably set up a regression model where the mean quantity of OJ in a sample of 10 is distributed as say exponential(1/(mu)) for i in 1..999 and exponential(1/(mu+eps)) for i=1000 and then say eps ~ normal(0,2); representing the fact that the effect of fixing the machine has shifted the average either up or down by an amount which is a small multiple of 2 L, you could really give a more informative prior, maybe normal(0,1) if you wanted to, since you know that the jugs aren’t really holding more than 2.25 L anyway, you can’t shift the average by more than about 2L. You might have to fiddle with this model a little to avoid getting negative values or zero for mu+eps.

Note that the above model stays in line with my basic thread of using a maximum entropy based likelihood distribution, but you could also use a normal distribution for the likelihood which is more informative and therefore higher power.

Now as to problem 2: A Frequentist interpretation of statistics with hypothesis testing and corrections for multiple-comparisons is not in my opinion valid for that type of inference, but using the samples to describe your information about the state of the “pre” machine is a valid way to set up a probability distribution.

You have a problem where you’ve got a single sample, like 10 jugs from a crate after the fix, and you’ve also got a bunch of other samples from other time points (or in your case, other portions of the genome), and instead of measuring one thing about the sample, like weight, you’re measuring as you say maybe 100 different aspects of the OJ in your sample of 10 jugs… let’s say it’s a test for 100 different types of bacterial contaminants.

You want to find out if there’s any reason to believe that the 10 jugs after the fix have been contaminated in any way that is “different” from the way it was before the fix. If even ONE type of microbe is now present at higher quantities than before the fix, that’s really important. Also, you might get several types of contamination together, or you might increase one kind of contamination and decrease other kinds… lots of stuff could happen.

Consider your sample of 100 measurements of 999 “pre” cartons not as 99900 measurements, but as 999 measurements of a vector of 100 items. Since you know very little about the process that generated the pre measurements, your information about the “pre” condition is pretty much that the kinds of things you would get were points in this 100 dimensional space which “looked like” the samples you have.

From a Bayesian perspective, you’re saying something like: “given that everything I know about the pre condition is just that it seemed to produce samples like the 999 ‘pre’ samples I have, is the 1 extra sample of 100 measurements in some ‘not very common’ region of my Bayesian probability distribution for the ‘pre’ condition”

You could set up some kind of maximum entropy distribution for the “pre” samples, get a closed form formula which is consistent with those samples, and then see if the “new” sample is in the low probability region of this maximum entropy distribution. That’s more or less a “functional” approach to computing. But you could also use the *samples* themselves to estimate the probability of “pre” samples being in the region of the one “post” sample. For example, suppose you’re working with contamination measurements scaled so that the “pre” samples are on average 1. You could decide that any sample in the hypercube around the “post” sample of width .1 in any dimension is “similar”, and then the fraction of the “pre” samples in that range represents the approximate “Bayesian probability of getting a sample similar to the ‘post’ sample under the state of information which consists entirely of the pre samples”

I should maybe take this as a challenge to blog up an example.

]]>Then the second problem is (related to the first), suppose we’ve decided (as I have in the case of a problem I’m working on) that a frequentist approach is warranted because I have this data with 1k samples from different lots of the same era in machinery. Suppose further you have to control for 100 other QC measures of the milk, and some of them are related to human health (we don’t want people to get sick after all) but we don’t know a priori which of these parameters are likely to come up. Well if we’re frequentists, bonferroni comes along and says “no problem! none of these are significant, the milk is perfectly safe!” Sound familiar? ]]>