Once again on Using the Data Twice

2017 April 6
by Daniel Lakeland

So there's an argument in the blogtwitterosphere about updating your bayesian model with the data twice.

On the one hand, we have EJ Wegenmakers saying:

And on the other hand, we have "Laplace" saying updating with the data twice is totally fine

Now, Laplace's math is absolutely correct. But, it's also subtle, because it's purely symbolic. When we build a model, we need to write down specific mathematical expressions for p(Foo | Bar) for all the foos and bars.

Let's see Laplace's equation in detail:

P(A|B,B) = P(B,B|A) P(A)/P(B,B) = P(B|B,A) p(B|A) P(A) / (P(B|B) P(B))

Now, P(B|B,A) = 1 because given B, B has to be true, same for P(B|B) and when you plug those in, you get P(B|A)P(A)/P(B) = P(A|B) = P(A|B,B)

BUT: when you build a Bayesian mathematical model, you CAN make mistakes, just like when you program a numerical integration routine, you can make mistakes. Suppose instead of P(A|B,B) we calculate P(A|B1,B2) where B2 is a deterministic copy of the data in B1.

Now, if we *remember this fact* correctly, we'll get P(B2|B1) = 1 and P(B2,B1) = P(B1) and we'll get the results above. But, if we forget this fact and pretend that B2 is new independent data, we will get the same results as if we had collected 2x as much data as we really did collect and treated it all as separate information. The mistake is as simple as doing something like

for(i in 1:(2*N)){ 

data[i] ~ normal(foo,bar);

}

instead of

for(i in 1:N){

data[i] ~ normal(foo,bar)

}

The second one is correct, because the second copy of data adds no information to the posterior as the probability of each data value past the Nth value is 1 given that we already know data values 1..N.

It's *this mistake* which is a bug, is common, and leads to the statements along the line "only use the data once". The statement "only use the data once" is like the statement "don't use global variables to pass arguments to functions" it's useful advice to reduce the chance of committing an error. It's not mathematical truth.

 

One Response leave one →
  1. Daniel Lakeland
    April 6, 2017

    Note, this problem can get even more subtle. Suppose you have a survey dataset regarding health and weight, after the year 2010 each person was weighed 3 times at 3 successive weekly appointments. Before the year 2010 each person was weighed once, but the data value from the first weighing is copied into the 2nd and 3rd place just to preserve a standard data table format and to avoid NULL values...

    Now, if you correctly ignore the pre 2010 weight 2nd and 3rd value as having P(W2|W1)=1 and P(W3|W1) = 1 you will get one result. But if you treat all the values as repeated independent measurements of the weight, you will find that before 2010 there was a LOT less variability in weight among patients.

    Again, updating your posterior using the correct mathematical expression for the probability will result in the correct answer, but updating your posterior using a "default" expression for the probability that fails to take into account the method of data collection and data tabulation... results in the wrong answer because you "used some data multiple times". The real reason is "you didn't condition correctly on your knowledge of the data tabulation process"

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS