Combining information from independent experiments: further ideas

2017 April 19
by Daniel Lakeland

It's an interesting question connected to something posted at Andrew's blog. I posted one possible "solution" using Stan previously, and Corey and I did some exploration of the issues in the comments.

Here are some new thoughts I've had, but first the set up. There are N labs each of which investigate a phenomenon in which there is a fundamental ratio R = X/Y and X and Y are individually measured with error. The labs each publish some posterior distribution for X and Y. In the example problem we have

Xtrue[i] ~ normal(Xbar[i],sx[i]); // published by lab i

Ytrue[i] ~ normal(Ybar[i],sy[i]); //similar

for each lab i

Now we'd like to do inference from this on the ultimate quantity of interest R. We assume that there is an underlying Rtrue of interest, and that in each lab, due to peculiarities of their apparatus etc an underlying R[i] is at work so that Xtrue[i]/Ytrue[i] = R[i] and R[i] is "close to" Rtrue.

R[i] ~ normal(Rtrue,deltaR);

with deltaR "small" and Rtrue a parameter in our model.

How can we do inference on Rtrue?

The thing that I recently came to is that in this situation, we don't get the individual data points from the labs, we get as *data* only Xbar[i] and Ybar[i] the published values that describe the lab's posterior distributions over Xtrue[i] and Ytrue[i], so we can do something like as follows:

Rtrue ~ normal(OurRtrueGuess, someSpread);// our prior for Rtrue

R[i] ~ normal(Rtrue, deltaR);// our prior for R[i] given Rtrue, expresses the "closeness" of the individual experiments to the real R value

Ytrue[i] ~ normal(Ybar[i], sy[i]); // the published posterior probability distribution over Ytrue[i] from lab i

Xbar[i] ~ normal(R[i]*Ytrue[i],sx[i]); 
/* the likelihood of having Xbar be published
 as the estimate of Xtrue by lab i given 
Ytrue[i], R[i] and the lab published sx[i] inference error. 
That is: p(Xbar[i] | R[i],Ytrue[i])

So now what we've got is essentially p(R) p(R[i]|R) p(Ytrue[i] | Data[i]) p(Xbar[i] | R[i], Ytrue[i], Data[i])

the narrow width of p(R[i]|R) is what describes the fact that all the labs are trying to set up their apparatus to reproduce carefully a single ratio R. The Ytrue[i] distribution is as given from the lab, and finally the likelihood over the Xbar value is potentially either the published distribution from the lab, or if we have additional information, it could be some modified thing. When Xbar is a sufficient statistic and we're using a normal distribution, it can be as good as getting the whole dataset from the lab.

What's interesting is, in these cases, the output of each labs inference process becomes *observed data* that we use in a likelihood type factor in our Bayesian model for combining the data.


Under Stan-ding the Hot Hand

2017 April 10
by Daniel Lakeland

Summary: It is hard to detect the "hot hand" in basketball with hit/miss shot data. In general in science, we often see people claiming that a thing doesn't exist because "they looked carefully and didn't find it". Well, the problem is if your method of looking can't find the thing, the fact that you didn't find it tells us nothing, yet this is taken as evidence that the thing doesn't exist. It's not just a triviality about basketball, it's a common problem in other important areas of science.

Imagine a player shoots baskets, and their probability of success is 0.5 + a sinusoidal function of shot. We don't know the phase or period exactly, we want to look at a sequence of hits and try to infer the amplitude of the variation and its frequency. In my case I'm using period = 40 and phase = 0.

## the hot hand, a player takes shots at a basket with a sinusoidal
## underlying prob of success, looking at N successive shots, if we
## know the period and phase perfectly can we detect the magnitude for
## different N = 20, 50, 100, 200, 500

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

hitmiss = sapply(1:500,function(x){return(rbinom(1,1,prob=0.5 + 0.1 * sin(2*pi*x/40)))})

stancode =  "
real binp(real A,int i,real f,real ph) { return(0.5 + A*sin(2*pi()*(i+ph)/f));}
data{int hitmiss[500];}

parameters{real A20;real A50; real A100;
 real A200; real A500;
vector [5] f;
vector [5] ph;
A20 ~ normal(0,.5);
A50 ~ normal(0,.5);
A100 ~ normal(0,.5);
A200 ~ normal(0,.5);
A500 ~ normal(0,.5);

f ~ gamma(2,2.0/50);
ph ~ normal(0,20);

for(i in 1:500){
if(i < 21){hitmiss[i] ~ binomial(1,binp(A20,i,f[1],ph[1]));
if(i < 51){hitmiss[i] ~ binomial(1,binp(A50,i,f[2],ph[2]));
if(i < 101){hitmiss[i] ~ binomial(1,binp(A100,i,f[3],ph[3]));
if(i < 201){hitmiss[i] ~ binomial(1,binp(A200,i,f[4],ph[4]));
if(i < 501){hitmiss[i] ~ binomial(1,binp(A500,i,f[5],ph[5]));

samps <- stan(model_code=stancode,data=list(hitmiss=hitmiss),


Running this model gives the following results:

Inference for Stan model: c8293fb08b00b7823bcc1fc716df0ef9.
4 chains, each with iter=10000; warmup=5000; thin=10; 
post-warmup draws per chain=500, total post-warmup draws=2000.

         mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
A20      0.15    0.01  0.11    0.01    0.06    0.12    0.21    0.39   422 1.00
A50      0.11    0.00  0.08    0.01    0.04    0.10    0.17    0.30   410 1.00
A100     0.09    0.00  0.06    0.00    0.04    0.09    0.14    0.24   569 1.00
A200     0.09    0.00  0.06    0.00    0.04    0.09    0.13    0.20   288 1.01
A500     0.13    0.00  0.03    0.06    0.12    0.13    0.16    0.20   270 1.01
f[1]    45.95    1.09 35.87    5.22   19.06   37.12   62.08  138.20  1090 1.00
f[2]    39.30    1.46 34.07    3.89   12.61   29.49   54.20  132.16   541 1.00
f[3]    37.54    1.66 29.00    5.82   17.11   32.19   47.29  112.63   305 1.01
f[4]    38.49    1.50 27.51    8.49   19.08   35.79   45.11  108.88   336 1.01
f[5]    40.39    0.28  2.02   39.52   40.31   40.62   40.98   41.68    54 1.08
ph[1]   -0.97    0.95 18.74  -34.41  -13.40   -3.96   11.86   38.11   387 1.02
ph[2]    1.18    0.97 20.88  -40.78  -11.06    0.30   15.52   42.05   465 1.00
ph[3]   -2.06    2.08 20.75  -43.41  -14.11   -1.80   12.54   36.89    99 1.03
ph[4]    1.48    1.35 20.94  -39.44  -10.69    1.07   14.84   42.38   240 1.02
ph[5]    5.01    0.28  4.57   -2.00    2.57    4.69    7.45   12.91   267 1.00
lp__  -581.57    0.33  4.48 -591.30 -584.43 -581.32 -578.54 -573.58   185 1.02

Suppose we decide that we've "Detected" the effect if we have an expected amplitude / sd(amplitude) of about 2, then it takes upwards of 200 shots, because at 200 shots E(A)/sd(A) = .09/.06 ~ 1.5. Also note how few effective samples I get (4 cores, 10000 samples per core, thinning by a factor of 10 so I have 2000 total samples but typical parameters have around a couple hundred effective samples. The model is hard to fit because the whole thing is noisy and messy, traceplots reflect this).

Also, this is with the benefit of a precisely repeated periodic signal and informative priors on the size, period, and phase of the effect.

If the effect varies like a smooth gaussian process with a ~ 40 shot scale but no periodicity and no informative priors, it'd be like trying estimate simultaneously 20 or 30 fourier coefficients or something... you'd need even more data and long runs to overcome the noise and get good effective sample size.

The model is ill posed, the measurement is not very informative, the Hot Hand is hard to detect from hit/miss data.


Inference on a ratio from two separate experiments (a case study in Stan awesomeness)

2017 April 9
by Daniel Lakeland

Andrew Gelman's blog has a post about a case where you do two experiments, one you measure X,Y with measurement error, and want to find out the true ratio of the parameters Xbar/Ybar, and same for a second experiment with Pbar/Qbar. Here is the R code to do the complete experiment:

note that we want to do p(R | Xtrue,Ytrue,Ptrue,Qtrue) = p(R|Xtrue,Ytrue) P(R|Ptrue,Qtrue) and each of these is a delta function (that is, if you tell me Xtrue,Ytrue I find out R exactly)... so we have to approximate this delta function with a sufficiently tightly peaked normal distribution. Basically we tell Stan to stay in the region where both experiments agree to within epsilon = 0.001

## inference on a ratio via two separate experiments


dataset <- data.frame(X=rnorm(10,1,.1),Y=rnorm(10,2,.3),P=rnorm(10,30,6),Q=rnorm(10,60,4));

## now the true value of R = 1/2 = 30/60 but we have two experiments
## where we measure X,Y or we measure P,Q with measurement error that
## is of known size (we know the .1 and .3 and 6 and 4 used as
## standard deviations, assume we have a well calibrated instrument).

stancode = "data { vector[10] X; vector[10] Y; vector[10] P;vector[10] Q;}

real<lower=0> Xtrue;
real<lower=0> Ytrue;
real<lower=0> Ptrue;
real<lower=0> Qtrue;
real<lower=0> Rat;

// we assume on physical principals that these are positive and have an upper bound order of magnitude estimate
Xtrue ~ uniform(0,1000);
Ytrue ~ uniform(0,1000);
Ptrue ~ uniform(0,10000);
Qtrue ~ uniform(0,10000);

X ~ normal(Xtrue,.1);
Y ~ normal(Ytrue,.3);
P ~ normal(Ptrue,6);
Q ~ normal(Qtrue,4);

// constrain the ratios to agree to within a small epsilon
Rat ~ normal(Xtrue/Ytrue,.001);
Rat ~ normal(Ptrue/Qtrue,.001);


samps <- stan(model_code=stancode,data=dataset);

When I run it. I get the following output:

Inference for Stan model: 5100ead2741217b889326da6ff0f0419.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

       mean se_mean   sd   2.5%   25%   50%   75% 97.5% n_eff Rhat
Xtrue  1.02    0.00 0.03   0.97  1.01  1.02  1.04  1.08  2105    1
Ytrue  2.06    0.00 0.08   1.91  2.00  2.06  2.11  2.22  1568    1
Ptrue 30.24    0.04 1.31  27.71 29.34 30.23 31.15 32.87  1410    1
Qtrue 60.64    0.03 1.22  58.20 59.85 60.66 61.43 63.06  2172    1
Rat    0.50    0.00 0.02   0.46  0.48  0.50  0.51  0.54   927    1
lp__  -6.28    0.04 1.59 -10.39 -7.05 -5.95 -5.11 -4.21  1588    1

Samples were drawn using NUTS(diag_e) at Sun Apr  9 19:37:53 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Note that with 10 observations in each experiment, we determine that our expected value of Rat is 0.5 to within 0.00 standard error, and almost all the samples of Rat are between about 0.46 and 0.54

What to do with Census outliers?

2017 April 9
by Daniel Lakeland

Consider the following records from the Census ACS for Pasadena CA:

11570 2013  2  1030   NA  53000
11835 2013  3  1210   NA  29000
12945 2013  2    NA 2200 200200
16433 2014  2    NA 3100 181000
18641 2015  2  2080   NA 128500
20854 2015  6   260   NA  57000

Apparently there is a family of 6 living in Pasadena, making $57,000/yr (HINCP) and paying $260 a month in rent (GRNTP).

Now, if you told me $2600 I wouldn't blink an eye, and as you can see that would be in line with the $2080 for a family of 2, or $3100 in mortgage for a family of 2. But $260 could only occur if you were renting from your uncle or you are acting as an apartment complex manager and part of your compensation is reduced rent, or some other similar situation.

So, given that some of these records do go through optical scanners and could come out with a dropped decimal place or the like, as well as some people have a situation like the apartment complex manager who has secret income in the form of subsidized rent... How should one think about handling this kind of situation if the goal is to estimate a function: "minimum market rent for a family of N people in each region".

One idea would be to simply trim the data, ignore anything that is say less than 20% of the observed mean or more than 5x the observed statewide mean, and this would catch most situations of both dropped or added decimal places (which would cause a factor of 10 error). Another idea would be to do a model where there's some probability to have outliers due to a separate process, and we simply learn about which records to ignore by virtue of their high posterior probability of being "outliers". But this requires us to carry an extra parameter for each observation. A third possibility is to marginalize out the outlier parameter, and learn only about the frequency of outliers and treat observed data as coming from a mixture model where we learn the mixture weight as equal to the marginalized frequency of outliers.

I'm sure there are other ways, but one issue is this: I already have a lot of parameters due to the large number of public use microdata areas I'm doing inference for, and runs are slow, so it'd be good to avoid complex models just for computational reasons.

Any thoughts?


PUMAs - Public Use Microdata Areas of the Census

2017 April 8
by Daniel Lakeland

The Census ACS and similar microdata datasets use PUMAs which are sort of regions that contain about 100k people. In 2010 the PUMAs were revised to be simpler, and completely contained within a state. But I gotta say, the PUMA is the wrong way to do things.

For the ACS and similar microdata, each household record should have a Latitude and Longitude associated. These values should be equal to the actual Lat/Long of the actual household plus a random uniformly distributed perturbation. The record should give the size of this perturbation in each dimension, and the size should be determined by the local population density such that the patch contains approximately 100k people. The uniform distribution should be used because it's bounded so you can be sure you're never more than a certain distance away from the true location.

The record still has the State, so if you go across a state boundary you'd have some issues which I'm sure the census is capable of working out, but other than that, it makes way more sense to think of things in terms of fuzzing out the location using a continuous perturbation than the complexity of redrawing boundary lines every few years and completely re-doing your identification scheme and screwing up any chance you had of working with spatial-timeseries models...

Just sayin.


A Bayesian Understanding of the "Garden of Forking Paths"

2017 April 7
by Daniel Lakeland

Andrew Gelman has discussed and written on his concept of "The Garden Of Forking Paths" in NHST analysis of scientific data.

"Laplace" whose insights I respect a lot has ridiculed the idea, and when put into the terms he uses, I agree with him. However, I don't think that Gelman's point is quite the same as the one Laplace ridicules. So thinking about it, here's how I'd like to proceed to an understanding.

For simplicity we'll analyze the situation in which a research collects data D, and then does a test T to determine if the two subsets A^+(D) and A^-(D) differ in some way that is detectable by the test by use of a sample statistic S.

First off, consider what the various options are available to the researcher:

T_i, i \in [1,.I]


S_j, j \in [1..J]


A_k, k \in [1..K]

That is, we can choose which test to use, which statistic to test, and how to subset and exclude certain portions of the data to form the partition (the function A partitions and excludes the data, so that there are two groups).

Now, what is the Bayesian probability that p < 0.05 given our knowledge N (I use N because I've already used K).

P(p < 0.05 | D,N,i,j,k) P(D,i,j,k | N)

Suppose in the first case that N contains the information "i,j,k were preregistered choices and D was collected after i,j,k were specified and is independent of the i,j,k". Then P(i,j,k|N) = 1, and P(p < 0.05 | N) is determined entirely by our knowledge in N of the appropriateness of the test and the p values that it outputs.

P(p < 0.05 | D,N) P(D|N)

So, we're still left with all the problems of the use of p values, but we're at least not left with the problems described below.

In the case that N contains the information "I,J,K are all large integers and were chosen after seeing D, and the researcher is motivated to get p < 0.05 and probably at least looked at the data, produced some informal graphs, and discussed which analysis to do with colleagues" we're left with the assumption that i,j,k were chosen from among those analyses which seemed via informal data "peeking" to be likely to give p < 0.05 so the Bayesian is left with:

P(p < 0.05 | i,j,k,D,N) P(i,j,k|D,N) P(D|N)

Now, due to our pre-analysis choice peeking, we can safely assume

P(p < 0.05 | i,j,k,D,N) \sim 1

sure it might not be exactly 1, but it's much much bigger than 0.05 like maybe 0.5 or 0.77 or 0.93 and this is FOR ALL i,j,k that would actually be chosen.

P(i,j,k|D,N,\{i,j,k\}\in G) = O(1)

where G is the reachable subset of the I \times J \times K space called "the garden of forking paths" such that any typical researcher would find themselves choosing i,j,k out of that subset such that it leads to analyses where P(p < 0.05 | i,j,k,D,N) \sim 1

So, how much information does p < 0.05 give the Bayesian about the process of interest? In the preregistered case, it at least tells you something like "it is unlikely that a random number generator of the type specified in the null hypothesis test would have generated the data" (not that we usually care, but this could be relevant some of the time).

In the GOFP case, it tells us "these researchers know how to pick analyses that will get them into the GOFP subset so they can get their desired p < 0.05 even without first doing the explicit calculations of p values."

So, using this formalism, we arrive at the idea that it's not so much that GOFP invalidates the p value, it's that it alters the evidentiary value of the p value to a Bayesian.


Once again on Using the Data Twice

2017 April 6
by Daniel Lakeland

So there's an argument in the blogtwitterosphere about updating your bayesian model with the data twice.

On the one hand, we have EJ Wagenmakers saying:

And on the other hand, we have "Laplace" saying updating with the data twice is totally fine

Now, Laplace's math is absolutely correct. But, it's also subtle, because it's purely symbolic. When we build a model, we need to write down specific mathematical expressions for p(Foo | Bar) for all the foos and bars.

Let's see Laplace's equation in detail:

P(A|B,B) = P(B,B|A) P(A)/P(B,B) = P(B|B,A) p(B|A) P(A) / (P(B|B) P(B))

Now, P(B|B,A) = 1 because given B, B has to be true, same for P(B|B) and when you plug those in, you get P(B|A)P(A)/P(B) = P(A|B) = P(A|B,B)

BUT: when you build a Bayesian mathematical model, you CAN make mistakes, just like when you program a numerical integration routine, you can make mistakes. Suppose instead of P(A|B,B) we calculate P(A|B1,B2) where B2 is a deterministic copy of the data in B1.

Now, if we *remember this fact* correctly, we'll get P(B2|B1) = 1 and P(B2,B1) = P(B1) and we'll get the results above. But, if we forget this fact and pretend that B2 is new independent data, we will get the same results as if we had collected 2x as much data as we really did collect and treated it all as separate information. The mistake is as simple as doing something like

for(i in 1:(2*N)){ 

data[i] ~ normal(foo,bar);


instead of

for(i in 1:N){

data[i] ~ normal(foo,bar)


The second one is correct, because the second copy of data adds no information to the posterior as the probability of each data value past the Nth value is 1 given that we already know data values 1..N.

It's *this mistake* which is a bug, is common, and leads to the statements along the line "only use the data once". The statement "only use the data once" is like the statement "don't use global variables to pass arguments to functions" it's useful advice to reduce the chance of committing an error. It's not mathematical truth.


Suicide Rates in the US

2017 April 4
by Daniel Lakeland

I got this data from CDC Wonder, and I let them do the population rate calculation, so I don't know if they did it right, but let's assume they did. Here are suicide rates for Males and Females ages 15 to 65 across the whole US, by race.

Notice that White rates are higher than Black or Asian rates, and that they've been trending upwards steadily since 2000.


A new term

2017 March 27
by Daniel Lakeland

Wansinking (verb, gerund): To do research about an essentially unimportant topic in a sloppy and unprincipled manner, possibly even inventing data, while attracting enormous amounts of credulous popular press coverage and corporate sponsorship for years and years, and dodging criticism by acting or being essentially clueless. cf. Brian Wansink.


Call for Book Recommendations on Game Theory

2017 March 9
by Daniel Lakeland

I'd like a book on game theory that is readable and interesting, has some examples that are somewhat real-world, but isn't afraid to use math. (ie. not something written for people who read Malcom Gladwell etc)

I dislike Definition, Theorem, Proof math books. I mean, not entirely, I like that stuff fine when I'm reading about abstract math, but generally find them tedious when the topic should have applied content. What interests me is how the formalism maps to the real-world, not excruciating details on what the content of the formalism is.

I loved this book on set theory, and Barenblatt's book Scaling, and Practical Applied Mathematics, and I'd like something at that level if possible. Also helps if it's less than $50 or so.

Hoping someone has something to recommend.

Of particular interest: issues in mechanism design, dynamic / repeated games, games with no stable equilibrium, rent-seeking, games where greedy algorithms fail, etc. I've been thinking a lot about economics problems, and I'd like to get familiar enough with the basic stuff to be able to talk about why certain policies put together result in lousy outcomes without appearing foolish for missing some very basic known results etc.

Also, any thoughts specifically on Steven Tadelis' book?