# Inference on a ratio from two separate experiments (a case study in Stan awesomeness)

Andrew Gelman's blog has a post about a case where you do two experiments, one you measure X,Y with measurement error, and want to find out the true ratio of the parameters Xbar/Ybar, and same for a second experiment with Pbar/Qbar. Here is the R code to do the complete experiment:

note that we want to do p(R | Xtrue,Ytrue,Ptrue,Qtrue) = p(R|Xtrue,Ytrue) P(R|Ptrue,Qtrue) and each of these is a delta function (that is, if you tell me Xtrue,Ytrue I find out R exactly)... so we have to approximate this delta function with a sufficiently tightly peaked normal distribution. Basically we tell Stan to stay in the region where both experiments agree to within epsilon = 0.001

## inference on a ratio via two separate experiments library(rstan); dataset <- data.frame(X=rnorm(10,1,.1),Y=rnorm(10,2,.3),P=rnorm(10,30,6),Q=rnorm(10,60,4)); ## now the true value of R = 1/2 = 30/60 but we have two experiments ## where we measure X,Y or we measure P,Q with measurement error that ## is of known size (we know the .1 and .3 and 6 and 4 used as ## standard deviations, assume we have a well calibrated instrument). stancode = "data { vector[10] X; vector[10] Y; vector[10] P;vector[10] Q;} parameters{ real<lower=0> Xtrue; real<lower=0> Ytrue; real<lower=0> Ptrue; real<lower=0> Qtrue; real<lower=0> Rat; } model{ // we assume on physical principals that these are positive and have an upper bound order of magnitude estimate Xtrue ~ uniform(0,1000); Ytrue ~ uniform(0,1000); Ptrue ~ uniform(0,10000); Qtrue ~ uniform(0,10000); X ~ normal(Xtrue,.1); Y ~ normal(Ytrue,.3); P ~ normal(Ptrue,6); Q ~ normal(Qtrue,4); // constrain the ratios to agree to within a small epsilon Rat ~ normal(Xtrue/Ytrue,.001); Rat ~ normal(Ptrue/Qtrue,.001); } " samps <- stan(model_code=stancode,data=dataset); samps

When I run it. I get the following output:

Inference for Stan model: 5100ead2741217b889326da6ff0f0419. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat Xtrue 1.02 0.00 0.03 0.97 1.01 1.02 1.04 1.08 2105 1 Ytrue 2.06 0.00 0.08 1.91 2.00 2.06 2.11 2.22 1568 1 Ptrue 30.24 0.04 1.31 27.71 29.34 30.23 31.15 32.87 1410 1 Qtrue 60.64 0.03 1.22 58.20 59.85 60.66 61.43 63.06 2172 1 Rat 0.50 0.00 0.02 0.46 0.48 0.50 0.51 0.54 927 1 lp__ -6.28 0.04 1.59 -10.39 -7.05 -5.95 -5.11 -4.21 1588 1 Samples were drawn using NUTS(diag_e) at Sun Apr 9 19:37:53 2017. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1). >

Note that with 10 observations in each experiment, we determine that our expected value of Rat is 0.5 to within 0.00 standard error, and almost all the samples of Rat are between about 0.46 and 0.54

How about:

parameters{

real logYtrue;

real logQtrue;

real logRat;

}

transformed parameters {

real Ytrue = exp(logYtrue)

real Xtrue = exp(logYtrue + logRat);

real Qtrue = exp(logQtrue)

real Ptrue = exp(logQtrue + logRat);

}

Yes, then you drop the normal approximate constraints Rat ~ normal(... , 0.001) and it's exact right? Good old algebra.

There's a folk theorem here somewhere: "wherever you want a delta function, you can replace it with some clever algebra"

I think you have to be careful about the implied priors. If you have logYtrue and soforth as your parameters, and you don't declare explicit priors, it implies uniform priors on the whole log space, which then implies really weird divergent priors on Ytrue, Qtrue, and Rat space.

I don't think the log transform is really required, you can just do multiplication right?

parameters{

real Ytrue;

real Qtrue;

real Rat;

}

transformed parameters{

real Xtrue;

real Ptrue;

Xtrue = Rat * Ytrue;

Ptrue = Rat * Qtrue;

}

model{

// priors on Ytrue,Qtrue,Rat go here

X ~ normal(Xtrue, foo);

Y ~ normal(Ytrue,bar);

P ~ normal(Ptrue,baz);

Q ~ normal(Qtrue,quux);

}

That was deliberate. It seemed to me that they're really scale parameters -- not statistically speaking, but in terms of whatever scientific context it is that makes them positive and makes their ratios meaningful. As such, they get uniform priors on the log scale, at least to start off with.

Actually that will immediately lead to an improper posterior, won't it?

Turns out this problem is entirely analytically tractable -- there's no need to mess about with approximate "likelihoods" that aren't actually likelihoods.

I didn't think much about it but I typically use informative proper priors in everything.

Would be happy to see your analytic version. The premise that there's something broken about Bayesian combining of experiments felt like a real gauntlet throw down...

This problem has some surprising subtleties relating to the requirement that the marginal priors on the individual numerator and denominator estimands be independently uniformly distributed. For example, the prior on Xtrue, Ytrue, Qtrue implied by

Xtrue ~ uniform(0,1000);

Ytrue ~ uniform(0,1000);

Ptrue ~ uniform(0,10000);

Qtrue ~ uniform(0,10000);

Rat ~ normal(Xtrue/Ytrue,.001);

Rat ~ normal(Ptrue/Qtrue,.001);

does not factor per independence (nor is it uniform). Furthermore, the improper prior

parameters{

real Xtrue;

real Ytrue;

real Qtrue;

}

transformed parameters{

real Rat, Ptrue;

Rat = Xtrue / Ytrue;

Ptrue = Rat * Qtrue;

}

is not correct in the sense that the limit of a sequence of proper priors doesn't converge there.

I started looking at this problem analytically. I didn't get that far but it seems If someone published normal posteriors for x y you can immediately get a density on R by integration along the hyperbola x/y= R. Same for pq it seems like there is really no inference involved here only integration.

What I wrote above isn't right. The prior specification I gave can arise as a limit of proper priors -- the problem is that it's not unique.

Here's what I was thinking: let's follow Jaynes and make the posterior arise as a limit where the corresponding limiting process on the prior drives it to be improper uniform. Suppose

Xtrue ~ Uniform(0, x_0)

Ytrue ~ Uniform(0, y_0)

What does that imply about the prior for Rat = Xtrue/Ytrue? This is a standard textbook problem; the solution is the uniform ratio distribution (scaled by x_0/y_0). We want to set up the joint prior so that (at least considered marginal of the other variables) Qtrue and Ptrue = Rat*Qtrue also have independent uniform priors, so that symmetry exists between (Xtrue, Ytrue) and (Ptrue, Qtrue). This implies that the form of the conditional prior (Qtrue | Rat) should be the same as that of (Ytrue | Rat).

Consider the form of the prior on (Ytrue, Rat). It satisfies the constraints 0 < y_true < y_0 and 0 < r * y_true < x_0. Within those constraints it's proportional to y. Ytrue and Rat aren't independent, but the dependence arises from the boundaries of the support, not the functional form within the support. The conditional prior of (Ytrue | Rat) shows this clearly: you get it by fixing a value of Rat and normalizing, and the support of the distribution depends on the constraint that binds at the chosen value of Rat. Note that if Rat < x_0/y_o then the constraint 0 < y x_0/y_o then the constraint 0 < y_true < r / x_0 binds then r^2 shows up in the normalizing constant.

When the conditional prior on (Ytrue | Rat) is multiplied by the marginal prior on Rat, the r^2 term cancels and the resulting functional form is, as noted above, free of r. We can then take the limit and x_0 and y_0 go to infinity and cover the entire positive quadrant (and if we like, we can reflect the prior to cover the other quadrants too).

Now let's bring Qtrue into the picture. The joint prior is

p(r, y_true, q_true) = p(r) p(y_true | r) p(q_true | r)

and there's only enough marginal Rat to cancel out the Rat-dependence of one of the two conditional priors. This means the

functional formof the posterior will change at r = x_0 / y_0, and this dependence on the limiting process doesn't go away as x_0 and y_0 go to infinity.The sentence starting, "Note that if Rat..." was mangled. It should read

Note that if Rat < x_0/y_o then the constraint 0 < y_true x_0/y_0 then the constraint 0 < y_true < x_0 / r binds and r^2 shows up in the normalizing constant.

Oh FFS.

Note that if Rat < x_0/y_o then the constraint 0 < y_true < y_0 binds and the conditional prior is free of R but if x_0/y_0 < Rat then the constraint 0 < y_true < x_0 / r binds and r^2 shows up in the normalizing constant.

I think dependence on the limiting process will disappear if I use conjugate priors, i.e., Gaussians with vanishing precisions instead of uniforms with vanishing precisions. The marginal prior of Rat is (scaled) Cauchy and the conditional prior of a denominator is the square root of an exponential distributed random variable with mean inversely proportional to 1 + r^2.

Corey. Suppose instead of data you just get normal posteriors distributions for x y p q

Doesn't this immediately imply a density on R

P(R)dR = Integrate(P(Ry,y,Rq,q) dy R dy dq R dq, y=0..inf, q=0..Inf)

Which has something wrong about it because when you divide dR you get something infinitesimal. I'm on my phone and traveling so will have to figure out what's wrong later but perhaps you can see it right away.

You have to somehow apply the constraint that the two ratios are identical. Your approach of multiplying the log-posterior by a term that enforces near-equality is a good one, but there's a Borel-Kolmogorov paradox lurking there and that makes me leery.

Consider just two mean parameters with flat priors and observations with normal likelihoods (with non-equal but known precisions) for each. The resulting posteriors are, for example,

x <- rnorm(1e7, 2, 1/sqrt(3/4))

y <- rnorm(1e7, 4, 2) # 2 = 1/sqrt(1/4)

If we select sufficiently close values of x and y,

idx1 <- abs(x - y) < 1e-3

then the result is just the same as if we two updates on a single mean parameter. But the process does not commute under non-linear monotonic transformations.

idx2 <- abs(asinh(x) - asinh(y) < 1e-3

plot(ecdf(asinh(x[idx1])), xlim = c(-1, 3))

plot(ecdf(x[idx2]), xlim = c(-1,3))

(Of course, the same difficulty arises with Bayes's theorem itself; the resolution is to recognize that continuous distributions for measured values are approximations to discrete ones.)

That's why I want to use a parameterization in which the ratio estimand is unique from the start; if someone hands me normal posteriors I'm going to have to back out the likelihoods (not too difficult when the priors are flat) and combine them with my own priors. Unfortunately that leaves me with problems with a different limiting process. Maybe they're secretly the same process...

So. someone gives you p(x|D1) p(y|D1) p(p|D2) p(q|D2), with independent measurement errors on everything and D1 is data from experiment 1 and D2 is data from experiment 2 (but the data is not revealed, only the posteriors). You know that xtrue/ytrue = ptrue/qtrue = Rtrue.

integrate(p(x=Ry,p=Rq | y,q)p(y|D1)p(q|D2)dydq) = p(R|D1,D2)

= integrate(Delta(x/y-R)Delta(p/q-R)p(y|D1)p(q|D2)dydq)

now, in general it's not possible to define multiplication on generalized distributions such as the delta function. But, you can use IST or what's the same, some particular limiting process to define this.

Define the nonstandard family of pre-delta functions as normal_pdf(x/y-R,ds1) where ds1 is an infinitesimal standard deviation and normal_pdf(p/q-R,ds2) for infinitesimal ds2.

Now, basically we can see that the result we get depends on the relative size of the two slacks or the ratio ds1/ds2 which is a free parameter and can be any positive number. In other words, how much relative precision does experiment D1 give us relative to D2. We need to weight the evidence of these two experiments.

Of course we might get a different result if we use a different pre-delta function, but the essential fact is not that we could use gaussians vs uniforms vs triangle distributions vs whatever, but instead that we have this totally free parameter ds1/ds2 that we need to specify

The alternative is to acknowledge that each experiment has some kind of bias in it, something that would lead to the R that would ultimately be inferred from the given experiment carried to infinity being different from the "unbiased true R".

Then we can define R_i the ratio for each experiment. We can say that by doing this experiment in multiple ways at multiple labs, we can at least on average across the labs produce something that has net zero bias, and do something like

R_i ~ normal(R_ultimate,sigmaBias)

where R_ultimate is the real thing we're doing inference on. Then, without any logical difficulties involving defining multiplication on generalized distributions, and/or free parameters that are the limited ratio of two infinitesimals...

p(R_i | y_i) p(y_i) = p_xi(R_i*y_i) p(y_i)

and all the R_i pool through the hyperparameters R_ultimate and sigmabias

and I think we're going to get a well posed problem, where the role of the "slack" in the previous version that seems mysterious is now the role played by the distribution over the collection of R values that you get through acknowledging the small biases introduced by doing the experiment in two different labs with different apparatus and different data collectors and different times and different measurement instruments and whatnot.

This issue isn't just that the infinitesimal elements in normal_pdf(x/y-R,ds1) and normal_pdf(p/q-R,ds2) don't have a well-defined ratio. It's that for an arbitrary non-linear monotonic function f it's not clear that we should prefer normal_pdf(x/y-R,ds1) to normal_pdf(f(x/y)-f(R),ds1). For example, if f(x) = xy then we get normal_pdf(x-Ry,ds1) which looks about as reasonable to me as normal_pdf(x/y-R,ds1).

Right that makes sense. However I don't think it poses a problem for the case where there is one R for each experiment and we use p(xi = Ri yi | yi) p(Yi | Di) for each I and a hyper distribution on the R

I think it does -- you end up having to marginalize out the experiment-level R parameters and that leaves you pretty much where you started.

It's worth noting that none of this depends on the definition of R as the ratio of normals. You get into very similar issues with a difference of normal means as the estimand of interest:

Delta = Xtrue - Ytrue = Ptrue - Qtrue

instead of

R = Xtrue/Ytrue = Ptrue Qtrue

where again Xtrue and Ytrue are prior-independent marginal of the other variables and Ptrue and Qtrue are prior-independent marginal of the other variables by construction. That is,

p(delta, y_true, q_true) = p(delta) p(y_true | delta) p(q_true | delta)

and there's only enough marginal Delta to cancel out the delta-dependence of one of the two conditional priors.

Aren't these basically the same issue? Suppose we use normal(x,M,ds1) with ds1 infinitesimal as a delta function for x in the infinitesimal vicinity of M. If you apply a nonlinear monotonic function, for x in the infinitesimal halo of M it is a linear function f(M) + df/dx (x-M) and so the probability associated to dx becomes a probability associated to a df = df/dx * dx and so now to match probabilities you need a delta function in the f space that is wider by a factor df/dx, and you'll match things probability wise by expanding to normal(f(x),f(M),ds1 * df/dx) so in some sense the non-symmetry under continuous transformation is closely related to the lack of a specific informed quantity for the infinitesimal ratio ds1/ds2

The indeterminacy of ds1/ds2 goes away because we require that the resulting density be normalized. Compare the scaling property of the delta function to its behaviour under function composition.

Sure, so long as you define composition in terms of division by the derivative you are able to enforce an invariance. But in the examples you're giving you're saying that you are logically indifferent between a constraint like say

abs(x-y) < 0.001 and abs(f(x) - f(y)) < 0.001 Once you decided which you think is the important "rule" you can keep things invariant by defining a transformation property of the delta function... But until you come up with a principled decision you have this indeterminate ratio to contend with which is essentially that if abs(x-y) < 0.001 then it implies that abs(f(x)/f'(x)-f(y)/f'(y)) < .001 which since x,y are also close essentially says more or less that abs(f(x)-f(y)) < f'(x) * 0.001 and since f is a free variable, you can make f'(x) any number at all, which is morally the same thing as saying ds1/ds2 is any positive quantity.

I keep wanting to work on this in some kind of detail but it will have to wait a few more days.

P(R | R1...RN) Product(P(Ri|Yi)P(Yi|Datai))

Looks well constructed at first glance

Here the generative model approach makes things clear. You have the top-level R and N slightly perturbed versions of it; the joint prior is

P(R)Product{P(R_i|R)}

The rest of the model consists of the conditional distributions that ensure that for a given i and marginal of the rest of the variables, Xtrue_i and Ytrue_i are independent in the prior.

P(R)Product{P(R_i|R)P(Ytrue_i | R_i)P(X_i, Y_i | R_i, Ytrue_i)}

(I don't think having Xtrue_i and Ytrue_i independent in the prior is necessarily desirable for Bayesians or anything -- it's just the condition that ensures that credible regions (marginal of the other variables) are also exact confidence regions and I want to see what enforcing that condition does for the confidence coverage of the credible regions of R (when likelihoods and priors aren't being approximated and mangled). In one sense the indeterminacy of the prior is an advantage because one could tune it or even make it data-dependent to get good coverage -- presumably that's what a confidence distribution theorist would do -- but I want to be as Jaynesian as possible here.)