On Incorporating Assertions in Bayesian Models

2016 August 23
by Daniel Lakeland

I've been thinking a lot recently about how to specify a prior over functions. One way is to specify a Gaussian Process, but this can be quite computationally heavy, and when the function in question is fairly smooth and simple, it's also kind of overkill.


The typical method for specifying a function is via a basis expansion. For example in the R function lm you can specify a third order polynomial model for y as y \sim x + x^2 + x^3. This means y = a + bx + cx^2 + dx^3 + \epsilon and uses linear least squares to fit.

Well, when fitting this sort of thing in a full Bayesian model, you should provide priors over a,b,c,d. Typically we can specify some very simple thing about each, such as maybe

a ~ normal(0,100) if you know that the size of the coefficients is typical a small multiple of 100.

But usually we have some information not about the individual coefficients themselves, but about some property of the function. For example, we might know that the function is "near" to some constant value, doesn't change too rapidly, and doesn't wiggle too rapidly:

\int_0^1 f(x) dx \sim \mu

\sqrt{\int_0^1 (\frac{d}{dx} f(x))^2 dx} \sim S

\sqrt{\int_0^1 (\frac{d^2}{dx^2}f(x))^2 dx}\sim Q

From the perspective of MCMC, we can calculate numerical quantities either using symbolic methods or by numerical integration, and then place probabilities on these outcomes. Note, these quantities are in general nonlinear functionals of the f(x) function. The parameter space in general will be high dimensional (let's say for example 15 coefficients in a polynomial fit) and the number of conditions for which we have prior information are less than 15, so there is no one-to-one mapping from coefficients to \mu,S,Q. As such, the prior you're going to place here is in part determined by the "multiplicity" of the outcome. Some \mu values can occur in more ways than other \mu values... This is just a fact of life about a mismatch between our state of information and the dimensionality of the parameters.

As an example, here's a complete R script using rstan and ggplot2 to put a distribution on a 3rd order polynomial on the region x \in [0,1]. To calculate the mean value, RMS slope, and RMS curvature I use a super-simple 3 point midpoint integration rule. I specify that the mean value is normally distributed around 100, the RMS slope is of typical size 10, and the RMS second derivative is on order 1000. Run the script and you'll get 20 random curves plotted, you'll see that they are "flattish" with some mild wiggles on the scale of a function whose values are around 100.

Add some data generated by your favorite function with some random measurement error, and add a normal likelihood, and see what happens. Add some additional parameters aa,bb,cc,dd which have uniform(-1e6,1e6) priors and the same normal likelihood (essentially the least-squares solution). Compare the posterior fits especially with only a few data points when you provide smoothness information compared to the unconstrained least squares solution.

Does the very low quality numerical integration routine matter? What happens if you add some additional higher order terms? If you have N data points and N coefficients, you can fit the curve directly through the points. That will inevitably be the least squares solution. Does that happen when you provide this kind of smoothness prior?

Note: it's problematic to do basis expansion in the 1,x,x^2,x^3... basis, usually you'd work with an orthogonal polynomial, and deal with numerical evaluation of the polynomial using special techniques for stability and to prevent cancellation problems etc. But this is just an example, and the technique should work well even for other types of basis expansions, such as radial basis functions which are often a good idea for strangely shaped domains (2D or 3D functions for example).

Above are pairs of scatterplots for samples of a,b,c,d, showing how the effective prior looks.


stanmodl <- "

real f(real x,real a, real b, real c, real d){
return a + b*x + c*x^2 + d*x^3;

real fp(real x, real a, real b, real c, real d){
return(b + 2*c*x+3*d*x^2);

real fp2(real x, real a, real b, real c, real d){
return 2*c+6*d*x;

real favg(real a, real b, real c, real d){
return((f(1.0/6,a,b,c,d) + f(0.5,a,b,c,d) + f(1-1.0/6,a,b,c,d))/3);

real slfun(real a, real b, real c, real d){
return(sqrt((fp(1.0/6,a,b,c,d)^2 + fp(0.5,a,b,c,d)^2 + fp(1-1.0/6,a,b,c,d)^2)/3));

real qfun(real a, real b, real c, real d){
return sqrt((fp2(1.0/6,a,b,c,d)^2 + fp2(.5,a,b,c,d)^2 + fp2(1-1.0/6,a,b,c,d)^2)/3.0);


real a;
real b;
real c;
real d;

transformed parameters{

real fbar;
real q;
real slavg;

 fbar = favg(a,b,c,d); 
 q = qfun(a,b,c,d);
 slavg = slfun(a,b,c,d);
fbar ~ normal(100.0,10);
slavg ~ exponential(1/10.0);
q ~ exponential(1/1000.0);


samps <- stan(model_code=stanmodl);
vals <- extract(samps);

plt <- ggplot(data=data.frame(x=0),aes(x=x))+xlim(0,1)
funs <- list();
for(i in sample(1:2000,20)){
    a <- vals$a[i];
    b <- vals$b[i];
    c <- vals$c[i];
    d <- vals$d[i];
    funs <- c(funs,    (function(a,b,c,d){a <- a; b <- b; c <- c; d <- d;  function(s){return(a+b*s+c*s^2+d*s^3)}})(a,b,c,d))

for(i in 1:length(funs)){
    plt <- plt+stat_function(fun=funs[[i]])



Feynman on Physical Law (apropos of Cox's Theorem)

2016 August 21
by Daniel Lakeland

I know, I know, everyone either loves Feynman or finds him pompous... but I do think this clip has something useful to say about what makes science different from all the other ways we might try to make sense of the world. And, in doing so, it helps me make the point about what part Cox's theorem and Bayesian probability theory play in science.

Guess is more or less his starting point. From a formal perspective this means somehow we specify some formal mathematical model that predicts the outcomes of experiments. We might also specify several competing models. We need to write down these models in a formal system, in modern terms, we need to program these models unambiguously into a computer. The basis of our formal systems is Set Theory.

Compute the consequences more or less this is straightforward if you are given some numerical values of certain quantities, the heavy lifting here is done by the theory of functions, computation, calculus, differential equations, numerical analysis, etc. But, we need those numerical values of the quantities. The Bayesian solution is to specify a state of knowledge about what those quantities might be, a prior distribution, where some values have higher probability than other values because we think those values are more reasonable (not more common under repetition, just more reasonable in this case)

Compare Predictions to experiment to a Bayesian this means see which values of the certain quantities are needed to make the outcomes of the experiment be "close" to the predictions, and not only that, but also see how high in that probability distribution around the prediction they will be (how precisely the best version of the model predicts reality). When there are several models, the ones that put the outcomes in a high probability region are going to get higher post-data probability, that is, we are automatically going to put probabilities over our several models based on how well they predict.

What parts are special to science? Certainly guessing is not unique to science, the history of Greek, Norse, Native American, and other mythologies shows that we like to guess a lot.

Computing the Consequences isn't unique to science either, Acupuncture / Chinese Medicine is largely guessing explanations in terms of Qi and hot vs cold foods and meridian lines and so forth... and then computing which spices or herbs or special tinctures or needle locations are recommended by the model.

Compare Predictions to Experiment: is really the essence of what makes science special, and in order to do this step, we need a meaning for "compare". The Bayesian solution is to force the model builder to use some information to specify how well the model should predict. In other words, what's the "best case"? Specifically the quantity p(Data | Params) should be taken to be a specification of how probable it would be to observe the Data as the outcomes of experiments, if you knew *precisely* the correct quantities to plug into Params. The fact that we don't put delta-functions over the Data values reflects our knowledge that we *don't* expect our models to predict the output of our instruments *exactly*.

So, what does Cox's axioms teach us? It's really just that if you want to use a real number to describe a degree of plausibility about anything you don't know, and you want it to agree with the boolean logic of YES vs NO in the limit, you should do your specifications of degrees of plausibility using probability theory.

Cox doesn't tell you anything much about how to Guess, or how to Compute The Consequences (except that you should also compute the probability consequences in a certain way), but it does have a lot to say about how you should Compare Predictions to Experiment.


Some Comments on Cox's Theorem

2016 August 16
by Daniel Lakeland

Apparently Cox's original paper has some subtle issues, some assumptions that were not explicitly mentioned or even realized by Cox. In the interim many years, some of these technicalities have been patched up. Kevin S Van Horn has a great "guide" to Cox's theorem which I like a lot. He has some other interesting related stuff as well.

In his presentation of Cox's theorem, he relies on 5 requirements

  • R1: plausibility is a real number
  • R2-(1-4): plausibility agrees in the limiting case with propositional calculus (where everything is either known to be true, or known to be false).
  • R3: plausibility of a proposition and of its negation are functionally related.
  • R4: Universality: In essence plausibility values take on a dense subset of an interval of real numbers, and we can construct propositions that have certain plausibilities associated with them.
  • R5: Conjunction: The conjunction function is a continuous strictly increasing function F such that p(A and B | X) = F(p(A|B,X),p(B|X)).

Kevin motivates each of these and gives some objections and discussion about why each one is required. For me, the point of the exercise is to generalize propositional calculus to real-number plausibility, so R1,R2,R3 are obvious to me, though I realize some people have objected (for example, there are 2-dimensional alternatives, where a proposition and its negation have different plausibilities unrelated by a functional equation)

I'm just not interested in such things, so when I look at Kevin's requirements, Universality and Conjunction are the ones I think you could attack. To me, the dense set of values, and that there must be more than one value (the endpoints of the interval are not the same) seem obvious.

The full version of R4 is:

R4: There exists a nonempty set of real number P_0 with the following two properties:

  • P_0 is a dense subset of (F,T).
  • For every y_1,y_2,y_3 \in P_0 there exists some consistent X with a basis of at least three atomic propositions A_1,A_2,A_3 such that (A_1|X) = y_1,(A_2|A_1,X)=y_2, (A_3|A_2,A_1,X) = y_3

This is perhaps the least understandable part. It says basically that for any three plausibility values, there exists some consistent state of knowledge that assigns those plausibility values to certain propositions. A consistent state of knowledge is one where we can't prove a contradiction (ie. A and Not A).

Note that this isn't a claim about the actual world. We aren't restricted to "states of knowlege" that are say consistent with the laws of physics, or consistent with a historically accurate account of the Norman Conquest, etc. This is a mathematical statement about whether we can *assign* certain plausibilities to certain statements without causing a contradiction. So "A1 = Unicorns keep an extra horn at home for emergencies" and "A2 = The second sun of the planet Gorlab rises every thursday" and soforth are acceptable statements whose truth value we could assign certain plausibilities to to meet these logical requirements. This is called "universality" because it expresses the desire to be able to assign plausibilities at will to at least 3 statements regardless of what those statements are about (whether that's unicorns, or the position of fictional suns, or it's the efficacy of a real drug for treatment of diabetes, or the albedo of a real planet circling a remote star).

Van Horn points out that without R4 we can construct a counterexample, so something possibly weaker might be allowable, but we can't omit it entirely. Van Horn also points out that Halpern requires that the set of pairs of propositions be infinite. But note, there's nothing that requires us to restrict Cox's theorem to any one area of study. So, for example, while you may be using probability as logic to make Bayesian decision theory decisions about a finite set of objects whose measurements come from finite sets, someone else is using it to make decisions about continuous parameters over infinite domains. This is Van Horn's argument, basically that we're looking for a single universal system that works for all propositions, not just some finite set of propositions about your favorite subject. He goes further to point out that what Halpern describes as (A|B) really should be (A|B,X) where X is a state of information about the statements. So long as you allow an infinite set of possible states of information, you can still apply Cox's theorem within a finite domain of finite objects.

Van Horn's article is worth a read, and he discusses many issues worth considering.

A summary of some discussions on data-dredging, hiding data, private models, and regulatory approval

2016 August 4
by Daniel Lakeland

We discussed a bunch of stuff at Andrew's blog on data dredging, hiding data, selecting your favorite analysis, etc. Here's a summary of my position:

In Bayesian analyses, the only thing that matters is the data and the model you actually have. That is, essentially p(Data | Parameters, Knowledge) p(Parameters | Knowledge).

The question arises as to how it affects things if you do some potentially questionable practices. So let's look at some of them:

  1. Cherry-picking data: you collect N data points, and then select n of them, submit these n as if they were all the data you have, together with a model... THIS IS BAD. The reason it's bad is that p(Data | Parameters, Knowledge) is now the wrong formula, because "Knowledge" doesn't include knowledge of the REAL data collection process. The rules by which you selected the n data points from the N data points are missing. Since the likelihood is a model of the whole data collection process, not a god-given fact about the "true frequency distribution in repeated sampling" (whatever that is) it strongly depends on the whole process by which the data eventually makes its way into your data file.
  2. Cherry-picking outcomes: You collect an NxM matrix of data, and select one column to analyze (say N people tested for M different outcomes). Provided you give me the full matrix, and you explain the logic behind the choice of analyzing your column, and I agree with it, the existence of this additional data need not concern me. However, if information about some of that other data could inform my model of the one outcome of interest, then I may legitimately require you to include the other columns in the analysis. For example, if you analyze 3 different kinds of immune cells, and a self-reported allergy outcome, and you want to analyze the self-reported allergy outcome without looking at the immune system outcomes... I may legitimately  conclude that I don't agree with the model. When the extra data doesn't alter what the regulator should think the likelihood (or priors) should be, it's irrelevant, but if it would... then it's not ok to hide it. Only the regulator can make that decision.
  3. Cherry-picking models for a single outcome. If you collect N data points, one outcome column, and you are analyzing the one outcome column, and you've privately thought up 400 different ways to slice and dice this data and just reported your favorite model. Does this invalidate the inference? It depends. If the result of your slicing and dicing is that the likelihood you're using is extremely specific and would not be a likelihood that a person who hadn't seen the data already would choose... then YES it invalidates the analysis, but that's because it invalidates the choice of the likelihood. If the likelihood winds up as something not too specific that others who hadn't seen the data would agree with based on the background info they have, then no, it's not invalidated, it's a fine analysis. Whether or not you looked at hundreds of alternative models, or just this one model, a regulator needs to scrutinize the model to ensure that it conforms to a knowledgebase that is shared by other people in the world (regulators, third parties, etc). The number of other hypotheses you thought about, does not matter, just as it doesn't matter that you might have anxieties, or be wearing a lucky rabbits foot, or have prayed to a deity, or any other private thoughts you might have had. Bayesian models are more or less mathematicized thoughts.
    1. Note, if you try to argue for a single final model, and regulators don't agree with it, it's always possible to set up a mixture model of several explanations with priors over the models, and do Bayesian model selection. Either a single model will dominate at the end, or several will have nontrivial probability. Either way, we can continue with the analysis using the posterior distribution over the mixture.

In the end, the thing to remember is that a Bayesian model has a CHOICE of likelihood. This isn't a hard shared fact about the external world, it's a fact about what you assume or know. If you send a regulator data and the method by which you actually collected, filtered and handed over that data differs from the method that your trial protocol describes, or the likelihood doesn't correspond to something people other than you can get behind, then there is no reason why the world needs to believe in your analysis, (and you may need to get some jail time).

However, if you have privately got all sorts of worries, fears, and have looked at hundreds of alternative models, provided the one model you submit is agreeable to independent parties in terms of the logic, and the data collection process is as you said it would be, and the likelihood expresses it correctly, and the priors are not unreasonable to third parties, then Bayesian logic suggests that everyone involved needs to agree with the posterior independent of how many other ways you might have sliced the data. That's ensured by the consistency requirements of Cox's axioms.

Note however, that Frequentist testing-based inference doesn't seems to have the same property. In particular the idea behind a test of a hypothesis is that you will filter out say 95% of false hypotheses. But, how many hypotheses did you test, or could you have tested easily? The expected number of non-filtered false hypotheses you could easily get your hands on is \sum_{i=1}^N p_i where N is the number of false hypotheses you can easily get your hands on, and p_i is the p value each one gets. So, it matters how many other hypotheses you might have tested... which is ridiculous, and shows how by failing to meet Cox's axioms, Frequentist testing based inference goes wrong.


Some summaries of Bayesian Decision Theory

2016 July 29
by Daniel Lakeland

The problem of how to manage fisheries in Norway (note additional details in comments I left there) is apparently getting a lot of attention, and a lot of burden is being placed on biologists, because the existing rules make it a big deal what the biologists report for their "percent of salmon that are hatchery escaped" numbers. Typically they do a survey of 60 fish or so, and then report how many were tagged/identified as hatchery escaped, then if their numbers indicate less than 5% then it's all good for the farms, and greater than 5%, it's Billions of dollars in costs imposed on the farms... Sigh...

So, the Biologists ask Andrew what the heck? How do we take into account the fact that seeing 5 fish in 60 caught really doesn't tell us very exactly what the overall fraction is in the river?

Fortunately, there's a smart way to do this, and it's been known for a long time... unfortunately the people who know about it are pretty much unknown to the people who have the power to make the rules...

First, let's consider how uncertainty should affect our decisions. Suppose that there is some vector of numbers that describes the state of the world, and we don't know what it is. For example N,F to describe N the number of total salmon in the river, and F the fraction that are hatchery escapees. Now, we collect some data, n,f (lower-case) which is relevant for making an inference about N,F. So, based on some model M we can place a Bayesian posterior p(N,F | n,f,M). Now, we have some control variables, say d_i for the dollars we are going to spend on project i. And our model tells us what N_1,F_1 are likely to be, next year, given what we know about N,F this year and how much we plan to spend on each program d_i.

So, we'd like to make a decision about what the d_i values should be that is in some sense "the best we can do given our knowledge".

First off, we need to know how good or bad will it be if N_1,F_1 are any of the possible values. So traditionally we think in terms of cost, and we need a cost function c(N_1,F_1,d_i) that incorporates "how bad it is that we spent all this money on the d_i values and we wound up getting N_1,F_1 to be whatever their actual values are". Let's suppose that through some kind of negotiations including hopefully some kind of suggestions from multiple parties, and some score-voting, we arrive at such a function.

How do we make the decision? First off, note that we want the decision to depend on information about every possible outcome. Is it possible we'll have 5000 fish (up from 3000) and 0% hatchery next year? Then that's good, is it possible we'll have 40 fish and 10% hatchery next year? That's bad. Ignoring any of the possibilities in our decision has got to be sub-optimal. Since all the options have to enter into the decision, the decision will need to be made based on some kind of integral over all the possibilities. Furthermore, if we're given plausibilities p(N_1,F_1) and a particular value of N_1,F_1 becomes 2x more plausible after seeing some data then it should intuitively "count" 2x as much in our decision as before. More generally, the decision should be made based on a functional that is a linear mapping of p(N_1,F_1) to the real numbers (so we have an ordering of better vs worse). The expectation operation is what we're looking for

 E(\mathrm{Cost}|\mathrm{Model,Data}) = \int_{N_1,F_1} \mathrm{Cost}(N_1,F_1,d_i) p(N_1,F_1 | d_i,n_0,f_0,\mathrm{Model})dN_1dF_1

So, let's find the d_i values that make this expected value of cost as low as possible (within the feasible computer time and search algorithm abilities we have) and then implement the policy where we actually do all the stuff implied by the d_i values.

This is called the Bayesian Decision Rule, and in this case, since it's trying to control real world outcomes, another way to describe this is as Bayesian Optimal Control.

Some features:

  1. If the Cost function is continuous with respect to the outcomes, and the outcome predictions are continuous functions of the data, then the d_i choices are continuous functions of the data too. That is, small changes in inputs produce small changes in outputs, there's no "gee we caught 1 more fish, send out the bill for $1 Billion Dollars"
  2. Every possibility is considered and its importance for the decision is dependent on both how bad that possibility is, and how plausible it is that it will occur. The results are linear in the plausibility values.
  3. The cost function can be anything finite, so no matter how you feel about various outcomes, you can incorporate that information in your decision. If multiple people are involved they can negotiate the use of a cost function that expresses some mixture of their opinions.
  4. As pointed out by Dale Lehman in the discussion at Andrew's blog, the cost function need not, and some would argue should not necessarily be the same as the kind of thing Economists call cost functions (that is somehow, accurate dollar prices based on actual willingness to pay). In particular, you're only going to pay dollars for the cost of carrying out the chosen policy d_i, so stuff that has very high "cost" associated to it, and is very unlikely under the model, can affect the choice of the decision variables in a way that causes you to actually pay a very moderate amount. Willingness to pay for the decisions is more important than willingness to pay the amount of money cost(N,F) for some hypothetical really bad N,F. If you choose a cost function and it causes you to make decisions that pretty much nobody agrees with, it means your cost function was "wrong" (though this isn't necessarily the case if you just piss off some fraction of the population, you can't please everyone all the time).
  5. The class of Bayesian Decision Solutions is an "essentially complete class", that is, every decision rule that minimizes the expected cost under a Frequentist sampling theory of N_i,F_i given some "true" parameters x either has the same mathematical form as above, or there is a Bayesian rule that will produce as good or better decisions regardless of what the unknown parameter x is. In essence, you have to form a function q(x|Data) \propto p(Data|x)p(x) to get a rule that minimizes the Frequentist risk (this is called Walds Essentially Complete Class Theorem). This is the case in part because no one knows the "true" parameter of the random number generator so we can't actually calculate the d_i that minimize the Frequentist risk under the "true" parameter value (and people like me deny that the Frequentist model of sampling from an RNG even applies in many/most cases).

So, whether you're a Frequentist who believes in God's dice, or a Bayesian who assigns numerical plausibility numbers based on some state of knowledge, the right way to make decisions that include all the information you have is to do the math that is equivalent to being a Bayesian who assigns numerical plausibilities.


Everyone loves a good NHST joke

2016 July 26
by Daniel Lakeland


Dirichlet Processes as Probability over Histograms

2016 July 22
by Daniel Lakeland

Suppose you are sampling from a large population, trying to find out about how that population is distributed. Suppose it's a large but finite population at a fixed time, like say 1 Million balances in checking accounts. You can't take a full sample, so you need to take a much smaller sample, say 100 balances. Yet, you want to be able to use your information to come up with plausible models for the frequency distribution of the balance so that you could for example run some simulations of what might happen if you offered people a chance to buy something at various prices or something like that (some calculation that requires simulating from the full frequency distribution, not just knowing its mean and sd or the like).

Most of us have worked with histograms in the past. In the simple version, you take your sample, break up the range into bins of equal size, and count up how many samples fall in each bin. The value n_i/N for each bin i gives an estimate of the value

\int_{x_i}^{x_{i+1}} F(s)ds

where F is thought of as the "real" frequency density, and when the population is large this is taken to be basically a continuous function (hopefully the nonstandard analysis analogy is already coming into your head).

Well, each of the bins is basically a "categorical" variable, and the n_i are counts in each category. The simplest categorical variable is the binomial one, where there are two possible outcomes {1,0} or "true vs false" or "success vs failure" or "heads vs tails". A more complex categorical variable analogy is drawing balls from an urn. Suppose there is a bag full of balls, and there are 5 different colors. If you draw a sample of 100 balls, you can provide a vector of the counts of red,orange,yellow,green,blue for example. Similarly by breaking up the range of a variable into different bins you can take your sample and turn it into a count of the number of times a data point fell in each bin.

Now, we're familiar with the idea that if you take a large sample, the histogram should be close to the real frequency distribution F(x) whereas with a small sample, the histogram might be very different from F(x).

Many people are also familiar with the Beta-Binomial conjugate family. If you check 100 out of a large number of items with only two colors, and 40 are red and 60 are blue, then with a prior of beta(1,1) your posterior over the frequency of reds is beta(41,61) for example. A similar distribution is conjugate to the multinomial/categorical distribution with k categories. This distribution is called the "Dirichlet" distribution, and it's a distribution over the vector of relative frequencies with which each category appears in random sampling from a giant urn with k colors of balls, or the process of creating histograms with k bins...

In R, if you load the "gtools" package, you can draw samples from a Dirichlet distribution:

\vec f = \mathrm{rdirichlet}(1,\vec N)

Is a function that returns a vector f which is in essence a frequency histogram from a conjugate model under the assumption that you've observed a vector of counts \vec N where the N_i values are actually the sum of the counts observed plus the "effective prior counts".

In the absence of strong prior information and the presence of a large sample we can think of the \vec N as just observed counts in a histogram process, and the f_i as plausible values for the true frequency with which values from a very large sample would fall into histogram bin i for all the possible i.

Unsurprisingly, the larger N_{tot}=\sum N_i is, the closer different draws of f_i will be to a single consistent value N_i/N_{tot}. In other words, all the f vectors will look similar. Whereas, when the N_{tot} is small, the histograms will vary a lot and tend to look the histograms of only a small number of observations... noisy. Note, there's no requirement to make each of the values in the \vec N be integers even though we're understanding them in terms of "effective counts".

If you want to simulate values that are plausible for the full population using only your prior guess at the frequency histogram and the counts of observed data points, one way to do it is to use the rdirichlet function to generate an f, then generate data consistent with that histogram. Each time you do this you'll generate a plausible full population which you can then use to calculate whatever complex thing you want as if you had the whole population dataset. Doing this over and over shows you how much your uncertainty in the whole population affects whatever your calculated statistic is.

There's nothing to say that the bins have to be equally spaced, or even have finite ends, though you'll have to have a model for the tail of the distribution if you include an infinite bin. Also, there's no reason you can't use an informative prior over the histograms, and the more informative you want the prior to be, the bigger the "count" of the fake data. If f_i is a vector of prior guesses for the histogram bin frequencies, then N f_i is in essence a prior over counts "worth" N data points. If you have an additional K data points that have observed relative frequencies F_i then Nf_i + KF_i is the parameter vector you should give to rdirichlet to get a sample from the posterior over histograms.

Once you have a vector of histogram frequencies f you can generate a plausible population of N values by selecting a histogram bin at random and generating a value uniformly within that bin. But probably the real F is not a piecewise constant function, so you can maybe do better by a little smoothing. Adding a small amount of normal noise to each simulated value will produce a draw from a smoothed histogram F_s a piecewise constant function convolved with a normal curve. If you simply use the midpoint of each interval and then smooth with the appropriate normally distributed perturbation you have a "nonparametric" gaussian mixture model. It's nonparametric in the sense that it is actually a family of models, as you can always increase the number of bins to fit the size of your data.

Basically, this scheme allows you to put Bayesian probabilities over the unknown frequency histogram of a large but finite population from which you have taken a random sample. If you have models for bias in your sampling, you can include that model through appropriate perturbations to the parameter vector of the Dirichlet.


Floating point arithmetic and probability

2016 July 18
by Daniel Lakeland

Kit Joyce asks about how floating point arithmetic interacts with random number generation

I thought about this for a while, and I think this is a good place to show the utility of concepts from nonstandard analysis and Internal Set Theory in particular. Here's how my reasoning goes.

First off, rather than using measure theory, we'll imagine the process of choosing a random number uniformly in the interval [0,1] as choosing one of a finite but nonstandard number of rational numbers, and then taking the standard part. So, we'll divide up the interval [0,1] into a bunch of possible outcomes r_i as follows:

r_i = 0 + i dx

where dx = 1/N for some nonstandard N, and so we are essentially choosing a random integer between 0 and N and then mapping that process into the interval [0,1]. To choose a standard number you take the standard part of r_i.

Now, what's the probability of getting a value between a,b where a < b and both are standard numbers in [0,1]? It's N(a,b)/N where N(a,b) is the number of possible outcomes that result in a standard number between a and b. Since there is one different outcome every interval dx then N(a,b) \ge (b-a)/dx. If you select a nonstandard number outside the interval [a,b] that is infinitesimally close to a or infinitesimally close to b you will also "round off" via the standard part to a or to b, so hence the \ge symbol. Still N(a,b)/N = (b-a)/dx/N + K/N \cong b-a because the number of outcomes in the halo of a or b outside the interval [a,b] is what I'm calling K and it is infinitesimal relative to N, so K/N \cong 0.

Now, I think this maps pretty directly to the idea of generating a random number say a 64 bit integer which maps then into rational numbers between 0,1 that are dx=1/2^{64} apart, and then the "standard part" operation is modeled by the "roundoff to nearest floating point" operation. And, so long as N(a,b)/K is big, the roundoff error doesn't introduce any problems, and that's true whether or not the floating point numbers between a,b are evenly distributed or unevenly distributed.

So, when do we have a problem? Mainly when we're trying to figure out the probability of landing in some region that is small relative to machine epsilon of the floating point numbers. For example, suppose you're trying to model the extreme tail of the normal distribution, by generating floats in the range 0.99999999999 to 1.0 and then transforming them by the inverse cumulative distribution to their normal quantiles... you are probably doing something wrong! If you want to generate normal variates in the extreme tail, better to do something like rejection sampling where you're generating the deviates directly in the outcome space and don't have big gaps caused by the roundoff errors in the process of generating numbers between 0.99999999999 and 1.0


On U shapes and NHST...

2016 June 28
by Daniel Lakeland

Gelman posts about a paper in which social scientists run a quadratic regression and discover that it produces a U shape!!! Amazing. The story is that too much talent on your sports team will lead to poorer performance. Further evidence that understanding function approximation and basis expansion is seriously flawed among many researchers.

Uri and Leif weren't able to get the data, but were able to get the original researchers to run a test of a hypothesis. The idea is they take their quadratic regression, find the maximum or minimum point, and split the data at that point, running two linear regressions, and testing whether the sign of the slope is different. This is a bit better than assuming the quadratic means there really is a U shape, but basically still wrong. The problem is NHST of a slope against zero isn't the right way to figure out whether you have a U shape. In fact NHST isn't the right way to do much of anything (except test the output of PRNG algorithms).

I hear you asking: Ok Dan, well then, how should we do this right?

First off, Uri and Leif have some data on competition vs patents and claim that this is a good testbed for their method. I disagree because I don't think number of patents is a good measure of "creativity" and there are serious problems with that analysis even before you have the data. The sports data on the other hand is pretty good conceptually, but the authors didn't provide the data so I can't just show you the kinds of stuff you ought to do...

So I'll have to resort to just telling you. The first thing is that you need a system which is sufficiently flexible to fit whatever nonlinear shape your data "really" has. With the sports data the x axis is logically limited to the range [0,1] (0 to 100% of the team is high-talent). This is a bounded interval, and polynomials are a complete basis for continuous functions on a bounded interval. That means, with sufficiently high order polynomials we can fit any nonlinear shape. However, with noisy measurements, we also have a problem of potentially over-fitting. Furthermore, we'll do better for mathematical reasons if we use orthogonal polynomials.

But, a complete orthogonal basis isn't necessarily the best way to go, especially in a Bayesian analysis. For example, we have the idea from sports fans who haven't seen the data that we should get a diminishing returns, horizontal asymptote type result. So, perhaps we should include such a family in our analysis. Unfortunately, I couldn't get Chebyshev polynomials into Stan because the Stan authors didn't really understand them even though someone DID do the work to include them. So, let's set up a model for y vs x using regular old non-orthogonal polynomial regression.

 y = a_0 + \frac{h}{(1+\exp(-r (x-x_0)))} + \sum_{i=1}^5 a_ix^i

We're going to expect some correlations between the various coefficients, because we don't have an orthogonal system, but Stan is pretty good at working that out, in the end all we care about are the function shapes, not the coefficients anyway. The second term represents a sigmoid function which has horizontal asymptotes at 0 and h and which has a transition between them in the region of x_0. The polynomial represents a general purpose correction to this basic shape. It would be even better to have Chebyshev polynomials and to give strong priors to the coefficients a_i for reasons I won't go into (but they do involve a less-correlated posterior distribution for the coefficients, easier to sample from). With the sigmoid function doing a reasonable job of representing our core expectation, we can at least put priors on the a_i that are zero-mean, as we wouldn't be too surprised to find that large polynomial corrections aren't necessary.

Ok, so now, pry the data out of the researchers fingers and fit the model in Stan. Then, take a sample of 100 of the coefficient vectors and plot the curves in a spaghetti plot. Now, calculate the derivative of this function (if you haven't done calculus in a while that's ok, use Maxima or ask Wolfram Alpha or something similar). In a separate plot, spaghetti plot the derivative. It'll be noisier, as the derivative is less constrained by the data, but we've regularized our result by truncating at 5th order polynomials and we can regularize further by putting strong priors on the a_i. Spaghetti plotting the derivative, does it almost always have a positive value on the left and then dive down to a negative value on the right?

If you have those plots of the curve and its derivative, and they really do produce a fairly consistent change in the sign of the derivative, THEN you have reasonable evidence for a transition to negative slope. The posterior probability of a curve shape can be directly assessed in these plots. There's no need to "test" your hypothesis using an NHST test that is appropriate for assessing whether the coefficients are the output of an RNG. But, if you do find that out of the 100 spaghetti derivative plots all but 1 to 5 of them have entirely negative values at the high end, then, yes you have relatively strong evidence for your decreasing performance at the high end.

The fact is, the posterior distribution is what summarizes our uncertainty. A binary decision about rejecting a linear regression on the upper end is NOT the same thing for the same reasons that Uri and Leif give about why the quadratic isn't good as an indicator of a U shape.

What should the analysis of acupuncture for allergic rhinitis have looked like

2016 June 14
by Daniel Lakeland

Ok, so I'm down on the NHST analysis of allergic rhinitis. And, I probably am not going to get ahold of the data to re-do the analysis, but what kinds of things SHOULD be done if the data were actually available?

First off, let's acknowledge that the goal of all of this is to make people feel and function better. There is therefore a latent "how good do I feel" variable which we can only measure imperfectly which is our real subject of interest. There are two ways they tried to measure this variable. The first is the RQLQ (symptom) score which says how bad the person perceives their symptoms, and the second is the RMS (medicine usage) score which tells how much the person used medication to combat the symptoms. However, the RMS score also is a measure of a kind of forcing function to reduce symptoms. The medications available were Cetirizine and oral steroids, and a diary was kept by each patient so we actually know how much medicine of what types was used on every day of the study..

So, let's imagine a function S(t) for "symptoms". We have some causal process we think is going on in which S increases through time when exposure to pollen occurs but eventually saturates (fortunately pollen rarely actually makes your whole body combust). We have some treatments thought to reduce the symptoms, including medications which are taken according to the patients own choice, based on a tolerance for symptoms, and in addition, sham acupuncture, or real acupuncture. Real acupuncture is given either in period t \in [0,8] or t \in [8,16] weeks and symptoms assessed at t=0, t=8, t=16, t=60. Furthermore, we have individual patient beliefs about whether sham or real acupuncture was being applied, which were assessed after their 3rd session with needles.

The model should therefore be that S(t) is observed with errors using the RQLQ assessment, and furthermore that the choice to take various medications at time t is based on a combination of S(t) and individual tolerance for the symptoms (so that M(t) gives us information about the patient's S(t)). If M(t) is the medication dosage function (dose per day) then M(t) = M(S(t), Tol) and also dS/dt =F(M(t), Ac(t), Sa(t), B, P(t)) for each patient separately, where M is medicine dose, Ac is acupuncture dose, Sa is sham acupuncture dose B is belief in whether they are getting sham or real acupuncture, and P(t) is pollen exposure which we might imagine as a constant for each clinical center.

Put all of those ideas together with some simplifying mathematical assumptions, analyze with a Bayesian computational software program like Stan, and discover how strongly Ac, Sa, and M affect symptoms. Compare the effect Ac/M and Sa/M averaged across people to get an answer to the question "what is the average effect of Acupuncture measured relative to the average effect of taking Cetirizine on patient Symptoms, and how does it compare to sham acupuncture?"