Thinking about gamma priors

2016 September 21
by Daniel Lakeland

Suppose you've got a parameter that's a positive number whose order of magnitude you know. For example, my height, you don't know what it is, but if I asked you if it was negative you'd say with 100% logical certainty that it wasn't, and if I asked you if it was 3 inches you'd be damn sure it wasn't, and if I asked you if it was 5 ft you might think that's reasonable, and 20 feet is certainly unreasonable... A typical average height for an adult male is something like 5ft 10 inches (178 cm). So it would be safe to say something like

h ~ exponential(1.0/178)

for an adult male height in cm. But, the peak density of this distribution is at h=0 and it extends well out into the 3 to 4 times the average height. That seems problematic.

Here's where the gamma distribution comes in. The average of a gamma(k,1/x) random variable is kx. And in fact, if a and b are both  exponential(1/x) then the sum is gamma(2,1/x) and for n exponential variables, gamma(n,1/x) for integer values of n. But n doesn't need to be an integer. The shape parameter n is continuous and is a parameter that "acts like" a sample size. That is, as n increases the gamma distribution holding the average value constant, so gamma(n,n/x) becomes more delta function like around the average value x.

If all you know is a variable is positive and has a typical value, you can use exponential(1/x) as a maximum entropy distribution for a given average x. If you know that the distribution is more concentrated near the average (and away from both zero and infinity) then you can use gamma(n,n/x) as a continuously "less than maximum entropy" distribution where "n" is an "effective sample size" and so can help you understand how to choose n. So if you think you have about as much information about my height as if you'd found say 12 randomly selected adult males from the US, you can use gamma(12,12.0/178) as a prior for my height.

What does that distribution look like? It has 95% interval 92 to 292 cm (3 to 9.6 ft) which is probably pretty reasonable for a height considering that I could for all you know be a dwarf or a basketball player...

The gamma distribution is in fact a maximum entropy distribution. Specifically, based on the explanation on wikipedia:

The gamma distribution is the maximum entropy probability distribution for a random variable X for which E[X] is fixed and greater than zero, and E[ln(X)] is fixed. [edited for clarity]

That is, you know what the average is, and you know what the average logarithm is (think of the average logarithm as a limit on the order of magnitude).

 

The WiFi experience a week out

2016 September 12
by Daniel Lakeland

So, I've had my 3 station home WiFi network up and running for about a week, and the question is, does it make any real difference? Specifically does it make a difference with respect to 2.4 GHz which is the more congested band.

First off, you should understand the situation at my house. There are 4 Amazon Fire tablets, a Fire TV stick, two cell phones, my wife uses a Mac laptop, if friends or family are visiting add at least a tablet and a cell phone, and our home phone system runs two incoming lines via VOIP. Worst case with my mother visiting the kids, we could have 4 Netflix streams and two phone calls all going on simultaneously. The cable internet connection is 60 Mbps downstream and not quite 4Mbps upstream, verified via several online speed tests from my wired desktop machine.

The first thing to say is that in fact the three stations are all being used. My kids sitting in the front room playing Minecraft on a tablet computer or watching a video will wind up on the station in the front room, whereas if I'm in my office or bedroom I'll wind up on a different station. This does in fact seem to help when talking on my phone over a VOIP connection. People haven't been complaining about stuttering and that makes sense because there won't be contention for the wireless connection, and if the packets can get to my router in time, my QoS script at the router will prioritize the VOIP streams.

Another thing I was concerned about was what happens when I walk around the house while on the phone? Hand-offs between the different stations, if they're occurring, do not seem to be causing any problems.

How about coverage? Inside our house was always pretty good, but a phone conversation in the back yard was out of the question. The new setup seems to let me talk from anywhere on my property, and because the xmit strength is turned down and I chose my channels to not interfere, by the time I get across the street, or into the neighbor yards I'm not bothering the neighbors.

My wife needed to pull some 10 GB of video data off our server (videos she took for her lab work) and throughput to her Mac laptop was sufficiently high that she didn't notice the file copy (ie. by the time she'd checked Facebook or Reddit or whatever it was done).

So, overall, it certainly didn't hurt, and it has helped extend coverage, reduce latency/contention from up to maybe 10 devices in our house being used simultaneously, without interference to our neighbors.

 

Multi-Station Home WiFi with WDS 5GHz backbone

2016 September 7
by Daniel Lakeland

You might be a computer nerd if you run OpenWRT on your home wifi router. If you do, you might want to take advantage of the following setup.

My main router used to be the only access point I had, it ran OpenWRT on a Buffalo WZR-600DHP which is a dual band 802.11n router. Pretty good kit for not much money. It has QoS settings, IPv6, special firewall settings, an OpenVPN set-up and lots of good stuff going.

Well, I run CSipSimple on my cell phone and take calls via VOIP over WiFi and the coverage in my house (old school concrete based plaster walls) was just OK. People would complain if my kids streamed videos while I was on the phone etc. Once the packets hit the router it's possible to QoS them upstream to the ISP, but it's harder to share just one radio channel.

A while back I grabbed two TP-Link WDR3600 routers which are somewhat less fancy but dual band and work well with OpenWRT. My plan was to pull Cat5e under my house to the front room, and the back room and wire them in so I could use channels 1, 6, and 11 to get good coverage. Of course, I'd be nice to neighbors, so I'd turn DOWN the power output to just the required level.

Well, running the wires never really happened, and I was thinking to myself "Gee I have these two routers, can I use them without the wires?"

The answer, of course, is YES, specifically, you can use the 5GHz link in WDS mode to distribute the local network to each of the routers, and this will still let you have a 2.4GHz access point in multiple locations. Since 5GHz can be set up with 40MHz channel width effectively and is generally lower interference, it's actually perfect for distributing wireless WITHIN your house, while 2.4 GHz is great for getting coverage on your back deck, front yard, various bedrooms, and throughout your house for devices that only have 2.4GHz (which is still a lot of devices).

Here's how the system works:

  1. On the "central" router, set up the 5GHz link in Access Point (WDS) mode (you might make this an "additional" network, with it's own "backbone" ESSID if you like, but then if you do, the clients need to use that ESSID, and you need to bridge it on the central router)
  2. On each satellite router delete the WAN network, and set up the 5GHz wifi in Client (WDS) mode with same ESSID as in (1).
  3. Add an additional wifi network to the 5GHz link in "Access Point" mode (if you use just one ESSID set that, otherwise set up the "normal" client ESSID)
  4. Set up the satellite 2.4GHz network in Access Point mode (using the "normal" client ESSID)
  5. Make sure all the WiFi networks on the satellite device have the same WPA2 settings and pre-shared-key.
  6. Set up the satellite station's LAN network to bridge the 5Ghz Client, 5GHz AP, 2.4GHz AP, and wired ethernet into one LAN.
  7. Set up the satellite station LAN network to be in DHCP client mode (you'll have to do an "are you sure" type acknowledgement of the change).
  8. On the central router add static DHCP assignments for your satellite devices so you will be able to access them for maintenance.
  9. Reboot the satellite devices, they should come up with a 5GHz client link to the central router which gives them LAN connectivity, and both 5GHz and 2.4GHz access points for local stations to use.
  10. Lay out your channel usage on 2.4GHz so you're interfering as little as possible with neighbors, and turn down the xmit power on the stations. you have more stations, each one should cover its own smaller area at low power, they don't all need to SHOUT.

 

On Incorporating Assertions in Bayesian Models

2016 August 23
by Daniel Lakeland

I've been thinking a lot recently about how to specify a prior over functions. One way is to specify a Gaussian Process, but this can be quite computationally heavy, and when the function in question is fairly smooth and simple, it's also kind of overkill.

funcs

The typical method for specifying a function is via a basis expansion. For example in the R function lm you can specify a third order polynomial model for y as y \sim x + x^2 + x^3. This means y = a + bx + cx^2 + dx^3 + \epsilon and uses linear least squares to fit.

Well, when fitting this sort of thing in a full Bayesian model, you should provide priors over a,b,c,d. Typically we can specify some very simple thing about each, such as maybe

a ~ normal(0,100) if you know that the size of the coefficients is typical a small multiple of 100.

But usually we have some information not about the individual coefficients themselves, but about some property of the function. For example, we might know that the function is "near" to some constant value, doesn't change too rapidly, and doesn't wiggle too rapidly:

\int_0^1 f(x) dx \sim \mu

\sqrt{\int_0^1 (\frac{d}{dx} f(x))^2 dx} \sim S

\sqrt{\int_0^1 (\frac{d^2}{dx^2}f(x))^2 dx}\sim Q

From the perspective of MCMC, we can calculate numerical quantities either using symbolic methods or by numerical integration, and then place probabilities on these outcomes. Note, these quantities are in general nonlinear functionals of the f(x) function. The parameter space in general will be high dimensional (let's say for example 15 coefficients in a polynomial fit) and the number of conditions for which we have prior information are less than 15, so there is no one-to-one mapping from coefficients to \mu,S,Q. As such, the prior you're going to place here is in part determined by the "multiplicity" of the outcome. Some \mu values can occur in more ways than other \mu values... This is just a fact of life about a mismatch between our state of information and the dimensionality of the parameters.

As an example, here's a complete R script using rstan and ggplot2 to put a distribution on a 3rd order polynomial on the region x \in [0,1]. To calculate the mean value, RMS slope, and RMS curvature I use a super-simple 3 point midpoint integration rule. I specify that the mean value is normally distributed around 100, the RMS slope is of typical size 10, and the RMS second derivative is on order 1000. Run the script and you'll get 20 random curves plotted, you'll see that they are "flattish" with some mild wiggles on the scale of a function whose values are around 100.

Add some data generated by your favorite function with some random measurement error, and add a normal likelihood, and see what happens. Add some additional parameters aa,bb,cc,dd which have uniform(-1e6,1e6) priors and the same normal likelihood (essentially the least-squares solution). Compare the posterior fits especially with only a few data points when you provide smoothness information compared to the unconstrained least squares solution.

Does the very low quality numerical integration routine matter? What happens if you add some additional higher order terms? If you have N data points and N coefficients, you can fit the curve directly through the points. That will inevitably be the least squares solution. Does that happen when you provide this kind of smoothness prior?

Note: it's problematic to do basis expansion in the 1,x,x^2,x^3... basis, usually you'd work with an orthogonal polynomial, and deal with numerical evaluation of the polynomial using special techniques for stability and to prevent cancellation problems etc. But this is just an example, and the technique should work well even for other types of basis expansions, such as radial basis functions which are often a good idea for strangely shaped domains (2D or 3D functions for example).

Above are pairs of scatterplots for samples of a,b,c,d, showing how the effective prior looks.

library(rstan)
library(ggplot2)


stanmodl <- "
functions{

real f(real x,real a, real b, real c, real d){
return a + b*x + c*x^2 + d*x^3;
}

real fp(real x, real a, real b, real c, real d){
return(b + 2*c*x+3*d*x^2);
}

real fp2(real x, real a, real b, real c, real d){
return 2*c+6*d*x;
}

real favg(real a, real b, real c, real d){
return((f(1.0/6,a,b,c,d) + f(0.5,a,b,c,d) + f(1-1.0/6,a,b,c,d))/3);
}

real slfun(real a, real b, real c, real d){
return(sqrt((fp(1.0/6,a,b,c,d)^2 + fp(0.5,a,b,c,d)^2 + fp(1-1.0/6,a,b,c,d)^2)/3));
}

real qfun(real a, real b, real c, real d){
return sqrt((fp2(1.0/6,a,b,c,d)^2 + fp2(.5,a,b,c,d)^2 + fp2(1-1.0/6,a,b,c,d)^2)/3.0);
}

}
parameters{

real a;
real b;
real c;
real d;

}
transformed parameters{

real fbar;
real q;
real slavg;


 fbar = favg(a,b,c,d); 
 q = qfun(a,b,c,d);
 slavg = slfun(a,b,c,d);
}
model{
fbar ~ normal(100.0,10);
slavg ~ exponential(1/10.0);
q ~ exponential(1/1000.0);

}
"

samps <- stan(model_code=stanmodl);
vals <- extract(samps);


plt <- ggplot(data=data.frame(x=0),aes(x=x))+xlim(0,1)
funs <- list();
for(i in sample(1:2000,20)){
    a <- vals$a[i];
    b <- vals$b[i];
    c <- vals$c[i];
    d <- vals$d[i];
    print(a);
    funs <- c(funs,    (function(a,b,c,d){a <- a; b <- b; c <- c; d <- d;  function(s){return(a+b*s+c*s^2+d*s^3)}})(a,b,c,d))
}

print(funs);
for(i in 1:length(funs)){
    plt <- plt+stat_function(fun=funs[[i]])
}

print(plt);

 

Feynman on Physical Law (apropos of Cox's Theorem)

2016 August 21
by Daniel Lakeland

I know, I know, everyone either loves Feynman or finds him pompous... but I do think this clip has something useful to say about what makes science different from all the other ways we might try to make sense of the world. And, in doing so, it helps me make the point about what part Cox's theorem and Bayesian probability theory play in science.

Guess is more or less his starting point. From a formal perspective this means somehow we specify some formal mathematical model that predicts the outcomes of experiments. We might also specify several competing models. We need to write down these models in a formal system, in modern terms, we need to program these models unambiguously into a computer. The basis of our formal systems is Set Theory.

Compute the consequences more or less this is straightforward if you are given some numerical values of certain quantities, the heavy lifting here is done by the theory of functions, computation, calculus, differential equations, numerical analysis, etc. But, we need those numerical values of the quantities. The Bayesian solution is to specify a state of knowledge about what those quantities might be, a prior distribution, where some values have higher probability than other values because we think those values are more reasonable (not more common under repetition, just more reasonable in this case)

Compare Predictions to experiment to a Bayesian this means see which values of the certain quantities are needed to make the outcomes of the experiment be "close" to the predictions, and not only that, but also see how high in that probability distribution around the prediction they will be (how precisely the best version of the model predicts reality). When there are several models, the ones that put the outcomes in a high probability region are going to get higher post-data probability, that is, we are automatically going to put probabilities over our several models based on how well they predict.

What parts are special to science? Certainly guessing is not unique to science, the history of Greek, Norse, Native American, and other mythologies shows that we like to guess a lot.

Computing the Consequences isn't unique to science either, Acupuncture / Chinese Medicine is largely guessing explanations in terms of Qi and hot vs cold foods and meridian lines and so forth... and then computing which spices or herbs or special tinctures or needle locations are recommended by the model.

Compare Predictions to Experiment: is really the essence of what makes science special, and in order to do this step, we need a meaning for "compare". The Bayesian solution is to force the model builder to use some information to specify how well the model should predict. In other words, what's the "best case"? Specifically the quantity p(Data | Params) should be taken to be a specification of how probable it would be to observe the Data as the outcomes of experiments, if you knew *precisely* the correct quantities to plug into Params. The fact that we don't put delta-functions over the Data values reflects our knowledge that we *don't* expect our models to predict the output of our instruments *exactly*.

So, what does Cox's axioms teach us? It's really just that if you want to use a real number to describe a degree of plausibility about anything you don't know, and you want it to agree with the boolean logic of YES vs NO in the limit, you should do your specifications of degrees of plausibility using probability theory.

Cox doesn't tell you anything much about how to Guess, or how to Compute The Consequences (except that you should also compute the probability consequences in a certain way), but it does have a lot to say about how you should Compare Predictions to Experiment.

 

Some Comments on Cox's Theorem

2016 August 16
by Daniel Lakeland

Apparently Cox's original paper has some subtle issues, some assumptions that were not explicitly mentioned or even realized by Cox. In the interim many years, some of these technicalities have been patched up. Kevin S Van Horn has a great "guide" to Cox's theorem which I like a lot. He has some other interesting related stuff as well.

In his presentation of Cox's theorem, he relies on 5 requirements

  • R1: plausibility is a real number
  • R2-(1-4): plausibility agrees in the limiting case with propositional calculus (where everything is either known to be true, or known to be false).
  • R3: plausibility of a proposition and of its negation are functionally related.
  • R4: Universality: In essence plausibility values take on a dense subset of an interval of real numbers, and we can construct propositions that have certain plausibilities associated with them.
  • R5: Conjunction: The conjunction function is a continuous strictly increasing function F such that p(A and B | X) = F(p(A|B,X),p(B|X)).

Kevin motivates each of these and gives some objections and discussion about why each one is required. For me, the point of the exercise is to generalize propositional calculus to real-number plausibility, so R1,R2,R3 are obvious to me, though I realize some people have objected (for example, there are 2-dimensional alternatives, where a proposition and its negation have different plausibilities unrelated by a functional equation)

I'm just not interested in such things, so when I look at Kevin's requirements, Universality and Conjunction are the ones I think you could attack. To me, the dense set of values, and that there must be more than one value (the endpoints of the interval are not the same) seem obvious.

The full version of R4 is:

R4: There exists a nonempty set of real number P_0 with the following two properties:

  • P_0 is a dense subset of (F,T).
  • For every y_1,y_2,y_3 \in P_0 there exists some consistent X with a basis of at least three atomic propositions A_1,A_2,A_3 such that (A_1|X) = y_1,(A_2|A_1,X)=y_2, (A_3|A_2,A_1,X) = y_3

This is perhaps the least understandable part. It says basically that for any three plausibility values, there exists some consistent state of knowledge that assigns those plausibility values to certain propositions. A consistent state of knowledge is one where we can't prove a contradiction (ie. A and Not A).

Note that this isn't a claim about the actual world. We aren't restricted to "states of knowlege" that are say consistent with the laws of physics, or consistent with a historically accurate account of the Norman Conquest, etc. This is a mathematical statement about whether we can *assign* certain plausibilities to certain statements without causing a contradiction. So "A1 = Unicorns keep an extra horn at home for emergencies" and "A2 = The second sun of the planet Gorlab rises every thursday" and soforth are acceptable statements whose truth value we could assign certain plausibilities to to meet these logical requirements. This is called "universality" because it expresses the desire to be able to assign plausibilities at will to at least 3 statements regardless of what those statements are about (whether that's unicorns, or the position of fictional suns, or it's the efficacy of a real drug for treatment of diabetes, or the albedo of a real planet circling a remote star).

Van Horn points out that without R4 we can construct a counterexample, so something possibly weaker might be allowable, but we can't omit it entirely. Van Horn also points out that Halpern requires that the set of pairs of propositions be infinite. But note, there's nothing that requires us to restrict Cox's theorem to any one area of study. So, for example, while you may be using probability as logic to make Bayesian decision theory decisions about a finite set of objects whose measurements come from finite sets, someone else is using it to make decisions about continuous parameters over infinite domains. This is Van Horn's argument, basically that we're looking for a single universal system that works for all propositions, not just some finite set of propositions about your favorite subject. He goes further to point out that what Halpern describes as (A|B) really should be (A|B,X) where X is a state of information about the statements. So long as you allow an infinite set of possible states of information, you can still apply Cox's theorem within a finite domain of finite objects.

Van Horn's article is worth a read, and he discusses many issues worth considering.

A summary of some discussions on data-dredging, hiding data, private models, and regulatory approval

2016 August 4
by Daniel Lakeland

We discussed a bunch of stuff at Andrew's blog on data dredging, hiding data, selecting your favorite analysis, etc. Here's a summary of my position:

In Bayesian analyses, the only thing that matters is the data and the model you actually have. That is, essentially p(Data | Parameters, Knowledge) p(Parameters | Knowledge).

The question arises as to how it affects things if you do some potentially questionable practices. So let's look at some of them:

  1. Cherry-picking data: you collect N data points, and then select n of them, submit these n as if they were all the data you have, together with a model... THIS IS BAD. The reason it's bad is that p(Data | Parameters, Knowledge) is now the wrong formula, because "Knowledge" doesn't include knowledge of the REAL data collection process. The rules by which you selected the n data points from the N data points are missing. Since the likelihood is a model of the whole data collection process, not a god-given fact about the "true frequency distribution in repeated sampling" (whatever that is) it strongly depends on the whole process by which the data eventually makes its way into your data file.
  2. Cherry-picking outcomes: You collect an NxM matrix of data, and select one column to analyze (say N people tested for M different outcomes). Provided you give me the full matrix, and you explain the logic behind the choice of analyzing your column, and I agree with it, the existence of this additional data need not concern me. However, if information about some of that other data could inform my model of the one outcome of interest, then I may legitimately require you to include the other columns in the analysis. For example, if you analyze 3 different kinds of immune cells, and a self-reported allergy outcome, and you want to analyze the self-reported allergy outcome without looking at the immune system outcomes... I may legitimately  conclude that I don't agree with the model. When the extra data doesn't alter what the regulator should think the likelihood (or priors) should be, it's irrelevant, but if it would... then it's not ok to hide it. Only the regulator can make that decision.
  3. Cherry-picking models for a single outcome. If you collect N data points, one outcome column, and you are analyzing the one outcome column, and you've privately thought up 400 different ways to slice and dice this data and just reported your favorite model. Does this invalidate the inference? It depends. If the result of your slicing and dicing is that the likelihood you're using is extremely specific and would not be a likelihood that a person who hadn't seen the data already would choose... then YES it invalidates the analysis, but that's because it invalidates the choice of the likelihood. If the likelihood winds up as something not too specific that others who hadn't seen the data would agree with based on the background info they have, then no, it's not invalidated, it's a fine analysis. Whether or not you looked at hundreds of alternative models, or just this one model, a regulator needs to scrutinize the model to ensure that it conforms to a knowledgebase that is shared by other people in the world (regulators, third parties, etc). The number of other hypotheses you thought about, does not matter, just as it doesn't matter that you might have anxieties, or be wearing a lucky rabbits foot, or have prayed to a deity, or any other private thoughts you might have had. Bayesian models are more or less mathematicized thoughts.
    1. Note, if you try to argue for a single final model, and regulators don't agree with it, it's always possible to set up a mixture model of several explanations with priors over the models, and do Bayesian model selection. Either a single model will dominate at the end, or several will have nontrivial probability. Either way, we can continue with the analysis using the posterior distribution over the mixture.

In the end, the thing to remember is that a Bayesian model has a CHOICE of likelihood. This isn't a hard shared fact about the external world, it's a fact about what you assume or know. If you send a regulator data and the method by which you actually collected, filtered and handed over that data differs from the method that your trial protocol describes, or the likelihood doesn't correspond to something people other than you can get behind, then there is no reason why the world needs to believe in your analysis, (and you may need to get some jail time).

However, if you have privately got all sorts of worries, fears, and have looked at hundreds of alternative models, provided the one model you submit is agreeable to independent parties in terms of the logic, and the data collection process is as you said it would be, and the likelihood expresses it correctly, and the priors are not unreasonable to third parties, then Bayesian logic suggests that everyone involved needs to agree with the posterior independent of how many other ways you might have sliced the data. That's ensured by the consistency requirements of Cox's axioms.

Note however, that Frequentist testing-based inference doesn't seems to have the same property. In particular the idea behind a test of a hypothesis is that you will filter out say 95% of false hypotheses. But, how many hypotheses did you test, or could you have tested easily? The expected number of non-filtered false hypotheses you could easily get your hands on is \sum_{i=1}^N p_i where N is the number of false hypotheses you can easily get your hands on, and p_i is the p value each one gets. So, it matters how many other hypotheses you might have tested... which is ridiculous, and shows how by failing to meet Cox's axioms, Frequentist testing based inference goes wrong.

 

Some summaries of Bayesian Decision Theory

2016 July 29
by Daniel Lakeland

The problem of how to manage fisheries in Norway (note additional details in comments I left there) is apparently getting a lot of attention, and a lot of burden is being placed on biologists, because the existing rules make it a big deal what the biologists report for their "percent of salmon that are hatchery escaped" numbers. Typically they do a survey of 60 fish or so, and then report how many were tagged/identified as hatchery escaped, then if their numbers indicate less than 5% then it's all good for the farms, and greater than 5%, it's Billions of dollars in costs imposed on the farms... Sigh...

So, the Biologists ask Andrew what the heck? How do we take into account the fact that seeing 5 fish in 60 caught really doesn't tell us very exactly what the overall fraction is in the river?

Fortunately, there's a smart way to do this, and it's been known for a long time... unfortunately the people who know about it are pretty much unknown to the people who have the power to make the rules...

First, let's consider how uncertainty should affect our decisions. Suppose that there is some vector of numbers that describes the state of the world, and we don't know what it is. For example N,F to describe N the number of total salmon in the river, and F the fraction that are hatchery escapees. Now, we collect some data, n,f (lower-case) which is relevant for making an inference about N,F. So, based on some model M we can place a Bayesian posterior p(N,F | n,f,M). Now, we have some control variables, say d_i for the dollars we are going to spend on project i. And our model tells us what N_1,F_1 are likely to be, next year, given what we know about N,F this year and how much we plan to spend on each program d_i.

So, we'd like to make a decision about what the d_i values should be that is in some sense "the best we can do given our knowledge".

First off, we need to know how good or bad will it be if N_1,F_1 are any of the possible values. So traditionally we think in terms of cost, and we need a cost function c(N_1,F_1,d_i) that incorporates "how bad it is that we spent all this money on the d_i values and we wound up getting N_1,F_1 to be whatever their actual values are". Let's suppose that through some kind of negotiations including hopefully some kind of suggestions from multiple parties, and some score-voting, we arrive at such a function.

How do we make the decision? First off, note that we want the decision to depend on information about every possible outcome. Is it possible we'll have 5000 fish (up from 3000) and 0% hatchery next year? Then that's good, is it possible we'll have 40 fish and 10% hatchery next year? That's bad. Ignoring any of the possibilities in our decision has got to be sub-optimal. Since all the options have to enter into the decision, the decision will need to be made based on some kind of integral over all the possibilities. Furthermore, if we're given plausibilities p(N_1,F_1) and a particular value of N_1,F_1 becomes 2x more plausible after seeing some data then it should intuitively "count" 2x as much in our decision as before. More generally, the decision should be made based on a functional that is a linear mapping of p(N_1,F_1) to the real numbers (so we have an ordering of better vs worse). The expectation operation is what we're looking for

 E(\mathrm{Cost}|\mathrm{Model,Data}) = \int_{N_1,F_1} \mathrm{Cost}(N_1,F_1,d_i) p(N_1,F_1 | d_i,n_0,f_0,\mathrm{Model})dN_1dF_1

So, let's find the d_i values that make this expected value of cost as low as possible (within the feasible computer time and search algorithm abilities we have) and then implement the policy where we actually do all the stuff implied by the d_i values.

This is called the Bayesian Decision Rule, and in this case, since it's trying to control real world outcomes, another way to describe this is as Bayesian Optimal Control.

Some features:

  1. If the Cost function is continuous with respect to the outcomes, and the outcome predictions are continuous functions of the data, then the d_i choices are continuous functions of the data too. That is, small changes in inputs produce small changes in outputs, there's no "gee we caught 1 more fish, send out the bill for $1 Billion Dollars"
  2. Every possibility is considered and its importance for the decision is dependent on both how bad that possibility is, and how plausible it is that it will occur. The results are linear in the plausibility values.
  3. The cost function can be anything finite, so no matter how you feel about various outcomes, you can incorporate that information in your decision. If multiple people are involved they can negotiate the use of a cost function that expresses some mixture of their opinions.
  4. As pointed out by Dale Lehman in the discussion at Andrew's blog, the cost function need not, and some would argue should not necessarily be the same as the kind of thing Economists call cost functions (that is somehow, accurate dollar prices based on actual willingness to pay). In particular, you're only going to pay dollars for the cost of carrying out the chosen policy d_i, so stuff that has very high "cost" associated to it, and is very unlikely under the model, can affect the choice of the decision variables in a way that causes you to actually pay a very moderate amount. Willingness to pay for the decisions is more important than willingness to pay the amount of money cost(N,F) for some hypothetical really bad N,F. If you choose a cost function and it causes you to make decisions that pretty much nobody agrees with, it means your cost function was "wrong" (though this isn't necessarily the case if you just piss off some fraction of the population, you can't please everyone all the time).
  5. The class of Bayesian Decision Solutions is an "essentially complete class", that is, every decision rule that minimizes the expected cost under a Frequentist sampling theory of N_i,F_i given some "true" parameters x either has the same mathematical form as above, or there is a Bayesian rule that will produce as good or better decisions regardless of what the unknown parameter x is. In essence, you have to form a function q(x|Data) \propto p(Data|x)p(x) to get a rule that minimizes the Frequentist risk (this is called Walds Essentially Complete Class Theorem). This is the case in part because no one knows the "true" parameter of the random number generator so we can't actually calculate the d_i that minimize the Frequentist risk under the "true" parameter value (and people like me deny that the Frequentist model of sampling from an RNG even applies in many/most cases).

So, whether you're a Frequentist who believes in God's dice, or a Bayesian who assigns numerical plausibility numbers based on some state of knowledge, the right way to make decisions that include all the information you have is to do the math that is equivalent to being a Bayesian who assigns numerical plausibilities.

 

Everyone loves a good NHST joke

2016 July 26
by Daniel Lakeland

So...

Dirichlet Processes as Probability over Histograms

2016 July 22
by Daniel Lakeland

Suppose you are sampling from a large population, trying to find out about how that population is distributed. Suppose it's a large but finite population at a fixed time, like say 1 Million balances in checking accounts. You can't take a full sample, so you need to take a much smaller sample, say 100 balances. Yet, you want to be able to use your information to come up with plausible models for the frequency distribution of the balance so that you could for example run some simulations of what might happen if you offered people a chance to buy something at various prices or something like that (some calculation that requires simulating from the full frequency distribution, not just knowing its mean and sd or the like).

Most of us have worked with histograms in the past. In the simple version, you take your sample, break up the range into bins of equal size, and count up how many samples fall in each bin. The value n_i/N for each bin i gives an estimate of the value

\int_{x_i}^{x_{i+1}} F(s)ds

where F is thought of as the "real" frequency density, and when the population is large this is taken to be basically a continuous function (hopefully the nonstandard analysis analogy is already coming into your head).

Well, each of the bins is basically a "categorical" variable, and the n_i are counts in each category. The simplest categorical variable is the binomial one, where there are two possible outcomes {1,0} or "true vs false" or "success vs failure" or "heads vs tails". A more complex categorical variable analogy is drawing balls from an urn. Suppose there is a bag full of balls, and there are 5 different colors. If you draw a sample of 100 balls, you can provide a vector of the counts of red,orange,yellow,green,blue for example. Similarly by breaking up the range of a variable into different bins you can take your sample and turn it into a count of the number of times a data point fell in each bin.

Now, we're familiar with the idea that if you take a large sample, the histogram should be close to the real frequency distribution F(x) whereas with a small sample, the histogram might be very different from F(x).

Many people are also familiar with the Beta-Binomial conjugate family. If you check 100 out of a large number of items with only two colors, and 40 are red and 60 are blue, then with a prior of beta(1,1) your posterior over the frequency of reds is beta(41,61) for example. A similar distribution is conjugate to the multinomial/categorical distribution with k categories. This distribution is called the "Dirichlet" distribution, and it's a distribution over the vector of relative frequencies with which each category appears in random sampling from a giant urn with k colors of balls, or the process of creating histograms with k bins...

In R, if you load the "gtools" package, you can draw samples from a Dirichlet distribution:

\vec f = \mathrm{rdirichlet}(1,\vec N)

Is a function that returns a vector f which is in essence a frequency histogram from a conjugate model under the assumption that you've observed a vector of counts \vec N where the N_i values are actually the sum of the counts observed plus the "effective prior counts".

In the absence of strong prior information and the presence of a large sample we can think of the \vec N as just observed counts in a histogram process, and the f_i as plausible values for the true frequency with which values from a very large sample would fall into histogram bin i for all the possible i.

Unsurprisingly, the larger N_{tot}=\sum N_i is, the closer different draws of f_i will be to a single consistent value N_i/N_{tot}. In other words, all the f vectors will look similar. Whereas, when the N_{tot} is small, the histograms will vary a lot and tend to look the histograms of only a small number of observations... noisy. Note, there's no requirement to make each of the values in the \vec N be integers even though we're understanding them in terms of "effective counts".

If you want to simulate values that are plausible for the full population using only your prior guess at the frequency histogram and the counts of observed data points, one way to do it is to use the rdirichlet function to generate an f, then generate data consistent with that histogram. Each time you do this you'll generate a plausible full population which you can then use to calculate whatever complex thing you want as if you had the whole population dataset. Doing this over and over shows you how much your uncertainty in the whole population affects whatever your calculated statistic is.

There's nothing to say that the bins have to be equally spaced, or even have finite ends, though you'll have to have a model for the tail of the distribution if you include an infinite bin. Also, there's no reason you can't use an informative prior over the histograms, and the more informative you want the prior to be, the bigger the "count" of the fake data. If f_i is a vector of prior guesses for the histogram bin frequencies, then N f_i is in essence a prior over counts "worth" N data points. If you have an additional K data points that have observed relative frequencies F_i then Nf_i + KF_i is the parameter vector you should give to rdirichlet to get a sample from the posterior over histograms.

Once you have a vector of histogram frequencies f you can generate a plausible population of N values by selecting a histogram bin at random and generating a value uniformly within that bin. But probably the real F is not a piecewise constant function, so you can maybe do better by a little smoothing. Adding a small amount of normal noise to each simulated value will produce a draw from a smoothed histogram F_s a piecewise constant function convolved with a normal curve. If you simply use the midpoint of each interval and then smooth with the appropriate normally distributed perturbation you have a "nonparametric" gaussian mixture model. It's nonparametric in the sense that it is actually a family of models, as you can always increase the number of bins to fit the size of your data.

Basically, this scheme allows you to put Bayesian probabilities over the unknown frequency histogram of a large but finite population from which you have taken a random sample. If you have models for bias in your sampling, you can include that model through appropriate perturbations to the parameter vector of the Dirichlet.