Some summaries of Bayesian Decision Theory

2016 July 29
by Daniel Lakeland

The problem of how to manage fisheries in Norway (note additional details in comments I left there) is apparently getting a lot of attention, and a lot of burden is being placed on biologists, because the existing rules make it a big deal what the biologists report for their “percent of salmon that are hatchery escaped” numbers. Typically they do a survey of 60 fish or so, and then report how many were tagged/identified as hatchery escaped, then if their numbers indicate less than 5% then it’s all good for the farms, and greater than 5%, it’s Billions of dollars in costs imposed on the farms… Sigh…

So, the Biologists ask Andrew what the heck? How do we take into account the fact that seeing 5 fish in 60 caught really doesn’t tell us very exactly what the overall fraction is in the river?

Fortunately, there’s a smart way to do this, and it’s been known for a long time… unfortunately the people who know about it are pretty much unknown to the people who have the power to make the rules…

First, let’s consider how uncertainty should affect our decisions. Suppose that there is some vector of numbers that describes the state of the world, and we don’t know what it is. For example N,F to describe N the number of total salmon in the river, and F the fraction that are hatchery escapees. Now, we collect some data, n,f (lower-case) which is relevant for making an inference about N,F. So, based on some model M we can place a Bayesian posterior p(N,F | n,f,M). Now, we have some control variables, say d_i for the dollars we are going to spend on project i. And our model tells us what N_1,F_1 are likely to be, next year, given what we know about N,F this year and how much we plan to spend on each program d_i.

So, we’d like to make a decision about what the d_i values should be that is in some sense “the best we can do given our knowledge”.

First off, we need to know how good or bad will it be if N_1,F_1 are any of the possible values. So traditionally we think in terms of cost, and we need a cost function c(N_1,F_1,d_i) that incorporates “how bad it is that we spent all this money on the d_i values and we wound up getting N_1,F_1 to be whatever their actual values are”. Let’s suppose that through some kind of negotiations including hopefully some kind of suggestions from multiple parties, and some score-voting, we arrive at such a function.

How do we make the decision? First off, note that we want the decision to depend on information about every possible outcome. Is it possible we’ll have 5000 fish (up from 3000) and 0% hatchery next year? Then that’s good, is it possible we’ll have 40 fish and 10% hatchery next year? That’s bad. Ignoring any of the possibilities in our decision has got to be sub-optimal. Since all the options have to enter into the decision, the decision will need to be made based on some kind of integral over all the possibilities. Furthermore, if we’re given plausibilities p(N_1,F_1) and a particular value of N_1,F_1 becomes 2x more plausible after seeing some data then it should intuitively “count” 2x as much in our decision as before. More generally, the decision should be made based on a functional that is a linear mapping of p(N_1,F_1) to the real numbers (so we have an ordering of better vs worse). The expectation operation is what we’re looking for

\[ E(\mathrm{Cost}|\mathrm{Model,Data}) = \int_{N_1,F_1} \mathrm{Cost}(N_1,F_1,d_i) p(N_1,F_1 | d_i,n_0,f_0,\mathrm{Model})dN_1dF_1 \]

So, let’s find the d_i values that make this expected value of cost as low as possible (within the feasible computer time and search algorithm abilities we have) and then implement the policy where we actually do all the stuff implied by the d_i values.

This is called the Bayesian Decision Rule, and in this case, since it’s trying to control real world outcomes, another way to describe this is as Bayesian Optimal Control.

Some features:

  1. If the Cost function is continuous with respect to the outcomes, and the outcome predictions are continuous functions of the data, then the d_i choices are continuous functions of the data too. That is, small changes in inputs produce small changes in outputs, there’s no “gee we caught 1 more fish, send out the bill for $1 Billion Dollars”
  2. Every possibility is considered and its importance for the decision is dependent on both how bad that possibility is, and how plausible it is that it will occur. The results are linear in the plausibility values.
  3. The cost function can be anything finite, so no matter how you feel about various outcomes, you can incorporate that information in your decision. If multiple people are involved they can negotiate the use of a cost function that expresses some mixture of their opinions.
  4. As pointed out by Dale Lehman in the discussion at Andrew’s blog, the cost function need not, and some would argue should not necessarily be the same as the kind of thing Economists call cost functions (that is somehow, accurate dollar prices based on actual willingness to pay). In particular, you’re only going to pay dollars for the cost of carrying out the chosen policy d_i, so stuff that has very high “cost” associated to it, and is very unlikely under the model, can affect the choice of the decision variables in a way that causes you to actually pay a very moderate amount. Willingness to pay for the decisions is more important than willingness to pay the amount of money cost(N,F) for some hypothetical really bad N,F. If you choose a cost function and it causes you to make decisions that pretty much nobody agrees with, it means your cost function was “wrong” (though this isn’t necessarily the case if you just piss off some fraction of the population, you can’t please everyone all the time).
  5. The class of Bayesian Decision Solutions is an “essentially complete class”, that is, every decision rule that minimizes the expected cost under a Frequentist sampling theory of $$N_i,F_i$$ given some “true” parameters $$x$$ either has the same mathematical form as above, or there is a Bayesian rule that will produce as good or better decisions regardless of what the unknown parameter $$x$$ is. In essence, you have to form a function $$q(x|Data) \propto p(Data|x)p(x)$$ to get a rule that minimizes the Frequentist risk (this is called Walds Essentially Complete Class Theorem). This is the case in part because no one knows the “true” parameter of the random number generator so we can’t actually calculate the d_i that minimize the Frequentist risk under the “true” parameter value (and people like me deny that the Frequentist model of sampling from an RNG even applies in many/most cases).

So, whether you’re a Frequentist who believes in God’s dice, or a Bayesian who assigns numerical plausibility numbers based on some state of knowledge, the right way to make decisions that include all the information you have is to do the math that is equivalent to being a Bayesian who assigns numerical plausibilities.


No comments yet

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS