On Morality of Real World Decisions, and Frequentist Principles

2017 June 27
by Daniel Lakeland

If you want to make decisions, such as choosing a particular point estimate of a parameter, or deciding whether to give a drug, or whatever. And you want your decision making rule to have the *Frequency related* (Frequentist) property that under repeated application it will on average have small “badness” (or larger “goodness”) then you should look for your procedure within the class of procedures mathematically proven to have unbeatable frequency of badness properties. This class is the Bayesian decision rules (see Wald’s Theorem and the Wiki). The boundary of the class is the Bayesian decision rules with flat priors, but we know more, we know that the frequency properties of Rule A will be better than Rule B whenever a region around the real parameter is higher in the prior probability distribution under A than under B. Then we are giving more weight to the actual correct value and so our decision is based more on what will turn out to happen.

Unfortunately we don’t know the correct value exactly, but in the case where we use flat priors, they are equivalent to a prior that places 100% probability on the value being infinitely large

Now, if you, as a person who cares about the frequency properties of procedures, agree that your parameter *is* a real number, and you have any idea at all what the magnitude of that number is, say it’s logarithm rounded to the nearest integer is N, then by choosing a proper prior that has normal(0,10^(N+100)) you will have lower Frequency risk of making “bad” decisions than if you used a flat prior, of course, you can do better, normal(0,10^(N+1)) will do better…

Now, decision making and morality are inevitably entertwined. Consider the existence of the “trolly problem” in moral philosophy, it’s all about making choices each of which have bad consequences, but we have to make a choice, including the choice of “do nothing” which also has bad consequences. On the other hand, if you have no choice, there is no morality associated with your action. Getting hit by a drunk driver who crashes through the wall of your bedroom while you’re sleeping is not a moral failing on your part for example.

But, if you have a choice of how to make real important, *CLINICAL* decisions about people’s lives, and health, and societal health through engineering civil structures and the like, and you care about the frequency with which you do things that have bad consequences that you can’t forsee exactly, and you *don’t* make your decision by choosing a method that is better than your likelihood + flat prior + point estimate based on Mean Squared Error because you refuse to use a prior on some kind of principle, or you refuse to consider real world consequences other than mean squared error on some kind of principle, then in my opinion your principle is immoral, in the same way as prescribing a toxic drug on the principle that “I get a cut of the proceeds” is immoral.

If you make the decision because you don’t know any better… then you’re like the guy in the bed who gets hit by the car. But if you write books on statistics from a Frequentist perspective, and you fail to teach the Complete Class result, and you fail to emphasize the fact that you have a choice in what measure you will use in deciding on your decisions (such as the choice between Mean Squared Error in your estimate of the parameter value vs Quality Adjusted Live Years Saved of your clinical decision) then I think you’re doing evil work in the same way that a person who teaches a Civil Engineering design rule that has been proven to be wrong and risk people’s lives is doing evil work.

So, I do get a little worked up over this issue. Remember I have background in Civil Engineering and I work with my wife who is a research biologist at a medical school. None of this is abstract to me, it’s all real world “you shouldn’t do that because it’s wrong/bad for society/evil/it hurts people more often”

To back up a bit though: I don’t think it’s evil to care about the frequency properties of your procedures. I think it’s evil to *fail to care* about the real world consequences of your decisions.

From the perspective of making decisions about things, such as point estimates of parameters, or clinical decisions about treatments, being a Frequentist (meaning, trying to reduce the average badness of outcomes under repeated trials) actually *entails* doing Bayesian Decision Theory. The Frequentist principle “try to reduce the average severity of bad outcomes” implies “Do Bayesian Decision Theory”.


5 Responses
  1. Corey permalink
    June 27, 2017

    Wald’s complete class theorem does have some technical assumptions, and personally my familiarity with the mathematical language in which they are expressed is poor enough that it’s not generally obvious to me if they’re satisfied in any given decision problem.

    • Daniel Lakeland
      June 27, 2017

      That’s fine, I’ll admit to being ignorant of the precise technical details as well. Still, the James-Stein estimator already shows you that even for MSE what is done (ie. use ML estimation) doesn’t meet the criterion for least Frequentist risk.

      There are basically 3 ways to do stuff:

      1) Do some stuff, just whatever works for you. I can’t comment on this really, but lots of stuff is done this way.

      2) Come up with a principled way to choose what stuff to do: there are basically two *principled* views of statistics, Frequentist principles, and Bayesian principles. Then just do whatever someone told you.

      3) Come up with some principles, and actually follow them.

      Almost all of what I’ve seen done in the “based on Frequentist principles” camp is done by people who are somewhere in the camp (2) case, they simply do what they were taught the principles were in their textbook. They’re like the person in the bed who gets hit by the truck… not really morally involved in any of this, they just did the best they knew based on what they were taught in the textbooks… Note that *lots* of actual Civil engineers design bridges and things just following the design specifications in the code or the textbook or whatever…

      On the other hand, where the principles ought to be… namely in the textbook… the principles that should be taught are that you should analyze a problem in a way that produces least average badness for society or the like. you have choices about how to do stuff!! They are explicit, and they have moral content.

      So, please pull out the big pile of textbooks on standard statistical methods that describe in detail early on, the principles required to get least average badness given your uncertainty?

      On my shelf I have various things I bought back in the day: Heilberger and Holland, Venables and Ripley, Fox Applied Regression…

      Heilberger and Holland seems like a bog standard Masters degree stats text. I got it because it was basically the only thing Codys had (a venerable high quality book store on Telegraph Ave in Berkeley, closed about 15 years ago but used to be THE source for academics in the bay area)

      Here’s what it says in the chapter “Statistics Concepts” under the heading “Estimation: Criteria for Point Estimators”

      There are a number of criteria for what constitutes “good” point estimators. Here is a heuristic description of some of these.

      small variance….

      where “…” elides a description of what those mean

      And I’m sorry, but that seems to me to be morally outrageous considering that tons of people will go off to become biostats masters grads and then start running clinical trials on things that actually have the potential to kill people, like cancer drugs or whatnot.

      Pretty much EVERYTHING you find from there on will be examples of plugging and chugging to find ML estimates or least squares, or do various hypothesis tests or whatever. Not ONCE will you find an example problem that looks like:

      Dr. Margulin knows that his patient is dying of an overdose of a certain drug. Dr Margulin knows the patient’s height, weight, age, sex and which drug the patient took, but has only a very noisy estimate of the dose that the patient took. If Dr Margulin gives not enough of the antidote the patient will die in the next few hours, if he gives too much of the antidote the patients kidneys and liver will fail and he will die a after a few days. Fortunately previous studies of this drug in monkeys have given us a lot of data on dosing. We can approximate the badness of the outcome by B(Dose/PatientMass,AntidoteDose/PatientMass) a given function. Given the dataset below, how do you design a chart that will inform Dr Margulin of the best dose to give to his patient?

      Yet it seems to me this or something like it should be a textbook problem in every stats text. You simply can’t do principled statistics from a Frequentist perspective by sticking to mean squared error and maximum likelihood…. without committing accidental atrocities.

      Mean Squared Error and Maximum Likelihood are taught *as if they were the meaningful principles* but the meaningful principle is “choose something that minimizes/maximizes something else”

    • Daniel Lakeland
      June 27, 2017

      The biggest assumptions in Wald’s proof are a bounded parameter space, compactness of the decision space, and a uniqueness of decision rule taking on its minimum risk value for a given weighting function.

      The uniqueness thing, I admit I don’t know, because I don’t know what a “space of abstract decision rules” looks like. The space of bayesian rules is more easy to understand (it involves basically just choosing priors).

      The boundedness of the parameter space can be forced by transforming the region [0,1] through a function that goes to -inf at 0 and +inf at 1.

      The compactness of the decision rule space… in real practical terms, let’s just allow only decision rules that can be programmed into a computer in less than 10 million bytes of computer code in the language of your choice. If your reason for not using Bayes is that you want to sit down and write a 300 million line computer function that is your decision rule… you probably need a punch in the nose.

      • Corey Yanofsky permalink
        June 29, 2017

        I don’t disagree with you about the moral weight of getting this stuff right. I’m talking mostly about the claim that if frequentists (those who care about the risk functions of their procedures anyway — not all do) follow the math to its logical conclusion they will arrive at a class of Bayes decision rules. Jaynes makes this claim, for example. I’d like to too, but I’m wary in particular of the condition that says the loss function must be a continuous function of the state of the world. That one seems like an easy one to violate. The loss function that leads to HPD regions violates it for example, although in that particular case you can get arbitrarily close to it with a sequence of continuous loss functions.

        • Daniel Lakeland
          June 29, 2017

          Hmm. I can’t imagine a meaningful *essentially* discontinuous loss function. Like, the stupid function that people discuss with the Lebesgue integral: the indicator function on the irrational reals.

          What I can imagine is that people take what is essentially a continuous thing with a very rapid change, and approximate it as a step function, here, undoing the approximation leads to a more real but less convenient to represent function.

          Think of it this way, take any discontinuous loss function at all that people might actually use, which is expressed on dimensionless variables that are O(1), and convolve it with normal(0,1e-37). It is now an infinitely smooth function. Does it represent the real-world badness less well?

          Even if you say something like there is a computer program which decides whether you get some good thing or not, and it has a precise step function in it… so that the convolved version really is problematic right at the transition (say between “you get 0” and “you win the lottery”), the convolved version is just a device inside an integral that allows you to get a decision, so does it lead to a meaningfully different decision? I can’t see it. The integral already gives the transition region basically infinitesimal weight, having that infinitesimal weight spread out by 1e-37 in any real world problem seems unlikely to matter. And if it does matter, what the heck kind of decision is that?

Comments are closed.