Confusion of definitions Bayesian vs Frequentist (vs yet a third category)

2016 May 9
by Daniel Lakeland

Generally I consider it unhelpful to argue about definitions of words. When that arises I think it’s important to make your definitions clear, and then move on to substantive arguments about things that matter (like the logic behind doing one thing vs another).

So, what makes a calculation Bayesian vs Frequentist and is there any other category to consider? Here are my definitions:

Bayesian Inference: The calculation begins from a position where probability defines how much you know about an aspect of the world, and proceeds to calculate the consequences of what you know together with what you observed (your data) using the sum and product rule of probability to arrive at a new explicit description of a state of knowledge. Analogy is that in Prolog you use Aristotelian logic to compute true statements from facts you supply, and in Stan you use Bayesian logic to compute samples from output distributions from data and your input distributions.

Frequentist Inference: The calculation begins with a description of the data collection or generation process as repeated sampling from a collection of possibilities, and proceeds to calculate how often something might happen if you do more sampling in the future, and how that relates to some unknown aspect of the collection. (Wikipedia definition seems to agree qualitatively)

Those are the definitions I’m using. You may well believe these are not good definitions. But I think these are principled definitions, in that they capture the essence of the metaphysical ideas (metaphysics: “a traditional branch of philosophy concerned with explaining the fundamental nature of being and the world that encompasses it” wikipedia)

The problem arises in that a lot of classical statistics is kind of “neither” of these, or maybe “a mishmash of both”. My take on that is more or less as follows:

  1. There are Frequentist procedures which are very clearly Frequentist. For example the chi-square, Kolmogorov-Smirnov, Anderson-Darling, Shapiro-Wilks, and similar tests. For a hard-core set of tests we can look at the Die Harder tests for Random Number Generators which explicitly try to find ways in which RNGs deviate from uniform and IID. These things really do answer the question “does this sample look like a sample from an RNG with given distribution?”
  2. There are Bayesian models, in which people choose a model based on what they know about a process (physics, chemistry, biology, economics, technological design of measurement instruments, etc) and then turn that into a joint distribution over the data and some parameters that describe the process, typically via two factors described as a likelihood and a prior. In this case, the question of whether we can reject the idea that the data came from a particular distribution on frequency grounds does not enter into the calculation.
  3. There are a large number of people who do things like what’s done in this paper where they construct a likelihood in what I would call a Bayesian way based on their knowledge of a process (in this case, it’s a point process in time, detection of gram-negative bacteria in a hospital test, they know that the average is not necessarily the same as the variance, so they choose a default likelihood function for the data, a negative binomial distribution which has two parameters, instead of a Poisson distribution with only one but there emphatically is NOT some well defined finite collection of negative-binomially distributed events that they are sampling from). They then take this likelihood function and treat it as if it were describing a bag of data from which the actual data were drawn, and do Frequentist type testing to determine whether they can reject the idea that certain explanatory parameters they are thinking of putting into the model could be dropped out (what you might call a “Frequentist tuning procedure”). In the end they choose a simplified model and maximize the likelihood (a Bayesian calculation).

Since 1 and 2 are pretty straightforward, the real question is how to understand 3? And, probably the best way to understand 3 is that 3 is what’s done when you’ve been classically trained and therefore have no real critical idea of the principles behind what you’re doing. That is, you know about “what is done” but not “why”. Note, that doesn’t make it “wrong” necessarily, it’s just that the people who do it rarely have a guiding principal other than “this is what I know how to do based on my education”.

On the other hand, I think it’s possible to argue that there is a principle behind 3, I just think that most people who do 3 aren’t much aware of that principle. The basic principle might be “data compression”. These kinds of models start with a Bayesian model in which a likelihood is picked based not on sampling from some known population but instead “what is known about where the data might lie” and then this model is tuned with the goal of giving a sufficiently short bit-length approximation to a Bayesian model that has good enough Frequency properties to reduce your bit cost of transmitting the data.

Then, if you collect additional data in the future, and it continues to look a lot like the past, awesome, you can send that data to your friends with low bandwidth costs. For example, instead of say a 16 bit integer per observation, you might be able to send the data using only on average 9 bits per observation, with no approximation or loss of information relative to the 16 bit version.

In this sense, (3) isn’t really an inference procedure at all, it’s a tuning procedure for a channel encoding. If you do some Frequentist tests, and decide to drop factor X from your model, it’s not a commitment to “factor X = 0 in the real world” it’s more an approximation to “Factor X doesn’t save me much on data transmission costs if I had to send a lot more of this kind of data”

Although there are a certain number of “information theoretic” statisticians who actually think of this explicitly as a guiding principle, I don’t think this is the position you’d get if you asked the authors of that paper what their motivation was for the statistical procedures they used.

Is 3 Bayesian inference? is 3 Frequentist inference? the answer is no and no. 3 is not inference at all, it’s communications engineering. But, it does use Bayesian ideas in the development of the likelihood, and it does use Frequentist ideas in the evaluation of the channel bandwidth. I think some of those principled information theoretic statisticians may have an argument that it is some kind of inference, in other words, that the word inference should include this type of stuff, but I don’t know much about those arguments and I don’t really care to argue about the definitions of words. Communications Engineering seems to be a perfectly good description of this process, the question is, does it accomplish what you need to accomplish? In most scientific questions, I doubt it.

Again, thanks to ojm who keeps pushing back and making me think about things in more depth.

PS: the careful, principled version of this stuff is called “Minimum Message Length” modeling and Wikipedia has a nice intro. In essence it takes a Bayesian model, breaks it down to a discretized version, making decisions about the precision of the discretization which are ultimately based on a loss function related to Frequentist ideas (since that’s what’s relevant to transmission of a message). The version performed by many classically trained statisticians such as the ones in the article from PLoS One is more or less an ad-hoc version.

8 Responses
  1. Chris Wilson permalink
    May 12, 2016

    Hey Daniel, nice write-up. I think I follow most of your reasoning here, although the CS analogy is a bit over my head. However, I’m wondering more specifically about what you might call “likelihoodism”. Now I grant that every max likelihood procedure *can* be considered as approximate Bayes (or equivalent to Bayes with flat priors, whatever), i.e.:
    p(theta|data) prop. to p(data|theta)*p(theta) where p(theta) is some constant and drops out, so we allow the equivalency
    p(theta|data) ~ p(data|theta)
    So, I grant that this is all potentially “Bayesian” in that it’s based in inverse probability or something. But it does seem that there is a frequentist justification/approach to both doing and interpreting likelihood calculations. In practice that seems to be: 1) first you construct an estimator, usually by deriving log likelihood with respect to parameter of interest and setting to zero, and then 2) you compute the variance of that estimate (this is critical part for me), by
    E[x^2] – (E[x])^2 where x is your parameter of interest. Then, standard error is root of that function. Of course, this has all the usual frequentist problems in moving from point estimate to uncertainty interval (i.e. will need to rely on asymptotics), but let’s side-step that for the moment.

    The two parts that seems salient to me are, 1) this math can be done purely with respect to the likelihood surface (no prior), and 2) I don’t know how to interpret the variance of the estimate (as yielded by that approach) without reference to some frequentist idea about hypothetical long-run sampling. Furthermore, that seems like a legitimate evaluation to me. I don’t agree that it is the superior way to look at the world (I’m definitely more of a Jaynes/Gelman Bayesian), but it seems like a consistent and coherent approach to doing likelihood calculations. So to me it seems like there IS a legitimate frequentist flavor of “likelihoodism”.


    • Daniel Lakeland
      May 12, 2016

      It’s interesting because where you say “I don’t know how to interpret the variance of the estimate (as yielded by that approach) without reference to some frequentist idea about hypothetical long-run sampling.”

      I’m much closer to saying “I don’t know how to interpret the variance of the estimate, except as the variance of the posterior probability distribution for x”

      So, I think we need to be very careful what we mean by expectation. If you calculate the posterior distribution for x as

      $$p(x | D) \propto p(D | x) p(x)$$

      where p(x) is uniform between -N and N, two nonstandard numbers (ie. a “flat” prior). And then you compute the variance of x as:

      \[\int_x (x-\bar x)^2 p(x|D) dx\]

      this is clearly a Bayesian variance of the posterior probability distribution for x.

      If, on the other hand, you compute the variance *of the estimation procedure* under resampling:

      for i = 1 .. N
      D[i] = resample(D);
      xbar[i] = ML_estimate_x(D[i])
      xvar = var(xbar);

      then you’re pretty much committing to D being a “truly random” (or let’s say “high quality, representative”) sample from a population so that resampling D is “like” re-running the experiment and you’re calculating the variance *of the procedure* under resampling.

      Sometimes you can believe that sampling from a parametric form with the ML estimate plugged in could be about-as-good as the resampling the data itself procedure. Then you’re really committing to the parametric form for sampling, ie. you’re committing to the idea that the future will sample like the past and both sampled like the particular parametric form.

      But, what evidence do you have for that? If you aren’t actually sampling from a well defined finite population, then you are doing the “compression engineering version” where you don’t actually have any guarantees about the world, you only have guarantees about how well your model will compress data *if* the future looks identical to the past.

      commitments to sampling across time having a stable frequency distribution are MUCH STRONGER commitments to a model of the world than commitments to a short-duration sampling procedure from a well defined finite population being stable if you re-sampled a different sample of that finite population a very short time later.

      • Chris Wilson permalink
        May 12, 2016

        Hi Daniel, yes I think I’m tracking what you’re saying. In particular, this: “…you’re pretty much committing to D being a “truly random” (or let’s say “high quality, representative”) sample from a population so that resampling D is “like” re-running the experiment and you’re calculating the variance *of the procedure* under resampling.”, and this “Then you’re really committing to the parametric form for sampling, ie. you’re committing to the idea that the future will sample like the past and both sampled like the particular parametric form.”

        is what I meant to imply in my original post, but admittedly was not super clear about. My understanding of frequentism is that they view data as random samples from certain distributions, or, as you like to say, from an RNG with certain true (fixed) parameter values. Thus, one is indeed constrained to estimating variance and uncertainty in the procedure rather than the estimate. So, if you do the math the way I described (your second definition), isn’t that a legitimate way to interpret that estimation variance? (geometrically, it maps to width of likelihood profile around ML estimate). I suppose all I’m saying is that this is a real frequentist justification for a certain max likelihood procedure that is used all the time in practice (i.e. lme4 package in R). It has all kinds of practical and philosophical flaws, but is still frequentist in spirit, no?

        In practice, I’m not sure any of this matters much to me at least 🙂 I’m content to view output of models from e.g. lme4 as “approximate Bayes” and leave it at that, but the idea that it could be derived as a sort of “frequentist-likelihoodism” also doesn’t bother me. I guess the key question in practice is probably when N is limiting, and estimation variance is high, the extent to which regularizing priors and/or fully informative priors can stabilize and improve inference can represent an important disagreement about how we converge on the truth.

        Anyhow, thanks for your writings, I’ve found several of your pieces useful and thought-provoking.



        • Daniel Lakeland
          May 12, 2016

          “Anyhow, thanks for your writings, I’ve found several of your pieces useful and thought-provoking.”

          Thanks Chris, I really like to get that kind of feedback because although I can see that people do come to my blog and read it, it’s hard to know what other people think is useful vs what is just yammering on

        • Daniel Lakeland
          May 12, 2016

          When you say “geometrically, it maps to width of likelihood profile around ML estimate” I think what is meant is that:

          1) After doing an essentially Bayesian calculation (ie. form the posterior using a flat prior), if the posterior is sharply peaked, then when you plug in the single maximum-likelihood value, we will be throwing away very little uncertainty, due to the sharp peak.

          2) After plugging in the maxlike estimate, resampling from the assumed data distribution will produce new datasets which have other maximum likelihood estimates which are close to the original one, so that the variance in the procedure is about the same width as the sharp peak of the Bayesian posterior.

          But this seems to break down when the Bayesian posterior is not sharply peaked. Then, the uncertainty about what value is “true” can be much bigger than the uncertainty you get when you resample using just the one fixed maxlike value.

  2. Daniel Lakeland
    May 12, 2016

    Chris, since you say the comp-sci version is a little outside your comfort area, here’s maybe a restatement of the idea.

    When we’re sampling from a finite real population of things (like all single family homes in LA county) the population has time-stable real properties. Sure, people demolish homes and rebuild, but only a small fraction of them in any given year. Each house has a reasonably well defined square footage for example.

    But when we sample from something like “ask each patron at our restaurant, how much did you enjoy your dessert today?” it’s a totally different story. Each night there will be perhaps a few regulars who you see several times a month, and then a few birthday parties, a few dates, people from out of town, people from a convention of Masons, whatever… there is no reason to believe that a month or a year from now the mix of people “looks like” the sample you got today, or that even if it’s the same people, they will continue to be satisfied in similar ways, perhaps your competition comes in and makes them expect more!

    But, if it DID always look like the sample you got today, then the frequencies assigned by your parametric model when the model has Maximum Likelihood values plugged in for its parameters would be good ones to use in choosing how to encode the data. If you take the most common thing that happens and encode it as “0” and the second most common thing encoded as “1”, third most common thing “10”, next… “11” etc then over the long run sending a lot of these things, say N, you’ll use only N times the average number of bits used to encode a message.

    if you think of “f” as “how common” an outcome is, then 1/f is “how rare” the outcome is, and you therefore assign something proportional to log(1/f)/log(2.0) = – log2(f) bits to a message that describes an event that has frequency f. This was proved by Shannon to be optimal. So you’re going to send something like O(N * mean(log2(1/f))) bits down this channel .

    But, if future frequencies don’t correspond to your predicted ones from your parametric model, then you’ll send a lot more bits, because you’ll be assigning the bit length of each data point based on bad predictions for f. In other words, there is NO frequency guarantee in this type of scenario, so by fitting your model to the observed frequencies, you are potentially easily fooled.

    So, a lot of this stuff basically starts with choosing a distribution based on some intuition or science or guesswork which has NO guarantees associated with it at all, to represent a potentially “infinite” string of events which are NOT samples out of a finite population where each element of the population has a definite value, and then you use some frequency based tests to see how well your tuned version of this thing fits the actual data you DO have. Then once it’s tuned well enough, you use the maximum likelihood estimates to describe your model which is really just a projection of your guess about how things work onto the *frequencies* of future events that haven’t happened, and have no guarantee of being like that at all.

    How is this different from Bayesian models? The Bayesian model makes NO assumptions about the sampling distribution of the data, it uses “how much you know” about the data outcomes to figure out “how much you know” about the unobservable quantities you need to make predictions (your parameters).

    If the frequency distribution of prediction errors doesn’t look like the probability you assigned in your likelihood, it’s not *necessarily* a modeling error. It is if you’re specifically trying to fit the frequency distribution, but often what you’re trying to figure out is something else. For example, you might be trying to figure out the total weight of crates of orange juice, and your likelihood chosen gives probability to regions of space that you KNOW are physically impossible, but you’re being a bit lazy, and you still get good estimates of the average (see my article that uses the orange juice example).

  3. Daniel Lakeland
    May 12, 2016

    So, looking at the Orange Juice example:

    we see that 1.6 is the maximum a posteriori estimate for the average. The likelihood was based on iid sampling from an exponential distribution.

    So, plugging in the 1.6 estimate, and knowing that the mean of the sample is the maximum likelihood estimate we can see how well resampling from the exp(1/1.6) does as an estimate of the standard error:

    > foo <- replicate(10000,mean(rexp(100,1/1.6)))
    > sd(foo)
    [1] 0.1595771

    whoops, I misread the previous post and thought we were sampling 100, we’re only sampling 10

    > foo <- replicate(10000,mean(rexp(10,1/1.6)))
    > sd(foo)
    [1] 0.5083227

    Which means the frequentist confidence interval is maybe 1.28 to 1.92 0.58 to 2.62, whereas the Bayesian interval is 1.11 to 2.14, which suggests that of course, the sampling distribution of the mean by resampling from an exponential with parameter 1/1.6 is not the same as the Bayesian posterior, and in fact, it’s a smaller, tighter interval that is unjustified.

    In this case, once I got the sample size right, it’s actually a much larger interval for the frequentist version, and that’s because the Bayesian prior puts in info that the maximum volume of a jug is just a little over 2.0 L, whereas the exponential likelihood has density out to infinity.

    As soon as you start sampling from the posterior and then resampling from the exponential… you know you’ve lost all sight of what’s going on!

    • Daniel Lakeland
      May 13, 2016

      Doing the pure data-resampling version:

      > foo <- replicate(1000,mean(sample(wsamp,10,replace=TRUE)))
      > mean(foo)
      [1] 1.695048
      > sd(foo)
      [1] 0.06195261

      Which is a lot tighter than the version where we use the exponential model to resample from as expected, since the exponential model was chosen *on purpose* to show that Bayesian inference is not conditional on you having a likelihood that is correct frequency-wise.

      So, to summarize:

      The Bayesian model asks how well does a particular parameter put the data into the high probability region of the data distribution, and does not depend on fitting the data frequency-wise, it depends more on the connection between the parameters and the prediction distribution (ie. the goodness of the model)

      The Bayesian Maximum Likelihood calculation is just a special case of the general Bayesian calculation but with a flat nonstandard prior. It works for the same reason as above.

      The Frequentist Maximum Likelihood version can ask questions about the frequencies with which sampling data and then calculating maximum likelihood would get you a certain distance from the “true value” but it can do it two ways:

      1) Using the parametric form of the likelihood to approximate the frequencies, which will FAIL even with a likelihood that succeeds for Bayesian inference. You need to reasonably approximate the actual frequencies of the data for this to work, and in particular it has to be *meaningful* to say that there *are* stable frequencies for the data, which is not even close to guaranteed in most real-world cases, like the bacterial hospital culturing example.

      2) Using resampling with replacement of the data, which will fail to predict the present when the data is not a random sample of a finite population, and fail to predict the future when the data itself is from a process that isn’t time-stable (ie. the restaurant review process, not the sampling square-footage of houses in LA process) but will at least give good results when the sample is actually a RNG generated sample from a real finite population.

Comments are closed.