Confusion of definitions Bayesian vs Frequentist (vs yet a third category)
Generally I consider it unhelpful to argue about definitions of words. When that arises I think it's important to make your definitions clear, and then move on to substantive arguments about things that matter (like the logic behind doing one thing vs another).
So, what makes a calculation Bayesian vs Frequentist and is there any other category to consider? Here are my definitions:
Bayesian Inference: The calculation begins from a position where probability defines how much you know about an aspect of the world, and proceeds to calculate the consequences of what you know together with what you observed (your data) using the sum and product rule of probability to arrive at a new explicit description of a state of knowledge. Analogy is that in Prolog you use Aristotelian logic to compute true statements from facts you supply, and in Stan you use Bayesian logic to compute samples from output distributions from data and your input distributions.
Frequentist Inference: The calculation begins with a description of the data collection or generation process as repeated sampling from a collection of possibilities, and proceeds to calculate how often something might happen if you do more sampling in the future, and how that relates to some unknown aspect of the collection. (Wikipedia definition seems to agree qualitatively)
Those are the definitions I'm using. You may well believe these are not good definitions. But I think these are principled definitions, in that they capture the essence of the metaphysical ideas (metaphysics: "a traditional branch of philosophy concerned with explaining the fundamental nature of being and the world that encompasses it" wikipedia)
The problem arises in that a lot of classical statistics is kind of "neither" of these, or maybe "a mishmash of both". My take on that is more or less as follows:
- There are Frequentist procedures which are very clearly Frequentist. For example the chi-square, Kolmogorov-Smirnov, Anderson-Darling, Shapiro-Wilks, and similar tests. For a hard-core set of tests we can look at the Die Harder tests for Random Number Generators which explicitly try to find ways in which RNGs deviate from uniform and IID. These things really do answer the question "does this sample look like a sample from an RNG with given distribution?"
- There are Bayesian models, in which people choose a model based on what they know about a process (physics, chemistry, biology, economics, technological design of measurement instruments, etc) and then turn that into a joint distribution over the data and some parameters that describe the process, typically via two factors described as a likelihood and a prior. In this case, the question of whether we can reject the idea that the data came from a particular distribution on frequency grounds does not enter into the calculation.
- There are a large number of people who do things like what's done in this paper where they construct a likelihood in what I would call a Bayesian way based on their knowledge of a process (in this case, it's a point process in time, detection of gram-negative bacteria in a hospital test, they know that the average is not necessarily the same as the variance, so they choose a default likelihood function for the data, a negative binomial distribution which has two parameters, instead of a Poisson distribution with only one but there emphatically is NOT some well defined finite collection of negative-binomially distributed events that they are sampling from). They then take this likelihood function and treat it as if it were describing a bag of data from which the actual data were drawn, and do Frequentist type testing to determine whether they can reject the idea that certain explanatory parameters they are thinking of putting into the model could be dropped out (what you might call a "Frequentist tuning procedure"). In the end they choose a simplified model and maximize the likelihood (a Bayesian calculation).
Since 1 and 2 are pretty straightforward, the real question is how to understand 3? And, probably the best way to understand 3 is that 3 is what's done when you've been classically trained and therefore have no real critical idea of the principles behind what you're doing. That is, you know about "what is done" but not "why". Note, that doesn't make it "wrong" necessarily, it's just that the people who do it rarely have a guiding principal other than "this is what I know how to do based on my education".
On the other hand, I think it's possible to argue that there is a principle behind 3, I just think that most people who do 3 aren't much aware of that principle. The basic principle might be "data compression". These kinds of models start with a Bayesian model in which a likelihood is picked based not on sampling from some known population but instead "what is known about where the data might lie" and then this model is tuned with the goal of giving a sufficiently short bit-length approximation to a Bayesian model that has good enough Frequency properties to reduce your bit cost of transmitting the data.
Then, if you collect additional data in the future, and it continues to look a lot like the past, awesome, you can send that data to your friends with low bandwidth costs. For example, instead of say a 16 bit integer per observation, you might be able to send the data using only on average 9 bits per observation, with no approximation or loss of information relative to the 16 bit version.
In this sense, (3) isn't really an inference procedure at all, it's a tuning procedure for a channel encoding. If you do some Frequentist tests, and decide to drop factor X from your model, it's not a commitment to "factor X = 0 in the real world" it's more an approximation to "Factor X doesn't save me much on data transmission costs if I had to send a lot more of this kind of data"
Although there are a certain number of "information theoretic" statisticians who actually think of this explicitly as a guiding principle, I don't think this is the position you'd get if you asked the authors of that paper what their motivation was for the statistical procedures they used.
Is 3 Bayesian inference? is 3 Frequentist inference? the answer is no and no. 3 is not inference at all, it's communications engineering. But, it does use Bayesian ideas in the development of the likelihood, and it does use Frequentist ideas in the evaluation of the channel bandwidth. I think some of those principled information theoretic statisticians may have an argument that it is some kind of inference, in other words, that the word inference should include this type of stuff, but I don't know much about those arguments and I don't really care to argue about the definitions of words. Communications Engineering seems to be a perfectly good description of this process, the question is, does it accomplish what you need to accomplish? In most scientific questions, I doubt it.
Again, thanks to ojm who keeps pushing back and making me think about things in more depth.
PS: the careful, principled version of this stuff is called "Minimum Message Length" modeling and Wikipedia has a nice intro. In essence it takes a Bayesian model, breaks it down to a discretized version, making decisions about the precision of the discretization which are ultimately based on a loss function related to Frequentist ideas (since that's what's relevant to transmission of a message). The version performed by many classically trained statisticians such as the ones in the article from PLoS One is more or less an ad-hoc version.