Gelman posts about a paper in which social scientists run a quadratic regression and discover that it produces a U shape!!! Amazing. The story is that too much talent on your sports team will lead to poorer performance. Further evidence that understanding function approximation and basis expansion is seriously flawed among many researchers.
Uri and Leif weren't able to get the data, but were able to get the original researchers to run a test of a hypothesis. The idea is they take their quadratic regression, find the maximum or minimum point, and split the data at that point, running two linear regressions, and testing whether the sign of the slope is different. This is a bit better than assuming the quadratic means there really is a U shape, but basically still wrong. The problem is NHST of a slope against zero isn't the right way to figure out whether you have a U shape. In fact NHST isn't the right way to do much of anything (except test the output of PRNG algorithms).
I hear you asking: Ok Dan, well then, how should we do this right?
First off, Uri and Leif have some data on competition vs patents and claim that this is a good testbed for their method. I disagree because I don't think number of patents is a good measure of "creativity" and there are serious problems with that analysis even before you have the data. The sports data on the other hand is pretty good conceptually, but the authors didn't provide the data so I can't just show you the kinds of stuff you ought to do...
So I'll have to resort to just telling you. The first thing is that you need a system which is sufficiently flexible to fit whatever nonlinear shape your data "really" has. With the sports data the x axis is logically limited to the range [0,1] (0 to 100% of the team is high-talent). This is a bounded interval, and polynomials are a complete basis for continuous functions on a bounded interval. That means, with sufficiently high order polynomials we can fit any nonlinear shape. However, with noisy measurements, we also have a problem of potentially over-fitting. Furthermore, we'll do better for mathematical reasons if we use orthogonal polynomials.
But, a complete orthogonal basis isn't necessarily the best way to go, especially in a Bayesian analysis. For example, we have the idea from sports fans who haven't seen the data that we should get a diminishing returns, horizontal asymptote type result. So, perhaps we should include such a family in our analysis. Unfortunately, I couldn't get Chebyshev polynomials into Stan because the Stan authors didn't really understand them even though someone DID do the work to include them. So, let's set up a model for y vs x using regular old non-orthogonal polynomial regression.
We're going to expect some correlations between the various coefficients, because we don't have an orthogonal system, but Stan is pretty good at working that out, in the end all we care about are the function shapes, not the coefficients anyway. The second term represents a sigmoid function which has horizontal asymptotes at 0 and h and which has a transition between them in the region of . The polynomial represents a general purpose correction to this basic shape. It would be even better to have Chebyshev polynomials and to give strong priors to the coefficients for reasons I won't go into (but they do involve a less-correlated posterior distribution for the coefficients, easier to sample from). With the sigmoid function doing a reasonable job of representing our core expectation, we can at least put priors on the that are zero-mean, as we wouldn't be too surprised to find that large polynomial corrections aren't necessary.
Ok, so now, pry the data out of the researchers fingers and fit the model in Stan. Then, take a sample of 100 of the coefficient vectors and plot the curves in a spaghetti plot. Now, calculate the derivative of this function (if you haven't done calculus in a while that's ok, use Maxima or ask Wolfram Alpha or something similar). In a separate plot, spaghetti plot the derivative. It'll be noisier, as the derivative is less constrained by the data, but we've regularized our result by truncating at 5th order polynomials and we can regularize further by putting strong priors on the . Spaghetti plotting the derivative, does it almost always have a positive value on the left and then dive down to a negative value on the right?
If you have those plots of the curve and its derivative, and they really do produce a fairly consistent change in the sign of the derivative, THEN you have reasonable evidence for a transition to negative slope. The posterior probability of a curve shape can be directly assessed in these plots. There's no need to "test" your hypothesis using an NHST test that is appropriate for assessing whether the coefficients are the output of an RNG. But, if you do find that out of the 100 spaghetti derivative plots all but 1 to 5 of them have entirely negative values at the high end, then, yes you have relatively strong evidence for your decreasing performance at the high end.
The fact is, the posterior distribution is what summarizes our uncertainty. A binary decision about rejecting a linear regression on the upper end is NOT the same thing for the same reasons that Uri and Leif give about why the quadratic isn't good as an indicator of a U shape.
Ok, so I'm down on the NHST analysis of allergic rhinitis. And, I probably am not going to get ahold of the data to re-do the analysis, but what kinds of things SHOULD be done if the data were actually available?
First off, let's acknowledge that the goal of all of this is to make people feel and function better. There is therefore a latent "how good do I feel" variable which we can only measure imperfectly which is our real subject of interest. There are two ways they tried to measure this variable. The first is the RQLQ (symptom) score which says how bad the person perceives their symptoms, and the second is the RMS (medicine usage) score which tells how much the person used medication to combat the symptoms. However, the RMS score also is a measure of a kind of forcing function to reduce symptoms. The medications available were Cetirizine and oral steroids, and a diary was kept by each patient so we actually know how much medicine of what types was used on every day of the study..
So, let's imagine a function S(t) for "symptoms". We have some causal process we think is going on in which S increases through time when exposure to pollen occurs but eventually saturates (fortunately pollen rarely actually makes your whole body combust). We have some treatments thought to reduce the symptoms, including medications which are taken according to the patients own choice, based on a tolerance for symptoms, and in addition, sham acupuncture, or real acupuncture. Real acupuncture is given either in period or weeks and symptoms assessed at t=0, t=8, t=16, t=60. Furthermore, we have individual patient beliefs about whether sham or real acupuncture was being applied, which were assessed after their 3rd session with needles.
The model should therefore be that S(t) is observed with errors using the RQLQ assessment, and furthermore that the choice to take various medications at time t is based on a combination of S(t) and individual tolerance for the symptoms (so that M(t) gives us information about the patient's S(t)). If M(t) is the medication dosage function (dose per day) then and also for each patient separately, where M is medicine dose, Ac is acupuncture dose, Sa is sham acupuncture dose B is belief in whether they are getting sham or real acupuncture, and P(t) is pollen exposure which we might imagine as a constant for each clinical center.
Put all of those ideas together with some simplifying mathematical assumptions, analyze with a Bayesian computational software program like Stan, and discover how strongly Ac, Sa, and M affect symptoms. Compare the effect Ac/M and Sa/M averaged across people to get an answer to the question "what is the average effect of Acupuncture measured relative to the average effect of taking Cetirizine on patient Symptoms, and how does it compare to sham acupuncture?"
From this paper on a Randomized Controlled Trial of acupuncture for allergic rhinitis
We therefore started with noninferiority tests of change in RQLQ score and concluded that real acupuncture was noninferior if the left limit of the covariance-based, 2-sided 95% CI surrounding the between-group difference between real and sham acupuncture was greater than the noninferiority margin of ⫺0.5 point (12). If we showed noninferiority, we repeated the analysis for the RMS outcome using a noninferiority margin of ⫺1.5 points. This threshold was chosen on the basis of a review that we performed of unpublished RQLQ and RMS data suggesting a rough equivalence of scales at a ratio of 1:3, so the RQLQ threshold of 0.5 point translated to an RMS threshold of 1.5 points. If this procedure also showed noninferiority, we tested for superiority and concluded that real acupuncture was superior to sham acupuncture if at least one of the Bonferroni-adjusted, analyses of covariance– based, 2-sided 97.5% CIs surrounding the between-group difference in RQLQ score and RMS was completely greater than zero.
This is how NHST turns people's brains to mush. In this study both the acupuncture group and the "medication only" group got acupuncture, it just happened at different times. They have assessment at 4 time points: t=0 initial baseline, t=8 weeks (only acupuncture group has had acupuncture), t=16 weeks (acupuncture group had acupuncture between week 0 and 8, medication group had acupuncture between t=8 and t=16), and again at week 60 (no treatment between t=16 and t=60 weeks). Clearly, what's needed is a time-series analysis for score as a function of what treatments happened when, but instead, we get "noninferiority and superiority testing at a predefined threshold of ..." bullshit.
Unfortunately, their reproducible research statement says "Data set: Certain portions of the analytic data set are available to approved individuals through written agreements with the author or research sponsor." which means getting the data is a lot more complicated than just downloading it off an open public website.
In this comment at Gelman's Blog "Ney" asks "Does the 5000:1 mean that a team like Leicester would be expected to win the English championship only once within 5000 years?"
My answer to that is no. At least if a Bayesian gives a 5000:1 odds on something like a team winning a championship or a particular earthquake occurring or a particular financial event, it need not have any frequency of occurrence in the long run interpretation. But what is the interpretation?
Bayesian probabilities under Cox/Jaynes probability just says that different things have different degrees of credibility or plausibility or believability or whatever. In this system, Probability is like Energy, if we write energy in units where all the energy in the universe is 1 unit, then since energy is conserved, we can account for what fraction of the universes energy there is in any one object. Same idea for Bayesian probability, what fraction of our credibility is associated with a particular value or small range of values?
So, we could imagine a whole series of events, say each week, there are some soccer games played, and then someone wins, and that means that different matches happen next week, and then someone wins, and etc etc etc. By the end of the season there is some enormous combinatorial explosion of different possible "paths" through the season. A 5000:1 odds for a Bayesian roughly means that of these N possible paths through the season 5000/5001 N of them have Leicester losing and 1/5001 N of them have Leicester winning. Now, it's not quite that simple, because there's no reason why each of the N possible paths have to have equal plausibility, so some paths might "count more" than others, but it's clear that we're not talking about what will happen in 5000 future seasons, we're talking about the weight of plausibility among all the different detailed ways in which the outcome LEICESTER WINS THIS SEASON could occur.
In philosophical discussions it's often the case that we use a kind of shorthand which can be confusing. In many cases, we "personify" particular assumptions. That is, we sort of imagine a person who holds those assumptions as true and talk about what such a person would also logically believe. For example:
"A Frequentist has to believe that a moderate sized sample will fill up the high probability region of the sampling distribution."
"A Bayesian believes that if a moderate sized sample doesn't fill the data distribution it's not a problem for the model"
These and other statements like them are really shorthand for more complicated sentences such as:
"Once you assume that your distribution is the long run frequency distribution of IID repeated trials, this also implies that a moderate sized sample will have data points throughout the high probability region as can be seen by simulations"
"The logic of Bayesian inference does not require that we believe the data distribution is the frequency distribution, so if there is a portion of the high probability region which isn't occupied by any data in a moderate sized sample it doesn't invalidate the basic assumptions of the model"
Real people are free of course to have additional assumptions beyond the basics (for example a person doing a Bayesian analysis might actually want to fit a frequency distribution, so the model with gaps in the sampling would be considered wrong), or to acknowledge that their choice of distribution is intentionally approximate or regularized in a way that doesn't fully satisfy the basic assumption. But it's a useful construct to think about a person who takes their model at literal face value and then see what else that model logically implies they should believe. It helps to detect when additional assumptions are needed, and what they might be, or it also helps to detect when the basic assumptions of a model are contrary to reality.
As such, when talking about "Frequentists" vs "Bayesians" I'm really talking about what the math associated to pure forms of those kinds of analyses implies, not what particular real people actually think.
So much of Statistics is taught as formulas, procedures, heuristics, calculation methods, techniques etc that many people (myself included up to a few years ago) never really see an overarching organizing principle. Discussing those organizing principles explicitly can be very useful to help people with their future analyses.
Shravan at Gelman's blog writes "I moved away from frequentist to Bayesian modeling in psycholinguistics some years ago (2012?), and my students and I have published maybe a dozen or more papers since then using Bayesian methods (Stan or JAGS). We still have the problem that we cannot replicate most of our and most of other people’s results."
So, Bayesian modeling isn't a panacea. Somehow I think this has been the take-away message for many people from my recent discussions of the philosophy of what Bayes vs Frequentist methods are about. "Switch to Bayes and your models will stop being un-reproducible!"
So, I feel like I've failed (so far, in part), because that certainly isn't my intention. The only thing that will make your science better is to figure out how the world works. This means figuring out the details of the mechanism that connects some observables to other observables using some parameter values which are unobservable. If you're doing science, you're looking for mechanisms, causality.
But, I believe that Bayesian analysis helps us detect when we have a problem, and by doing Bayesian analysis, we can force a kind of knowledge of our real uncertainty onto the problem, that we can't do with Frequentist procedures!
That detection of un-reproducibility, and the ability to force realistic evaluations of uncertainty, both help us be less wrong.
Some Bayesian Advantages
So, what are some of the things that make Bayesian models helpful for detecting scientific hypotheses that are unreproducible or keep our models more realistic?
Consistency is Detectible:
First off, there's a huge flexibility of modeling possibilities, with a single consistent evaluation procedure (Bayesian probability theory).
For example, you can easily distinguish between the following cases:
- There is a single approximately fixed parameter in the world which describes all of the instances of the data. (ie. charge of an electron, or vapor pressure of water at temperature T and atmospheric pressure P, or the maximum stack depth of linguistic states required for a pushdown automaton to parse a given sentence)
- There are parameters that describe individual sets of data and all of these parameters are within a certain range of each other (ie. basal metabolic rate of individual atheletes, fuel efficiency of different cars on a test track).
- There is no stable parameter value that describes a given physical process (ie. basal metabolic rate of couch potatoes who first start training for a marathon, fuel efficiency of a car with an intermittent sensor problem, or if you believe that Tracy and Beall paper Andrew Gelman likes to use as an example, the shirt color preferences of women through time).
When Shravan says that his Bayesian models are unreproducible, what does that indicate?
Most likely it indicates that he's created a model like (1) where he hypothesizes a universal facet of language processing, and then when he fits it to data and finds his parameter value, if he fits the model to different data, the parameter value isn't similar to the first fit. That is, the universality fails to hold. If, in fact, he were to use the posterior distribution of the first fit as the prior for the second, he'd find that the posterior widened or maybe shifted somewhere that the posterior from the first analysis wouldn't predict when the additional data was taken into account.
Or, maybe he's got a model more like (2) where he hypothesizes that at least there's some stable parameter for each person but that all the people are clustered in some region of space. But, when he measures the same person multiple times he gets different parameter values for each trial, or when he measures different populations he winds up with different ranges of parameter values.
Or, maybe he has a model like (3) where he expects changes in time, but the kinds of changes he sees are so broad and have so little structure to them that they aren't really predictive of anything at all.
In each case, probability as extended logic gives you a clear mechanism to detect that your hypothesis is problematic. When you collect more data and it fails to concentrate your parameter estimates, or it causes the parameters to shift outside the region they had previously been constrained to, it indicates that the model is inconsistent and can't explain the data sufficiently.
Bayesian models can be structured to be valid even when the data isn't an IID sample:
IID sampling is just way more powerful mathematically than the reality of running real experiments, so that assumption implicit in a Frequentist test, will fool you into failing to realize the range of possibilities. In other words, Frequentism relies on you throwing away knowledge of alternatives that might happen under other circumstances ("let the data speak for themselves" which is sometimes attributed to Ronald Fisher but may not be something he really said)
Suppose instead you are analyzing performance of a linguistic test in terms of a Frequency distribution of events. The basic claim is that each event is an IID sample from a distribution where x is the measured value, and a,b are some fixed but unknown parameters. Now, you do a test to see if and you reject this hypothesis, so you assume and you publish your result. First off, the assumption of a fixed and IID sampling is usually a huge huge assumption. And, with that assumption comes a "surety" about the past, the future, about other places in the world, other people, etc. In essence what that assumption means is "no matter where I go or when I take the measurements all the measurements will look like samples from F and every moderate size dataset will fill up F". How is this assumption different from "no matter where I go all I know about the data is that they won't be unusual events under distribution P" which is the Bayesian interpretation?
Consider the following specific example: P = normal(0,10), and I go and I get a sample of 25 data points, and I find they are all within the range -1,1. A Frequentist rejects the model of normal(0,10). Any random sample from normal(0,10) would contain values in the ranges say -15,-1 and 1,15 but this sample doesn't.
Does the Bayesian reject the model? No, not necessarily. Because the Bayesians know that a) the probability is in their head based on the knowledge they have, whereas the Frequency is in the world based on the laws of physics. And, b) in many many cases there is no sense in which the data is a representative random sample of all the things that can happen in the world. So, even if Bayesians really do believe that is a stable Frequency distribution across the full range of conditions, there's nothing to say that the current conditions couldn't be all within a small subset of what's possible. In other words, the Bayesian's knowledge may tell them only about the full range of possibilities, but not about the particulars of how this sample might vary from the full range of possibilities. Since Bayesians know that their samples aren't IID samples from they need not be concerned that it fails to fill up all of . The only time they need to be concerned is if they see samples that are far outside the range of possibilities considered. Like say x=100 in this example. That would never happen under normal(0,10);
To the Bayesian, x ~ normal(0,s); s ~ prior_for_s(constants) is very different than x ~ normal(0,constant); In the first case the scale s is unknown and has a kind of range of possibilities which we can do inference on. The assumption is implicitly that the data is informative about the range of s values. In the second case, the constant scale comes from theoretical principles and we're not interested in restricting the range down to some other value s based on the data.
This can be important, because it lets the Bayesian impose his knowledge on the model, forcing the model to consider a wide range of possibilities for data values even when the data doesn't "fill up" the range. Furthermore, it's possible to be less dogmatic for the Bayesian, they can provide informative priors that are not as informative as the delta-function (ie. a fixed constant) but also will not immediately mold themselves to the particular sample. For example s ~ gamma(100,100.0/10) says that the scale is close to 10 and this information is "worth" about 100 additional samples, so don't take the 25 samples we've got as fully indicative of the full range.
The assumption of random sampling that comes out of Frequentist theory says that there's no way in hell that you could fail to fill up the Frequency distribution if your samples were 25 random ones. This means implicitly that the current data and any future data look "alike" for sufficiently large sample sizes. That's a very strong assumption which only holds when you are doing random sampling from a finite population using an RNG. But all of the Frequentist tests rely on that assumption. Without that assumption you simply can't interpret the p values as meaningful at all. If you know that your sample is not a random sample from a finite population but just "some stuff that happened through time where people were recruited to take part in various experiments in various labs" then assuming that they are random samples from a fixed population says automatically "when n is more than a smallish number, like 25 to 50, there is no way in hell we've failed to sample any part of the possible outcomes"
Since the p values you get out of Frequentist tests are typically only valid when the data really IS an IID sample of the possibilities. You're left with either being fooled by statistics, or being a Bayesian and imposing your knowledge onto your inferences.
My wife had a problem where she had measurements of the expression level of thousands of genes under two different genotypes and two different time points of development. Subsetting these genes into those she was interested in, she still had the problem of understanding the differences in a multi-thousand-dimensional vector. I invented a plot where the genetic expression profile was described in terms of a function of the rank of the expression under a reference condition. It's easier to understand if you've seen the data, so here's an example plot with the specifics of the labels removed (click the image to get a zoomed version)
The idea is that we're displaying the logarithm of normalized expression levels on the y axis, with the genes shown in a particular order. The order they're shown in is always the rank-order under the reference condition +/- (heterozygous). That rank is preserved for all of the plots at both time points (early and late). So when you plot the alternate condition on the same graph, you get simultaneously a view of how much expression there is typically for this gene, how much variation there is on average between genotypes, and how much of a difference there is for the particular gene compared to the reference genotype. Finally, plotting the later time point shows you that BOTH genotypes diverge from the original expression profile, but in a noisy way, there is still an overall curving upward trend which is somewhat similar to the original.
Now, suppose you have a high dimensional parameter vector in a Bayesian model. You could choose the rank of each parameter within the average parameter vector as the x values, and then plot that average parameter vector as the reference condition. Then, over that, you can spaghetti plot particular samples, which will show you how much variation there is as well as to some extent correlations in the variation, for example when parameter number 50 is higher than average, maybe parameter number 70 is often lower than average, so that the curve will have an upward blip at 50 and a downward blip at 70 for many of the spaghetti curves.
This would be particularly useful if the parameter values were fairly well constrained, if they have a huge amount of variation, then the spaghetti plot could get really really messy, even messier than the "late" curves in this plot, for example.
Another thing to note is that this plot will work a LOT better if you have dimensionless parameters which are scaled to be O(1), none of this "parameter 1 is in ft/s and parameter 2 is in dollars/hour and parameter 3 is in Trillions of Dollars per Country per Decade" stuff.
Suppose you want to hold a lottery, and you want it to be such that no one has a good way to influence the outcome of the lottery. What you need is a cryptographic seed for a strong random number generator.
A strong RNG can be made by encrypting a counter using a block cipher in CBC mode starting at a randomly chosen initial number with a randomly chosen seed. Or by using a strong stream cipher (a cipher that outputs one bit-at-a-time basically)
But how to get the initial seed? There are many ways, but one of the things you want is that no one person can fiddle around with it too much. And also, you want transparency, you aren't going to rely on some compiled function to do what you need because that requires people to audit the function something most people can't do.
If your audience understands cryptography. You can simply execute on the Linux command line dd if=/dev/random bs=1024 count=10 | sha256sum
and read off the resulting hash value. Otherwise reasonably analog ways to create initial vectors for valuable lotteries would be to roll 10 different twenty-sided dice 10 times and read them off into a text file, then cat the file into sha256sum, if the dice are perfectly symmetrical they'd generate around 400 bits of entropy which would be compressed down to the 256 you need. Or, you could roll regular 6 sided dice, say 10 at a time about 10 times, to get around 258 bits of entropy which still fills up the SHA256 hash value.
Once you've done this, you can then set about randomly choosing grants to fund, a step up vs the BS we currently do 😉
Generally I consider it unhelpful to argue about definitions of words. When that arises I think it's important to make your definitions clear, and then move on to substantive arguments about things that matter (like the logic behind doing one thing vs another).
So, what makes a calculation Bayesian vs Frequentist and is there any other category to consider? Here are my definitions:
Bayesian Inference: The calculation begins from a position where probability defines how much you know about an aspect of the world, and proceeds to calculate the consequences of what you know together with what you observed (your data) using the sum and product rule of probability to arrive at a new explicit description of a state of knowledge. Analogy is that in Prolog you use Aristotelian logic to compute true statements from facts you supply, and in Stan you use Bayesian logic to compute samples from output distributions from data and your input distributions.
Frequentist Inference: The calculation begins with a description of the data collection or generation process as repeated sampling from a collection of possibilities, and proceeds to calculate how often something might happen if you do more sampling in the future, and how that relates to some unknown aspect of the collection. (Wikipedia definition seems to agree qualitatively)
Those are the definitions I'm using. You may well believe these are not good definitions. But I think these are principled definitions, in that they capture the essence of the metaphysical ideas (metaphysics: "a traditional branch of philosophy concerned with explaining the fundamental nature of being and the world that encompasses it" wikipedia)
The problem arises in that a lot of classical statistics is kind of "neither" of these, or maybe "a mishmash of both". My take on that is more or less as follows:
- There are Frequentist procedures which are very clearly Frequentist. For example the chi-square, Kolmogorov-Smirnov, Anderson-Darling, Shapiro-Wilks, and similar tests. For a hard-core set of tests we can look at the Die Harder tests for Random Number Generators which explicitly try to find ways in which RNGs deviate from uniform and IID. These things really do answer the question "does this sample look like a sample from an RNG with given distribution?"
- There are Bayesian models, in which people choose a model based on what they know about a process (physics, chemistry, biology, economics, technological design of measurement instruments, etc) and then turn that into a joint distribution over the data and some parameters that describe the process, typically via two factors described as a likelihood and a prior. In this case, the question of whether we can reject the idea that the data came from a particular distribution on frequency grounds does not enter into the calculation.
- There are a large number of people who do things like what's done in this paper where they construct a likelihood in what I would call a Bayesian way based on their knowledge of a process (in this case, it's a point process in time, detection of gram-negative bacteria in a hospital test, they know that the average is not necessarily the same as the variance, so they choose a default likelihood function for the data, a negative binomial distribution which has two parameters, instead of a Poisson distribution with only one but there emphatically is NOT some well defined finite collection of negative-binomially distributed events that they are sampling from). They then take this likelihood function and treat it as if it were describing a bag of data from which the actual data were drawn, and do Frequentist type testing to determine whether they can reject the idea that certain explanatory parameters they are thinking of putting into the model could be dropped out (what you might call a "Frequentist tuning procedure"). In the end they choose a simplified model and maximize the likelihood (a Bayesian calculation).
Since 1 and 2 are pretty straightforward, the real question is how to understand 3? And, probably the best way to understand 3 is that 3 is what's done when you've been classically trained and therefore have no real critical idea of the principles behind what you're doing. That is, you know about "what is done" but not "why". Note, that doesn't make it "wrong" necessarily, it's just that the people who do it rarely have a guiding principal other than "this is what I know how to do based on my education".
On the other hand, I think it's possible to argue that there is a principle behind 3, I just think that most people who do 3 aren't much aware of that principle. The basic principle might be "data compression". These kinds of models start with a Bayesian model in which a likelihood is picked based not on sampling from some known population but instead "what is known about where the data might lie" and then this model is tuned with the goal of giving a sufficiently short bit-length approximation to a Bayesian model that has good enough Frequency properties to reduce your bit cost of transmitting the data.
Then, if you collect additional data in the future, and it continues to look a lot like the past, awesome, you can send that data to your friends with low bandwidth costs. For example, instead of say a 16 bit integer per observation, you might be able to send the data using only on average 9 bits per observation, with no approximation or loss of information relative to the 16 bit version.
In this sense, (3) isn't really an inference procedure at all, it's a tuning procedure for a channel encoding. If you do some Frequentist tests, and decide to drop factor X from your model, it's not a commitment to "factor X = 0 in the real world" it's more an approximation to "Factor X doesn't save me much on data transmission costs if I had to send a lot more of this kind of data"
Although there are a certain number of "information theoretic" statisticians who actually think of this explicitly as a guiding principle, I don't think this is the position you'd get if you asked the authors of that paper what their motivation was for the statistical procedures they used.
Is 3 Bayesian inference? is 3 Frequentist inference? the answer is no and no. 3 is not inference at all, it's communications engineering. But, it does use Bayesian ideas in the development of the likelihood, and it does use Frequentist ideas in the evaluation of the channel bandwidth. I think some of those principled information theoretic statisticians may have an argument that it is some kind of inference, in other words, that the word inference should include this type of stuff, but I don't know much about those arguments and I don't really care to argue about the definitions of words. Communications Engineering seems to be a perfectly good description of this process, the question is, does it accomplish what you need to accomplish? In most scientific questions, I doubt it.
Again, thanks to ojm who keeps pushing back and making me think about things in more depth.
PS: the careful, principled version of this stuff is called "Minimum Message Length" modeling and Wikipedia has a nice intro. In essence it takes a Bayesian model, breaks it down to a discretized version, making decisions about the precision of the discretization which are ultimately based on a loss function related to Frequentist ideas (since that's what's relevant to transmission of a message). The version performed by many classically trained statisticians such as the ones in the article from PLoS One is more or less an ad-hoc version.
How can we map back and forth between (nonstandard) density functions, and random number sequences?
Let be a finite sequence of numbers of length with a nonstandard integer. Let be a grid of values on the real line with a nonstandard integer and so that there are grid point values.
Let be nonstandard (L is ULTRA ULTRA ULTRA BIG, so that there are roughly "a nonstandard number of values for each grid point").
Consider a "randomly chosen" permutation of the integers from where we intuitively understand somehow how this is done (ie. we pick one of the permutations intuitively). Call this sequence (for choice) then imagine choosing at random from the sequence by choosing the integers in order calculating and choosing the appropriate . Assert without proof that is an IID random number generator for some distribution when is chosen "well enough". What distribution?
For each value, determine which grid point it falls in, and increment a counter for that grid point until we get to the end of the sequence. Call this counter . Construct the nonstandard function this is the nonstandard probability density associated with . When the standardization of this function exists, use it. Otherwise, perhaps you have something like a delta-function, where all the values are infinitesimally close to a single value, or some such thing. Those are useful too so we don't necessarily require a standard density. The fact that we include nonstandard densities means we are able to actually represent arbitrary probability measures, where the measure of an open set is
To reverse this process, start with the function , multiply the function by then for each grid point append a string of values equal to on our sequence. Once we have the full sequence we can "sample from it" by choosing a permutation of .
Now, let's canonicalize our sequences, namely if we have a sequence we can sort it in ascending order. Now for every function we can find a canonical of length .
How many different RNG sequences of length L are there for a given density function? There are one for each permutation, or . How big is that? Remember Stirling's approximation: which is already quite accurate at . So there are about different RNGs corresponding to the same distribution. We could perhaps require only permutations that leave no element in place, called the derangements, there are about of those so that'd put us at
Are all of these really "random number generators" that will satisfy Per Martin-Lof? Probably not. But a big fraction of them will be. Imagine that an oracle tells us Martin-Lof's "most powerful test" for a binary random number generator. Take the subset of the permutations which cause the sequence to pass that test at the level, when is the sequence where the first half of the sequence is 0 and the second half 1. Constrain our choice of to be one of these sequences, it's still going to be a big big number of sequences (asserted without proof).
So, it's always possible to take a probability measure (a nonstandard density) and convert it into many different sequences of IID samples from that probability measure (a random number generator) and back and forth.
Mathematically, this means in essence that "you can sample from any probability distribution". But this doesn't mean that placing a probability distribution over real world data implies that you think is a sample from that distribution. As a Bayesian, you are free to reject that. The probability involved is in your head or at least assumed for the purpose of calculation (ie. in the head of the Jaynesian Robot, or of MC Stan). As Jaynes says: "It is therefore highly illogical to speak of 'verifying' (3.8 [the Bernoulli urn equation]) by performing experiments with the urn; that would be like trying to verify a boy's love for his dog by performing experiments on the dog." - E.T. Jaynes, Probability Theory