Whoo, the title is a mouthful! Again, in discussions with Joseph over at the Entsophy blog some interesting issues in the philosophy of statistics were raised. It has me thinking about the following issue:

Suppose that you have some data $D_i$ and some measured covariates $x_i$ and some brainpower. You would like to understand the process that produces $D$ in terms of things you know about the state of the world $x$ and things you assume about the state of the world, your model $G(x)$ (I'm using G because we'll want to distinguish between probabilities $p(D)$ and frequencies in ensembles $f(D)$). So you are going to use the following formulation:

and $\epsilon_i$ has a Bayesian probability distribution associated to it. Actually, technically there's a joint distribution over all the epsilons $p(\epsilon_1,\epsilon_2,\ldots,\epsilon_N)$. For the moment pretend we haven't thought about it yet.

So, you build your functional form for $G$ and it has some unknown parameters $q_i$ about which you have some general knowledge, say order of magnitude estimates, from which you can build priors that approximate that knowledge $p(q)$. Now you need to do some inference on $q$ to find out what reasonable values are for those unknown parameters. Using Bayes Theorem, you write down:

[ Edited: changed this to condition on the observed covariates $x$, and will let the measurement error in $x$ enter as $p(x|q)$ with the parameters $q$ including the unobservable measurement errors in the measurements of $x$]

And you're off to the races doing MCMC right? Not so fast! what the heck is $p(D|q)$? The usual jargon is "The Likelihood", and well, the predictions from $G$ given $q$ are well known, so the only thing we're not sure about is how likely we are to see a certain differences $\epsilon_i$ between the data $D_i$ and the predictions $G(x_i)$. In other words we need to specify the Bayesian probability distribution on the $\epsilon$ values.

Well, first let's consider the "usual" answer. The usual answer is something like "the data are treated as IID draws from distribution 'foo'" (often normal, or binomial, or poisson, or something else fairly convenient). The next thing after that is often some kind of heteroskedastic version, where $\epsilon_i/\sigma_i$ are all IID and $\sigma_i$ is some kind of known measurement error scale that comes from instrument calibration data.

But, from a Bayesian perspective the joint distribution over the $\epsilon$ values is really a belief about how closely the prediction equation will match the data in each circumstance, and it could be pretty much any probability distribution. The "frequentist" notion that there is a "data generating process" like IID normal draws is backwards, the actual direction of implication seems to be (Knowledge about model performance and measurement properties) $\rightarrow$ (Belief about errors when $q$ values are "reasonable", ie a mathematical specification of the likelihood) $\rightarrow$ (Belief about numerical values for "reasonable values" of parameters $q$, typically through MCMC procedures) $\rightarrow$ (Ensemble of measured errors under high probability $q$ values, this ensemble will be constrained to look like draws from some single distribution, typically mean 0 and maybe having other Maximum Entropy like constraints such as standard deviation, quantiles, or other structure).

The "frequentist influenced" Bayesian method goes more around in a circle: first assume a data generating mechanism (the final distribution above) then use that to produce an "automatic" likelihood (back to the likelihood at the beginning of the above chain), then use the prior and the likelihood to get the high probability parameter values $q$ via MCMC or whatever (you could do the "full frequentist" version of this by leaving off the priors and simply doing maximum likelihood for example).

In other words, in the fully Bayesian world, depending on what we think the model and measurement devices ought to do (how we specify the likelihood), we will get different distributions on the parameters, and then when we treat the parameters as random variables and observe the discrepancies, we will get an ensemble of discrepancies which look like IID draws (or dependent draws if you want to specify some dependency structure) from some distribution, but the distribution they look like draws from need not be the one we use to define the likelihood. It is more or less the case that we can often use a simple IID "frequency" model as the Bayesian probability model for defining the likelihood and we will get in the end that the ensemble of errors look like they came from that IID distribution we assumed. This is in essence a special case, but that special case is exactly the one that Joseph is complaining about in his "jumped the shark" post. In other words, there is no "Correct" likelihood that corresponds to actual facts about the world. The "Data Generating Process" doesn't really exist, so we can't be right or wrong about it, and confidence intervals derived from assumptions about the "Data Generating Process" need not have real coverage properties because the world need not behave as if that "Data Generating Process" assumed by a frequentist statistician is actually correct.

12 Responses leave one →
August 22, 2013

Nice description. I have a minor(?) quibble and two more substantial comments:

1) You seem to be making a conceptual distinction between data and observed covariates (not including the latter in D) - what is this distinction? You also use the frequentist terminology distinguishing between random variables and parameters. I see no clear way to distinguish between different types of variables, except for those that are observed (which makes them data and makes me want to include them in D) vs those that are not (which should not be included in D).

2) I don't see how to translate this argument into the sort of models I am more interested in, where the data-generation model does not make sharp predictions of what will be observed (e.g. a discrete-state Markov chain parameterized by a transition probability matrix) and hence there is no natural notion of "error". A framework requiring notions like "measurement error" or "prediction error" is not sufficiently general, because those only address problems relating to imprecise measurement or prediction, which does not come close to covering the range of situations where we need to deal with uncertainty.

3) Even in your (and Joseph's) example where we do have a notion of error, I would argue that a data-generating process is still the most useful modeling approach. The point you make is just that it is possible to make overly strong assumptions about the data generating process (e.g. that it belongs to a specific parametric family) and then get reasonable answers (to many but not all questions) for the wrong reasons. E.g., assuming a normal data-generating process often gives good answers because the correct predictive distribution given what we know is normal, even when the data-generating process is not (i.e. the ensemble of errors will not be normal). But we could also make less restrictive (and more realistic) assumptions about the data-generating process and still obtain a normal predictive distribution.

2. August 22, 2013

1) The only conceptual distinction between $D$ and covariates $x$ is what we're primarily interested in predicting or understanding. We measure $x$ because we want to use it to understand how $D$ occurs (causally) or to predict $D$ in new circumstances without necessarily a causal model, but where information about $x$ will be available. In particular, $x$ has its own measurement uncertainty of course, and this should enter into the model. Parameters are generally things that I think of as not directly measurable. The only distinction I would make between "parameters" and "random variables" is that the second thing is a particular way of describing the uncertainty of a parameter. We can have parameters in a purely deterministic model too, where there is no notion of a random variable.

2) A discrete-state Markov chain model has output states. Don't you ultimately want to connect these states to some kind of data $D$? As soon as you do, you've got a "prediction" from the model and you've got data $D$ and you can compare the two. You may not be able to "subtract" the two like you would with say length measurements in meters, but you can compare them. For example if you have 4 categories, you can look at the frequency with which your model predicts the data, vs the frequency with which your model predicts other categories. Joseph's point seems to be that we can do Bayesian analysis of this "error frequency" just like we can for other things. If your model never gets compared to data at any point, then it's a purely mathematical probability-theory construct, not actually statistics (those are interesting too, but just quite different).

3a) In my example, the (model of the) data generating process is $G$, which could be its own pretty complicated thing, like a state-space ODE for ecological populations or a compartment model of Nitrogen build-up for the purposes of predicting decompression sickness, or an agent based model of economic actor's choices. The bayesian probability distributions we put on the errors (whether they be categorical, or measurement) is a good distribution only when the real errors are in the high probability region, those regions could look like anything, if you think that a model is likely to branch off in one of two directions, you could *assign* bimodal errors for example. This is in the *likelihood*. If you want to call that "the data generating process" then I guess you could, but I think it's a little bit misleading terminology. First of all there is no "the" there are many possible models. Second of all, it doesn't generate the data, it more or less describes what we think is reasonable to believe about the error. The data is generated by "stuff that happens" and the model output is generated by the model structure.

3b) "The point you make is just that it is possible to make overly strong assumptions about the data generating process (e.g. that it belongs to a specific parametric family) and then get reasonable answers (to many but not all questions) for the wrong reasons."

I think maybe you misunderstood this as backwards from what I wanted to say. We could get an ensemble of errors which looks like a finite sample from a normal distribution, but we need not use that normal distribution in defining the likelihood of our data. We can for example assign probabilities to individual prediction errors which are dependent on each other, and non-normal, and still wind up with a model which ultimately gives an ensemble consistent with that normal.

3. August 22, 2013

Daniel mentioned that my point was to show the unity of all distributions. I believe all distributions P(x) are a way to describe how well we can locate one-off facts x_true. Detecting signals in the presence of errors is just an important special case, which I wanted to highlight because it's usually though of as Frequentist's home turf.

Most of the stuff I spend my days on has no natural notion of error either (actually that's not strictly true, but that's another story). One example is finance where we're trying to predict price sequences which can be thought of as continuous or discrete path {Y(t): a<t<b}. The general formulation for time series is that there is a true path Y_true(t) that we need to predict and the a distribution (functional) on paths P[Y] is a way of describing how well we can locate one-off facts Y_true(t). That was the point of the post "Law of Trading Edges" which I'm not sure anyone really got.

Markov assumptions play the same role on paths as do IID, they represent a kind of low information/high entropy special case. Naturally, having a clearer, more general, but also simpler understanding of what's really happening is only sometimes going to be useful. In some cases you're not practically going to be able to do any better than the high entropy/low information special cases that Frequentists are driven towards.

In general, time series type problems seem to be an exception, at least for me. Whether it's finance or military stuff related to Afghanistan, time series seems to be an area where I can stretch my Bayesian legs. Classical Time series is severely crippled by the need to justify everything in frequentist terms. That leads to less general, but more difficult mathematics. In essence, the range of time series models that can be given a frequentist interpretation is cramped and limited. The range of models (and possible informational states which imply those models) using the idea mentioned above about using a baysian P[Y] to locate Y_true(t) is both simpler and far more general.

August 23, 2013

Daniel:

1) The distinction between D and x becomes relevant during inference when one calculates a posterior distribution for unobserved quantities conditioned on observed quantities. If you're conditioning on observed quantities, we're in agreement on everything important. But if you really meant to exclude the observed x_i when saying that p(q|D) is the posterior of q, we are in disagreement.

I don't understand what you mean by a random variable being a particular way to describe the uncertainty of a parameter. You seem to suggest that a deterministic model is the opposite of a probabilistic model - I can't go along with that: probabilistic refers to incomplete information, not non-determinism. Even in a purely deterministic model, you still have probability distributions on variables, with the distributions changing as a function of which observations you're conditioning on.

2) Suppose I have a 2-state Markov chain with states H and T and the transition probabilities from H are [1/3, 2/3]. Given that the chain is in state H at time t_0, this gives me a probability distribution for where it will be at t_1. Next, I observe that the state at time t_1 is T. At this point (one observation) we can already compare the model to data by calculating the likelihood, but we do not have a sensible notion of error (what was the error of our one observation?).

Sure, if I have a large number of time points I can define a notion of error (say the KL divergence between the observed frequency histogram and the distribution predicted by the model - though even for this there will be disagreement on which definition I should choose) - but the whole point is that we are discussing problems where we do _not_ have a large nr of time points. Your notion of error was defined on a single observation.

3a) I do think of the likelihood function and the data-generating process as roughly synonymous: whenever you specify a likelihood function (with accompanying prior, if it contains unknown parameters), you automatically specify a procedure for generating data from your model. I agree there are many possible models - each of them is associated with a specific probability (or probability density, for continuous data) of generating the data set you've actually got. None of them really generated the actual data, because real-world data are not generated by models. Instead, we are asking "_if_ the data really _were_ generated by a model, which model would it be?" That's why I prefer to think of Bayesian models as counterfactual hypotheticals, e.g. my comment near the bottom of this thread: http://andrewgelman.com/2013/07/25/bayes-respecting-experimental-design-and-other-things/

3b) Let's see if I can summarize: both of us have in mind a non-Gaussian data-generating process which generates a finite data set that appears consistent with a Gaussian model, even though with a sufficiently large data set this consistency would disappear. I was additionally trying to point out that (a) we may have a Gaussian posterior predictive distribution identical to that produced by a Gaussian model even when our model is not Gaussian, (b) it can be wrong to assume a Gaussian model in such situations.

August 23, 2013

Entsophy: I'm not sure if we're in disagreement on anything here. In general I am more interested in the low information/high entropy case, because in high information/low entropy cases all vaguely sensible methodologies tend to give the same answers. It's in low information settings where it becomes important to make optimal use of the available information.

I work on problems where information is low enough that prediction is not usually feasible (in the sense that posterior predictive distributions are very diffuse - all or most possible observations have non-negligible probability), and instead we are interested in learning about the way in which real-world mechanisms produce the data (e.g. detecting sites in a gene where selective pressure is unusually strong). So it is natural to think in terms of constructing a model that mirrors the real-world mechanism.

Markov assumptions are then a way of encoding information we have about the way the world really works: in our current understanding of physics, the time-evolution of every system is Markov - the only way to break the Markov assumption is to adopt a notion of "state" that does not incorporate all relevant information about the system. This is different from IID assumptions, which are often made for practical reasons even when we know the real-world process is not IID.

6. August 23, 2013

Yes, I see what you're saying, I guess I was being sloppy and should have written $p(q|D,x) \propto p(D|q,x)p(q|x)p(x) = p(D|q,x)p(q)p(x)$ or something like that. It's a little confusing, what is $p(x)$ for example? is that 1 because we definitely observed $x$? anyway, definitely we should condition on x.

Not everyone is a Bayesian, and not every Bayesian needs to treat every thing as having significant uncertainty. The term "parameter" is used by people who don't bother to include uncertainty in their models (not that there isn't uncertainty, just that it's not acknowledged in the model). If you're willing to let a delta function describe the uncertainty in a parameter, then I'm willing to say that all parameters are random variables, some of which have delta function priors . I think deterministic vs non-deterministic is perhaps a poor choice of words due to confusion.

2) I think it depends on what you mean by observation. Evidently if you have a markov chain, then you have a series of states, and the entire series could be considered as a single observation, or the individual states could each be observations. As far as I can see in your example, the error is either 1/3 or 2/3

3a) I'm happy with your terminology, so long as it's made clear as counterfactual hypothetical. I think Joseph's point is that Frequentists give the "data generating process" a kind of pseudo-physical status, as if it really were what actually went on. And then if you check it, it turns out to be wrong, and then you can't explain why things work well, for example.

3b) I was thinking of a marginally-non-gaussian data generating process which could nevertheless generate pretty close to gaussian results for an ensemble. For a perverse example imagine that we generate our time-series data by incrementing an integer from 1 to 100 and then back to 1... etc, and then choosing a random number uniformly from an interval defined by bin number n. If we arrange our intervals correctly the marginal distribution will look pretty darn gaussian, but the individual values are nothing like gaussian, they're uniform in some intervals. I think we're basically in agreement, the fact that an ensemble of values looks gaussian does not mean that we should assume the individual values are each gaussian from the ensemble distribution. It especially means that we shouldn't feel like it would be WRONG to assume some non-gaussian distribution for each individual data point.

August 24, 2013

Looks like we're more or less in agreement. Just some small points:

1: The equation is p(q|D,x) = p(D|q,x)p(q|x)/p(D|x), so there's no p(x). This justifies your conceptual distinction: if your likelihood function is in the form p(D|q,x) you can condition on x throughout the analysis.

"all parameters are random variables, some of which have delta function priors" - yes, that's exactly how I'd put it.

2: It doesn't really matter how you count the observations - even with a whole sequence you could have enough states so that the sequence only contains a single transition out of a given state of interest. Saying that the error is 1/3 or 2/3 is different from the way error was defined in the original example - there it was the difference between prediction and observation, not the difference between 1 and the predicted probability of the observation. When we predict something with 80% certainty and our prediction comes true, it doesn't really feel right to say we had a prediction error of 20%.

8. August 24, 2013

In the case of finance, the change in perspective from frequentist to Bayesian is huge. It really changes the nature of the problem. To give just one example, most people in finance think P[y] is a statement about frequencies. Since it's very unlikely individual investor are really estimating frequency distributions, it fundamentally doesn't make sense to suppose each individual investor really has a P[Y] when they trade. If on the other hand, P[Y] is a bayesian distribution which through it's high probability manifold describes where Y_true(t) is thought to be, then it's very easy to imagine each investor has an implied P[Y] in mind. This is just one example. In general the change in perspective radically changes the nature of the modeling problem, or at least it changes it far more than I expected at first.

But taking a step back, there are three possible uses for statistics:

(1) to take what we know and make best guesses
(2) to learn new information (or new constraints) from failed best guesses
(3) to determine what state of knowledge it would take to predict quantities of interest with low uncertainty.

Statistical modeling as typically taught and practiced heavily leans towards (1) (although frequentists wouldn't state it that way). Real examples of (2) and (3) seem to come from more obscure applications like physics, which are pretty far outside of anything you'll see in a graduate stats course. For example, statistical mechanics basically shows that for an ideal gas that you can predict Volume accurately given the temperature, pressure, and number of atoms. I don't know of a single example of a formula like PV=nRT which has been derived from (Bayesian) statistics and a microstate model in either Biology or Economics, except for a few special cases were you can intuit the answer thereby bypassing the "statistical inference" part of it. And you need to bypass the statistical inference, because if you're confusing probabilities and frequencies, as almost everyone does, you'll never get it right.

I strongly believe that (2) and (3) are far more useful than (1) even if (1) is perfected as far as it can go. Part of the story is in the post "Data Science is inherently limited", but that's only part of the story. So while my blogging has been focused on (1) by getting a clearer picture of what we're really doing with traditional stats modeling, I believe all the real action for science occurs in (2) and (3).

9. August 25, 2013

I think I might prefer the factorization $p(q|D,x) \propto p(D|q,x) p(x|q)p(q)$ and then treat the "measurement errors" (or the true quantities being measured) as part of the parameters $q$. In all cases, measurement error enter. Some of those cases have negligible uncertainties surrounding the measurement errors (again like our delta-function priors).

I think you are conflating frequencies and probabilities as Joseph was discussing. In your markov model, we have an observed frequency for transition between two states, and we have a model frequency for transition between two states. We can model those frequencies using bayesian probabilities. ie. we could put a probability over the correct value of the transition frequency. Also we can talk about the error between the observed value and the model parameter, and we can put a probability distribution over that error as well.

10. August 25, 2013

To expand, in your markov example 100% of the observed transitions from special state A were to special state B (for example). This is a *frequency* of 1. The Markov model had transition frequency say 0.75 (but really, we should say the probability of frequency 0.75 was highest, ie the maximum a-priori value), so that the difference between the model frequency and the observed frequency was 0.25

I don't think that is an objectionable notion of error, but we'd most likely want to collect more data before stating that the error was *definitely* exactly 0.25, in other words, in reasonable models we're going to have a posterior distribution over the frequency error that has non-negligible width until we've observed a lot more of these transitions.

August 26, 2013

I agree that the model parameters in my example are frequencies, and that it is misleading to call them "transition probabilities" in this context - that's their common name, but it's not technically correct to call them Bayesian probabilities without specifying the information they are conditioned on - we can say that they are the predictive probabilities under complete information about the model, though.

I do not agree that it is useful to define error that way. For one thing, we do not in general expect the observed frequencies to be a good estimate of the model parameter - model parameters are inferred, not measured directly.

12. August 26, 2013

I am not sure I disagree exactly. Here's what I'd say, there's a prior distribution over the model transition frequencies $f$, and there's the observed number of transitions that we get from some data $D$. To get inference on transition frequencies we need to update the distribution over the model frequencies based on the observed data. To do that, we'll need to specify a likelihood of observing the particular transitions under our model assumptions. Then we can do $P(f|D) \propto P(D|f) p(f)$

If you are assuming a markov process for $D$ then you should be able to get the likelihood of observing $D$ under different $f$ values either by simulation, or using some closed-form expression if you happen to have it, and that will help you compute posterior distributions for $f$.

However, with the Markovian assumptions, this can be written in terms of a deviation in observed frequencies $F$ and model frequencies $f$ but you don't have to if that doesn't help your modeling effort.

So, if you're objecting to me saying that there is exactly one error 0.25, then that was really just a shortcut for doing the whole inference implied by the above stuff. There's really a probability distribution over the errors, that depends on the probability distribution over $f$ and the structure of the Markov model.

The essential point though, going back to Joseph's (Entsophy's) original posts, is that we can put probabilities over frequencies, and if we keep probabilities and frequencies straight in our mind, even though many people conflate them, we will be able to think more clearly about what we're modeling.