Summary of ideas on Convenience Samples
We had a long discussion over at Gelman's Blog on using convenience samples in studies. I had some strong opinions which I thought I'd try to summarize here.
First up, you have to ask yourself what it is you want to find out from your study. Typically, it's not just what would happen if you did your treatment again on a similar convenience sample. It's more likely to be something like "what would happen if you did your treatment on a more broadly defined population P?"
If this is the kind of question you have, then you now have to figure out what kind of statistics you are willing to do. The spectrum looks something like this:
- Frequentist statistics: A probability means how often something would happen if you repeated it a lot of times. For example, if you take a random number generator sample of 100 measurements out of a population of 100,000 people then you're guaranteed based on the sampling distribution of the average, that the average of your sample will be within a certain distance of the average over all 100,000 people almost no matter what your RNG output (for example, calculating the 95% confidence interval)
- Likelihoodist Statistics: where we do a Bayesian calculation with a nonstandard, flat prior.
- Frequentist Likelihoodist statistics: where we assume that the data arises as if from a random number generator as IID samples, with a validated frequency distribution (we test this distribution to make sure the frequencies approximately match the assumptions). We then write down the likelihood of seeing the data, and use it to get a maximum likelihood value of the parameter.
- Default Bayesian Likelihoodist statistics: We write down some IID likelihood based on some default choice of distribution without doing any tests on the data distribution, we multiply by a constant prior (flat), and then usually look for maximum likelihood. This is what's done in "ordinary least squares" in the vast majority of cases. Few people test the "normality" assumptions and when they do, they often find that they're not met, but they stick with the calculation anyway because they don't have much else they know how to do or don't believe in full Bayesian statistics for some reason.
- Full Bayesian Statistics: We write down a full joint distribution of the data and the parameters, providing some kind of prior that isn't a nonstandard construct (ie. not flat on the whole real line) and we don't restrict ourselves necessarily to likelihoods that are IID. For example, we might model a whole timeseries as a single observation from a gaussian-process, where each data point has a complex covariance structure with every other data point. We reject the idea that distributions must represent frequencies under repeated sampling, and instead use them as measurements of plausibility conditional on some scientific knowledge, and we're aware of this fact (usually unlike the Default Bayesian Likelihoodist).
Now, we've done a study on a convenience sample. What are the challenges?
For a Frequentist, we imagine we're interested in some function of the population, f(P), and a given data set S which is a sample from a population P has a value f(S). If we understand the sampling methodology, and the distribution of the values in P that go into calculating f(P) then we can get a sampling distribution for f(S) and see how it relates to f(P). This automatically allows us to make extrapolations to the population P which are not exact, but which hold to a certain approximation a certain percentage of the time. The only problems with this approach are, it requires us to know, at least approximately, the frequency distribution of the values in P, or be able to calculate the sampling distribution of f(S) approximately independently from the frequency distribution of P (such as when there's a mathematical attractor involved such as the Central Limit Theorem)
But, in the absence of a specific sampling model, when all we know is that we're sub-sampling P in some definitely biased but unknown way, there is no way to extrapolate back to the population without making a specific guess for what the population values P look like, AND how the sampling works. And since a Frequentist will not put probabilities on things like "the distribution of the P distributions" (which a Bayesian might accomplish by doing something like a Dirichlet Process or a Gaussian Mixture Model), there is no way to get probabilistic statements here. You can at best do something like "worst case" among all the sensitivity analyses you tried.
What would be the way forward for a Bayesian? Well, first of all, there are quite a few of them. We can build a wide variety of models. But in the basic case, what we'd do is assume that we have some covariates we can measure and that we can get a functional form for the prediction of the outcome from the covariates, so it's basically the case that for each s in the sample S, we have f(s)=outcome(s)+error. Now we have to assume that this relationship still holds, or holds with some modification over which we have a prior, so that f* = f(s') + error + error2(s') = outcome(s') for any s' value in the whole population P. Here we say that we at least have some probabilistic information about error2, the out-of-sample extrapolation error.
Next, we assume something about what P looks like, and since we're Bayesian, we can put probabilities over the frequency distribution for P. So we can do things like set up Gaussian Mixture Models, or Dirichlet Process Models or other models so that we can describe what we think we know about the full range of the population. This could be "pure" prior guess, or it could be based on some data from an alternative source (like Census data, or patient summary data from other hospitals, or operating characteristics of fighter jets published by military intelligence gatherers, or aerial photos of forests from weather satellites, whatever your subject matter is). Finally, we can generate a series of samples from the assumed population P and apply our predictors f*(P) to get predictions extrapolated back to our model of the population P.
So, in this case, the Bayesian can make assumptions which give probabilistic statements about the extrapolated results, but these probabilities are not Frequencies in any sense (there's no sense in which the real population P is somehow part of an ensemble of repeated populations P_i whose frequency distribution is given by the Gaussian Mixture Model of the Bayesian or whatever). The true frequency distribution of the real population P is a fixed thing at a given point in time. Like, the true frequency distribution of the weight in lbs of people between the ages of 18 and 60 in the US on April 28 2016 at 3:45PM Pacific Time. Nevertheless, although there's a very definite distribution for those weights, we DO NOT know what it is, and the Bayesian distribution over that frequency distribution is NOT describing an infinite set of worlds where our true population P is just one population out of many...
The Frequentist can only give an infinite set of statements of the form "if P looks like P_i and the sampling method looks like S_j then the confidence interval for f(P) = f(S) +- F_(i,j)"