A summary of some discussions on data-dredging, hiding data, private models, and regulatory approval

2016 August 4
by Daniel Lakeland

We discussed a bunch of stuff at Andrew's blog on data dredging, hiding data, selecting your favorite analysis, etc. Here's a summary of my position:

In Bayesian analyses, the only thing that matters is the data and the model you actually have. That is, essentially p(Data | Parameters, Knowledge) p(Parameters | Knowledge).

The question arises as to how it affects things if you do some potentially questionable practices. So let's look at some of them:

  1. Cherry-picking data: you collect N data points, and then select n of them, submit these n as if they were all the data you have, together with a model... THIS IS BAD. The reason it's bad is that p(Data | Parameters, Knowledge) is now the wrong formula, because "Knowledge" doesn't include knowledge of the REAL data collection process. The rules by which you selected the n data points from the N data points are missing. Since the likelihood is a model of the whole data collection process, not a god-given fact about the "true frequency distribution in repeated sampling" (whatever that is) it strongly depends on the whole process by which the data eventually makes its way into your data file.
  2. Cherry-picking outcomes: You collect an NxM matrix of data, and select one column to analyze (say N people tested for M different outcomes). Provided you give me the full matrix, and you explain the logic behind the choice of analyzing your column, and I agree with it, the existence of this additional data need not concern me. However, if information about some of that other data could inform my model of the one outcome of interest, then I may legitimately require you to include the other columns in the analysis. For example, if you analyze 3 different kinds of immune cells, and a self-reported allergy outcome, and you want to analyze the self-reported allergy outcome without looking at the immune system outcomes... I may legitimately  conclude that I don't agree with the model. When the extra data doesn't alter what the regulator should think the likelihood (or priors) should be, it's irrelevant, but if it would... then it's not ok to hide it. Only the regulator can make that decision.
  3. Cherry-picking models for a single outcome. If you collect N data points, one outcome column, and you are analyzing the one outcome column, and you've privately thought up 400 different ways to slice and dice this data and just reported your favorite model. Does this invalidate the inference? It depends. If the result of your slicing and dicing is that the likelihood you're using is extremely specific and would not be a likelihood that a person who hadn't seen the data already would choose... then YES it invalidates the analysis, but that's because it invalidates the choice of the likelihood. If the likelihood winds up as something not too specific that others who hadn't seen the data would agree with based on the background info they have, then no, it's not invalidated, it's a fine analysis. Whether or not you looked at hundreds of alternative models, or just this one model, a regulator needs to scrutinize the model to ensure that it conforms to a knowledgebase that is shared by other people in the world (regulators, third parties, etc). The number of other hypotheses you thought about, does not matter, just as it doesn't matter that you might have anxieties, or be wearing a lucky rabbits foot, or have prayed to a deity, or any other private thoughts you might have had. Bayesian models are more or less mathematicized thoughts.
    1. Note, if you try to argue for a single final model, and regulators don't agree with it, it's always possible to set up a mixture model of several explanations with priors over the models, and do Bayesian model selection. Either a single model will dominate at the end, or several will have nontrivial probability. Either way, we can continue with the analysis using the posterior distribution over the mixture.

In the end, the thing to remember is that a Bayesian model has a CHOICE of likelihood. This isn't a hard shared fact about the external world, it's a fact about what you assume or know. If you send a regulator data and the method by which you actually collected, filtered and handed over that data differs from the method that your trial protocol describes, or the likelihood doesn't correspond to something people other than you can get behind, then there is no reason why the world needs to believe in your analysis, (and you may need to get some jail time).

However, if you have privately got all sorts of worries, fears, and have looked at hundreds of alternative models, provided the one model you submit is agreeable to independent parties in terms of the logic, and the data collection process is as you said it would be, and the likelihood expresses it correctly, and the priors are not unreasonable to third parties, then Bayesian logic suggests that everyone involved needs to agree with the posterior independent of how many other ways you might have sliced the data. That's ensured by the consistency requirements of Cox's axioms.

Note however, that Frequentist testing-based inference doesn't seems to have the same property. In particular the idea behind a test of a hypothesis is that you will filter out say 95% of false hypotheses. But, how many hypotheses did you test, or could you have tested easily? The expected number of non-filtered false hypotheses you could easily get your hands on is \sum_{i=1}^N p_i where N is the number of false hypotheses you can easily get your hands on, and p_i is the p value each one gets. So, it matters how many other hypotheses you might have tested... which is ridiculous, and shows how by failing to meet Cox's axioms, Frequentist testing based inference goes wrong.


No comments yet

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS