# On models of the stopping process, informativeness and uninformativeness

I had a great conversation with Carlos Ungil over at Andrew Gelman’s blog where we delved deep into some confusions about stopping rules and their relevance to Bayesian Inference. So here I’m going to try to lay out what it is I discovered through that conversation, and I’m going to do it in the context of Cox/Jaynes probability with explicit background knowledge, translating from what I consider the much more nebulous terms like “deterministic” or “random”.

**Standard Textbook Stopping Problem:**

First off. Let’s talk about the “standard textbook” type stopping rule: “Flip a bernoulli coin with a constant p until you see 3 heads in a row and then stop” Suppose you get 8 observations HTTHTHHH. Now, the rule is completely known to you, and so you can determine immediately by looking at the data whether a person following the rule will stop or not. If someone hands you this data and say “given what you know about the rule, will they stop?” you will say P( STOP_HERE | Data, Rule) = 1. *for you* the rule is deterministic thanks to the textbook description. Under this scenario, knowing that they stopped at N=8 does not add anything that you didn’t already know. Therefore it can’t add any information to the inference. Basically

P(Parameters | Data, Rule, STOP) = P(Parameters | Data,Rule)

### Standard Real World Stopping problem:

In the real world, the reason why people stop is rarely so clearly known to you. The experimenter tells you “Our collaborators ran a few trials and read the protocols from another lab, and tried this thing out, and it seemed to work, and so we tried it in our lab, and after collecting 8 samples we saw that the results were consistent with what we’d been told to expect, and so we stopped.”

Or a survey sample of a population comes with a dataset of answers from 120 people, and a description of the protocol: “we ran a small preliminary study to test out our questions, and found in the preliminary study that the phrasing of the questions didn’t seem to matter, and so we used the observed variability in this previous study to determine that a sample of 120 people should give a cost effective dataset for answering the questions we asked, we then sampled a fixed 120 people” but… because the preliminary study was based on a mixture of different questions and soforth… it’s not included in the data set. Here, the information gleaned in the earlier trial study is expressed only through the choice of 120 people. The “fixed deterministic” rule “sample 120 people” is informative for a bigger model in which you include the earlier steps.

Or a slightly more sophisticated version: “we sampled until our Bayesian model for the probability of parameter q being in the range 0-0.1 was less than 0.01, ie. p(q < 0.1) < 0.01” To the people running the study, a Bayesian posterior is a deterministic function of the data and the model. Everyone who has the same model always calculates the same posterior from the given data. But note, *you* don’t know what their Bayesian model was, either priors or likelihood.

### Deterministic vs Random stopping rule vs Cox/Jaynes restatement

In the literature, a stopping rule is called “informative” if it is “a random stopping rule that is probabilistically dependent on a parameter of interest”. I personally think this is a terrible definition, because “random” is usually a meaningless word. It gave me a lot of trouble in my conversation with Carlos, because when I think random I pretty much exclusively use that in the context of generated with a random number generator… but that’s not what is meant here. So let’s rephrase it in Cox/Jaynes probability terminology.

**A stopping rule is informative to you if given your background knowledge, and the data up to the stopping point, you can not determine with certainty that the experiment would stop, and there is some parameter in your model which would cause you to assign different probabilities to stopping at this point for different values of that parameter.**

Under this scenario, the *fact of stopping* is itself data (since it can’t be inferred with 100% accuracy just from the other data).

Now in particular, notice that the informativeness of a stopping rule depends on the background data. This should be no surprise. To a person who doesn’t know the seed of a random number generator runif(1) returns a “random” number, whereas to a person who knows the seed, the entire future stream of calls to runif is known with 100% certainty. It’s best to not use the term random and replace it with “uncertain” or better yet “uncertain given the knowledge we have”.

### What can you usually infer from the stopping rule?

If you’re in a situation where the stopping rule is uncertain to you, then if you believe it is helpful for your modeling purposes, you can add into your model of the data a model for the stopping rule (or a model for the choice of the “fixed” sample size). This is particularly of interest in real world rules along the lines of “my biologist friends told me what I should be getting from this experiment so I tried about 8 experiments and they kind of seemed to be consistent with what I was told, so I stopped”. The actual rule for whether the experimenter would stop is very nebulous, but the fact of stopping might tell you something you could use to describe distributions over relevant real-world parameters. For example, suppose there’s an issue with the way a drug is delivered that can cause toxicity if you do it “wrong”. The fact that the biologist stopped after 8 experiments suggests that they believe p(DetectToxicity | DoingItWrong) is near 1, so that if you haven’t seen it in 8 tries then you are virtually certain you are doing it “right”.

So, eliciting information about the stopping rule is very useful because it can show you that there are potentially parameters you need to include in your model for which the fact of stopping informs those parameters, and particularly, parameters *that describe the nebulous uncertain rule*.

In the example above about sampling until a Bayesian posterior distribution excluded the range 0-0.1 with 99% probability, if someone tells you exactly the description of the Bayesian model, then if you plug in the data, you will immediately know whether the rule said stop or continue. But, if you know what the likelihood function was, but not the prior, then you could potentially infer something about the prior that was used from the fact that the sampling stopped at this point. If you think that prior was based on real background information, this lets you infer some of that real background info.

## Summary:

To a Cox/Jaynes Bayesian there is no such thing as “random” only “uncertain given the background information”. A stopping rule can teach you something about your parameters precisely when your model doesn’t predict the fact of stopping with probability = 1 *and* your model has a parameter which affects the probability of stopping conditional on data in other words, p(STOP_HERE | Data_so_Far, Background, Params) is a non-constant function of Params

Then, given the data, the fact that the experiment stopped is additional information about some of the Params

Stopping rules are not irrelevant to Bayesian models, but they are only relevant in those circumstances, if you feel that the stopping rule seems vague or the reasons for the choice of the sample size seem based on some information that you’re not privy to, then you might want to invent some parameters that might help you explain the connection between the fact of stopping, and the things you want to know such as “hidden” prior probabilities in use by the experimenter that inform their stopping.

The thing that becomes obvious in the restatement is that the fact of stopping is data just like any other data. And its informative precisely when the data is not redundant with other knowledge and whatever your model is predicts different probabilities for that data for different regions of parameter space.

Your model might even predict 100% probability to stop for some regions of parameter space. And then the stopping rule is a deterministic given the parameter value but perhaps only 10% given a different value. There’s no mysterious random powers inherent in the rule, there’s only stuff you know for certain and stuff you’re uncertain about.

When people talk about the relevance of stopping rules to Bayesian inference, the issue is whether the fact that data collection was stopped at some particular point does provide any additional information which is missed if the stopping rule is ignored and a fixed size is assumed. When the stopping rule used is indeed informative, it affects inference by changing the likelihood.

Many of the questions you discuss are about how you could (but not necessarily should!) use the assumptions implied by someone else’s stopping rule (or their experimental design in general) to establish the prior for your model.

Interesting as that may be, that’s a completely unrelated issue and it doesn’t even require that a stopping rule is used at all. If the original design included a stopping rule which was finally not put in place for some reason, you can still use it to shape your prior. You can also use the assumptions implied by the stopping rules used in other experiments or by alternative stopping rules that may have been discussed.

> Now in particular, notice that the informativeness of a stopping rule depends on the background data. This should be no surprise. To a person who doesn’t know the seed of a random number generator runif(1) returns a “random” number, whereas to a person who knows the seed, the entire future stream of calls to runif is known with 100% certainty.

However, both will agree on the non-informativeness of the stopping rule.

I don’t know about what “people” do, but I know where I am now. I have a logically consistent view of when I can expect to use information about the stopping mechanism that works within a conception of probability as a real valued logical quantity instead of some kind of mysterious “inherent randomness” (ie. an objective fact of randomness of the stopping rule about which all people will agree).

It is not the case that everyone will agree on the non-informativeness of the stopping rule.

You collect data until your Bayesian model excludes some region of the parameter space. You use a likelihood for your data that I’m aware of, but a prior that I’m not. Your posterior distribution is a deterministic function of the data and your model. Your stopping rule is uninformative *to you* (ie. you can’t learn anything from the fact that you stopped).

In order to analyze this data, I can build a model in which your prior is represented by an additional parameter or two. My model (likelihood *and* prior) now has a new parameter, and the information that you stopped at a certain point is itself informative in my model whereas in your model it’s totally uninformative. Why? because your background knowledge already has that prior as a 100% certain thing, you’re the one who chose it.

There are vast seas of application for this concept. If in my model there is a relationship between your choice of prior and the parameter that describes the actual data (such as I know that you’ve collected similar data in the past, and you have informed reasons to choose your prior). Then this process is not only informative to me about what your prior was, but also I can put correlations between your choice of prior for stopping and my parameter for the data. Something like:

myPriorLocation ~ normal(myPriorGuess,myPriorUncertainty);

yourPriorLocation ~ normal(myGuessForYou,myUncertainty);

myDataLocation ~ normal((yourPriorLocation+myPriorLocation)/2,myUncertainty);

StoppingN ~ someDistribution(yourPriorLocation);

Data ~ someDataDistribution(myDataLocation,…);

By inferring your prior from the stopping rule, it influences the high probability region of my estimate for the data location parameter, a kind of partial pooling between what I originally thought, and what I discern from your stopping. Plus, it adds a parameter. The likelihood in my model now has an extra StoppingN ~ someDistribution() factor, and it is a function of 2 parameters instead of 1. You’re right, this is a choice, it’s not something that I necessarily SHOULD do (esp. if I think you used a very poorly informed prior). But if your prior is based on something like a preliminary study, or studies in related animals, related regions of the world, related chemicals… whatever, then it’s a useful thing to do for me.

My likelihood is different even though your stopping rule is not only uninformative to you, but actually deterministic to you.

The literature is full of confusion. Some people are pure Frequentists when it comes to probability, some people are Cox/Jaynes Bayesians, and the vast majority of the Bayesian literature seems to be full of people who use the mathematics of Bayes but the probabilistic conception of Frequentism, to this latter group, whether you know the stopping rule or not is irrelevant it’s the “objective” fact of whether it had a “random” component or not, and whether the random component was or was not “objectively” dependent on the parameter that describes the “true” distribution of the data…

To that group, if you say “I used a bayesian model and stopped when it excluded a certain parameter region at a certain probability” they can look at this and say “objectively deterministic” even though they have no idea what prior you used, and then they can magically say “the true data distribution is normal with unknown mean” (Frequentist conception of the data generating process) and the stopping rule is “objectively deterministic” (Frequentist conception of what it means to be deterministic), and so I am required to analyze it with the likelihood

data ~ normal(location,scale);

That’s all nonsense in my opinion. Bayes is all about making choices based on your background knowledge. If when you look at the data, your background knowledge doesn’t give you the ability to determine whether the data collection process would stop, but you can think of a way to create a parameter that you can partially infer from the fact of stopping (ie. it would put different predictive probabilities of stopping for different parameter values), and you want to use that parameter in your model, then the stopping rule is informative to you.

Note, the truth is even I found it hard to escape the trap of “if you plug in the data you will always get the same answer therefore the stopping rule is deterministic” look how it seduces you and yet it’s clearly a frequentist conception (ie. repeated applications of the rule with the same data, by you, produce the same result).

Thinking in terms of uncertainty and logic: “If you told me you had collected all this data, but you didn’t tell me whether there was more to come or not, given what I know about your stopping rule, can I determine with certainty that you will or won’t stop?”

that’s the interpretation that is consistent with Cox

> It is not the case that everyone will agree on the non-informativeness of the stopping rule.

Everyone will agree in your example, the stopping rule is deterministic to the person who knows the internal state of the pseudo-random number generator runif(1) and it is non-deterministic to the person who doesn’t know but both will find the stopping rule non-informative.

> You collect data until your Bayesian model excludes some region of the parameter space. You use a likelihood for your data that I’m aware of, but a prior that I’m not. Your posterior distribution is a deterministic function of the data and your model. Your stopping rule is uninformative *to you* (ie. you can’t learn anything from the fact that you stopped).

The stopping rule is also uninformative to you because it is deterministic. If you know that it is deterministic, you know that it is non-informative. And if you don’t know that it is deterministic, it’s still non-informative given the data (which is the same for you and me). I don’t see how you’re going to extract much information from the stopping rule if your model doesn’t let you see it’s deterministic. At least this is what I think, but I would be interested in seeing a detailed example of your methods.

Let’s say I perform a coin-tossing experiment (trying to infer the value of theta) and get two heads.

I think the likelihood is L(theta)=theta^2, regardless of my stopping rule (as long as it is a deterministic function of the data).

If I understand what you wrote, you will consider the stopping rule non-informative if you know the details of the stopping rule but it will be informative if you know only partially the details of the stopping rule. This is remarkable, the less you know about the stopping rule the more informative it will be!

Maybe you could give an example, using whatever assumptions you see fit for the sake of the illustration, of what will the likelihood function L(theta) and how the inference will change in different scenarios:

(0) you don’t know if N was fixed or there was a different stopping rule of any kind

(1) you know that N was fixed (=2)

(2) you know that there was a deterministic stopping rule different from N fixed, but not the exact details

(3) you know that the stopping rule was to stop after two heads

From what I have understood, your likelihood will be L(theta)=theta^2 in scenarios (1) and (3).

Carlos: this is the problem I was having with the terminology “deterministic” which is why I felt really confused about the definition. This “deterministic” is a frequentist concept and is much more limited than the Bayesian probability as numerical plausibility or credence.

It makes perfect sense that information about a thing you know a lot about is uninformative. It’s like if you know you’re wearing blue jeans and I tell you “hey you’re wearing blue jeans” this is uninformative. It’s only if you put your pants on in the dark and haven’t looked at them yet that you will find “hey you’re wearing blue jeans” to be informative. The less you know about something, the more data about it is informative, that seems obvious to me once you look at it in this way. Of course, data is uninformative if you don’t know what it means. If I say “Hey grobnatzig freeblongis” unless you have some idea what those words might mean… still uninformative.

Now, if there are no parameters in my model which cause me to assign different probabilities to your stopping rule, then I also can’t determine much from the data. it’s still uninformative in the way “grobnatzig freeblongis” is.

I have already given an example in which the deterministic-to-you stopping rule is informative to me, the case where you stop after your bayesian model excludes a region of your parameter space, but I don’t know your Bayesian prior. The information I gain is something about your Bayesian prior, and I can also use it to infer something about the parameter if I think your Bayesian prior is informative about the parameter.

In your cases (just for simplicity let’s imagine we’re sampling measurements of length from something my background says might be normalish distributed, it’s maybe easier for me to describe than a bernoulli coin. One parameter of interest then is the mean of the normal)

0) I have no information to assign probabilities about stopping with, and I have no data about whether you did in fact stop sampling. I have only the data so far. In this case, I have a model p(Data | Params, Knowledge) p(Params | Knowledge) just based on whatever information I do have, let’s say I suspect data comes clustered around a central value with a mean squared error of 2 just for illustration, so Data ~ normal(mu,2); mu ~ my_prior_for_mu().

1) I know N was chosen to be 2, but I have no knowledge about how. I have no basis to infer anything from the fact that N = 2 so I have the same model as 0.

2) To make this little bit of extra knowledge helpful, I need to assume that I have some partial knowledge of how N was chosen (that is, that I can interpret the fact of stopping in some way, I’m not in the “grobnatzig freeblongis” type case. I create a model with partial information about how you selected N, with new parameters describing my understanding of that process, and I fit this new model. In this model “I Stopped at N=2” is data. And assuming I have some model over this data, then ParStop is the parameters of the stopping model. If I in addition think what I can infer about ParStop is also informative about the parameter Mean, then I should use these new parameters in my likelihood. p(Data | Mean) p(Mean | ParStop) p(ParStop | “Stop at N=2”, My Stopping Model) p(ParStop, Mean | Background)

3) In this case, if you tell me N=2 I can immediately infer HH without data. Or, if you tell me HH I can infer N without data, so my model is the same as 0 if I treat HH as the data. My model is independent of all data except N if I treat N as the data.

The interesting case is I think (2) because it’s the case where there’s some hidden information in the process of sampling that is partially revealed by the choice to get a certain N, *even if that choice would always be made the same way in repeated trials that got the same data*

In other words, the fact that the stopping rule repeats perfectly means its Frequentist probability is 1 but the fact that I don’t know everything about that rule means that its Bayesian Probability given my background is NOT 1.

Here’s an example where we can plug in some numbers and express the different inferences using Stan notation (ack… trying to do this a little too quickly with kids begging for popsicles in the heat…. let me come back to it, because you could argue that all I’m doing is “changing priors” here, but I can make it have a “likelihood” type term in which N acts as data for an inference… or you can fill in the blanks if you see where I’m going).

I know you’re collecting data on the length of some object, and I get from you the data set {1.04, 0.97, STOP After 2}

I also know that you have a lot of experience with this measurement apparatus, and that you’re motivated to measure this object with reasonably tight precision. (some way to partially interpret the stopping rule)

Also I have reason to believe that the size of this object is O(1) so that I could in the absence of any data put say a gamma(2,2) prior on the length.

Now, if I don’t use the information about the stopping rule, I get the inference

mu ~ gamma(2,2);

sigma ~ exponential(1.0/0.5); // I guess the measurement instrument isn't too inaccurate

Data ~ normal(mu,sigma);

If I do use the fact that I know you have knowledge of the measurement instrument, then together with the data N, I get the inference:

mu ~ gamma(2,2);

yourPosteriorSigma ~ exponential(1.0/.02); // information about how much error I know you're willing to tolerate yourMeasurementSigma ~ gamma(10.0,10.0/(yourPosteriorSigma*sqrt(N)); // inference about what you must think the measurement error in the machine is

`Data ~ normal(mu,yourMeasurementSigma);`

The fundamental quantity is the joint distribution, which here depends on data N and in the first case doesn’t.

I think you could make it more like what you expect to see in a prior*likelihood situation by

yourMeasurementSigma ~ MyPriorOverYourMeasurementSigma();

sqrt(N) ~ gamma(10.0,10.0/(yourMeasurementSigma/yourPosteriorSigma));

or something like that, where sqrt(N) is on the left hand of a sampling statment because it’s data, this is obviously an approximation, because N is discrete.

again just checking in here quickly to give hints between other sunday duties.

> The interesting case is I think (2)

Let’s stay with the other cases for a second then. If I understand correctly, you get always the same likelihood for theta. You said that your model in (3) and (1) is the same as in (0). Do you mean that you get the same posterior distribution for theta in all these cases?

Yes as far as I can see you get the same posterior for the parameter in 0 1 and 3.

The case 2 is where because you have partial background data that can be used when you know there was a stopping rule that was triggered you get to infer something.

In probability as logic this 2 is the equivalent of what random means to a frequentist. And the partial information allows you to invent a parameter which is then probabilistically related to the fact of stopping. It has the same “random and related to the parameters of the model” but the meaning of random is different for Bayesian interpretation. In Bayesian interpretation random means you can’t be sure stopping would happen without being told it did happen.

When the Bayesian knows that the stopping rule is deterministic in the sense of always coming out the same conditional on the data, then the likelihood for the inference about the parameter that informs stopping looks like 1 when given the parameter and the data the Bayesian predicts stopping, and 0 when given the parameter and the data the Bayesian predicts no-stopping. So, after being told that the data collection stopped, the posterior for the parameter that the Bayesian uses to predict stopping is the prior, restricted to the region where the value is consistent with stopping at the given N.

When there’s something else going on (like for example, the Bayesian doesn’t know enough about the rule to predict with certainty) then you’d get intermediate probabilities for stopping and a continuous likelihood.

(I apology for being reiterative, but I want to be sure I remain on track)

If you don’t have any information about a stopping rule, you get some posterior for theta.

If you have complete information about the stopping rule, you get the same posterior.

However, if you have partial information about the stopping rule you might be able to extract additional information from the data and get a more precise posterior distribution for theta.

Don’t you find that slightly disturbing? If you know precisely the stopping rule, should you ignore some of that knowledge to improve your inference about theta?

carlos, we’re out of reply depth, so I started over at the bottom.

Carlos, looks like we’re out of Reply depth. Here I am replying to your comment: http://models.street-artists.org/2017/06/24/on-models-of-the-stopping-process-informativeness-and-uninformativeness/#comment-217212

Thanks, you’ve pointed out a situation in which we are ambiguously talking about different situations.

Let’s reiterate your situations, and add another one which is directly related to my number 2:

your scenarios:

(0) you don’t know if N was fixed or there was a different stopping rule of any kind

(1) you know that N was fixed (=2)

(2) you know that there was a deterministic stopping rule different from N fixed, but not the exact details

(3) you know that the stopping rule was to stop after two heads

From what I have understood, your likelihood will be L(theta)=theta^2 in scenarios (1) and (3).

In scenario 0 I have my simple posterior from above using just my prior on the mean and the measurement error size.

In scenario 1 the fact of stopping tells me nothing and I have the same posterior as 0.

In scenario 2 I assumed I have some partial information about the stopping rule, for example that it relates somehow to an instrument precision/measurement error information that I think you have. a new parameter is born describing what I know about what you know about the measurement error, and the fact of stopping informs me about what you know about the measurement error. That also informs my posterior for the length, it would probably make it more concentrated since you are assumed to know something real about the instrument error.

In scenario 3 to map to my measurement error scenario, the rule was say to stop after two heads of a separate coinflip… this doesn’t help me infer anything, except what the coinflips are, a fact I don’t care about, but I do in fact learn about them if I care to.

In scenario 4) you tell me that you stopped after plugging in the data to your bayesian model, you give me the explicit prior and likelihood that you used. Now. rather than inferring partial information about what you know about the measurement error, I have already been handed complete information about what you know about the measurement error. So my posterior distribution is an even more concentrated version of (2), since I use your very good information abut the measurement error exactly. In fact, unless I have some reason to add something else to my model (like for example that I think there’s a slight consistent bias to your measurements or whatever) then my posterior becomes your posterior.

If you now pointed out that I think in scenario 4 that I should throw out the email you sent me with the exact measurement error prior, and go back to scenario (2) then that really would be something weird. However, I don’t think that. In scenario 4 I use your prior because you’re the one who knows the instrument.

Does that help us?

Ok, so instead of analysing the data using your model and your prior if I tell you my model and my prior you might like them better and use them instead, getting a different (maybe better) posterior distribution for the parameter of interest.

And if you know quite a lot but not everything about my experimetal design you can use the fact that there was a stopping event to infer a bit more and reverse engineer my model to some extent and use that to improve your model so you get a different (maybe better) posterior. I don’t really think that’s feasible in the real world but I guess it’s not logically impossible.

Yes!!! we’ve arrived at the same point. Thank you again Carlos.

Feasibility in the real world is of course very dependent on the problem. To me, knowing how to analyze the stopping rule in a fully Bayesian context thanks to your help with clarifications at least gives me the tool I need to argue to people whether or not the stopping has meaning.

There are enough cases that I am faced with whether the stopping rule was clearly pretty nebulous, so simply telling people “hey here’s the things I need to know about the experimental design to determine whether we need to think about the stopping” that’s really helpful.

Even if every time we say “looks like we should just ignore the stopping rule” at least we’ve actually done it based on some logically consistent Bayesian principle.