# Estimating parameters from truncated samples

If you've got a bunch of stuff out in the world, and you send some people out to check up on them, and they only bother to look at and report things that are bigger than some value, then you wind up with a truncated random sample. For example, suppose you run around a housing development and see if there are any driveways with large cracks in them indicating a problem with the quality of the concrete. Suppose you only stop at driveways where the cracks are noticeable in size. You don't know what "noticeable" really is (unknown cutoff) but you only get reports of how wide the cracks are when they can be seen from the street on a drive-by (this is not a real problem I'm having, just a plausible example).

Let's just model this in the simplest case: approximately normally distributed crack widths with unknown cutoff.

vals <- rnorm(1000,10,1); /* 1000 cracks mean width 10, sd width 1*/ actualcutoff <- 10; N <- 100; seenvals <- sample(vals[vals > actualcutoff],N);

Now we want to estimate the population parameters, which are mu=10,sd=1 in this case, and we need to simultaneously estimate the cutoff, which we know logically needs to be say positive, and less than min(seenvals) (a data-dependent prior!)

Stan code:

data{ vector [N] seenvals; } parameters{ real<lower=0,upper=min(seenvals)> cutoff; real<lower=0> mean; real<lower=0> sd; } model{ mean ~ exponential(1.0/100); /* we have a vague idea about mean and sd*/ sd ~ exponential(1.0/100); seenvals ~ normal(mean,sd); increment_log_prob(-N*normal_ccdf_log(cutoff,mean,sd)); /*renormalize the distribution based on cutoff*/ }

What I discovered in doing this was that for truncation points near or above the mean, the estimated mean and sd were biased towards lower means and higher standard deviations, and actually it was true independent of whether I used a flat or exponential prior. The more observations you saw to the left of the mean, the more reliable and accurate your estimates were. As sample size increased things got better.

I am not sure why this bias occurs, but it is interesting and might inform some analyses. There is definitely a tendency in practical problems to have situations where we can't even detect some smaller values, but get a full compliment of the larger ones.

That's not a data-dependent prior -- the I(cutoff < minimum data val) term enters through the likelihood. It's just that multiplication by a given indicator function is idempotent.

Sometimes it's hard to really separate what's a prior and what's a likelihood. These two things are mental constructs on top of the fundamental thing, which is the joint distribution. Clearly, I'm specifying my knowledge about the cutoff using a function of the data, and so in that sense it is a data dependent distribution (notice I didn't say prior) for the cutoff.

You can *define* "likelihood" to be any term in the joint distribution that includes the data, for example, and then you're right by definition.

As is well known you can use the posterior from one analysis as the prior for another analysis. Is this "not a prior" because it's actually the "true prior" times the likelihood of "earlier data"?

But, your point is taken, you *can* interpret my model as a uniform prior on the positive real line times a likelihood that has an indicator function involving min(data).

One method I use to suss this out is to ask myself what function I'd maximize to get the MLE.

The posterior from a previous analysis is a legitimate prior and is not an illegitimate data-you're-about-to-update-on-dependent prior. ðŸ˜‰