So, here's a graph I generated after writing my previous post. The idea is to approximate the Dirichlet Process with a step function derived from a finite dimensional Dirichlet Distribution.
Loading up the gtools package in R I was able to generate 10 random vectors of length 100 from a Dirichlet Distribution. The function takes 2 arguments, the number of draws, and the "shape parameters". In the nonstandard case, the shape parameters are a function of x (red) and then each draw is a function of x as well (black-blue). I used ggplot2 to plot these as step functions. Clearly they're pretty rough functions.
I constructed the shape parameters by taking the function
which is a probability density on
, sampling it on 100 evenly spaced points, and multiplying it by a value
. Roughly, by analogy to the beta distribution, the
parameter is like how many data points you collected. The more data points you collect, the tighter your sample paths will be around the red curve. It's a sort of "information" parameter, though the Wiki calls it a "concentration" parameter. Both are accurate descriptions.
Now imagine you want to construct a model for something else. Let's say it's a financial timeseries. You'd like to model this financial timeseries as a series of independent increments but with the standard deviation of these increments being something variable. Perhaps you're willing to constrain them to be between .01 and 20 on some scale. Also, suppose you have some information about how frequently different size increments occur. You could draw on paper a function on [0,1] that approximates how often you think the standard deviation is .01 vs 20 (you'll interpret
as
and
as
). You can approximate this function numerically, and put it in a computer. Using
a not very large number, you'll be telling your model that you aren't very sure about the exact shape of this function. Then you can model the timeseries as follows:
First draw a vector of 100 values from a Dirichlet distribution and interpret them as a stepwise PDF on [0,1]. Then for each time increment, draw a value from this PDF and scale it to the range
, then draw a mean zero gaussian random number with this standard deviation and increment the timeseries. You'll wind up with a timeseries path whose increments are mean zero and have a time-varying random standard deviation with the size of the standard deviation coming from this perturbed version of your original graph that you get from the DP.
If you think of the DP as a prior over functions, then you can do this whole thing in an MCMC algorithm and do inference both on the particular mixture function, and the particular increments that happened during your actual data series. It's clearly a very flexible way of providing some prior data about mixture distributions and let MCMC to do its thing.
Andrew Gelman links to a paper on priors for Bayesian nonparametric statistics (an has a wonderful in joke in the title for those who know late 70's and early 80's music).
The paper itself is something about nonparametric bayesian models, which really means models that have a huge number of parameters. One thing that comes up a lot in these nonparametric problems is the Dirichlet Process. Unfortunately that wikipedia article is inscrutable to someone who doesn't already have an intuition about the Dirichlet Process (ie. it's maybe a good reference, but a terrible way to learn).
I am not really sure what a Dirichlet Process is, but I'm going to take a stab at it via nonstandard analysis and see if someone can interpret what I have to say and tell me if it makes sense.
First, let's talk about what a Dirichlet distribution is. Imagine you have a measurement process that can produce several different discrete outcomes. For example you have a survey form and people can put their "race" as one of 8 options. Such a process is a categorical random variable with 8 categories. When there are only 2 categories this is a bernoulli random variable (ie. its result is either 0 or 1), but in general it could be N categories. For each of the N categories there is a probability associated with that category. Every outcome MUST fall into one of the categories (for example you have 7 words describing "common" races and then "other" so that there's an option for everyone).
The parameters of the N dimensional categorical distribution is then a vector of N probabilities. These probabilities must add to 1 (because of the "other" category, you always fall into one of the categories).
The Dirichlet distribution is a distribution over vectors of N numbers that are in the interval [0,1] and add to 1. Since if you know N-1 of these numbers the last one is automatically exactly known (ie.
) the probability density in regions where the sum is greater or less than one is zero. So it's a kind of N-1 dimensional hyper-surface embedded in the N dimensional hyper-cube. If you draw a vector of N different p values from the Dirichlet distribution you'll get a vector that can be interpreted as the parameters of a categorical distribution over N categories.
Fine and dandy so far (relatively speaking). Now suppose that N is a nonstandard number, there are a nonstandard number of categories. Well, we could interpret a set of N different p values where N is nonstandard as a function
. Since the
values add to 1, if we take
and integrate
the
values cancel, and we see that the integral is guaranteed to be 1. So the function
is a nonstandard probability density function on [0,1]. It's a very weird function though, there's nothing that guarantees it's s-continuous anywhere, any two adjacent
values could be different by a significant amount. The graph of this function doesn't exist as a standard thing (because it requires a nonstandard number of discrete steps) , but could be approximated by some kind of very spikey noisy graph using N a large but standard number (like say 1000).
The typical way that this is used, as far as I can tell, is to define mixture models. So you have some set N of PDFs that define different ways in which a data point can be generated (for example
different gaussians with different mean values). Each of the N PDFs has a probability of being used to generate a given dataset. You draw a single pdf from the Dirichlet process, and then you use this pdf to select which of the different gaussians you will use to explain the ith observation. Since in practice we always work with a finite standard number of things, you think of the Dirichlet process with length
as a way to get a finite standard number
of discrete values, by sampling
for i from 0 to n-1 where n is standard, or something like that.
Well, that sort of helped me, I wonder if it's more or less correct
.
EDIT: ok, reading up on the Wikipedia entry, it seems like there are basically two "parameters" to the DP itself, the first is a pdf which describes the marginal probability of a given
and the other is a "concentration parameter" which describes how much the values of the DP concentrate about a finite set of values. So the function
is going to look like some kind of step function because it will frequently take on some discrete set of values. Also the concentration parameter has something to do with how spread out the discrete values are.
I was at the La Brea Tar Pits museum with my kids monday. It's nice the first time, ok the second time, but a little bit too small to go very often. They could really use some updates to the exhibits with some scientific information about what they've discovered.
One of the exhibits, right at the entrance, is where they have buckets of asphalt each bucket containing two cylinders. One about the size of a deer's foot (about 4 cm in diam?) and one about the size of a baby mammoth, or a bear (15 cm or so?). Attached to these is a handle and you can try to "pull your foot out" of the asphalt.
There are two things to be noticed here, the first is that if you pull lightly it will come out slowly but steadily. I doubt asphalt has a single viscosity, more likely it's a shear-rate dependent fluid. At very high shear rates, like impacts, it's elastic. It's also highly temperature dependent which doesn't enter into this issue because the temperature is controlled in the museum.
The other thing you obviously notice is that if you try your hardest to pull out the cylinders, the narrow one comes out much easier. This isn't too surprising, the thicker cylinder mobilizes much more of the asphalt. But one thing they fail at in this exhibit is to make the exhibit dimensionless. The relevant question is how much force does it take to pull your foot out, as a fraction of the force that an animal that size could generate? Since this is a rate dependent issue, you'd also like to relate it to how quickly the animal normally moves its leg. I imagine it could easily happen that an animal like a deer gets trapped, and it has a fast-twitch muscle and it tries to just pluck its foot out. This rapid application of force causes the asphalt to lock up like a solid and trips the animal leading to it falling down and becoming stuck. A large mammoth lumbers along normally, when it gets its foot stuck it pulls steadily with a mighty muscle and slowly frees its foot. It's also much more stable, so less likely to fall over. So even though we perceive that the deer, with its 4cm diam leg might have had an easier time based on our experiment, it may have been the other way around.
One way to address this would be to plot force necessary to remove the leg at a rate appropriate to the animal's normal leg movement rate and as a fraction of the animals weight vs diameter of the leg as a fraction of the diameter of a large mammoth leg. The x axis would go from 0 to 1 and the y axis would show you who had the easiest time removing their leg. To do this you could start by solving equations of motion for the asphalt flow axisymmetrically around the cylinder. This is a 1D model in the radial direction and should be not too hard. As a first approximation a Newtonian tar could be used. That's relevant to high temperature asphalt but as I mentioned above at cold temperatures you'd need to take rate dependence into account.
Christine Shoemaker came and talked about a method she has that essentially is one way to answer the question I posed recently "Where do Likelihoods Come From?" For her problems the model is a very expensive computational thing, like time-stepping a PDE for the concentration of pollutants underground. The application I'm interested in is likelihood functions in MCMC, though most of her talk was about optimization of mean sum of squared errors, she did talk towards the end about applications to MCMC.
Her technique is to evaluate the model prediction using the expensive simulation function, let's call it
where
is the set of parameters, then compare it to data
to get a posterior evaluation
. You do this for some set of points, and construct an approximate function
using Radial Basis Functions. Then you identify a high probability region of the parameter space
using optimization techniques on this
function, and you jump around evaluating
for
values in this high probability region to improve your
RBF approximation until you're happy (you're not doing MCMC yet, you're just trying to get a good interpolation of the posterior density) once you are happy with the
approximation you can start at some point in the high probability manifold, and rapidly sample
in detail using MCMC without running an expensive numerical simulation. Some of these numerical simulations might take 30 minutes to 10 hours for each simulation, so it's clearly impossible to do MCMC very well on the raw numerical model.
This will give you the approximate distribution of the posterior parameter values of interest, and finally if you're interested in prediction you can sample from these posterior parameter values, and run your expensive model a few times on the high probability parameter values to get predictions about the future.
She found speedups of around 20 to 60 times fewer evaluations of the full numerical simulation for her method, and I would say this is one of the most important talks I've heard in a long time since it makes it practical to do Bayesian MCMC on the kinds of models that Civil and Environmental engineers are really interested in, things like timeseries for pollution transport, flooding, building earthquake response, etc. Since the RBF function is easily differentiable it would also be very easy to do HMC on such an approximation, whereas not so much for raw evaluations of the numerical simulations.
For those who want to play around with the falling ball experiment and fit their own models or reproduce mine, here is the Ball Analysis R code, with some comments, to analyze the drops and balls data I posted earlier. WordPress won't let me upload a file with extension ".R" so I changed it to ".txt" just rename the file when you download it. (note, I know I can edit the wordpress installation to allow ".R" but I'm lazy)
Well, the results are in, and we have a nice set of preliminary predictions for the falling ball problem (a more massive MCMC run would be in order next). They also show some very nice advantages to a Bayesian analysis of the problem. In the above graph are measured fall times (black points) predicted fall times without drag (black curve), a linear fit to the raw data (grey line/area), corrected fall times including imputed timing bias (red points) and predicted fall times from the ODE model for the initial conditions and the aerodynamic radius. Bayesian uncertainty not shown (I used the mean values from the MCMC runs).
One of the nice aspects of this analysis is how prior information helped determine which sources of uncertainty were important. For example, we could obviously fit the original data pretty well if we had no drag and a little extra gravity (imagine the black curve a little lower). In fact fitting the model
> linmod1 <- lm(ts ~ 0+I(sqrt(2*h0cm/100)),data=drops); > 1/coef(linmod1)^2 I(sqrt(2 * h0cm/100)) 12.15669
But gravitational acceleration of 12.2 m/s^2 is completely out of the question, we know thanks to others work that g in LA is within about 0.01 of 9.796 and we know from dropping something denser than a paper ball that the paper ball takes longer due to drag, maybe around .1 or .2 seconds longer for these heights. We could include this timing bias into the model easily:
> linmod2 <- lm(ts ~ I(sqrt(2*h0cm/100)),data=drops); > 1/coef(linmod2)[2]^2; I(sqrt(2 * h0cm/100)) 7.101245 > coef(linmod2)[1] (Intercept) -0.1371857
But although a timing bias of 0.14 seconds is not unreasonable (the Bayesian method predicts 0.275 s), g=7.1 is also completely out of the question. This reflects the fact that the shape of the point cloud isn't like a
curve because of drag (it's more linear because we're transitioning to a constant terminal velocity by the time we hit the ground).
To include the ODE predictions in the least squares type model we need to first run the ODE predictor for each drop. To do that, we need some value for the aerodynamic radius, and we need to know the initial conditions, neither of which do we actually know. Oh sure, we could assume
is pretty close to half the average measured diameter for example, but the linear model needs just one r value and we are not comfortable saying that r is any definite quantity based on the variability in size and shape, so we're going to need to do "sensitivity analysis" on r. Also, we know that there's some kind of uncertainty in the initial height, oh sure I tried to hold it very close to the height requested, but the shape isn't a sphere with a precisely defined edge, it's a crumpled ball of paper. Plus there's the
and
values for air, fixed during the experiments, but not known exactly due to temperature and humidity and soforth. So again, we are going to have to run "sensitivity" analysis on all of these things. How will we choose the heights and radii (and air properties) to use in our analysis? Will we, as Gelman rails against, invert a hypothesis test, and accept any r,h pair which we can not reject? That will probably be a pretty big range considering that our above analysis is willing to put g between 7.1 and 12!
The Bayesian analysis goes one step further than the frequentist analysis in allowing us to specify that some values are more reasonable than others, namely g tightly coupled around 9.796 and
somewhat less tightly coupled around their nominal values, so that we expect more uncertainty in
. Furthermore, if we like we could couple these values together. A ball with a more irregular shape will be harder to hold at an exact height for example.
So how about calibration? One of the complaints from Frequentists is that predictions from Bayesian analysis do not necessarily have well calibrated probabilities for reproducible experiments. For example if I took my fitted model and predicted fall times for each ball for a new fall height, and then ran the new experiment a bunch of times to get a lot of data, would the distribution of measured values be pretty close to the distribution of predictions from the Bayesian model? Well, it's hard to say without running the experiment. But my intuition says that after all the sensitivity analysis and inversion of hypothesis tests and soforth, any Frequentist confidence intervals would be pretty large, what with having to include different g values and r values and soforth. And if you are including different values for
but not assigning them prior probabilities, what do you do for confidence intervals? You'll get one for every different value of
so it's not clear which one to evaluate for advertised coverage.
All this is to say, I use Bayesian methods because they answer the question that I tend to want answered. Namely, if I can make some reasonable assumption (to me based on my general knowledge) then Bayes theorem will give me the resulting weights to be placed on different values for quantities that I do not know exactly (things like aerodynamic radius, fall height, air density, timing bias, timing variance etc).
One question that I don't know how to answer is how to deal with calibration issues in the absence of an enormous validation sample. But this is a problem I have no matter what. When the sample size is enormous we can be pretty sure about our predictions. When I have 30 drops for 10 balls and I predict the future from that, I'm only going to know how well calibrated my predicted fall time distribution is after I go and do a whole bunch more drops. Of course then I'd want to update my model with those additional data points, and I'd have to collect even more data to see how calibrated I am... ad nauseum.
Above is a plot of the height vs time for a sphere falling through air both predicted (black) and drag free (red). As you can see by drawing a horizontal line at h = 0.0 there should be a significant effect, 50% longer fall time or so. t=1 corresponds to 0.534 seconds (shown in the title). On the other hand, the actual data shows nothing like these fall times. In fact thae data shows t' values < 1 which suggests maybe the Android based stopwatch + user reaction time / anticipation is a big issue. Uncertainty abounds, but it's not really that interesting uncertainty (we know the time measurement method is crude). If we want to find out about drag we may very well need much more accurate time measurements, or drops from higher heights.
Rama suggested using an x-box kinect sensor to get fairly precise distance vs time data which would help you because velocity is more strongly affected. In any case I will reconsider the experiment a little. The results as-is are still interesting when it comes to philosophical aspects of uncertainty. For example,
is less than 1 for most experiments. This either means that
was significantly less than
, that
or that there was a bias towards stopping the stopwatch earlier. Bayesians will have no trouble putting prior distributions over the various options. Since the tape measure and gravitational constant are relatively well calibrated, I certainly suspect that the timer was biased low. This might be caused by the timer program itself (attempting to account for a delay in response it might subtract a fixed delay estimate for example), or by the experimenter pressing the start button late or the stop button early or both.
So, we'll have to see what happens with this data. perhaps it's overwhelmed by time measurement noise, or perhaps we can discover some effective radii even with the measurement noise.
Attached data: drops and balls which contain the timing data for 30 drops, and the mass and size data for 10 balls.
From my previous article proposing a simple experiment of dropping a ball that will experience drag and then doing inference about the unknown quantities based on a mechanical model that will induce a likelihood: Here's an update on the model and experimental protocol. I ran the experiment today and the data is posted as text files above.
We started with Newton's Laws for a falling sphere with drag:


and the initial conditions
where
is the measured initial height and
accounts for the difference between measured and actual.
It makes sense to nondimensionalize every physical equation. We will define new variables:
where
is the nominal "standard" value for g, defined as
. Substituting for the variables
and solving for the new equations in
gives us:


These are dimensionless equations (choose a system of units, such as SI, and calculate the units of any term, and you will see they all cancel) and it emphasizes that we don't need to know exactly each constant. Ultimately we need 3 dimensionless ratios, some of which are exact and for a frequentist are not random variables at all (they are fixed in every experiment):
. We are modeling
as
and we have that
is a dynamic variable that our deterministic model will predict. The uncertainty comes in the variables
which is 6 unknown variables, however we only need to infer 4 quantities thanks to our nondimensionalization
where
is called the "kinematic viscosity".
Next, we need to consider the differences between our predictions from the ODE model and the measured fall times. From the ODE model we have that the fall time is
and given values for the constants, we can estimate this essentially exactly (ie. 8 or 10 decimal places, which is enormously precise compared to other aspects of our problem). However our measured time will definitely be different from this predicted time. The dimensionless version of the measured time
will differ from the predicted amount by a "reaction time" error
. If an experimenter flubs the measurement entirely (forgets to press start, misses the end time and lets the timer run for several extra seconds, etc), they will redo the experiment rather than recording it, so they will only accept values that are reasonably tightly clustered around something "true" and hence a normal random variable is a reasonable model for
. Alternatively if they might record several minutes of fall time due to letting the stopwatch go on accidentally, we might prefer to use a complicated mixture model or something with fat tails, like a t distribution. However even if we use the normal model, the bias and variance in this procedure are unknown. Having approximated the experiment a few times we know that the fall time will be on the order of 0.5 seconds, and that the fastest we can double-click the stopwatch is on the order of 0.1 seconds, so we might expect that the bias is some number whose size is on the order of 0.1 seconds and might be either positive or negative, and the variance in the measurement error should be on the order of 0.2^2 or less. For a Bayesian, we can define hyperparameters
and
, and a reasonable prior for them is something like
and
where here the 0.1 in the Gamma distribution is the scale parameter of the Gamma, (sometimes people parameterize in terms of a rate = 1/scale).
Now for more specifics on the data collection: create two sheets of paper, one with headings:
Ball Mass D1 D2 D3
Where you record info about the balls, namely the ball number, the mass in g based on weighing on a scale, and three approximate diameters (lay it on top of your tape measure and read off something that seems reasonable, record it in cm). On the other sheet of paper we have:
Ball h0 t
Where we record which ball we dropped, the nominal height we tried to drop it from (generated randomly in a range that you can easily read without standing on something or stretching too much). Record these values in cm and seconds. For analysis we'll convert distances to meters to get consistent units.
At the beginning of this post I have links to data files that I generated thanks to Rama and Evangelia for helping me collect the data. BTW the temperature in the room was about 72 C 72 F plus or minus a degree or so, that's useful for looking up the kinematic viscosity of air to get a prior for that parameter.
Truthfulness, nondeterminism, epistemic and aleatoric randomness and Bayesian models (part 1)
Two interesting posts from opposite directions led me to write about the structure of Bayesian models of mechanistic processes, and the nature of uncertainty. The first was Josephs blog post about the "truthfulness" and "information" criteria in Bayesian models, and the second was Mayo's post about statistical ontology. I think it's important to understand what it means to be a good Bayesian model, and how that connects with frequentist concepts of statistics, as well as what it means to be a random variable, and how some random variables are different from others.
To illustrate the issue, I'm going to posit a very simple physical experiment that anyone could actually do, and then talk about an analysis of this repeated experiment using a mechanistic Bayesian model, and the role that randomness plays in the model, and what it means for this to be a good model.
The experiment goes as follows, get a pad of note paper, a 2 meter stick, a stopwatch, and a partner. Tear off a sheet of paper, crumple it into a ball. Generate a random number between 1 and 2 using R. Hold the ball of paper with its bottom at that height on the 2 meter stick. Count 1,2,3 and let go of the paper. The partner will start the stopwatch on 3, and stop it when the paper hits the ground. Record the height and the time. Repeat the experiment say 20 times. Weigh the 20 papers together on a sensitive scale and record the total weight. Measure the temperature and the barometric pressure in the room if you like, or get the pressure from a weather site online and measure just the temperature. Or whatever you can accomplish, note that a room thermostat may be available for temperature measurement.
Now we'd like to build a model of what happened that takes into account what we know: First off, Newton's laws including a decent model for drag:
and the initial conditions,
and
.
We can't integrate this directly via a nonlinear numerical solver (like say runge-kutta-fehlberg) because we don't know
exactly for our current location, the drag coefficient function
, the actual mass of any one piece of paper or the appropriate dimensions
of the crumpled ball, or the density
or viscosity
of air. However, some people have provided us with scientific background information. We can look up a table of information about air. From this, and the ideal gas law, and a range of reasonable estimates for the water content, we could build a model for the density of air. A reasonable model would be to put in what we think of as the best guess values for water content, and then allow a normal distribution around this value with some variance. This is a Bayesian probability distribution as it expresses an unknown aspect of a fixed value. If we're indoors and do the experiments in close time succession, it's reasonable to expect that the density doesn't change between experiments.
Similarly we can get the viscosity of air using the calculator that wikipedia suggests, and since we don't know the exact water content we may need to include some unknown correction, again a Bayesian model for a fixed but unknown value.
Most people will not even consider the variability in
based on location, unless they've taken courses in geophysics or geodetics, but it's not entirely trivial. Where I am near Los Angeles, Wikipedia's table says it's about 9.796 but values vary from about 9.76 to 9.83 so it would be reasonable to give a prior distribution for
of
if you're in Los Angeles. It might be reasonable to expand the variance to say 0.04^2 if you didn't know much about your location, altitude or other factors.
Finally, we have
the drag coefficient function, and
that describe the shape and size of the ball of paper, both of which are unknown, and both of which actually do vary from experiment to experiment. In this sense, they have frequentist type frequency distributions. A crumpled ball of paper does not have a single
value in fact. We might like to define
and
together. Let
be the drag coefficient function for an idealized sphere of the same weight as the piece of paper, then let
be the value which predicts the same outcome for the travel time of the ideal sphere as the actual travel time of the real paper given
. Note that this definition of
will depend slightly on the height from which the paper is dropped, an effect we will include in "model error". Furthermore, the function
for an idealized sphere is itself not perfectly well constrained by data, and the ball of paper is not a sphere. These are additional modeling errors. Perhaps we would like to add a perturbation function so that
and we express
as a series with random coefficients. Perhaps we would like to express our knowledge that the experiment in this range of values should be pretty well approximated by a sphere by placing prior constraints on the coefficients of
.
Finally, there is the measurement process. Aligning the ball with the correct height, starting the stopwatch, and stopping the stopwatch all have fluctuations associated with them and these are real "frequency" type fluctuations. It's not just that the error is the same every time but unknown, it's both unknown, and different every time. We would like to create a statistical model of these measurements, and a typical one to use is a normal distribution around the measured value with mean 0 and variance something or other, possibly we may use hyperparameters for the variance if we are Bayesian. Then we have Bayesian probability distributions over the statistics of a frequency distribution.
Ideally, from a Bayesian perspective, I would now like to build a computer model of this experiment which will give me an approximate likelihood function. Namely,
where
is the vector of measured initial heights and fall times and
is the actual heights, and ... indicates that we may possibly have other parameters such as the hyperparameters on the measurement error model. To do this we will take the values as given, integrate the equations of motion through time until we predict that the paper hits the ground, get a predicted time, and then use the measurement error model to figure out the probability of the discrepancy. Then, we can use MCMC to do Bayesian inference on the values of
etc.
This blog post is long enough now that I will come back to the topic soon and see if we can make progress on understanding the various roles that randomness plays, and how to deal with likelihoods induced by numerical simulations.
(NOTE: having performed this experiment a few times it seems that the fall time will be in the range of about 1/2 a second for most cases, it is therefore useful to have a stopwatch that reads to the nearest 0.01 seconds, not because your reaction time is that good, but more because the roundoff error aspect is less important compared to a more continuous reaction time component. Apparently Stopwatch by Chronus is an Android stopwatch that promises 1/1000 second accuracy.)
(NOTE2: it seems prudent, for identifiability issues, to drop each balled paper from say 3 different heights, this lets us get information about the drag function since the shape and size will be constant for the given piece of paper).
Sometimes I feel like the prototypical Kuhnian paradigm shifted scientist when reading or responding to Mayo's blog on Error Statistics. Recently there was a bunch of stuff about the likelihood principle and invariance under changes to the data collection protocol, and I got into a long involved discussion there. One of the major "talking past each other" issues seemed to be that apparently Mayo maybe talks with Bayesians who can't believe that their likelihood could ever be wrong (or something, I really and truly am not sure). My argument was that model building is about deciding how to describe the process that generates the data, and I didn't see how it was relevant to analyze a bunch of data using a Bayesian model that doesn't take into account the fact that partway through the experiment you changed your mind about how you were going to collect the data.
So I'd like to give an example here where obviously data collection matters, and how you might in broad terms, go about modeling the whole thing as a Bayesian. This example is informed by my work in Forensic Engineering but doesn't have any actual real data content (I'm going to make the numbers up for example).
Imagine some lawyers have decided to sue a window manufacturer and window installer for defective windows which leak during a standardized window test. There are say 200 homes built in one development, and each one has 15 windows. Somehow the lawyers recruit 100 of these homes to participate in the lawsuit, and the plaintiff experts recruit 20 of these to participate in the intrusive process of testing the windows. On the first day the testers choose, via some unknown process 6 houses to test, and among those 6 houses they choose via some unknown process 3 windows each. Of the 12 windows tested on the first day they find 4 that leak during a 15 minute intensive spray test. On the next day they choose 6 more houses and test 12 more windows chosen by some unknown process and find 5 that leaks, and then on the third day they decide to call off testing for some unknown reason.
Summary statistics: Out of 200 homes built, 100 agreed to participate in the lawsuit, out of 100 homes in the lawsuit 20 agreed to be tested, out of 20 houses to be tested 12 houses actually tested, 180 windows might potentially have been selected among those 12 houses, 24 actually selected, 9 leak, 15 do not.
The stupid analysis of this is to say that the probability of leakage is a constant across all windows, and maybe using a beta(1,1) prior you decide that the posterior probability of leakage is beta(10,16) which has a mode at 9/24 = 0.375
But there is no reason to believe that the probability of leakage among the windows tested is necessarily the same as the probability of leakage among all windows on the property. The testers have many degrees of freedom in choosing the windows, and they only tested windows among people who were already dissatisfied enough to participate in the lawsuit and even dissatisfied enough to volunteer for the testing procedure. A more reasonable model would be to say that the probability of leakage is related to some observable facts about the windows, and about the houses, ie. which exposure they have, whether there is cracking in the stucco outside the windows, whether there is evidence of water staining on the interior wall, whether the windows are installed straight or have a skew or warp in the frame, etc.
Then, based on classifying the tested windows, we can fit maybe a Bayesian logistic regression based on a score generated from the covariates, and finally we can demand to visually inspect a computer-generated random sample of windows among all the 100 houses and generate scores for them. Based on our additional data, we can now fit a fuller Bayesian model which predicts the overall probability of leakage given a model for the distribution of the covariate score, and the logistic regression for the probability of leakage given the covariate score.
Both models are Bayesian, both obey the likelihood principle in that they use their likelihood function and some priors to generate posterior distributions, and they don't use any other type of inference. One model is insensitive to the data collection process, and one model knowingly models what are thought to be important aspects of the data collection process. Finally one model incorporates more data because the modeler did not believe that the originally collected data were informative due to the unknown bias in the sampling method.
Is the second model "non Bayesian"? Does it "fail to respect the likelihood principle?" no. it's just doing the good old fashioned hard work necessary to really and truly model the real process.


