So, I've been discussing declarative vs generative models. Here's a case where I think the declarative model makes good sense, how would you model this?

y is a vector of 1000 timeseries samples from some audio system. On one end is a highly sensitive microphone that puts out a voltage signal proportional to the instantaneous pressure gradient. The voltage is in the range [0,0.01] volts and is known to have a slew-rate limit of at most 0.005 v per sample and a bias of 0.005 +- 0.0006. A pre-amplifier increases this signal by pretty close to 100x but has a response function$f_{pre}(v) = A \mathrm{atan}(x/0.01)/\mathrm{atan}(1)$ which has a known shape over the input range [0,0.01], is not perfectly linear, and this shape is multiplied by a number A slightly less than 100 to get the actual transfer function.  The pre-amp output picks up radio frequency noise, which can be modeled as normal(0,0.015), and the result is converted via 10 bit A/D into the range 0-1023, with 1v in to the A/D implying 1024. This digital signal is sent down a noisy channel, so that only some of the samples are received in digital form. Additionally, as a compromise to improve reliability and due to lack of equipment, the signal is re-converted to analog via a 12 bit D/A outputting in the range 0-10v, which has a known noisy bias signal due to a failing diode that can be modeled as independent exponentially distributed voltages with mean value close to 0.05. The result is sent over a long cable of around 1000 feet, where additional normally distributed white noise is added to the signal whose RMS amplitude is less than 0.01 v. The long cable has about 1 ohm resistance per 100 feet +- 5 % as well as a resistance of less than 10 ohms at both the input and output connections. The whole thing is put into a 16 bit A/D operating on inputs in the range 0-10v with an unfortunately low input impedance of close to 1000 ohms +- 1 %. The 16 bit A/D is sufficiently high resolution that y is given as a floating point number equal to the digital value on 0-(2^16-1) / 2^16 and assumed to be continuous with an additional conversion noise and bias which is approximately gamma distributed with k=3 and rate = 1/0.003

Reconstruct the original signal y_true out of the microphone as best you can using as much of the information available as possible, while also giving best estimates for the length of the wire, the total resistance of the cable, the amplification factor for the pre-amp, and the input impedance of the final A/D.

The "generative" picture here is this (more or less, we're ignoring low-pass filters that are always present on the d/a and a/d steps):

y_meas = round(((round(1024*A*f_pre(y_true)+epsilon_rf)+da_bias_noise)*10*
((rperlen*Len+conn_imp)/(rperlen*Len+inpt_imped+conn_imp))+cable_noise)/10*2^16)/2^16+ad16_noise


And, in addition, we have some of the digital values round(1024*A*f_pre(y_true)+epsilon_rf) let's call the ones we have y_digit, which is a sparse vector and t_digit which tells us which of the y_meas the y_digit correspond to.

So, if you wanted to write this in a generative form, with only data or only parameters on the left hand side of the probabilistic statements, in Stan, what would you use to define the likelihood function for y_meas or y_digit given all the parameters, y_true, epsilon_rf, da_bias_noise... etc?

For the full analog / digital / analog / digital / analog path, You could always invent some extra roundoff error parameters, and write

y_meas - (....)  ~ gamma(3,1/0.003);


where (...) = the whole mess where round(x) is replace with x+roundoff_i and roundoff_i is given the appropriate prior. But then you'd be writing a declarative model with a nonlinear function of the data and the parameters (consider the resistances formula) on the left hand side.

Stan doesn't have a shifted gamma, but if it did, you could use the shifted gamma distribution and place the round function in the location parameter, however, you will probably induce bad behavior due to the non-differentiable jumps.

The point of the declarative version is that if you were given the actual values of the parameters other than ad16_noise, then subtracting y_meas - (...) would give you an expression equal to the noise sample, and the likelihood for seeing that sample conditional on all the other parameters, is the gamma distribution.

If you have the wrong values of the parameters, you won't get something that's gamma distributed, so you'll be in the low probability region of the gamma distribution function, so you'll downweight that portion of the parameter space... that's all you need to get inference on the parameters, that you find reasons to reject some of the prior space because it doesn't fit the facts you know.

Suppose you are in a Kindergarten class, and you have a bag containing 100 mixed nuts, Brazil, Walnut, Almond, Hazelnut, etc.

Now one of the children reaches in, pulls out a nut, and tells you what kind it is. How much does it weigh?

Let's make the assumption that the nuts remaining are not radically different from the one chosen, at least for that type.

Consider the bag of nuts as a prior over the weight of the nuts. It's the range of weights that we consider reasonable. Our data is the kind of nut...

Repeatedly reach into the bag of nuts, and pull out one nut. If the nut is the wrong type, then you know it's not like the data, so throw it back. This is the likelihood, it down-weights parameter values that don't fit the data. If it's the right type, weigh it (this is a sample of the parameter, the unobserved weight) and write down the weight. Once you've done this so that you have measured your type of nut many times, then build a histogram of the weights you wrote down.

The nut your child is holding probably has a weight that is near the peak in your histogram, and the weight probably isn't outside the range of 95% of the weights in your histogram. This is the Bayesian way of thinking in a nutshell.

In my posts about assigning distributions to functions of data and parameters, I mentioned the tried and true example of trying to apply a distribution to a nonlinear function of a parameter:

log(foo) ~ normal(0,1);

In Stan at least, this does NOT imply that samples of foo have a lognormal frequency distribution, for that you have to take into account the differential compression that the log function applies to the dfoo intervals. This is always true when you transform a particular set of n parameters through nonlinear transforms into n other parameters (linear transforms imply a constant multiplier correction which is then normalized out) the differential volumes transform according to the determinant of the Jacobian of the transform in the vicinity of the volume interval.

But, why is the same thing not true when we do:

lfoo ~ normal(0,1);
foo <- exp(lfoo)

which, if you write those statements in the appropriate places (model, and transformed parameters) in a Stan program, will give you a sample foo from a lognormal distribution? (alternatively you could sample lfoo, and exp transform in R after you grab the sample).

And, when you make a statement like

f(data,parameter_vector) ~ distribution(other_params)

The parameter_vector may have many elements, and therefore, there is no Jacobian of the transform (specifically the Jacobian isn't a square matrix and so has no determinant). But even if you are using only one parameter, and could calculate the distortion of the differentials, would you want to?

For the first question, about the samples of lfoo and their exponential, the answer is that we're transforming fixed numbers that Stan spits out which have the proper normal(0,1) frequency distribution thanks to Stan ensuring that fact in the lfoo space. If we specify instead that foo is our parameter, and log(foo) ~ normal(0,1) Stan is trying to enforce a frequency in the foo space, it calculates p(log(foo)) dfoo and says this is the frequency of samples in the region dfoo. If this is what you meant, then FINE, but if you meant for foo to have a lognormal distribution, you need to match the space in which you're calculating the density to the space in which Stan is calculating the probability: so you need to specify log(foo) ~ p(log(foo))dlog(foo)/dfoo

For the second question, about whether you'd do a transformation when you say f(data,params) ~ distribution(other,params), the answer is no, you don't want to transform anything, and the reason comes from the mathematics. The statement about f(data, parameters) is a conditional probability:

p(f(data,params) | params) = distribution(other_params)

This isn't a statement about the frequency of "params" in your Stan sample, since "params" is a given vector (it's on the right hand side of the vertical bar), it's a statement about where the f function values should be if the params are correct... and since the parameters are the only things that change during the simulation process, not the data, it's ultimately a statement about which params values result in transformed data -> f values that you consider probable.

Like a likelihood that is generated by an iid sampling assumption, the best way to think about this statement is that it's actually a weighting function for the prior distribution over the parameters. We start with a basket of possible parameter values called the prior, and we reject any values (downweight the density) of the parameters which result in f(data,params) being in the low density region of the given distribution. In the same light, we could write

data ~ distribution(param1,param2)

which is a generative statement of the likelihood. This is a special case, it corresponds to:

id(data) ~ distribution(param1,param2)

where "id" is the identity function $f(x) = x$ and says that we should downweight the priors over param1 and param2 whenever the data values themselves are in a region of "distribution" which is improbable given the parameters.

Restricting yourself to generative models is a lot like restricting yourself to separable equations.

solve for $x$... sorry no can do. The best we can do is invent a name for a function which is the solution to this equation, and come up with a way to calculate it. Incidentally, the function is called the "Lambert W" and $W(y) = x$.

Similarly, if we have a mathematical model in which we can specify the probability distribution of parameterized transformations of the data, we can either write down that fact:

roundoff ~ uniform(-0.5,0.5)
(x+roundoff) ~ normal(mu,sigma);


or we can invent a special likelihood to put it in a generative (separated) form:

increment_log_prob(normal_cdf(x+0.5,0,1)-normal_cdf(x-0.5,0,1))


But in many ways enforcing the "generative" style will induce you to work with more limited sets of models, because sometimes the likelihood you need to invent is going to be very complicated, and that's particularly true when the left hand side involves several parameters. For example if we know a little more about some of the roundoff errors, we'll need to do a weighted integral of normal_pdf weighted by the information we know about roundoff. There is no hard reason why you need to specify your model in this way. In my opinion, instead of trying to work your model into a form where you can say

data ~ my_distribution(other,parameters)


You should think hard about what you know, and make appropriate probability statements. If that results in

f(data,params) ~ my_distribution(other,parameters)


then so be it. Just be aware that your "fact" needs to be both a true (approximate, probabilistic) fact about the world, and informative. If there are a LOT of different parameter values that can make the left hand side be in the high probability region of my_distribution, then you won't be downweighting much of the parameter space... and you won't find out much about your parameters. The ultimate version of this is if you choose a function f(data,params)=constant. Then "params" will be totally uninformed by this statement.

You might well ask, in all this declarative vs generative modeling discussion, why would you want to put distributions on expressions involving data and parameters?

And the answer I'd give to that is that sometimes it's more natural. For example, with the roundoff error in the laser example (data rounded to nearest 1/4 wavelength) the alternative is to write a likelihood on a discrete grid of rounded results. for example if you know that x is rounded to the nearest 0.25, then its underlying value fell somewhere between -0.125 and +0.125 relative to its reported value, in Stan:

increment_log_prob(log(normal_cdf(x+0.125,mu,sigma)-normal_cdf(x-0.125,mu,sigma)));

But, what if you knew not just that it was rounded, but for example you had some second measurement which, with noise, gives you some additional information about the underlying value for some of the measurements? Now, to do this above model you have to calculate a weighted integral of the normal_pdf instead of an unweighted one (which is what the normal_cdf is).

Alternatively, you could put a prior on the roundoff error, and express your probability over the result of subtracting the roundoff error. For example suppose in this particular case, the underlying value was above by about 0.08 (but no more than 0.125)

real<lower=0,upper=0.125> err;
err ~ exponential(1/0.08);
(x-err) ~ normal(mu,sigma);


If you marginalize away the variable err, by calculating

when $p(\mathrm{err})=\mathrm{uniform(-0.125,0.125)}$ you get the above model with the normal_cdf, but if you use the informed prior, you get a model which is, marginally, a weighted integral of the normal_pdf over the range [0,0.125]. Good luck calculating that in the likelihood, it will require doing numerical integration in the increment_log_prob statement!

But, why else would you want to do this? Another reason that I can think of is that often, at least in the physical sciences, we can decompose the error in to a variety of sources. For example, in the laser molecular measurement example we had two sources: molecular vibrations, and roundoff. But in a moderately famous neutrino experiment from a few years back, you had what people thought were faster-than-light neutrinos. It turned out if I remember correctly that this was due to an error in the model for the errors! Specifically, if you have a signal traveling through the earth at the speed of light, and then hitting a detector, and the detector is connected to an amplifier, and the amplifier is connected to a signal processor, and the signal processor is connected to a computer network, and the leg-bone is connected to the hip-bone... Then at each stage there is some kind of systematic error in the signal which you can model via various functions and prior information. But at the end, you're expecting (but not requiring!) that the signal couldn't have gotten there faster than light.

(signal_speed - my_complicated_error_model)/speed_of_light ~ normal(1,0.00001); // a little remaining statistical noise

or suppose that from previous experiments, you know the noise is a little skewed or something

(signal_speed - my_complicated_error_model)/speed_of_light ~ gamma(appropriate,constants); // a bit non-normal noise

The fact that is salient, is that after removing the systematic errors, the remaining result should be just about exactly the speed of light. If, however, you use a highly informed prior for some portion of the error model, you can wind up overwhelming this distribution, and getting samples that predict faster than light. If your error model is correct, then you've discovered something! and if you are less certain about your error model and express that uncertainty appropriately, then the knowledge you have about the speed of the signal being almost exactly the speed of light will help you discover the true value for that component of the error.

Of course, in most cases you could write these models in a generative way. But we've seen in the above example the generative model might very well require doing weighted numerical integration in the likelihood function. The distribution that is implied by the above statement on the signal speed could well be extremely strange, especially if my_complicated_error_model has all sorts of informed priors on multiple parameters involved in it.

Remember, the distribution of a+b is the convolution of the distribution for a and the distribution for b.

Suppose your error model has multiple terms, each of which is an informed prior:

a-(b+c+d+e+f+g) ~ gamma(q,p);

This means that the appropriate convolution of all those informed distributions for b,c,d,e,f,g with the distribution for a, is a gamma distribution with some constants q,p you've gotten from previous experiments:

convolution_of_a_b_c_d_e_f_g = gamma_pdf(q,p)

What does that mean for the probability of a? Good luck writing that down in a simple likelihood. Ok, so perhaps you can figure it out, the point is that if the only reason you aren't willing to put a distribution on the left hand side is that it's not generative... then you're wasting a lot of time for a requirement that isn't a real requirement! Much better to simply state the fact that you know, and let Stan find the samples for a.

In programming, there are a variety of "paradigms" which languages fall into. The most well known is the "imperative" style of language, typified by C/C++, Matlab, Java, Python, etc.

There are also "functional" languages, like Scheme, Lisp, Haskell, and soforth.

Another genre is "declarative" languages, like Prolog, SQL, and a few others. Other than SQL, these languages are somewhat more rare for the average programmer to have experience with.

Some languages are Object Oriented, though typically within OO languages they are usually imperative in style.

Finally, many languages have aspects of all of these, for example R, Python, and to some extent C++ all support many aspects of functional paradigms, Common Lisp supports functional, object oriented, imperative, and declarative (through certain macro packages, where Prolog is built inside Common Lisp).

The distinction I'm trying to make here is between Imperative, and Declarative languages, and how that relates to building statistical models in a Generative vs Declarative style (related to my previous posts on declarative models) so it's worthwhile to think about how they are different.

In an imperative language, you tell the computer what to do so for example, you might tell it to loop over an index i, and add 1 to all the elements of an array, in R:

for(i in 1:10) {
my_array[i] <- my_array[i]+1
}

In a declarative language, you tell the computer what the result you want is, and it figures out how to achieve it, in SQL:

select my_value + 1 from my_table where my_value > 0;

There is a proof (i'm told) that prolog, a declarative language, can calculate anything that C or another imperative language can, with at most logarithmic extra space. So, while SQL is a very limited declarative language, the declarative paradigm is fully general (Turing Complete).

Now, it's my opinion that this distinction carries over just about 1-1 when talking about building Bayesian statistical models. There are two basic paradigms:

Generative models:

In a generative model, the model is created as if the data and the parameters were the outputs of certain random number generators (RNGs). Michael Betancourt put it well in a discussion on the stan-user mailing list:

The generative picture is
1) generate truth, myfunction(x,a,b)
2) generate noise, epsilon ~ normal(0, sigma)
3) generate measurement, y = myfunction(x,a,b) + epsilon
This is a very "imperative" view of how you might specify a model. You can imagine coding it up as "sample the parameters a,b from the prior", then "calculate myfunction using the parameters" then generate random perturbation from the normal distribution, then say that "that's how y was created" (note, this last bit is almost always false, unless you're trying to experiment with your model based on simulated data).
This paradigm has some advantages, one of which is that you can easily create fake data and see how well your model fits the real data.

But, it's not the only way to think about Bayesian modeling. The alternative is more declarative, or since Bayesian models are a generalized logic, you can think like a Prolog "logic programmer"

Declarative models:

Lay out all the facts that you know:

a is between a_low and a_high with high probability, and is most probably around a_typ;

a ~ normal(a_typ, (a_high-a_low)/4); // or something similar

b is a positive number less than 10;

b ~ uniform(0,10);

The difference between the data and my predictions using myfunction is less than about 2 when the correct a, and b are used:

y-myfunction(x,a,b) ~ normal(0,2);

you can of course, elaborate, as I did in my model for simultaneously measuring a molecular length, and calibrating a set of lasers.

You're free to encode approximate knowledge of anything you like so long as its actually true knowledge (it's no good saying x ~ normal(0,3) when x is almost always > 50 in reality, unless you can get enough data to undo your incorrect assumption), and the Bayesian machinery (Stan!) will give you the approximate logical consequences of those statements.

I should say though, that it's very possible to go wrong by putting in facts that either are wrong, or by stating them in a way which is in-compatible with what you mean. For example, the well known problem with attempting to say that a parameter a has a lognormal distribution:

log(a) ~ normal(0,log(1.2));

It's well known that taking the logarithm compresses intervals of a into much smaller intervals of log(a). The degree of compression changes with the different values of a. To figure out what the distribution of a is (density per unit a), given that b=log(a) has a normal distribution per unit b, you can do:

normal(b,sigma) db = normal(log(a),sigma) db = normal(log(a),sigma) db * abs(da/db) to get a density in units of a

and $da/db = 1/(db/da) = 1/(dlog(a)/da) = 1/(1/a) = a$

Why doesn't this same type of caveat apply in the case of a declaration about a function of data and parameters? To answer that will take a little more thought and care, and will be left for next time.

So, why should we believe that it works to put probability distributions on functions of both data and parameters, in a non-generative way? (see my previous example)

Well, one reason is the success of Approximate Bayesian Computation (ABC), in that scheme typically you have a physical simulator, like for example a PDE or an ODE solver, or an agent-based model or something, and if you put in the parameters, the simulator will return you a prediction for the data..... and then, you have data. And the data never matches the predictions exactly, and there's no arguments you can make about IID samples, since typically every measurement in space is tied together by the whole simulation. So what do you do?

The typical method is to calculate several statistics of the simulation and data, and see if they fit together close enough. For example you might calculate the total kinetic energy of some ensemble, calculate the mean and stddev of reflectivity of infra-red over your domain, calculate the density of predators in a certain region... whatever is appropriate for your model, you get some kind of summary statistic (or several of them), and you compare it to the data, and you accept the parameters if the difference between the two is in the high probability region of some distribution (could be just if you're within epsilon, in other words, a uniform distribution).

This method works, in lots of examples, and it is a generalized version of what I did in my previous example. Its success gives confidence to the idea that writing "non generative" models whose whole structure simply puts high probability on parameters when some function of the data an parameters is in the high probability region of a particular distribution you've chosen.

This is a pretty powerful generalization from the typical IID likelihood principle!

Suppose there are physics labs around the world that want to find out the length of a special unobtanium molecule. They have a variety of special laser molecular traps which can read out the length of the molecule to the nearest 1/4 wavelength of the laser light. Unfortunately, they all have different lasers with different wavelengths that aren't known exactly. Fortunately for us, one lab has extra money, and has a carefully measured and tuned laser, but unfortunately they can only give us a few samples of the molecular length (otherwise, we could just as well ignore everyone else as far as obtaining the avg length goes)

One of the reasons to do this experiment is to find out the length of the unobtanium, but the other reason is to calibrate the other labs measurement apparati for future use. So we'd like to do both!

Furthermore, suppose one of the labs, in initially getting things set up, did a lot of measurements of the molecule with a highly out of tune laser (very uncertain what the actual wavelength is) and found that the molecule vibrates in such a way that any given measurement of the length is distributed around the average length by a root-mean-squared error of 0.1294795 when measured as a fraction of the mean length (they have no calibration data on their laser, so they can't give a measurement in meters!). Unfortunately, they didn't think to keep the measurements, they just calculated the mean mu and the standard deviation, sigma, and reported sigma/mu only.

What the facilities do know is an interval where their laser's wavelength lies, perhaps by some cheap but imprecise measurement apparatus, except for the one which is quite precise (without this, there is no hope to connect things to the definition of a meter).

See the examp_laser_molecule R/Rstan code (rename it to .R instead of .txt)

Note, we know that the laser measurements are rounded to the nearest quarter wavelength. This means that for every measurement, there is some number between -0.125 and 0.125 which we could subtract from the rounded data, to get the "real" data. Since we know nothing about the actual value, we give a maxent prior of uniform on that interval for this parameter.

Similarly, the labs have reported a range over which they are SURE their laser's wavelength lies. So we specify maxent priors as uniform on that range for the wavelength.

Since we know the relative RMS error in the length (caused by molecular vibration!) we specify a quantity which we know is distributed around 1 with sigma = sd_scaled = 0.1294795.

However, secretly, when we simulate the process we use a t distribution with 13 degrees of freedom, rescaled to have the appropriate sigma. In other words, the normal distribution we're using is maxent for something which has a given RMS error, but the actual underlying doesn't satisfy the normal assumption exactly.

The part that generated concern (pun intended) with the "generative modeling" crowd was this:

(l1[i]-roundl1[i])*wl1/true_len ~ normal(1,sd_scaled);

And similar lines related to the data from other labs, l2, l3, l4, l5.

I read this line as: given the rounding error and wavelength, and the true_len, if I calculate the left hand side it will look like a sample from a normal(1,sd_scaled) distribution.

To an HMC monte-carlo simulator like Stan, you could read this as: Stay in the region of parameter space where the left hand sides (all of them) look like samples from the distribution on the right hand side.

In mathematics of conditional probability:

p((l1[i]-roundl1[i])*wl1/true_len | roundl1[i], wl1, true_len, My_Knowledge) = normal_pdf(1,sd_scaled);

to get a full joint distribution, we'll need to multiply by

p(roundl1[i])*p(wl1)*p(true_len), since those parameters all have independent priors. This distribution implies various weird distributions on the l1 data, and the roundl1 parameters etc, but it DOES express a true fact about the world, and it seems to me that is all that should matter in terms of getting correct facts out from the posterior distribution. (Note: I have no "proof" of that, but from intuition on logic, if Bayesian logic is a generalization of binary logic, then if you make true statements in the Bayesian logic, it shouldn't imply false statements out the other end right?)

Note, this type of method may not be maximally informative, efficient, unbiased, or whatnot, and if you're not careful with your choice of function on the left, it may actually be very uninformative (for example, if everything you put in comes out like a normal deviate, then it can't tell you much about which parameter values you should concentrate on). If you want to do this, you need to express facts which aren't true when the parameters have the "wrong" values and are true when the parameters have the "right" values. It may be that such things are more easily understandable in say physics than in social sciences or some other areas of applications.

So, recently I was both asking questions on the stan-user mailing list, and trying to be helpful with others who were asking questions too. A topic came up where no-one was answering the question, and I thought I had a reasonable approach, it was about rounding in data reporting.

Now, my approach, which Bob Carpenter made more computationally efficient, involved creating parameters to represent the roundoff errors, and then putting uniform interval priors on them (the maximum entropy prior for a finite interval) and then running the model and seeing both what the parameters were for the underlying normal, and the parameters for the individual roundoff errors. The logical way to code this turned out to be to place a distribution over a function of data and parameters

data - rounding_errors ~ normal(mu,sigma);

this generated a lot of resistance on vague principles. Bob Carpenter later proved that the model was identical to the recommended model (marginally for mu,sigma at least), which involved writing out a likelihood using the normal CDF.

The strong resistance to this approach made me wonder, am I doing something that has pitfalls, or have I just drunk different Kool-Aid than the "generative" modeling crew?

So, I'm going to put up an example model where this kind of thing makes very good sense to me, and "generative" models seem irritatingly obfuscated... and see what the results are.

Helping the internet:

I got into a situation on Debian, doing an upgrade from before systemd where procps wouldn't install due to an error where procps was trying to start/stop itself and systemctl was returning "Failed to wait for response: Bad message" so procps wouldn't install, so udev wouldn't install so systemd wouldn't install so systemd would give the error about failed to wait for response....

solution: install sytemd via dpkg telling it to ignore dependencies on udev.... then apt-get -f install and continue my dist-upgrade.

dpkg --ignore-depends=udev --install /var/cache/...systemd... etc

Ok, the above stuff is all true as far as it goes. But do yourself a favor, just get a new Debian netinstall CD and install a brand new up to date system with systemd. Trust me, it will work better and take a lot less time and effort.

The p value is perhaps one of the most mis-used concepts in statistics. In fact, many researchers in science who are not statistics experts seem to believe that statistics is really the study of how to define and calculate p values. I'd say this attitude is prevalent especially in biology, medicine, and some areas of social sciences.

The truth is, to a Bayesian such as myself, p values are largely irrelevant, but they DO have one area where they make sense. First, let's understand what a p value is.

### p value meaning

The p value in a test is the probability that a random number generator of a certain type would output test-static data that is as extreme or more extreme than the actually observed value of that test statistic.

$P_{H0}(t(d) > t(D))$

Procedurally, imagine that your data $D$ comes from a random number generator which has boring properties that you don't care about. If your data comes from this random number generator, it would by definition, be an uninteresting process that you'd stop studying. Call this random number generator $H_0$ and its output $d$ (little d). Now consider some function $t$ which maps your data to a real number: $t(D)$ or for random generator output $t(d)$. Generally the function measures in some sense how far away your data falls from an uninteresting value of $t$, (often t=0). Now, how often would your specifically chosen boring random number generator produce fake data $d$ whose $t$ value is more extreme than the $t$ value of your actual data $D$? This is what the formula above describes.

### p value use

So, that above description seems rather sterile, here is an examples of "proper" use of a p value: filtering data

You have a large sample of 1 second long audio recordings of the ambient noise around the area of a surveillance camera. You want to detect when the overall loudness of the noise is "unusual" so that you can tag and save the audio and video recordings for 30 minutes on either side of the "unusual" event. These events will be saved indefinitely, and other time periods will be deleted after 7 days to reduce data storage requirements. You calculate an overall amplitude of the sound recording $s(t)$ using this formula for an amplitude: $A_{t_0} = \int_{t=t_0}^{t=(t_0+1)} s(q)^2 dq$ this is a real number, and its calculation from the data does not require generating random numbers, and therefore the formula is a deterministic function that maps your data (a sequence of voltages) to a real number, and qualifies as a "test statistic". Next you manually identify a set of 1000 time intervals during which "nothing happened" on your recording, and you calculate the $A$ values for each of these "uninteresting" intervals. Now, if you have an $A$ value which is greater than 99% of all the "uninteresting" $A$ values, then you know that the "A" value is unusually large under the assumption that your "A" value was generated by the "nothing happened" random number generator, in this case, the p value for the amplitude to come from a "nothing happened" time period is $p = 0.01$ because 99% of "nothing happened" samples have amplitude less than this given amplitude.

Note that this does not mean in any way that "a crime happened" perhaps a cat knocked over a trash-can, or a window washer came and bumped the camera, or a jet airplane flew over, or a Harley Davidson drove by, or whatever. Taking the fact that the audio was louder than most of the samples of "nothing happened" as evidence that "a crime was committed" is seriously WRONG, in just the way that taking the fact that your psychology experiment produced measurements that are different from some simple "null hypothesis" as evidence that "my explanatory mechanism is true" is also seriously WRONG.

### The real value of p: filtering

So, we see the real value of "p" values: filters. We have lots of things that we probably shouldn't pay attention to: chemicals synthesized at random in a pharma lab, noise produced by unimportant events near our surveillance camera, time periods on a seismometer during which no earthquake waves are being received, psychology experiments that we shouldn't pay attention to. The p value gives us a way to say "something is worth considering here". The big problem comes when we make the unjustified leap from "something is worth considering here" to "my pet theory is true!".

As a Bayesian, I appreciate the use of p values in filtering down the data to stuff I probably shouldn't ignore, it's a first step. But the next step is always going to be "let's build a model of what's going on, and then find out what the data tells us about the unobserved explanatory variables within the model" That's where the real science occurs!

PS: often under the assumption that there is a single stationary distribution from which all the data d come means that the random samples of t(d) have some common and well known frequency distribution (like normal, or chi-squared, or chi, or gamma or whatever). The central limit theorem often guarantees the frequency distribution is normal under some mild assumptions, this is why the "null hypothesis" is often not really about the data generating histogram, but rather the histogram of t(d). In this view, it doesn't so much matter what your data turn out to be, but rather what the t(D) is relative to a reference distribution for the test statistic (like the "student-t" or the normal, chi-squared, etc).