Another way to think about the "declarative vs generative" dichotomy that I have recently blogged on, is that you have free choice about what you are going to model. Informally, the "likelihood" principle says that you should model the output of your data collection instruments, typically as IID draws from some distribution. In simple cases, this let's you write down

And this formalism is typically fine for people who work with data that is pretty darn noisy and about which there is little known about the wheels and gears and other things that go into producing the data... like survey responses, or even some economics, or even some simple physical processes.

But lots of people have detailed models of their data production process, like weather predictors, they may not know everything they'd like to know, but conditional on the initial conditions, they have PDEs that let them evolve their weather forward in time to tomorrow or next week (or next century for some the climate change researchers).

So, we could look at todays weather measurements and compare them to the predictions based on yesterday's weather measurements, and try to get the parameters of the model (initial and boundary conditions yesterday, forcing functions) perfectly set up so that we'll predict tomorrow's weather. But we absolutely SHOULD NOT replace the weather model with a random number generator.

If we are given a vast vector of temperature, pressure, and wind speed measurements all put together into a vector "data", and a set of parameters which are more or less the "correct" initial and boundary conditions, then what kind of

should we use? The answer is, we don't have one, and, luckily, we're NOT forced by the logic of Cox/Jaynes probability modeling to write one down.

In particular, there's nothing forcing us to model our data, we can model anything, and one thing we can model is for example the relative reasonableness of certain degrees of devation between our model for the weather and today's actual weather data. This is what ABC does

p(distance(summary(predicted_weather(parameters),summary(measured_weather))) | parameters, K) = uniform(0,epsilon)

This is a function of both data and parameters which we choose to model, we simply ASSIGN based on our knowledge K the given probability distribution, and it means "this is about how far off we could be if the parameters were correct".

Doing that lets us update the parameters, let's call all that distance(...) stuff "dist" for short:

Then we have some knowledge about how well weather prediction works, which lets us write down something we can have some confidence in in terms of:

p(all_params | dist) \propt p(dist | all_params) p(all_params)

Is p(dist | all_params) a likelihood? well, not if you want likelihood to mean p(data | params) but it is an assigned probability which is a valid modeling step for a Cox/Jaynes model, and it's assigned on an expression that USES the data. It has perfectly fine logical basis. We may not know what the reasonable values are for the very complicated data vector, but we know the reasonable values for some kinds of summaries and we can collapse our data down via a parameterized function and model that collapsed quantity.

After all, the output of data collection equipment is really just a collapsing down of other quantities we could call data: for example a gauge points to a number on a dial, but that's a collapse of a certain current and a certain spring constant, and a certain magnetization, and some geometry... a digital thermometer outputs a temperature, but that's a collapse of a certain charge in a certain capacitor of a certain size and a certain resistance, and a certain response function for an A/D converter and etc etc.

And it's not just complicated black-box PDE solvers that we can do this for, we can always model something other than the pure output of our measurement instruments. So long as we are actually USING the output of the measurement instruments somewhere in the model, the measurement will provide some kind of information, the more we collapse them, perhaps the less information we can get, and so more collapsed models are possibly not quite as informative as if we had a full model for the entire data vector, but then, we don't have such a model!

Statistical testing answers the question: how strange is the sequence of data $d_i$ under the assumption that it is a sample from distribution $D$?

When would a Bayesian care about the answer to such a question? There are basically two cases:

1. When summarizing the posterior distribution using a sampling mechanism such as MCMC.
2. When attempting to fit a distribution to data so as to replace the physical process with a sampling process.

Case 1 is a computational technique. We replace a function of the parameter space with a finite sequence of numbers which act as if they were a sample from a distribution. The main thing this does is let us replace expectation integrals with finite sums. Per Martin-Löf gave us a way to define random sequences as mathematical objects in terms of sequences that pass some "most powerful test" for which we have a non-constructive proof of existence. Practically speaking, this means we can determine whether a sampling technique is working by testing to see if we can reject the sample.

Case 2 is a modeling technique. Whereas the first case is about computing with a sequences of samples from the posterior for the parameters, the second technique is about approximating some scientific real-world process by sampling. The second case is always intentionally wrong, and this is what Frequentist statistics often gets wrong. We know that the way data "is generated" is more or less as the result of physical processes, quantum or classical, whatever the case some molecules push on other molecules etc. However, from a convenience in modeling perspective sometimes we can replace all that detailed picture with an approximate picture which is that the net result of all those causes tends to fall into some certain range of outcomes with some relative frequencies of occurrence. This is always always always an approximation, and one which need not continue to hold and which should be checked. The way to check the adequacy of this process for describing at least the historical data, is to use some kind of statistical test.

Continuous probability densities are in dimensions of 1/[x]  where [x] is whatever the dimensions of the input are. For example, probability per unit length for $p(x)$ on $x$ a length.

Mathematical models should be made based on dimensionless quantities. If we're looking to compare different data points under a model, it seems to me it's worth considering the quantity:

Where $x^*$ is the x where p takes on its maximum value, so for the unit normal $x^*=0$ for example.

In particular, I have in mind using this alternative quantity to filter data in a similar way to the filtering by "p value" described in a previous post.

So, if you're trying to detect unusual situations (in the previous post example it was unusual loudness of noise) you fit your model $p(x)$ and then for each data value $x_i$ you calculate $\log(p(x_i)/p(x^*))$ and select those data points which produce a negative enough log probability density ratio. In particular, you'd actually calculate say the average over the posterior distribution for the parameters of this quantity.

This isn't quite the same as the likelihood ratio, since of course it's averaged over the posterior of the parameters, but it is obviously related. (EDIT: you can obviously use these ratios both for likelihood and for posterior density of parameter values, if you're trying to compare alternative parameters, but with parameters you probably want marginal density values, and if you have samples, you probably need kernel density estimates or something...)

This is a case where it makes sense to use the Bayesian machinery to actually fit the frequency histogram of the data points, a perfectly legit thing to do, even though it is sometimes not required.

Ways in which this is superior to "p values" include being able to distinguish when something is unusual even when it's not on the tail of a distribution. For example in multi-modal distributions, as well as being actually motivated by the science (in so far as you put science into your data model) and the actual values of the data, it doesn't involve any probability under the model to observe theoretical data 'as extreme or more extreme" than the data point for example, it's just the relative probability to observe data in the immediate neighbourhood of the data point compared to in the immediate neighbourhood of the most common point. In particular you can think of $p(x) dx / p(x^*) dx$ as a ratio of probabilities for any infinitesimal quantities $dx$ which thanks to the ratio happen to cancel out.

I used this exact method recently with data my wife gave me. It was a kind of screening of certain chemicals for their ability to induce certain kinds of fluorescent markers in cells. The image data resulted in a bunch of noisy "typical" values, and then a number of outliers. Each plate was different since it had different chemicals, different cells, different markers etc. In Stan I fit a t distribution to the image quantification data from each plate, and looked for data points that were in the tails of the fit t distribution based on this quantity. It seemed to work quite well, and got her what she needed.

The above graph summarizes the situation that Andrew Gelman has been railing on about how chasing the noise with p values leads to a great career and top publications!

Published recently in Psychological Science, this study has Open Science Data. So I pulled the Study1 data and quickly plotted it in R. On the x axis is a measure of how manipulable the participant is times the manipulation that was done. More sad is to the right. On the y axis the relative blue-yellow accuracy as a fraction of red-green accuracy.

You may notice the two outstanding data points on the left at the top? Yeah, I think that's their effect.  Basically, two people who were easy to manipulate and happened to get the amusement manipulation were outstanding blue-yellow discriminators. All the rest... well, look for yourself.

From a comment I left on Andrew Gelman's blog:

In science, the data never come from random number generators with given frequency distributions. They always come from some set of atoms and sub-atomic particles interacting to alter the position of some other particles. it’s pretty much always position, or occasionally velocity, too.

Position of a needle on a gauge caused by velocity of electrons flowing through a wire causing a magnetic field which provides a torque to counteract the torque applied by a spring.

Position of electrons in a capacitor that produces a voltage for an A/D converter.

Position of graphite molecules on a paper survey answered by an enormous bundle of atoms we call a Psychology Student who uses the position of photons striking her retina to interpret the meaning of the position of the ink molecules on the page which alters the position of sodium ions in the neurons of her brain.

Position of electrons in a bank computer which cause a money balance to be shown on an account which causes a bundle of molecules called a consumer to be able to alter the position of a bundle of molecules called a product from one location on the earth called a store to another location on the earth called a home without another bundle of molecules called a police officer confining the consumer bundle into a small space by strong molecular repulsion.

It’s never Platonic infinite sequences of bits coming from a particular abstract mathematical sequence whose Godel number is some particular value and which satisfies the Kolmogorov axioms for a particular distribution function to within a given epsilon when calculated by the non-constructive uniformly most powerful computable test.

So, I've been discussing declarative vs generative models. Here's a case where I think the declarative model makes good sense, how would you model this?

y is a vector of 1000 timeseries samples from some audio system. On one end is a highly sensitive microphone that puts out a voltage signal proportional to the instantaneous pressure gradient. The voltage is in the range [0,0.01] volts and is known to have a slew-rate limit of at most 0.005 v per sample and a bias of 0.005 +- 0.0006. A pre-amplifier increases this signal by pretty close to 100x but has a response function$f_{pre}(v) = A \mathrm{atan}(x/0.01)/\mathrm{atan}(1)$ which has a known shape over the input range [0,0.01], is not perfectly linear, and this shape is multiplied by a number A slightly less than 100 to get the actual transfer function.  The pre-amp output picks up radio frequency noise, which can be modeled as normal(0,0.015), and the result is converted via 10 bit A/D into the range 0-1023, with 1v in to the A/D implying 1024. This digital signal is sent down a noisy channel, so that only some of the samples are received in digital form. Additionally, as a compromise to improve reliability and due to lack of equipment, the signal is re-converted to analog via a 12 bit D/A outputting in the range 0-10v, which has a known noisy bias signal due to a failing diode that can be modeled as independent exponentially distributed voltages with mean value close to 0.05. The result is sent over a long cable of around 1000 feet, where additional normally distributed white noise is added to the signal whose RMS amplitude is less than 0.01 v. The long cable has about 1 ohm resistance per 100 feet +- 5 % as well as a resistance of less than 10 ohms at both the input and output connections. The whole thing is put into a 16 bit A/D operating on inputs in the range 0-10v with an unfortunately low input impedance of close to 1000 ohms +- 1 %. The 16 bit A/D is sufficiently high resolution that y is given as a floating point number equal to the digital value on 0-(2^16-1) / 2^16 and assumed to be continuous with an additional conversion noise and bias which is approximately gamma distributed with k=3 and rate = 1/0.003

Reconstruct the original signal y_true out of the microphone as best you can using as much of the information available as possible, while also giving best estimates for the length of the wire, the total resistance of the cable, the amplification factor for the pre-amp, and the input impedance of the final A/D.

The "generative" picture here is this (more or less, we're ignoring low-pass filters that are always present on the d/a and a/d steps):

y_meas = round(((round(1024*A*f_pre(y_true)+epsilon_rf)+da_bias_noise)*10*


And, in addition, we have some of the digital values round(1024*A*f_pre(y_true)+epsilon_rf) let's call the ones we have y_digit, which is a sparse vector and t_digit which tells us which of the y_meas the y_digit correspond to.

So, if you wanted to write this in a generative form, with only data or only parameters on the left hand side of the probabilistic statements, in Stan, what would you use to define the likelihood function for y_meas or y_digit given all the parameters, y_true, epsilon_rf, da_bias_noise... etc?

For the full analog / digital / analog / digital / analog path, You could always invent some extra roundoff error parameters, and write

y_meas - (....)  ~ gamma(3,1/0.003);


where (...) = the whole mess where round(x) is replace with x+roundoff_i and roundoff_i is given the appropriate prior. But then you'd be writing a declarative model with a nonlinear function of the data and the parameters (consider the resistances formula) on the left hand side.

Stan doesn't have a shifted gamma, but if it did, you could use the shifted gamma distribution and place the round function in the location parameter, however, you will probably induce bad behavior due to the non-differentiable jumps.

The point of the declarative version is that if you were given the actual values of the parameters other than ad16_noise, then subtracting y_meas - (...) would give you an expression equal to the noise sample, and the likelihood for seeing that sample conditional on all the other parameters, is the gamma distribution.

If you have the wrong values of the parameters, you won't get something that's gamma distributed, so you'll be in the low probability region of the gamma distribution function, so you'll downweight that portion of the parameter space... that's all you need to get inference on the parameters, that you find reasons to reject some of the prior space because it doesn't fit the facts you know.

Suppose you are in a Kindergarten class, and you have a bag containing 100 mixed nuts, Brazil, Walnut, Almond, Hazelnut, etc.

Now one of the children reaches in, pulls out a nut, and tells you what kind it is. How much does it weigh?

Let's make the assumption that the nuts remaining are not radically different from the one chosen, at least for that type.

Consider the bag of nuts as a prior over the weight of the nuts. It's the range of weights that we consider reasonable. Our data is the kind of nut...

Repeatedly reach into the bag of nuts, and pull out one nut. If the nut is the wrong type, then you know it's not like the data, so throw it back. This is the likelihood, it down-weights parameter values that don't fit the data. If it's the right type, weigh it (this is a sample of the parameter, the unobserved weight) and write down the weight. Once you've done this so that you have measured your type of nut many times, then build a histogram of the weights you wrote down.

The nut your child is holding probably has a weight that is near the peak in your histogram, and the weight probably isn't outside the range of 95% of the weights in your histogram. This is the Bayesian way of thinking in a nutshell.

In my posts about assigning distributions to functions of data and parameters, I mentioned the tried and true example of trying to apply a distribution to a nonlinear function of a parameter:

log(foo) ~ normal(0,1);

In Stan at least, this does NOT imply that samples of foo have a lognormal frequency distribution, for that you have to take into account the differential compression that the log function applies to the dfoo intervals. This is always true when you transform a particular set of n parameters through nonlinear transforms into n other parameters (linear transforms imply a constant multiplier correction which is then normalized out) the differential volumes transform according to the determinant of the Jacobian of the transform in the vicinity of the volume interval.

But, why is the same thing not true when we do:

lfoo ~ normal(0,1);
foo <- exp(lfoo)

which, if you write those statements in the appropriate places (model, and transformed parameters) in a Stan program, will give you a sample foo from a lognormal distribution? (alternatively you could sample lfoo, and exp transform in R after you grab the sample).

And, when you make a statement like

f(data,parameter_vector) ~ distribution(other_params)

The parameter_vector may have many elements, and therefore, there is no Jacobian of the transform (specifically the Jacobian isn't a square matrix and so has no determinant). But even if you are using only one parameter, and could calculate the distortion of the differentials, would you want to?

For the first question, about the samples of lfoo and their exponential, the answer is that we're transforming fixed numbers that Stan spits out which have the proper normal(0,1) frequency distribution thanks to Stan ensuring that fact in the lfoo space. If we specify instead that foo is our parameter, and log(foo) ~ normal(0,1) Stan is trying to enforce a frequency in the foo space, it calculates p(log(foo)) dfoo and says this is the frequency of samples in the region dfoo. If this is what you meant, then FINE, but if you meant for foo to have a lognormal distribution, you need to match the space in which you're calculating the density to the space in which Stan is calculating the probability: so you need to specify log(foo) ~ p(log(foo))dlog(foo)/dfoo

For the second question, about whether you'd do a transformation when you say f(data,params) ~ distribution(other,params), the answer is no, you don't want to transform anything, and the reason comes from the mathematics. The statement about f(data, parameters) is a conditional probability:

p(f(data,params) | params) = distribution(other_params)

This isn't a statement about the frequency of "params" in your Stan sample, since "params" is a given vector (it's on the right hand side of the vertical bar), it's a statement about where the f function values should be if the params are correct... and since the parameters are the only things that change during the simulation process, not the data, it's ultimately a statement about which params values result in transformed data -> f values that you consider probable.

Like a likelihood that is generated by an iid sampling assumption, the best way to think about this statement is that it's actually a weighting function for the prior distribution over the parameters. We start with a basket of possible parameter values called the prior, and we reject any values (downweight the density) of the parameters which result in f(data,params) being in the low density region of the given distribution. In the same light, we could write

data ~ distribution(param1,param2)

which is a generative statement of the likelihood. This is a special case, it corresponds to:

id(data) ~ distribution(param1,param2)

where "id" is the identity function $f(x) = x$ and says that we should downweight the priors over param1 and param2 whenever the data values themselves are in a region of "distribution" which is improbable given the parameters.

Restricting yourself to generative models is a lot like restricting yourself to separable equations.

solve for $x$... sorry no can do. The best we can do is invent a name for a function which is the solution to this equation, and come up with a way to calculate it. Incidentally, the function is called the "Lambert W" and $W(y) = x$.

Similarly, if we have a mathematical model in which we can specify the probability distribution of parameterized transformations of the data, we can either write down that fact:

roundoff ~ uniform(-0.5,0.5)
(x+roundoff) ~ normal(mu,sigma);


or we can invent a special likelihood to put it in a generative (separated) form:

increment_log_prob(normal_cdf(x+0.5,0,1)-normal_cdf(x-0.5,0,1))


But in many ways enforcing the "generative" style will induce you to work with more limited sets of models, because sometimes the likelihood you need to invent is going to be very complicated, and that's particularly true when the left hand side involves several parameters. For example if we know a little more about some of the roundoff errors, we'll need to do a weighted integral of normal_pdf weighted by the information we know about roundoff. There is no hard reason why you need to specify your model in this way. In my opinion, instead of trying to work your model into a form where you can say

data ~ my_distribution(other,parameters)


You should think hard about what you know, and make appropriate probability statements. If that results in

f(data,params) ~ my_distribution(other,parameters)


then so be it. Just be aware that your "fact" needs to be both a true (approximate, probabilistic) fact about the world, and informative. If there are a LOT of different parameter values that can make the left hand side be in the high probability region of my_distribution, then you won't be downweighting much of the parameter space... and you won't find out much about your parameters. The ultimate version of this is if you choose a function f(data,params)=constant. Then "params" will be totally uninformed by this statement.

You might well ask, in all this declarative vs generative modeling discussion, why would you want to put distributions on expressions involving data and parameters?

And the answer I'd give to that is that sometimes it's more natural. For example, with the roundoff error in the laser example (data rounded to nearest 1/4 wavelength) the alternative is to write a likelihood on a discrete grid of rounded results. for example if you know that x is rounded to the nearest 0.25, then its underlying value fell somewhere between -0.125 and +0.125 relative to its reported value, in Stan:

increment_log_prob(log(normal_cdf(x+0.125,mu,sigma)-normal_cdf(x-0.125,mu,sigma)));

But, what if you knew not just that it was rounded, but for example you had some second measurement which, with noise, gives you some additional information about the underlying value for some of the measurements? Now, to do this above model you have to calculate a weighted integral of the normal_pdf instead of an unweighted one (which is what the normal_cdf is).

Alternatively, you could put a prior on the roundoff error, and express your probability over the result of subtracting the roundoff error. For example suppose in this particular case, the underlying value was above by about 0.08 (but no more than 0.125)

real<lower=0,upper=0.125> err;
err ~ exponential(1/0.08);
(x-err) ~ normal(mu,sigma);


If you marginalize away the variable err, by calculating

when $p(\mathrm{err})=\mathrm{uniform(-0.125,0.125)}$ you get the above model with the normal_cdf, but if you use the informed prior, you get a model which is, marginally, a weighted integral of the normal_pdf over the range [0,0.125]. Good luck calculating that in the likelihood, it will require doing numerical integration in the increment_log_prob statement!

But, why else would you want to do this? Another reason that I can think of is that often, at least in the physical sciences, we can decompose the error in to a variety of sources. For example, in the laser molecular measurement example we had two sources: molecular vibrations, and roundoff. But in a moderately famous neutrino experiment from a few years back, you had what people thought were faster-than-light neutrinos. It turned out if I remember correctly that this was due to an error in the model for the errors! Specifically, if you have a signal traveling through the earth at the speed of light, and then hitting a detector, and the detector is connected to an amplifier, and the amplifier is connected to a signal processor, and the signal processor is connected to a computer network, and the leg-bone is connected to the hip-bone... Then at each stage there is some kind of systematic error in the signal which you can model via various functions and prior information. But at the end, you're expecting (but not requiring!) that the signal couldn't have gotten there faster than light.

(signal_speed - my_complicated_error_model)/speed_of_light ~ normal(1,0.00001); // a little remaining statistical noise

or suppose that from previous experiments, you know the noise is a little skewed or something

(signal_speed - my_complicated_error_model)/speed_of_light ~ gamma(appropriate,constants); // a bit non-normal noise

The fact that is salient, is that after removing the systematic errors, the remaining result should be just about exactly the speed of light. If, however, you use a highly informed prior for some portion of the error model, you can wind up overwhelming this distribution, and getting samples that predict faster than light. If your error model is correct, then you've discovered something! and if you are less certain about your error model and express that uncertainty appropriately, then the knowledge you have about the speed of the signal being almost exactly the speed of light will help you discover the true value for that component of the error.

Of course, in most cases you could write these models in a generative way. But we've seen in the above example the generative model might very well require doing weighted numerical integration in the likelihood function. The distribution that is implied by the above statement on the signal speed could well be extremely strange, especially if my_complicated_error_model has all sorts of informed priors on multiple parameters involved in it.

Remember, the distribution of a+b is the convolution of the distribution for a and the distribution for b.

Suppose your error model has multiple terms, each of which is an informed prior:

a-(b+c+d+e+f+g) ~ gamma(q,p);

This means that the appropriate convolution of all those informed distributions for b,c,d,e,f,g with the distribution for a, is a gamma distribution with some constants q,p you've gotten from previous experiments:

convolution_of_a_b_c_d_e_f_g = gamma_pdf(q,p)

What does that mean for the probability of a? Good luck writing that down in a simple likelihood. Ok, so perhaps you can figure it out, the point is that if the only reason you aren't willing to put a distribution on the left hand side is that it's not generative... then you're wasting a lot of time for a requirement that isn't a real requirement! Much better to simply state the fact that you know, and let Stan find the samples for a.

In programming, there are a variety of "paradigms" which languages fall into. The most well known is the "imperative" style of language, typified by C/C++, Matlab, Java, Python, etc.

There are also "functional" languages, like Scheme, Lisp, Haskell, and soforth.

Another genre is "declarative" languages, like Prolog, SQL, and a few others. Other than SQL, these languages are somewhat more rare for the average programmer to have experience with.

Some languages are Object Oriented, though typically within OO languages they are usually imperative in style.

Finally, many languages have aspects of all of these, for example R, Python, and to some extent C++ all support many aspects of functional paradigms, Common Lisp supports functional, object oriented, imperative, and declarative (through certain macro packages, where Prolog is built inside Common Lisp).

The distinction I'm trying to make here is between Imperative, and Declarative languages, and how that relates to building statistical models in a Generative vs Declarative style (related to my previous posts on declarative models) so it's worthwhile to think about how they are different.

In an imperative language, you tell the computer what to do so for example, you might tell it to loop over an index i, and add 1 to all the elements of an array, in R:

for(i in 1:10) {
my_array[i] <- my_array[i]+1
}

In a declarative language, you tell the computer what the result you want is, and it figures out how to achieve it, in SQL:

select my_value + 1 from my_table where my_value > 0;

There is a proof (i'm told) that prolog, a declarative language, can calculate anything that C or another imperative language can, with at most logarithmic extra space. So, while SQL is a very limited declarative language, the declarative paradigm is fully general (Turing Complete).

Now, it's my opinion that this distinction carries over just about 1-1 when talking about building Bayesian statistical models. There are two basic paradigms:

Generative models:

In a generative model, the model is created as if the data and the parameters were the outputs of certain random number generators (RNGs). Michael Betancourt put it well in a discussion on the stan-user mailing list:

The generative picture is
1) generate truth, myfunction(x,a,b)
2) generate noise, epsilon ~ normal(0, sigma)
3) generate measurement, y = myfunction(x,a,b) + epsilon
This is a very "imperative" view of how you might specify a model. You can imagine coding it up as "sample the parameters a,b from the prior", then "calculate myfunction using the parameters" then generate random perturbation from the normal distribution, then say that "that's how y was created" (note, this last bit is almost always false, unless you're trying to experiment with your model based on simulated data).
This paradigm has some advantages, one of which is that you can easily create fake data and see how well your model fits the real data.

But, it's not the only way to think about Bayesian modeling. The alternative is more declarative, or since Bayesian models are a generalized logic, you can think like a Prolog "logic programmer"

Declarative models:

Lay out all the facts that you know:

a is between a_low and a_high with high probability, and is most probably around a_typ;

a ~ normal(a_typ, (a_high-a_low)/4); // or something similar

b is a positive number less than 10;

b ~ uniform(0,10);

The difference between the data and my predictions using myfunction is less than about 2 when the correct a, and b are used:

y-myfunction(x,a,b) ~ normal(0,2);

you can of course, elaborate, as I did in my model for simultaneously measuring a molecular length, and calibrating a set of lasers.

You're free to encode approximate knowledge of anything you like so long as its actually true knowledge (it's no good saying x ~ normal(0,3) when x is almost always > 50 in reality, unless you can get enough data to undo your incorrect assumption), and the Bayesian machinery (Stan!) will give you the approximate logical consequences of those statements.

I should say though, that it's very possible to go wrong by putting in facts that either are wrong, or by stating them in a way which is in-compatible with what you mean. For example, the well known problem with attempting to say that a parameter a has a lognormal distribution:

log(a) ~ normal(0,log(1.2));

It's well known that taking the logarithm compresses intervals of a into much smaller intervals of log(a). The degree of compression changes with the different values of a. To figure out what the distribution of a is (density per unit a), given that b=log(a) has a normal distribution per unit b, you can do:

normal(b,sigma) db = normal(log(a),sigma) db = normal(log(a),sigma) db * abs(da/db) to get a density in units of a

and $da/db = 1/(db/da) = 1/(dlog(a)/da) = 1/(1/a) = a$

Why doesn't this same type of caveat apply in the case of a declaration about a function of data and parameters? To answer that will take a little more thought and care, and will be left for next time.