# Generative vs Declarative models <-> Imperative vs Declarative programs

In programming, there are a variety of "paradigms" which languages fall into. The most well known is the "imperative" style of language, typified by C/C++, Matlab, Java, Python, etc.

There are also "functional" languages, like Scheme, Lisp, Haskell, and soforth.

Another genre is "declarative" languages, like Prolog, SQL, and a few others. Other than SQL, these languages are somewhat more rare for the average programmer to have experience with.

Some languages are Object Oriented, though typically within OO languages they are usually imperative in style.

Finally, many languages have aspects of all of these, for example R, Python, and to some extent C++ all support many aspects of functional paradigms, Common Lisp supports functional, object oriented, imperative, and declarative (through certain macro packages, where Prolog is built inside Common Lisp).

The distinction I'm trying to make here is between Imperative, and Declarative languages, and how that relates to building statistical models in a Generative vs Declarative style (related to my previous posts on declarative models) so it's worthwhile to think about how they are different.

In an imperative language, you tell the computer *what to do* so for example, you might tell it to loop over an index i, and add 1 to all the elements of an array, in R:

for(i in 1:10) { my_array[i] <- my_array[i]+1 }

In a declarative language, you tell the computer what the result you want is, and it figures out how to achieve it, in SQL:

select my_value + 1 from my_table where my_value > 0;

There is a proof (i'm told) that prolog, a declarative language, can calculate anything that C or another imperative language can, with at most logarithmic extra space. So, while SQL is a very limited declarative language, the declarative paradigm is fully general (Turing Complete).

Now, it's my opinion that this distinction carries over just about 1-1 when talking about building Bayesian statistical models. There are two basic paradigms:

Generative models:

In a generative model, the model is created as if the data and the parameters were the outputs of certain random number generators (RNGs). Michael Betancourt put it well in a discussion on the stan-user mailing list:

The generative picture is1) generate truth, myfunction(x,a,b)2) generate noise, epsilon ~ normal(0, sigma)3) generate measurement, y = myfunction(x,a,b) + epsilon

But, it's not the only way to think about Bayesian modeling. The alternative is more declarative, or since Bayesian models are a generalized logic, you can think like a Prolog "logic programmer"

Declarative models:

Lay out all the facts that you know:

a is between a_low and a_high with high probability, and is most probably around a_typ;

a ~ normal(a_typ, (a_high-a_low)/4); // or something similar

b is a positive number less than 10;

b ~ uniform(0,10);

The difference between the data and my predictions using myfunction is less than about 2 when the correct a, and b are used:

y-myfunction(x,a,b) ~ normal(0,2);

you can of course, elaborate, as I did in my model for simultaneously measuring a molecular length, and calibrating a set of lasers.

You're free to encode approximate knowledge of anything you like so long as its actually true knowledge (it's no good saying x ~ normal(0,3) when x is almost always > 50 in reality, unless you can get enough data to undo your incorrect assumption), and the Bayesian machinery (Stan!) will give you the approximate logical consequences of those statements.

I should say though, that it's very possible to go wrong by putting in facts that either are wrong, or by stating them in a way which is in-compatible with what you mean. For example, the well known problem with attempting to say that a parameter a has a lognormal distribution:

log(a) ~ normal(0,log(1.2));

It's well known that taking the logarithm compresses intervals of a into much smaller intervals of log(a). The degree of compression changes with the different values of a. To figure out what the distribution of a is (density per unit a), given that b=log(a) has a normal distribution per unit b, you can do:

normal(b,sigma) db = normal(log(a),sigma) db = normal(log(a),sigma) db * abs(da/db) to get a density in units of a

and

Why doesn't this same type of caveat apply in the case of a declaration about a function of data and parameters? To answer that will take a little more thought and care, and will be left for next time.

The proof for Prolog's asymptotic complexity for arbitrary programs is really simple. You code up a random access array using binary trees and use binary search for access --- it's where the log term comes from. In practice, though, things are much worse in that some algorithms are nearly impossible to convert to even moderately efficient Prolog (all that theory ignores important little issues like memory locality and garbage collection overhead).

I don't think languages like Prolog or SQL are generative in practice. In Prolog, changing the order of two clauses means the difference between an infinite loop and an efficient program. And you really really need to know that it's using depth-first evaluation. You could do breadth-first evaluation, which would be complete, but horribly inefficient as a programming language. The depth-first backtracking eval is what lets you write even remotely efficient Prolog code (see O'Keefe's book,

The Craft of Prolog).I was expecting to see the declarative analogy be the generative model!

For a language like Stan, the only thing that matters is the definition of the density (up to issues such as arithmetic stability and efficiency).

Bob, thanks for the outline of the Prolog proof. From the perspective of practical programming, I agree with you. When I was writing Prolog, I used cut and the like all the time to enforce depth-first with pruning. I spent many hours with

The Craft of Prolog. But, I haven't done any Prolog in a long time. I think the last prolog program I wrote was a program to solve some kind of brain-teaser pencil and paper puzzle, probably over 5 years ago. I still love Prolog though, it's perfect for lots of things that we often see imperative programs doing instead. Like for example, it's perfect for routing calls in an IP PBX, though I don't think anyone has cottoned on to that and actually grafted Prolog into a SIP PBX server.The point of the declarative/generative analogy though is more or less this:

A generative statement of a model is a statement in which data is treated as if it came directly out of some particular RNG. You therefore have a straightforward "imperative" way to generate fake data, just take samples from this RNG.

A declarative statement of a model is one in which you explicitly "declare" probabilistic facts about the world, whether or not those are facts in the data space. A generative model is actually also declarative (it states probabilistic facts about the data directly), but it has a flavor of "imperative" because you can think of generating this and then generating that and then generating the other thing in a sequence.

In both cases, you wind up with a density over the parameters. But in the generative case you wind up with a well-defined "likelihood" since the model assumes data comes IID out of the surrogate RNG with a given distribution, whereas in the declarative case you wind up instead with a generalized "weighting function" which plays the same role of the likelihood (it updates the prior by multiplication) but need not have any direct connection to data as samples from a distribution. Sampling from a distribution is just a special case to the Cox/Bayes formalism.

In some sense, you can treat Stan like a generalized system for computing with either kind of model. A statement

data ~ my_distribution(my_parameter);

is straightforward generative model

a statement like

function(data,parameters) ~ my_distribution(my_other_parameters)

is a relatively clear statement of probabilistic fact about the outcome space of function(data,parameters).

The reason you don't want a jacobian transform usually, is that it's directly a statement of fact about the probability of values in the outcome space. It is in essence multiplied by "dfunction". That's because you're modeling the function outcomes, and so typically you're thinking in that space when you choose the right hand side.

Model choice comes first, and the math follows.