Apparently Cox’s original paper has some subtle issues, some assumptions that were not explicitly mentioned or even realized by Cox. In the interim many years, some of these technicalities have been patched up. Kevin S Van Horn has a great “guide” to Cox’s theorem which I like a lot. He has some other interesting related stuff as well.

In his presentation of Cox’s theorem, he relies on 5 requirements

• R1: plausibility is a real number
• R2-(1-4): plausibility agrees in the limiting case with propositional calculus (where everything is either known to be true, or known to be false).
• R3: plausibility of a proposition and of its negation are functionally related.
• R4: Universality: In essence plausibility values take on a dense subset of an interval of real numbers, and we can construct propositions that have certain plausibilities associated with them.
• R5: Conjunction: The conjunction function is a continuous strictly increasing function F such that p(A and B | X) = F(p(A|B,X),p(B|X)).

Kevin motivates each of these and gives some objections and discussion about why each one is required. For me, the point of the exercise is to generalize propositional calculus to real-number plausibility, so R1,R2,R3 are obvious to me, though I realize some people have objected (for example, there are 2-dimensional alternatives, where a proposition and its negation have different plausibilities unrelated by a functional equation)

I’m just not interested in such things, so when I look at Kevin’s requirements, Universality and Conjunction are the ones I think you could attack. To me, the dense set of values, and that there must be more than one value (the endpoints of the interval are not the same) seem obvious.

The full version of R4 is:

R4: There exists a nonempty set of real number $$P_0$$ with the following two properties:

• $$P_0$$ is a dense subset of (F,T).
• For every $$y_1,y_2,y_3 \in P_0$$ there exists some consistent X with a basis of at least three atomic propositions $$A_1,A_2,A_3$$ such that $$(A_1|X) = y_1,(A_2|A_1,X)=y_2, (A_3|A_2,A_1,X) = y_3$$

This is perhaps the least understandable part. It says basically that for any three plausibility values, there exists some consistent state of knowledge that assigns those plausibility values to certain propositions. A consistent state of knowledge is one where we can’t prove a contradiction (ie. A and Not A).

Note that this isn’t a claim about the actual world. We aren’t restricted to “states of knowlege” that are say consistent with the laws of physics, or consistent with a historically accurate account of the Norman Conquest, etc. This is a mathematical statement about whether we can *assign* certain plausibilities to certain statements without causing a contradiction. So “A1 = Unicorns keep an extra horn at home for emergencies” and “A2 = The second sun of the planet Gorlab rises every thursday” and soforth are acceptable statements whose truth value we could assign certain plausibilities to to meet these logical requirements. This is called “universality” because it expresses the desire to be able to assign plausibilities at will to at least 3 statements regardless of what those statements are about (whether that’s unicorns, or the position of fictional suns, or it’s the efficacy of a real drug for treatment of diabetes, or the albedo of a real planet circling a remote star).

Van Horn points out that without R4 we can construct a counterexample, so something possibly weaker might be allowable, but we can’t omit it entirely. Van Horn also points out that Halpern requires that the set of pairs of propositions be infinite. But note, there’s nothing that requires us to restrict Cox’s theorem to any one area of study. So, for example, while you may be using probability as logic to make Bayesian decision theory decisions about a finite set of objects whose measurements come from finite sets, someone else is using it to make decisions about continuous parameters over infinite domains. This is Van Horn’s argument, basically that we’re looking for a single universal system that works for all propositions, not just some finite set of propositions about your favorite subject. He goes further to point out that what Halpern describes as $$(A|B)$$ really should be $$(A|B,X)$$ where $$X$$ is a state of information about the statements. So long as you allow an infinite set of possible states of information, you can still apply Cox’s theorem within a finite domain of finite objects.

Van Horn’s article is worth a read, and he discusses many issues worth considering.

52 Responses leave one →
August 16, 2016

Hi Daniel,

Quick comment for now. To me the key sentence is

“In contrast to previous treatments of Cox’s theorem, but following common working practice among Bayesians, we condition the plausibility of a proposition on a state of information, rather than on another proposition”

I think this would generally be accepted by a ‘Jaynesian’. In fact this seems to be one of the central features mentioned by Jaynesians, though not always making the distinction explicit as in Van Horn.

My argument would be (I will try to fill in the details at some point) that

a) I agree (at least tentatively) that the sentence above captures a valuable distinction
b) This in fact undermines the usual Bayesian account based only on simple propositions (e.g. as in I believe Jaynes’ book and Cox’s original work, perhaps excluding Jaynes’ ‘Ap distribution’ stuff, but Van Horn has some critical comments on this elsewhere)
c) This position opens up the compatible possibilities to include pure likelihood theory as well as forms of frequentism.

In fact, I would argue that these alternatives are in some ways more compatible with Van Horn’s sentence than standard Bayes, as they naturally make the distinction explicit.

2. August 16, 2016

Jaynes original book certainly explicitly mentions the “background information” which corresponds to the state of information which leads you to choose a particular sampling distribution for example, so I think Jaynes is fully compatible theoretically, but was not as explicit notationally (except perhaps in the Ap distribution stuff). In particular, I think we can consider a “state of information” as basically a set of propositions which are considered to be true, some of which might be propositions of the sort “The probability p(Foo) = 0.459057” or “p(X in [x,x+dx] | mu, sigma) = normal(mu,sigma)dx” which is I think a generalization of the Ap distribution stuff.

I believe pure likelihood theory is fully compatible with Bayesian reasoning, but is a limitation that is unecessary. I find it hard to believe that frequentism of the sort “test whether mu = 0 and if we reject then treat mu as if it essentially equals mu* the sample mu. if we do not reject, treat mu as if it equals 0” which is the typical actual daily application of frequentism is incompatible.

Frequentism of the sort “observe some data under a ‘null’ condition, fit a distributional form to the data, and then filter the incoming data so that we observe only those with p < 0.05 and do some special inference on them" is probably compatible with this Cox/Bayesian approach, and we can make this formal by looking at a mixture model for the likelihood in which we have one distribution for "nothing is going on" and another distribution for "something is going on" and whenever the probability of the data under the model "nothing is going on" is low, the probability under the "something is going on" model is high so that we can treat the probability of the data as p(Data) = p(Data | Something is Going on) p(Something is Going On) + p(Data | Nothing is going on)p(Nothing is going on) ~ p(Data | Something is Going on) p(Something is Going on) truncating the null model out of the calculation as an approximation due to the small probability under the null.

• August 16, 2016

Also note, pure likelihood theory, that is, an insistence of an iid likelihood, treating the data “as if” from an RNG with some parameters:

p(Data | Model, params) = Product_i p(Data_i | Model, params)

Can in practical problems lead to serious issues, because sometimes it’s just not true that the data are like an IID sample. See recent discussion at Gelman’s blog under “Bootstrapping your posterior” http://andrewgelman.com/2016/08/09/29625/#comment-292761

The freedom to use “background information” to choose a likelihood that represents realistic scientific processes is one of the great things of the “full Bayes” kool aid.

August 16, 2016

Hi Daniel,

The distinction I am drawing is admittedly subtle, but is not based on whether or not Jaynes admits ‘background information’ but in how it is formalised. Van Horn formalises it differently to Jaynes and Cox – as he states in the paper e.g. in the sentence I mention.

My argument (not made explicit here yet) is that this distinction is important and leads to different conclusions than one than does not make the distinction explicit.

And I would certainly not equate pure likelihood theory with iid assumptions, nor with the neglecting of background information. I would recommend Edwards’ book and Pawitan’s book for more on this approach.

August 16, 2016

Does Corey read your blog? I wonder if he might guess the argument I’m only hinting at for the moment?

• August 16, 2016

Well, he does comment on occasion, but I don’t know how often he checks. it I’ll email him and see if he wants to contribute, he’s got lots of background here, and I think he sent me the link to the Van Horn paper originally.

• August 16, 2016

I think Cox’s theorem is compatible with an approach that is essentially just write down the joint distribution unfactored, in particular don’t factor out p(Parameters) as some kind of “prior”. If that’s isomorphic to the Edwards/Pawitan concepts then this is included in Cox’s theorem. But, this approach doesn’t take advantage of any of the “algebra” of probability theory. It’s a little like saying f(x)=0, x= 0.4595401 by inspection…

August 17, 2016

Howdy ojm! I’m not really sure what you’re driving at. My understanding of Van Horn’s treatment of states of information as axiomatically continuously changeable (and thus not logically equivalent to fixed finite sets of propositions) is that it provides an explicit justification for introducing the (otherwise perhaps obscurely-motivated) constraint/axiom/desideratum that excludes Halpern’s counter-example.

I think you’re alluding to h-likelihood, which is just one of a variety of attempt to make the Bayesian omelet without cracking the Bayesian eggs. It still posits unknown fixed parameters that do not get priors. Humbug, I say.

August 16, 2016

Here is an analogy which is a bit outside my domain knowledge but seems like a good one. The wave function of quantum mechanics might be said to represent a state of information. This enables us to use it to calculate the probabilities of observables, so in this sense it indexes a probability distribution.

We do not, however, have a probability distribution over the wave function itself (at least in standard approaches).

If you were to confuse the probability distribution implied by a wave function with a probability over a wave function I think you would be missing something important. Now, replace ‘wave function’ with ‘parameter’. Likelihoodists like Pawitan distinguish P(A|B) and P(A;B) and even use eg P(A|B;C).

I find this an important distinction and it is fully compatible with the Van Horn approach of distinguishing propositions and states of information. Furthermore I believe it clarifies issues of eg improper priors and marginalisation paradoxes. For example a prior likelihood may naturally represent what an improper prior probability distribution struggles with, and blocks the marginalisation paradox (you can’t in *general* marginalise over a likelihood and expect the right answer).

4. August 17, 2016

So, the Marginalization Paradox, I read up on it last night but couldn’t find a great description of how it works, and didn’t chase down the original paper. I did however find Van Horn had something to say about it:

http://ksvanhorn.com/bayes/jaynes/node17.html

I’m interested to look into whether the marginalization paradox remains when we start with a nonstandard construct where we have a uniform prior on a nonstandard interval, marginalize, and then take the standardization of the resulting pdf. Also, perhaps it depends on the form of the nonstandard prior, for example do we get different results when we use say a flat on -N,N with N nonstandard, vs we do a normal(0,N) with N nonstandard etc.

So, I guess I’d better look into the marginalization stuff.

5. August 17, 2016

Corey, ojm. It seems to me that either h-likelihood etc satisfies the Cox axioms Van Horn gives, in which case there’s no “you can’t marginalize over a likelihood”, or it doesn’t, for example it might “posit unknown fixed parameters that don’t get priors” in which case it is maybe Cox probability extended with additional objects (h-likelihoods being different from pdfs or whatever).

I’m pretty much in the humbug camp on that, and for the most part, I’m also in the humbug camp on improper priors. I mean, got a length? Certainly it’s smaller than the known diameter of the universe. Trying to do inference on the diameter of the universe? Put a uniform prior on 10^1000 times as large as your best guess. Got an energy? It’s gotta be smaller than mc^2 for m the mass of the universe, got a level of pain reported by a patient? it’s probably less than the level reported by someone being tortured on several medieval torture devices simultaneously.

But I am interested in the math of improper priors, sort of for similar reasons why I’m interested in the math of maxent using nonstandard analysis, if we can get beyond the weirdness of certain limits and get a reasonably theory of how to use them, it can help avoid confusion.

August 17, 2016

Hi both,

Just another quick comment/question for now.

As far as I can tell nowhere in the article does a state of information X appear on the lhs of the ‘conditioning’, always the rhs.

Would either of you be willing to eg use Bayes’ theorem to move X to the othe side? Why/why not? If not, how would you ‘update’ or change X itself? Is there any relation between different states of information X and Y?

• August 17, 2016

My initial thought is that moving the state of information to the left hand side is very problematic. In the Van Horn treatment, he says we condition everything on that state of information, but we don’t put bayesian probability over the *whole* state of information, we only put bayesian probability over specific sub-elements of the state (ie. what’s the value of mu, what’s the value of sigma, what’s the value of alpha… in some model)

As to the question “is there any relation between different states of information X and Y”, certainly.

Let Y be the state of information “X plus some data D”, then in some sense, Bayes theorem lets us calculate the consequences of adding D to X for other quantities about which we had some information in X.

p(Parameters | D, X) = p(D | Parameters, X) * p(Parameters|X)/P(D|X) = p(Parameters | Y)

which is why it doesn’t really make sense to move Y to the left hand side, instead we just do inference on the sub-parts of X or Y that have explicit assignments of probabilities associated.

Cox’s theorem in essence says “if X assigns credences to quantities, then the proper algebraic rules for manipulating those credences is standard probability theory”

The fact that “X assigns credences” means in some sense there is no moving it to the left hand side of the conditioning bar. in some sense p(X) = 1 by definition of “X is your state of information”

August 17, 2016

> n the Van Horn treatment, he says we condition everything on that state of information, but we don’t put bayesian probability over the *whole* state of information, we only put bayesian probability over specific sub-elements of the state (ie. what’s the value of mu, what’s the value of sigma, what’s the value of alpha… in some model)

Exactly, and that’s really my point. To follow Gelman you might say that Bayes works *within* a model but offers no general theory of what goes on at the boundaries/’outside’ the model.

So you might reason ‘perfectly’ within a model (state of information) but could quite well be bested by someone who just makes up another model (state of information) that has no relation to yours but happens to better approximate reality. See eg Wolpert and ‘no free lunch’ ideas.

As stated in this particular formal development there is no theory or real guidance on how to specify a state of information or model. You might get into maxent etc etc, but the point is I see no compelling reason why Cox’s theorem should be taken as implying you should use Bayes. It does say you can use Bayes *within* a model, but even Frequentists do this. They differ on model assessment itself. See also Box/Gelman on model checking.

August 17, 2016

Also you effectively say define Y = X and A. But there is nothing in the main formal development that says this operation has to be defined for a ‘state of information’ and a proposition. Everything would seem to work the same if ‘state of information’ just meant ‘that which allows me to assign probabilities to propositions’. There is no strict reason as far as I can see that this *must* be taken to be a list of simple propositions. The history of logical atomism would perhaps suggest to me (I’m no expert though) that this would in fact be a bad idea.

• August 17, 2016

I agree it doesn’t need to be a finite list of simple propositions. But, it has to be that you can add information contained in a proposition to the state of information when that proposition is of the form “the probability of Foo is Bar” that obviously is relevant to “allowing you to assign probabilities to propositions”

August 17, 2016

Again my point is that there is no theory of updating states of information here – only of updating probabilities of propositions given states of information.

That is to say, there is no *inductive* theory here. Hence (I assume) why Gelman has said he is convinced by Popper’s argument for deductive falsification (at least as a general scheme).

• August 17, 2016

Also, I totally agree with you that some people will have “bad” states of information, and some people will have “good” ones. For example, if you’re the guy who programmed up the computerized lottery, you can pretty easily cheat whereas other people’s state of information won’t allow them to cheat.

But, once you’ve got some state of information in which some quantities are to have real-number credences assigned to them, then if you buy into Cox’s axioms the only way to update *those credences* is through the algebra of probability theory.

So, you’re fine rejecting the idea of a real number credence, or rejecting the idea of negation, or rejecting the idea of conjunction, etc, but if you buy into the idea of assigning real number credences, and that they obey negation, conjunction etc as laid out by Van Horn, then you’re committed to Bayesian updating of the credences.

If you are in a state of information where you have several possible models and you can assign credences to them, then you need to use Bayesian updating to get new credences over the models. If you don’t want credences over models or credences over parameters or credences of any kind, then you are free to ignore Bayes by just denying Van Horn’s R1

• August 17, 2016

@ojm: See definition 1 in Van Horn, “A state of information X summarizes the information we have about some set of atomic propositions A, called the basis of X, and their relationship to each other. The domain of X is the logical closure of A”

So, it’s not necessarily a finite set, but it is the logical closure of “some set” of atomic propositions. So the proposition p(Params | Data) = foo(Params) is a proposition that can be added to the set after doing the right algebraic updating according to Cox Bayesian updating.

August 17, 2016

X itself is not, and need not, be a set of propositions, finite or not. It just needs to allow you to make assignments to propositions.

Again, think of the wave function. It is not a set of propositions itself, but allows you to assign probabilities to observable propositions.

• August 17, 2016

ojm: replied below because of reply limitations.

August 17, 2016

I’d be happy to split propositions off from X and move them across the conditioning bar.

• August 17, 2016

Right, that’s more or less what my longer-winded version meant.

• August 17, 2016

I guess the “theory of updating states of information” is here, it’s basically this:

K is some set of atomic propositions, and X(K) is some summary of that set that lets us assign probabilities. The Cox axioms allow us to prove new propositions of the form A = {p(Params | Data) = Some_Function(Params)}. These new statements can be added to K using regular set operations, and then we can get a new summary X(K union A).

7. August 17, 2016

Consider the difference between the standard inference: p(Param|Data, KS) = p(Data | Param,KS) p(Param|KS)/Z1

where p(Param|KS) is built on the “standard” state of knowledge “Param is somewhere on the real line”

and the inference:

p(Param|Data,KNS) = ST(p(Data | Param,KNS) p(Param|KNS)/Z2)

where KNS is the nonstandard state of knowledge “Param is a limited real number”

In the first case, there does not exist a p(Param | KS), we must work with something like p(Param) = 1 which is improper.

In the second case, p(Param|KS) = uniform(-N,N) is nonstandard, exists, and is properly normalized. I wonder if I can use this kind of situation to draw some useful distinctions in the Marginalization Paradox issue.

Also note, saying x ~ uniform(-N,N) is a way of saying “There’s a really big number beyond which we know the parameter doesn’t lie, but we don’t know what the really big number is” which is a kind of uncertainty…. hmmm

August 17, 2016

Moved here.

X itself is not, and need not, be a set of propositions, finite or not. It just needs to allow you to make assignments to propositions.
Again, think of the wave function. It is not a set of propositions itself, but allows you to assign probabilities to observable propositions.

August 17, 2016

Also not that a set of propositions in your scheme do not uniquely determine a summary, and there is no theory of how the probabilities determined by the summary of the union of a set of propositions and another is related to the probabilities determined by the summary of the original set of propositions.

• August 17, 2016

X is not a set of propositions, but X is a function of a set of propositions. So for example, if the propositions include “My shoes are brown” and this is not relevant, then it won’t be part of the summary that determines the equations. But, my understanding of what Van Horn is doing is basically that he’s saying that if a set of propositions is consistent, then you can use the algebra of probability and some data to generate new true and consistent propositions about probabilities which then can be unioned into your original set, and the union will STILL BE CONSISTENT. So, in that sense, it IS a theory of updating states of information.

August 17, 2016

Hi Daniel,

Thanks for the responses, this has been a useful exercise for me 🙂

I think our chances of convincing each other have become slim now, however!

At this point I’m satisfied that Van Horn’s presentation of Cox’s theorem does not in fact constitute an argument for the use of Bayes in the sense it’s usually taken (e.g. as a theory of inductive inference, or as a theory for updating states of information) and you are (I think) essentially satisfied that it is (or at least that it constitutes an argument for Bayes in the sense you take it, and against some other approaches that differ from Bayes in particular respects).

So perhaps we leave it here (for now!).

But again, thanks for pointing to the article and having this discussion. It helped me clarify my own thoughts on the topic.

• August 17, 2016

Well, if you can think on it for a while and come up with some specific concerns, I’m happy to entertain them and see where they lead. I just don’t quite understand where you’re going. It seems evident to me that from some set of propositions we can assign probabilities, then if we add some propositions like the content of a big dataset, we can carry out algebra, and arrive at new propositions that assign new probabilities (conditional probabilities on some data), and the point of it is to update the state of information with these new assignments so we can maybe collect more data and repeat the process.

If you’re arguing that we haven’t proven the consistency of probability theory (ie. that propositions derived by algebraic manipulations via the rules of probability theory do not produce inconsistencies) then that might be true, I think Cox’s theorem just shows that plausible reasoning of this type is isomorphic to probability theory, it’s not a consistency proof. But perhaps consistency has been proven elsewhere? Or more likely, it’s been proven that if ZFC is consistent then probability theory is consistent. I’d assume. I mean Kolmogorov theory I’ve always assumed is as consistent as ZFC (is there a consistency proof for ZFC?) I’m no logician.

So, like I say, take your time, if you can come up with something you think we should discuss, please do!

August 17, 2016

Sure, will do 🙂

• August 17, 2016

As an aside, Godel’s second incompleteness theorem implies you can’t prove consistency of ZFC in ZFC unless ZFC is actually inconsistent… so I guess proofs would have to be of the form “assuming ZFC is consistent, then Probability theory is consistent” and my guess is that somewhere that’s been proven but I’m not certain.

August 18, 2016

August 17, 2016

Conveniently it looks like someone else made essentially the same point as me in blog-friendly style here (had a quick google around):

http://meaningness.com/probability-and-logic

(Though I might dispute even the claims of the power of predicate logic somewhat!)

• August 18, 2016

If you’re heading that direction (and I’ve actually seen that blog post and TL;DR previously) then I think provisionally we can agree on something like the following:

Cox’s axioms tell us real number credences *are* probabilities, and that probabilities are consistent with ZFC has been proven, so we can update these probabilities according to the algebra of probability theory. This allows us to build up knowledge about facts.

But, knowledge about facts (ie. the speed of light is very close to 2.998e8 m/s) is not sufficient, we also need to somehow *come up with the models* we use to assign the probabilities, and that will involve reasoning in other forms (formalized in predicate calculus, or maybe extending ZFC with the axioms of IST and doing nonstandard analysis for example) probability theory doesn’t give us that for free.

I totally 100% agree with that. Developing models, and checking that we have correctly specified our models needs to happen at least in part outside probability theory.

• August 18, 2016

The reason I don’t think this gets in the way of being hard-assed about the use of probability theory in science, is that in science one of the most fundamental things we want to do is assess which assertions about the world are true through the collection of data. And while I’ll agree that probability doesn’t extend *all of logic* it does specifically extend the part you need to filter true facts about the world from false ones. Specifically it lets you put probabilities on the truth values of facts which then can converge towards 0 or 1 with enough information.

So, “Probability theory generalizes logic” is incorrect, but “probability theory generalizes inference about which facts are true or false to real-number degrees of credence” is correct, and that’s a really important part of science.

August 18, 2016

To me the upshot of a lot of this is that the things ‘doing the work’ in scientific inference etc are almost always things lying outside of propositional calculus. Eg states of information, predicates, models etc.

It is not clear to me that ‘Bayesian inference’ as a theory of inductive inference, a general form of reasoning etc etc, as opposed to calculations within probability theory that everyone agrees with, actually adds anything.

My current personal conclusion is that Cox’s theorem doesn’t say anything much about scientific inference itself. In fact I think it is entirely possible for various forms of non-Bayesian statistical inference to be compatible with Cox’s axioms as applied to probability theory (not that I would unequivocally endorse any of these other paradigms either).

As a side point, for all the grief Jaynes gives Fisher in his book, my impression is that Fisher at least intuitively understood some of these points and tried to tackle them, whether or not he succeeded (of course I also have my issues with Fisher’s approach). So I think, despite being interesting and provocative, Jaynes is misleading on both the topic of logic and on the nature of the problem(s) of inductive inference.

• August 18, 2016

ojm: Well, the heavy-lifting of accounting isn’t really being done by addition and subtraction, but it doesn’t mean we shouldn’t use consistent rules of algebra when working with bank balances.

To me, that’s the status of Cox’s theorem, it tells us that the “bank balances” (credences) should be updated by the sum and product rules. You certainly can’t do physics without calculus and real algebra and matrix algebra and complex algebra and a set of equations for the forces, and a large database of facts about the constants associated with the force equations (Gravitational constant, acceleration constant at the surface of the earth, permittivity of free space, masses of particles, speed of light, etc).

In some sense Newton’s law F=ma is better written as “for all particles with mass m and for all situations where the total force is F, d^2x/dt^2 = F/m”. We can take this as giver, for example, and still want to use some data x(t) measured imprecisely and some mass measurements m measured imprecisely and some force measurements F(t) measured imprecisely and infer as much as possible about what the m value is through putting credences on different m values that are the result of updating our pre-measurement state of information via adding the measurements to our state of information and doing the algebra required to get new credences conditional on our measurements.

The “heavy lifting” is being done by calculus, and models developed over many years of testing, and a little predicate calculus asserting the F=ma law, but the inference on the m is being done by probability theory and that seems to be necessary as soon as you want the inference to satisfy R1-R5.

• August 18, 2016

Which is why I think probably the easiest way to shut me down is just deny R1. You don’t want credences over real-world quantities… and then we can agree that Cox’s theorem doesn’t apply. But then from a philosophy of science perspective, what is it that you want? And what system are you going to choose to get that thing, and why should anyone believe that that thing is a good thing to have?

I’m pretty happy if I can find out that the posterior distribution over m = normal(1,0.05) on some scale, and now I can do some calculations predicting some trajectory, and know that my accelerations will be within around 5% error. That’s hugely better than “I know m is a real number” or even “I know m is a real number between 0 and 10^36” or some such thing.

August 18, 2016

But in your analogy everyone uses the same addition and subtraction rule! The question is which accounting framework, each of which respects addition and subtraction etc, to use to address the business-world problem of accounting, not the problem of arithmetic.

Cox’s theorem gives the impression of a rigorous formal demonstration of the necessity of Bayesian thinking, but I claim it is not showing what you think it is showing. It is not incorrect, it just addresses a different question. That blog post I mentioned above also links to this by Shalizi:

http://bactra.org/weblog/569.html

And that’s really the point. There is (to me) a big gulf between the actual formal content of Cox’s theorem and your informal interpretations of it. It shows one thing, people read it, think it shows something else and add additional informal content to make up the deficit.

But other systems can equally well respect probability theory as related to propositional logic while using additional layers of theory for the distinct problem of scientific inference.

The weird thing I find with Cox/Jaynes discussions is people start with ‘look, here a rigorous formal justification of why Bayes is necessary for scientific inference’, someone points out a problem based on taking it seriously as a formal argument for the conclusions people draw from it and then an ad-hoc or informal fix or wave of the hands is offered back. It starts to feel like Freudian theory or something – it can always be patched up informally.

I have nothing against informal frameworks and intuitive ideas – and Bayes in practice a la Gelman is often great, I’ve used it and will probably continue to use it myself – but as Gelman said ‘most of the standard philosophy of Bayes is wrong’. Or, most of the formal justifications offered are not valid formal justifications of Bayesian epistemology etc etc, or of the necessity of using Bayesian statistical inference etc etc, despite informal Bayesian statistical tools being useful (alongside many others).

• August 18, 2016

You seem to be arguing about what “people” do. I’m just saying that I really do WANT real number credences over quantities, and that’s why I use Bayesian reasoning over Frequentist testing.

I’ve just said that you can deny Bayes by denying R1 (denying that you’re a 6 legged creature). But, then, there’s no reason why you should think that you can go back and do probability theory calculations using Frequentism and be better off.

Frequentist testing doesn’t give credences over quantities, but it pretends that it sort of does, and it doesn’t align with Boolean logic, but it pretends that it doesn’t matter.

So, we need to distinguish two issues:

1) How to go about doing the science stuff where real number credences don’t apply (that is, inventing calculus, deciding to consider the universal gravitation law as something worth investigating, and choosing which models of election outcomes make sense)

2) Given a set of hypothesized statements coming from (1) how to go about finding out what the proper quantities are of unknowns, and if we give credence to several explanations, how to determine which explanation has the highest credence given the data.

You seem to think that “people” claim Cox’s theorem implies Bayesian inference is all you need for both (1) and (2), and I agree with you that it’s not, but I think almost everyone doing any calculation with probability is trying to answer (2) in some way, and those doing it the Cox way get results that agree with boolean logic, and others don’t.

When I think “scientific inference” I think (2), but you say “the separate problem of scientific inference” which means to me that you don’t think “scientific inference” is (2) which means perhaps we’re just arguing over the scope of some words, which doesn’t seem productive.

When, in the recent Gelman post that ignited this discussion I said people need to deny Cox or stop using p values in the way they are typically used, I still think that’s valid. A small p value gives people a license to act as if the null were false, without actually implying that a real-number credence assigned via their actual state of information and the rules of probability theory is actually a small value. When people are told what p values mean technically it’s routine to discover that they instead believed that they were in essence Bayesian credences. To me, this implies that most scientists using statistics WANT credences. So R1-R3 are going to be uncontroversial to them as well. If you explain R4 and R5 most people will wind up being OK with those as well.

In the end, I think my claim that p values don’t do what people want and they need to stop using them and start using Bayes to get what they want is correct. Cox’s theorem just tells us that there’s a unique way to do those calcs.

August 18, 2016

I don’t deny that probability theory can be related to assigning real numbers to simple propositions.

> In fact, Cox pointed this out in his 1961 book The Algebra of Probable Inference, quoting Boole in Footnote 5, p. 101. In this passage, Boole not only makes the connection between the frequentist and logical interpretations of probability, he suggests that it is necessary—which is the point of Cox’s Theorem

What I deny is that assigning real numbers to simple propositions makes much if any progress towards a theory of scientific inference.

In particular, I deny that simple propositions, as used in propositional logic, have adequate descriptive power for a theory of scientific inference, and I deny that adding real number assignments to simple propositions fixes this.

The need of Van Horn to introduce a separate concept of a ‘state of information’, distinct from a proposition or set of propositions, is but one sign of this difficulty. This concept is also left quite informal in his paper.

• August 18, 2016

I guess in order to know whether I agree with you or not I need to know what you mean by “a theory of scientific inference”. And my guess is that you’re going to include a bunch of stuff in it, and then probably I’ll have to agree with you that a theory of all of that stuff put together isn’t given by Cox and Bayes. But, to me that might be like saying “gee, wheeled vehicles don’t give a complete theory of transportation” sure fine, but I’m not going to give up using the wheel for what it does do, and I’m not going to put up with people with square wheels pretending they’re just as good as circular ones… anyway, I agree with you that examination of the Van Horn article somewhat carefully is helpful, it does sort of draw the lines around what probability theory does and does not do.

11. August 18, 2016

Also ojm you keep saying that the “state of information” is “separate from a set of propositions” but in fact the defintion of Van Horn’s state of information is:

A state of information X summarizes the information we have about some set of atomic propositions A, called the basis of X, and their relationships to each other. The domain of X is the logical closure of A, that is, the union of A and all compound propositions that involve only atomic propositions from A.

I take this to mean, we start with a bunch of propositions A, and then everything that A implies gets computed (theoretically), and then the stuff that is relevant to assigning probabilities is distilled out of it and that’s the state of information used to assign the probabilities. So it’s not the same as a set of simple propositions, but it is basically “everything that the atomic propositions imply logically about probabilities of stuff on the left side of the conditioning bar”

So, Cox/Bayes probability is the unique real-number system that satisfies R1-R5 for starting with some propositions about probabilities, and some data, and winding up with some new propositions about probabilities.

August 18, 2016

That’s not a real definition of a state of information, that’s hand waving.

August 18, 2016

If you presented two different people with the same set of propositions would they ‘summarise’ this as the same state of information. Given two arbitrary states of information X and Y what is the relation between them, etc etc?

• August 19, 2016

Well, you may be right for a “general theory of scientific inference” whatever that turns out to be, but for the purposes of probability theory, you never need to compare states of information between people (and certainly we don’t want “what people do” to be an essential part of a formal system, since there is no such thing really), and all you need is that the summarize operation returns answers to questions like “what’s the probability of Foo” so that p(Foo | X(A)) is a well defined number.

I think the goal of “formalizing all scientific reasoning” is a non starter (Thanks Godel!) so for me Cox’s axioms work to say “when you can give probabilities for base objects, and you add some data, here’s how you should give new conditional probabilities, and they will be as consistent as possible (ie. as consistent as ZFC, thanks to the bit Corey pointed out above). It’s a specification of the algebra you should use to think about real-number credence/plausibility/probability assignments. It’s like telling all the accountants “use the rules of arithmetic”. Sure, it’s still the case that we need “good” accounting practices on top of that, but it’s no good if some accountants just don’t follow the arithmetic rules.

I think we’ve established that it’s not logically required that you MUST accept R1-R5 either. So, basically the contents of Cox’s axioms are “if you’re going to work with probabilities to help you do the accounting of what is and isn’t likely to be true out of a big database of facts… then you need to use the sum and product rules”

That’s not a theory of everything, but it’s not nothing either!

12. August 19, 2016

See now look what you’ve done… I went out and bought a kindle copy of Pinter’s A Book of Set Theory.

• August 20, 2016

Ok, so having skimmed through Pinter’s book, he discusses towards the end a result from Model theory, called the Completeness Theorem, which says that a theory T is consistent if and only if it has a model (an example set of objects and example formal operations on that set). So Van Horn pointing to Kolmogorov probability theory is basically saying “Bayesian Probability is consistent because it’s isomorphic to this model”. He actually says explicitly that you take “a state of information” as a probability distribution over the uncertain quantities, basically your large set of propositions needs to get you the form of a joint distribution over parameters and data. Then, the Bayesian updating is a consistent way to update because it’s isomorphic to the Kolmogorov model. So, ojm you’re right in saying that there’s heavy lifting going on outside probability theory at the level of the “state of information” and its mapping to probability distributions. But, I don’t think that was ever something that Jaynes or I or most Bayesians deny. We need a way to say “gee, it seems to me like this mathematical model of the world is probably something I should pay attention to, and it implies a joint distribution p(Data,Params)” before you can start doing the Bayes dance.

The fact that priors are not universal, that is, that people can come up with different priors from a similar state of information is well known, so we still need some “scientific reasoning” outside probability theory, which I think everyone admits (all true Scotsmen at least 🙂 ). But Godel gives us a reason to think that no theory is complete in and of itself which means we shouldn’t be surprised by this fact, and reading Pinter’s book, it’s just taken for granted in modern set theory that if we want to make correspondences between mathematical statements and the world around us, that occurs outside our formal system, because a formal system is just a very limited language about symbolic formulas. So in some sense, the point of science is to figure out which models allow you to make logical deductions which then also turn out to be valid truths about the universe (ie. the model is good at predicting what happens in the world). The model formulation, and decisions about what to measure, and how to measure them etc occur outside probability theory, but if you want to account (with a single real number) for how much the weight of all your knowledge is for or against a particular statement about the world being true, you should use Bayesian updating to get a consistent system.

August 20, 2016

I think we’re are getting closer to being on the same page.

Now, given a single probability distribution (state of information) we can make inferences ‘within’ this single probability distribution. Eg calculate conditional probabilities given a single global probability distribution.

So, what would lead you to reject the single overall joint distribution as ‘not good enough’? How would you compare two competing full joint distributions or compare either to reality?

• August 20, 2016

So, I think this depends. First off if you have multiple competing predictive models, you should analyze them together within a joint distribution where each model gets some prior probability associated rather than separately. So:

p(Data | Model1) p(Model1) + p(Data | Model2) p(Model2) + …

This is a consistent way to analyze them and compare them together. The question is, what if none of the models are “adequate” ? For example, none of them have any portion of the parameter space where they’re particularly predictive. I think to deal with this you have to remember that in pretty much every situation, you’re working with a truncated version of some ideal scientific process. You’re always working with a small finite set of predictor variables, and a small finite model specification, and limited computing time, etc. Some times it just makes sense to go back to the drawing board and seek out new models.

One question is how do you detect this situation? In the absence of a specific model you can include in the Bayesian calculations, there’s no way to just put in a generic “something else”. I think Gelman has some useful ideas about using generative models with high-posterior-probability parameter values to compare fake-data with real-data, and soforth. Those techniques *do* help. Graphical techniques for model-data comparison definitely help. I definitely don’t have a complete theory of model checking to offer. In many cases I’ve just wound up with a model that isn’t terrible and work with it until something better occurs to me. In part, it depends on what your needs are, how the model is getting used.

So, in this sense, model comparison can be done within Bayes, but model “adequacy” really has to be done outside Bayes.

• August 21, 2016

Thinking about a generic “something else”. I was imagining a model in which I hypothesize how well a “reasonably good” model should perform, and then I use a pseudo-model that takes the actual data, generates a random perturbation about the size of the error in a “good” model, and then generates a likelihood

Product(p(Data_i | Perturbed_Data_i, model_error_size), i)

say using a normal model for the “model error”

This model could then be included in a mixture model in which the probability associated with it is a parameter, whose hyper-prior assigns relatively low probability, and if it winds up having high posterior probability it’s an indication that your predictive models are not very good compared to how good you think they should be.