Andrew Gelman discusses John Cook's post about order-of-magnitude estimates and goes on to state that is a "hopelessly innumerate" estimate of the probability of a decisive vote in a large election.
What I have to say about that is that a probability of a decisive vote in a particular election is very different than the long-run frequency of decisive votes in national elections. I'll venture to say that the long run frequency of decisive votes in elections involving more than 10 million voters will be zero exactly. There will be a finite number of these elections before the end of the human race, and as in the Gore/Bush case, there are too many ways to fiddle with the vote counts for a decisive vote to ever really occur, any election where the vote is down to a few hundred people will be decided by committee, even if that committee is the people deciding which ballots are invalidated. Committees invalidating ballots will always find a way to invalidate enough that the difference isn't down to 1 vote (a prediction, but not unreasonable I think).
But, whether an estimate like is terrible in a given election is down to what our state of knowledge is about that election. Consider the following graph:
This binomial model puts a moderate amount of probability on an exact 1M vote outcome in a 2M vote election if you put exactly p=0.50000... but if you vary from this p by even +- 0.001 the probability plummets to "hopelessly innumerate" levels. But what does this even mean in our case?
In real world situations, we have the following uncertain variables:
- How many votes will be cast.
- How many votes will be allowed, and from which districts.
- What will be the total count of the allowed votes (assume a yes/no vote on a ballot measure for simplicity).
Note that there is no "p" that corresponds to the binomial probability formula. The usual intuition on such formulas is that p is the long run frequency that will be observed in infinitely repeated trials. Such a parameter is meaningful for an algorithmic random number generator, but that interpretation is meaningless for a single election. But a binomial distribution is a reasonable model for counts of yes/no sequence outcomes where we know nothing about which individual sequence we will get except that sequences with more or less counts are more or less likely in a certain sense (in the sense of the parameter p indexing the highest probability count).
So, if we're in a state where we are quite certain that the highest probability count is a little different from 1000000/2000000 it is very reasonable to call out the chance of a given election as , the fact is though, that a prior over the hyperparameter p (the maximum probability count) rarely is strongly peaked around any given value (ie. peaked around 0.5001 +- 0.00005). Much more likely a probability distribution for the "highest probability count" (ie. a prior over p) would be broad at the level of 0.5 +- 0.02...
I think the average researcher views statisticians as a kind of "Gatekeeper" of publication. Do the right incantations, appease the worries about distributional approximations, or robustness of estimators, get the p < 0.05 or you can't publish. In this view, the statistician doesn't add to the researcher's substantive hypothesis, more keeping the researcher from getting into an accident, like a kind of research seat-belt.
The alternative version is what I like to think of as the Keymaster role. A researcher, having a vague substantive hypothesis and an idea of technically how to go about collecting some data that would be relevant, can come to a good statistician, or better yet mathematical-modeler (which encompasses a little more than just applied probability, sampling theory etc) who will help make a vague notion into a fairly precise and quantitative statement about the world. This process will get you thinking about the relationships between your quantities of interest, and identify some substantive but unknown parameters that describe the system you are studying. That model structure will then give you a sense of what data will best inform you about these precise quantities, and then ultimately when the Keymaster analyzes the collected data, he or she can extract the meaningful internal unobserved quantities that you really care about (but didn't know about) originally.
This explains why I think it's a big mistake to go out and collect some data first and then show up and expect a statistican to help you make sense of it.
And, I mean really, who wouldn't want to be Rick Moranis??
From G.I. Barenblatt's recently released book: "Flow, Deformation, and Fracture" published by Cambridge, footnote 3, pg 4:
There is nowadays considerable discussion concerning the subject of applied mathematics. In fact, its proper understanding is clarified if we remember the famous saying of J. W. Gibbs: "Mathematics is also a language." If so, then on the one hand pure mathematicians can be identified with linguists, who study the laws of the formation, history, and evolution of languages. On the other hand applied mathematicians, building models of phenomena, are analogous to writers and poets who use language to create novels, poetry, critical essays etc. Applied mathematics is the art of constructing mathematical models of phenomena in nature, engineering, and society
A novelist needs a facility with words, and an understanding of what makes good insight into the human condition to create a great novel. An intense study of the family tree of indonesian and southeast asian ancient languages tracing the evolution of words and writing systems would generally not help. At the same time, indigenous peoples in a rainforest may have intense knowledge of ecological interconnections, but without a facility for written language they will never communicate it to the rest of the world.
On the other hand, Applied Mathematics isn't like say science journalism, in which writers write about what others have done, it's more like perhaps investigative journalism, in which journalists uncover and put together the facts, organize the ideas, and explain them to the world.
Luckily I keep good backups. But recently I started Firefox and it said something like "you haven't used firefox in a while, would you like to start over with a new profile to take advantage of our new features?" So since I mostly use Chromium, I said, "yes". Days later I tried to start Zotero and it had no data directory (because by default it uses the one in .firefox/<profile directory>/zotero.
Thanks to rdiff-backup I was able to recover my zotero directory and put it in .zotero/zotero, but this would have been a BIG deal, and Firefox should have done something like move the old directory to a backup location, not nuke it entirely.
To follow up on my discussion of the Ebola uncertainty. Let's take a look at some very basic differential equations that we can use to get an idea of the factors that go into making up an epidemic.
First, we'll model a population as having infected and uninfected . Let's also measure these populations as a fraction of the total population. So initially and and is small (like maybe or ). Now, how does the infected population grow?
The assumption here is that in a short unit of time, each person becomes in contact with a certain number of people, and for the initial stages at least, this drives the infection. Note that in later stages, population will begin to be reduced as they die off, and there is more going on. We're interested mainly in the initial stages because we'd like to avoid a major epidemic killing off a few percent of the worlds population etc.
Now, and are unitless (they are the ratios of counts of people), and has units of time, so has units of "per time". It represents the rate at which infected people mix with uninfected people, times the fraction of these mixings which result in transmission. In theory, the fraction of mixing that results in transmission is the definition of from my previous post (EDIT: not quite, R_0 is actually the fraction of mixings that result in infection, times the average number of mixings throughout a total epidemic... but we could imagine that's constant...)... so we can replace with where is the rate of mixing.
starts out at near zero, and we're interested in how the infection grows, hopefully we will do something to squash it before it reaches more than 0.005 or 1/2 a percent of the population, so we can assume initially, that is for small .
This is the equation for exponential growth, we can make it dimensionless by choosing to be the unit of time, and we get:
So all epidemics are similar at some time scale, and controlled by , this reassures only the naivest of mathematicians, because the assumption is only valid for . In a situation in which the mixing time is small, this could mean we have only say a few days before at which point we have a SERIOUS problem (2% of the population actively has Ebola, and that would be devastating). The point is, the equation has to change before gets too big in dimensionless time.
So is useful as an index of how infective the virus is, but NOT how quickly it will spread, since there is also the mixing time to be considered. In western countries we'd have to imagine that the mixing time could be much lower than in West Africa, and so effective response would have to be much faster.
In addition, another dimensionless group is important, namely , where is the time it takes to effectively institute response measures and is the mixing time. The larger this is (the longer the response takes in dimensionless time) the bigger will be the problem.
Fortunately, we also have maybe some suggestion that would be smaller in the US, in West Africa many tribal groups wash and prepare their dead, then kiss the bodies to say goodbye... not a good idea with Ebola. Also, there have been attacks on healthcare workers as ignorant people believe Ebola is either a hoax or spread by the government or whatever. Those things probably won't happen in the US.
All this is to say, there is a lot of uncertainty, with mixing time and infectivity both having different values in Western countries than in West Africa. So the actual number of days or weeks we will have to effectively respond, and change the equation of growth of the infected population is unknown. One thing we DO know though, is the faster the better. And this is where the CDC and other officials are not driving a lot of confidence in the US population. The general population's cry of "we need to do something about this NOW" is well justified. Given that Ebola has been around for decades, there should be an established plan and some contingencies that have already been thought out. That this doesn't seem to be the case is not confidence inspiring.
or (the basic reproduction number) is a parameter used in mathematical models of infection. It's in theory the time integrated average number of people who will be infected by each new case. An suggests the infection will die out, and greater than 1 suggests it will spread. But is a tricky thing to calculate. Wikipedia gives references to how it's calculated, and in fact it seems to be that these different methods of calculation give different results even with a given infection, and likely comparison across diseases is not indicative of something that can really be compared accurately.
But beyond the difficulty of actually calculating such a parameter, there's the uncertainty involved when an epidemic moves from one environment, where you've got a lot of data (say West African Ebola), to another environment which has very different social dynamics and where you have very little data (Say Ebola in International Airline Travel). Bayesian methods can be used to help give a sense of the uncertainty in the parameter once you've got enough cases to do calculations... But I'm going to hope we will have to rely primarily on prior data in the Ebola outbreak. Unfortunately, we are going to have to put a wide prior on in the global case, because we just don't know how highly mobile and interacting societies compare to West African villages in the spread of this disease.
It's a well known phenomenon in granular materials that if you fill up a tube full of sand and then you tap the tube repeatedly, the sand will settle down to a certain stable height in the tube. Typically the variability between the "least dense" and "most dense" states is a few percent of the height. So for example you might start with 10cm of sand, tap it for a while and wind up with 9cm of sand. Note that it's also possible though difficult to get your sand into a state where it actually expands as you tap it, but generally doing so requires you to crush the sand into the tube initially, when poured into the tube the sand will generally be less than or about equal to equilibrium density.
During my PhD I spent a lot of time thinking about how to model this process. One of the key issues is that we have essentially no information about the sand. For example the position, orientation, shape, and material properties (elasticity, surface/friction properties, etc) of the individual grains. It's tempting to say that this is similar to the situation in the ideal gas where we have no idea where, how fast, or in what direction any of the atoms are. That's true, in so far as it goes. But whereas in the ideal gas we have no interactions between the gas molecules, in the static sand condition we have essentially nothing but interactions between the sand grains. At first glance it seems hopeless to predict what will happen when what will happen is caused by interactions, and we have virtually no information about those interactions.
However, it does also depend on what you want to predict, and for someone interested in say soil liquefaction, the main thing to predict is how some disturbance such as a shear wave will affect the density of the soil, and in particular when that soil is saturated with water.
So consider a sand tapping experiment. We have a short-ish column of sand at uniform porosity (the fraction of the volume taken up by voids), and we tap this tube of sand with a blow from a hammer having kinetic energy which is small compared to the total gravitational potential of the deposit relative to the bottom of the tube (you won't be lifting the whole tube off the table and putting it into near earth orbit), but large compared to the gravitational potential of a single grain sitting at the top of the tube (you may very well bounce the grains sitting at the surface up a few millimeters), and given this energy, the sand grains bounce around a bit. Most of the sand grains will move not-very-far, you won't have a grain go from the bottom of the tube to the top for example. The average center-of-mass distance traveled is likely to be considerably less than a typical grain diameter. However, the orientations of the grains may change by larger fractions, it wouldn't be completely unheard of for a grain to rotate 180 degrees around some axis.
This tapping process is in many ways like the process of a random "proposal" in MCMC. It moves the grains to a nearby state, one in which the total energy is within about of the initial energy. It makes sense to ask the question: "Given that the final state is somewhere in a very high dimensional state space which has energy within about of the initial energy, what is the that we're likely to observe?"
It is, in general, hopeless to try to compute this from first principles for realistic sands, you might get somewhere doing it for idealized spherical beads or something like that, but it isn't hopeless to try to observe what actually happens for some sample of sand, and then describe some kind of predictive model. In particular it seems like what we'd want is a kind of transition kernel:
at least for in a certain range.
So, while I didn't get around to doing it in my PhD dissertation, I may very well need to go out and buy a bag of sand, a clear plastic tube, some kind of small hammer, and a bit of other hardware and have a go at collecting some data and seeing what I get.
I've been sick a lot recently, in part thanks to having small children. In any case, one thing I've been doing is revisiting Chess. I honestly am pretty clumsy at Chess but it's one of those things I always felt I should probably do. When I was younger most of my friends played stronger games than me, and it was hard to enjoy when you were getting beaten all the time. Now, thanks to Moore's law and clever programming, even the very very very top players are useless against a 4 core laptop computer running Stockfish.
So we can all agree now that it's no fun getting blasted out of the water every time, but also we can use computers to make things better and more interesting for humans, since that's what they're for right?
There are lots of proposals for randomized or alternative starting position Chess games. For example Chess 960 (Fischer random chess) is a variant with 960 possible starting positions. The idea is to avoid making Chess a game where a big advantage comes from memorizing opening moves in some opening database. I'm more or less for this in my play. I enjoy playing Chess well enough, but I have absolutely NO interest in poring over variation after variation in a big book of opening theory. I think some people like this stuff, so for them, they can of course continue to play regular chess.
On the other hand, for people like me, consider the following method of starting the game:
- Set up the board in standard starting position.
- Using a computer, play N random legal pairs of moves (turns). possibly with or without capture.
- Using a chess program on the computer, find the "score" for this position.
- Accept the position if the chess program decides that the score is within of (where positive is good for white, negative is good for black, this is standard output from chess engines), otherwise go to step 1.
- Assign the color randomly to the human (or to one of the humans if you're not playing against a computer).
- Start the game by allowing white to move.
Note, this variation also can be used to handicap games by accepting a starting position if it is within of some handicap value , and then assigning the weaker player to the color who has the advantage. It's also possible to play random moves and then allow the computer to move moves until the score evens out properly if you can get support from the chess engine. Finally, it's also possible to search a large database of games for one in which after moves the position evaluates to within of the appropriate handicap value, rather than generating random moves.
I suspect to would be the appropriate number of moves to use.
Now, who will implement this in SCID or the like?
Frequentist statistics often relies on p values as summaries of whether a particular dataset implies an important property about a population (often that the average is different from 0).
In a comment thread on Gelman's blog (complete with a little controversy) I discussed some of the realistic problems with that, which I'll repeat and elaborate here:
When we do some study in which we collect data and then calculate a value to see if it has some particular property, we calculate the following:
Where is a functional form for a cumulative distribution function, and are sample statistics of the data .
A typical case might be where is the sample average of the data and is the sample standard deviation, is the number of data points, and is the standard t distribution CDF with degrees of freedom.
The basic idea is this: you have a finite population of things, you can sample those things, and measure them to get values . You do that for some particular sample, and then want to know whether future samples will have similar outcomes. In order for the value to be a meaningful way to think about those future samples you need:
- Representativeness of the sample. If your sample covers a small range of the population's total variability, then obviously future samples will not necessarily look like your current sample.
- Stability of the measurements in time. If the population's values are changing on the timescale between now and the next time you have a sample, then the p value is meaningless for the future sample.
- Knowledge of a good functional form for . When we can rely on things like central limit theorems, and certain summary statistics therefore have sampling distributions that are somewhat independent of the underlying population distribution, we will get a more robust and reliable summary from our p values. This is one reason why the t-test is so popular.
- Belief that there is only one, or at least a small number of possible analyses that could have been done, and that the choice of sample statistics and functional form are not influenced by information about the data: represents in essence a population of possible p values from analyses indexed by , when there are a wide variety of possible values for , the fact that one particular p value was reported with "statistical significance" only indicates to the reader that it was possible to find a given that gave the required small .
The "Garden of Forking Paths" that Gelman has been discussing is really about the size of the set independent of the number of values that the researcher actually looked at. It's also about the fact that having seen your data, it is plausibly easier to choose a given analysis which produces small values even without looking at a large number of values when there is a large plausible set of potential .
Gelman has commented on all of these, but there's been a fair amount of hoo-ha about his "Forking Paths" argument. I think the symbolification of it here makes things a little clearer, if there are a huge number of values which could plausibly have been accepted by the reader, and the particular value chosen (the analysis) was not pre-registered, then there is no way to know whether is a meaningful summary about future samples representative of the whole population of things.
What problems are solved by a Bayesian viewpoint?
Representativeness of the sample is still important, but if we have knowledge of the data collection process, and background knowledge about the general population, we can build in that knowledge to our choice of data model and prior. We can, at least partially, account for our uncertainty in representativeness.
Stability in time: A Bayesian analysis can give us reasonable estimates of model parameters for a model of the population at the given point in time, and can use probability to do this, even though there is no possibility to go back in time and make repeated measurements at the same time point. Frequentist sampling theory often confuses things by implicitly assuming time-independent values, though I should mention it is possible to explicitly include time in frequentist analyses.
Knowledge of a good functional form: Bayesian analysis does not rely on the concept of repeated sampling for its conception of a distribution. A Bayesian data distribution does not need to reproduce the actual unobserved histogram of values "out there" in the world in order to be accurate. What it does need to do is encode true facts about the world which make it sensitive to the questions of interest. see my example problem on orange juice for instance.
Possible Alternative Analysis: In general, Bayesian analyses are rarely summarized by p values, so the idea that the values themselves are random variables and we have a lot to choose from is less relevant. Furthermore, Bayesian analysis is always explicitly conditional on the model, and the model is generally something with some scientific content. One of the huge advantages of Bayesian models is that they leave the description of the data to the modeler in a very general way. So a Bayesian model essentially says: "if you believe my model for how data arises, then the parameter values that are reasonable are ". Most Frequentist results can be summarized by "if you believe the data arise by some kind of simple boring process, then you would be surprised to see my data". That's not at all the same thing!