"99.999999999% durability of objects over a given year"

2015 May 27
by Daniel Lakeland

Google Cloud storage claims "11 nines" of reliability in their storage. The earth is 4 billion years old, and has experienced about 5 mass extinction events. Google believes that if I put a picture of my cat in their cloud storage today, and then wait another 4 billion years, after 5 more mass extinctions, there's a 96% chance I'll still be able to access my cat picture?

Perhaps not.


What's more important, the data or the method?

2015 May 24
by Daniel Lakeland

Suppose for a moment that you are involved in a psychology experiment. A grad student brings you into a room with a table, on which is a bag full of coins, and a small pile of pennies. The student tells you that the pennies were sampled from the bag, there are 6 of them on the table. Using a scale you can determine that the 6 pennies weight 0.5 grams, and that the bag weighs 200 g.

Seeing this set-up you are then asked to write what you can determine about the bag?

What can you write? Well, the answer is "not much". In particular, from the set up, you can write "the bag used to contain at least 6 pennies" and very little else, however, it seems plausible that given this set up, many people would naturally be inclined to extrapolate the money in the bag, saying that there are about 200 * 6/0.5 pennies total in the bag. And, if we extend this example to more realistic situations, such as determining in a lawsuit how much damage occurred in a particular building, the likelihood that the "natural" extrapolation would be employed by a typical participant becomes very high (in my experience). The set up, with a bunch of seemingly precise data in a textbook type example leads you along the garden path towards the "desired answer".

To counteract this, I came up with the bag example, and an explanation about the missing information: "how were the pennies sampled?"

That this question is crucial becomes obvious when we consider the following different options:

  1. Reach into the bag, stir it around, and pinch a bunch of coins, pull them all out and put them on the table.
  2. Reach into the bag, pull out coins one at a time. If you get anything other than a penny, flip it, and if it comes up heads, throw it back. Pull coins until you get 6 total.
  3. Reach into the bag, pull out one coin at a time, and throw back anything that isn't a penny. Continue until you get 6 pennies.
  4. Reach into the bag, pull out the first coin, flip it, if it comes up heads, use protocol (1) above to pull a sample of coins, if it comes up tails, use protocol (2) above...
  5. ........ there are an enormous set of possible methods to sample coins .....

In litigation situations, people with expertise in the type of damages often are invoked to look carefully at some set of samples and determine in detail the degree or types of damages involved in those samples. Statisticians are very familiar with the scientist who shows up with a pile of detailed expensive data, and in the words of John Tukey "an aching desire for an answer". Unfortunately, detailed data about some totally un-known sampling process produces... totally unknown answers.


Quality of Service (QOS), VOIP, and Netflix

2015 March 28
by Daniel Lakeland

I have a fairly extensive SIP based VOIP system. Recently, after buying a FireTV Stick for Christmas, I've been having lots of complaints about "breaking up" when the kids are streaming high def content, like Sesame Street or nature shows about ants or whatever.

I thought this was fairly odd, because I run an OpenWRT router with QOS settings that I THOUGHT would take care of prioritizing my upstream voice (RTP packets).

So, to make a long story short, the standard QOS scripts prioritize TCP SYN and TCP ACK packets, and apparently the way that Netflix works is to every 10 seconds or so, open up a gazillion HTTP requests to grab blocks of video, and these got prioritized highest priority due to the SYN and ACKs or some such thing.

Eliminating this reclassifying behavior makes my voice nice and steady even while streaming.

Even Bayesians can be confused about frequencies

2014 December 28
by Daniel Lakeland

Andrew Gelman discusses John Cook's post about order-of-magnitude estimates and goes on to state that 10^{-90} is a "hopelessly innumerate" estimate of the probability of a decisive vote in a large election.

What I have to say about that is that a probability of a decisive vote in a particular election is very different than the long-run frequency of decisive votes in national elections. I'll venture to say that the long run frequency of decisive votes in elections involving more than 10 million voters will be zero exactly. There will be a finite number of these elections before the end of the human race, and as in the Gore/Bush case, there are too many ways to fiddle with the vote counts for a decisive vote to ever really occur, any election where the vote is down to a few hundred people will be decided by committee, even if that committee is the people deciding which ballots are invalidated. Committees invalidating ballots will always find a way to invalidate enough that the difference isn't down to 1 vote (a prediction, but not unreasonable I think).

But, whether an estimate like 10^{-90} is terrible in a given election is down to what our state of knowledge is about that election. Consider the following graph:

The binomial density for 1,000,000 yes votes in a 2,000,000 vote election as function of the probability p

The binomial "density" for 1,000,000 yes votes in a 2,000,000 vote election as function of the probability p

This binomial model puts a moderate amount of probability O(10^{-4}) on an exact 1M vote outcome in a 2M vote election if you put exactly p=0.50000... but if you vary from this p by even +- 0.001 the probability plummets to "hopelessly innumerate" levels. But what does this even mean in our case?

In real world situations, we have the following uncertain variables:

  1. How many votes will be cast.
  2. How many votes will be allowed, and from which districts.
  3. What will be the total count of the allowed votes (assume a yes/no vote on a ballot measure for simplicity).

Note that there is no "p" that corresponds to the binomial probability formula. The usual intuition on such formulas is that p is the long run frequency that will be observed in infinitely repeated trials. Such a parameter is meaningful for an algorithmic random number generator, but that interpretation is meaningless for a single election. But a binomial distribution is a reasonable model for counts of yes/no sequence outcomes where we know nothing about which individual sequence we will get except that sequences with more or less counts are more or less likely in a certain sense (in the sense of the parameter p indexing the highest probability count).

So, if we're in a state where we are quite certain that the highest probability count is a little different from 1000000/2000000 it is very reasonable to call out the chance of a given election as 10^{-90}, the fact is though, that a prior over the hyperparameter p (the maximum probability count) rarely is strongly peaked around any given value (ie. peaked around 0.5001 +- 0.00005). Much more likely a probability distribution for the "highest probability count" (ie. a prior over p) would be broad at the level of 0.5 +- 0.02...


Gatekeeper vs Keymaster: the role of statistician / mathematical modeler in science

2014 December 22
by Daniel Lakeland

The Gatekeeper vs the Keymaster


I think the average researcher views statisticians as a kind of "Gatekeeper" of publication. Do the right incantations, appease the worries about distributional approximations, or robustness of estimators, get the p < 0.05 or you can't publish. In this view, the statistician doesn't add to the researcher's substantive hypothesis, more keeping the researcher from getting into an accident, like a kind of research seat-belt.

The alternative version is what I like to think of as the Keymaster role. A researcher, having a vague substantive hypothesis and an idea of technically how to go about collecting some data that would be relevant, can come to a good statistician, or better yet mathematical-modeler (which encompasses a little more than just applied probability, sampling theory etc) who will help make a vague notion into a fairly precise and quantitative statement about the world. This process will get you thinking about the relationships between your quantities of interest, and identify some substantive but unknown parameters that describe the system you are studying. That model structure will then give you a sense of what data will best inform you about these precise quantities, and then ultimately when the Keymaster analyzes the collected data, he or she can extract the meaningful internal unobserved quantities that you really care about (but didn't know about) originally.

This explains why I think it's a big mistake to go out and collect some data first and then show up and expect a statistican to help you make sense of it.

And, I mean really, who wouldn't want to be Rick Moranis??


One of the best books in Applied Mathematics and Mechanics ever:

2014 December 2
by Daniel Lakeland

From G.I. Barenblatt's recently released book: "Flow, Deformation, and Fracture" published by Cambridge, footnote 3, pg 4:

There is nowadays considerable discussion concerning the subject of applied mathematics. In fact, its proper understanding is clarified if we remember the famous saying of J. W. Gibbs: "Mathematics is also a language." If so, then on the one hand pure mathematicians can be identified with linguists, who study the laws of the formation, history, and evolution of languages. On the other hand applied mathematicians, building models of phenomena, are analogous to writers and poets who use language to create novels, poetry, critical essays etc. Applied mathematics is the art of constructing mathematical models of phenomena in nature, engineering, and society

A novelist needs a facility with words, and an understanding of what makes good insight into the human condition to create a great novel. An intense study of the family tree of indonesian and southeast asian ancient languages tracing the evolution of words and writing systems would generally not help. At the same time, indigenous peoples in a rainforest may have intense knowledge of ecological interconnections, but without a facility for written language they will never communicate it to the rest of the world.

On the other hand, Applied Mathematics isn't like say science journalism, in which writers write about what others have done, it's more like perhaps investigative journalism, in which journalists uncover and put together the facts, organize the ideas, and explain them to the world.

Dear Firefox, you just nuked my entire 5 years of Zotero archives!

2014 October 21
by Daniel Lakeland

Luckily I keep good backups. But recently I started Firefox and it said something like "you haven't used firefox in a while, would you like to start over with a new profile to take advantage of our new features?" So since I mostly use Chromium, I said, "yes". Days later I tried to start Zotero and it had no data directory (because by default it uses the one in .firefox/<profile directory>/zotero.

Thanks to rdiff-backup I was able to recover my zotero directory and put it in .zotero/zotero, but this would have been a BIG deal, and Firefox should have done something like move the old directory to a backup location, not nuke it entirely.

Dimensional analysis and the Ebola epidemic

2014 October 18
by Daniel Lakeland

To follow up on my discussion of the Ebola uncertainty. Let's take a look at some very basic differential equations that we can use to get an idea of the factors that go into making up an epidemic.

First, we'll model a population as having infected I and uninfected U. Let's also measure these populations as a fraction of the total population. So initially I=\epsilon and U=1-\epsilon and \epsilon is small (like maybe 10^{-6} or 10^{-8}). Now, how does the infected population grow?

\frac{dI}{dt} = k_{IU} I U = k_{IU} I (1-I)

The assumption here is that in a short unit of time, each I person becomes in contact with a certain number of U people, and for the initial stages at least, this drives the infection. Note that in later stages, I population will begin to be reduced as they die off, and there is more going on. We're interested mainly in the initial stages because we'd like to avoid a major epidemic killing off a few percent of the worlds population etc.

Now, I and U are unitless (they are the ratios of counts of people), and t has units of time, so k_{IU} has units of "per time". It represents the rate at which infected people mix with uninfected people, times the fraction of these mixings which result in transmission. In theory, the fraction of mixing that results in transmission is the definition of R_0 from my previous post (EDIT: not quite, R_0 is actually the fraction of mixings that result in infection, times the average number of mixings throughout a total epidemic... but we could imagine that's constant...)... so we can replace k_{IU} with R_0 r_{IU} where r_{IU} is the rate of mixing.

\frac{dI}{dt} = R_0 r_{IU}I(1-I)

I starts out at near zero, and we're interested in how the infection grows, hopefully we will do something to squash it before it reaches more than 0.005 or 1/2 a percent of the population, so we can assume (1-I) \approx 1 initially, that is for small t.

\frac{dI}{dt} \approx R_0 r_{IU}I

This is the equation for exponential growth, we can make it dimensionless by choosing t_0 = 1/r_{IU} to be the unit of time, and we get:

\frac{dI}{dt'} \approx R_0I : t' = O(1)

So all epidemics are similar at some time scale, and controlled by R_0, this reassures only the naivest of mathematicians, because the assumption is only valid for t' = O(1). In a situation in which the mixing time t_0 is small, this could mean we have only say a few days before I \approx 0.02 at which point we have a SERIOUS problem (2% of the population actively has Ebola, and that would be devastating). The point is, the equation has to change before t gets too big in dimensionless time.

So  R_0 is useful as an index of how infective the virus is, but NOT how quickly it will spread, since there is also the mixing time to be considered. In western countries we'd have to imagine that the mixing time could be much lower than in West Africa, and so effective response would have to be much faster.

In addition, another dimensionless group is important, namely t_rR_0/t_0, where t_r is the time it takes to effectively institute response measures and t_0 is the mixing time. The larger this is (the longer the response takes in dimensionless time) the bigger will be the problem.

Fortunately, we also have maybe some suggestion that R_0 would be smaller in the US, in West Africa many tribal groups wash and prepare their dead, then kiss the bodies to say goodbye... not a good idea with Ebola. Also, there have been attacks on healthcare workers as ignorant people believe Ebola is either a hoax or spread by the government or whatever. Those things probably won't happen in the US.

All this is to say, there is a lot of uncertainty, with mixing time t_0 and infectivity R_0 both having different values in Western countries than in West Africa. So the actual number of days or weeks we will have to effectively respond, and change the equation of growth of the infected population is unknown. One thing we DO know though, is the faster the better. And this is where the CDC and other officials are not driving a lot of confidence in the US population. The general population's cry of "we need to do something about this NOW" is well justified. Given that Ebola has been around for decades, there should be an established plan and some contingencies that have already been thought out. That this doesn't seem to be the case is not confidence inspiring.




The myth of R_0 and Ebola infectiousness

2014 October 17
by Daniel Lakeland

R_0 or (the basic reproduction number) is a parameter used in mathematical models of infection. It's in theory the time integrated average number of people who will be infected by each new case. An R_0 < 1 suggests the infection will die out, and greater than 1 suggests it will spread. But R_0 is a tricky thing to calculate. Wikipedia gives references to how it's calculated, and in fact it seems to be that these different methods of calculation give different results even with a given infection, and likely comparison across diseases is not indicative of something that can really be compared accurately.

But beyond the difficulty of actually calculating such a parameter, there's the uncertainty involved when an epidemic moves from one environment, where you've got a lot of data (say West African Ebola), to another environment which has very different social dynamics and where you have very little data (Say Ebola in International Airline Travel). Bayesian methods can be used to help give a sense of the uncertainty in the parameter once you've got enough cases to do calculations... But I'm going to hope we will have to rely primarily on prior data in the Ebola outbreak. Unfortunately, we are going to have to put a wide prior on R_0 in the global case, because we just don't know how highly mobile and interacting societies compare to West African villages in the spread of this disease.


Sand tapping experiments and MCMC

2014 September 26
by Daniel Lakeland

It's a well known phenomenon in granular materials that if you fill up a tube full of sand and then you tap the tube repeatedly, the sand will settle down to a certain stable height in the tube. Typically the variability between the "least dense" and "most dense" states is a few percent of the height. So for example you might start with 10cm of sand, tap it for a while and wind up with 9cm of sand. Note that it's also possible though difficult to get your sand into a state where it actually expands as you tap it, but generally doing so requires you to crush the sand into the tube initially, when poured into the tube the sand will generally be less than or about equal to equilibrium density.

During my PhD I spent a lot of time thinking about how to model this process. One of the key issues is that we have essentially no information about the sand. For example the position, orientation, shape, and material properties (elasticity, surface/friction properties, etc) of the individual grains. It's tempting to say that this is similar to the situation in the ideal gas where we have no idea where, how fast, or in what direction any of the atoms are. That's true, in so far as it goes. But whereas in the ideal gas we have no interactions between the gas molecules, in the static sand condition we have essentially nothing but interactions between the sand grains. At first glance it seems hopeless to predict what will happen when what will happen is caused by interactions, and we have virtually no information about those interactions.

However, it does also depend on what you want to predict, and for someone interested in say soil liquefaction, the main thing to predict is how some disturbance such as a shear wave will affect the density of the soil, and in particular when that soil is saturated with water.

So consider a sand tapping experiment. We have a short-ish column of sand at uniform porosity \phi (the fraction of the volume taken up by voids), and we tap this tube of sand with a blow from a hammer having kinetic energy dE which is small compared to the total gravitational potential of the deposit relative to the bottom of the tube (you won't be lifting the whole tube off the table and putting it into near earth orbit), but large compared to the gravitational potential of a single grain sitting at the top of the tube (you may very well bounce the grains sitting at the surface up a few millimeters), and given this energy, the sand grains bounce around a bit. Most of the sand grains will move not-very-far, you won't have a grain go from the bottom of the tube to the top for example. The average center-of-mass distance traveled is likely to be considerably less than a typical grain diameter. However, the orientations of the grains may change by larger fractions, it wouldn't be completely unheard of for a grain to rotate 180 degrees around some axis.

This tapping process is in many ways like the process of a random "proposal" in MCMC. It moves the grains to a nearby state, one in which the total energy is within about dE of the initial energy. It makes sense to ask the question: "Given that the final state is somewhere in a very high dimensional state space which has energy within about dE of the initial energy, what is the d\phi that we're likely to observe?"

It is, in general, hopeless to try to compute this from first principles for realistic sands, you might get somewhere doing it for idealized spherical beads or something like that, but it isn't hopeless to try to observe what actually happens for some sample of sand, and then describe some kind of predictive model. In particular it seems like what we'd want is a kind of transition kernel:

P(\phi_1 | \phi_0, dE)

at least for \phi_0,dE in a certain range.

So, while I didn't get around to doing it in my PhD dissertation, I may very well need to go out and buy a bag of sand, a clear plastic tube, some kind of small hammer, and a bit of other hardware and have a go at collecting some data and seeing what I get.