Luckily I keep good backups. But recently I started Firefox and it said something like "you haven't used firefox in a while, would you like to start over with a new profile to take advantage of our new features?" So since I mostly use Chromium, I said, "yes". Days later I tried to start Zotero and it had no data directory (because by default it uses the one in .firefox/<profile directory>/zotero.
Thanks to rdiff-backup I was able to recover my zotero directory and put it in .zotero/zotero, but this would have been a BIG deal, and Firefox should have done something like move the old directory to a backup location, not nuke it entirely.
To follow up on my discussion of the Ebola uncertainty. Let's take a look at some very basic differential equations that we can use to get an idea of the factors that go into making up an epidemic.
First, we'll model a population as having infected and uninfected . Let's also measure these populations as a fraction of the total population. So initially and and is small (like maybe or ). Now, how does the infected population grow?
The assumption here is that in a short unit of time, each person becomes in contact with a certain number of people, and for the initial stages at least, this drives the infection. Note that in later stages, population will begin to be reduced as they die off, and there is more going on. We're interested mainly in the initial stages because we'd like to avoid a major epidemic killing off a few percent of the worlds population etc.
Now, and are unitless (they are the ratios of counts of people), and has units of time, so has units of "per time". It represents the rate at which infected people mix with uninfected people, times the fraction of these mixings which result in transmission. In theory, the fraction of mixing that results in transmission is the definition of from my previous post (EDIT: not quite, R_0 is actually the fraction of mixings that result in infection, times the average number of mixings throughout a total epidemic... but we could imagine that's constant...)... so we can replace with where is the rate of mixing.
starts out at near zero, and we're interested in how the infection grows, hopefully we will do something to squash it before it reaches more than 0.005 or 1/2 a percent of the population, so we can assume initially, that is for small .
This is the equation for exponential growth, we can make it dimensionless by choosing to be the unit of time, and we get:
So all epidemics are similar at some time scale, and controlled by , this reassures only the naivest of mathematicians, because the assumption is only valid for . In a situation in which the mixing time is small, this could mean we have only say a few days before at which point we have a SERIOUS problem (2% of the population actively has Ebola, and that would be devastating). The point is, the equation has to change before gets too big in dimensionless time.
So is useful as an index of how infective the virus is, but NOT how quickly it will spread, since there is also the mixing time to be considered. In western countries we'd have to imagine that the mixing time could be much lower than in West Africa, and so effective response would have to be much faster.
In addition, another dimensionless group is important, namely , where is the time it takes to effectively institute response measures and is the mixing time. The larger this is (the longer the response takes in dimensionless time) the bigger will be the problem.
Fortunately, we also have maybe some suggestion that would be smaller in the US, in West Africa many tribal groups wash and prepare their dead, then kiss the bodies to say goodbye... not a good idea with Ebola. Also, there have been attacks on healthcare workers as ignorant people believe Ebola is either a hoax or spread by the government or whatever. Those things probably won't happen in the US.
All this is to say, there is a lot of uncertainty, with mixing time and infectivity both having different values in Western countries than in West Africa. So the actual number of days or weeks we will have to effectively respond, and change the equation of growth of the infected population is unknown. One thing we DO know though, is the faster the better. And this is where the CDC and other officials are not driving a lot of confidence in the US population. The general population's cry of "we need to do something about this NOW" is well justified. Given that Ebola has been around for decades, there should be an established plan and some contingencies that have already been thought out. That this doesn't seem to be the case is not confidence inspiring.
or (the basic reproduction number) is a parameter used in mathematical models of infection. It's in theory the time integrated average number of people who will be infected by each new case. An suggests the infection will die out, and greater than 1 suggests it will spread. But is a tricky thing to calculate. Wikipedia gives references to how it's calculated, and in fact it seems to be that these different methods of calculation give different results even with a given infection, and likely comparison across diseases is not indicative of something that can really be compared accurately.
But beyond the difficulty of actually calculating such a parameter, there's the uncertainty involved when an epidemic moves from one environment, where you've got a lot of data (say West African Ebola), to another environment which has very different social dynamics and where you have very little data (Say Ebola in International Airline Travel). Bayesian methods can be used to help give a sense of the uncertainty in the parameter once you've got enough cases to do calculations... But I'm going to hope we will have to rely primarily on prior data in the Ebola outbreak. Unfortunately, we are going to have to put a wide prior on in the global case, because we just don't know how highly mobile and interacting societies compare to West African villages in the spread of this disease.
It's a well known phenomenon in granular materials that if you fill up a tube full of sand and then you tap the tube repeatedly, the sand will settle down to a certain stable height in the tube. Typically the variability between the "least dense" and "most dense" states is a few percent of the height. So for example you might start with 10cm of sand, tap it for a while and wind up with 9cm of sand. Note that it's also possible though difficult to get your sand into a state where it actually expands as you tap it, but generally doing so requires you to crush the sand into the tube initially, when poured into the tube the sand will generally be less than or about equal to equilibrium density.
During my PhD I spent a lot of time thinking about how to model this process. One of the key issues is that we have essentially no information about the sand. For example the position, orientation, shape, and material properties (elasticity, surface/friction properties, etc) of the individual grains. It's tempting to say that this is similar to the situation in the ideal gas where we have no idea where, how fast, or in what direction any of the atoms are. That's true, in so far as it goes. But whereas in the ideal gas we have no interactions between the gas molecules, in the static sand condition we have essentially nothing but interactions between the sand grains. At first glance it seems hopeless to predict what will happen when what will happen is caused by interactions, and we have virtually no information about those interactions.
However, it does also depend on what you want to predict, and for someone interested in say soil liquefaction, the main thing to predict is how some disturbance such as a shear wave will affect the density of the soil, and in particular when that soil is saturated with water.
So consider a sand tapping experiment. We have a short-ish column of sand at uniform porosity (the fraction of the volume taken up by voids), and we tap this tube of sand with a blow from a hammer having kinetic energy which is small compared to the total gravitational potential of the deposit relative to the bottom of the tube (you won't be lifting the whole tube off the table and putting it into near earth orbit), but large compared to the gravitational potential of a single grain sitting at the top of the tube (you may very well bounce the grains sitting at the surface up a few millimeters), and given this energy, the sand grains bounce around a bit. Most of the sand grains will move not-very-far, you won't have a grain go from the bottom of the tube to the top for example. The average center-of-mass distance traveled is likely to be considerably less than a typical grain diameter. However, the orientations of the grains may change by larger fractions, it wouldn't be completely unheard of for a grain to rotate 180 degrees around some axis.
This tapping process is in many ways like the process of a random "proposal" in MCMC. It moves the grains to a nearby state, one in which the total energy is within about of the initial energy. It makes sense to ask the question: "Given that the final state is somewhere in a very high dimensional state space which has energy within about of the initial energy, what is the that we're likely to observe?"
It is, in general, hopeless to try to compute this from first principles for realistic sands, you might get somewhere doing it for idealized spherical beads or something like that, but it isn't hopeless to try to observe what actually happens for some sample of sand, and then describe some kind of predictive model. In particular it seems like what we'd want is a kind of transition kernel:
at least for in a certain range.
So, while I didn't get around to doing it in my PhD dissertation, I may very well need to go out and buy a bag of sand, a clear plastic tube, some kind of small hammer, and a bit of other hardware and have a go at collecting some data and seeing what I get.
I've been sick a lot recently, in part thanks to having small children. In any case, one thing I've been doing is revisiting Chess. I honestly am pretty clumsy at Chess but it's one of those things I always felt I should probably do. When I was younger most of my friends played stronger games than me, and it was hard to enjoy when you were getting beaten all the time. Now, thanks to Moore's law and clever programming, even the very very very top players are useless against a 4 core laptop computer running Stockfish.
So we can all agree now that it's no fun getting blasted out of the water every time, but also we can use computers to make things better and more interesting for humans, since that's what they're for right?
There are lots of proposals for randomized or alternative starting position Chess games. For example Chess 960 (Fischer random chess) is a variant with 960 possible starting positions. The idea is to avoid making Chess a game where a big advantage comes from memorizing opening moves in some opening database. I'm more or less for this in my play. I enjoy playing Chess well enough, but I have absolutely NO interest in poring over variation after variation in a big book of opening theory. I think some people like this stuff, so for them, they can of course continue to play regular chess.
On the other hand, for people like me, consider the following method of starting the game:
- Set up the board in standard starting position.
- Using a computer, play N random legal pairs of moves (turns). possibly with or without capture.
- Using a chess program on the computer, find the "score" for this position.
- Accept the position if the chess program decides that the score is within of (where positive is good for white, negative is good for black, this is standard output from chess engines), otherwise go to step 1.
- Assign the color randomly to the human (or to one of the humans if you're not playing against a computer).
- Start the game by allowing white to move.
Note, this variation also can be used to handicap games by accepting a starting position if it is within of some handicap value , and then assigning the weaker player to the color who has the advantage. It's also possible to play random moves and then allow the computer to move moves until the score evens out properly if you can get support from the chess engine. Finally, it's also possible to search a large database of games for one in which after moves the position evaluates to within of the appropriate handicap value, rather than generating random moves.
I suspect to would be the appropriate number of moves to use.
Now, who will implement this in SCID or the like?
Frequentist statistics often relies on p values as summaries of whether a particular dataset implies an important property about a population (often that the average is different from 0).
In a comment thread on Gelman's blog (complete with a little controversy) I discussed some of the realistic problems with that, which I'll repeat and elaborate here:
When we do some study in which we collect data and then calculate a value to see if it has some particular property, we calculate the following:
Where is a functional form for a cumulative distribution function, and are sample statistics of the data .
A typical case might be where is the sample average of the data and is the sample standard deviation, is the number of data points, and is the standard t distribution CDF with degrees of freedom.
The basic idea is this: you have a finite population of things, you can sample those things, and measure them to get values . You do that for some particular sample, and then want to know whether future samples will have similar outcomes. In order for the value to be a meaningful way to think about those future samples you need:
- Representativeness of the sample. If your sample covers a small range of the population's total variability, then obviously future samples will not necessarily look like your current sample.
- Stability of the measurements in time. If the population's values are changing on the timescale between now and the next time you have a sample, then the p value is meaningless for the future sample.
- Knowledge of a good functional form for . When we can rely on things like central limit theorems, and certain summary statistics therefore have sampling distributions that are somewhat independent of the underlying population distribution, we will get a more robust and reliable summary from our p values. This is one reason why the t-test is so popular.
- Belief that there is only one, or at least a small number of possible analyses that could have been done, and that the choice of sample statistics and functional form are not influenced by information about the data: represents in essence a population of possible p values from analyses indexed by , when there are a wide variety of possible values for , the fact that one particular p value was reported with "statistical significance" only indicates to the reader that it was possible to find a given that gave the required small .
The "Garden of Forking Paths" that Gelman has been discussing is really about the size of the set independent of the number of values that the researcher actually looked at. It's also about the fact that having seen your data, it is plausibly easier to choose a given analysis which produces small values even without looking at a large number of values when there is a large plausible set of potential .
Gelman has commented on all of these, but there's been a fair amount of hoo-ha about his "Forking Paths" argument. I think the symbolification of it here makes things a little clearer, if there are a huge number of values which could plausibly have been accepted by the reader, and the particular value chosen (the analysis) was not pre-registered, then there is no way to know whether is a meaningful summary about future samples representative of the whole population of things.
What problems are solved by a Bayesian viewpoint?
Representativeness of the sample is still important, but if we have knowledge of the data collection process, and background knowledge about the general population, we can build in that knowledge to our choice of data model and prior. We can, at least partially, account for our uncertainty in representativeness.
Stability in time: A Bayesian analysis can give us reasonable estimates of model parameters for a model of the population at the given point in time, and can use probability to do this, even though there is no possibility to go back in time and make repeated measurements at the same time point. Frequentist sampling theory often confuses things by implicitly assuming time-independent values, though I should mention it is possible to explicitly include time in frequentist analyses.
Knowledge of a good functional form: Bayesian analysis does not rely on the concept of repeated sampling for its conception of a distribution. A Bayesian data distribution does not need to reproduce the actual unobserved histogram of values "out there" in the world in order to be accurate. What it does need to do is encode true facts about the world which make it sensitive to the questions of interest. see my example problem on orange juice for instance.
Possible Alternative Analysis: In general, Bayesian analyses are rarely summarized by p values, so the idea that the values themselves are random variables and we have a lot to choose from is less relevant. Furthermore, Bayesian analysis is always explicitly conditional on the model, and the model is generally something with some scientific content. One of the huge advantages of Bayesian models is that they leave the description of the data to the modeler in a very general way. So a Bayesian model essentially says: "if you believe my model for how data arises, then the parameter values that are reasonable are ". Most Frequentist results can be summarized by "if you believe the data arise by some kind of simple boring process, then you would be surprised to see my data". That's not at all the same thing!
Boobies. There I had to say it. This is a post about boobies, and math, and consulting with experts before making too many claims.
In this click bait article that I found somehow searching on Google News for unrelated topics, I see that some "Medical Anthropologists" are claiming that Bras seem to cause breast cancer (not a new claim, their book came out in 1995, but their push against the scientific establishment is reignited I guess). At least part of this conclusion seems to be based on the observation from their PDF
Dressed To Kill described our 1991-93 Bra and Breast Cancer Study, examining the bra wearing habits and attitudes of about 4,700 American women, nearly half of whom had had breast cancer. The study results showed that wearing a bra over 12 hours daily dramatically increases breast cancer incidence. Bra-free women were shown to have about the same incidence of breast cancer as men, while those who wear a bra 18-24 hours daily have over 100 times greater incidence of breast cancer than do bra-free women. This link was 3-4 times greater than that between cigarettes and lung cancer!
They further claim "bras are the leading cause of breast cancer."
That's pretty shocking data! I mean really? Now, according to http://seer.cancer.gov/statfacts/html/breast.html there are about 2 Million women in the US living with breast cancer, and 12% overall will be diagnosed throughout their lives. There are around 150M women in the US overall. So
However, in our sample That's 50 times the background rate (ok 37.5 if you do the math precisely).
Doesn't it maybe seem plausible that in winnowing through the 1% of women living with breast cancer and are still alive, or even the 5 or 6 percent who have been diagnosed in the past but are still alive (figure half of women who are alive today who will at some point be diagnosed have already been diagnosed at this point) that maybe, just maybe they could have introduced a bias in whether or not their sample wears bras?
So "looking for cancer patients causes us to find bra wearing women" is actually maybe the more likely story here? Perhaps "cancer patients who were non bra wearers were overwhelmingly more likely to have died from their breast cancer, and so we couldn't find any of them?" That's somehow not as reassuring to the non-bra-wearers in the audience I think.
Symbolically: pretend BC and Bra are independent. We conclude or not wearing a bra reduces your chance of surviving by a factor of 10 or so if P(Bra) ~ 0.9? Put on those bras ladies! The exact opposite of their conclusion!
I personally suspect something else spurious in their research. But nothing in their PDF convinces me that they know what they are doing.
Note that wikipedia has some discussion of their book.