Regression Discontinuity fails again

2020 July 5
by Daniel Lakeland

Regression Discontinuity analysis is not a failure as an idea, it’s just a failure as a practical way to learn about the world in most cases. The problem is that most situations in which it’s being applied are noisy human / social studies where the effects are much smaller than the level of noise.

Andrew Gelman picks apart another one of these here.

I’ve been teaching myself to use Julia for all my future data analysis projects. It’s just a fabulous language. So here’s the graphs I came up with:

Years lived post election against percentage point margin.
With LOESS fits using 0.2 0.1 and 0.05 bandwidths

What this shows is … NOTHING. As you decrease the bandwidth you can detect more rapid variation in the function value at the expense of more noise in the estimate. By the time you’re down to bandwidth 0.05 you’re using only 5% of the data to fit any given location in the fit. Right at margin = 0 you can see using the orange or red curves that the estimates are extremely noisy, and certainly nowhere near a 5-10 year bump in longevity moving from left of 0 to right of 0.

Science is broken. Here is Erik defending his study.

As we can clearly see in the raw data, there is no discernible signal, none. So whatever signal is supposedly there, if it’s there it just happens to be *exactly* hidden by the offsetting effect of whatever covariates he’s adjusting for. It just so happens basically that people who won elections didn’t live longer on average than people who lost elections, but if they hadn’t won those elections we have somehow strong evidence that they would have died earlier because they were all sicker people than the losers which we can determine from their covariates…

Whatever. Here’s what it looks like when you have an actual signal… Here I’ve used the same x coords, and then generated random y coordinate Normal(0,s) noise, and then added a signal to it… Three different kinds of signals. In the first graph is a step function that steps up by 5 units as we pass x = 0. The second one adds a little wavelet that decreases right before zero and increases right after zero (negative of the derivative of a gaussian) and the last one is the same as the second one, except confined only to x > 0.

Updated state by state graphs with bug fix

2020 May 17
by Daniel Lakeland

I discovered that the method I was using to smooth the cases per day had a bug.

The shapes of the case-per-day function were right, but the overall scale was reduced. Basically what i was doing was convolving by the derivative of a smoothing kernel… But when you calculate the derivative its (f(x+dx)-f(x-dx))/(2dx) that you’re trying to calculate, so when you’re averaging across multiple sizes of dx you need to take that into account… fixed. Now the grey points are the raw data, the black line is the short-term smoothed data, and the blue line is the ggplot smoother.

I wasn’t using this method for doing deaths per day, though maybe I should be now… in any case here’s the current versions.

Updated state by state graphs

2020 May 7
by Daniel Lakeland

Here’s the current status…

State by state graphs of COVID-19 data (from

2020 April 25
by Daniel Lakeland

I’ve got a script that grabs data from and generates several pdfs that give an overview of the pandemic situation one graph per state… I’ll try to update the graphs about weekly. But here are the ones as of today.

Cryptographically Distributed COVID contact tracing through WiFi ad-hoc networking

2020 April 10
by Daniel Lakeland

This is a quick note to try to sketch out an idea that I thought up about how to have people cooperatively determine if they have come in contact with a COVID patient. Here’s the basic idea.

Every Android or iPhone generates a random UUID. Then, when walking around, periodically the phones beacon out a peer-to-peer SSID on their WiFi radios called something like COVID-CONTACT-{UUID}. Everyone’s phones scans the surroundings for stations, and whatever stations they hear, they record the UUID of the phone.

Now… at the end of the day, each phone uploads a one-way cryptographic hash of their own UUID, and the UUIDs that they contacted today.

If a person tests COVID positive, they upload a record of their cryptographic hash, and the fact that they tested positive.

Now, every day you look at all the contacts you’ve contacted in the last ~ 10 days, you hash those, and you see if any of those hashes report being COVID positive. Also, you look up all the COVID positives in the last 10 days, and you see if they report having contacted YOUR hash…

Now, there’s probably some subtlety to this which requires working out by people who are more crypto nerdy than I am, but the dataset is such that you can’t determine whether A contacted B unless you know the UUID of *both* parties. Since the UUID itself is stored internal to the app and never sent to anyone else, basically the UUID itself is a secret, and it’s only possible to determine if A contacted B or vice versa if you are in fact either A or B.

Of course, you could just try EVERY UUID that’s possible… Good luck with that, since there are ~ 2^128 = 340282366920938463463374607431768211456 of them. If you tried 1 Million per second, it’d take 10^25 years to try them all.

So, is this a viable non-invasive contact tracing strategy? What am I missing?

Grocery handling, good bad or ugly?

2020 April 1
by Daniel Lakeland

Apparently this guy’s video is controversial:

I’m going to come right out and say this is a great video, it shows people how to handle objects in a way that minimizes transmission of virus from surfaces. Apparently the controversial part though is where he dumps his oranges in soapy water? Are you kidding me? Everyone should be washing their produce at all times people! Have you ever heard of e-coli?

A frequently heard thing in the “anti” group is something along the lines of “there is zero evidence that xyz”, such as “there is zero evidence that food packaging is a significant source of infection” or “there is zero evidence that washing your food in soapy water is good for you” or whatever. This is typical “Null Hypothesis Significance Testing” type logic… Until we have collected a bunch of data rejecting the “null hypothesis” that “everything is just fine” then we should just “act as if everything is just fine”. Another way to put this is “until enough people have died, you shouldn’t take precautions to protect yourself”. Put that way it’s clearly UTTERLY irresponsible to “debunk” this video using that logic.

What we KNOW is that viruses are particles, essentially complex chemicals, which sit in droplets, which can be viable after floating in the air for 3 hours, which can settle out onto cardboard and be viable for 24 hours, and which can be viable for 3 days on plastic and steel. Guess what your groceries come in? Plastic bags, cardboard boxes, steel cans, plastic jars…

The assay used in the NIH study that established those timelines was to actually elute (wash) the virus off the surface and then infect cells in a dish with it and see how many were infected. It wasn’t just detecting the virus was there, but actually showing that it was active and viable.

So, there’s your evidence. There is *direct* laboratory evidence that the virus *can* be transmitted off the surfaces into cells and infect them.

Whether this is a significant source of infection or not is more or less irrelevant. How do you make a decision as to whether you should spend ~ 1hr every 2 weeks cleaning all your groceries?

Here’s the Bayesian Decision Theory:

Suppose two actions are possible: 1) do nothing, or 2) handle your groceries carefully and wash your fruits and vegetables in dish-soapy water

Costs of (1): probability p0 of getting infected from contaminated surface. We don’t know what p0 is, but leave it as a symbolic quantity for the moment. Let’s just use 0.5% chance of dying if you’re infected as the dominant problem, and a “statistical value of a life” as on the order of 10M dollars… so p0*.005*10000000 = 50000*p0

Cost of (2): probability of getting infected from contaminated surface reduced to p0/100000 perhaps, the same 0.5% chance of dying if you’re infected, plus 1 hr of cleaning time. So cost is 0.5*p0 + w*1 where w is an “hourly wage”. Suppose you are willing to work for a median type wage, 50k/yr. This is 25$/hr. So, what does the probability p0 need to be to “break even”? Ignoring negligible quantities 0.5*p0, we have 50000*p0 = 25 so p0 = .0005. If you think there’s something like a .0005 chance you could transmit virus from your grocery items to your face by “doing nothing” then YOU SHOULD BE CAREFUL and wash your items. For me, I’ll spend some time quarantining my groceries, and washing my produce… I also find it keeps the produce from spoiling and hence lasts longer in storage, so that should go into the “plus” side as well.

As to what to wash your produce with. I’m using sudsy water from dye and fragrance free dish soap (main ingredients: Water, Sodium Lauryl Sulfate…). I’m washing my fruit and veg, and then rinsing it thoroughly. The quantity of soap I’m ingesting is substantially the same as if I hand washed a glass, rinsed it, and then filled it with water and drank it… It’s substantially less than you get from brushing your teeth with a typical toothpaste. If you are afraid of washing your dishes with soap, or of brushing your teeth, then by all means don’t wash your fruit with soap either… For the rest of us, do a good job rinsing just like you’d rinse your glasses or bowls before putting food in them.

Confusion about coronavirus testing and the role of testing capacity

2020 March 30
by Daniel Lakeland

Here’s some code to simulate a process whereby we saturate testing capacity… First the graphs:

Confirmed cases (blue) follows the real cases (red) so long as the cases per day are below the maximum… once we saturate, the green line increases linearly, and so does the blue line…
Green line (tests) parallels the blue (positive tests), as we saturate

t = seq(1,40)
realcases = 100*exp(t/4)
realincrement = diff(c(0,realcases))

testseekers = rnorm(NROW(realincrement),4,.25)*realincrement

maxtests = 20000

## now assume that you test *up to* 20k people. if more people are
## seeking tests, you test a random subset of the seekers
## getting a binomial count of positives for the given frequency

ntests = rep(0,NROW(t));
ntests[1] = 100;
confinc = rep(0,NROW(t));
confinc[1] = 100;
for(i in 2:(NROW(t)-1)){
    if(testseekers[i] < maxtests){
        confinc[i] = realincrement[i]
        ntests[i] = testseekers[i]
    else if(testseekers[i] > maxtests){
        confinc[i] = min(realincrement[i],rbinom(1,maxtests,realincrement/testseekers))
        ntests[i] = maxtests

cumconf = cumsum(confinc)
cumtests = cumsum(ntests)

ggplot(data.frame(t=t,conf=cumconf,nt=cumtests,real=realcases))+geom_line(aes(t,cumconf),color="blue")  + geom_line(aes(t,nt),color="green")+ geom_line(aes(t,real),color="red") +coord_cartesian(xlim=c(0,35),ylim=c(0,400000));

ggplot(data.frame(t=t,conf=cumconf,nt=cumtests,real=realcases))+geom_line(aes(t,log(cumconf)),color="blue") + geom_line(aes(t,log(nt)),color="green")+ geom_line(aes(t,log(real)),color="red") +coord_cartesian(xlim=c(0,30),ylim=c(0,log(400000)));

The longer term outlook…

2020 March 10
by Daniel Lakeland

Coming out the other end of this whole COVID-19 thing… how do we do a good job of sustaining social distancing, and then returning sanely to productivity? The “flatten the curve” idea extends the amount of time one needs to be in “lockdown” but ultimately reduces deaths and severe morbidity… That’s good, but it starts to run into the “how long can we hole up?” question. If things go crazy through the roof, like in China, the duration is shorter. Data here shows from “oh shit” to relatively small per day caseload was about 20 days in china.
That’s a bad thing, because that represents the really “peaked” shape that overwhelms healthcare facilities. Many people died who otherwise might not have…
But if we make that slower, then also the peak occurs later, and the duration is longer, we might need, say 80 days of rather intense social distancing to make that happen. If we figure lockdowns are going to start now and build up through the next 10 days (it’s already something WaPo and The Atlantic and etc are saying)… And then we need 80 days after that… you’re talking 90 days which is 3 months, and puts us starting to return to work around June 1.

Now let’s talk food supply. Unlike China, this virus is spreading country-wide. It’s not contained to a particular place. So mobilizing the national guard to bring food from the midwest to WA because people in the midwest are ok… is not a possibility. How do we feed our country for 80 days without people having to be in contact with each other? We need food delivery systems.

Fortunately, as people get the virus and then recover, they should be immune for at least some period of time. Recovery to the point that they’re not shedding the virus is however probably 30 days? Just a guess, we’ll have to see with serology and PCR combo tests (to test that someone had the virus at some point, and doesn’t shed it now).

This doesn’t help us a lot. We have to do 90 days of relative isolation, and during the first 30 days people are getting the thing and then over the next 30 days those early people are recovering… by the time we hit 90 days, if you haven’t gotten it, you’re running pretty lean on food and things even if you’re well stocked now (and most people really aren’t). Obviously we’ll need to distribute food throughout the 90 days. This is going to require coordination from govt I believe, otherwise we’ll have sick people out there handling food… not good.

Everything you need to know about what to do about Coronavirus

2020 March 9
by Daniel Lakeland

You need to stop interacting with people. And I’m not joking about this.

Here’s the facts out of Italy: about 10% of tested positive cases require ICU ventilation. The death rate for people under age 65 is probably only ~ 1% **if you get the ventilators to the 10% needing ventilation**… If you overwhelm the hospitals, the death rate will go to ~10% which is on the order of magnitude of about 10x as bad as pandemic influenza in 1918.

The current trending idea is #flattenthecurve to describe to people HOW IMPORTANT it is to start *NOW* avoiding the spread of the disease. This avoidance of overloading the infrastructure is a core idea in Civil Engineering (my PhD is in CE).

Reducing the spread of the disease is not important just because fewer people will eventually get it (though that is probably true) but because the peak number of people who need ventilators and other intensive type care will be lower, so that fatality rates can stay low. If all the ICU beds are full, and 300 patients show up needing ICU today… all 300 patients will die. Since 10% of cases may need ventilators, it’s a serious situation.

Does social distancing, closing schools, etc work? Evidence out of 1918 says HELL YES: Unfortunately servers are getting swamped, so the best way for me to link you to this info is via twitter, who will probably stand up to the pounding.

So, what do you need to do? TODAY make plans to not be at work by the end of the week. Why? Because the virus is doubling the number of symptomatic verified cases outside china about every 2-4 days, let’s call it 3 days. And, btw it takes 5 days to onset of symptoms and for many people ~ 10 or 15 days before they say “hey I need to go to the hospital” (though for the elderly… it can be like 1hr after onset of fever). So, whatever’s going on in a hospital near you… it’s maybe what was the case 3 or 4 doubling periods ago, so today it’s on the order of ~ 10x worse than that. 10 days from now, it will be 100x worse already, but that will show up at the hospital about 20 days from now.

Early, proactive and significant reduction in interaction with other people WORKS and is one of the only things we can do. So we WILL be doing it. If we wait, we’ll be doing it AND have a massive tragedy. If we start now, we’ll be doing it but have less of a massive tragedy. The boulder is rolling down the hill, we can start walking off the path now, or get hit.

Back of the Envelope Cost-Benefit on pulling your kids from school

2020 March 4
by Daniel Lakeland

It is clear that COVID is spreading in communities in Northern California, and Washington. The time until it is confirmed to be spreading in SoCal is probably a few days. It will always be confirmed *after the fact*, which means it is probably spreading in the SoCal community at the moment, though in the early stages. Outside China cases are increasing exponentially with a doubling time of about 5 days +- which you can read off the graph at several web-sites such as the linked map site (click logarithmic graph on the lower right graph, read off the yellow dots for outside China spread).

I personally view it as inevitable that PUSD will decide to close schools. I don’t know what their timeline will be, but as these are typically committee decisions and there is risk either way (too early vs too late) I expect them to be delayed until the choice becomes obvious. On a doubling every 5 days trajectory, that probably means somewhere in the 10 to 15 to 20 days from now (which would mean somewhere around 800 to 3000 cases in the US). Spring break being Mar 30, I could imagine they’ll try to stay open til the 25th or so, and then not reopen after spring break. Though more pro-active decision making might mean closure in the next 5-10 days or so now that Pasadena has declared state of emergency. All this is more or less my own opinion based on reading the growth charts, and seeing the responses from large organizations canceling conferences and things.

Now, at what point is it actually logical to pull your kids from school? I’m going to do this just for a family with a stay at home parent, because the calculation for lost days of work is much harder and depends on a lot of factors. We can back of the envelope calculate this as follows: Costs of lost days of education is on the order of a couple hundred dollars a day. Let’s say $20/hr x 6hr/day = $120/day. If the stay at home parent can provide some of this education, the cost might drop to say $50/day…

Now, what’s the costs associated with sickness? Let’s just do the calculation of one parent gets seriously ill and dies. For a child in elementary school let’s just put this around say $10M.

Now, what’s the chance of death if you have definite exposure? It’ll be something like 100% chance of getting sick and 0.5% chance of death (assuming parent doesn’t have underlying conditions and isn’t unusually old)… So the expected cost is $10M * 0.005 = 50000… So by this logic, you should be willing to avoid that by pulling your kids from school about 1000 days early. Of course, it’s way too late to be 1000 days early, so basically you should pull your kids from school TODAY.

Now, suppose you have a job making $100k/yr, and you just get cut off from that job. That’s $385/day (which you don’t take home all of, but whatever). So if you add $50/day to that for educational loss, you should be willing to pull your kids about 115 days early. It’s also too late for that… So again, pull your kids TODAY.

Any way I back of the envelope this, it’s time to pull your kids from school… I don’t see a big enough flaw in all these calculations that would lead to waiting another 20 days.