A (much better) grant evaluation algorithm

2015 October 19
by Daniel Lakeland

There are lots of problems with the way that grants are evaluated in NIH or NSF study sections. A few (typically 3) people read each grant, they give a score, then if the score is high enough, they discuss the grant with the rest of the group, then the rest of the group votes on the results of that discussion, then you add up all the scores and get a total score, then you rank the scores and fund the top 5-15% based on available funds (or some approximation of this fairy-tale).

To make it past the first round (ie. into discussion) you need to impress all 3 of the randomly selected people, including people who might be your competitors, who might not know much about your field, who might hold a grudge against you... And then, you need those people to be good advocates for you in the discussion... It's a disaster of unintended potential biases. Furthermore, the system tends to favor "hot" topics, and spends too little time searching the wider space of potentially good projects.

Here is an alternative that I think is far far better:

  1. A study section consists of N > 5 people with expertise in the general field (as it does now).
  2. Each grant submitted by the deadline is given a sequential number.
  3. Take the Unix time of the grant deadline expressed as a decimal number, and the last names of all authors on grant submissions in ascending alphabetically sorted order with upper-case ASCII characters, and compute the SHA512 hash (or other secure crypto hash) of this entire string. Then using AES or another secure block cipher in CBC feedback mode, with the first 128 bits of the hash as the key, and the rest of the hash as a starting point for the cyphertext, encrypt the sequence x,x+1,x+2,x+3... starting at x= the rest of the SHA hash. This defines a repeatable and difficult to muck-with random number sequence for a random number generator.
  4. Each grant is reviewed by 5 people chosen at random. (In sequential order, choose a grant number, then choose 5 people at random with replacement to review it... repeat with the next grant.)
  5. Allow each reviewer to score the grant on the usual criteria (feasibility, innovation, blablabla) with equal weight put on the various criteria. For each grant add up the total score for each of the 5 scorers.
  6. For each grant, take the median of the 5 scores it was assigned. This prevents your friends or foes or clueless people who don't understand the grant, or whatever from having too much influence.
  7. Divide each grant's score by the maximum possible score.
  8. Add 1 to the score.
  9. Divide score by 2. You now have a score between 0 and 1 and 50% of that score is influenced by the reading of the grant, and 50% is constant under the assumption that most of the grants are of similar quality, this prevents too much emphasis on the current hot topic.
  10. While there is still money left: select a grant at random with probability proportional to the overall scores of the remaining grants. Deduct the grant's budget from the total budget and fund that grant, removing it from the pool. Repeat with the next randomly chosen grant until all the money is spent.

Why is this a good idea? So much of grant scoring is influenced by things other than the science: whether the person writing the grant has published in this field a lot, whether they are well known and liked by the committee members, whether they have been funded in the past, whether they are working on a hot topic, whether they're a new investigator, how MANY papers they've published (not so much how good those papers were) whether they have a sexy new technique to be applied, etc etc.

But the truth is most grants are probably similar mostly lousy-quality projects. It's hard to do science, very few experiments are going to be pivotal, revolutionary, or open new fields of research. There's going to be a lot of mediocre stuff that's very similar in quality, and so ape-politics is going to have a big influence on total score across the 5 reviewers.

But, the review process does offer SOME information. When at least 3 of 5 randomly chosen reviewers recognize that a grant is seriously misguided, that's information you should use. Taking the median of 5 scores, and using it as 50% of the decision making criterion seems like a good balance between a prior belief that most grants are of similar quality, and specific knowledge that the 5 reviewers are going to bring to the table after reading it.

Dos Equis

2015 October 19
by Daniel Lakeland


Some order of magnitude estimates on Universal Basic Income

2015 October 16
by Daniel Lakeland

Ok, so I've advocated with some of my friends for a Universal Basic Income (UBI). The basic idea is this, if you are an adult citizen of the US, you get a social security number, and you register a bank account, and you get a monthly direct deposit pre-tax from the government. A flat amount that everyone receives just for being a citizen. The goal here is to simplify vastly the requirements to provide a basic social safety net as well as eliminating the complexity of programs like the progressive income tax with millions of specialty deductions etc.

The UBI eliminates the fear of being without income on the very short term (days, weeks, months), and lets people take risks, be entrepreneurial, take care of families, weather bad events better, etc. It also takes care of pretty much everything that a progressive tax rate structure is supposed to do (help poorer people who spend a lot of their income on basic necessities). So once a UBI is in place, you can VASTLY simplify the administration of an income tax, and you can eliminate all sorts of specialized subsidies that current require a lot of administrative overhead (checking that people qualify, running housing projects, providing specialty healthcare programs etc).

The UBI doesn't work for the mentally ill, so they will continue to need specialty help in addition, but for everyone else, it's a very efficient way to do what we're currently doing in very inefficient ways.

But, this isn't a post about the merits, it's a post about order of magnitude estimates for the quantities of money involved.

According to Google there are 3.2 \times 10^8 people in the US.

The federal Budget is currently about 3.7 \times 10^{12} dollars, with about 0.63 \times 10^{12} in defence, the rest in various social services and interest on debt.

Let's take as an order of magnitude estimate of a good choice for UBI as the 2015 federal poverty guidelines. That's about 12\times 10^3 per year, or about $1k / mo.

So, if we just started shipping out cash to everyone at the rate of 12k/yr how big is that as a fraction of the federal after-defense budget?

 12\times 10^3 \times 3.2\times 10^8 / ((3.7-0.6) \times 10^{12}) \approx 1.2

So, to first order, the entire non-defense budget is about the same as the amount of money you'd need to spend on a UBI. But the UBI can replace a *lot* of other government programs. Social security, medicare, housing and human services, a big majority of what we're spending this budget on is basically doing an inefficient job of helping people.

I don't advocate gutting all of the government programs and replacing them with a UBI, but I imagine I could easily get on board with gutting 60 or 70% of them and replacing with a UBI.

Besides reducing the overhead of government, you'd need to increase revenue. The UBI would drive sales, and a flat federal sales tax would be a very simple way to take care of this extra need for income. A sales tax would be also a consumption based tax, which has good economic consequences (it encourages saving and investing vs income taxes which discourage earning and encourage consumption!)

So, our order of magnitude estimates show, this is a feasible plan. It's not something that would be easy to transition to in a blink, but it could be done a lot easier than setting up a universal medicare system for example. A UBI accomplishes things that both the liberal and conservative groups in politics wants: helping people, while being efficient, and encouraging growth and entrepreneurialism. It's an idea whose time has come:

(see this WaPo article on how a UBI like thing helped native american populations for some empirical information)

WD-40 and motor bearings

2015 October 12
by Daniel Lakeland

It's been terribly hot in the Pasadena area, and recently our 30 year old AC system started making a horrendous "wubba-wubba" kind of noise. It was pretty obviously the direct drive fan motor for the heat exchanger, which pulls the air through the fins and exhausts it upwards away from the unit. Now, this motor hasn't been lubed in forever, and it's exposed to the elements. It has two plastic oil ports with little caps. Adding some light 3-in-one oil helped for a day or two but it would intermittently make this horrendous noise, which was especially bad when the system kicked in at night and had the potential to wake neighbors several doors away. Clearly, we were in for a very expensive replacement...

I bought some of the heavy duty SAE20 oil that 3-in-one makes especially for these heavier duty motors, but adding it didn't really help. Partly, the heavier oil didn't seem like it was likely to get into the bearing right away, if at all. Finally, with 104F weather and two small kids at home, I HAD to run the system, so I decided out of desperation to hit it with WD-40 to try to clean out the system. After shooting the WD-40 directly into the bottom of the oil port on each side, I filled the oil port with the SAE 20 heavy duty electric motor oil.

Sure enough, about 30 seconds after I turned it back on, it stopped making the noise and hasn't made it since (about 4 days now). That's a big improvement over making the noise almost every time it turned on. Now, if you look on the internet you'll read all sorts of stuff about how WD-40 is a terrible product for motor bearings. Well, here's my more nuanced take on that:

WD-40 is basically a mixture of very light petroleum products with a small portion that is very heavy, in fact heavy enough to "sink" in water and thereby displace water from the surface of metal parts. So, WD-40 is a good cleaner (due to the light fraction which works like mineral spirits / paint thinner) and corrosion protector (due to the heavy fraction displacing the water) but it is a terrible lubricant, because the light fraction evaporates away, and/or takes away any other lubricant, and the heavy fraction is subject to gumming up like asphalt. This then becomes sticky and binds up dirt in your bearing. Your bearing is now operating with basically no lube, or even an abrasive gum.

So, is there any reason to use WD-40 on motors or bearings? I think yes. Specifically, it's going to penetrate, take away dirt, and wet surfaces. It can act like a carrier to help other more appropriate lubricants penetrate. So, I believe the reason my last-ditch effort worked was that it cleaned the crap out of the bearings, brought the appropriate SAE20 lube oil into the system by reducing the surface energy, and allowed the bearing to take up the lube it needed.

By itself, WD-40 is NOT a bearing lube, and it's not necessarily safe for all kinds of wire insulators and things, some of which it might dissolve, but if you have a bearing that is old and dirty and worn and sounds like it's about to die, it could make good sense to try cleaning the crap out using WD-40 and then adding in the appropriate light or heavy duty motor oil which will more easily wet and, penetrate, and will remain behind after the light WD-40 fraction evaporates or runs out, to lube the bearing properly. Just be sure to add plenty of the bearing oil after the treatment. In my case I think I put 15 drops into each oil port, which pretty much filled it to the brim.

Lubrication, friction, and surface physics are really interesting areas of study.

The Choice of What to Model

2015 September 25
by Daniel Lakeland

Another way to think about the "declarative vs generative" dichotomy that I have recently blogged on, is that you have free choice about what you are going to model. Informally, the "likelihood" principle says that you should model the output of your data collection instruments, typically as IID draws from some distribution. In simple cases, this let's you write down

p(data|parameters) = \ldots

And this formalism is typically fine for people who work with data that is pretty darn noisy and about which there is little known about the wheels and gears and other things that go into producing the data... like survey responses, or even some economics, or even some simple physical processes.

But lots of people have detailed models of their data production process, like weather predictors, they may not know everything they'd like to know, but conditional on the initial conditions, they have PDEs that let them evolve their weather forward in time to tomorrow or next week (or next century for some the climate change researchers).

So, we could look at todays weather measurements and compare them to the predictions based on yesterday's weather measurements, and try to get the parameters of the model (initial and boundary conditions yesterday, forcing functions) perfectly set up so that we'll predict tomorrow's weather. But we absolutely SHOULD NOT replace the weather model with a random number generator.

If we are given a vast vector of temperature, pressure, and wind speed measurements all put together into a vector "data", and a set of parameters which are more or less the "correct" initial and boundary conditions, then what kind of

p(data | parameters) = ...

should we use? The answer is, we don't have one, and, luckily, we're NOT forced by the logic of Cox/Jaynes probability modeling to write one down.

In particular, there's nothing forcing us to model our data, we can model anything, and one thing we can model is for example the relative reasonableness of certain degrees of devation between our model for the weather and today's actual weather data. This is what ABC does

p(distance(summary(predicted_weather(parameters),summary(measured_weather))) | parameters, K) = uniform(0,epsilon)

This is a function of both data and parameters which we choose to model, we simply ASSIGN based on our knowledge K the given probability distribution, and it means "this is about how far off we could be if the parameters were correct".

Doing that lets us update the parameters, let's call all that distance(...) stuff "dist" for short:

Then we have some knowledge about how well weather prediction works, which lets us write down something we can have some confidence in in terms of:

p(all_params | dist) \propt p(dist | all_params) p(all_params)

Is p(dist | all_params) a likelihood? well, not if you want likelihood to mean p(data | params) but it is an assigned probability which is a valid modeling step for a Cox/Jaynes model, and it's assigned on an expression that USES the data. It has perfectly fine logical basis. We may not know what the reasonable values are for the very complicated data vector, but we know the reasonable values for some kinds of summaries and we can collapse our data down via a parameterized function and model that collapsed quantity.

After all, the output of data collection equipment is really just a collapsing down of other quantities we could call data: for example a gauge points to a number on a dial, but that's a collapse of a certain current and a certain spring constant, and a certain magnetization, and some geometry... a digital thermometer outputs a temperature, but that's a collapse of a certain charge in a certain capacitor of a certain size and a certain resistance, and a certain response function for an A/D converter and etc etc.

And it's not just complicated black-box PDE solvers that we can do this for, we can always model something other than the pure output of our measurement instruments. So long as we are actually USING the output of the measurement instruments somewhere in the model, the measurement will provide some kind of information, the more we collapse them, perhaps the less information we can get, and so more collapsed models are possibly not quite as informative as if we had a full model for the entire data vector, but then, we don't have such a model!

Statistical tests: when would a Bayesian care?

2015 September 20
by Daniel Lakeland

Statistical testing answers the question: how strange is the sequence of data d_i under the assumption that it is a sample from distribution D?

When would a Bayesian care about the answer to such a question? There are basically two cases:

  1. When summarizing the posterior distribution using a sampling mechanism such as MCMC.
  2. When attempting to fit a distribution to data so as to replace the physical process with a sampling process.

Case 1 is a computational technique. We replace a function of the parameter space with a finite sequence of numbers which act as if they were a sample from a distribution. The main thing this does is let us replace expectation integrals with finite sums. Per Martin-Löf gave us a way to define random sequences as mathematical objects in terms of sequences that pass some "most powerful test" for which we have a non-constructive proof of existence. Practically speaking, this means we can determine whether a sampling technique is working by testing to see if we can reject the sample.

Case 2 is a modeling technique. Whereas the first case is about computing with a sequences of samples from the posterior for the parameters, the second technique is about approximating some scientific real-world process by sampling. The second case is always intentionally wrong, and this is what Frequentist statistics often gets wrong. We know that the way data "is generated" is more or less as the result of physical processes, quantum or classical, whatever the case some molecules push on other molecules etc. However, from a convenience in modeling perspective sometimes we can replace all that detailed picture with an approximate picture which is that the net result of all those causes tends to fall into some certain range of outcomes with some relative frequencies of occurrence. This is always always always an approximation, and one which need not continue to hold and which should be checked. The way to check the adequacy of this process for describing at least the historical data, is to use some kind of statistical test.


Posterior probability density ratios

2015 September 3
by Daniel Lakeland

Continuous probability densities are in dimensions of 1/[x]  where [x] is whatever the dimensions of the input are. For example, probability per unit length for p(x) on x a length.

Mathematical models should be made based on dimensionless quantities. If we're looking to compare different data points under a model, it seems to me it's worth considering the quantity:


Where x^* is the x where p takes on its maximum value, so for the unit normal x^*=0 for example.

In particular, I have in mind using this alternative quantity to filter data in a similar way to the filtering by "p value" described in a previous post.

So, if you're trying to detect unusual situations (in the previous post example it was unusual loudness of noise) you fit your model p(x) and then for each data value x_i you calculate \log(p(x_i)/p(x^*)) and select those data points which produce a negative enough log probability density ratio. In particular, you'd actually calculate say the average over the posterior distribution for the parameters of this quantity.

This isn't quite the same as the likelihood ratio, since of course it's averaged over the posterior of the parameters, but it is obviously related. (EDIT: you can obviously use these ratios both for likelihood and for posterior density of parameter values, if you're trying to compare alternative parameters, but with parameters you probably want marginal density values, and if you have samples, you probably need kernel density estimates or something...)

This is a case where it makes sense to use the Bayesian machinery to actually fit the frequency histogram of the data points, a perfectly legit thing to do, even though it is sometimes not required.

Ways in which this is superior to "p values" include being able to distinguish when something is unusual even when it's not on the tail of a distribution. For example in multi-modal distributions, as well as being actually motivated by the science (in so far as you put science into your data model) and the actual values of the data, it doesn't involve any probability under the model to observe theoretical data 'as extreme or more extreme" than the data point for example, it's just the relative probability to observe data in the immediate neighbourhood of the data point compared to in the immediate neighbourhood of the most common point. In particular you can think of p(x) dx / p(x^*) dx as a ratio of probabilities for any infinitesimal quantities dx which thanks to the ratio happen to cancel out.

I used this exact method recently with data my wife gave me. It was a kind of screening of certain chemicals for their ability to induce certain kinds of fluorescent markers in cells. The image data resulted in a bunch of noisy "typical" values, and then a number of outliers. Each plate was different since it had different chemicals, different cells, different markers etc. In Stan I fit a t distribution to the image quantification data from each plate, and looked for data points that were in the tails of the fit t distribution based on this quantity. It seemed to work quite well, and got her what she needed.

I'm blue about this being published

2015 September 2
by Daniel Lakeland


The above graph summarizes the situation that Andrew Gelman has been railing on about how chasing the noise with p values leads to a great career and top publications!

Published recently in Psychological Science, this study has Open Science Data. So I pulled the Study1 data and quickly plotted it in R. On the x axis is a measure of how manipulable the participant is times the manipulation that was done. More sad is to the right. On the y axis the relative blue-yellow accuracy as a fraction of red-green accuracy.

You may notice the two outstanding data points on the left at the top? Yeah, I think that's their effect.  Basically, two people who were easy to manipulate and happened to get the amusement manipulation were outstanding blue-yellow discriminators. All the rest... well, look for yourself.

The Data Generating Mechanism

2015 September 1
by Daniel Lakeland

From a comment I left on Andrew Gelman's blog:

In science, the data never come from random number generators with given frequency distributions. They always come from some set of atoms and sub-atomic particles interacting to alter the position of some other particles. it’s pretty much always position, or occasionally velocity, too.

Position of a needle on a gauge caused by velocity of electrons flowing through a wire causing a magnetic field which provides a torque to counteract the torque applied by a spring.

Position of electrons in a capacitor that produces a voltage for an A/D converter.

Position of graphite molecules on a paper survey answered by an enormous bundle of atoms we call a Psychology Student who uses the position of photons striking her retina to interpret the meaning of the position of the ink molecules on the page which alters the position of sodium ions in the neurons of her brain.

Position of electrons in a bank computer which cause a money balance to be shown on an account which causes a bundle of molecules called a consumer to be able to alter the position of a bundle of molecules called a product from one location on the earth called a store to another location on the earth called a home without another bundle of molecules called a police officer confining the consumer bundle into a small space by strong molecular repulsion.

It’s never Platonic infinite sequences of bits coming from a particular abstract mathematical sequence whose Godel number is some particular value and which satisfies the Kolmogorov axioms for a particular distribution function to within a given epsilon when calculated by the non-constructive uniformly most powerful computable test.

A Generative Modeling Challenge

2015 August 28
by Daniel Lakeland

So, I've been discussing declarative vs generative models. Here's a case where I think the declarative model makes good sense, how would you model this?

y is a vector of 1000 timeseries samples from some audio system. On one end is a highly sensitive microphone that puts out a voltage signal proportional to the instantaneous pressure gradient. The voltage is in the range [0,0.01] volts and is known to have a slew-rate limit of at most 0.005 v per sample and a bias of 0.005 +- 0.0006. A pre-amplifier increases this signal by pretty close to 100x but has a response function f_{pre}(v) = A \mathrm{atan}(x/0.01)/\mathrm{atan}(1) which has a known shape over the input range [0,0.01], is not perfectly linear, and this shape is multiplied by a number A slightly less than 100 to get the actual transfer function.  The pre-amp output picks up radio frequency noise, which can be modeled as normal(0,0.015), and the result is converted via 10 bit A/D into the range 0-1023, with 1v in to the A/D implying 1024. This digital signal is sent down a noisy channel, so that only some of the samples are received in digital form. Additionally, as a compromise to improve reliability and due to lack of equipment, the signal is re-converted to analog via a 12 bit D/A outputting in the range 0-10v, which has a known noisy bias signal due to a failing diode that can be modeled as independent exponentially distributed voltages with mean value close to 0.05. The result is sent over a long cable of around 1000 feet, where additional normally distributed white noise is added to the signal whose RMS amplitude is less than 0.01 v. The long cable has about 1 ohm resistance per 100 feet +- 5 % as well as a resistance of less than 10 ohms at both the input and output connections. The whole thing is put into a 16 bit A/D operating on inputs in the range 0-10v with an unfortunately low input impedance of close to 1000 ohms +- 1 %. The 16 bit A/D is sufficiently high resolution that y is given as a floating point number equal to the digital value on 0-(2^16-1) / 2^16 and assumed to be continuous with an additional conversion noise and bias which is approximately gamma distributed with k=3 and rate = 1/0.003

Reconstruct the original signal y_true out of the microphone as best you can using as much of the information available as possible, while also giving best estimates for the length of the wire, the total resistance of the cable, the amplification factor for the pre-amp, and the input impedance of the final A/D.

The "generative" picture here is this (more or less, we're ignoring low-pass filters that are always present on the d/a and a/d steps):

y_meas = round(((round(1024*A*f_pre(y_true)+epsilon_rf)+da_bias_noise)*10*

And, in addition, we have some of the digital values round(1024*A*f_pre(y_true)+epsilon_rf) let's call the ones we have y_digit, which is a sparse vector and t_digit which tells us which of the y_meas the y_digit correspond to.

So, if you wanted to write this in a generative form, with only data or only parameters on the left hand side of the probabilistic statements, in Stan, what would you use to define the likelihood function for y_meas or y_digit given all the parameters, y_true, epsilon_rf, da_bias_noise... etc?

For the full analog / digital / analog / digital / analog path, You could always invent some extra roundoff error parameters, and write

y_meas - (....)  ~ gamma(3,1/0.003);

where (...) = the whole mess where round(x) is replace with x+roundoff_i and roundoff_i is given the appropriate prior. But then you'd be writing a declarative model with a nonlinear function of the data and the parameters (consider the resistances formula) on the left hand side.

Stan doesn't have a shifted gamma, but if it did, you could use the shifted gamma distribution and place the round function in the location parameter, however, you will probably induce bad behavior due to the non-differentiable jumps.

The point of the declarative version is that if you were given the actual values of the parameters other than ad16_noise, then subtracting y_meas - (...) would give you an expression equal to the noise sample, and the likelihood for seeing that sample conditional on all the other parameters, is the gamma distribution.

If you have the wrong values of the parameters, you won't get something that's gamma distributed, so you'll be in the low probability region of the gamma distribution function, so you'll downweight that portion of the parameter space... that's all you need to get inference on the parameters, that you find reasons to reject some of the prior space because it doesn't fit the facts you know.