Ok, so I've advocated with some of my friends for a Universal Basic Income (UBI). The basic idea is this, if you are an adult citizen of the US, you get a social security number, and you register a bank account, and you get a monthly direct deposit pre-tax from the government. A flat amount that everyone receives just for being a citizen. The goal here is to simplify vastly the requirements to provide a basic social safety net as well as eliminating the complexity of programs like the progressive income tax with millions of specialty deductions etc.
The UBI eliminates the fear of being without income on the very short term (days, weeks, months), and lets people take risks, be entrepreneurial, take care of families, weather bad events better, etc. It also takes care of pretty much everything that a progressive tax rate structure is supposed to do (help poorer people who spend a lot of their income on basic necessities). So once a UBI is in place, you can VASTLY simplify the administration of an income tax, and you can eliminate all sorts of specialized subsidies that current require a lot of administrative overhead (checking that people qualify, running housing projects, providing specialty healthcare programs etc).
The UBI doesn't work for the mentally ill, so they will continue to need specialty help in addition, but for everyone else, it's a very efficient way to do what we're currently doing in very inefficient ways.
But, this isn't a post about the merits, it's a post about order of magnitude estimates for the quantities of money involved.
According to Google there are people in the US.
The federal Budget is currently about dollars, with about in defence, the rest in various social services and interest on debt.
Let's take as an order of magnitude estimate of a good choice for UBI as the 2015 federal poverty guidelines. That's about per year, or about $1k / mo.
So, if we just started shipping out cash to everyone at the rate of 12k/yr how big is that as a fraction of the federal after-defense budget?
So, to first order, the entire non-defense budget is about the same as the amount of money you'd need to spend on a UBI. But the UBI can replace a *lot* of other government programs. Social security, medicare, housing and human services, a big majority of what we're spending this budget on is basically doing an inefficient job of helping people.
I don't advocate gutting all of the government programs and replacing them with a UBI, but I imagine I could easily get on board with gutting 60 or 70% of them and replacing with a UBI.
Besides reducing the overhead of government, you'd need to increase revenue. The UBI would drive sales, and a flat federal sales tax would be a very simple way to take care of this extra need for income. A sales tax would be also a consumption based tax, which has good economic consequences (it encourages saving and investing vs income taxes which discourage earning and encourage consumption!)
So, our order of magnitude estimates show, this is a feasible plan. It's not something that would be easy to transition to in a blink, but it could be done a lot easier than setting up a universal medicare system for example. A UBI accomplishes things that both the liberal and conservative groups in politics wants: helping people, while being efficient, and encouraging growth and entrepreneurialism. It's an idea whose time has come:
(see this WaPo article on how a UBI like thing helped native american populations for some empirical information)
It's been terribly hot in the Pasadena area, and recently our 30 year old AC system started making a horrendous "wubba-wubba" kind of noise. It was pretty obviously the direct drive fan motor for the heat exchanger, which pulls the air through the fins and exhausts it upwards away from the unit. Now, this motor hasn't been lubed in forever, and it's exposed to the elements. It has two plastic oil ports with little caps. Adding some light 3-in-one oil helped for a day or two but it would intermittently make this horrendous noise, which was especially bad when the system kicked in at night and had the potential to wake neighbors several doors away. Clearly, we were in for a very expensive replacement...
I bought some of the heavy duty SAE20 oil that 3-in-one makes especially for these heavier duty motors, but adding it didn't really help. Partly, the heavier oil didn't seem like it was likely to get into the bearing right away, if at all. Finally, with 104F weather and two small kids at home, I HAD to run the system, so I decided out of desperation to hit it with WD-40 to try to clean out the system. After shooting the WD-40 directly into the bottom of the oil port on each side, I filled the oil port with the SAE 20 heavy duty electric motor oil.
Sure enough, about 30 seconds after I turned it back on, it stopped making the noise and hasn't made it since (about 4 days now). That's a big improvement over making the noise almost every time it turned on. Now, if you look on the internet you'll read all sorts of stuff about how WD-40 is a terrible product for motor bearings. Well, here's my more nuanced take on that:
WD-40 is basically a mixture of very light petroleum products with a small portion that is very heavy, in fact heavy enough to "sink" in water and thereby displace water from the surface of metal parts. So, WD-40 is a good cleaner (due to the light fraction which works like mineral spirits / paint thinner) and corrosion protector (due to the heavy fraction displacing the water) but it is a terrible lubricant, because the light fraction evaporates away, and/or takes away any other lubricant, and the heavy fraction is subject to gumming up like asphalt. This then becomes sticky and binds up dirt in your bearing. Your bearing is now operating with basically no lube, or even an abrasive gum.
So, is there any reason to use WD-40 on motors or bearings? I think yes. Specifically, it's going to penetrate, take away dirt, and wet surfaces. It can act like a carrier to help other more appropriate lubricants penetrate. So, I believe the reason my last-ditch effort worked was that it cleaned the crap out of the bearings, brought the appropriate SAE20 lube oil into the system by reducing the surface energy, and allowed the bearing to take up the lube it needed.
By itself, WD-40 is NOT a bearing lube, and it's not necessarily safe for all kinds of wire insulators and things, some of which it might dissolve, but if you have a bearing that is old and dirty and worn and sounds like it's about to die, it could make good sense to try cleaning the crap out using WD-40 and then adding in the appropriate light or heavy duty motor oil which will more easily wet and, penetrate, and will remain behind after the light WD-40 fraction evaporates or runs out, to lube the bearing properly. Just be sure to add plenty of the bearing oil after the treatment. In my case I think I put 15 drops into each oil port, which pretty much filled it to the brim.
Lubrication, friction, and surface physics are really interesting areas of study.
Another way to think about the "declarative vs generative" dichotomy that I have recently blogged on, is that you have free choice about what you are going to model. Informally, the "likelihood" principle says that you should model the output of your data collection instruments, typically as IID draws from some distribution. In simple cases, this let's you write down
And this formalism is typically fine for people who work with data that is pretty darn noisy and about which there is little known about the wheels and gears and other things that go into producing the data... like survey responses, or even some economics, or even some simple physical processes.
But lots of people have detailed models of their data production process, like weather predictors, they may not know everything they'd like to know, but conditional on the initial conditions, they have PDEs that let them evolve their weather forward in time to tomorrow or next week (or next century for some the climate change researchers).
So, we could look at todays weather measurements and compare them to the predictions based on yesterday's weather measurements, and try to get the parameters of the model (initial and boundary conditions yesterday, forcing functions) perfectly set up so that we'll predict tomorrow's weather. But we absolutely SHOULD NOT replace the weather model with a random number generator.
If we are given a vast vector of temperature, pressure, and wind speed measurements all put together into a vector "data", and a set of parameters which are more or less the "correct" initial and boundary conditions, then what kind of
should we use? The answer is, we don't have one, and, luckily, we're NOT forced by the logic of Cox/Jaynes probability modeling to write one down.
In particular, there's nothing forcing us to model our data, we can model anything, and one thing we can model is for example the relative reasonableness of certain degrees of devation between our model for the weather and today's actual weather data. This is what ABC does
p(distance(summary(predicted_weather(parameters),summary(measured_weather))) | parameters, K) = uniform(0,epsilon)
This is a function of both data and parameters which we choose to model, we simply ASSIGN based on our knowledge K the given probability distribution, and it means "this is about how far off we could be if the parameters were correct".
Doing that lets us update the parameters, let's call all that distance(...) stuff "dist" for short:
Then we have some knowledge about how well weather prediction works, which lets us write down something we can have some confidence in in terms of:
p(all_params | dist) \propt p(dist | all_params) p(all_params)
Is p(dist | all_params) a likelihood? well, not if you want likelihood to mean p(data | params) but it is an assigned probability which is a valid modeling step for a Cox/Jaynes model, and it's assigned on an expression that USES the data. It has perfectly fine logical basis. We may not know what the reasonable values are for the very complicated data vector, but we know the reasonable values for some kinds of summaries and we can collapse our data down via a parameterized function and model that collapsed quantity.
After all, the output of data collection equipment is really just a collapsing down of other quantities we could call data: for example a gauge points to a number on a dial, but that's a collapse of a certain current and a certain spring constant, and a certain magnetization, and some geometry... a digital thermometer outputs a temperature, but that's a collapse of a certain charge in a certain capacitor of a certain size and a certain resistance, and a certain response function for an A/D converter and etc etc.
And it's not just complicated black-box PDE solvers that we can do this for, we can always model something other than the pure output of our measurement instruments. So long as we are actually USING the output of the measurement instruments somewhere in the model, the measurement will provide some kind of information, the more we collapse them, perhaps the less information we can get, and so more collapsed models are possibly not quite as informative as if we had a full model for the entire data vector, but then, we don't have such a model!
Statistical testing answers the question: how strange is the sequence of data under the assumption that it is a sample from distribution ?
When would a Bayesian care about the answer to such a question? There are basically two cases:
- When summarizing the posterior distribution using a sampling mechanism such as MCMC.
- When attempting to fit a distribution to data so as to replace the physical process with a sampling process.
Case 1 is a computational technique. We replace a function of the parameter space with a finite sequence of numbers which act as if they were a sample from a distribution. The main thing this does is let us replace expectation integrals with finite sums. Per Martin-Löf gave us a way to define random sequences as mathematical objects in terms of sequences that pass some "most powerful test" for which we have a non-constructive proof of existence. Practically speaking, this means we can determine whether a sampling technique is working by testing to see if we can reject the sample.
Case 2 is a modeling technique. Whereas the first case is about computing with a sequences of samples from the posterior for the parameters, the second technique is about approximating some scientific real-world process by sampling. The second case is always intentionally wrong, and this is what Frequentist statistics often gets wrong. We know that the way data "is generated" is more or less as the result of physical processes, quantum or classical, whatever the case some molecules push on other molecules etc. However, from a convenience in modeling perspective sometimes we can replace all that detailed picture with an approximate picture which is that the net result of all those causes tends to fall into some certain range of outcomes with some relative frequencies of occurrence. This is always always always an approximation, and one which need not continue to hold and which should be checked. The way to check the adequacy of this process for describing at least the historical data, is to use some kind of statistical test.
Continuous probability densities are in dimensions of 1/[x] where [x] is whatever the dimensions of the input are. For example, probability per unit length for on a length.
Mathematical models should be made based on dimensionless quantities. If we're looking to compare different data points under a model, it seems to me it's worth considering the quantity:
Where is the x where p takes on its maximum value, so for the unit normal for example.
In particular, I have in mind using this alternative quantity to filter data in a similar way to the filtering by "p value" described in a previous post.
So, if you're trying to detect unusual situations (in the previous post example it was unusual loudness of noise) you fit your model and then for each data value you calculate and select those data points which produce a negative enough log probability density ratio. In particular, you'd actually calculate say the average over the posterior distribution for the parameters of this quantity.
This isn't quite the same as the likelihood ratio, since of course it's averaged over the posterior of the parameters, but it is obviously related. (EDIT: you can obviously use these ratios both for likelihood and for posterior density of parameter values, if you're trying to compare alternative parameters, but with parameters you probably want marginal density values, and if you have samples, you probably need kernel density estimates or something...)
This is a case where it makes sense to use the Bayesian machinery to actually fit the frequency histogram of the data points, a perfectly legit thing to do, even though it is sometimes not required.
Ways in which this is superior to "p values" include being able to distinguish when something is unusual even when it's not on the tail of a distribution. For example in multi-modal distributions, as well as being actually motivated by the science (in so far as you put science into your data model) and the actual values of the data, it doesn't involve any probability under the model to observe theoretical data 'as extreme or more extreme" than the data point for example, it's just the relative probability to observe data in the immediate neighbourhood of the data point compared to in the immediate neighbourhood of the most common point. In particular you can think of as a ratio of probabilities for any infinitesimal quantities which thanks to the ratio happen to cancel out.
I used this exact method recently with data my wife gave me. It was a kind of screening of certain chemicals for their ability to induce certain kinds of fluorescent markers in cells. The image data resulted in a bunch of noisy "typical" values, and then a number of outliers. Each plate was different since it had different chemicals, different cells, different markers etc. In Stan I fit a t distribution to the image quantification data from each plate, and looked for data points that were in the tails of the fit t distribution based on this quantity. It seemed to work quite well, and got her what she needed.
From a comment I left on Andrew Gelman's blog:
In science, the data never come from random number generators with given frequency distributions. They always come from some set of atoms and sub-atomic particles interacting to alter the position of some other particles. it’s pretty much always position, or occasionally velocity, too.
Position of a needle on a gauge caused by velocity of electrons flowing through a wire causing a magnetic field which provides a torque to counteract the torque applied by a spring.
Position of electrons in a capacitor that produces a voltage for an A/D converter.
Position of graphite molecules on a paper survey answered by an enormous bundle of atoms we call a Psychology Student who uses the position of photons striking her retina to interpret the meaning of the position of the ink molecules on the page which alters the position of sodium ions in the neurons of her brain.
Position of electrons in a bank computer which cause a money balance to be shown on an account which causes a bundle of molecules called a consumer to be able to alter the position of a bundle of molecules called a product from one location on the earth called a store to another location on the earth called a home without another bundle of molecules called a police officer confining the consumer bundle into a small space by strong molecular repulsion.
It’s never Platonic infinite sequences of bits coming from a particular abstract mathematical sequence whose Godel number is some particular value and which satisfies the Kolmogorov axioms for a particular distribution function to within a given epsilon when calculated by the non-constructive uniformly most powerful computable test.
So, I've been discussing declarative vs generative models. Here's a case where I think the declarative model makes good sense, how would you model this?
y is a vector of 1000 timeseries samples from some audio system. On one end is a highly sensitive microphone that puts out a voltage signal proportional to the instantaneous pressure gradient. The voltage is in the range [0,0.01] volts and is known to have a slew-rate limit of at most 0.005 v per sample and a bias of 0.005 +- 0.0006. A pre-amplifier increases this signal by pretty close to 100x but has a response function which has a known shape over the input range [0,0.01], is not perfectly linear, and this shape is multiplied by a number A slightly less than 100 to get the actual transfer function. The pre-amp output picks up radio frequency noise, which can be modeled as normal(0,0.015), and the result is converted via 10 bit A/D into the range 0-1023, with 1v in to the A/D implying 1024. This digital signal is sent down a noisy channel, so that only some of the samples are received in digital form. Additionally, as a compromise to improve reliability and due to lack of equipment, the signal is re-converted to analog via a 12 bit D/A outputting in the range 0-10v, which has a known noisy bias signal due to a failing diode that can be modeled as independent exponentially distributed voltages with mean value close to 0.05. The result is sent over a long cable of around 1000 feet, where additional normally distributed white noise is added to the signal whose RMS amplitude is less than 0.01 v. The long cable has about 1 ohm resistance per 100 feet +- 5 % as well as a resistance of less than 10 ohms at both the input and output connections. The whole thing is put into a 16 bit A/D operating on inputs in the range 0-10v with an unfortunately low input impedance of close to 1000 ohms +- 1 %. The 16 bit A/D is sufficiently high resolution that y is given as a floating point number equal to the digital value on 0-(2^16-1) / 2^16 and assumed to be continuous with an additional conversion noise and bias which is approximately gamma distributed with k=3 and rate = 1/0.003
Reconstruct the original signal y_true out of the microphone as best you can using as much of the information available as possible, while also giving best estimates for the length of the wire, the total resistance of the cable, the amplification factor for the pre-amp, and the input impedance of the final A/D.
The "generative" picture here is this (more or less, we're ignoring low-pass filters that are always present on the d/a and a/d steps):
y_meas = round(((round(1024*A*f_pre(y_true)+epsilon_rf)+da_bias_noise)*10* ((rperlen*Len+conn_imp)/(rperlen*Len+inpt_imped+conn_imp))+cable_noise)/10*2^16)/2^16+ad16_noise
And, in addition, we have some of the digital values round(1024*A*f_pre(y_true)+epsilon_rf) let's call the ones we have y_digit, which is a sparse vector and t_digit which tells us which of the y_meas the y_digit correspond to.
So, if you wanted to write this in a generative form, with only data or only parameters on the left hand side of the probabilistic statements, in Stan, what would you use to define the likelihood function for y_meas or y_digit given all the parameters, y_true, epsilon_rf, da_bias_noise... etc?
For the full analog / digital / analog / digital / analog path, You could always invent some extra roundoff error parameters, and write
y_meas - (....) ~ gamma(3,1/0.003);
where (...) = the whole mess where round(x) is replace with x+roundoff_i and roundoff_i is given the appropriate prior. But then you'd be writing a declarative model with a nonlinear function of the data and the parameters (consider the resistances formula) on the left hand side.
Stan doesn't have a shifted gamma, but if it did, you could use the shifted gamma distribution and place the round function in the location parameter, however, you will probably induce bad behavior due to the non-differentiable jumps.
The point of the declarative version is that if you were given the actual values of the parameters other than ad16_noise, then subtracting y_meas - (...) would give you an expression equal to the noise sample, and the likelihood for seeing that sample conditional on all the other parameters, is the gamma distribution.
If you have the wrong values of the parameters, you won't get something that's gamma distributed, so you'll be in the low probability region of the gamma distribution function, so you'll downweight that portion of the parameter space... that's all you need to get inference on the parameters, that you find reasons to reject some of the prior space because it doesn't fit the facts you know.