A Bayesian Understanding of the "Garden of Forking Paths"

2017 April 7
by Daniel Lakeland

Andrew Gelman has discussed and written on his concept of "The Garden Of Forking Paths" in NHST analysis of scientific data.

"Laplace" whose insights I respect a lot has ridiculed the idea, and when put into the terms he uses, I agree with him. However, I don't think that Gelman's point is quite the same as the one Laplace ridicules. So thinking about it, here's how I'd like to proceed to an understanding.

For simplicity we'll analyze the situation in which a research collects data D, and then does a test T to determine if the two subsets A^+(D) and A^-(D) differ in some way that is detectable by the test by use of a sample statistic S.

First off, consider what the various options are available to the researcher:

T_i, i \in [1,.I]

and

S_j, j \in [1..J]

and

A_k, k \in [1..K]

That is, we can choose which test to use, which statistic to test, and how to subset and exclude certain portions of the data to form the partition (the function A partitions and excludes the data, so that there are two groups).

Now, what is the Bayesian probability that p < 0.05 given our knowledge N (I use N because I've already used K).

P(p < 0.05 | D,N,i,j,k) P(D,i,j,k | N)

Suppose in the first case that N contains the information "i,j,k were preregistered choices and D was collected after i,j,k were specified and is independent of the i,j,k". Then P(i,j,k|N) = 1, and P(p < 0.05 | N) is determined entirely by our knowledge in N of the appropriateness of the test and the p values that it outputs.

P(p < 0.05 | D,N) P(D|N)

So, we're still left with all the problems of the use of p values, but we're at least not left with the problems described below.

In the case that N contains the information "I,J,K are all large integers and were chosen after seeing D, and the researcher is motivated to get p < 0.05 and probably at least looked at the data, produced some informal graphs, and discussed which analysis to do with colleagues" we're left with the assumption that i,j,k were chosen from among those analyses which seemed via informal data "peeking" to be likely to give p < 0.05 so the Bayesian is left with:

P(p < 0.05 | i,j,k,D,N) P(i,j,k|D,N) P(D|N)

Now, due to our pre-analysis choice peeking, we can safely assume

P(p < 0.05 | i,j,k,D,N) \sim 1

sure it might not be exactly 1, but it's much much bigger than 0.05 like maybe 0.5 or 0.77 or 0.93 and this is FOR ALL i,j,k that would actually be chosen.

P(i,j,k|D,N,\{i,j,k\}\in G) = O(1)

where G is the reachable subset of the I \times J \times K space called "the garden of forking paths" such that any typical researcher would find themselves choosing i,j,k out of that subset such that it leads to analyses where P(p < 0.05 | i,j,k,D,N) \sim 1

So, how much information does p < 0.05 give the Bayesian about the process of interest? In the preregistered case, it at least tells you something like "it is unlikely that a random number generator of the type specified in the null hypothesis test would have generated the data" (not that we usually care, but this could be relevant some of the time).

In the GOFP case, it tells us "these researchers know how to pick analyses that will get them into the GOFP subset so they can get their desired p < 0.05 even without first doing the explicit calculations of p values."

So, using this formalism, we arrive at the idea that it's not so much that GOFP invalidates the p value, it's that it alters the evidentiary value of the p value to a Bayesian.

 

Once again on Using the Data Twice

2017 April 6
by Daniel Lakeland

So there's an argument in the blogtwitterosphere about updating your bayesian model with the data twice.

On the one hand, we have EJ Wagenmakers saying:

And on the other hand, we have "Laplace" saying updating with the data twice is totally fine

Now, Laplace's math is absolutely correct. But, it's also subtle, because it's purely symbolic. When we build a model, we need to write down specific mathematical expressions for p(Foo | Bar) for all the foos and bars.

Let's see Laplace's equation in detail:

P(A|B,B) = P(B,B|A) P(A)/P(B,B) = P(B|B,A) p(B|A) P(A) / (P(B|B) P(B))

Now, P(B|B,A) = 1 because given B, B has to be true, same for P(B|B) and when you plug those in, you get P(B|A)P(A)/P(B) = P(A|B) = P(A|B,B)

BUT: when you build a Bayesian mathematical model, you CAN make mistakes, just like when you program a numerical integration routine, you can make mistakes. Suppose instead of P(A|B,B) we calculate P(A|B1,B2) where B2 is a deterministic copy of the data in B1.

Now, if we *remember this fact* correctly, we'll get P(B2|B1) = 1 and P(B2,B1) = P(B1) and we'll get the results above. But, if we forget this fact and pretend that B2 is new independent data, we will get the same results as if we had collected 2x as much data as we really did collect and treated it all as separate information. The mistake is as simple as doing something like

for(i in 1:(2*N)){ 

data[i] ~ normal(foo,bar);

}

instead of

for(i in 1:N){

data[i] ~ normal(foo,bar)

}

The second one is correct, because the second copy of data adds no information to the posterior as the probability of each data value past the Nth value is 1 given that we already know data values 1..N.

It's *this mistake* which is a bug, is common, and leads to the statements along the line "only use the data once". The statement "only use the data once" is like the statement "don't use global variables to pass arguments to functions" it's useful advice to reduce the chance of committing an error. It's not mathematical truth.

 

Suicide Rates in the US

2017 April 4
by Daniel Lakeland

I got this data from CDC Wonder, and I let them do the population rate calculation, so I don't know if they did it right, but let's assume they did. Here are suicide rates for Males and Females ages 15 to 65 across the whole US, by race.

Notice that White rates are higher than Black or Asian rates, and that they've been trending upwards steadily since 2000.

 

A new term

2017 March 27
by Daniel Lakeland

Wansinking (verb, gerund): To do research about an essentially unimportant topic in a sloppy and unprincipled manner, possibly even inventing data, while attracting enormous amounts of credulous popular press coverage and corporate sponsorship for years and years, and dodging criticism by acting or being essentially clueless. cf. Brian Wansink.

 

Call for Book Recommendations on Game Theory

2017 March 9
by Daniel Lakeland

I'd like a book on game theory that is readable and interesting, has some examples that are somewhat real-world, but isn't afraid to use math. (ie. not something written for people who read Malcom Gladwell etc)

I dislike Definition, Theorem, Proof math books. I mean, not entirely, I like that stuff fine when I'm reading about abstract math, but generally find them tedious when the topic should have applied content. What interests me is how the formalism maps to the real-world, not excruciating details on what the content of the formalism is.

I loved this book on set theory, and Barenblatt's book Scaling, and Practical Applied Mathematics, and I'd like something at that level if possible. Also helps if it's less than $50 or so.

Hoping someone has something to recommend.

Of particular interest: issues in mechanism design, dynamic / repeated games, games with no stable equilibrium, rent-seeking, games where greedy algorithms fail, etc. I've been thinking a lot about economics problems, and I'd like to get familiar enough with the basic stuff to be able to talk about why certain policies put together result in lousy outcomes without appearing foolish for missing some very basic known results etc.

Also, any thoughts specifically on Steven Tadelis' book?

Bet-Proofness as a property of Confidence Interval Construction Procedures (Not realized intervals)

2017 March 8
by Daniel Lakeland

There was a bunch of discussion over at Andrew Gelman's blog about "bet proof" interpretations of confidence intervals. The relevant paper is here.

Below is what I original wrote in grey. Nope. I was misreading. The actual definition of bet-proofness is also weird. Here it is in essence:

A confidence interval construction procedure is bet proof if for every betting scheme b(x) there exists some value of the parameter \Theta such that this scheme will lose in the long run.

This is kind of the converse of what I was thinking before. But here's my example of why that makes no sense (quoted here from what I wrote on Andrew's blog after I finally figured out what was going on).

Now, with the above reworded Definition 1, I can see how the game is about revealing X and not revealing Theta. But, I don’t see how it is interesting. Let me give you an example:

Our bet is to sample a random sample of every adult who lives within 2 blocks of me. We will measure their heights X, then you’ll construct a confidence interval CI(X), and I will bet whether after we measure all the rest of them, the population mean Theta will be in the CI.

Now, being a good Bayesian, I use Wald’s theorem and realize that any strategy to decide what my bet winnings will be that doesn’t use a prior and a loss function will be dominated by one that does…. So I’ll go out to Wikipedia and google up info about the population of the US and their heights, and I”ll construct a real world prior and then I’ll place my bet.

Now bet proofness says that because it’s the case that if the actual height of people in a 2 block radius is 1 Million Feet (ie. THERE EXISTS A THETA), my prior will bias my bets in the wrong direction and I will not be able to make money…. that this CI is all good, it’s bet proof.

And how is that relevant to any bet we will actually place?"

The basic principal of bet-proofness was essentially that if a sample of data X comes from a RNG with known distribution D(\Theta) that has some parameter \Theta, then even if you know \Theta exactly, so long as you don't know what the X values will be, you can't make money betting on whether the constructed CI will contain the \Theta (the paper writes this in terms of f(\Theta) but the principal is the same since f is a deterministic function).

The part that confused me, was that this was then taken to be a property of the individual realized interval... "Because an interval came from a bet-proof procedure it is a bet-proof realized interval" in essence. But, this defines a new term "bet-proof realized interval" which is meaningless when it comes to actual betting. The definition of "bet-proof procedure" explicitly uses averaging over the possible outcomes of the data collection procedure X but after you've collected X and told everyone what it is, if someone knows \Theta and knows X they can calculate exactly whether the confidence interval does or does not contain \Theta and so they win every bet they make.

So "bet-proof realized confidence interval" is really just a technical term meaning "a realized confidence interval that came from a bet proof procedure" however it doesn't have any content for prediction of bets about that realized interval. The Bayesian with perfect knowledge of \Theta and X and the confidence construction procedure wins every bet! (there's nothing uncertain about these bets).

 

"Person" as a dimension?

2017 March 6
by Daniel Lakeland

Consider things like GDP / capita and how to use them in constructing dimensionless ratios. Now, a person is not an infinitely divisible thing. It's pretty much meaningless to talk about 2.8553 people. Just like molecules, "people" is a count, a dimensionless integer. But, then you can also think about the "mole" in SI units. This has all the qualities that normally make up an arbitrary dimensional unit, you can sub-divide it essentially continuously (at least to 23 decimal places). And that's the key thing about dimensions. The symmetry property of dimensionlessness is essentially that if you define a unit of measure of something, say the Foo and someone else defines the Bar and 1 Bar is equal to x Foos, then you can make all your equations in Bar units into equations in Foo units by multiplying by x Foos/Bar wherever you have something measured in Bars.

This is closely related to the renormalization group. However, it breaks down when the thing you are measuring is not infinitely divisible (or approximately so).

So, in the asymptotic regime where you are describing large aggregates of people (like say countries or states or very large corporations etc) you can calculate statistics in terms of "per capita" and treat capita as if it were an infinitely divisible thing, and therefore as if it represents a "dimension". However in the asymptotic regime where you are discussing a small number of people, you will always use an integer, 1,2,3,4 people etc and so this should be treated as simply a number that is dimensionless and does not enter into the dimensional calculation.

So, when I calculated Total Wages Paid / Total Hours Worked / Market Cap Of Stocks * 1 hr, the dimensions of this are Dollars/Hours / Dollars * Hours = 1 = dimensionless, and this is implicitly because we're discussing what fraction of the total market a single person could buy if they received an average amount of money for their hour of work.

On the other hand, when thinking about the aggregate when I calculated

(GDP/capita * Fraction Of GDP To wages ) / (GDP/Capita * Market Cap as Fraction of GDP * 1 Yr)

We are now explicitly comparing averages across a large pool of people, and we might be interested in how "capita" changed in time in both cases, and "capita" changes in a near-continuous manner because it's an integer, but it's an integer like 325,000,000 so it has 9 significant figures and we can ignore the incremental person's discrete effect.

Whether to treat Capita or People as a dimension is more or less down to a choice of how you are thinking about the problem. When the number of people involved is large and explicitly considered in a ratio, such as in the calculation of GDP/Capita then treating "Capita" as a dimension that needs to cancel makes sense. When the number of people is small and a specific integer: such as "for a family of 3 the cost to purchase food for a month is X Dollars/Month" it would make sense to treat the number of people as a dimensionless count and not calculate something like X/3 Dollars/Month/Person, so if you want to ask what's the relative cost of feeding 3 people per month vs watering a lawn at cost Y dollars/month you can say X/Y is dimensionless ignoring the fact that a family of 3 people is involved.

The most useful case for treating people as a dimension is when there is a natural linear relationship between the number of people and the thing of interest. For example, for adults feeding 3 of them requires about 3 times the mass of food as feeding one. But, because of economies of scale related to cooking and shopping costs, there is no real reason to think that it requires 3 times the dollar cost (for home cooked meals). So, Dollars/Person is not a meaningful statistic at the asymptotic small person count regime. On the other hand, for a whole Battalion of troops, it would be. This is directly related to the "unit conversion symmetry" property I described in the first paragraph. If you define a Battalion of troops as 7000 and a different country they define it as 5300, then you can convert the other countries equations to your units by multiplying their Battalion numbers by 5300 and dividing by 7000, and the assumed linearity of the relationships in the asymptotic limit makes these discrete counts have approximately arbitrary linear-scaling dimensional properties.

Personal vs Macro Dimensionless Ratios

2017 March 3
by Daniel Lakeland

Here's a different version of a graph I posted yesterday:

 

This graph is (GDP/capita) * (Wages As Percentage Of GDP/100) / (GDP/capita * Stock Market Capitalization as Percentage of GDP/100) = Wages as Percent / Stock Market Cap as Percent.

Note also that the per capita cancels on the top and the bottom.

It looks like GDP cancels but note that this is problematic. Conceptually the dimensions of this number are actually 1/[Time] since GDP is not a dollar measure but an INCOME measure, so when they calculate "Stock Market Cap as Percent" they're really calculating "Stock Market Cap as Percent of the GDP Income Rate times 1 year". Whereas wages as a percentage of GDP are in fact in units of Dollars/Time. The overall dimensions are 1/[Time] with units of 1/Year.

So, this graph shows you how if you saved *all* your wages for a year, how big a fraction of the stock market could you buy relative to "the share that everyone would have if the stock market were equally divided among all people".

In 1975 you could buy a little more than "your equal share" (I won't call it "fair" it's just equal). These days you can buy a little less than 30% of an equal share.

Here's another relevant question. How much would an epsilon share of the stock market buy you in consumer goods if you sold it?

This is (stock market capitalization as a fraction of GDP * GDP/capita) / CPI * epsilon where epsilon is a sufficiently small dimensionless number that the results are in dollar amounts you might carry in your wallet.

In 1988 your per capita share of the stock market would buy you consumer goods equivalent to $25 today (3 chicken sandwiches?), whereas by 2000 it would buy you what $80 would buy you today (a sushi dinner for 3?), and it's been on a wild ride ever since. On the other hand, if you look at growth rates prior to 1990, you can see that the extension of those growth rates would put the stock market at the bottom of the 2008 crash about at the "right" level. Hmm... Certainly we do things more efficiently now that we have computers, so I could see an increased growth rate being reasonable, but pretty obviously given the periodic oscillations, there's way too much optimism.

 

Some Particular Dimensionless Ratios in the Economy

2017 March 2
by Daniel Lakeland

The Federal Reserve Economic Data (FRED) website is a great way to easily be able to play around with measures in the economy and create new dimensionless ratios that are relevant to various interests. They have a system where you can add various datasets to a graph and construct a new measure via a formula. So for example, here is a measure of consumer goods purchasing power of an hour of average labor:

Total Payroll / Total Hours Worked / CPI rescaled to today's level

When multiplied by 1 hr marginal extra work, this becomes a quantity in Dollars, so it's a dimensionless ratio multiplied by todays prevailing wages (about $30/hr). You might call this "Real Wages" if you were an economist, but that would be wrong because there's no such thing. This graph shows that the amount of consumer goods you can consume for an hour of wages has after about 20 years stagnating between 1970 and 1990, finally gone up between 1995 and 2010. One suspects this is due to a combination of the Internet improving logistics and Chinese manufacturing reducing costs of production.

But, how about this graph?

Total Payroll / Total Hours Worked / Total Stock Market Capitalization (red) Total Payroll / Total Hours Worked / Total M2 Money Supply (blue)

These are rescaled to match approximately the $30/hr prevailing wage as of 2015. So they are a dimensionless ratio multiplied by a modern $30/hr wage. They show essentially how big a piece of the money supply (blue) or how big a piece of the stock market (red) your marginal hour of work buys. Because I've rescaled to match the $30/hr today, you can see that a marginal hour of work in 1975 bought you as much stock as if you were paid something like $160/hr in 2015, and as much a fraction of the money supply as if you were paid $60/hr in 2015.

So which is it? Has an average wage gone up from $25/hr in 1995 to $30/hr in 2015 (first graph)? Or has it gone down from $160/hr in 1975 to $30/hr in 2015? The answer is BOTH. Relative to the consumer goods people choose to buy, they are able to buy more today, but the goods they buy change through time (the CPI is not a fixed basket). So perhaps people have funneled their money towards less expensive items. Relative to investments, wages have gone WAY down (red 2nd graph). And, you'll see that as you might expect, when you can't get much for your dollar in the stock market, savings declined dramatically:

Here's savings rate as a percent of income, which is already dimensionless. Sure enough in 1975 when a marginal hour of work bought you what $160 would by you today in the stock market, savings rates were around 12.5% of income. Today, they're around 5%.

How big is todays savings as a fraction of what it was in the 70's?

\frac{0.125\times 160}{0.05\times 30}= 13.3

The investment part of a 1975 marginal hour of work bought you 13 times as much investment as the investment part of todays marginal hour of work. YIKES!

There are many dimensionless ratios in the economy that are important to consumers, and others that are important to producers, or investors. It's a mistake to think in terms of things like "Real Dollars" as if it were possible to simply rescale things "for inflation" with one measure like the CPI, and recapture the past in modern money quantities.

 

A good idea for an ACA improvement

2017 March 1
by Daniel Lakeland

You know what would be a good idea to improve on the ACA? Require any company selling health insurance to offer a plan that costs exactly 1% of GDP/capita per year in premiums for either a 40 year old Male or Female. Allow them to use a 3rd order polynomial in age to adjust the premiums by Age between 0 and 110 years old requiring that the coverage level be the same for all ages, but the premium adjusts so that income-expenses has minimal variation across age subject to the 40 year old anchor point. Let them compete on what coverage they can offer for that level, with no specific coverage requirement whatsoever except that there must be some annual out of pocket maximum.

What would this do? It'd make available a real insurance policy, one where you pay an affordable level of premium, and get protection from extreme events, and participation in pre-negotiated pricing. You'd wind up paying more out of pocket for services than on more plush policies, but if enough people actually did that, there would be downward pressure on health care prices, so that would be a good thing to some extent. Furthermore it would make it so that the individual mandate made sense for everyone. 1% of GDP/capita is today around $560/yr or $47/mo, which is more or less the cost of 5 meals per month.

Finally, what kind of out of pocket max would it be possible to offer. I did a little searching on hospitalization costs a while back and found order of magnitude estimates that a typical hospital stay costs something like $15,000 and that something like 10% of people have a hospitalization... if pay in equals pay out on average (just as an order of magnitude estimate), then 560 + 0.1*X = 0.1 * (15000-X) and X = 4700. So you'd expect that insurance companies could offer you somewhere between $5k and $10k ceilings on annual expenses for 1% of GDP/capita.

EDIT: Error in the equation for ceiling, but it's fixed, basic conclusion didn't change.