Statics methods for MC ensembles

2017 September 5
by Daniel Lakeland

Recently I've been working on problems for which Stan struggles. The typical situation is that I have a large model with many parameters, usually most of the parameters represent some measurement for an individual item. For example a Public Use Microdata Area in the Census, or the behavior of some individual who has been surveyed, or the expression level of some gene, or what have you. And then there are a small number of parameter that describe the group-level properties of these individual measurement taken as an ensemble (a hierarchical model)

The symptoms of this struggling is that Stan adapts its step size very small, uses long trajectories, maxes out its treedepth, takes many iterations to get into the typical set, and then generally takes a long time to sample, and probably has divergences anyway. Often I kill the sampler before it even really adapts because I see the stepsize has crashed down to 1e-7 or the like and each iteration is taking tens of minutes to an hour.

Now, when we have better diagnostics for divergences, perhaps this kind of problem will be remedied somewhat. But in the mean time, solving dynamics problems is known to be hard, a lot harder typically than solving statics problems. For example, if you want to put electrons on the surface of a sphere in a configuration that minimizes total potential energy, you can do so a lot easier computationally than you can solving the dynamics of those electrons if you give each one of them a velocity constrained to the sphere... Solving the dynamics problem requires you to take small timesteps and move all the electrons along complicated curves. Solving the statics problem just requires you to simultaneously perturb all the positions in an energy minimization way.

So, one thought I had was to solve an optimization problem on ensembles.

Here's one way to think about it: suppose p(x) is a probability density on a high dimensional space x. We want to create a sample of say N points in the X space, let N = 100 or so for example.

One thing we know is that if \{x_i\} is any set of points in the X space then for K a nonstandard number of steps \{x_i'\} = \mathrm{RWMH}(x_i,K) is an ensemble in equilibrium with this p(x) distribution. Where RWMH(x,K) means K random walk metropolis hastings updates from each starting point x_i.

Furthermore, once in equilibrium, it stays in equilibrium even for K=1. Of course for K=1 it is possible that each point is rejected in the MCMC update step, and so the ensemble is also unchanged. But as both the size of the ensemble increases and K increases, this non-movement becomes dramatically improbable. In particular, we can choose something like an isotropic gaussian kernel with sufficiently small standard deviation that we get something like 10% chance of acceptance in each point, and with 100 points, the probability that the ensemble is unmoved after 1 update per point is 0.9^100 ~ 1e-5. Of course, the distance moved may not be very large, but we will have some movement of at least some of the points. If we make K = 10 or so, we'll get some movement of most of the points.

However, we want to guarantee that we do get large movement relative to our usually terrible starting point. One way to get movement is to just continue moving each particle along a line:

For each particle:

x(t) = x(0) + t (RWMH(x(0),K) - x(0))

When mapped across the ensemble, this can be thought of as transport of the ensemble in the direction of an acceptable random perturbation (acceptable because the RWMH accepted that update).

How far should we allow this transport to go? Obviously if t goes to infinity, some of the particles go to infinite distances and since our probability density is normalizable this means some of the particles go well outside the high probability region.

One particular expectation that is of interest is the ensemble entropy:

\frac{1}{N} \sum_{i=1}^N -\log(p(x_i))p(x_i)

This has the advantage that the function f(x) = -\log(p(x)) has domain equal to the domain of the density, and is therefore sensitive to all the regions of space. Furthermore, because p(x) goes to zero outside some ball, the function -\log(p(x)) is positive and so as t increases the entropy of the ensemble eventually increases.

The result is, that if x is out of equilibrium, initially as a function of increasing t the ensemble entropy should first decrease (because the RWMH accepts higher probability transitions and many of the transitions will be towards higher probability) then after a while, it should become somewhat constant, then as time goes on it has to increase again. The region of space near the entropy minimum is what we want. Yet, we don't want to just send all the points to the mode... that would be the lowest entropy ensemble possible.

This whole concept is related to concentration of measure. The entropy of an ensemble of 100 randomly chosen iid points is a function of 100 random variables and is therefore more or less constant under different realizations. The region of constant entropy is called the typical set and contains essentially 100% of the probability mass.

Now, suppose that \{x_i\} starts in equilibrium. Then, as t increases, ensemble entropy may decrease or increase initially, but the change will not be dramatic, and as t increases through time eventually entropy increases again. Since p(x) is typically smooth, the ensemble entropy as a function of t is also smooth. If it comes to a minimum somewhere, then in the vicinity of that minimum it acts like a parabola. How far should we go? Suppose that the ensemble entropy has a central limit type theorem such that when the ensemble size is large enough the ensemble entropy is distributed according to Normal(e,s(N)) for e the actual entropy of the distribution, and s some decreasing function of ensemble size N. This manifests as the trajectory having a parabolic minimum entropy. If we sample t with weights according to \exp(-e(x(t))) where e is the ensemble entropy then we should come into equilibrium (intuitively, I have no idea if this works in reality). Note that initially we are going to always go forward in time, but at equilibrium we need to project t both forward and backward.

Some questions: why doesn't this collapse to finding the mode?

The reason is that at equilibrium each individual random walk metropolis step goes either up or down in the distribution equally, so some points are heading into the tail, and others into the core. If we get close to the mode, the RWMH has more ways to go away from the mode and so we will get directions that mostly point away from it. If we get too far from the mode, RWMH will tend to ratchet upwards and give us a direction back into the typical set.

Is this any better than random walk?

I don't know, intuitively we're using a random walk to get a random perturbation direction that at equilibrium leaves the ensemble in equilibrium. So in the direction of this perturbation, we're moving along the typical set manifold. We do so in a way that more or less equally weights all points that we visited which were in the typical set (ie. if the entropy is constant along our curve, then we go to any of those points equally likely). It seems likely that this devolves to a random walk when the typical set is highly curved, and works more like HMC when the typical set is elongated. Now, Stan detects when there is a lot of curvature and drops the step size, but it then keeps the same step size for all the trajectories. so the computation required is based on the worst case. Perhaps this other technique for following the typical set as an ensemble is more robust to allowing larger step size when it's warranted? Also, it doesn't require calculating gradients. In some sense, the RWMH step finds us a direction which is pointed along the typical set, and so has zero ensemble entropy gradient on average.

The nice thing is, I have been working on some example problems in Julia, so I think I have most of the requirements in place to test this on some example problems. I'll report back.


On Morality of Real World Decisions, and Frequentist Principles

2017 June 27
by Daniel Lakeland

If you want to make decisions, such as choosing a particular point estimate of a parameter, or deciding whether to give a drug, or whatever. And you want your decision making rule to have the *Frequency related* (Frequentist) property that under repeated application it will on average have small "badness" (or larger "goodness") then you should look for your procedure within the class of procedures mathematically proven to have unbeatable frequency of badness properties. This class is the Bayesian decision rules (see Wald's Theorem and the Wiki). The boundary of the class is the Bayesian decision rules with flat priors, but we know more, we know that the frequency properties of Rule A will be better than Rule B whenever a region around the real parameter is higher in the prior probability distribution under A than under B. Then we are giving more weight to the actual correct value and so our decision is based more on what will turn out to happen.

Unfortunately we don't know the correct value exactly, but in the case where we use flat priors, they are equivalent to a prior that places 100% probability on the value being infinitely large

Now, if you, as a person who cares about the frequency properties of procedures, agree that your parameter *is* a real number, and you have any idea at all what the magnitude of that number is, say it's logarithm rounded to the nearest integer is N, then by choosing a proper prior that has normal(0,10^(N+100)) you will have lower Frequency risk of making "bad" decisions than if you used a flat prior, of course, you can do better, normal(0,10^(N+1)) will do better...

Now, decision making and morality are inevitably entertwined. Consider the existence of the "trolly problem" in moral philosophy, it's all about making choices each of which have bad consequences, but we have to make a choice, including the choice of "do nothing" which also has bad consequences. On the other hand, if you have no choice, there is no morality associated with your action. Getting hit by a drunk driver who crashes through the wall of your bedroom while you're sleeping is not a moral failing on your part for example.

But, if you have a choice of how to make real important, *CLINICAL* decisions about people's lives, and health, and societal health through engineering civil structures and the like, and you care about the frequency with which you do things that have bad consequences that you can't forsee exactly, and you *don't* make your decision by choosing a method that is better than your likelihood + flat prior + point estimate based on Mean Squared Error because you refuse to use a prior on some kind of principle, or you refuse to consider real world consequences other than mean squared error on some kind of principle, then in my opinion your principle is immoral, in the same way as prescribing a toxic drug on the principle that "I get a cut of the proceeds" is immoral.

If you make the decision because you don't know any better... then you're like the guy in the bed who gets hit by the car. But if you write books on statistics from a Frequentist perspective, and you fail to teach the Complete Class result, and you fail to emphasize the fact that you have a choice in what measure you will use in deciding on your decisions (such as the choice between Mean Squared Error in your estimate of the parameter value vs Quality Adjusted Live Years Saved of your clinical decision) then I think you're doing evil work in the same way that a person who teaches a Civil Engineering design rule that has been proven to be wrong and risk people's lives is doing evil work.

So, I do get a little worked up over this issue. Remember I have background in Civil Engineering and I work with my wife who is a research biologist at a medical school. None of this is abstract to me, it's all real world "you shouldn't do that because it's wrong/bad for society/evil/it hurts people more often"

To back up a bit though: I don't think it's evil to care about the frequency properties of your procedures. I think it's evil to *fail to care* about the real world consequences of your decisions.

From the perspective of making decisions about things, such as point estimates of parameters, or clinical decisions about treatments, being a Frequentist (meaning, trying to reduce the average badness of outcomes under repeated trials) actually *entails* doing Bayesian Decision Theory. The Frequentist principle "try to reduce the average severity of bad outcomes" implies "Do Bayesian Decision Theory".


On the perversity of "likelihoodism" or Bayesian with Flat Priors

2017 June 26
by Daniel Lakeland

"Classical" statistics is a mishmash of various heuristic ideas. Frequentism is specifically about the Frequency with which repeated application of procedures would cause something to happen if the distributional assumptions used were really true. Some Frequentist procedures are not based on Likelihoods, and some are. When they are, they are based on Likelihoods but not priors.

We can reinterpret an interval constructed from a Likelihood with no prior, as a Bayesian calculations with nonstandard priors uniform(-N,N) for N a nonstandard integer (in IST nonstandard analysis for example).

Now, let's examine the prior. What is the prior probability that the parameter will be limited? Let {[a,b]} be any finite set of intervals with standard endpoints. The prior density for uniform(-N,N) is 1/2N, so the prior probability for any interval in the set is (b-a)/2N which is infinitesimal because N is nonstandard. This is true for all finite sets of standard intervals, so by the Idealization principle, it's true for all standard intervals.

So, classical interval inference using Likelihoods alone, are equivalent to Bayesian intervals where the prior essentially dogmatically asserts that the parameter is either minus infinity or infinity. Only by renormalization via division by an infinitesimal after seeing the data, do we get standard posterior probabilities.

P(Param | Data, Model) = P(Data | Param, Model) P(Param | Model) / P(Data|Model)

Since P(Param | Model) puts infinitesimal probability on limited parameters, and the parameter *is* limited, then P(Data | Model) is infinitesimal as well. In other words, Likelihood only inference asserts that it's virtually impossible to get any data that isn't infinite.


On models of the stopping process, informativeness and uninformativeness

2017 June 24
by Daniel Lakeland

I had a great conversation with Carlos Ungil over at Andrew Gelman's blog where we delved deep into some confusions about stopping rules and their relevance to Bayesian Inference. So here I'm going to try to lay out what it is I discovered through that conversation, and I'm going to do it in the context of Cox/Jaynes probability with explicit background knowledge, translating from what I consider the much more nebulous terms like "deterministic" or "random".

Standard Textbook Stopping Problem:

First off. Let's talk about the "standard textbook" type stopping rule: "Flip a bernoulli coin with a constant p until you see 3 heads in a row and then stop" Suppose you get 8 observations HTTHTHHH. Now, the rule is completely known to you, and so you can determine immediately by looking at the data whether a person following the rule will stop or not. If someone hands you this data and say "given what you know about the rule, will they stop?" you will say P( STOP_HERE | Data, Rule) = 1. *for you* the rule is deterministic thanks to the textbook description. Under this scenario, knowing that they stopped at N=8 does not add anything that you didn't already know. Therefore it can't add any information to the inference. Basically

P(Parameters | Data, Rule, STOP) = P(Parameters | Data,Rule)

Standard Real World Stopping problem:

In the real world, the reason why people stop is rarely so clearly known to you. The experimenter tells you "Our collaborators ran a few trials and read the protocols from another lab, and tried this thing out, and it seemed to work, and so we tried it in our lab, and after collecting 8 samples we saw that the results were consistent with what we'd been told to expect, and so we stopped."

Or a survey sample of a population comes with a dataset of answers from 120 people, and a description of the protocol: "we ran a small preliminary study to test out our questions, and found in the preliminary study that the phrasing of the questions didn't seem to matter, and so we used the observed variability in this previous study to determine that a sample of 120 people should give a cost effective dataset for answering the questions we asked, we then sampled a fixed 120 people" but... because the preliminary study was based on a mixture of different questions and soforth... it's not included in the data set. Here, the information gleaned in the earlier trial study is expressed only through the choice of 120 people. The "fixed deterministic" rule "sample 120 people" is informative for a bigger model in which you include the earlier steps.

Or a slightly more sophisticated version: "we sampled until our Bayesian model for the probability of parameter q being in the range 0-0.1 was less than 0.01, ie. p(q < 0.1) < 0.01" To the people running the study, a Bayesian posterior is a deterministic function of the data and the model. Everyone who has the same model always calculates the same posterior from the given data. But note, *you* don't know what their Bayesian model was, either priors or likelihood.

Deterministic vs Random stopping rule vs Cox/Jaynes restatement

In the literature, a stopping rule is called "informative" if it is "a random stopping rule that is probabilistically dependent on a parameter of interest". I personally think this is a terrible definition, because "random" is usually a meaningless word. It gave me a lot of trouble in my conversation with Carlos, because when I think random I pretty much exclusively use that in the context of generated with a random number generator... but that's not what is meant here. So let's rephrase it in Cox/Jaynes probability terminology.

A stopping rule is informative to you if given your background knowledge, and the data up to the stopping point, you can not determine with certainty that the experiment would stop, and there is some parameter in your model which would cause you to assign different probabilities to stopping at this point for different values of that parameter.

Under this scenario, the *fact of stopping* is itself data (since it can't be inferred with 100% accuracy just from the other data).

Now in particular, notice that the informativeness of a stopping rule depends on the background data. This should be no surprise. To a person who doesn't know the seed of a random number generator runif(1) returns a "random" number, whereas to a person who knows the seed, the entire future stream of calls to runif is known with 100% certainty. It's best to not use the term random and replace it with "uncertain" or better yet "uncertain given the knowledge we have".

What can you usually infer from the stopping rule?

If you're in a situation where the stopping rule is uncertain to you, then if you believe it is helpful for your modeling purposes, you can add into your model of the data a model for the stopping rule (or a model for the choice of the "fixed" sample size). This is particularly of interest in real world rules along the lines of "my biologist friends told me what I should be getting from this experiment so I tried about 8 experiments and they kind of seemed to be consistent with what I was told, so I stopped". The actual rule for whether the experimenter would stop is very nebulous, but the fact of stopping might tell you something you could use to describe distributions over relevant real-world parameters. For example, suppose there's an issue with the way a drug is delivered that can cause toxicity if you do it "wrong". The fact that the biologist stopped after 8 experiments suggests that they believe p(DetectToxicity | DoingItWrong) is near 1, so that if you haven't seen it in 8 tries then you are virtually certain you are doing it "right".

So, eliciting information about the stopping rule is very useful because it can show you that there are potentially parameters you need to include in your model for which the fact of stopping informs those parameters, and particularly, parameters *that describe the nebulous uncertain rule*.

In the example above about sampling until a Bayesian posterior distribution excluded the range 0-0.1 with 99% probability, if someone tells you exactly the description of the Bayesian model, then if you plug in the data, you will immediately know whether the rule said stop or continue. But, if you know what the likelihood function was, but not the prior, then you could potentially infer something about the prior that was used from the fact that the sampling stopped at this point. If you think that prior was based on real background information, this lets you infer some of that real background info.


To a Cox/Jaynes Bayesian there is no such thing as "random" only "uncertain given the background information". A stopping rule can teach you something about your parameters precisely when your model doesn't predict the fact of stopping with probability = 1 *and* your model has a parameter which affects the probability of stopping conditional on data in other words, p(STOP_HERE | Data_so_Far, Background, Params) is a non-constant function of Params

Then, given the data, the fact that the experiment stopped is additional information about some of the Params

Stopping rules are not irrelevant to Bayesian models, but they are only relevant in those circumstances, if you feel that the stopping rule seems vague or the reasons for the choice of the sample size seem based on some information that you're not privy to, then you might want to invent some parameters that might help you explain the connection between the fact of stopping, and the things you want to know such as "hidden" prior probabilities in use by the experimenter that inform their stopping.


Once more about Stan and Jacobian warnings

2017 May 30
by Daniel Lakeland

In Stan, if you have an expression that looks like

foo ~ distrib(a,b,c);

Where foo is either a function, or a transformed parameter variable, and distrib is some distribution, Stan will complain about you needing to add a log-Jacobian to the target.

Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If so, you need to call target += with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:

Now, if foo is directly a transformed single parameter, then in fact, Stan is correct. But if foo is the result of applying a function to several parameters together, collapsing an N dimensional space down to 1, then there does not exist a Jacobian

for example:

exp(q/r) ~ distrib(a,b,c)

where q and r are both parameters. There is one function, but two parameters, q,r. So the Jacobian matrix is 1x2 and has no determinant. How can you understand what this means?

Consider the prior on q,r perhaps you've said
q ~ normal(qguess, qerr);
r ~ normal(rguess, rerr);

and you thought to yourself "Ok there's the prior for q and r", but then you wrote down:

exp(q/r) ~ distrib(a,b,c)

and thought to yourself what? Something like "there's the prior for exp(q/r)" which is kind of true. But exp(q/r) is not a parameter, it's a deterministic function of two parameters q,r. On the other hand, we're multiplying them all together, so the actual prior for q,r is

normal_pdf(q,qguess,qerr) * normal_pdf(r,rguess,rerr) * distrib_pdf(exp(q/r),a,b,c)/Z

that is, "exp(q/r) ~ distrib(a,b,c)" declares a weighting function on values of q,r which further downweights those regions of q,r space that cause exp(q/r) to fall outside the high density region of distrib(a,b,c) (and you need the Z to re-normalize after the reweighting, but note that in MCMC you typically don't need the normalization constant so you're ok there).

This "declarative" technique produces an intentional "tilt" on the prior for q,r that helps you create a more complex and nuanced prior. In particular, when there are various ways to get your function to be near the high probability region for distrib, you'll wind up with a non-independent prior over the parameters q,r you might have a ridge along some curve, or several modes or whatever. This is good when you really do have information that q,r should lead your function to be in the high probability region of distrib.

To generate from this model, you'd need to first generate according to the simple prior, and then reject on the basis of the weighting function, so for example

q = normal_rng(qguess,qerr);
r = normal_rng(rguess,rerr);
f = exp(q/r)
w = distrib_pdf(f,a,b,c)/distrib_pdf_max_pdf_value
p = runif(0,1);
if(p < w) accept else reject

The Reality of actual average taxes paid

2017 May 28
by Daniel Lakeland

According to IRS data, the following graph summarizes the tax situation in the US. It compares actual average taxes paid by AGI categories (points) to a linear tax plus a $500/mo UBI to all adults.

This graph shows the range all the way out to 10M/yr income, here's the range where most of us actually are:

Finally, here's the number of people in these tax brackets:

You can see, there are tons and tons of people down in the region near basically zero income, the brackets there are very closely spaced. Then families who work and earn in the range 40k to 150k are the vast majority.  Overall, there are a trivial number of people out past the ~$300,000 top bracket shown here, especially out past a $1M or more. Taxes on the ultra-rich just don't matter for revenue, because there are too few of them to make a difference. (But they matter a lot for behavior, because ultra rich people will dodge those taxes and distort economic efficient allocation of vast amounts of business resources).

Note that this compares a 35% flat tax + UBI to current. But to be deficit growth neutral, we really only need around 28% flat tax + UBI. and then use the UBI to replace many separate inefficient government programs.


WTF Economics: shifting curves UP/DOWN or LEFT/RIGHT?

2017 May 24
by Daniel Lakeland

Further evidence that I don't think like an Economist. From Mankiw's Macroecon undergrad textbook (and pretty much everywhere else you look on the web etc where they're trying to describe supply and demand curve models):

Now. In my world, if someone offers me something such as say Pears, at a particular price, I look at the price and decide my quantity I will buy. So Qd = D(P) and if I grow pears and have to harvest them and transport them to a market, then I look at the price they are selling at at the market, and decide if I'm going to pay the money to harvest and transport, so my market supplied quantity Qs = S(P)

In other words, price is the input variable, and quantity is the output variable, and so I expect all the graphs to have P on the bottom horizontal X axis and Q on the left vertical Y axis.

The fact that it's intuitive to econ professors to flip these axes... is I think very weird. When I talk with economists I'd want to say "demand increases so the demand curve shifts up", and they expect me to say "demand increases so the demand curve shifts right"

It may be the convention, but you're not going to convince me that it is a good convention to go against everything that will be learned in every other math related field.

Probably too late to fix. Sigh


Some Notes for Phil on Modeling Housing Markets

2017 May 21
by Daniel Lakeland

Given all the stuff being discussed on Gelman's blog about the housing market, here's some ideas I had about how one might formulate a continuous version of the model (rather than say an agent based version).

First, let's talk about how many houses are being rented and how many people are renting them, as a function of price. Let's posit an important component of the model, F(t,r) which is the frequency distribution of houses renting for price r at time t. Now let's also posit some variables N(t) for the total number of houses, B(t,r) the average number of Bedrooms per house renting for r, and C(t,r) the average number of people per bedroom in houses renting for r (C is for crowding). Then we can describe the relevant population of the area as:

 \mathrm{Pop}(t) = \int_0^\infty N(t)F(t,r)C(t,r)B(t,r) dr

Now F(t,r) can change in several ways. First of all there can be some turnover. When there's turnover of a house currently renting for rent r_1 and it starts renting for r_2 then we have:

\frac{d}{dt}(N(t) F(t,r_1)) = -\int N(t) F(t,r_1)To(t,r_1,r_2) dr_2 + \int N(t)F(t,r_2)To(t,r_2,r_1)dr_2 + \frac{dN(t,r_1)}{dt}

An integro-differential equation, with a kernel To(t,r_1,r_2) that describes the turnover transition of an existing apartment between different rents, and a function \frac{dN(t,r)}{dt} that describes how fast we're constructing housing and renting it out at price r, each changes in time.

Now, we'll imagine that the number of units actually on the market for a day or whatever is negligible (we don't have say 15 percent of all units empty ever for example). So we can talk about a turnover rate v(t,r) for volume of transactions at time t, price r. This defines a kind of spot price distribution. There's no one spot price, because there's no "universal apartment". Different locations and different qualities, and different sizes produce different market rental spot prices. But if v(t,r) > 0 then at time t, some volume is available at price r.

There is a relationship between the kernel To(t,r_1,r_2) and the transaction volume v(t,r), namely that the total number of apartments transitioning from anywhere into price r, plus the new sales, is the volume:

v(t,r) = \int N(t) F(t,r^*) To(t,r^*,r)dr^* + \frac{dN(t,r)}{dt}

and also, there is a relationship between N(t) and \frac{dN(t,r)}{dt} namely:

\frac{dN(t)}{dt} = \int \frac{dN(t,r)}{dt} dr

This more or less describes the dynamics of the supply, at least, if you posit specific forms for the F(0,r) function, and some dynamics of the transition kernels and the building rate and soforth.

But, in order to get information about the transition kernels and the volume, etc, you need to have some information about demand, and for this you need to posit D(t,r) a demand for an apartment (or a bedroom?) at a given rent r. And this describes the reservoir of people who are waiting in the wings to scoop up an apartment if one becomes available at a given price. We should probably split it out as a component from inside our region, and from outside our region. From inside our region will help us determine how often people are leaving their apartments.

It's clear that the demand can be a function of the character of the population, as Phil says you can get more rich people demanding services and then increasing the demand and the lower end of r. So, there is feedback between statistics of the population, and the demand.

Now, I think we have a structure that we can start to build on, specific functional forms could be hypothesized and dynamics could be observed through solving integro-differential equations, perhaps by discretizing the system and doing sums. Probably we need a bit more than this, this isn't quite a complete closed model, but it's a good start for a structure if you wanted to go after this with a continuous statistical approach.


Injecting Data into the SF Rent discussion

2017 May 17
by Daniel Lakeland

There was a massive shitstorm on Andrew Gelman's blog about SF rents, which are of course a hot button topic. Since I have the complete ACS microdata, I figured I could inject some data into the discussion at least. So here it is, the complete distribution of rents for occupied units each year in all of the SF related Census PUMAs... In total about 8328 household observations across 2010-2015


Click to see full size

Tabular values from each year:

  YEAR   MedRent
1 2010 1.0000000
2 2011 0.9565217
3 2012 1.2173913
4 2013 1.3043478
5 2014 1.3913043
6 2015 1.3913043

Note, the above was the original table I published, but it is affected by the fact that the Census discretizes the rent. Adding a normal perturbation to the normalized rent with sd = 0.01 creates a smoothed distribution, and then you can get better estimates on the median and mean of the real distribution:

  YEAR   MedRent  MeanRent
1 2010 0.9959745 1.0020145
2 2011 0.9621274 0.9797048
3 2012 1.2243687 1.2844332
4 2013 1.3043896 1.3762188
5 2014 1.3854716 1.4446855
6 2015 1.4144859 1.5111725

So clearly the distribution has ratcheted up each year, and more at the right tail than anywhere else (mean more than the median). Also Top-coding is a big problem here, lots of the rents are reported at the maximum allowable top-coded value rather than their real value because of privacy issues.

Combining information from independent experiments: further ideas

2017 April 19
by Daniel Lakeland

It's an interesting question connected to something posted at Andrew's blog. I posted one possible "solution" using Stan previously, and Corey and I did some exploration of the issues in the comments.

Here are some new thoughts I've had, but first the set up. There are N labs each of which investigate a phenomenon in which there is a fundamental ratio R = X/Y and X and Y are individually measured with error. The labs each publish some posterior distribution for X and Y. In the example problem we have

Xtrue[i] ~ normal(Xbar[i],sx[i]); // published by lab i

Ytrue[i] ~ normal(Ybar[i],sy[i]); //similar

for each lab i

Now we'd like to do inference from this on the ultimate quantity of interest R. We assume that there is an underlying Rtrue of interest, and that in each lab, due to peculiarities of their apparatus etc an underlying R[i] is at work so that Xtrue[i]/Ytrue[i] = R[i] and R[i] is "close to" Rtrue.

R[i] ~ normal(Rtrue,deltaR);

with deltaR "small" and Rtrue a parameter in our model.

How can we do inference on Rtrue?

The thing that I recently came to is that in this situation, we don't get the individual data points from the labs, we get as *data* only Xbar[i] and Ybar[i] the published values that describe the lab's posterior distributions over Xtrue[i] and Ytrue[i], so we can do something like as follows:

Rtrue ~ normal(OurRtrueGuess, someSpread);// our prior for Rtrue

R[i] ~ normal(Rtrue, deltaR);// our prior for R[i] given Rtrue, expresses the "closeness" of the individual experiments to the real R value

Ytrue[i] ~ normal(Ybar[i], sy[i]); // the published posterior probability distribution over Ytrue[i] from lab i

Xbar[i] ~ normal(R[i]*Ytrue[i],sx[i]); 
/* the likelihood of having Xbar be published
 as the estimate of Xtrue by lab i given 
Ytrue[i], R[i] and the lab published sx[i] inference error. 
That is: p(Xbar[i] | R[i],Ytrue[i])

So now what we've got is essentially p(R) p(R[i]|R) p(Ytrue[i] | Data[i]) p(Xbar[i] | R[i], Ytrue[i], Data[i])

the narrow width of p(R[i]|R) is what describes the fact that all the labs are trying to set up their apparatus to reproduce carefully a single ratio R. The Ytrue[i] distribution is as given from the lab, and finally the likelihood over the Xbar value is potentially either the published distribution from the lab, or if we have additional information, it could be some modified thing. When Xbar is a sufficient statistic and we're using a normal distribution, it can be as good as getting the whole dataset from the lab.

What's interesting is, in these cases, the output of each labs inference process becomes *observed data* that we use in a likelihood type factor in our Bayesian model for combining the data.