foo ~ distrib(a,b,c);

Where foo is either a function, or a transformed parameter variable, and distrib is some distribution, Stan will complain about you needing to add a log-Jacobian to the target.

```
DIAGNOSTIC(S) FROM PARSER:
Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If so, you need to call target += with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
```

Now, if foo is directly a transformed single parameter, then in fact, Stan is correct. But if foo is the result of applying a function to several parameters together, collapsing an N dimensional space down to 1, then **there does not exist a Jacobian**

for example:

`exp(q/r) ~ distrib(a,b,c)`

where q and r are both parameters. There is one function, but two parameters, q,r. So the Jacobian matrix is 1x2 and has no determinant. How can you understand what this means?

Consider the prior on q,r perhaps you've said

`q ~ normal(qguess, qerr);`

r ~ normal(rguess, rerr);

and you thought to yourself "Ok there's the prior for q and r", but then you wrote down:

`exp(q/r) ~ distrib(a,b,c)`

and thought to yourself what? Something like "there's the prior for exp(q/r)" which is kind of true. But exp(q/r) is not a parameter, it's a deterministic function of two parameters q,r. On the other hand, we're multiplying them all together, so the actual prior for q,r is

`normal_pdf(q,qguess,qerr) * normal_pdf(r,rguess,rerr) * distrib_pdf(exp(q/r),a,b,c)/Z`

that is, "exp(q/r) ~ distrib(a,b,c)" declares a weighting function on values of q,r which further downweights those regions of q,r space that cause exp(q/r) to fall outside the high density region of distrib(a,b,c) (and you need the Z to re-normalize after the reweighting, but note that in MCMC you typically don't need the normalization constant so you're ok there).

This "declarative" technique produces an intentional "tilt" on the prior for q,r that helps you create a more complex and nuanced prior. In particular, when there are various ways to get your function to be near the high probability region for distrib, you'll wind up with a non-independent prior over the parameters q,r you might have a ridge along some curve, or several modes or whatever. This is good when you really do have information that q,r should lead your function to be in the high probability region of distrib.

To generate from this model, you'd need to first generate according to the simple prior, and then reject on the basis of the weighting function, so for example

`q = normal_rng(qguess,qerr);`

r = normal_rng(rguess,rerr);

f = exp(q/r)

w = distrib_pdf(f,a,b,c)/distrib_pdf_max_pdf_value

p = runif(0,1);

if(p < w) accept
else reject

This graph shows the range all the way out to 10M/yr income, here's the range where most of us actually are:

Finally, here's the number of people in these tax brackets:

You can see, there are tons and tons of people down in the region near basically zero income, the brackets there are very closely spaced. Then families who work and earn in the range 40k to 150k are the vast majority. Overall, there are a trivial number of people out past the ~$300,000 top bracket shown here, especially out past a $1M or more. Taxes on the ultra-rich just don't matter for revenue, because there are too few of them to make a difference. (But they matter a lot for behavior, because ultra rich people will dodge those taxes and distort economic efficient allocation of vast amounts of business resources).

Note that this compares a 35% flat tax + UBI to current. But to be deficit growth neutral, we really only need around 28% flat tax + UBI. and then use the UBI to replace many separate inefficient government programs.

]]>

Now. In my world, if someone offers me something such as say Pears, at a particular price, I look at the price and decide my quantity I will buy. So Qd = D(P) and if I grow pears and have to harvest them and transport them to a market, then I look at the price they are selling at at the market, and decide if I'm going to pay the money to harvest and transport, so my market supplied quantity Qs = S(P)

In other words, price is the input variable, and quantity is the output variable, and so I expect all the graphs to have P on the bottom horizontal X axis and Q on the left vertical Y axis.

The fact that it's intuitive to econ professors to flip these axes... is I think very weird. When I talk with economists I'd want to say "demand increases so the demand curve shifts **up**", and they expect me to say "demand increases so the demand curve shifts** right"**

It may be the convention, but you're not going to convince me that it is a good convention to go against everything that will be learned in every other math related field.

Probably too late to fix. Sigh

]]>

First, let's talk about how many houses are being rented and how many people are renting them, as a function of price. Let's posit an important component of the model, which is the frequency distribution of houses renting for price r at time t. Now let's also posit some variables for the total number of houses, the average number of Bedrooms per house renting for r, and the average number of people per bedroom in houses renting for r (C is for crowding). Then we can describe the relevant population of the area as:

Now F(t,r) can change in several ways. First of all there can be some turnover. When there's turnover of a house currently renting for rent and it starts renting for then we have:

An integro-differential equation, with a kernel that describes the turnover transition of an existing apartment between different rents, and a function that describes how fast we're constructing housing and renting it out at price r, each changes in time.

Now, we'll imagine that the number of units actually on the market for a day or whatever is negligible (we don't have say 15 percent of all units empty ever for example). So we can talk about a turnover rate for volume of transactions at time t, price r. This defines a kind of spot price distribution. There's no one spot price, because there's no "universal apartment". Different locations and different qualities, and different sizes produce different market rental spot prices. But if then at time t, some volume is available at price r.

There is a relationship between the kernel and the transaction volume , namely that the total number of apartments transitioning from anywhere into price r, plus the new sales, is the volume:

and also, there is a relationship between and namely:

This more or less describes the dynamics of the supply, at least, if you posit specific forms for the function, and some dynamics of the transition kernels and the building rate and soforth.

But, in order to get information about the transition kernels and the volume, etc, you need to have some information about demand, and for this you need to posit a demand for an apartment (or a bedroom?) at a given rent r. And this describes the reservoir of people who are waiting in the wings to scoop up an apartment if one becomes available at a given price. We should probably split it out as a component from inside our region, and from outside our region. From inside our region will help us determine how often people are leaving their apartments.

It's clear that the demand can be a function of the character of the population, as Phil says you can get more rich people demanding services and then increasing the demand and the lower end of r. So, there is feedback between statistics of the population, and the demand.

Now, I think we have a structure that we can start to build on, specific functional forms could be hypothesized and dynamics could be observed through solving integro-differential equations, perhaps by discretizing the system and doing sums. Probably we need a bit more than this, this isn't quite a complete closed model, but it's a good start for a structure if you wanted to go after this with a continuous statistical approach.

]]>

Tabular values from each year:

YEAR MedRent 1 2010 1.0000000 2 2011 0.9565217 3 2012 1.2173913 4 2013 1.3043478 5 2014 1.3913043 6 2015 1.3913043

Note, the above was the original table I published, but it is affected by the fact that the Census discretizes the rent. Adding a normal perturbation to the normalized rent with sd = 0.01 creates a smoothed distribution, and then you can get better estimates on the median and mean of the real distribution:

YEAR MedRent MeanRent 1 2010 0.9959745 1.0020145 2 2011 0.9621274 0.9797048 3 2012 1.2243687 1.2844332 4 2013 1.3043896 1.3762188 5 2014 1.3854716 1.4446855 6 2015 1.4144859 1.5111725

So clearly the distribution has ratcheted up each year, and more at the right tail than anywhere else (mean more than the median). Also Top-coding is a big problem here, lots of the rents are reported at the maximum allowable top-coded value rather than their real value because of privacy issues.

]]>Here are some new thoughts I've had, but first the set up. There are N labs each of which investigate a phenomenon in which there is a fundamental ratio R = X/Y and X and Y are individually measured with error. The labs each publish some posterior distribution for X and Y. In the example problem we have

Xtrue[i] ~ normal(Xbar[i],sx[i]); // published by lab i Ytrue[i] ~ normal(Ybar[i],sy[i]); //similar

for each lab i

Now we'd like to do inference from this on the ultimate quantity of interest R. We assume that there is an underlying Rtrue of interest, and that in each lab, due to peculiarities of their apparatus etc an underlying R[i] is at work so that Xtrue[i]/Ytrue[i] = R[i] and R[i] is "close to" Rtrue.

R[i] ~ normal(Rtrue,deltaR);

with deltaR "small" and Rtrue a parameter in our model.

How can we do inference on Rtrue?

The thing that I recently came to is that in this situation, we don't get the individual data points from the labs, we get as *data* only Xbar[i] and Ybar[i] the published values that describe the lab's posterior distributions over Xtrue[i] and Ytrue[i], so we can do something like as follows:

Rtrue ~ normal(OurRtrueGuess, someSpread);// our prior for Rtrue R[i] ~ normal(Rtrue, deltaR);// our prior for R[i] given Rtrue, expresses the "closeness" of the individual experiments to the real R value Ytrue[i] ~ normal(Ybar[i], sy[i]); // the published posterior probability distribution over Ytrue[i] from lab i Xbar[i] ~ normal(R[i]*Ytrue[i],sx[i]); /* the likelihood of having Xbar be published as the estimate of Xtrue by lab i given Ytrue[i], R[i] and the lab published sx[i] inference error. That is: p(Xbar[i] | R[i],Ytrue[i]) */

So now what we've got is essentially p(R) p(R[i]|R) p(Ytrue[i] | Data[i]) p(Xbar[i] | R[i], Ytrue[i], Data[i])

the narrow width of p(R[i]|R) is what describes the fact that all the labs are trying to set up their apparatus to reproduce carefully a single ratio R. The Ytrue[i] distribution is as given from the lab, and finally the likelihood over the Xbar value is potentially either the published distribution from the lab, or if we have additional information, it could be some modified thing. When Xbar is a sufficient statistic and we're using a normal distribution, it can be as good as getting the whole dataset from the lab.

What's interesting is, in these cases, the output of each labs inference process becomes *observed data* that we use in a likelihood type factor in our Bayesian model for combining the data.

]]>

Imagine a player shoots baskets, and their probability of success is 0.5 + a sinusoidal function of shot. We don't know the phase or period exactly, we want to look at a sequence of hits and try to infer the amplitude of the variation and its frequency. In my case I'm using period = 40 and phase = 0.

## the hot hand, a player takes shots at a basket with a sinusoidal ## underlying prob of success, looking at N successive shots, if we ## know the period and phase perfectly can we detect the magnitude for ## different N = 20, 50, 100, 200, 500 library(rstan); rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectCores()) set.seed(1); hitmiss = sapply(1:500,function(x){return(rbinom(1,1,prob=0.5 + 0.1 * sin(2*pi*x/40)))}) stancode = " functions{ real binp(real A,int i,real f,real ph) { return(0.5 + A*sin(2*pi()*(i+ph)/f));} } data{int hitmiss[500];} parameters{realA20;real A50; real A100; real A200; real A500; vector [5] f; vector [5] ph; } model{ A20 ~ normal(0,.5); A50 ~ normal(0,.5); A100 ~ normal(0,.5); A200 ~ normal(0,.5); A500 ~ normal(0,.5); f ~ gamma(2,2.0/50); ph ~ normal(0,20); for(i in 1:500){ if(i < 21){hitmiss[i] ~ binomial(1,binp(A20,i,f[1],ph[1])); } if(i < 51){hitmiss[i] ~ binomial(1,binp(A50,i,f[2],ph[2])); } if(i < 101){hitmiss[i] ~ binomial(1,binp(A100,i,f[3],ph[3])); } if(i < 201){hitmiss[i] ~ binomial(1,binp(A200,i,f[4],ph[4])); } if(i < 501){hitmiss[i] ~ binomial(1,binp(A500,i,f[5],ph[5])); } } } " samps <- stan(model_code=stancode,data=list(hitmiss=hitmiss), init=list( list(f=rep(50,5),A20=0.2,A50=0.2,A100=0.2,A200=0.2,A500=0.2,ph=rep(0,5)), list(f=rep(50,5),A20=0.2,A50=0.2,A100=0.2,A200=0.2,A500=0.2,ph=rep(0,5)), list(f=rep(50,5),A20=0.2,A50=0.2,A100=0.2,A200=0.2,A500=0.2,ph=rep(0,5)), list(f=rep(50,5),A20=0.2,A50=0.2,A100=0.2,A200=0.2,A500=0.2,ph=rep(0,5))), iter=10000,thin=10) samps

Running this model gives the following results:

Inference for Stan model: c8293fb08b00b7823bcc1fc716df0ef9. 4 chains, each with iter=10000; warmup=5000; thin=10; post-warmup draws per chain=500, total post-warmup draws=2000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat A20 0.15 0.01 0.11 0.01 0.06 0.12 0.21 0.39 422 1.00 A50 0.11 0.00 0.08 0.01 0.04 0.10 0.17 0.30 410 1.00 A100 0.09 0.00 0.06 0.00 0.04 0.09 0.14 0.24 569 1.00 A200 0.09 0.00 0.06 0.00 0.04 0.09 0.13 0.20 288 1.01 A500 0.13 0.00 0.03 0.06 0.12 0.13 0.16 0.20 270 1.01 f[1] 45.95 1.09 35.87 5.22 19.06 37.12 62.08 138.20 1090 1.00 f[2] 39.30 1.46 34.07 3.89 12.61 29.49 54.20 132.16 541 1.00 f[3] 37.54 1.66 29.00 5.82 17.11 32.19 47.29 112.63 305 1.01 f[4] 38.49 1.50 27.51 8.49 19.08 35.79 45.11 108.88 336 1.01 f[5] 40.39 0.28 2.02 39.52 40.31 40.62 40.98 41.68 54 1.08 ph[1] -0.97 0.95 18.74 -34.41 -13.40 -3.96 11.86 38.11 387 1.02 ph[2] 1.18 0.97 20.88 -40.78 -11.06 0.30 15.52 42.05 465 1.00 ph[3] -2.06 2.08 20.75 -43.41 -14.11 -1.80 12.54 36.89 99 1.03 ph[4] 1.48 1.35 20.94 -39.44 -10.69 1.07 14.84 42.38 240 1.02 ph[5] 5.01 0.28 4.57 -2.00 2.57 4.69 7.45 12.91 267 1.00 lp__ -581.57 0.33 4.48 -591.30 -584.43 -581.32 -578.54 -573.58 185 1.02

Suppose we decide that we've "Detected" the effect if we have an expected amplitude / sd(amplitude) of about 2, then it takes upwards of 200 shots, because at 200 shots E(A)/sd(A) = .09/.06 ~ 1.5. Also note how few effective samples I get (4 cores, 10000 samples per core, thinning by a factor of 10 so I have 2000 total samples but typical parameters have around a couple hundred effective samples. The model is hard to fit because the whole thing is noisy and messy, traceplots reflect this).

Also, this is with the benefit of a precisely repeated periodic signal and informative priors on the size, period, and phase of the effect.

If the effect varies like a smooth gaussian process with a ~ 40 shot scale but no periodicity and no informative priors, it'd be like trying estimate simultaneously 20 or 30 fourier coefficients or something... you'd need even more data and long runs to overcome the noise and get good effective sample size.

The model is ill posed, the measurement is not very informative, the Hot Hand is hard to detect from hit/miss data.

**SO LACK OF DETECTION IS NOT EVIDENCE OF ABSENCE.**

note that we want to do p(R | Xtrue,Ytrue,Ptrue,Qtrue) = p(R|Xtrue,Ytrue) P(R|Ptrue,Qtrue) and each of these is a delta function (that is, if you tell me Xtrue,Ytrue I find out R exactly)... so we have to approximate this delta function with a sufficiently tightly peaked normal distribution. Basically we tell Stan to stay in the region where both experiments agree to within epsilon = 0.001

## inference on a ratio via two separate experiments library(rstan); dataset <- data.frame(X=rnorm(10,1,.1),Y=rnorm(10,2,.3),P=rnorm(10,30,6),Q=rnorm(10,60,4)); ## now the true value of R = 1/2 = 30/60 but we have two experiments ## where we measure X,Y or we measure P,Q with measurement error that ## is of known size (we know the .1 and .3 and 6 and 4 used as ## standard deviations, assume we have a well calibrated instrument). stancode = "data { vector[10] X; vector[10] Y; vector[10] P;vector[10] Q;} parameters{ real<lower=0> Xtrue; real<lower=0> Ytrue; real<lower=0> Ptrue; real<lower=0> Qtrue; real<lower=0> Rat; } model{ // we assume on physical principals that these are positive and have an upper bound order of magnitude estimate Xtrue ~ uniform(0,1000); Ytrue ~ uniform(0,1000); Ptrue ~ uniform(0,10000); Qtrue ~ uniform(0,10000); X ~ normal(Xtrue,.1); Y ~ normal(Ytrue,.3); P ~ normal(Ptrue,6); Q ~ normal(Qtrue,4); // constrain the ratios to agree to within a small epsilon Rat ~ normal(Xtrue/Ytrue,.001); Rat ~ normal(Ptrue/Qtrue,.001); } " samps <- stan(model_code=stancode,data=dataset); samps

When I run it. I get the following output:

Inference for Stan model: 5100ead2741217b889326da6ff0f0419. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat Xtrue 1.02 0.00 0.03 0.97 1.01 1.02 1.04 1.08 2105 1 Ytrue 2.06 0.00 0.08 1.91 2.00 2.06 2.11 2.22 1568 1 Ptrue 30.24 0.04 1.31 27.71 29.34 30.23 31.15 32.87 1410 1 Qtrue 60.64 0.03 1.22 58.20 59.85 60.66 61.43 63.06 2172 1 Rat 0.50 0.00 0.02 0.46 0.48 0.50 0.51 0.54 927 1 lp__ -6.28 0.04 1.59 -10.39 -7.05 -5.95 -5.11 -4.21 1588 1 Samples were drawn using NUTS(diag_e) at Sun Apr 9 19:37:53 2017. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1). >

Note that with 10 observations in each experiment, we determine that our expected value of Rat is 0.5 to within 0.00 standard error, and almost all the samples of Rat are between about 0.46 and 0.54

]]>YEAR NP GRNTP MRGP HINCP 11570 2013 2 1030 NA 53000 11835 2013 3 1210 NA 29000 12945 2013 2 NA 2200 200200 16433 2014 2 NA 3100 181000 18641 2015 2 2080 NA 128500 20854 2015 6 260 NA 57000

Apparently there is a family of 6 living in Pasadena, making $57,000/yr (HINCP) and paying $260 a month in rent (GRNTP).

Now, if you told me $2600 I wouldn't blink an eye, and as you can see that would be in line with the $2080 for a family of 2, or $3100 in mortgage for a family of 2. But $260 could only occur if you were renting from your uncle or you are acting as an apartment complex manager and part of your compensation is reduced rent, or some other similar situation.

So, given that some of these records do go through optical scanners and could come out with a dropped decimal place or the like, as well as some people have a situation like the apartment complex manager who has secret income in the form of subsidized rent... How should one think about handling this kind of situation if the goal is to estimate a function: "minimum market rent for a family of N people in each region".

One idea would be to simply trim the data, ignore anything that is say less than 20% of the observed mean or more than 5x the observed statewide mean, and this would catch most situations of both dropped or added decimal places (which would cause a factor of 10 error). Another idea would be to do a model where there's some probability to have outliers due to a separate process, and we simply learn about which records to ignore by virtue of their high posterior probability of being "outliers". But this requires us to carry an extra parameter for each observation. A third possibility is to marginalize out the outlier parameter, and learn only about the frequency of outliers and treat observed data as coming from a mixture model where we learn the mixture weight as equal to the marginalized frequency of outliers.

I'm sure there are other ways, but one issue is this: I already have a lot of parameters due to the large number of public use microdata areas I'm doing inference for, and runs are slow, so it'd be good to avoid complex models just for computational reasons.

Any thoughts?

]]>

For the ACS and similar microdata, each household record should have a Latitude and Longitude associated. These values should be equal to the actual Lat/Long of the actual household plus a random uniformly distributed perturbation. The record should give the size of this perturbation in each dimension, and the size should be determined by the local population density such that the patch contains approximately 100k people. The uniform distribution should be used because it's bounded so you can be sure you're never more than a certain distance away from the true location.

The record still has the State, so if you go across a state boundary you'd have some issues which I'm sure the census is capable of working out, but other than that, it makes way more sense to think of things in terms of fuzzing out the location using a continuous perturbation than the complexity of redrawing boundary lines every few years and completely re-doing your identification scheme and screwing up any chance you had of working with spatial-timeseries models...

Just sayin.

]]>