What to do with Census outliers?

2017 April 9
by Daniel Lakeland

Consider the following records from the Census ACS for Pasadena CA:

11570 2013  2  1030   NA  53000
11835 2013  3  1210   NA  29000
12945 2013  2    NA 2200 200200
16433 2014  2    NA 3100 181000
18641 2015  2  2080   NA 128500
20854 2015  6   260   NA  57000

Apparently there is a family of 6 living in Pasadena, making $57,000/yr (HINCP) and paying $260 a month in rent (GRNTP).

Now, if you told me $2600 I wouldn’t blink an eye, and as you can see that would be in line with the $2080 for a family of 2, or $3100 in mortgage for a family of 2. But $260 could only occur if you were renting from your uncle or you are acting as an apartment complex manager and part of your compensation is reduced rent, or some other similar situation.

So, given that some of these records do go through optical scanners and could come out with a dropped decimal place or the like, as well as some people have a situation like the apartment complex manager who has secret income in the form of subsidized rent… How should one think about handling this kind of situation if the goal is to estimate a function: “minimum market rent for a family of N people in each region”.

One idea would be to simply trim the data, ignore anything that is say less than 20% of the observed mean or more than 5x the observed statewide mean, and this would catch most situations of both dropped or added decimal places (which would cause a factor of 10 error). Another idea would be to do a model where there’s some probability to have outliers due to a separate process, and we simply learn about which records to ignore by virtue of their high posterior probability of being “outliers”. But this requires us to carry an extra parameter for each observation. A third possibility is to marginalize out the outlier parameter, and learn only about the frequency of outliers and treat observed data as coming from a mixture model where we learn the mixture weight as equal to the marginalized frequency of outliers.

I’m sure there are other ways, but one issue is this: I already have a lot of parameters due to the large number of public use microdata areas I’m doing inference for, and runs are slow, so it’d be good to avoid complex models just for computational reasons.

Any thoughts?


Comments are closed.