What to do with Census outliers?
Consider the following records from the Census ACS for Pasadena CA:
YEAR NP GRNTP MRGP HINCP 11570 2013 2 1030 NA 53000 11835 2013 3 1210 NA 29000 12945 2013 2 NA 2200 200200 16433 2014 2 NA 3100 181000 18641 2015 2 2080 NA 128500 20854 2015 6 260 NA 57000
Apparently there is a family of 6 living in Pasadena, making $57,000/yr (HINCP) and paying $260 a month in rent (GRNTP).
Now, if you told me $2600 I wouldn't blink an eye, and as you can see that would be in line with the $2080 for a family of 2, or $3100 in mortgage for a family of 2. But $260 could only occur if you were renting from your uncle or you are acting as an apartment complex manager and part of your compensation is reduced rent, or some other similar situation.
So, given that some of these records do go through optical scanners and could come out with a dropped decimal place or the like, as well as some people have a situation like the apartment complex manager who has secret income in the form of subsidized rent... How should one think about handling this kind of situation if the goal is to estimate a function: "minimum market rent for a family of N people in each region".
One idea would be to simply trim the data, ignore anything that is say less than 20% of the observed mean or more than 5x the observed statewide mean, and this would catch most situations of both dropped or added decimal places (which would cause a factor of 10 error). Another idea would be to do a model where there's some probability to have outliers due to a separate process, and we simply learn about which records to ignore by virtue of their high posterior probability of being "outliers". But this requires us to carry an extra parameter for each observation. A third possibility is to marginalize out the outlier parameter, and learn only about the frequency of outliers and treat observed data as coming from a mixture model where we learn the mixture weight as equal to the marginalized frequency of outliers.
I'm sure there are other ways, but one issue is this: I already have a lot of parameters due to the large number of public use microdata areas I'm doing inference for, and runs are slow, so it'd be good to avoid complex models just for computational reasons.