Headaches with “contrasts” in linear statistical models

2009 December 11
by Daniel Lakeland

There’s a topic in linear statistical modeling that always drives me crazy when I have to consider it, and I haven’t seen a good simple explanation of it, so here’s my meager attempt.

Suppose you have some data, and you want to fit a linear model to the data. Perhaps there are several variables, such as y, x1, and x2. y is something you want to predict and x1 and x2 are generally available measurements that you’d like to use to inform your predictions. So far, so good.

Suppose y is a continuous variable, like a length, time, weight, or volume, and that x1 is also continuous. On the other hand, x2 is a factor. That is, there are a discrete set of categories in which x2 can fall. Suppose there are 5 different cities in which the data are collected, and treating the cities as coordinates on the surface of the earth is not reasonable, it’s better to simply compare the 5 different cities as if they were 5 discrete places. So x2 can take on one of the values in the set {1,2,3,4,5} each value having some name associated.

So far so good. You load your data into your favorite statistics program, which if you read this blog is required to be the fantastic freely available software “R“, and you fit the following model:

model <- lm(y ~ x1 + x2, data=mydata)

What does this model mean? By default, using the “treatment” contrasts, it means that y is predicted according to the following formula:

y [is estimated by] INT * 1 + a1 * x1 + a2 * (is x2 = 2) + a3 * (is x2 = 3) + a4 * (is x2 = 4) + a5 * (is x2 = 5)

INT is the coefficient of the “intercept term”, perhaps better called the “constant term” which multiplies the number 1 and appears in every row. However in this scheme, it is intimately connected to the situation when x2 = 1 (notice that there was no “is x2 = 1” column?). This model is generally easy to interpret, but it has the property that the intercept INT is not a predictor of the overall average across all cities, it’s only a predictor of the average in city 1 (or if you use the function “relevel” you can adjust which city is the reference).

There’s a whole bunch of theory that goes along with this issue, and it goes under the name of  “orthogonal contrasts” in the ANOVA literature. If you have INT, the intercept term, and there are 5 different levels of your factor, then we can only uniquely estimate 4 comparisons between some combination of factor levels. Think of it this way: if you know the average of 5 quantities, and you know 4 of the quantities, then you automatically know the 5th quantity, there is no “estimation” required.

In other words, we’re going to represent a single data observation as a vector like this c(x1=1.257,1,0,0,0,0) for an observation in city 1, and like this for example c(x1=2.8712,0,0,1,0,0) for an observation in city 3. There is exactly one column from columns 2 through 6 that has a 1 in it, and it represents which city the observation is from.

Great, but when we build the model matrix, we’re adding another column, the constant column, and it has a 1 in every row. If we want to create a constant term that represents an overall average, then we can only estimate 4 (in general N-1) other “averages” and furthermore, those other averages must not be linearly related to the constant term.

Knowing some linear algebra is helpful here. Think of the data as a big matrix:

D = [x1,1, (x2 is 1), (x2 is 2), (x2 is 3), (x2 is 4), (x2 is 5)]

where each entry is a column vector, the second column is the constant column with entries = 1, and the last 5 are boolean column vectors (entries are 0 or 1 only) and every row has exactly 1 of the last 5 entries equal to 1. We have 7 columns, but only 6 of the columns are linearly independent (the sum of the last 5 columns IS the constant 1 column which is the same as the second column). To reduce this model matrix to 6 columns wide, we’ll multiply on the right by a 7 x 6 matrix:

C = [ (1,0,0,0…), (0,1,0,0,0…), C1, C2, C3, C4]

The first column looks like the first column of the identity matrix, and the result DC will have first column equal to the first column of D (ie. x1), the second column is like the second column of the identity, it means our second column will be like D[2,] = (1,1,1,1…), but now C1, C2, C3, and C4 are the “contrasts” which need to be linearly independent of column 1 and 2 and each other. They are only mathematically required to be linearly independent, but if we’re going to interpret the second coefficient as an overall average then C1..C4 also need to be orthogonal to (1,1,1….) and for maximal efficiency in calculation, it is best if they are also orthogonal to each other. That is, they go in totally different directions from the other contrasts. Orthogonality (actually a slightly stronger “orthonormality”) means $$C_i \cdot C_j = \delta_{ij}$$.

In “R” the contrasts of a factor are a property of the factor variable (see the function “contrasts”). You can change these by assigning to it as in “contrasts(myfactor) <- my.contrasts”. The standard “orthogonal” contrasts in R are called “helmert” contrasts, and are generated by the function “contr.helmert()”: Here are the helmert contrasts for 5 level factors:

> contr.helmert(5)
 [,1] [,2] [,3] [,4]
1   -1   -1   -1   -1
2    1   -1   -1   -1
3    0    2   -1   -1
4    0    0    3   -1
5    0    0    0    4

Notice that there are 5 rows and 4 columns, hence we’re creating 4 combinations of 5 values, and the 5th value will be the “constant” column. Finally, notice that the helmert contrasts calculate the difference between the i’th data column and the average of the previous i-1 data columns (look at the first helmert contrast column to see how when it’s multiplied on the right by D, it calculates D[2,] – D[1,] well, it only gets multiplied by the last 5 columns of D so that’s actually D[4,] – D[3,] in the original D matrix .

In summary, if you want to calculate an overall average during a linear regression with factors, you get the overall average from the constant term, and then need orthogonal contrasts for the rest of the factor values:

  1. You need contrasts that are orthogonal to the constant column, and for efficiency they should be orthogonal to each other.
  2. You may want to pick which comparisons to make, since the Helmert comparisons are not always what you’re interested in.
  3. Contrast matrices in R are always N x (N-1) and are multiplied by the columns of the data vector that are associated with that factor only. Implicitly the “first” column is a column of all ones (1) that is actually equivalent to the constant term in the model.

Hopefully this post gets indexed by a search engine and helps someone out while trying to figure out what’s going on with R contrast matrices. Writing it sure helped clarify things for me!

3 Responses
  1. Daniel Lakeland
    December 11, 2009

    Ugh. Once again, Math writing fails to meet its potential due to lack of pictures… it would be helpful to have the matrices spelled out and arrows pointing to relevant bits I think. Sure, if you understand linear algebra pretty well, you can get more out of this than some other sources on the internet, but would a 6th grader get it? If not, then I think I’ve failed to reach the full potential.

  2. Jon permalink
    April 24, 2012

    I was hopeful when this started, but you lost me with the matrix math. Also, you started out discussing 2 predictors in the model, with the other being a continuous var. It would have been great if you could have come back to that once you explained the contrasts to explain what the interaction terms would mean. Understanding interactions terms for grouping factor variables always seems like a headache to me.

  3. Daniel Lakeland
    April 25, 2012

    Jon, the interaction terms are not too bad. to compute an interaction you create a new column from the multiplication of the continuous factor column with the discrete factor column. This produces a number which is uniformly zero when the data doesn’t come from the city of interest, and uniformly equal to the continuous variable when the data does come from the city of interest (i’m sticking with the cities interpretation here, but in general “grouping of interest”)

    So the coefficient of the continuous variable now means the overall average across all the factors for the continuous variable response, and the coefficient of the interaction term tells you how much you should change the coefficient of the continuous variable when you know which group the measurement comes from.

Comments are closed.