Introduction
R has many ways to code variables (e.g. factor, numeric, character). One method to recode categorical variables that has recently become more popular is ‘contrast coding’. Contrast coding allows for recentering of categorical variables such that the intercept of a model is not the mean of one level of a category, but instead the mean of all data points in the data set. This can also be useful when running models with an interaction of two variables, as it gets rid of the problem of baselines when trying to interpret model results (see blog post on Linear Models Part 1 for more details). Today’s post will focus on what happens when a categorical variable has three levels and when contrast coding doesn’t directly match the default coding system.
Data
For this problem we’ll generate some fake data with a variable ‘group’ with three levels, ‘A’, ‘B’, and ‘C’. For the first analysis we’ll only compare two levels, ‘A’ and ‘B’. The code for generating the data is below.
library(dplyr)
data_threelevel = data_frame(group = c(rep("A", 10), rep("B", 10), rep("C", 10)),
score = c(rnorm(10, 40, 5), rnorm(10, 50, 5), rnorm(10, 60, 5)))
data_twolevel = data_threelevel %>%
filter(group != "C") %>%
mutate(group = factor(group))
Two-level Variable
To see the power of contrast coding we’ll compare a model with dummy coding (the default method of coding in R for factors) to a model with contrast coding when ‘group’ only has two levels (‘A’ and ‘B’). A summary of the data is provided below.
group | mean | sd |
---|---|---|
A |
39.77303 |
4.107021 |
B |
49.80712 |
6.025235 |
With dummy coding the default of a level is coded as 0 and the non-default level is coded as 1. Since R codes variables alphabetically ‘A’ is our default (or 0) and ‘B’ is our non-default (or 1). As a result the intercept (when x is 0) is the value of the default level, here the mean of ‘A’. The first model below uses dummy coding. As you can see, the intercept is 39.7730349, the same value as the mean of level ‘A’. The estimate for ‘group’ is the difference between group ‘A’ and group ‘B’ (10.034082).
ab_dummy.lm = lm(score ~ group, data_twolevel)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) |
39.77303 |
1.630507 |
24.393045 |
0.0000000 |
groupB |
10.03408 |
2.305885 |
4.351509 |
0.0003846 |
With contrast coding we can recode the number values for our levels so that 0 is in between each level, instead of being equal to a level. The code below does this by making a new column where ‘A’ is equal to -0.5 and ‘B’ to 0.5. Note, here I did this by making a new numeric column. Other tutorials on contrast coding often use the ‘contrasts()’ call, both methods produce the same results in the model.
data_twolevel_contrast = data_twolevel %>%
mutate(contrast_AvB = ifelse(group == "A", -0.5, 0.5))
The model below is the same as the first model, but using contrast coding. The estimate for ‘group’ (here ‘contrast_AvB’) is the same as for the model with dummy coding, but the intercept is the mean of ‘A’ and ‘B’ together (44.7900759), not just the mean of ‘A’.
ab_contrast.lm = lm(score ~ contrast_AvB, data_twolevel_contrast)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) |
44.79008 |
1.152943 |
38.848484 |
0.0000000 |
contrast_AvB |
10.03408 |
2.305885 |
4.351509 |
0.0003846 |
Three-level Variable
Where this gets more complicated is when trying to recreate dummy coding estimates with contrast coding when you have a variable with three levels. First, here is a summary of our full data set with three levels for ‘group’.
group | mean | sd |
---|---|---|
A |
39.77303 |
4.107021 |
B |
49.80712 |
6.025235 |
C |
59.76273 |
5.591157 |
As for the two-level data set, the first model here uses dummy coding. The intercept continues to be the same as for the model with only ‘A’ and ‘B’, since it is still the mean of ‘A’ (39.7730349). Similarly the estimate for ‘A’ versus ‘B’ is the same as the other model (10.034082). This model also includes the estimate for ‘A’ versus ‘C’, which is the difference of the means of ‘A’ and ‘C’ (19.9896989).
ab_ac_dummy.lm = lm(score ~ group, data_threelevel)
kable(coef(summary(ab_ac_dummy.lm)))
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) |
39.77303 |
1.677618 |
23.708033 |
0.0000000 |
groupB |
10.03408 |
2.372511 |
4.229309 |
0.0002407 |
groupC |
19.98970 |
2.372511 |
8.425546 |
0.0000000 |
To run our contrast coded model we’ll need to make two columns. Below is the code to create two columns, one is ‘A’ versus ‘B’ (same as the model with only two levels) and one is ‘A’ versus ‘C’. In both cases ‘A’ is set to -0.5, the other level is set to 0.5, and the remaining variable is set to 0 so it will be ignored.
data_threelevel_contrast = data_threelevel %>%
mutate(contrast_AvB = ifelse(group == "A", -0.5,
ifelse(group == "B", 0.5, 0))) %>%
mutate(contrast_AvC = ifelse(group == "A", -0.5,
ifelse(group == "C", 0.5, 0)))
The contrast coded model does some things as expected and some not. The intercept is now the mean of the full data set including level ‘C’ (49.7809619). However, the estimates for ‘A’ versus ‘B’ and ‘A’ versus ‘C’ don’t match the dummy coded model. What then are these estimates?
ab_ac_contrast.lm = lm(score ~ contrast_AvB + contrast_AvC, data_threelevel_contrast)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) |
49.78096 |
0.9685735 |
51.3961642 |
0.0000000 |
contrast_AvB |
0.05231 |
2.7395395 |
0.0190945 |
0.9849062 |
contrast_AvC |
19.96354 |
2.7395395 |
7.2871896 |
0.0000001 |
Three-level Variable Updated
To get our desired estimates we need to do contrast coding where we collapse together two levels and compare them (together) to the remaining level. To create this new column we’ll set both ‘A’ and ‘B’ to -0.5 and ‘C’ to 0.5. Other tutorials with contrast coding say that in this case ‘A’ and ‘B’ should be set to -0.25 so that the sum of the matrix is 0 (‘A’ vs. ‘B’: -0.5, 0.5, 0; ‘A’ and ‘B’ vs. ‘C’: -0.25, -0.25, 0.5). Both of these methods produce desired results in one sense and undesired results in another. To show the differences we’ll make a column for both types of coding.
data_threelevel_contrast_updated = data_threelevel_contrast %>%
mutate(contrast_ABvC1 = ifelse(group == "C", 0.5, -0.5)) %>%
mutate(contrast_ABvC2 = ifelse(group == "C", 0.5, -0.25))
Coding Method #1
Now we can make a new model that compares ‘A’ versus ‘B’ and then ‘A’ and ‘B’ versus ‘C’. We get our correct coefficient for ‘A’ versus ‘B’ that we got in both of our dummy coded models and our two-level contrast coded model (10.034082). Our coefficient for the second variable is correctly the mean of ‘A’ and ‘B’ minus the mean of ‘C’ (14.9726579).
ab_abc_contrast.lm = lm(score ~ contrast_AvB + contrast_ABvC1, data_threelevel_contrast_updated)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) |
52.27640 |
1.027327 |
50.885832 |
0.0000000 |
contrast_AvB |
10.03408 |
2.372511 |
4.229309 |
0.0002407 |
contrast_ABvC1 |
14.97266 |
2.054655 |
7.287190 |
0.0000001 |
However, one thing is still incorrect, our intercept. It should be the mean of the full data set (49.7809619), but instead it’s a different number. Further examination shows that it is actually the mean of ‘A’, ‘B’, and ‘C’ twice (52.2764049), or (‘A’ mean + ‘B’ mean + ‘C’ mean + ‘C’ mean) / 4. Another way to think of this is the estimate is actually the average of the mean of ‘A’ minus the mean of ‘C’, and then the mean of ‘B’ minus the mean of ‘C’, ((‘C’ mean – ‘A’ mean) + (‘C’ mean – ‘B’ mean)) / 2. As a result, it makes sense that the mean of ‘C’ would be represented twice in the computation of the intercept.
Coding Method #2
We can also build the model where we used the coding where ‘A’ and ‘B’ were set to -0.25. In this model the estimate is now correctly the mean of the full data set (49.7809619), and the estimate of ‘A’ versus ‘B’ variable is correctly the mean of ‘A’ minus the mean of ‘B’ (10.034082).
ab_abc2_contrast.lm = lm(score ~ contrast_AvB + contrast_ABvC2, data_threelevel_contrast_updated)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) |
49.78096 |
0.9685735 |
51.396164 |
0.0000000 |
contrast_AvB |
10.03408 |
2.3725108 |
4.229309 |
0.0002407 |
contrast_ABvC2 |
19.96354 |
2.7395395 |
7.287190 |
0.0000001 |
However, the estimate for ‘A’ and ‘B’ versus ‘C’ is not the mean of ‘A’ and ‘B’ minus the mean of ‘C’. It is actually the grand mean of the data minus the mean of ‘A’ minus the mean of ‘B’ plus the mean value for ‘C’ (19.9635439), or full data set mean – ‘A’ mean – ‘B’ mean + ‘C’ mean. We can also rewrite this equation as (-2 * ‘A’ mean) / 3 + (-2 * ‘B’ mean) / 3 + (4 * ‘C’ mean) / 3. Both levels ‘A’ and ‘B’ were set to -0.25, which divided by the mean of the other possible coding (0.5) equals -2 / 3, the multiplier or ‘A’ and ‘B’, -0.25 / ((0.25 + 0.5) / 2). Conversely, the 4 / 3, the multiplier for ‘C’, is 0.5 / ((0.25 + 0.5) / 2).
You may have noticed that this is the same value we got for (what we thought was) ‘A’ vs. ‘C’ in the original three-level contrast coded model. It’s also worth noting that the t-value and p-value for ‘A’ and ‘B’ versus ‘C’ is the same in both versions of our updated contrast coded models, while this is not the case for the intercept. In the end then it is probably preferred to have the correct intercept and an (slightly) incorrect estimate for one of the variables if you are using the models for significance testing. So, it is better to use the version where ‘A’ and ‘B’ are set to -0.25.
Conclusion
In conclusion, contrast coding is a way to reset the intercept to the mean of the full data set, not just a single level of a variable. For variables with only two levels this produces the same estimates (just different intercepts) as dummy coding. However, when a variable has three levels it becomes more difficult to replicate the estimates from dummy coding. The closest method appears to be to maintain one contrast and for the second comparison collapse together two levels and compare it to the third level. Here I leave as an open question, is it possible replicate the dummy coded model with three levels using contrast coding?
Thank you for this! It’s extremely clear and helpful.
Thanks so much, Page, for this nice description! It helped me a lot! How would you set the contrasts if 4 (or 5) different groups are included (I want to interpret the intercept)?
Would I just continue by using (e.g. 4 groups):
mutate(contrast_ABCvD = ifelse(group == “D”, 0.5, -0.16))
?
Thanks a lot! Best, Roman