In this blog, I will show omitted variable bias in an intuitive way. Omitted Variable Bias (OVB) means that in a regression, the interested independent variable X is correlated with the error term, then our estimated coefficient will be biased because in practice the OLS will not recognize the endogenous error term and give its effect on Y to X.
In order to solve this problem, we can add a control variable, which extracts the correlated part in the error term (with X), and make the conditional expected error term given the control variable be zero, i.e. E[e|Control, X]=0 rather than E[e|X]=0
For example, if someone wants to research the causal effect of teacher-student ratio on test score, and obviously the higher the teacher-student ratio, the higher the test score makes sense. However, in reality the students who live in high income community have higher probablilty to join a class with high t-s ratio, and these students are supposed to have more resources that could be helpful for their studies and tests. It is a omitted variable, and if we do not include it in regression, the error term will be correlated with the t-s ratio.
Here in our simulated data, X1 is the t-s ratio, and X2 is the income level index, Y is the students' test score, error term is some other random influences on test score. We have n=100, X are normalized and demeaned, var-covariance matrix sigma is
, error term ~ N(0, 0.5). All betas (including intercept) are equal to 1.
The figure above shows the regression line for both with X2 (red line) and without X2 (blue line), we can see that in the aspect of prediction, blue one outperforms, however, in the context of causal interpretation, the blue one will overestimate the effect of X1 on Y, that is, the effect of t-s ratio is overestimated, ignoring the fact that family background also plays a role in test performance.
Thus, if we add income level X2 into regression, and for the sake of visualization simplicity, we separate income data into 4 parts, low to high is 1 to 4, then we can give a relatively "conditional" illustration of such estimation.
The four parallel dashed lines fit the trend of X1 conditional X2 levels better than the naive one (only with X1). And we can conclude that the slope of four parallel dashed lines are the true causal effect of t-s ratio in test score. In other words, conditional on family income level, the effect of t-s ratio is far less than the case of omitting family economic background.
Plotting code has a more specific version of the above figure.
# illustration on OVB
rm(list = ls())
library(MASS)
n <- 100
sigma <- matrix(c(1,0.8,0.8,1),2,2)
mu <- c(0,0)
beta <- c(1,1,1)
set.seed(555)
X <- mvrnorm(n,mu,sigma)
error <- rnorm(n,0,0.5)
Y <- 1+X[,1]+X[,2]+error
lmtrue <- lm(Y~X)
summary(lmtrue)
lmovb <- lm(Y~X[,1])
summary(lmovb)
cond <- c()
for (i in 1:n) {
if(X[i,2]<(-0.6)){cond[i] <- "q1"
}else if(X[i,2]>=-0.6&X[i,2]<0){cond[i] <- "q2"
}else if(X[i,2]>=0&X[i,2]<0.6){cond[i] <- "q3"
}else if(X[i,2]>=0.6){cond[i] <- "q4"}
}
data <- data.frame(Y,X,cond=factor(cond))
plot(X[,1],Y,col=data$cond,xlab = "X1",pch=16,
main = "Illustration of OVB")
abline(coef(lmtrue)[1],coef(lmtrue)[2],lwd=2,col="darkred",lty=2)
abline(coef(lmovb)[1],coef(lmovb)[2],lwd=2,col="darkblue",lty=4)
legend("bottomright",legend=c("q1","q2","q3","q4"),
col = c("black",data$cond[2:4]),pch=16)
plot(X[,1],Y,col="skyblue",xlab = "X1",pch=16,
main = "Illustration of OVB")
abline(coef(lmtrue)[1],coef(lmtrue)[2],lwd=2,col="darkred",lty=2)
abline(coef(lmovb)[1],coef(lmovb)[2],lwd=2,col="darkblue",lty=4)
plot(data$X1[data$cond=="q1"],Y[data$cond=="q1"],col="skyblue",xlab = "X1",pch=16,
main = "Illustration of OVB (Q1)")
abline(coef(lmtrue)[1]+coef(lmtrue)[3]*mean(data$X1[data$cond=="q1"]),coef(lmtrue)[2],lwd=2,col="darkred",lty=2)
abline(coef(lmovb)[1],coef(lmovb)[2],lwd=2,col="darkblue",lty=4)
plot(data$X1[data$cond=="q2"],Y[data$cond=="q2"],col="skyblue",xlab = "X1",pch=16,
main = "Illustration of OVB (Q2)")
abline(coef(lmtrue)[1]+coef(lmtrue)[3]*mean(data$X1[data$cond=="q2"]),coef(lmtrue)[2],lwd=2,col="darkred",lty=2)
abline(coef(lmovb)[1],coef(lmovb)[2],lwd=2,col="darkblue",lty=4)
plot(data$X1[data$cond=="q3"],Y[data$cond=="q3"],col="skyblue",xlab = "X1",pch=16,
main = "Illustration of OVB (Q3)")
abline(coef(lmtrue)[1]+coef(lmtrue)[3]*mean(data$X1[data$cond=="q3"]),coef(lmtrue)[2],lwd=2,col="darkred",lty=2)
abline(coef(lmovb)[1],coef(lmovb)[2],lwd=2,col="darkblue",lty=4)
plot(data$X1[data$cond=="q4"],Y[data$cond=="q4"],col="skyblue",xlab = "X1",pch=16,
main = "Illustration of OVB (Q4)")
abline(coef(lmtrue)[1]+coef(lmtrue)[3]*mean(data$X1[data$cond=="q4"]),coef(lmtrue)[2],lwd=2,col="darkred",lty=2)
abline(coef(lmovb)[1],coef(lmovb)[2],lwd=2,col="darkblue",lty=4)
plot(X[,1],Y,col=data$cond,xlab = "X1",pch=16,
main = "Illustration of OVB")
abline(coef(lmtrue)[1]+coef(lmtrue)[3]*mean(data$X1[data$cond=="q1"]),coef(lmtrue)[2],lwd=1.5,col="black",lty=2)
abline(coef(lmtrue)[1]+coef(lmtrue)[3]*mean(data$X1[data$cond=="q2"]),coef(lmtrue)[2],lwd=1.5,col=data$cond[2],lty=2)
abline(coef(lmtrue)[1]+coef(lmtrue)[3]*mean(data$X1[data$cond=="q3"]),coef(lmtrue)[2],lwd=1.5,col=data$cond[3],lty=2)
abline(coef(lmtrue)[1]+coef(lmtrue)[3]*mean(data$X1[data$cond=="q4"]),coef(lmtrue)[2],lwd=1.5,col=data$cond[4],lty=2)
abline(coef(lmovb)[1],coef(lmovb)[2],lwd=2,col="darkblue",lty=4)
legend("bottomright",legend=c("q1","q2","q3","q4"),
col = c("black",data$cond[2:4]),pch=16)
Comments