This post will visualize how ridge regression outperforms OLS estimators.
The figures and code of this post come from a group work in Computational Statistics made by Gewei Cao, Andreas Koundouros, Raúl Luyando, Erik Covarrubia.
We use the following data generating process as baseline setting:
the true beta vector is (0.5 0.5 -0.5)
variance covariance matrix sigma is
The formula of ridge regression see: https://geweicao.wixsite.com/karl-s-study-blog/post/the-coefficients-s-shrinkage-path-of-ridge-regression
Firstly, we review a little about the behaviour of OLS and ridge coefficients given different lambda.
We see that for a lambda close to 0, the estimated ridge coefficients converage to OLS coefficients. And the testing errors and the training errors follow the same parttern as above figure shown. We choose the "best" lambda by 10-fold cross validation.
Now we compare the performance of ridge and OLS in terms of test error, which means we generate 100 training obs. in order to fit our model, and 100 testing obs. in order to get mean square error for our fitted models.
Then we repeat such procedure 100 times, to reduce the influence of random effects.
The coordinates of points above is (test error of ridge, test error of OLS). The red line is the diagonal line, if one point is below red line, then in this case the test error of ridge is bigger than OLS test error, and if one point is above, then test error of ridge is smaller than test error of OLS. The bigger the test error, the worse the performance of chosen model.
Then we could see that in our initial setting, which all 3 predictors are nearly independent, OLS outperforms ridge. The box plot below also gives us the same result:
1). Highly Correlated Predictors
Now we change to a highly correlated predictors setting, and according to the shrinkage ability of ridge regression, we expect ridge outperforms OLS, because in such setting, OLS estimates have high variance, but ridge reduces this.
The var-covariance matrix sigma is:
Then we get the following result:
Now we see that in 76 simulations, ridge is better than OLS, this is consistent with theoretical inference.
In our box plot, Ridge* and OLS* are highly correlated predictors setting, boxes without * are initial setting. The red horizontal line represents the variance of error term, which is the theoretical true error rate. Box plot above gives us the same result.
2). Noise Variables
In addition, when there are a lot of noise variables in our model, ridge also outperforms OLS. This is because in such situation OLS will overfit model, and gives bad prediction for testing data, but the ridge controls the coefficients of noise variables.
Here noise variables means that they do not take part in the true relationship of Y and X, but in reality we may do not know they are noise, do not play a role in the true relationship. Thus, we train our model with those noise variables, and predict with our trained model.
We see that the result are the same as what we expected. Meanwhile, in the box plot, Ridge? and OLS? represent noise variable case, we could conclude that both models perform worse, but the ridge performs not as bad as OLS.
Finally let us compare all three cases in one figure:
Code:
rm(list = ls())
library(MASS)
# Parameters
set.seed(777)
n <- 100
beta.true <- c(0.5, 0.5, -0.5)
sigma <- matrix(c(2,0.1,0.1,
0.1,3,0.1,
0.1,0.1,4) ,
nrow= 3, ncol= 3, byrow=TRUE)
mu <- rep(0,3)
# Data generator
data.generator <- function(n, sigma, mu, beta){
x <- mvrnorm(n, mu, sigma)
e <- rnorm(n, 0, sqrt(10))
y <- x %*% beta + e
x.1 <- x[,1] / sd(x[,1])
x.2 <- x[,2]/ sd(x[,2])
x.3 <- x[,3]/ sd(x[,3])
x.sd <- cbind(x.1, x.2, x.3)
data <- data.frame ("y" = y, "x.sd" = x.sd, "x" = x)
return(data)
}
# Training data for the excerise
data <- data.generator(n, sigma, mu, beta.true)
# Storing the centered and uncentered covariates in a matrix for later convenience
x.sd <- cbind(data$x.sd.x.1, data$x.sd.x.2, data$x.sd.x.3)
x <- cbind(data$x.1, data$x.2, data$x.3)
# Ridge-regression function
ridge <- function(x,y,lambda, p = 3){
beta.ridge <- solve(t(x) %*% x +
diag(lambda, nrow = p, ncol = p)) %*% t(x) %*% y
return(beta.ridge)
}
set.seed(123)
# Creating a new grid for the values of lambda
grid <- 10^ seq (2, -10, length = 100)
# Generating test data
test.data <- data.generator(n, sigma, mu, beta.true)
# Uncentered test data
x.t <- cbind(test.data$x.1, test.data$x.2, test.data$x.3)
# Centered test data
x.sd.t <- cbind(test.data$x.sd.x.1, test.data$x.sd.x.2, test.data$x.sd.x.3)
# Containers for the train- and test errors of the ridge regression
train.error <- c()
test.error <- c()
for (i in 1:length(grid)){
y.hat <- x.sd %*% beta.ridge[,i]
train.error[i] <- mean((data$y - y.hat)^2)
y.hat.t <- x.sd.t %*% beta.ridge[,i]
test.error[i] <- mean((test.data$y - y.hat.t)^2)
}
# Train - and test errors for the OLS function
lm.obj <- lm(data$y ~ x.sd.x.1 + x.sd.x.2 + x.sd.x.3 -1, data = data)
lm.obj.test <- x.sd.t %*% lm.obj$coefficients
#lm.obj.test <- predict(lm.obj, newdata= data.frame(x.sd.t))
ols.train.error <- mean((lm.obj$fitted.values - data$y)^2)
ols.test.error <- mean((lm.obj.test - test.data$y)^2)
# Plot
plot(x = log(grid), y = train.error, col = "red",
ylim = c(min(train.error, test.error), max(train.error, test.error)),
type = "n",
xlab = expression(log(lambda)), ylab = "Error Rate")
grid()
points(x = log(grid), y = test.error, col = "blue")
points(x = log(grid), y = train.error, col = "red")
abline(h = ols.train.error, col = "darkgreen", lwd = 3)
abline(h = ols.test.error, col = "orange", lwd = 3)
legend("topleft", legend = c("Ridge Training", "Ridge Test", "OLS Training",
"OLS Test"), pch = c(1,1, NA, NA),
lty = c(NA, NA, 1, 1), lwd = c(NA, NA, 3, 3),
col = c("red", "blue", "darkgreen", "orange"),
cex = 1)
set.seed(527)
rep <- 100
MSE <- matrix(NA,rep,2)
colnames(MSE) <- c("Ridge", "OLS")
for (i in 1:rep){
# Data generating process
train.data <- data.generator(n, sigma, mu, beta.true)
test.data <- data.generator(n,sigma, mu, beta.true)
# Storing the x variables into matrixes for convinience
# Train covariates
x <- cbind(train.data$x.1, train.data$x.2, train.data$x.3)
x.sd <- cbind(train.data$x.sd.x.1, train.data$x.sd.x.2, train.data$x.sd.x.3)
#Test covariates
x.t <- cbind(test.data$x.1, test.data$x.2, test.data$x.3)
x.sd.t <- cbind(test.data$x.sd.x.1, test.data$x.sd.x.2, test.data$x.sd.x.3)
# Ridge regession and prediction of the optimal lambda
ridge.mod <- glmnet(x.sd, train.data$y, alpha = 0,
lambda = grid, tresh = 1e-12, intercept = FALSE)
cv.ridge <- cv.glmnet(x.sd, train.data$y, alpha = 0, intercept = FALSE)
lam.star <- cv.ridge$lambda.min
ridge.test <- predict(ridge.mod, s= lam.star, newx = x.sd.t)
# Ridge MSE
MSE[i,1] <- mean(( ridge.test - test.data$y)^2)
# OLS fit
lm.obj <- lm(data$y ~ x.sd.x.1 + x.sd.x.2 + x.sd.x.3 -1, data = data)
lm.obj.test <- x.sd.t %*% lm.obj$coefficients
#OLS MSE
MSE[i,2] <- mean( (lm.obj.test - test.data$y)^2)
}
# Plot OLS MSE vs. Ridge MSE
# Percentage of values above the 45 degree line
above <- mean(MSE[,1]<=MSE[,2])*100
plot(MSE[,1],MSE[,2], xlim = c(5,20), ylim = c(5,20),
main = "Test MSE According to Ridge vs OLS",
xlab = "Ridge", ylab = "OLS", type = "n")
grid()
points(MSE[,1],MSE[,2], col = "darkblue")
abline(a = 0, b = 1, col = "red")
text(10, 15, above)
text(10, 6, 100-above)
# Boxplot
boxplot(MSE, ylab = "MSE", col = c("cyan1", "lightsalmon"), main = "Test MSE According to Ridge vs OLS")
#b )
##### change to a highly correlated covmatrix
set.seed(527)
n <- 100
beta.true <- c(0.5, 0.5, -0.5)
# a,b,c is the diagonal
covmatrix <- function(a,b,c,d){
matrix(c(a,sqrt(a*b)-d,sqrt(a*c)-3*d,
sqrt(a*b)-d,b,sqrt(b*c)-d,
sqrt(a*c)-3*d,sqrt(b*c)-d,c),
nrow= 3, ncol= 3, byrow=TRUE)
}
sigma2 <- covmatrix(11,15,11,0.01)
mu <- rep(0,3)
rep <- 100
grid <- 10^ seq (5, -2, length = 100)
MSE2 <- matrix(NA,rep,2)
colnames(MSE2) <- c("Ridge*", "OLS*")
for (i in 1:rep){
train.data <- data.generator(n, sigma2, mu, beta.true)
test.data <- data.generator(n,sigma2, mu, beta.true)
x <- cbind(train.data$x.1, train.data$x.2, train.data$x.3)
x.sd <- cbind(train.data$x.sd.x.1, train.data$x.sd.x.2, train.data$x.sd.x.3)
x.t <- cbind(test.data$x.1, test.data$x.2, test.data$x.3)
x.sd.t <- cbind(test.data$x.sd.x.1, test.data$x.sd.x.2, test.data$x.sd.x.3)
ridge.mod <- glmnet(x.sd, train.data$y, alpha = 0,
lambda = grid, tresh = 1e-12,intercept = F
)
cv.ridge <- cv.glmnet(x.sd, train.data$y, alpha = 0,intercept = F)
lam.star <- cv.ridge$lambda.min
ridge.test <- predict(ridge.mod, s= lam.star, newx = x.sd.t)
lm.obj <- lm(train.data$y ~ train.data$x.sd.x.1 + train.data$x.sd.x.2 +
train.data$x.sd.x.3 -1)
lm.obj.test <- x.sd.t %*% lm.obj$coefficients
MSE2[i,1] <- mean(( ridge.test - test.data$y)^2)
MSE2[i,2] <- mean( (lm.obj.test - test.data$y)^2)
}
# Plot OLS MSE vs. Ridge MSE
above2 <- mean(MSE2[,1]<=MSE2[,2])*100
plot(MSE2[,1],MSE2[,2], col="blue",xlim = c(8.5,11.5),ylim = c(8.5,11.5),
xlab = "Ridge",ylab = "OLS", main = "Test MSE According to Ridge vs OLS (high covariance)")
lines(c(0:20),c(0:20), col = "red")
text(9, 10, above2)
text(10, 9.5, 100-above2)
# Boxplot
boxplot(cbind(MSE,MSE2), ylab = "MSE", col = c("cyan1", "lightsalmon"), main = "Test MSE According to Ridge vs OLS, various scenarios" )
lines(0:7,rep(10,8),col="red", lwd= 2.5)
#######################################################
# now assume we have some noise variables
set.seed(527)
n <- 100
beta.true <- c(0.5, 0.5, -0.5)
mu <- rep(0,3)
rep <- 100
grid <- 10^ seq (5, -2, length = 100)
MSE3 <- matrix(NA,rep,2)
colnames(MSE3) <- c("Ridge?", "OLS?")
sigma <- matrix(c(2,0.1,0.1,0.1,3,0.1,0.1,0.1,4) ,
nrow= 3, ncol= 3, byrow=TRUE)
for (i in 1:rep){
train.data <- data.generator(n, sigma, mu, beta.true)
test.data <- data.generator(n,sigma, mu, beta.true)
# Noise variables are already centered
x4 <- rnorm(n,2,1)
x5 <- runif(n,1-sqrt(3),1+sqrt(3))
x6 <- runif(n,2-sqrt(3),2+sqrt(3))
x7 <- rnorm(n,4,1)
x8 <- runif(n,-1-sqrt(3),-1+sqrt(3))
x9 <- rnorm(n,-2,1)
x10 <- runif(n,-sqrt(3),sqrt(3))
x4t <- rnorm(n,2,1)
x5t <- runif(n,1-sqrt(3),1+sqrt(3))
x6t <- runif(n,2-sqrt(3),2+sqrt(3))
x7t <- rnorm(n,4,1)
x8t <- runif(n,-1-sqrt(3),-1+sqrt(3))
x9t <- rnorm(n,-2,1)
x10t <- runif(n,-sqrt(3),sqrt(3))
x.sd <- cbind(train.data$x.sd.x.1, train.data$x.sd.x.2,
train.data$x.sd.x.3,x4,x5,x6,x7,x8,x9,x10)
x.sd.t <- cbind(test.data$x.sd.x.1, test.data$x.sd.x.2,
test.data$x.sd.x.3,x4t,x5t,x6t,x7t,x8t,x9t,x10t)
ridge.mod <- glmnet(x.sd, train.data$y, alpha = 0,
lambda = grid, tresh = 1e-12,intercept = F)
cv.ridge <- cv.glmnet(x.sd, train.data$y, alpha = 0,intercept = F)
lam.star <- cv.ridge$lambda.min
ridge.test <- predict(ridge.mod, s= lam.star, newx = x.sd.t)
lm.obj <- lm(train.data$y ~ train.data$x.sd.x.1 + train.data$x.sd.x.2 +
train.data$x.sd.x.3+x4+x5+x6+x7+x8+x9+x10 -1)
lm.obj.test <- x.sd.t %*% lm.obj$coefficients
MSE3[i,1] <- mean(( ridge.test - test.data$y)^2)
MSE3[i,2] <- mean( (lm.obj.test - test.data$y)^2)
}
# Plot OLS MSE vs. Ridge MSE
above3 <- mean(MSE3[,1]<=MSE3[,2])*100
plot(MSE3[,1],MSE3[,2],col="blue",
xlab = "Ridge",ylab = "OLS",xlim = c(7,15),
ylim = c(7,15), main = "Test MSE According to Ridge vs OLS (noise variables)")
lines(c(0:20),c(0:20), col = "red")
text(10,12, above2)
text(10,8, 100-above2)
# Boxplot
boxplot(cbind(MSE,MSE2,MSE3), ylab = "MSE", col = c("cyan1", "lightsalmon"), main = "Test MSE According to Ridge vs OLS, various scenarios" )
lines(0:7,rep(10,8),col="red", lwd= 2.5)
plot(MSE[,1],MSE[,2],col="darkblue",
xlab = "Ridge",ylab = "OLS",xlim = c(7,15),
ylim = c(7,15), main = "Test MSE According to Ridge vs OLS, various scenarios", pch= 16)
grid()
points(MSE2[,1],MSE2[,2],col="orange", pch = 15)
points(MSE3[,1],MSE3[,2],col="red", pch = 17)
lines(c(0:20),c(0:20), col = "black")
legend("bottomright", c("Basline", "High covariance", "Noise variables"),
col = c("darkblue", "orange", "red"), pch = c(16,15,17))
Comments