For ridge regression, it uses a shrinkage method by adding penalty on estimated coefficients.
Usually ridge regression could deal with high dimensional data and correlated predictors.
High dimensional data here means that for the number of predictors p and number of observations n, p>n or p<n but not p<<n.
In these situations, ridge regression could outperforms OLS regression, because OLS estimated coefficients will have a high variance, and make a bad prediction for test data.
To build a deeper understanding on Ridge Regression, it is meaningful to consider the shrinkage path of estimated coefficients of ridge ( we call them betas in the following exercise).
Shrinkage path here refers to given different lambda, the path of an estimated coefficient.
In practice, we can select the best lambda by cross-validation, 10-fold works well.
We use the following data generating process as baseline setting:
the true beta vector is (0.5 0.5 -0.5)
variance covariance matrix sigma is
1). Why ridge regression needs standardized predictors
Here standardization means that make the variance of each predictor equals to 1, by dividing each predictor value by its standard deviation.
Here we firstly use our baseline setting, and use ridge on standardized data and original data.
We could see that with a higher variance(original data), the shrinkage pathes are less steep, which implies that the variance of x will influence the shrinkage effect on X.
Here we use a sigma matrix with different variance of each X.
Here the standardized data path keeps, and we could see that for the high variance X2, although beta2 is smaller than beta1, for a bigger lambda, beta1 converages faster to 0 than beta2.
This means if we want to eliminate the influence of predictor units ( for example, 1 dollar or 1000 dollar, cm or m, g or kg), we need to standardize our predictors.
2). The trend for highly correlated data
For highly correlated data, we use:
Here the correlation coefficients between x1 and x2 is 0.95.
And we have path as:
We find that the path for beta1 & 2 quickly converage to each other and then converge to 0 simultaneously.
This shows that for highly correlated predictors, ridge treats them as nearly the same predictor, and give them the same shrinkage weight, which implies that ridge regression model cannot be used for predictor selection.
And for a sigma setting makes all 3 predictors highly correlated:
We could see that all 3 betas converage to each other at first and then converage to 0 togather, and this gives us the same conclusion as above.
In addition, we can see that when lambda = 0, the estimated ridge betas are equal to OLS betas. And because of the same scale of standardized data, the smaller one ( take OLS as baseline) converages to 0 quicker as lambda increases.
R Code:
# test the shrinkage path of ridge
rm(list = ls())
library(MASS)
set.seed(777)
n <- 100
beta.true <- c(0.5, 0.5, -0.5)
sigma <- matrix(c(2,0.1,0.1,
0.1,2,0.1,
0.1,0.1,2) ,
nrow= 3, ncol= 3, byrow=TRUE)
sigma <- matrix(c(2,0.1,0.1,
0.1,4,0.1,
0.1,0.1,5) ,
nrow= 3, ncol= 3, byrow=TRUE)
sigma <- matrix(c(2,1.9,0.1,
1.9,2,0.1,
0.1,0.1,2) ,
nrow= 3, ncol= 3, byrow=TRUE)
sigma <- matrix(c(2,0.1,1.9,
0.1,2,0.1,
1.9,0.1,2) ,
nrow= 3, ncol= 3, byrow=TRUE)
sigma <- matrix(c(2,1.9,1.9,
1.9,2,1.9,
1.9,1.9,2) ,
nrow= 3, ncol= 3, byrow=TRUE)
mu <- rep(0,3)
data.generator <- function(n, sigma, mu, beta){
x <- mvrnorm(n, mu, sigma)
e <- rnorm(n, 0, sqrt(10))
y <- x %*% beta + e
x.1 <- x[,1] / sd(x[,1])
x.2 <- x[,2]/ sd(x[,2])
x.3 <- x[,3]/ sd(x[,3])
x.sd <- cbind(x.1, x.2, x.3)
data <- data.frame ("y" = y, "x.sd" = x.sd, "x" = x)
return(data)
}
data <- data.generator(n, sigma, mu, beta.true)
x.sd <- cbind(data$x.sd.x.1, data$x.sd.x.2, data$x.sd.x.3)
x <- cbind(data$x.1, data$x.2, data$x.3)
ridge <- function(x,y,lambda, p = 3){
beta.ridge <- solve(t(x) %*% x +
diag(lambda, nrow = p, ncol = p)) %*% t(x) %*% y
return(beta.ridge)
}
grid <- 10^ seq (3, -3, length = 100)
beta.ridge <- matrix(NA, nrow = 3, ncol= length(grid))
beta.ridge.nonsd <- matrix(NA, nrow = 3, ncol= length(grid))
for (i in 1:length(grid)){
lam <- grid[i]
beta.ridge[,i] <- ridge(x.sd, data$y, lam)
}
for (i in 1:length(grid)){
lam <- grid[i]
beta.ridge.nonsd[,i] <- ridge(x, data$y, lam)
}
ylimits = c(-1.5, 1.5)
plot(x=grid, y=beta.ridge[1,], col = "red", ylim = ylimits,
xlab = expression(lambda), ylab = "",
main = expression(hat(beta) ~ "for different" ~ lambda), type = "n")
grid()
points(x=grid, y=beta.ridge[1,], col = "orange", lwd = 1)
points(x=grid, y=beta.ridge[2,], col = "red", lwd = 1)
points(x=grid, y=beta.ridge[3,], col = "darkblue", lwd = 1)
abline(h = 0, col = "black")
legend("topright", c(expression(beta[1]), expression(beta[2]), expression(beta[3])),
col = c("orange", "red", "darkblue"), pch = 1)
mtext("standardized data")
ylimits = c(-1.5, 1.5)
plot(x=grid, y=beta.ridge.nonsd[1,], col = "red", ylim = ylimits,
xlab = expression(lambda), ylab = "",
main = expression(hat(beta) ~ "for different" ~ lambda), type = "n")
grid()
points(x=grid, y=beta.ridge.nonsd[1,], col = "orange", lwd = 1)
points(x=grid, y=beta.ridge.nonsd[2,], col = "red", lwd = 1)
points(x=grid, y=beta.ridge.nonsd[3,], col = "darkblue", lwd = 1)
abline(h = 0, col = "black")
legend("topright", c(expression(beta[1]), expression(beta[2]), expression(beta[3])),
col = c("orange", "red", "darkblue"), pch = 1)
mtext("non-standardized data")
好活,已举办