The Directions of Principal Components for PCR/PCA

Karl 曹
Jun 25, 2022
2 min read

Principal Component Regression is a regression that uses Principal Components Analysis to select M principal components, and uses these components to regress Y.

Principal components reflect the directions which independent variables (or called predictors) vary the most, and this means that each PC could contain as much as information of independent variables. In addition, PCs are orthogonal to each other.

Here we use this data generating process:

Set the number of predictors p = 2 and X ∼ Np(μ,Σ). Σ is the covariance matrix.

μ =(0, 10). The (initially) true coefficients range from β = 0 to 0.1 (randomly selected in this interval) and the errors are drawn from a normal distribution ε ∼ N(0, 1). Number of obversions is n = 100.

We choose p = 2 is to make it easier to illustrate our result.

library(pls)
pcr.fit<-pcr(y.train~x.train,data=data.train, scale=TRUE,validation="CV")

we choose scale = T to standardize our data, in order to avoid high variance variable takes too much in our principal component.

We can call pcr.fit$loading or $projection to check our factor loadings, which enable us to calculate the linear combination of our demeaned Xs, the we can obtain principal component scores for all observations for all components.

Note that factor loadings are equal to the directions of principal components, given the standardized data.

The principal component's direction for original data can be computed by the variance-covariance matrix of original data. The eigenvectors are the direction of principal components. (the var-cov matrix for standardized data also gives the directions as same as loadings for standardized data)

The green line is the 1st PC direction and the red dashed line is the 2nd PC direction.

We can see that the green line is the 1st PC, which reflects the correlation between X1 and X2, and our 1st PC also reflects most information of X1 and X2, because it shows the direction which (X1i, X2i)s vary the most, in this direction, the projections of all points onto the 1st PC has the highest variance among all other possible directions in this 2-dimensional space. We can see this by:

that 1st PC is highly correlated to X1 and X2, thus we can use the 1st PC to represent X1 and X2, while 2nd PC does not capture much information:

R code:


library(MASS)
set.seed(1)
n<-100
p<-2
mu<-c(0,10)
sigma<-matrix(c(2,1,1,1.5),nrow=p,ncol=p,byrow = TRUE)
beta_range<-seq(0,0.1,by=0.001)
beta<-sample(beta_range,p)
e.mu=0
e.sigma=1
dgp = function(n, beta, mu, sigma){
  x<-mvrnorm(n,mu,sigma)
  error<-rnorm(n,e.mu,e.sigma)
  y <- x%*% beta + error
  data<-data.frame(y,x)
  return(data)
}
data.train<-dgp(n,beta,mu,sigma)
x1.train<-data.train[,2]
x2.train<-data.train[,3]
x.train<-cbind(x1.train,x2.train)
y.train<-data.train[,1]
data.test<-dgp(n,beta,mu,sigma)
x1.test<-data.test[,2]
x2.test<-data.test[,3]
x.test<-cbind(x1.test,x2.test)
y.test<-data.test[,1]
####for the question a
library(pls)
pcr.fit<-pcr(y.train~x.train,data=data.train,
             scale=TRUE,validation="CV")
first.com.cof<-c(pcr.fit$loadings[1,1],pcr.fit$loadings[2,1])
second.com.cof<-c(pcr.fit$loadings[1,2],pcr.fit$loadings[2,2])
x1.train.sd<-(x1.train)/sd(x1.train)
x2.train.sd<-(x2.train)/sd(x2.train)
x.train.sd <- cbind(x1.train.sd,x2.train.sd)
first.com<-first.com.cof[1]*(x1.train.sd-mean(x1.train.sd))+
  first.com.cof[2]*(x2.train.sd-mean(x2.train.sd))
second.com<-second.com.cof[1]*(x1.train.sd-mean(x1.train.sd))+
  second.com.cof[2]*(x2.train.sd-mean(x2.train.sd)) 
# std data 
ei <- eigen(cov(x.train.sd))
ei <- ei$vectors
pl <- seq(-10,10,1)
ply <- pl*ei[2,1]/ei[1,1]+mean(x2.train.sd)
pl2 <- pl
ply2 <- pl2*ei[2,2]/ei[1,2]+mean(x2.train.sd)
###plot
plot(x1.train.sd,x2.train.sd,asp=1,main = "Standardized Data")
lines(pl,ply,col="darkgreen",lwd=2,xlim=c(-10,10))
lines(pl2,ply2,col='red',lty=2,xlim=c(-10,10))

### original data 
ei <- eigen(cov(x.train))
ei <- ei$vectors
pl <- seq(-10,10,1)
ply <- pl*ei[2,1]/ei[1,1]+mean(x2.train)
pl2 <- pl
ply2 <- pl2*ei[2,2]/ei[1,2]+mean(x2.train)
###plot
plot(x1.train,x2.train,asp=1,main = "original data")
lines(pl,ply,col="darkgreen",lwd=2,xlim=c(-10,10))
lines(pl2,ply2,col='red',lty=2,xlim=c(-10,10))

################################
plot(first.com,x1.train.sd,pch=15,col="red",
     main = "First Component v.s. X1")
plot(first.com,x2.train.sd,pch=16,col="red",
     main = "First Component v.s. X2")
plot(second.com,x1.train.sd,pch=17,col="red",
     main = "Second Component v.s. X1")
plot(second.com,x2.train.sd,pch=18,col="red",
     main = "Second Component v.s. X2")

The Directions of Principal Components for PCR/PCA

Recent Posts

Комментарии