Chapter 6 - Linear Model Selection Regularization - Question 10

(a)

In [189]:
set.seed(1)
#creating coefficient vector
#beta=sample(1:20,20,replace=TRUE)
beta=rnorm(20)
beta[sample(1:20,5)]=0
#creating X variables matrix
n=1000
p=20
X = matrix(rnorm(n*p),n,p)
#error
err = rnorm(n)
#model
Y = X%*%beta + err
#dataset
dataset=data.frame(Y,X)

(b)

In [190]:
set.seed(1)
train = sample(nrow(dataset),100,replace=FALSE)

(c)

In [191]:
library(leaps)
set.seed(1)
bestsubset.model = regsubsets(Y~.,data=dataset,nvmax=20,subset=train)
bestsubset.summary = summary(bestsubset.model)
In [192]:
bestsubset.summary
Subset selection object
Call: regsubsets.formula(Y ~ ., data = dataset, nvmax = 20, subset = train)
20 Variables  (and intercept)
    Forced in Forced out
X1      FALSE      FALSE
X2      FALSE      FALSE
X3      FALSE      FALSE
X4      FALSE      FALSE
X5      FALSE      FALSE
X6      FALSE      FALSE
X7      FALSE      FALSE
X8      FALSE      FALSE
X9      FALSE      FALSE
X10     FALSE      FALSE
X11     FALSE      FALSE
X12     FALSE      FALSE
X13     FALSE      FALSE
X14     FALSE      FALSE
X15     FALSE      FALSE
X16     FALSE      FALSE
X17     FALSE      FALSE
X18     FALSE      FALSE
X19     FALSE      FALSE
X20     FALSE      FALSE
1 subsets of each size up to 20
Selection Algorithm: exhaustive
          X1  X2  X3  X4  X5  X6  X7  X8  X9  X10 X11 X12 X13 X14 X15 X16 X17
1  ( 1 )  " " " " " " " " " " " " " " " " " " " " " " " " " " "*" " " " " " "
2  ( 1 )  " " " " " " "*" " " " " " " " " " " " " " " " " " " "*" " " " " " "
3  ( 1 )  " " " " " " "*" " " " " " " " " " " " " "*" " " " " "*" " " " " " "
4  ( 1 )  " " " " " " "*" " " " " " " " " " " " " "*" " " " " "*" " " " " " "
5  ( 1 )  " " " " " " "*" " " "*" " " " " " " " " "*" " " " " "*" " " " " " "
6  ( 1 )  "*" " " " " "*" " " "*" " " " " " " " " "*" " " " " "*" " " " " " "
7  ( 1 )  "*" " " " " "*" " " "*" " " " " " " " " "*" " " " " "*" " " " " " "
8  ( 1 )  "*" " " "*" "*" " " "*" " " " " " " " " "*" " " " " "*" " " " " " "
9  ( 1 )  "*" " " "*" "*" " " "*" " " "*" " " " " "*" " " " " "*" " " " " " "
10  ( 1 ) "*" " " "*" "*" " " "*" " " "*" " " " " "*" "*" " " "*" " " " " " "
11  ( 1 ) "*" " " "*" "*" " " "*" " " "*" " " " " "*" "*" " " "*" " " " " " "
12  ( 1 ) "*" " " "*" "*" " " "*" "*" "*" " " " " "*" "*" " " "*" " " " " " "
13  ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" " " " " "*" "*" " " "*" " " " " " "
14  ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" " " " " "*" "*" " " "*" " " "*" " "
15  ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" " " " " "*" "*" "*" "*" " " "*" " "
16  ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" " " " " "*" "*" "*" "*" "*" "*" " "
17  ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" " "
18  ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" " "
19  ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" " "
20  ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
          X18 X19 X20
1  ( 1 )  " " " " " "
2  ( 1 )  " " " " " "
3  ( 1 )  " " " " " "
4  ( 1 )  "*" " " " "
5  ( 1 )  "*" " " " "
6  ( 1 )  "*" " " " "
7  ( 1 )  "*" "*" " "
8  ( 1 )  "*" "*" " "
9  ( 1 )  "*" "*" " "
10  ( 1 ) "*" "*" " "
11  ( 1 ) "*" "*" "*"
12  ( 1 ) "*" "*" "*"
13  ( 1 ) "*" "*" "*"
14  ( 1 ) "*" "*" "*"
15  ( 1 ) "*" "*" "*"
16  ( 1 ) "*" "*" "*"
17  ( 1 ) "*" "*" "*"
18  ( 1 ) "*" "*" "*"
19  ( 1 ) "*" "*" "*"
20  ( 1 ) "*" "*" "*"
In [193]:
mse=rep(NA,20)
for(i in 1:20){
    coef_vec = coef(bestsubset.model,id=i)
    pred=as.matrix(dataset[train,names(coef_vec)[-1]])%*%coef_vec[-1]
    mse[i]=mean((pred-dataset$Y[train])^2)
}
In [194]:
plot(mse,ylab="mean squared error",xlab="number of variables",type="b",main="Training MSE")

(d)

In [195]:
mse=rep(NA,20)
for(i in 1:20){
    coef_vec = coef(bestsubset.model,id=i)
    pred=as.matrix(dataset[-train,names(coef_vec)[-1]])%*%coef_vec[-1]
    mse[i]=mean((pred-dataset$Y[-train])^2)
}
In [196]:
plot(mse,ylab="mean squared error",xlab="number of variables",type="b",main="Test MSE")

(e)

In [197]:
#The size of the model giving the least MSE on test data
which.min(mse)
13

(f)

In [202]:
which(beta==0)
coef(bestsubset.model,id=13)
  1. 9
  2. 10
  3. 13
  4. 15
  5. 17
(Intercept)
-0.200436940618195
X1
-0.631896678963695
X3
-0.659336936437764
X4
1.65020505236265
X5
0.17769052951172
X6
-0.848233494430316
X7
0.351852029213562
X8
0.738211375618172
X11
1.42329203766404
X12
0.46204606551803
X14
-2.43818349501597
X18
0.826338866677508
X19
0.780614348262876
X20
0.490329972496041

The model with 13 variables has caught all the zeros in the actual model and has excluded them.

(g)

In [245]:
b=beta
b.hat = beta
b.hat = data.frame(b)
rownames(b.hat)=names(coef(bestsubset.model,id=20))[-1]
val = rep(NA,20)
for(i in 1:20){
    coef_vec = coef(bestsubset.model,id=i)
    match  = rownames(b.hat) %in% names(coef_vec)
    b.hat[match,]=coef_vec[-1]
    b.hat[!match,]=0
    val[i]=sqrt(sum((b-b.hat)^2))
}
plot(val,ylab="sqrt of sum of squared error of coefficients",xlab="no. of variables",type="b")
In [238]:
which.min(val)
13

The graph seems similar to graph of test MSE in (d)

In [ ]: