Chapter:5-Resampling methods, q8

(a)

In [2]:
set.seed(1)
y = rnorm(100)
x = rnorm(100)
y = x-2*x^2+rnorm(100)

n is the number of samples in the dataset. p is the number of attributes used in the model.

n=100 p = 2

y=x-2*x^2+error

(b)

In [3]:
plot(x,y)

The scatterplot of y vs x makes a parabolic graph.

i

In [15]:
set.seed(1)
library(boot)
glm.model = glm(y~x,data=data.frame(x,y))
cv.err = cv.glm(data.frame(x,y),glm.model)
cv.err$delta
  1. 5.89097855988843
  2. 5.88881215196093

ii

In [16]:
glm.model = glm(y~x+I(x^2),data=data.frame(x,y))
cv.err = cv.glm(data.frame(x,y),glm.model)
cv.err$delta
  1. 1.0865955642745
  2. 1.08632580328877

iii

In [17]:
glm.model = glm(y~x+I(x^2)+I(x^3),data=data.frame(x,y))
cv.err = cv.glm(data.frame(x,y),glm.model)
cv.err$delta
  1. 1.10258509387339
  2. 1.10222658385953

iv

In [18]:
glm.model = glm(y~x+I(x^2)+I(x^3)+I(x^4),data=data.frame(x,y))
cv.err = cv.glm(data.frame(x,y),glm.model)
cv.err$delta
  1. 1.11477226814507
  2. 1.11433406148513

(d)

In [21]:
set.seed(2)
for(i in 1:4){
    glm.model = glm(y~poly(x,i),data=data.frame(x,y))
    cv.err = cv.glm(data.frame(x,y),glm.model)
    print(i)
    print(cv.err$delta)
}
[1] 1
[1] 5.890979 5.888812
[1] 2
[1] 1.086596 1.086326
[1] 3
[1] 1.102585 1.102227
[1] 4
[1] 1.114772 1.114334

The results of the test error estimates are the same as in (c). This is because for computing LOOCV, the model is not tested on random samples from the dataset.

(e)

As expected the quadratic model which is model number (ii) had the smallest LOOCV error. This is because the actual model used for generating simulated dataset was quadratic.

(f)

In [22]:
summary(glm.model)
Call:
glm(formula = y ~ poly(x, i), data = data.frame(x, y))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.8914  -0.5244   0.0749   0.5932   2.7796  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.8277     0.1041 -17.549   <2e-16 ***
poly(x, i)1   2.3164     1.0415   2.224   0.0285 *  
poly(x, i)2 -21.0586     1.0415 -20.220   <2e-16 ***
poly(x, i)3  -0.3048     1.0415  -0.293   0.7704    
poly(x, i)4  -0.4926     1.0415  -0.473   0.6373    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 1.084654)

    Null deviance: 552.21  on 99  degrees of freedom
Residual deviance: 103.04  on 95  degrees of freedom
AIC: 298.78

Number of Fisher Scoring iterations: 2

The p-values of coefficients with degree 1 and 2 are close to zero and hence only these coefficients are significant. The results agree with cross-validation error.