Chapter:5-Resampling methods, q8
set.seed(1)
y = rnorm(100)
x = rnorm(100)
y = x-2*x^2+rnorm(100)
n is the number of samples in the dataset. p is the number of attributes used in the model.
n=100 p = 2
y=x-2*x^2+error
plot(x,y)
The scatterplot of y vs x makes a parabolic graph.
set.seed(1)
library(boot)
glm.model = glm(y~x,data=data.frame(x,y))
cv.err = cv.glm(data.frame(x,y),glm.model)
cv.err$delta
glm.model = glm(y~x+I(x^2),data=data.frame(x,y))
cv.err = cv.glm(data.frame(x,y),glm.model)
cv.err$delta
glm.model = glm(y~x+I(x^2)+I(x^3),data=data.frame(x,y))
cv.err = cv.glm(data.frame(x,y),glm.model)
cv.err$delta
glm.model = glm(y~x+I(x^2)+I(x^3)+I(x^4),data=data.frame(x,y))
cv.err = cv.glm(data.frame(x,y),glm.model)
cv.err$delta
set.seed(2)
for(i in 1:4){
glm.model = glm(y~poly(x,i),data=data.frame(x,y))
cv.err = cv.glm(data.frame(x,y),glm.model)
print(i)
print(cv.err$delta)
}
The results of the test error estimates are the same as in (c). This is because for computing LOOCV, the model is not tested on random samples from the dataset.
As expected the quadratic model which is model number (ii) had the smallest LOOCV error. This is because the actual model used for generating simulated dataset was quadratic.
summary(glm.model)
The p-values of coefficients with degree 1 and 2 are close to zero and hence only these coefficients are significant. The results agree with cross-validation error.