Chapter 5, Resampling method - q5

(a)

In [4]:
library(ISLR)
glm.model = glm(default~income+balance,data=Default,family=binomial)
glm.model
Call:  glm(formula = default ~ income + balance, family = binomial, 
    data = Default)

Coefficients:
(Intercept)       income      balance  
 -1.154e+01    2.081e-05    5.647e-03  

Degrees of Freedom: 9999 Total (i.e. Null);  9997 Residual
Null Deviance:	    2921 
Residual Deviance: 1579 	AIC: 1585

(b)

i

In [30]:
set.seed(1)
train = sample(nrow(Default),8000)

ii

In [31]:
glm.model = glm(default~income+balance, data=Default, family=binomial, subset=train)

iii

In [36]:
glm.prob = predict(glm.model,newdata=Default[-train,], type="response") 
glm.pred = rep("No",length(glm.prob))
glm.pred[glm.prob>0.5]="Yes"
#computing error rate
sum(glm.pred!=Default[-train,]$default)/length(glm.pred)*100
2.85

(c)

In [37]:
for(i in 2:4){
    set.seed(i)
    train = sample(nrow(Default),8000)
    glm.model = glm(default~income+balance,data=Default,family=binomial, subset=train)
    glm.prob = predict(glm.model,newdata=Default[-train,],type="response")
    glm.pred = rep("No",length(glm.prob))
    glm.pred[glm.prob>0.5]="Yes"
    print(sum(glm.pred!=Default[-train,]$default)/length(glm.pred)*100)
}
[1] 2.15
[1] 2.55
[1] 2.85
In [38]:
(2.15+2.55+2.85)/3
2.51666666666667

The model seems to give an average error rate of 2.52% on the validation data

(d)

In [45]:
set.seed(1)
train = sample(nrow(Default),8000)
glm.model = glm(default~income+balance+student,data=Default,family=binomial,subset=train)
glm.prob = predict(glm.model,newdata=Default[-train,],type="response")
glm.pred = rep("No",length(glm.prob))
glm.pred[glm.prob>0.5]="Yes"
#computing the error rate
sum(glm.pred!=Default[-train,]$default)/length(glm.pred)*100
2.5

The error rate has not decreased. Addition of the dummy variable, student, has not reduced the error.

In [ ]: