Chapter 8 Tree Based Methods - Question 11

In [58]:
library(ISLR)
In [59]:
caravan_dataset = Caravan
p = rep(0,nrow(caravan_dataset))
p[caravan_dataset$Purchase=="Yes"]=1
caravan_dataset$Purchase = p

a

In [60]:
set.seed(1)
train = 1:1000

b

In [61]:
library(gbm)
boost.model = gbm(Purchase~.,data=caravan_dataset[train,],shrinkage=0.01,n.trees=1000,distribution="bernoulli")
Warning message in gbm.fit(x, y, offset = offset, distribution = distribution, w = w, :
“variable 50: PVRAAUT has no variation.”Warning message in gbm.fit(x, y, offset = offset, distribution = distribution, w = w, :
“variable 71: AVRAAUT has no variation.”
In [62]:
summary(boost.model)
varrel.inf
PPERSAUTPPERSAUT 14.6350478
MKOOPKLAMKOOPKLA 9.4709165
MOPLHOOGMOPLHOOG 7.3145742
MBERMIDDMBERMIDD 6.0865197
PBRANDPBRAND 4.6676612
MGODGEMGODGE 4.4946326
ABRANDABRAND 4.3242776
MINK3045MINK3045 4.1759062
MOSTYPEMOSTYPE 2.8640258
PWAPARTPWAPART 2.7819107
MAUT1MAUT1 2.6192915
MBERARBGMBERARBG 2.1048051
MSKAMSKA 2.1018515
MAUT2MAUT2 2.0217251
MSKCMSKC 1.9868434
MINKGEMMINKGEM 1.9212271
MGODPRMGODPR 1.9177754
MBERHOOGMBERHOOG 1.8071062
MGODOVMGODOV 1.7869391
PBYSTANDPBYSTAND 1.5727959
MSKB1MSKB1 1.4355140
MFWEKINDMFWEKIND 1.3726426
MRELGEMRELGE 1.2080518
MOPLMIDDMOPLMIDD 0.9379197
MINK7512MINK7512 0.9259072
MINK4575MINK4575 0.9174599
MGODRKMGODRK 0.9076554
MFGEKINDMFGEKIND 0.8574537
MZPARTMZPART 0.8253107
MRELOVMRELOV 0.8073125
PAANHANGPAANHANG0
PTRACTORPTRACTOR0
PWERKTPWERKT 0
PBROMPBROM 0
PPERSONGPPERSONG0
PGEZONGPGEZONG 0
PWAOREGPWAOREG 0
PZEILPLPZEILPL 0
PPLEZIERPPLEZIER0
PFIETSPFIETS 0
PINBOEDPINBOED 0
AWAPARTAWAPART 0
AWABEDRAWABEDR 0
AWALANDAWALAND 0
ABESAUTABESAUT 0
AMOTSCOAMOTSCO 0
AVRAAUTAVRAAUT 0
AAANHANGAAANHANG0
ATRACTORATRACTOR0
AWERKTAWERKT 0
ABROMABROM 0
ALEVENALEVEN 0
APERSONGAPERSONG0
AGEZONGAGEZONG 0
AWAOREGAWAOREG 0
AZEILPLAZEILPL 0
APLEZIERAPLEZIER0
AFIETSAFIETS 0
AINBOEDAINBOED 0
ABYSTANDABYSTAND0

The most important variables are PPERSAUT, MKOOPKLA and MBERMIDD.

c

Boosting

In [74]:
yhat = predict(boost.model,newdata=caravan_dataset[-train,],n.trees=1000,type="response")
purchase.pred = rep("No",length(yhat))
purchase.pred[yhat>0.2]="Yes"
table(purchase.pred,Caravan$Purchase[-train])
             
purchase.pred   No  Yes
          No  4410  256
          Yes  123   33
In [102]:
#Fraction of people predicted to make a purchase that do in fact make one. (Fraction of True Positives)
33/(123+33)
0.211538461538462

knn

In [118]:
#knn
library(class)
std.x = scale(Caravan[,-86])
train.x = std.x[train,]
train.y = Caravan[train,86]
test.x = std.x[-train,]
In [119]:
set.seed(1)
knn.pred = knn(train.x,test.x,train.y,k=5)
table(knn.pred,Caravan$Purchase[-train])
        
knn.pred   No  Yes
     No  4506  279
     Yes   27   10
In [120]:
#Fraction of people predicted to make a purchase that do in fact make one. (Fraction of True Positives)
10/(27+10)
0.27027027027027

Logistic Regression

In [125]:
#Logistic Regression
glm.model = glm(Purchase~.,data=Caravan,family=binomial,subset=train)
Warning message:
“glm.fit: fitted probabilities numerically 0 or 1 occurred”
In [133]:
yhat = predict(glm.model,newdata=Caravan[-train,],type="response")
glm.pred = rep("No",length(yhat))
glm.pred[yhat>=0.5] = "Yes"
table(glm.pred,Caravan$Purchase[-train])
Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
“prediction from a rank-deficient fit may be misleading”
        
glm.pred   No  Yes
     No  4446  274
     Yes   87   15
In [135]:
#Fraction of people predicted to make a purchase that do in fact make one. (Fraction of True Positives)
15/(15+87)
0.147058823529412

Results

Boosting produces 21% True Positives.

KNN produces 27% True Positives.

Logistic Regression produces 14% True Positives.