Chapter 8 Tree Based Methods - Question 8

In [1]:
library(ISLR)
In [2]:
names(Carseats)
  1. 'Sales'
  2. 'CompPrice'
  3. 'Income'
  4. 'Advertising'
  5. 'Population'
  6. 'Price'
  7. 'ShelveLoc'
  8. 'Age'
  9. 'Education'
  10. 'Urban'
  11. 'US'

a

In [14]:
set.seed(1)
train = sample(1:nrow(Carseats),nrow(Carseats)/2)

b

In [15]:
library(tree)
carseats.tree = tree(Sales~.,data=Carseats,subset=train)
In [16]:
plot(carseats.tree)
text(carseats.tree,pretty=0)
In [17]:
#Finding the mean squared error
carseats.pred = predict(carseats.tree,newdata=Carseats[-train,])
mean((carseats.pred-Carseats$Sales[-train])^2)
4.14889745049246

c

In [20]:
cv.carseats.tree = cv.tree(carseats.tree)
cv.carseats.tree
$size
 [1] 18 17 16 15 14 12 11 10  9  8  7  6  5  4  3  1

$dev
 [1] 1013.2727  995.5856  995.5856 1040.5112 1040.5112  979.9660  983.3205
 [8]  991.4825  991.3305 1024.1369 1018.1733 1038.2824 1124.6938 1127.3483
[15] 1236.5701 1562.7692

$k
 [1]      -Inf  15.48181  15.53599  18.69038  18.74886  21.05038  23.79480
 [8]  25.78579  26.01210  30.10435  32.74801  53.28569  72.33061  78.19599
[15] 141.73781 251.22901

$method
[1] "deviance"

attr(,"class")
[1] "prune"         "tree.sequence"
In [21]:
plot(cv.carseats.tree$size,cv.carseats.tree$dev,type="b")
In [25]:
#As we can see the tree with 12 terminal nodes gives the lowest deviance.
prune.carseats.tree = prune.tree(carseats.tree,best=12)
plot(prune.carseats.tree)
text(prune.carseats.tree)
In [26]:
yhat = predict(prune.carseats.tree,newdata=Carseats[-train,])
mean((yhat-Carseats$Sales[-train])^2)
4.61032165137014

Pruning the tree doesn't decrease the test error. The mean squared error is 4.1 without pruning and 4.6 with pruning

d

In [29]:
library(randomForest)
In [38]:
# p = the number of variables considered at each split
p = ncol(Carseats)-1
bag.carseats = randomForest(Sales~.,data=Carseats,mtry=p,ntree=500,importance=TRUE,subset=train)
In [39]:
yhat = predict(bag.carseats,newdata=Carseats[-train,])
mean((yhat-Carseats$Sales[-train])^2)
2.59330551265318

Bagging has reduced the test error from 4 to 2.6

In [40]:
varImpPlot(bag.carseats)

Price and Shelveloc are the two most important variables in deciding the number of sales.

e

In [41]:
# p = the number variables considered at each split
p = sqrt(ncol(Carseats)-1)
p2 = (ncol(Carseats)-1)/2
randf.carseats = randomForest(Sales~.,data=Carseats,mtry=p,ntree=500,importance=TRUE,subset=train)
randf.carseats.p2 = randomForest(Sales~.,data=Carseats,mtry=p2,ntree=500,importance=TRUE,subset=train)
In [37]:
yhat = predict(randf.carseats,newdata=Carseats[-train,])
mean((yhat-Carseats$Sales[-train])^2)
3.32981861576787
In [42]:
yhat = predict(randf.carseats.p2,newdata=Carseats[-train,])
mean((yhat-Carseats$Sales[-train])^2)
2.85454077752842

Here the Random Forest increases the mean squared error.

In [44]:
importance(randf.carseats)
varImpPlot(randf.carseats)
%IncMSEIncNodePurity
CompPrice 7.0760560129.67547
Income 5.4464122126.45442
Advertising12.9804105138.40878
Population-1.3719784 97.99927
Price37.0633392382.26242
ShelveLoc30.4561325236.74282
Age17.9715510196.77469
Education 2.0792595 72.60662
Urban-0.5662764 16.19988
US 5.2840921 32.31290

According to Random Forest also the Price and ShelveLoc are the two most important variables