Chapter 7 - Moving Beyond Linearity, Question 10

In [2]:
library(ISLR)
summary(College)
 Private        Apps           Accept          Enroll       Top10perc    
 No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
 Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
           Median : 1558   Median : 1110   Median : 434   Median :23.00  
           Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
           3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
           Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
   Top25perc      F.Undergrad     P.Undergrad         Outstate    
 Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
 1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
 Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
 Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
 3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
 Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
   Room.Board       Books           Personal         PhD        
 Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
 1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
 Median :4200   Median : 500.0   Median :1200   Median : 75.00  
 Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
 3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
 Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
    Terminal       S.F.Ratio      perc.alumni        Expend     
 Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
 1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
 Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
 Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
 3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
 Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
   Grad.Rate     
 Min.   : 10.00  
 1st Qu.: 53.00  
 Median : 65.00  
 Mean   : 65.46  
 3rd Qu.: 78.00  
 Max.   :118.00  

(a)

In [6]:
train = sample(1:nrow(College)%/%2)
length(names(College))
18
In [79]:
library(leaps)
regfit.forward = regsubsets(Outstate~.,data=College,nvmax=18,method="forward")
regfit_summary = summary(regfit.forward)
regfit_summary
Subset selection object
Call: regsubsets.formula(Outstate ~ ., data = College, nvmax = 18, 
    method = "forward")
17 Variables  (and intercept)
            Forced in Forced out
PrivateYes      FALSE      FALSE
Apps            FALSE      FALSE
Accept          FALSE      FALSE
Enroll          FALSE      FALSE
Top10perc       FALSE      FALSE
Top25perc       FALSE      FALSE
F.Undergrad     FALSE      FALSE
P.Undergrad     FALSE      FALSE
Room.Board      FALSE      FALSE
Books           FALSE      FALSE
Personal        FALSE      FALSE
PhD             FALSE      FALSE
Terminal        FALSE      FALSE
S.F.Ratio       FALSE      FALSE
perc.alumni     FALSE      FALSE
Expend          FALSE      FALSE
Grad.Rate       FALSE      FALSE
1 subsets of each size up to 17
Selection Algorithm: forward
          PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
1  ( 1 )  " "        " "  " "    " "    " "       " "       " "        
2  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
3  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
4  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
5  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
6  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
7  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
8  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
9  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
10  ( 1 ) "*"        " "  "*"    " "    " "       " "       " "        
11  ( 1 ) "*"        " "  "*"    " "    " "       " "       "*"        
12  ( 1 ) "*"        "*"  "*"    " "    " "       " "       "*"        
13  ( 1 ) "*"        "*"  "*"    " "    "*"       " "       "*"        
14  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       "*"        
15  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       "*"        
16  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
17  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
          P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
1  ( 1 )  " "         " "        " "   " "      " " " "      " "      
2  ( 1 )  " "         " "        " "   " "      " " " "      " "      
3  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
4  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
5  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
6  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
7  ( 1 )  " "         "*"        " "   "*"      "*" " "      " "      
8  ( 1 )  " "         "*"        " "   "*"      "*" "*"      " "      
9  ( 1 )  " "         "*"        " "   "*"      "*" "*"      "*"      
10  ( 1 ) " "         "*"        " "   "*"      "*" "*"      "*"      
11  ( 1 ) " "         "*"        " "   "*"      "*" "*"      "*"      
12  ( 1 ) " "         "*"        " "   "*"      "*" "*"      "*"      
13  ( 1 ) " "         "*"        " "   "*"      "*" "*"      "*"      
14  ( 1 ) " "         "*"        " "   "*"      "*" "*"      "*"      
15  ( 1 ) " "         "*"        "*"   "*"      "*" "*"      "*"      
16  ( 1 ) " "         "*"        "*"   "*"      "*" "*"      "*"      
17  ( 1 ) "*"         "*"        "*"   "*"      "*" "*"      "*"      
          perc.alumni Expend Grad.Rate
1  ( 1 )  " "         "*"    " "      
2  ( 1 )  " "         "*"    " "      
3  ( 1 )  " "         "*"    " "      
4  ( 1 )  "*"         "*"    " "      
5  ( 1 )  "*"         "*"    " "      
6  ( 1 )  "*"         "*"    "*"      
7  ( 1 )  "*"         "*"    "*"      
8  ( 1 )  "*"         "*"    "*"      
9  ( 1 )  "*"         "*"    "*"      
10  ( 1 ) "*"         "*"    "*"      
11  ( 1 ) "*"         "*"    "*"      
12  ( 1 ) "*"         "*"    "*"      
13  ( 1 ) "*"         "*"    "*"      
14  ( 1 ) "*"         "*"    "*"      
15  ( 1 ) "*"         "*"    "*"      
16  ( 1 ) "*"         "*"    "*"      
17  ( 1 ) "*"         "*"    "*"      
In [13]:
names(regfit_summary)
  1. 'which'
  2. 'rsq'
  3. 'rss'
  4. 'adjr2'
  5. 'cp'
  6. 'bic'
  7. 'outmat'
  8. 'obj'
In [69]:
par(mfrow=c(2,2))
plot(regfit_summary$adjr2,type="l",ylab="adjusted R.sq", xlab="no. of predictors")
max_adjr2 = which.max(regfit_summary$adjr2)
points(max_adjr2,regfit_summary$adjr2[max_adjr2],col="red")
plot(regfit_summary$bic,type="l",ylab="BIC", xlab="no. of predictors")
min_bic = which.min(regfit_summary$bic)
points(min_bic,regfit_summary$bic[min_bic],col="red")
plot(regfit_summary$cp,type="l",ylab="CP", xlab="no. of predictors")
min_cp = which.min(regfit_summary$cp)
points(min_cp,regfit_summary$cp[min_cp],col="red")
In [70]:
k=10
folds = sample(1:k,nrow(College),replace=TRUE)
In [75]:
set.seed(1)
cverr = c()
for(i in 1:17){
    err = c()
    for(j in 1:k){
        regfit_model = regsubsets(Outstate~.,data=College[folds!=k,],method="forward")
        test_mat = model.matrix(Outstate~.,data=College[folds==k,])
        coef_vec = coef(regfit.forward,id=i)
        n = names(coef_vec)
        pred = test_mat[,n]%*%coef_vec
        err[j]=mean((pred-College[folds==k,]$Outstate)^2)
    }
    cverr[i]=mean(err)
}
plot(cverr,type="l",xlab="no. of predictors")
In [76]:
which.min(cverr)
13

According to 10-fold cross validation error the model with 13 predictors give the lowest mean squared error. However the results of CP and BIC favor 10 predictors. In the cross validation graph we can observe that the difference between the mean squared error of the model with 10 predictors and the model with 13 predictors is insignificant. Therefore for the simplification of the model we choose 10 predictors.

In [80]:
coef(regfit.forward,id=10)
(Intercept)
-2089.12270036304
PrivateYes
2799.83531388728
Accept
0.080884534097839
Room.Board
0.884514301059244
Personal
-0.361607375711528
PhD
18.7487472207267
Terminal
23.244346621605
S.F.Ratio
-61.8324122196724
perc.alumni
43.7747006003416
Expend
0.195064036803467
Grad.Rate
26.5399510112468

(b)

In [81]:
names(College)
  1. 'Private'
  2. 'Apps'
  3. 'Accept'
  4. 'Enroll'
  5. 'Top10perc'
  6. 'Top25perc'
  7. 'F.Undergrad'
  8. 'P.Undergrad'
  9. 'Outstate'
  10. 'Room.Board'
  11. 'Books'
  12. 'Personal'
  13. 'PhD'
  14. 'Terminal'
  15. 'S.F.Ratio'
  16. 'perc.alumni'
  17. 'Expend'
  18. 'Grad.Rate'
In [100]:
library(gam)
library(splines)
gam.model = gam(Outstate~Private+s(Accept,df=3)+s(Room.Board,df=3)+s(Books,df=3)+s(Personal,df=3)+s(PhD,df=3)+
                s(S.F.Ratio,df=3)+s(perc.alumni,df=3)+s(Expend,df=3)+s(Grad.Rate,df=3), data=College[train,])
In [101]:
par(mfrow=c(4,4))
plot(gam.model)

(c)

In [105]:
pred = predict(gam.model,newdata=College[-train,])
In [106]:
#R-squared
TSS = sum((College[-train,]$Outstate-mean(College[-train,]$Outstate))^2)
RSS = sum((College[-train,]$Outstate-pred)^2)
1-RSS/TSS
0.77355461544689

(d)

In [104]:
summary(gam.model)
Call: gam(formula = Outstate ~ Private + s(Accept, df = 3) + s(Room.Board, 
    df = 3) + s(Books, df = 3) + s(Personal, df = 3) + s(PhD, 
    df = 3) + s(S.F.Ratio, df = 3) + s(perc.alumni, df = 3) + 
    s(Expend, df = 3) + s(Grad.Rate, df = 3), data = College[train, 
    ])
Deviance Residuals:
     Min       1Q   Median       3Q      Max 
-6580.46 -1091.51    52.66  1149.57  7208.44 

(Dispersion Parameter for gaussian family taken to be 3141810)

    Null Deviance: 12341942890 on 775 degrees of freedom
Residual Deviance: 2346929754 on 746.9992 degrees of freedom
AIC: 13841.84 

Number of Local Scoring Iterations: 2 

Anova for Parametric Effects
                        Df     Sum Sq    Mean Sq F value    Pr(>F)    
Private                  1 2004051580 2004051580 637.865 < 2.2e-16 ***
s(Accept, df = 3)        1  817929953  817929953 260.337 < 2.2e-16 ***
s(Room.Board, df = 3)    1 1887004403 1887004403 600.611 < 2.2e-16 ***
s(Books, df = 3)         1   39064708   39064708  12.434 0.0004474 ***
s(Personal, df = 3)      1  201423425  201423425  64.111 4.503e-15 ***
s(PhD, df = 3)           1  709588375  709588375 225.853 < 2.2e-16 ***
s(S.F.Ratio, df = 3)     1  354280800  354280800 112.763 < 2.2e-16 ***
s(perc.alumni, df = 3)   1  310973847  310973847  98.979 < 2.2e-16 ***
s(Expend, df = 3)        1  905629641  905629641 288.251 < 2.2e-16 ***
s(Grad.Rate, df = 3)     1   70651493   70651493  22.488 2.534e-06 ***
Residuals              747 2346929754    3141810                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Anova for Nonparametric Effects
                       Npar Df Npar F     Pr(F)    
(Intercept)                                        
Private                                            
s(Accept, df = 3)            2 13.023 2.758e-06 ***
s(Room.Board, df = 3)        2  6.500  0.001589 ** 
s(Books, df = 3)             2  2.542  0.079405 .  
s(Personal, df = 3)          2  5.596  0.003868 ** 
s(PhD, df = 3)               2 12.677 3.850e-06 ***
s(S.F.Ratio, df = 3)         2 13.322 2.066e-06 ***
s(perc.alumni, df = 3)       2  3.245  0.039528 *  
s(Expend, df = 3)            2 64.218 < 2.2e-16 ***
s(Grad.Rate, df = 3)         2  6.756  0.001236 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

According to the ANOVA for the Nonparametric Effect Accept, PhD, S.F. Ratio and Expend have a strong non-linear relationship with the response

In [ ]: