Chapter 3 Linear Regression, Q9

In [1]:
Auto = read.csv("../../datasets/Auto.csv",header=T,na.strings="?")
Auto = na.omit(Auto)

(a)

In [2]:
pairs(Auto)

(b)

In [3]:
auto_subset = subset(Auto, select=-name)
cor(auto_subset)
mpgcylindersdisplacementhorsepowerweightaccelerationyearorigin
mpg 1.0000000-0.7776175-0.8051269-0.7784268-0.8322442 0.4233285 0.5805410 0.5652088
cylinders-0.7776175 1.0000000 0.9508233 0.8429834 0.8975273-0.5046834-0.3456474-0.5689316
displacement-0.8051269 0.9508233 1.0000000 0.8972570 0.9329944-0.5438005-0.3698552-0.6145351
horsepower-0.7784268 0.8429834 0.8972570 1.0000000 0.8645377-0.6891955-0.4163615-0.4551715
weight-0.8322442 0.8975273 0.9329944 0.8645377 1.0000000-0.4168392-0.3091199-0.5850054
acceleration 0.4233285-0.5046834-0.5438005-0.6891955-0.4168392 1.0000000 0.2903161 0.2127458
year 0.5805410-0.3456474-0.3698552-0.4163615-0.3091199 0.2903161 1.0000000 0.1815277
origin 0.5652088-0.5689316-0.6145351-0.4551715-0.5850054 0.2127458 0.1815277 1.0000000

(c)

In [4]:
lm.model = lm(mpg~.-name, data=Auto)
summary(lm.model)
Call:
lm(formula = mpg ~ . - name, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5903 -2.1565 -0.1169  1.8690 13.0604 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
cylinders     -0.493376   0.323282  -1.526  0.12780    
displacement   0.019896   0.007515   2.647  0.00844 ** 
horsepower    -0.016951   0.013787  -1.230  0.21963    
weight        -0.006474   0.000652  -9.929  < 2e-16 ***
acceleration   0.080576   0.098845   0.815  0.41548    
year           0.750773   0.050973  14.729  < 2e-16 ***
origin         1.426141   0.278136   5.127 4.67e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared:  0.8215,	Adjusted R-squared:  0.8182 
F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

(i)

The F-statistic which is a lot larger than 1 and its very small p-value disqualify the null hypothesis that all the predictors are equal to zero. Hence the alternate hypothesis, that there is a relationship between one of the predictors and the response, stands true.

(ii)

weight, year, origin and displacement have statistically significant relationship to the response. Because their p-values are very small.

(iii)

The coefficient of the year variable is 0.750773. We can conclude that on average the fuel efficiency of the cars increases by 7.5 mpg in every 10 years.

(d)

In [6]:
par(mfrow=c(2,2))
plot(lm.model)
  1. Residuals vs Fitted values graph shows that there is a nonlinearity between predictor and response.
  2. Studentized residuals vs Fitted values graph show that there some unusually large outliers in Y.
  3. Studentized residuals vs Leverage graph indicates that the observation no. 14 has an unusually high leverage.

(e)

In [10]:
lm.model_interaction = lm(mpg~.+cylinders*displacement+cylinders:year-name,data=Auto)
summary(lm.model_interaction)
Call:
lm(formula = mpg ~ . + cylinders * displacement + cylinders:year - 
    name, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.9613 -1.6755 -0.0946  1.3488 12.7088 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -6.694e+01  1.234e+01  -5.427 1.02e-07 ***
cylinders               1.011e+01  2.323e+00   4.352 1.73e-05 ***
displacement           -7.331e-02  1.365e-02  -5.373 1.35e-07 ***
horsepower             -5.302e-02  1.291e-02  -4.107 4.91e-05 ***
weight                 -5.181e-03  6.021e-04  -8.606  < 2e-16 ***
acceleration            9.581e-02  8.861e-02   1.081  0.28024    
year                    1.572e+00  1.521e-01  10.329  < 2e-16 ***
origin                  7.471e-01  2.636e-01   2.835  0.00483 ** 
cylinders:displacement  1.208e-02  1.679e-03   7.197 3.27e-12 ***
cylinders:year         -1.612e-01  2.882e-02  -5.594 4.23e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.974 on 382 degrees of freedom
Multiple R-squared:  0.8582,	Adjusted R-squared:  0.8548 
F-statistic: 256.8 on 9 and 382 DF,  p-value: < 2.2e-16

Both interactions between cylinders and displacement, and cylinders and year appear to be statistically significant as their p-values are very small.

(f)

In [12]:
lm.model_logY_transform = lm(log(mpg)~.+cylinders*displacement-name,data=Auto)
summary(lm.model_logY_transform)
Call:
lm(formula = log(mpg) ~ . + cylinders * displacement - name, 
    data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46071 -0.06318  0.00052  0.06522  0.39133 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             2.119e+00  1.744e-01  12.148  < 2e-16 ***
cylinders              -8.379e-02  1.524e-02  -5.498 7.03e-08 ***
displacement           -1.833e-03  5.268e-04  -3.479 0.000562 ***
horsepower             -2.252e-03  4.978e-04  -4.524 8.11e-06 ***
weight                 -2.237e-04  2.327e-05  -9.610  < 2e-16 ***
acceleration           -1.875e-03  3.417e-03  -0.549 0.583543    
year                    2.980e-02  1.762e-03  16.912  < 2e-16 ***
origin                  2.252e-02  1.019e-02   2.211 0.027644 *  
cylinders:displacement  3.450e-04  6.405e-05   5.386 1.26e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.115 on 383 degrees of freedom
Multiple R-squared:  0.888,	Adjusted R-squared:  0.8857 
F-statistic: 379.6 on 8 and 383 DF,  p-value: < 2.2e-16
In [14]:
par(mfrow=c(2,2))
plot(lm.model_logY_transform)

The Residuals vs Fitted values graph indicates that Log transforming f(x) values (mpg) has almost removed the nonlinearity between the predictors and response.

In [18]:
lm.model_sqrtY_transform = lm(sqrt(mpg)~.+cylinders*displacement-name,data=Auto)
summary(lm.model_logY_transform)
par(mfrow=c(2,2))
plot(lm.model_sqrtY_transform)
Call:
lm(formula = sqrt(mpg) ~ . + cylinders * displacement - name, 
    data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.15528 -0.16415 -0.00293  0.16172  1.06820 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             2.272e+00  4.400e-01   5.163 3.92e-07 ***
cylinders              -2.410e-01  3.845e-02  -6.269 9.83e-10 ***
displacement           -6.276e-03  1.329e-03  -4.722 3.28e-06 ***
horsepower             -5.039e-03  1.256e-03  -4.013 7.22e-05 ***
weight                 -5.344e-04  5.872e-05  -9.101  < 2e-16 ***
acceleration            1.025e-03  8.620e-03   0.119   0.9054    
year                    7.453e-02  4.445e-03  16.767  < 2e-16 ***
origin                  6.251e-02  2.570e-02   2.432   0.0155 *  
cylinders:displacement  1.122e-03  1.616e-04   6.943 1.65e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2901 on 383 degrees of freedom
Multiple R-squared:  0.8722,	Adjusted R-squared:  0.8695 
F-statistic: 326.7 on 8 and 383 DF,  p-value: < 2.2e-16

Both log transformation and taking square root of the f(x) values (mpg) have not helped remove the outliers. But these transformations have just decreased the nonlinearity between the predictors and the response.

In [22]:
lm.model_Xsq = lm(mpg~horsepower+I(horsepower^2),data=Auto)
summary(lm.model_Xsq)
Call:
lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.7135  -2.5943  -0.0859   2.2868  15.8961 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     56.9000997  1.8004268   31.60   <2e-16 ***
horsepower      -0.4661896  0.0311246  -14.98   <2e-16 ***
I(horsepower^2)  0.0012305  0.0001221   10.08   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.374 on 389 degrees of freedom
Multiple R-squared:  0.6876,	Adjusted R-squared:  0.686 
F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

Both the horsepower and horsepower^2 variables are statistically significant as their p-values are very small.

In [ ]: