Chapter 3 Linear Regression, Q14

(a)¶

set.seed(1)
x1 = runif(100)
x2 = 0.5*x1+rnorm(100)/10
y = 2+2*x1+0.3*x2+rnorm(100)

y = 2 + 2*x1 + 0.3*x2 + error

beta0 = 2 , beta1 = 2 , beta2 = 0.3

(b)¶

cor(x1,x2)
plot(x1,x2)

(c)¶

lm.model1 = lm(y~x1+x2)
summary(lm.model1)

Call:
lm(formula = y ~ x1 + x2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8311 -0.7273 -0.0537  0.6338  2.3359 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
x1            1.4396     0.7212   1.996   0.0487 *  
x2            1.0097     1.1337   0.891   0.3754    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.056 on 97 degrees of freedom
Multiple R-squared:  0.2088,	Adjusted R-squared:  0.1925 
F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

The estimates for the coefficiets are as following.

beta0=2.13 , beta1 = 1.4396 , beta2 = 1.0097

Only beta0 estimate is a good estimation of the actual value of beta0. The estimates for beta1 and beta2 are not very accurate.

The p-value of the t-statistic of beta1 is slightly smaller than the 5% critical point. We can reject the hypothesis that beta1 = 0.

However the p-value of the t-statistic of beta2 is larger than the 5% critical point. Therefore we can't reject the null hypothesis that beta2=0.

(d)¶

lm.model2 = lm(y~x1)
summary(lm.model2)

Call:
lm(formula = y ~ x1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.89495 -0.66874 -0.07785  0.59221  2.45560 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
x1            1.9759     0.3963   4.986 2.66e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.055 on 98 degrees of freedom
Multiple R-squared:  0.2024,	Adjusted R-squared:  0.1942 
F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

The standard error of the estimate of beta1 has decreased.

The p-value associated with the t-statistic of beta1 is near zero. Therefore we can reject the null hypothesis that beta1=0.

lm.model3 = lm(y~x2)
summary(lm.model3)

Call:
lm(formula = y ~ x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.62687 -0.75156 -0.03598  0.72383  2.44890 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
x2            2.8996     0.6330    4.58 1.37e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.072 on 98 degrees of freedom
Multiple R-squared:  0.1763,	Adjusted R-squared:  0.1679 
F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

The standard error of the estimate of beta2 has decreased.

The p-value associated with the t-statistic of beta2 is near zero. There we can reject the null hypothesis that beta2 = 0.

(f)¶

The results in (c) and (e) contradict each other. According to the results in (c) we couldn't reject the null hypothesis that beta2=0. However the results in (e) falsifies the null hypothesis. We are getting such results because of the high colinearity between x1 and x2. The high colinearity increases the standard error of the estimates of coefficients which leads to a decline in the t-statistic value and hence we can't reject the null hypothesis. Only after removing one of the predictors from the model we get the correct p-value associated with the coefficient.

(g)¶

x1 = c(x1, 0.1)
x2 = c(x2, 0.8)
y = c(y,6)

lm.model_g = lm(y~x1+x2)
summary(lm.model_g)
par(mfrow=c(2,2))
plot(lm.model_g)

Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.73348 -0.69318 -0.05263  0.66385  2.30619 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
x1            0.5394     0.5922   0.911  0.36458    
x2            2.5146     0.8977   2.801  0.00614 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.075 on 98 degrees of freedom
Multiple R-squared:  0.2188,	Adjusted R-squared:  0.2029 
F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06

p=2
n = length(y)
(p+1)/n

Effect of the new observation on the model in (c):

According to the Studentized residuals vs Leverage graph the newly added observation with index number 101 has a leverage statistic of 0.4 which greatly exceeds (p+1)/n. Therefore this observation has a high leverage. There is no outlier as Studentized residuals vs Fitted values graph is bounded between -2 and 2.

After the addition of the new observation the coefficient of x1 becomes significant and the coefficient of x2 becomes insignificant.

lm.model_g2 = lm(y~x1)
summary(lm.model_g2)
par(mfrow=c(2,2))
plot(lm.model_g2)
par(mfrow=c(1,1))
plot(predict(lm.model_g2),rstudent(lm.model_g2))

Call:
lm(formula = y ~ x1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8897 -0.6556 -0.0909  0.5682  3.5665 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
x1            1.7657     0.4124   4.282 4.29e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.111 on 99 degrees of freedom
Multiple R-squared:  0.1562,	Adjusted R-squared:  0.1477 
F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05

Effect of the new observation on the model in (d):

There is no high leverage point in data. In the Studentized residuals vs Fitted value graph the observation which is not bounded by -3 and 3 is an outlier.

The R-squared value has decreased.

lm.model_g3 = lm(y~x2)
summary(lm.model_g3)
par(mfrow=c(2,2))
plot(lm.model_g3)
par(mfrow=c(1,1))
plot(predict(lm.model_g3),rstudent(lm.model_g3))

Call:
lm(formula = y ~ x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.64729 -0.71021 -0.06899  0.72699  2.38074 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
x2            3.1190     0.6040   5.164 1.25e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.074 on 99 degrees of freedom
Multiple R-squared:  0.2122,	Adjusted R-squared:  0.2042 
F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06

Effect of the observation on the model in (e):

Studentized residuals vs leverage graph shows that there is one observation with a high leverage.

The Studentized residuals vs Fitted values graph shows that all the observations are bounded between -3 and 3 so there are no outliers.