Chapter 3 Linear Regression, Q13

In [5]:
set.seed(1)

(a)

In [6]:
x = rnorm(100,mean=0,sd=1)

(b)

In [7]:
er = rnorm(100,mean=0,sd=sqrt(0.25))

(c)

In [8]:
y = -1+0.5*x+er
In [9]:
length(y)
100

length(y)=100, Beta0 = -1 , Beta1 = 0.5

(d)

In [10]:
plot(y,x)

The relationship between x and y is linear.

(e)

In [11]:
lm.model = lm(y~x)
summary(lm.model)
Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.93842 -0.30688 -0.06975  0.26970  1.17309 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.01885    0.04849 -21.010  < 2e-16 ***
x            0.49947    0.05386   9.273 4.58e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4814 on 98 degrees of freedom
Multiple R-squared:  0.4674,	Adjusted R-squared:  0.4619 
F-statistic: 85.99 on 1 and 98 DF,  p-value: 4.583e-15

The estimated coefficients are almost equal to the actual values of the coefficients.

(f)

In [12]:
plot(x,y)
abline(lm.model,col="red")
abline(-1,0.5,col="blue")
legend(-2,0.5, legend = c("least square fit", "pop. regression"), col=c("red","blue"), lwd=2)

(g)

In [13]:
lm.model_quad = lm(y~x+I(x^2))
summary(lm.model_quad)
Call:
lm(formula = y ~ x + I(x^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.98252 -0.31270 -0.06441  0.29014  1.13500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.97164    0.05883 -16.517  < 2e-16 ***
x            0.50858    0.05399   9.420  2.4e-15 ***
I(x^2)      -0.05946    0.04238  -1.403    0.164    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.479 on 97 degrees of freedom
Multiple R-squared:  0.4779,	Adjusted R-squared:  0.4672 
F-statistic:  44.4 on 2 and 97 DF,  p-value: 2.038e-14
In [14]:
anova(lm.model,lm.model_quad)
Res.DfRSSDfSum of SqFPr(>F)
98 22.70890 NA NA NA NA
97 22.25728 1 0.45162561.968241 0.1638275
According to the anova test the p-value of the f-statistic is larger than 5%. Hence the null hypothesis stands true that both models fit the data equally well. Moreover the p-value of the t-statistic of the coefficient of I(x^2) is larger than 5%. Therefore the null hypothesis is stands true that this coefficient is equal to zero and can be omited.

(h)

In [15]:
set.seed(1)
x2 = rnorm(100,mean=0,sd=1)
er2 = rnorm(100,mean=0,sd=sqrt(0.1))
y2 = -1+0.5*x2+er2
In [16]:
plot(x2,y2)
In [17]:
lm.model2 = lm(y2~x2)
summary(lm.model2)
Call:
lm(formula = y2 ~ x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.59351 -0.19409 -0.04411  0.17057  0.74193 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.01192    0.03067  -32.99   <2e-16 ***
x2           0.49966    0.03407   14.67   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3044 on 98 degrees of freedom
Multiple R-squared:  0.687,	Adjusted R-squared:  0.6838 
F-statistic: 215.1 on 1 and 98 DF,  p-value: < 2.2e-16

The residual standard error has decreased and the R-squared value has increased.

In [18]:
plot(x2,y2)
abline(lm.model2, col="red")
abline(-1,0.5, col="blue")
legend(-2,0.0, legend=c("least square fit","pop. regression"), col=c("red","blue"),lwd=2)
In [19]:
lm.model2_quad = lm(y2~x2+I(x2^2))
summary(lm.model2_quad)
Call:
lm(formula = y2 ~ x2 + I(x2^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.62140 -0.19777 -0.04073  0.18350  0.71783 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.98207    0.03721 -26.395   <2e-16 ***
x2           0.50543    0.03415  14.801   <2e-16 ***
I(x2^2)     -0.03761    0.02681  -1.403    0.164    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.303 on 97 degrees of freedom
Multiple R-squared:  0.6933,	Adjusted R-squared:  0.6869 
F-statistic: 109.6 on 2 and 97 DF,  p-value: < 2.2e-16
In [20]:
anova(lm.model2,lm.model2_quad)
Res.DfRSSDfSum of SqFPr(>F)
98 9.083561 NA NA NA NA
97 8.902911 1 0.18065021.968241 0.1638275

According to the anova test the p-value of the f-statistic is larger than 5%. Hence the null hypothesis stands true that both models fit the data equally well.

(i)

In [21]:
set.seed(1)
x3 = rnorm(100,mean=0,sd=1)
er3 = rnorm(100,mean=0,sd=sqrt(0.6))
y3 = -1+0.5*x3+er3
lm.model3 = lm(y3~x3)
summary(lm.model3)
plot(x3,y3)
abline(lm.model3, col="red")
abline(-1,0.5, col="blue")
legend(-2,0.0, legend=c("least square fit","pop. regression"), col=c("red","blue"),lwd=2)
Call:
lm(formula = y3 ~ x3)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.4538 -0.4754 -0.1080  0.4178  1.8173 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.02920    0.07513 -13.700  < 2e-16 ***
x3           0.49918    0.08344   5.982  3.6e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7457 on 98 degrees of freedom
Multiple R-squared:  0.2675,	Adjusted R-squared:   0.26 
F-statistic: 35.79 on 1 and 98 DF,  p-value: 3.6e-08

The residual standard error has increased to 0.75 and the R-squared value has fallen to 0.27

In [22]:
lm.model3_quad = lm(y3~x3+I(x3^2))
summary(lm.model3_quad)
anova(lm.model3,lm.model3_quad)
Call:
lm(formula = y3 ~ x3 + I(x3^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-1.52211 -0.48443 -0.09978  0.44949  1.75833 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.95607    0.09114 -10.491  < 2e-16 ***
x3           0.51329    0.08364   6.137 1.84e-08 ***
I(x3^2)     -0.09212    0.06566  -1.403    0.164    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7421 on 97 degrees of freedom
Multiple R-squared:  0.2821,	Adjusted R-squared:  0.2673 
F-statistic: 19.05 on 2 and 97 DF,  p-value: 1.048e-07
Res.DfRSSDfSum of SqFPr(>F)
98 54.50137 NA NA NA NA
97 53.41746 1 1.083901 1.968241 0.1638275

(j)

In [23]:
#Original dataset
confint(lm.model)
2.5 %97.5 %
(Intercept)-1.1150804-0.9226122
x 0.3925794 0.6063602
In [24]:
#Less noisy dataset
confint(lm.model2)
2.5 %97.5 %
(Intercept)-1.0727832-0.9510557
x2 0.4320613 0.5672681
In [25]:
#Noiser dataset
confint(lm.model3)
2.5 %97.5 %
(Intercept)-1.1782817-0.8801114
x3 0.3335847 0.6647725

The confidence intervals of Beta0 and Beta1 are the narrowest in the less noisy dataset and the widest in the noiser dataset.

In [ ]: