Chapter 3 Linear Regression, Q13

set.seed(1)

(a)¶

x = rnorm(100,mean=0,sd=1)

(b)¶

er = rnorm(100,mean=0,sd=sqrt(0.25))

(c)¶

y = -1+0.5*x+er

length(y)

length(y)=100, Beta0 = -1 , Beta1 = 0.5

(d)¶

plot(y,x)

The relationship between x and y is linear.

(e)¶

lm.model = lm(y~x)
summary(lm.model)

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.93842 -0.30688 -0.06975  0.26970  1.17309 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.01885    0.04849 -21.010  < 2e-16 ***
x            0.49947    0.05386   9.273 4.58e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4814 on 98 degrees of freedom
Multiple R-squared:  0.4674,	Adjusted R-squared:  0.4619 
F-statistic: 85.99 on 1 and 98 DF,  p-value: 4.583e-15

The estimated coefficients are almost equal to the actual values of the coefficients.

(f)¶

plot(x,y)
abline(lm.model,col="red")
abline(-1,0.5,col="blue")
legend(-2,0.5, legend = c("least square fit", "pop. regression"), col=c("red","blue"), lwd=2)

(g)¶

lm.model_quad = lm(y~x+I(x^2))
summary(lm.model_quad)

Call:
lm(formula = y ~ x + I(x^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.98252 -0.31270 -0.06441  0.29014  1.13500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.97164    0.05883 -16.517  < 2e-16 ***
x            0.50858    0.05399   9.420  2.4e-15 ***
I(x^2)      -0.05946    0.04238  -1.403    0.164    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.479 on 97 degrees of freedom
Multiple R-squared:  0.4779,	Adjusted R-squared:  0.4672 
F-statistic:  44.4 on 2 and 97 DF,  p-value: 2.038e-14

anova(lm.model,lm.model_quad)

(h)¶

set.seed(1)
x2 = rnorm(100,mean=0,sd=1)
er2 = rnorm(100,mean=0,sd=sqrt(0.1))
y2 = -1+0.5*x2+er2

plot(x2,y2)

lm.model2 = lm(y2~x2)
summary(lm.model2)

Call:
lm(formula = y2 ~ x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.59351 -0.19409 -0.04411  0.17057  0.74193 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.01192    0.03067  -32.99   <2e-16 ***
x2           0.49966    0.03407   14.67   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3044 on 98 degrees of freedom
Multiple R-squared:  0.687,	Adjusted R-squared:  0.6838 
F-statistic: 215.1 on 1 and 98 DF,  p-value: < 2.2e-16

The residual standard error has decreased and the R-squared value has increased.

plot(x2,y2)
abline(lm.model2, col="red")
abline(-1,0.5, col="blue")
legend(-2,0.0, legend=c("least square fit","pop. regression"), col=c("red","blue"),lwd=2)

lm.model2_quad = lm(y2~x2+I(x2^2))
summary(lm.model2_quad)

Call:
lm(formula = y2 ~ x2 + I(x2^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.62140 -0.19777 -0.04073  0.18350  0.71783 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.98207    0.03721 -26.395   <2e-16 ***
x2           0.50543    0.03415  14.801   <2e-16 ***
I(x2^2)     -0.03761    0.02681  -1.403    0.164    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.303 on 97 degrees of freedom
Multiple R-squared:  0.6933,	Adjusted R-squared:  0.6869 
F-statistic: 109.6 on 2 and 97 DF,  p-value: < 2.2e-16

anova(lm.model2,lm.model2_quad)

According to the anova test the p-value of the f-statistic is larger than 5%. Hence the null hypothesis stands true that both models fit the data equally well.

(i)¶

set.seed(1)
x3 = rnorm(100,mean=0,sd=1)
er3 = rnorm(100,mean=0,sd=sqrt(0.6))
y3 = -1+0.5*x3+er3
lm.model3 = lm(y3~x3)
summary(lm.model3)
plot(x3,y3)
abline(lm.model3, col="red")
abline(-1,0.5, col="blue")
legend(-2,0.0, legend=c("least square fit","pop. regression"), col=c("red","blue"),lwd=2)

Call:
lm(formula = y3 ~ x3)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.4538 -0.4754 -0.1080  0.4178  1.8173 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.02920    0.07513 -13.700  < 2e-16 ***
x3           0.49918    0.08344   5.982  3.6e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7457 on 98 degrees of freedom
Multiple R-squared:  0.2675,	Adjusted R-squared:   0.26 
F-statistic: 35.79 on 1 and 98 DF,  p-value: 3.6e-08

The residual standard error has increased to 0.75 and the R-squared value has fallen to 0.27

lm.model3_quad = lm(y3~x3+I(x3^2))
summary(lm.model3_quad)
anova(lm.model3,lm.model3_quad)

Call:
lm(formula = y3 ~ x3 + I(x3^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-1.52211 -0.48443 -0.09978  0.44949  1.75833 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.95607    0.09114 -10.491  < 2e-16 ***
x3           0.51329    0.08364   6.137 1.84e-08 ***
I(x3^2)     -0.09212    0.06566  -1.403    0.164    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7421 on 97 degrees of freedom
Multiple R-squared:  0.2821,	Adjusted R-squared:  0.2673 
F-statistic: 19.05 on 2 and 97 DF,  p-value: 1.048e-07

(j)¶

#Original dataset
confint(lm.model)

#Less noisy dataset
confint(lm.model2)

#Noiser dataset
confint(lm.model3)

The confidence intervals of Beta0 and Beta1 are the narrowest in the less noisy dataset and the widest in the noiser dataset.

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
98	22.70890	NA	NA	NA	NA
97	22.25728	1	0.4516256	1.968241	0.1638275

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
98	9.083561	NA	NA	NA	NA
97	8.902911	1	0.1806502	1.968241	0.1638275

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
98	54.50137	NA	NA	NA	NA
97	53.41746	1	1.083901	1.968241	0.1638275

	2.5 %	97.5 %
(Intercept)	-1.1150804	-0.9226122
x	0.3925794	0.6063602

	2.5 %	97.5 %
(Intercept)	-1.0727832	-0.9510557
x2	0.4320613	0.5672681

	2.5 %	97.5 %
(Intercept)	-1.1782817	-0.8801114
x3	0.3335847	0.6647725