8135
SYDNEY
FACULTIES OF ARTS AND SCIENCE
STAT 3012: Applied Linear Models
Semester 1, 2006
Time allowed: Three hours
1. In a study of high-density lipoprotein (HDL, labelled y) in human blood a
sample of size 42 was used. Measurements were taken on total cholesterol (x
1
),
total trigylceride (x
2
) as well as noting whether a sticky component, sinking pre-
beta (SPB, labelled x
3
) was present (coded 1) or absent (coded 0). A partial
analysis is given in the attached R-output. The basis for the analysis is the
model
Y
i
=
0
+
1
x
1
+
2
x
2
+
3
x
3
+
i
,
where
i
N ID(0,
2
).
(a) (2 marks) Write down the tted multiple regression model. What propor-
tion of the total variability in Y is explained by the multiple regression?
(b) (2 marks) Provide a 95% condence interval for
2
.
(c) (4 marks) What is the purpose of the residual versus tted values plot?
Are the above model assumptions reasonable? Justify your answer.
(d) (3 marks) Are there any high leverage points in this data set?
What
characterises a high leverage point in general?
(e) (2 marks) Are there any outliers in the data?
(f) (3 marks) Test the hypothesis H
0
:
1
=
2
= 0.
(g) (2 marks) Give a 95% condence interval for the error standard deviation,
, for the simple regression on x
3
only.
(h) (2 marks) What is the predicted dierence in average HDL levels between
people with and without SPB?
. . . /2
8135
Semester 1 2006
Page 2 of 7
2. To determine the eect of exhaust index (in seconds) and pump heater voltage
on the pressure inside a vacuum tube (in microns of mercury), three exhaust
indices and two voltages were chosen at xed levels. It was decided to run two
trials for each combination of index and voltage so in all 12 trials were run. A
completely randomised design was deemed appropriate. The results of the 12
experiments are given below.
Pump Heater
Exhaust Index
Voltage
60
90
120
Totals
127
48, 58
28, 33,
7, 15
189
220
62, 54
14, 10
9, 6
155
Totals
222
85
37
344
(a) (9 marks) Calculate the interaction sum of squares and hence complete the
analysis of variance table. Dene an appropriate model and hence test for
an interaction eect.
Source of Variation
d.f.
Sum of Squares
Voltage (V)
Exhaust Index (E)
4608.167
Interaction (V E)
Residual
6
139.00
Total
5126.667
(b) (3 marks) Plot the mean response curves for this data set. Comment.
(c) (2 marks) Show that the linear component of the Exhaust Index sum of
squares is 4278.125.
(d) (3 marks) Test that pressure varies linearly with exhaust index.
(e) (3 marks) Calculate a 95% condence interval for the average dierence in
pressure at the two voltage levels when the exhaust index is set at 90.
. . . /3
8135
Semester 1 2006
Page 3 of 7
3.
(a) (5 marks) Consider the simple linear regression model
Y
i
=
0
+
1
x
i
+
i
,
where
N (0,
2
).
(i) Prove that the least squares estimates for the parameters satisfy
0
=
Y
1
x,
and
1
=
n
i=1
(Y
i
Y )x
i
n
i=1
(x
i
x)x
i
.
(ii) Show that the residuals, R
i
= Y
i
(
0
+
1
x
i
) sum to 0.
(b) (5 marks) Da Vincis famous sketch of a man shows a persons armspan
measured across their back is approximately equal to their height. Eight
mesurements (in inches) on armspan (t) and height(h) are stored in data
in the attached output.
(i) Use the output to test the claim that the slope of the regression line
for predicting height given armspan is 1.
(ii) Use the data to obtain a 90% condence region for the height of a
person with armspan of 64 inches.
. . . /4
8135
Semester 1 2006
Page 4 of 7
4. A nutrition specialist studied the eects of six diets on weight gain of domestic
rabbits. It was decided to block on litters, since rabbits where anticipated
to be similar within litters but dierent between litters. The weight gain was
measured on 3 rabbits in each of 10 litters (blocks). An incomplete block design
was used to assign the 6 diets (treatments) to the rabbits. The data frame
referred to below has 3 columns: treat, a factor with levels a, b,. . . , f indicating
the diet, a numeric vector gain giving the weight gain and a factor block
indicating the litter. Some R output is displayed below.
(a) (3 marks) The incomplete blocking design is displayed more clearly below.
It is not cyclically generated but is balanced; how does the output below
tell us this?
> matrix(rabbit$treat, nrow = 3)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] f
c
c
a
e
b
d
a
d
f
[2,] b
a
f
e
c
f
a
e
b
d
[3,] c
b
d
c
d
e
b
f
e
a
> L <- lm(gain ~ block + treat)
> S <- summary(L)
> S$cov.unscaled[11:15, 11:15]
treatb treatc treatd treate treatf
treatb
0.50
0.25
0.25
0.25
0.25
treatc
0.25
0.50
0.25
0.25
0.25
treatd
0.25
0.25
0.50
0.25
0.25
treate
0.25
0.25
0.25
0.50
0.25
treatf
0.25
0.25
0.25
0.25
0.50
. . . /5
8135
Semester 1 2006
Page 5 of 7
> S$coef[,1:2]
Estimate Std. Error
(Intercept) 36.01388889
2.588631
blockb10
3.29722222
2.796041
blockb2
4.13333333
2.694333
blockb3
-1.80277778
2.694333
blockb4
8.79444444
2.796041
blockb5
2.30555556
2.796041
blockb6
5.40833333
2.694333
blockb7
5.77777778
2.796041
blockb8
9.42777778
2.796041
blockb9
-7.48055556
2.796041
treatb
-1.74166667
2.241821
treatc
0.40000000
2.241821
treatd
0.06666667
2.241821
treate
-5.22500000
2.241821
treatf
3.30000000
2.241821
> L$df.res
[1] 15
> qtukey(c(.95,.99),6,15)/sqrt(2)
[1] 3.248968 4.098088
> qt(1-c(.05,.01)/(2*15),15)
[1] 3.483677 4.273314
> sqrt(5*qf(c(.95,.99),5,15))
[1] 3.808736 4.772638
(b) (9 marks) Suppose that all pairwise dierences are of interest, and this
is decided a priori. Using the output above, determine if any pairs of
treatments are signicantly dierent at either the 5% or 1% levels.
(c) (3 marks) Do any of your conclusions change if it is acknowledged that the
decision to consider only all pairwise dierences is made after looking at
the data?
. . . /6
8135
Semester 1 2006
Page 6 of 7
5. An experiment was performed on 5 varieties of cowpeas planted at 3 spacings
using a split-plot design. Each variety was randomly assigned to 4 of the 20
wholeplots, and each whole plot was divided into 3 subplots for the spacings.
The data is yield per plot. The data frame cowpea contains a numeric vector
yield and factors variety with 5 levels, spacing with 3 levels and wplot with
20 levels. The output below shows an appropriate analysis of variance table
when each one of the three factors is ignored.
> summary(aov(yield~wplot+spacing,cowpea))
Df
Sum Sq Mean Sq F value
Pr(>F)
wplot
19 2126.67
111.93
4.2190 7.892e-05 ***
spacing
2
109.20
54.60
2.0581
0.1417
Residuals
38 1008.13
26.53
---
Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
> summary(aov(yield~variety*spacing,cowpea))
Df
Sum Sq Mean Sq F value
Pr(>F)
variety
4 1089.17
272.29 10.4683 4.425e-06 ***
spacing
2
109.20
54.60
2.0991 0.1344074
variety:spacing
8
875.13
109.39
4.2056 0.0007994 ***
Residuals
45 1170.50
26.01
---
Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
> summary(aov(yield~variety+wplot,cowpea))
Df
Sum Sq Mean Sq F value
Pr(>F)
variety
4 1089.17
272.29
9.7479 1.336e-05 ***
wplot
15 1037.50
69.17
2.4761
0.0113 *
Residuals
40 1117.33
27.93
---
Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
. . . /7
8135
Semester 1 2006
Page 7 of 7
(a) (13 marks) Using the above output, draw up the appropriate analysis of
variance table for the full model with wholeplots-within-varieties regarded
as a random eect. Your table should have two sections, one for between-
wholeplot comparisons and one for within-wholeplot comparisons.
(b) (4 marks) Indicate which mean-squares would be used to construct F -
statistics for testing the following hypotheses. If possible, provide prob-
ability expressions (e.g. P {F
3,6
> 5.12}) for appropriate p-values. Ignore
any multiplicity issues:
i. There is no dierence overall between the varieties.
ii. There is no dierence overall between the spacings.
iii. The eect of spacings is the same for each variety.
iv. Dierent wholeplots within each variety give the same yield.
6. A 2
5
factorial experiment in 2 replicates each with 4 blocks of size 8 is to be
conducted using an incomplete block design.
(a) (2 marks) Describe the method of randomisation which should be used for
such a design.
(b) (2 marks) How many eects are confounded in each replicate?
(c) It is desired that no eect is to be totally confounded and no main or two
factor interaction eect is to be partially confounded.
i. (2 marks) How many degrees of freedom for residuals would there be
in a design with these properties?
ii. (10 marks) Give a design with the above properties.
THIS IS THE LAST PAGE
Computer output
Question 1.
R : Copyright 2004, The R Foundation for Statistical Computing
Version 1.9.1
(2004-06-21), ISBN 3-900051-00-3
> dat<-data.frame(y,x1,x2,x3)
> dat
y
x1
x2 x3
1
47 287 111
0
2
38 236 135
0
3
47 255
98
0
4
39 135
63
0
5
44 121
46
0
6
64 171 103
0
7
58 260 227
0
8
49 237 157
0
9
57 192 115
1
10 42 349 408
1
11 54 263 103
1
12 60 223 102
1
13 33 316 274
0
14 55 288 130
0
15 36 256 149
0
16 36 318 180
0
17 55 261 266
0
18 52 397 167
0
19 49 295 164
0
20 47 261 119
1
21 40 258 145
1
22 42 280 247
1
23 63 339 168
1
24 40 161
68
1
25 59 324
92
1
26 56 171
56
1
27 76 265 240
1
28 67 280 306
1
29 57 248
93
1
30 42 270 134
0
31 41 262 154
0
32 42 264
86
0
33 39 325 148
0
34 27 388 191
0
35 31 260 123
0
36 39 284 135
0
37 56 326 236
1
38 40 248
92
1
39 58 285 153
1
40 43 361 126
1
41 40 248 226
1
42 46 280 176
1
> summary(dat)
y
x1
x2
x3
Min.
:27.00
Min.
:121.0
Min.
: 46.0
Min.
:0.0000
1st Qu.:40.00
1st Qu.:248.0
1st Qu.:103.0
1st Qu.:0.0000
Median :46.50
Median :263.5
Median :140.0
Median :0.0000
Mean
:47.76
Mean
:267.8
Mean
:155.0
Mean
:0.4762
3rd Qu.:56.00
3rd Qu