hw1 solns stat110 aut0708
TRANSCRIPT
8/8/2019 Hw1 Solns Stat110 Aut0708
http://slidepdf.com/reader/full/hw1-solns-stat110-aut0708 1/11
Statistics 110 Autumn 2007–2008B. Srinivasan Out: Oct. 5, 2007Handout #5
Homework #1 Solutions
1. MB 1.1
This dataset is also in the houseprices dataset within DAAG. The linear histogramemphasizes outliers, while the logarithmic histogram conceals them.
> rm(list = ls())
> library(DAAG)
> data(houseprices)
> attach(houseprices)
> layout(matrix(1:4, 2, 2, byrow = TRUE))
> plot(sale.price, area)
> hist(sale.price)
> plot(log(sale.price), area)
> hist(log(sale.price))
> detach(houseprices)
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
150 200 250 300 350
7 0 0
9 0 0
1 1 0 0
sale.price
a r e a
Histogram of sale.price
sale.price
F r e q u e n c y
100 200 300 400
0
1
2
3
4
5
6
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
4.8 5.2 5.6
7 0 0
9 0 0
1 1 0 0
log(sale.price)
a r e a
Histogram of log(sale.price)
log(sale.price)
F r e q u e n c y
4.6 5.0 5.4 5.8
0
1
2
3
4
5
6
1
8/8/2019 Hw1 Solns Stat110 Aut0708
http://slidepdf.com/reader/full/hw1-solns-stat110-aut0708 2/11
2. MB 1.3
It is usually good practice to explicitly clear the workspace in between problems. Thecode below shows how to quickly inspect columns with str() and how to determine
which rows and columns contain at least one missing element in an automatic fashion.
Note that the nesting of invocations here shows how to use R as a functional program-ming language, in which one result is handed off to the invoking function.
> rm(list = ls())
> library(DAAG)
> data(possum)
> for (nn in 1:ncol(possum)) {
+ print(colnames(possum)[nn])
+ str(possum[, nn])
+ }
[1] "case"
num [1:104] 1 2 3 4 5 6 7 8 9 10 ...
[1] "site"
num [1:104] 1 1 1 1 1 1 1 1 1 1 ...
[1] "Pop"
Factor w/ 2 levels "Vic","other": 1 1 1 1 1 1 1 1 1 1 ...
[1] "sex"
Factor w/ 2 levels "f","m": 2 1 1 1 1 1 2 1 1 1 ...
[1] "age"
num [1:104] 8 6 6 6 2 1 2 6 9 6 ...
[1] "hdlngth"num [1:104] 94.1 92.5 94 93.2 91.5 93.1 95.3 94.8 93.4 91.8 ...
[1] "skullw"
num [1:104] 60.4 57.6 60 57.1 56.3 54.8 58.2 57.6 56.3 58 ...
[1] "totlngth"
num [1:104] 89 91.5 95.5 92 85.5 90.5 89.5 91 91.5 89.5 ...
[1] "taill"
num [1:104] 36 36.5 39 38 36 35.5 36 37 37 37.5 ...
[1] "footlgth"
num [1:104] 74.5 72.5 75.4 76.1 71 73.2 71.5 72.7 72.4 70.9 ...
[1] "earconch"
num [1:104] 54.5 51.2 51.9 52.2 53.2 53.6 52 53.9 52.9 53.4 ...[1] "eye"
num [1:104] 15.2 16 15.5 15.2 15.1 14.2 14.2 14.5 15.5 14.4 ...
[1] "chest"
num [1:104] 28 28.5 30 28 28.5 30 30 29 28 27.5 ...
[1] "belly"
num [1:104] 36 33 34 34 33 32 34.5 34 33 32 ...
> missing.inds = which(!complete.cases(possum))
> print(missing.inds)
[1] 41 44 46
2
8/8/2019 Hw1 Solns Stat110 Aut0708
http://slidepdf.com/reader/full/hw1-solns-stat110-aut0708 3/11
> print(possum[missing.inds, ])
case site Pop sex age hdlngth skullw totlngth taill footlgth earconch eye
BB36 41 2 Vic f 5 88.4 57.0 83 36.5 NA 40.3 15.9BB41 44 2 Vic m NA 85.1 51.5 76 35.5 70.3 52.6 14.4
BB45 46 2 Vic m NA 91.4 54.4 84 35.0 72.8 51.2 14.4
chest belly
BB36 27.0 30.5
BB41 23.0 27.0
BB45 24.5 35.0
> print(which(apply(is.na(possum[missing.inds, ]), 2, sum) > 0))
age footlgth
5 10
3. MB 1.4
This shows a different way to figure out which columns are missing, by transposingthe ais data frame and then running the complete.cases command. We see that nocolumns have missing values, and Gym, Netball, T_Sprnt, and W_Polo are sex imbal-anced by a factor of 2:1 or more.
> rm(list = ls())
> library(DAAG)
> data(ais)
> for (nn in 1:ncol(ais)) { + print(colnames(ais)[nn])
+ str(ais[, nn])
+ }
[1] "rcc"
num [1:202] 3.96 4.41 4.14 4.11 4.45 4.1 4.31 4.42 4.3 4.51 ...
[1] "wcc"
num [1:202] 7.5 8.3 5 5.3 6.8 4.4 5.3 5.7 8.9 4.4 ...
[1] "hc"
num [1:202] 37.5 38.2 36.4 37.3 41.5 37.4 39.6 39.9 41.1 41.6 ...
[1] "hg"
num [1:202] 12.3 12.7 11.6 12.6 14 12.5 12.8 13.2 13.5 12.7 ...
[1] "ferr"
num [1:202] 60 68 21 69 29 42 73 44 41 44 ...
[1] "bmi"
num [1:202] 20.6 20.7 21.9 21.9 19.0 ...
[1] "ssf"
num [1:202] 109.1 102.8 104.6 126.4 80.3 ...
[1] "pcBfat"
num [1:202] 19.8 21.3 19.9 23.7 17.6 ...
[1] "lbm"
num [1:202] 63.3 58.5 55.4 57.2 53.2 ...
3
8/8/2019 Hw1 Solns Stat110 Aut0708
http://slidepdf.com/reader/full/hw1-solns-stat110-aut0708 4/11
[1] "ht"
num [1:202] 196 190 178 185 185 ...
[1] "wt"
num [1:202] 78.9 74.4 69.1 74.9 64.6 63.7 75.2 62.3 66.5 62.9 ...[1] "sex"
Factor w/ 2 levels "f","m": 1 1 1 1 1 1 1 1 1 1 ...
[1] "sport"
Factor w/ 10 levels "B_Ball","Field",..: 1 1 1 1 1 1 1 1 1 1 ...
> missing.cols = which(!complete.cases(t(ais)))
> print(paste("Number of missing columns = ", length(missing.cols)))
[1] "Number of missing columns = 0"
> sex.vs.sport = table(ais$sex, ais$sport)> sex.sport.ratio = sex.vs.sport[1, ]/sex.vs.sport[2, ]
> which(sex.sport.ratio > 2 | sex.sport.ratio < 0.5)
Gym Netball T_Sprnt W_Polo
3 4 8 10
4. MB 1.6
(a) For the first plot, we explicitly use a log transform in the plot invocation.
> rm(list = ls())
> data(Manitoba.lakes)> attach(Manitoba.lakes)
> plot(log2(area) ~ elevation, pch = 16, xlim = c(170, 280), ylab = "Log of Area,
+ xlab = "Elevation (meters above sea level)")
> text(log2(area) ~ elevation, labels = row.names(Manitoba.lakes),
+ pos = 4)
> text(log2(area) ~ elevation, labels = area, pos = 2)
> title("Manitoba' s Largest Lakes (Area in square.km near points)")
4
8/8/2019 Hw1 Solns Stat110 Aut0708
http://slidepdf.com/reader/full/hw1-solns-stat110-aut0708 5/11
q
q
q
q
q
q
q
180 200 220 240 260 280
1 0
1 1
1 2
1 3
1 4
Elevation (meters above sea level)
L o g o
f A r e a , s q u a r e k m
Winnipeg
WinnipegosisManitoba
SouthernIndian
CedarIslandGods
CrossPlaygreen
24387
53744624
2247
135312231151
755657
Manitoba's Largest Lakes (Area in square.km near points)
(b) For the second plot, we invoke plot with untransformed data but specify log-scale(i.e. unevenly spaced) ticks on the y-axis.
> plot(area ~ elevation, pch = 16, xlim = c(170, 280), ylab = "Area, square km",
+ xlab = "Elevation (meters above sea level)", log = "y")
> text(area ~ elevation, labels = row.names(Manitoba.lakes), pos = 4)
> text(area ~ elevation, labels = area, pos = 2)
> title("Manitoba' s Largest Lakes (Area in square.km near points)")
> detach(Manitoba.lakes)
5
8/8/2019 Hw1 Solns Stat110 Aut0708
http://slidepdf.com/reader/full/hw1-solns-stat110-aut0708 6/11
q
q
q
q
q
q
q
180 200 220 240 260 280
1 0 0 0
2 0 0 0
5 0 0 0
1 0 0 0 0
2 0 0 0 0
Elevation (meters above sea level)
A r e a , s q u a r e k m
Winnipeg
WinnipegosisManitoba
SouthernIndian
CedarIslandGods
CrossPlaygreen
24387
53744624
2247
135312231151
755657
Manitoba's Largest Lakes (Area in square.km near points)
5. MB 1.7
This is fairly simple; just two invocations of dotchart. A dotchart is similar to a box-plot in that a continuous variable is plotted against a categorical variable. The maindifference is that a dotchart keeps the individual dots and does not create a box. In thiscase we have one-to-one relationships between the categorical and continuous variable,but in general this need not be the case; for example, we might have multiple noisymeasurements of the area for each lake.
> rm(list = ls())
> library(DAAG)
> data(Manitoba.lakes)
> layout(1:2)
> dotchart(Manitoba.lakes$area, rownames(Manitoba.lakes), main = "Manitoba ' s largest la
+ xlab = "Area (km^2)")
> dotchart(log(Manitoba.lakes$area), rownames(Manitoba.lakes),
+ main = "Manitoba' s largest lakes", xlab = "Log(Area) (km^2)")
6
8/8/2019 Hw1 Solns Stat110 Aut0708
http://slidepdf.com/reader/full/hw1-solns-stat110-aut0708 7/11
WinnipegWinnipegosisManitobaSouthernIndianCedarIslandGodsCrossPlaygreen
q
q
q
q
q
q
q
q
q
0 5000 10000 15000 20000 25000
Manitoba's largest lakes
Area (km^2)
WinnipegWinnipegosisManitobaSouthernIndianCedarIslandGodsCrossPlaygreen
q
q
q
q
q
q
q
q
q
7 8 9 10
Manitoba's largest lakes
Log(Area) (km^2)
6. MB 1.13
The correlation between brain and body size is most apparent from the log-log plot. Bycontrast, the linear plot makes it appear as if brain and body size are not related. Thesquare root and .1 power plots are successively closer to the log plot, because withinthis range these transformations (esp. the .1 power) are close to the log. This can beconfirmed by executing plot(log(body),body^{0.1}) which will show that the twotransformations are close-to-linearly related (though their absolute values differ).
> par(mfrow = c(2, 2))
> library(MASS)> attach(Animals)
> plot(body, brain)
> plot(sqrt(body), sqrt(brain))
> plot(body^0.1, brain^0.1)
> plot(log(body), log(brain))
> detach(Animals)
> par(mfrow = c(1, 1))
7
8/8/2019 Hw1 Solns Stat110 Aut0708
http://slidepdf.com/reader/full/hw1-solns-stat110-aut0708 8/11
q
qqq q
q
qqq
q
q
q
q
0 20000 60000
0
2 0 0 0
4 0 0 0
body
b r a i n
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 50 100 200 300
0
2 0
4 0
6 0
sqrt(body)
s q r t ( b r a i n )
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
1.0 1.5 2.0 2.5 3.0
1 .
0
1 .
5
2 .
0
body^0.1
b r a i n ^ 0 .
1
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10
0
2
4
6
8
log(body)
l o g ( b r a i n )
7. MB 1.16
You would first execute the code below:
> rm(list = ls())
> library(DAAG)
> data(socsupport)
> attach(socsupport)
> gender1 = abbreviate(gender, 1)
> table(gender1)
gender1
f m
71 24
> country3 = abbreviate(country, 3)
> table(country3)
country3
ast oth
85 10
8
8/8/2019 Hw1 Solns Stat110 Aut0708
http://slidepdf.com/reader/full/hw1-solns-stat110-aut0708 9/11
> num = seq(along = gender)
> lab = paste(gender1, country3, num, sep = ":")
> plot(BDI ~ age)
> detach(socsupport)
q
q
q
q
q
q
18−20 21−24 25−30 31−40 40+
0
1 0
2 0
3 0
4 0
age
B D I
You would then run the command identify(BDI ~ age,labels=lab) to locate therows with indexes 8 12 36 59 68 95 as outliers; these would be automatically labeledas you clicked on them. The final plot would look like this:
9
8/8/2019 Hw1 Solns Stat110 Aut0708
http://slidepdf.com/reader/full/hw1-solns-stat110-aut0708 10/11
8. MB 1.17
The seq(along=x) returns integer(0) which is another way of describing a vectorof length 0. This is better than seq(1,length(x)) which returns c(1,0) due to thedefinition of the seq function (see ?seq for why).
> x = c(8, 54, 534, 1630, 6611)
> seq(1, length(x))
[ 1 ] 1 2 3 4 5
> seq(along = x)
[ 1 ] 1 2 3 4 5
> x = NULL
> seq(1, length(x))
[1] 1 0
> seq(along = x)
10
8/8/2019 Hw1 Solns Stat110 Aut0708
http://slidepdf.com/reader/full/hw1-solns-stat110-aut0708 11/11
integer(0)
11