multivariate analysis exam 2 dan sewell question 1: a. the
TRANSCRIPT
-
8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The
1/13
MULTIVARIATE ANALYSIS
EXAM 2
Dan Sewell
Question 1:
a. The first few observations of mortalities per 100,000 are below, which were
calculated in SAS (see code in Appendix), but the entirety of the data set can be
found in the exhaustive SAS output (see Appendix).
Obs newcopd newcvd newpneu newresp
1 64.2845 336.711 32.4186 97.25572 15.8351 160.462 16.8907 32.72583 21.5268 206.496 21.3919 42.98624 34.7171 161.644 12.1879 47.39745 47.6086 289.430 19.0434 66.95436 22.2840 370.108 19.8618 42.1459
7 29.5735 242.242 29.4285 59.14708 42.5145 427.039 31.8858 74.82129 31.7778 216.549 28.7582 60.823610 33.0126 328.191 31.1527 64.5372
b. Monthly average temperatures and ozones, from April to September were
computed using Microsoft Excel (=average(). The resulting data set was then
imported into SAS. The first few observations are below, but the entire data set can
be found in the exhaustive SAS output.
Obs TempApril TempMay TempJune TempJuly TempAug TempSep O3April
1 51.4000 61.6129 70.0667 76.0968 68.4516 63.7333 -2.59742
2 56.7667 67.4839 74.9333 83.2258 79.9032 70.2667 -5.774013 65.3667 69.1290 75.1667 79.3871 82.1290 73.3000 -4.000794 72.0000 76.6452 81.9667 83.4516 88.5161 81.6667 8.268115 58.0667 66.6452 73.9000 80.2258 77.8710 77.2333 2.013696 72.0333 74.7419 80.5000 81.8710 85.2903 76.3000 4.489137 49.4000 58.4194 71.2000 76.0323 71.6452 67.3667 6.720948 46.1333 59.8387 68.5667 74.5806 68.0645 64.4667 2.514589 62.2333 65.9677 73.6333 79.0645 79.6774 69.3667 0.5306610 49.8000 61.8710 70.5667 78.5806 70.5484 63.6667 1.25153
Obs O3May O3June O3July O3Aug O3Sep
1 2.9776 9.3160 10.1753 -1.3167 -5.44732 3.3453 6.7535 13.4359 8.1736 -6.52353 2.8535 -1.9897 -0.0305 13.6344 2.2994
4 2.1398 -5.5971 -11.3581 9.2746 11.64305 10.1284 14.0376 13.9645 9.5357 16.27526 13.4318 0.1609 -3.6472 8.2088 3.44737 2.6733 7.3469 7.7088 3.5155 -3.50278 9.0380 13.3186 15.9207 1.0480 0.39549 2.7441 4.2031 3.6292 12.7265 -6.934310 6.0385 7.6226 7.0373 -0.5158 0.0429
Question 2:
-
8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The
2/13
Are mortalities related to coastline? To answer this question, I ran MANOVA, with
the following model:
newcopdinewcvdinewpneuinewrespi=1coasti0111 02 12 03 13 04 14+
[i1i2i3i4]
for i = 170.
I used Wilks Lambda to determine whether to reject or fail to reject the null
hypothesis that Coastline has no effect (i.e. H0:1=0). Less relevant to the
question but still tested was the null hypothesis that the intercept was not
significant (i.e. H0:0=0). The p-values from Wilks Lambda for these two tests are
0.003 and less than 0.001, respectively. This implies that there is indeed a
difference in mortalities due to geography (specifically, if they live in region next to
the coast). Since the mortality rates differ depending on region, it is important to
look at the means of the mortality rates for each of the two locations.
Ncoast Obs Variable N Mean Std Dev Minimum Maximum
0 32 newcopd 32 46.9256059 10.4654806 24.8226145 74.4175238
newcvd 32 342.2685421 104.0352160 170.4295948 706.4312769newpneu 32 26.9604181 6.1229006 14.3832579 39.9480052newresp 32 74.3137923 13.2431913 45.9742789 99.4803952
1 38 newcopd 38 36.8365308 11.7025433 15.8350620 77.1583167newcvd 38 304.6484248 117.8645380 160.4619615 738.6820742newpneu 38 20.5756272 8.0217809 7.3123619 38.0844329newresp 38 57.6683982 15.7261493 32.7257948 110.2571727
We can see that for each type of death, the mean mortality rates are higher for
those cities which are not along a coastline. I conclude that there is a higher
probability of a Chronic Obstructive Pulmonary Disease death, a cardiovascular
death, a pneumonia death, or a respiratory death if one lives away from the
coastline.
Diagnostic Checks were run to check for Normality and for equal covariance
matrices. Testing the homogeneity of the covariances, by Chi-squared test, leads
us to fail to reject the null hypothesis (equal variances) at the 0.05 level. However,
when Henze-Zirkler Test was run on the residuals, it turned out to not be normal.
Question 3:
We wish to better see the underlying variation structure in the monthly averages oftemperature and ozone. To this end, I perform principle component analysis on
both temperature and ozone.
First, with monthly average temperature, I find that 96.04% of the variation of the
data is explained by the first two PCs. Further evidence for choosing to use just the
first two PCs comes from the following Scree Plot, and noticing the elbow is at 2:
-
8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The
3/13
Scree Plot of Eigenvalues6
1
5
4
E i
g e n v 3 a l u e s
2
1 2 3 4
0 5 6
0 1 2 3 4 5 6
The following table shows how each monthly average temperature is correlated
with the first two PCs.
PC APRIL MAY JUNE JULY AUG SEPT
1 0.928 0.987 0.952 0.809 0.957 0.9492 -0.353 -0.0183 0.256 0.562 0.147 -0.092Second, for average monthly ozone, I find that I should choose either 3 or 4 PCs by
looking at the Scree Plot below:
Scree Plot of Eigenvalues
-
8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The
4/13
2.5
1 2
2.0
E i g 1.5 e n v a l u 3e 1.0 s
0.5 4 5 6
0.0
0 1 2 3 4 5 6
Number
The first 3 PCs account for 88.24% of the variance in average monthly Ozone, and
4 PCs account for 94.38% of the variance. This is a rather subjective decision, but I
will choose that 4 PCs will be used for further analytic needs. Correlations between
each monthly Ozone and each of the first four PCs are given below.
PC APRIL MAY JUNE JULY AUG SEP
Y1
-
0.24531
0.246509
0.919129
0.972862
0.206886
-
0.29725
Y20.449
670.592
1930.026
071
-0.025
70.847
2470.777
09
Y30.781
1580.603
7770.177
8780.024
001
-0.441
60.109
002
-
8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The
5/13
Y4
-0.271
42
-0.147
530.122
6630.020
533
-0.206
320.536
598The data set with the PC Scores from all 6 PCs attached in the exhaustive SAS
output. Note that PCTi refers to the ith PC for temperature, and PCOi refers to the ith
PC for Ozone,i= 1..6 .
Question4:
In order to understand the relationships and associations between mortality and
monthly temperatures, monthly ozones, and geography (coastlines or no
coastlines), I set up a regression model, with four response variables (four mortality
rates) and 13 covariates (14 including intercept). The following are the regression
coefficients for each of the four mortality variables:
For Chronic Obstructive Pulmonary Disease:Intercept 20.28952TempApril -0.19539TempMay -0.48570TempJune -0.16807TempJuly 1.12896TempAug -1.52321TempSep 1.61922O3April 0.15609O3May -0.03675O3June -0.31666O3July -0.16258O3Aug -0.15963O3Sep 0.29670coast -12.50162
For cardiovascular deaths:Intercept -193.94372TempApril 10.72090TempMay -39.54714
TempJune 26.74436TempJuly 22.51034TempAug -24.88893TempSep 8.86207O3April -5.95457O3May 14.62094O3June -9.52123O3July -0.69330O3Aug -1.42830O3Sep -0.95663coast 39.53147
For pneumonia deaths:Intercept -193.94372TempApril 10.72090TempMay -39.54714
TempJune 26.74436TempJuly 22.51034TempAug -24.88893TempSep 8.86207O3April -5.95457O3May 14.62094O3June -9.52123O3July -0.69330O3Aug -1.42830O3Sep -0.95663coast 39.53147
-
8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The
6/13
-
8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The
7/13
Next, we use only the PCs (specifically, PCT1,PCT2, PCO1,PCO2,PCO3, and PCO4) as
covariates, along with coast, to construct much the same model. The following are
the regression coefficients for this model.
For Chronic Obstructive Pulmonary Disease:Intercept 48.11297
PCT1 0.13659PCT2 -0.02778PCO1 -0.01593PCO2 -0.26565PCO3 0.41358PCO4 0.69546coast -12.27632
For cardiovascular deaths:Intercept 328.24533PCT1 0.88349PCT2 2.34210PCO1 0.03544PCO2 -4.82485PCO3 -0.93247
PCO4 2.95725coast -11.78788
For pneumonia deaths:Intercept 24.82633PCT1 0.02900PCT2 0.27819PCO1 -0.04426PCO2 -0.38240PCO3 -0.36996PCO4 -0.24137coast -2.45358
For respiratory deaths:Intercept 73.31611
PCT1 0.16035PCT2 0.26899PCO1 -0.06609PCO2 -0.65850PCO3 0.05854PCO4 0.46664coast -14.80757
Again using Wilks Lambda for our tests, we test the null hypothesis that none of
the covariates have any effect on mortality rates. The test yields a p-value of
0.008, so we reject, concluding that at least one of the covariates affects mortality
rates. The results for individual covariate tests are given in the same fashion as for
the first regression model.
COVARIATE p-value Does it affect mortality
rates?
PCT1 0.5959 NoPCT2 0.6509 NoPCO1 0.9490 NoPCO2 0.0610 NoPCO3 0.0211 Yes
-
8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The
8/13
PCO4 0.1783 NoCoast 0.0220 Yes
Finally, one last regression model is conducted, which only has two covariates,
PCO3 and Coast, just to make sure that we can really say that ozone has an effect
on the mortality rates. With this model, we get the following regression
coefficients:
For Chronic Obstructive Pulmonary Disease:Intercept 47.53327PCO3 0.33993coast -11.20846
For cardiovascular deaths:Intercept 340.00412PCO3 -1.26673coast -33.44881
For pneumonia deaths:Intercept 26.28443PCO3 -0.37815coast -5.13954
For respiratory deaths:Intercept 74.27432PCO3 -0.02208coast -16.57268
The overall model is significant, as the Wilks Lambda gave a p-value of less than
0.0001. Individual tests show that they both are significant, as the p-value for PCO3
is 0.0166, and the p-value for coast is 0.0006. This indicates that the four mortalityrates are affected by both geography and the monthly average ozone.
Diagnostics were run to ensure that our residuals were multivariate normal. To
assess the multinormality of the residuals, I used Henze-Zirkler Test, which
indicated that they were in fact not normal.
Question 5:
In order to see which cities have similar average monthly temperatures, I performed
both hierarchical and non-hierarchical clustering methods. First, I performed ahierarchical clustering method based on average linkage. The problem now lies in
how many clusters to choose. Looking at the plot of the first two PCs does not give
any clear idea as to how many clusters one should choose, so I then looked at both
the Cubic Clustering Criterion and the Pseudo Hotellings T2 test. On the plot of CCC
vs. Number of Clusters, it seems that there is a peak at 11 clusters, and at 7
clusters. I look at the Pseudo T2 and it also confirms that I should choose 11
-
8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The
9/13
clusters, since it jumps from 4.9 (obtaining 11 clusters) to 41.7 (obtaining 10
clusters). As a final check, I used a non-hierarchical method, using Beales F-type
statistic to determine if I should choose 7 or 11 clusters. The Beales F-type statistic
for this comparison is 16.9, which leads us to conclude that 11 clusters is better.
The Beales F-type statistic is computed and compared to the F value using the R
code in the appendix. See the SAS output for a plot of the first 2 temperature PCs
by cluster, and for a list of cities sorted by their cluster. So I can say that all of ourcities fall into one of 11 groups based on their average monthly temperature.
Since parts a-f are all clustered in the same manner, I will summarize all of the
results in a tabulated form below.
From these data, the following conclusions can be made: Each of our cities can be
put into one of 11 groups, where the cities within a group have similar monthly
average temperatures. Each city can also be put into one of 11 groups, where the
cities within a group have similar monthly average ozone. They can also be put into
one of 10 groups, where the cities within a group have similar monthly average
temperatures and ozone. If we use the first few PCs from temperature and ozone
which explain most of the variance of the original variables, we can put each city in
one of 13 groups, where the cities within a group have similar average monthly
temperatures, or 15 groups, where the cities in a group have similar average
monthly ozone, or 12 groups, where the cities within a group have similar average
monthly temperatures and ozone. To find which city belongs to which group, for
each of these 6 grouping methods, see the exhaustive SAS output.
Cities
clustered
by . . .
# of clusters
indicated by
CCC
# of clusters
indicated by
Pseudo T2
# indicated
from Beales
F-type
statistic
Decided
number of
clusters
Monthly avg.
Temperatures
7, 11 11 11 11
Monthly avg.
Ozone
3, 5, 7, 9, 11 3, 7, 9, 11 11 11
Monthly avg.
Temps. and
Ozone
4, 6, 10 4, 10 10 10
First 2
Temperature
PCs
6, 11, 13 SAS did not
calculate
13 13
First 4 Ozone
PCs
3, 7, 11 3, 7, 11, 15 15 15
First 2 Temp.
PCs and first 4
ozone PCs
4, 6, 10 4, 7, 10, 12 12 12
-
8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The
10/13
Question 6:
It is of interest to find the correlation between temperature and ozone. To learn
more about how these two things are correlated, I used canonical correlation
analysis. First, I used all the monthly average temperatures, and correlated those
to the set of variables consisting of the monthly average ozone. Second, I used only
the first two PCs of monthly average temperature, and correlated those to the set
of the first four PCs of monthly average ozone. Note that in the SAS output, you
can find the canonical correlation coefficients, as well as the correlation matrices for
all four sets of variables, and the correlation matrices for each pair of sets of
variables (i.e. for temperatures and ozone, and for the PCs of temperatures and
ozone). For both analyses, I report results based on the first canonical variables,
since they explain most of the correlation.
For the first case, I find that the canonical correlation between the set of all monthly
average temperatures to the set of all monthly average ozone is 0.965. This implies
that there is a very high correlation between temperature and ozone. Since it is
positive, as the temperature canonical variable (temp1) increases, we can feel quite
certain that the ozone canonical variable (oz1) will also increase. Furthermore, we
can see which months of temperature are more correlated to the average monthly
ozone, and vice versa. First, temperatures for April and August, followed by May,
have the strongest correlation with the canonical variable temp1. They also have
the strongest (positive) correlation with the canonical variable oz1. We may
conclude from this that the average temperatures for April, August, and May have
the strongest correlation with the monthly average ozone from April to September.
Second, by looking at the monthly average ozone variables, we see that theaverage ozone in June and July have the strongest (negative) correlation with
canonical variable ozone1. They also have the strongest (negative) correlation with
the canonical variable temperature1. We may conclude that the average ozone
during June and July have the strongest correlation with the average monthly
temperatures from April to September.
For the second case, I find that the canonical correlation between the first two PCs
of monthly average temperatures and the first four PCs of monthly average ozone
is 0.823. This implies that there is a strong positive correlation between the PC
temperature canonical variable (PCtemp1) and the PC ozone canonical variable
(PCOz1). As one would expect, the first PC for temperature has the highest
(negative) correlation with PCtemp1, and the first two PCs for ozone have the
strongest (PCO1 is positive, and PCO2 is negative) correlation with PCOz1.
Similarly, the first PC for temperature has the strongest (negative) correlation with
PCOz1 and the first two PCs for ozone have the strongest (PCO1 is positive and
PCO2 is negative) correlation with PCTemp1. This implies the first linear
combination of average monthly temperatures (PCT1) has the strongest correlation
-
8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The
11/13
with the average monthly ozone from April to September, and the first two linear
combinations of average monthly ozone (PCO1 and PCO2) have the strongest
correlation with average monthly temperatures from April to September.
Below are two plots of the canonical variable scores, first using the monthly means,
and second using the PCs.Clustering based on Monthly Average Ozone
Plot of Temp1*Oz1. Legend: A = 1 obs, B = 2 obs, etc.
Temp1
2.0 A A A A A A
1.5 A A A AA A A A
1.0 A A
A A A B A
0.5 A AA A A A
0.0 A A BB A A A A BA AA A A AA A
-0.5 A A A A B A
-1.0 A A A A A A A A
A A A A-1.5 A A A
-2.0
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
Oz1
-
8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The
12/13
Plot of PCTemp1*PCOz1. Legend: A = 1 obs, B = 2 obs, etc.
PCTemp1
2 A A
A A B A A A
1 A A A B A A B A A A A AA A B AA A A AA A
0 A A A A A A A A AA AA A A A B A A A A A A
-1 A A A A A A A A A A A A
-2 A
-3 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
PCOz1
The plots give good visual reference to the fact that the Canonical Correlation was
stronger when using the average monthly means of temperature and ozone, as
opposed to using the PCs. The Canonical Variable Scores can be seen in the fullSAS output.
Question 7:
We are now interested in being able to classify a region as having a coastline or not
having a coastline by various sets of variables. The end result of the following
analysis will be that if we are given the monthly averages for temperature and
ozone for a region, we will be able to predict whether it is coastal or not. Three
methods will be applied to 6 different sets of variables, and the misclassification
rates will determine which rule and which set of variables are most effective for
determining if a region is coastal or not. The cross-validation technique will be used
to calculate misclassification rates. The following table summarizes the results,
which can be found in detail in the SAS output. For each method and each variable
set, there is a complete listing of the cities and their classification/misclassification
which is found in the SAS output.
-
8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The
13/13
VARIABLES LINEAR
DISCRIMINANT
FUNCTION
K NEAREST
NEIGHBOR
(K=5)
LOGISTIC
REGRESSION
TempApril to
TempSep1370 1270 770
O3April to O3Sep
2170 1870 1770
TempApril to
TempSep and
O3April to O3Sep 1070 970 070
PCT1 and PCT2
1470 2070 1470
PCO1 to PCO4
2170 1870 2270
PCT1, PCT2 and
PCO1 to PCO41570 1870 1270
As is clearly seen, our best way to correctly classify a region as being coastal or not
coastal is using all of our monthly means (our average monthly temperatures and
our average monthly ozone), and using logistic regression. In this manner, we have
the highest probability of correctly classifying a region as coastal or not coastal.