multivariate analysis exam 2 dan sewell question 1: a. the

Upload: dksewell

Post on 30-May-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The

    1/13

    MULTIVARIATE ANALYSIS

    EXAM 2

    Dan Sewell

    Question 1:

    a. The first few observations of mortalities per 100,000 are below, which were

    calculated in SAS (see code in Appendix), but the entirety of the data set can be

    found in the exhaustive SAS output (see Appendix).

    Obs newcopd newcvd newpneu newresp

    1 64.2845 336.711 32.4186 97.25572 15.8351 160.462 16.8907 32.72583 21.5268 206.496 21.3919 42.98624 34.7171 161.644 12.1879 47.39745 47.6086 289.430 19.0434 66.95436 22.2840 370.108 19.8618 42.1459

    7 29.5735 242.242 29.4285 59.14708 42.5145 427.039 31.8858 74.82129 31.7778 216.549 28.7582 60.823610 33.0126 328.191 31.1527 64.5372

    b. Monthly average temperatures and ozones, from April to September were

    computed using Microsoft Excel (=average(). The resulting data set was then

    imported into SAS. The first few observations are below, but the entire data set can

    be found in the exhaustive SAS output.

    Obs TempApril TempMay TempJune TempJuly TempAug TempSep O3April

    1 51.4000 61.6129 70.0667 76.0968 68.4516 63.7333 -2.59742

    2 56.7667 67.4839 74.9333 83.2258 79.9032 70.2667 -5.774013 65.3667 69.1290 75.1667 79.3871 82.1290 73.3000 -4.000794 72.0000 76.6452 81.9667 83.4516 88.5161 81.6667 8.268115 58.0667 66.6452 73.9000 80.2258 77.8710 77.2333 2.013696 72.0333 74.7419 80.5000 81.8710 85.2903 76.3000 4.489137 49.4000 58.4194 71.2000 76.0323 71.6452 67.3667 6.720948 46.1333 59.8387 68.5667 74.5806 68.0645 64.4667 2.514589 62.2333 65.9677 73.6333 79.0645 79.6774 69.3667 0.5306610 49.8000 61.8710 70.5667 78.5806 70.5484 63.6667 1.25153

    Obs O3May O3June O3July O3Aug O3Sep

    1 2.9776 9.3160 10.1753 -1.3167 -5.44732 3.3453 6.7535 13.4359 8.1736 -6.52353 2.8535 -1.9897 -0.0305 13.6344 2.2994

    4 2.1398 -5.5971 -11.3581 9.2746 11.64305 10.1284 14.0376 13.9645 9.5357 16.27526 13.4318 0.1609 -3.6472 8.2088 3.44737 2.6733 7.3469 7.7088 3.5155 -3.50278 9.0380 13.3186 15.9207 1.0480 0.39549 2.7441 4.2031 3.6292 12.7265 -6.934310 6.0385 7.6226 7.0373 -0.5158 0.0429

    Question 2:

  • 8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The

    2/13

    Are mortalities related to coastline? To answer this question, I ran MANOVA, with

    the following model:

    newcopdinewcvdinewpneuinewrespi=1coasti0111 02 12 03 13 04 14+

    [i1i2i3i4]

    for i = 170.

    I used Wilks Lambda to determine whether to reject or fail to reject the null

    hypothesis that Coastline has no effect (i.e. H0:1=0). Less relevant to the

    question but still tested was the null hypothesis that the intercept was not

    significant (i.e. H0:0=0). The p-values from Wilks Lambda for these two tests are

    0.003 and less than 0.001, respectively. This implies that there is indeed a

    difference in mortalities due to geography (specifically, if they live in region next to

    the coast). Since the mortality rates differ depending on region, it is important to

    look at the means of the mortality rates for each of the two locations.

    Ncoast Obs Variable N Mean Std Dev Minimum Maximum

    0 32 newcopd 32 46.9256059 10.4654806 24.8226145 74.4175238

    newcvd 32 342.2685421 104.0352160 170.4295948 706.4312769newpneu 32 26.9604181 6.1229006 14.3832579 39.9480052newresp 32 74.3137923 13.2431913 45.9742789 99.4803952

    1 38 newcopd 38 36.8365308 11.7025433 15.8350620 77.1583167newcvd 38 304.6484248 117.8645380 160.4619615 738.6820742newpneu 38 20.5756272 8.0217809 7.3123619 38.0844329newresp 38 57.6683982 15.7261493 32.7257948 110.2571727

    We can see that for each type of death, the mean mortality rates are higher for

    those cities which are not along a coastline. I conclude that there is a higher

    probability of a Chronic Obstructive Pulmonary Disease death, a cardiovascular

    death, a pneumonia death, or a respiratory death if one lives away from the

    coastline.

    Diagnostic Checks were run to check for Normality and for equal covariance

    matrices. Testing the homogeneity of the covariances, by Chi-squared test, leads

    us to fail to reject the null hypothesis (equal variances) at the 0.05 level. However,

    when Henze-Zirkler Test was run on the residuals, it turned out to not be normal.

    Question 3:

    We wish to better see the underlying variation structure in the monthly averages oftemperature and ozone. To this end, I perform principle component analysis on

    both temperature and ozone.

    First, with monthly average temperature, I find that 96.04% of the variation of the

    data is explained by the first two PCs. Further evidence for choosing to use just the

    first two PCs comes from the following Scree Plot, and noticing the elbow is at 2:

  • 8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The

    3/13

    Scree Plot of Eigenvalues6

    1

    5

    4

    E i

    g e n v 3 a l u e s

    2

    1 2 3 4

    0 5 6

    0 1 2 3 4 5 6

    The following table shows how each monthly average temperature is correlated

    with the first two PCs.

    PC APRIL MAY JUNE JULY AUG SEPT

    1 0.928 0.987 0.952 0.809 0.957 0.9492 -0.353 -0.0183 0.256 0.562 0.147 -0.092Second, for average monthly ozone, I find that I should choose either 3 or 4 PCs by

    looking at the Scree Plot below:

    Scree Plot of Eigenvalues

  • 8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The

    4/13

    2.5

    1 2

    2.0

    E i g 1.5 e n v a l u 3e 1.0 s

    0.5 4 5 6

    0.0

    0 1 2 3 4 5 6

    Number

    The first 3 PCs account for 88.24% of the variance in average monthly Ozone, and

    4 PCs account for 94.38% of the variance. This is a rather subjective decision, but I

    will choose that 4 PCs will be used for further analytic needs. Correlations between

    each monthly Ozone and each of the first four PCs are given below.

    PC APRIL MAY JUNE JULY AUG SEP

    Y1

    -

    0.24531

    0.246509

    0.919129

    0.972862

    0.206886

    -

    0.29725

    Y20.449

    670.592

    1930.026

    071

    -0.025

    70.847

    2470.777

    09

    Y30.781

    1580.603

    7770.177

    8780.024

    001

    -0.441

    60.109

    002

  • 8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The

    5/13

    Y4

    -0.271

    42

    -0.147

    530.122

    6630.020

    533

    -0.206

    320.536

    598The data set with the PC Scores from all 6 PCs attached in the exhaustive SAS

    output. Note that PCTi refers to the ith PC for temperature, and PCOi refers to the ith

    PC for Ozone,i= 1..6 .

    Question4:

    In order to understand the relationships and associations between mortality and

    monthly temperatures, monthly ozones, and geography (coastlines or no

    coastlines), I set up a regression model, with four response variables (four mortality

    rates) and 13 covariates (14 including intercept). The following are the regression

    coefficients for each of the four mortality variables:

    For Chronic Obstructive Pulmonary Disease:Intercept 20.28952TempApril -0.19539TempMay -0.48570TempJune -0.16807TempJuly 1.12896TempAug -1.52321TempSep 1.61922O3April 0.15609O3May -0.03675O3June -0.31666O3July -0.16258O3Aug -0.15963O3Sep 0.29670coast -12.50162

    For cardiovascular deaths:Intercept -193.94372TempApril 10.72090TempMay -39.54714

    TempJune 26.74436TempJuly 22.51034TempAug -24.88893TempSep 8.86207O3April -5.95457O3May 14.62094O3June -9.52123O3July -0.69330O3Aug -1.42830O3Sep -0.95663coast 39.53147

    For pneumonia deaths:Intercept -193.94372TempApril 10.72090TempMay -39.54714

    TempJune 26.74436TempJuly 22.51034TempAug -24.88893TempSep 8.86207O3April -5.95457O3May 14.62094O3June -9.52123O3July -0.69330O3Aug -1.42830O3Sep -0.95663coast 39.53147

  • 8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The

    6/13

  • 8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The

    7/13

    Next, we use only the PCs (specifically, PCT1,PCT2, PCO1,PCO2,PCO3, and PCO4) as

    covariates, along with coast, to construct much the same model. The following are

    the regression coefficients for this model.

    For Chronic Obstructive Pulmonary Disease:Intercept 48.11297

    PCT1 0.13659PCT2 -0.02778PCO1 -0.01593PCO2 -0.26565PCO3 0.41358PCO4 0.69546coast -12.27632

    For cardiovascular deaths:Intercept 328.24533PCT1 0.88349PCT2 2.34210PCO1 0.03544PCO2 -4.82485PCO3 -0.93247

    PCO4 2.95725coast -11.78788

    For pneumonia deaths:Intercept 24.82633PCT1 0.02900PCT2 0.27819PCO1 -0.04426PCO2 -0.38240PCO3 -0.36996PCO4 -0.24137coast -2.45358

    For respiratory deaths:Intercept 73.31611

    PCT1 0.16035PCT2 0.26899PCO1 -0.06609PCO2 -0.65850PCO3 0.05854PCO4 0.46664coast -14.80757

    Again using Wilks Lambda for our tests, we test the null hypothesis that none of

    the covariates have any effect on mortality rates. The test yields a p-value of

    0.008, so we reject, concluding that at least one of the covariates affects mortality

    rates. The results for individual covariate tests are given in the same fashion as for

    the first regression model.

    COVARIATE p-value Does it affect mortality

    rates?

    PCT1 0.5959 NoPCT2 0.6509 NoPCO1 0.9490 NoPCO2 0.0610 NoPCO3 0.0211 Yes

  • 8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The

    8/13

    PCO4 0.1783 NoCoast 0.0220 Yes

    Finally, one last regression model is conducted, which only has two covariates,

    PCO3 and Coast, just to make sure that we can really say that ozone has an effect

    on the mortality rates. With this model, we get the following regression

    coefficients:

    For Chronic Obstructive Pulmonary Disease:Intercept 47.53327PCO3 0.33993coast -11.20846

    For cardiovascular deaths:Intercept 340.00412PCO3 -1.26673coast -33.44881

    For pneumonia deaths:Intercept 26.28443PCO3 -0.37815coast -5.13954

    For respiratory deaths:Intercept 74.27432PCO3 -0.02208coast -16.57268

    The overall model is significant, as the Wilks Lambda gave a p-value of less than

    0.0001. Individual tests show that they both are significant, as the p-value for PCO3

    is 0.0166, and the p-value for coast is 0.0006. This indicates that the four mortalityrates are affected by both geography and the monthly average ozone.

    Diagnostics were run to ensure that our residuals were multivariate normal. To

    assess the multinormality of the residuals, I used Henze-Zirkler Test, which

    indicated that they were in fact not normal.

    Question 5:

    In order to see which cities have similar average monthly temperatures, I performed

    both hierarchical and non-hierarchical clustering methods. First, I performed ahierarchical clustering method based on average linkage. The problem now lies in

    how many clusters to choose. Looking at the plot of the first two PCs does not give

    any clear idea as to how many clusters one should choose, so I then looked at both

    the Cubic Clustering Criterion and the Pseudo Hotellings T2 test. On the plot of CCC

    vs. Number of Clusters, it seems that there is a peak at 11 clusters, and at 7

    clusters. I look at the Pseudo T2 and it also confirms that I should choose 11

  • 8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The

    9/13

    clusters, since it jumps from 4.9 (obtaining 11 clusters) to 41.7 (obtaining 10

    clusters). As a final check, I used a non-hierarchical method, using Beales F-type

    statistic to determine if I should choose 7 or 11 clusters. The Beales F-type statistic

    for this comparison is 16.9, which leads us to conclude that 11 clusters is better.

    The Beales F-type statistic is computed and compared to the F value using the R

    code in the appendix. See the SAS output for a plot of the first 2 temperature PCs

    by cluster, and for a list of cities sorted by their cluster. So I can say that all of ourcities fall into one of 11 groups based on their average monthly temperature.

    Since parts a-f are all clustered in the same manner, I will summarize all of the

    results in a tabulated form below.

    From these data, the following conclusions can be made: Each of our cities can be

    put into one of 11 groups, where the cities within a group have similar monthly

    average temperatures. Each city can also be put into one of 11 groups, where the

    cities within a group have similar monthly average ozone. They can also be put into

    one of 10 groups, where the cities within a group have similar monthly average

    temperatures and ozone. If we use the first few PCs from temperature and ozone

    which explain most of the variance of the original variables, we can put each city in

    one of 13 groups, where the cities within a group have similar average monthly

    temperatures, or 15 groups, where the cities in a group have similar average

    monthly ozone, or 12 groups, where the cities within a group have similar average

    monthly temperatures and ozone. To find which city belongs to which group, for

    each of these 6 grouping methods, see the exhaustive SAS output.

    Cities

    clustered

    by . . .

    # of clusters

    indicated by

    CCC

    # of clusters

    indicated by

    Pseudo T2

    # indicated

    from Beales

    F-type

    statistic

    Decided

    number of

    clusters

    Monthly avg.

    Temperatures

    7, 11 11 11 11

    Monthly avg.

    Ozone

    3, 5, 7, 9, 11 3, 7, 9, 11 11 11

    Monthly avg.

    Temps. and

    Ozone

    4, 6, 10 4, 10 10 10

    First 2

    Temperature

    PCs

    6, 11, 13 SAS did not

    calculate

    13 13

    First 4 Ozone

    PCs

    3, 7, 11 3, 7, 11, 15 15 15

    First 2 Temp.

    PCs and first 4

    ozone PCs

    4, 6, 10 4, 7, 10, 12 12 12

  • 8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The

    10/13

    Question 6:

    It is of interest to find the correlation between temperature and ozone. To learn

    more about how these two things are correlated, I used canonical correlation

    analysis. First, I used all the monthly average temperatures, and correlated those

    to the set of variables consisting of the monthly average ozone. Second, I used only

    the first two PCs of monthly average temperature, and correlated those to the set

    of the first four PCs of monthly average ozone. Note that in the SAS output, you

    can find the canonical correlation coefficients, as well as the correlation matrices for

    all four sets of variables, and the correlation matrices for each pair of sets of

    variables (i.e. for temperatures and ozone, and for the PCs of temperatures and

    ozone). For both analyses, I report results based on the first canonical variables,

    since they explain most of the correlation.

    For the first case, I find that the canonical correlation between the set of all monthly

    average temperatures to the set of all monthly average ozone is 0.965. This implies

    that there is a very high correlation between temperature and ozone. Since it is

    positive, as the temperature canonical variable (temp1) increases, we can feel quite

    certain that the ozone canonical variable (oz1) will also increase. Furthermore, we

    can see which months of temperature are more correlated to the average monthly

    ozone, and vice versa. First, temperatures for April and August, followed by May,

    have the strongest correlation with the canonical variable temp1. They also have

    the strongest (positive) correlation with the canonical variable oz1. We may

    conclude from this that the average temperatures for April, August, and May have

    the strongest correlation with the monthly average ozone from April to September.

    Second, by looking at the monthly average ozone variables, we see that theaverage ozone in June and July have the strongest (negative) correlation with

    canonical variable ozone1. They also have the strongest (negative) correlation with

    the canonical variable temperature1. We may conclude that the average ozone

    during June and July have the strongest correlation with the average monthly

    temperatures from April to September.

    For the second case, I find that the canonical correlation between the first two PCs

    of monthly average temperatures and the first four PCs of monthly average ozone

    is 0.823. This implies that there is a strong positive correlation between the PC

    temperature canonical variable (PCtemp1) and the PC ozone canonical variable

    (PCOz1). As one would expect, the first PC for temperature has the highest

    (negative) correlation with PCtemp1, and the first two PCs for ozone have the

    strongest (PCO1 is positive, and PCO2 is negative) correlation with PCOz1.

    Similarly, the first PC for temperature has the strongest (negative) correlation with

    PCOz1 and the first two PCs for ozone have the strongest (PCO1 is positive and

    PCO2 is negative) correlation with PCTemp1. This implies the first linear

    combination of average monthly temperatures (PCT1) has the strongest correlation

  • 8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The

    11/13

    with the average monthly ozone from April to September, and the first two linear

    combinations of average monthly ozone (PCO1 and PCO2) have the strongest

    correlation with average monthly temperatures from April to September.

    Below are two plots of the canonical variable scores, first using the monthly means,

    and second using the PCs.Clustering based on Monthly Average Ozone

    Plot of Temp1*Oz1. Legend: A = 1 obs, B = 2 obs, etc.

    Temp1

    2.0 A A A A A A

    1.5 A A A AA A A A

    1.0 A A

    A A A B A

    0.5 A AA A A A

    0.0 A A BB A A A A BA AA A A AA A

    -0.5 A A A A B A

    -1.0 A A A A A A A A

    A A A A-1.5 A A A

    -2.0

    -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

    Oz1

  • 8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The

    12/13

    Plot of PCTemp1*PCOz1. Legend: A = 1 obs, B = 2 obs, etc.

    PCTemp1

    2 A A

    A A B A A A

    1 A A A B A A B A A A A AA A B AA A A AA A

    0 A A A A A A A A AA AA A A A B A A A A A A

    -1 A A A A A A A A A A A A

    -2 A

    -3 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

    PCOz1

    The plots give good visual reference to the fact that the Canonical Correlation was

    stronger when using the average monthly means of temperature and ozone, as

    opposed to using the PCs. The Canonical Variable Scores can be seen in the fullSAS output.

    Question 7:

    We are now interested in being able to classify a region as having a coastline or not

    having a coastline by various sets of variables. The end result of the following

    analysis will be that if we are given the monthly averages for temperature and

    ozone for a region, we will be able to predict whether it is coastal or not. Three

    methods will be applied to 6 different sets of variables, and the misclassification

    rates will determine which rule and which set of variables are most effective for

    determining if a region is coastal or not. The cross-validation technique will be used

    to calculate misclassification rates. The following table summarizes the results,

    which can be found in detail in the SAS output. For each method and each variable

    set, there is a complete listing of the cities and their classification/misclassification

    which is found in the SAS output.

  • 8/14/2019 Multivariate Analysis Exam 2 Dan Sewell Question 1: A. The

    13/13

    VARIABLES LINEAR

    DISCRIMINANT

    FUNCTION

    K NEAREST

    NEIGHBOR

    (K=5)

    LOGISTIC

    REGRESSION

    TempApril to

    TempSep1370 1270 770

    O3April to O3Sep

    2170 1870 1770

    TempApril to

    TempSep and

    O3April to O3Sep 1070 970 070

    PCT1 and PCT2

    1470 2070 1470

    PCO1 to PCO4

    2170 1870 2270

    PCT1, PCT2 and

    PCO1 to PCO41570 1870 1270

    As is clearly seen, our best way to correctly classify a region as being coastal or not

    coastal is using all of our monthly means (our average monthly temperatures and

    our average monthly ozone), and using logistic regression. In this manner, we have

    the highest probability of correctly classifying a region as coastal or not coastal.