ibm spss statistics for windows intermediate / advance › fms › postgrad › skills › documents...

N U I T, NEWCASTLE UNIVERSITY

IBM SPSS STATISTICS for Windows Intermediate /

Advance A Training Manual for Intermediate /

Experience Users, Faculty of Medical Sciences

Dr S. T. Kometa

2

Table of Contents

Ordinary Regression ................................................................................................................ 3

Repeated Measures Analysis ................................................................................................... 9

Data Analysis Using Crosstabulation Techniques .............................................................. 12

Type of Survival Analysis / Kaplan-Meier .......................................................................... 18

Binary Logistics Regression .................................................................................................. 21

Multivariate Analysis of Variance (MANOVA).................................................................. 24

3

Ordinary Linear Regression Model with Two Independent Variables

Why fit a regression model?

To build a model for predicting the outcome variable for a new sample of data.

To see how well the independent (explanatory) variables explain the dependent

(response) variable.

To identify which subsets from many independent variables is most effective for

estimating the dependent variable.

Open the data set called world95.sav. To do this, follow these instructions:

1. Select Start -> Programs -> Statistical Software -> IBM SPSS Statistics -> IBM

SPSS Statistics 19.

2. From SPSS menu bar select File -> Open -> Data… a dialogue box will appear.

3. In the text area for File name: type \\campus\software\dept\spss and then click on

Open.

4. Select the file world95.sav and click on Open.

5. Spend some time to study the data file. How many cases and variables make up the

data file? Cases:…….. Variables:………

6. Are there any missing values in the data? Yes No

Assumptions for Ordinary Linear Regression

All observations should be independent.

Your data should not suffer from multicollinearity. That is the independent variables

should not be highly related. To find out if your data suffer from multicollinearity,

you have to look at the tolerances for each of the independent variables in the model.

These are printed if you select Collinearity Diagnostics in the Linear Regression

Statistics dialogue box. If any of the tolerances are small, less than 0.1 for example,

multicollinearity may be a problem.

Residual from model fit should follow a normal distribution.

Each of the independent (explanatory or predictor) continuous variables should have a

linear relationship with the dependent (response or outcome) variable. It is always a

good idea to check this assumption using a scatterplots.

Simple Linear Regression

Is the female literacy of a country useful in predicting their life expectancy? We want to build

model of the form:

𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑒𝑚𝑎𝑙𝑒 𝑙𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦 = 𝑏0 + 𝑏1 ∗ 𝑓𝑒𝑚𝑎𝑙𝑒 𝑙𝑖𝑡𝑒𝑟𝑎𝑐𝑦 + 𝜀

Where Average female life expectancy (lifeexpf) is the dependent (response, y, or outcome)

variable, female who can read (%) (lit_fema) is the independent (explanatory or predictor)

variable, 𝑏0 is the intercept of the line of best fit, b1 is its slope and 𝜀 is the error term.

Is there a linear relationship between average female life expectancy and female literacy? Produce a scatter plot to help you answer this question.

4

To produce the output for the regression model, from the menus choose:

Analyze -> Regression -> Linear….

Dependent Variable: Average female life expectancy [lifeexpf]

Independent: female who can read (%) [lit_fema]

Statistics…

Descriptives

Make sure that Estimates and Model fit are selected.

Select Collinearity diagnostics

Residuals

Casewise disnognotics

Select Outlier outside 1.0 standard deviations

Plots…

Y: *ZRESID

X: *ZPRED

Click Next

Y: ZPRED

X: Dependent

Select Histogram and Normal probability plot

These steps will generate lots of output. Now examine the output and attempt to interpret it.

Look at the table Descriptive Statistics. What will you conclude?

Look at the table Correlations. What are the hypotheses being tested? What will you

conclude?

Look at the table Model Summary. What do you conclude?

.

5

Look at the table ANOVA. Explain what the Degrees of Freedom (DF), Sums of Squares (SS)

and Mean Squares (MS) represent. How they are related?

State the hypotheses being tested in the ANOVA table. How is the test statistic calculated and

what would your decision be?

Look at the table Coefficients. What do you conclude? Write an equation for the

regression model and use it to predict the average female life expectancy of a country

whose female literacy is 86%. What are the hypotheses being tested?

6

The last two columns of the Coefficient table give information about collinearity

statistics. Looking at the Tolerance, can you say if there is any problem with

multicollinearity?

The rest of the output deals with the residuals. This helps to find out if the assumptions to run

a linear are met and to identify any outliers or influential cases.

Look at the table Casewise Diagnostics. What is standardised residual? What do you

conclude?

Look at the table Residual Statistics. What do you conclude?

Look at the Histogram and Normal P-P Plot. What do you conclude about the

residuals?

Now look at the two Scatter Plots. What do you conclude?

Can you think of any restriction when using your model to predict female life

expectancy?

7

How would you validate a model like this?

Multiple Linear Regression

While a simple linear regression can have just one independent variable, a multiple linear

regression can have more than one independent variable. The following is a model with two

independent variables:

𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑒𝑚𝑎𝑙𝑒 𝑙𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦 = 𝑏0 + 𝑏1 ∗ 𝑖𝑛𝑓𝑎𝑛𝑡 𝑚𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦 + 𝑏2 ∗ 𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦 + 𝜀

where infant mortality (deaths per 1000 live births) [babymort] is the number of dead babies

during their first year per thousand live births and average number of kids [fertilty] is the

average number of children per family.

We found that literacy explained 67% of the variability of life expectancy. Now we examine

a model using infant mortality (babymort) and fertility (fertility) to predict life expectancy.

To run the analysis, select

Analyze -> Regression -> Linear….Click on Reset.

Dependent Variable: Average female life expectancy [lifeexpf]

Independent: average number of kids [fertilty], infant mortality (deaths per 1000 live births)

[babymort]

Case Labels: country

Statistics…

Descriptives

Make sure that Estimates and Model fit are selected.

Select Collinearity diagnostics

Plots…

Produce all partial plots

Save...

Predicted values: Standardised

Look at the table Descriptive Statistics. What do you conclude?

Look at the table Correlations. What do you conclude?

8

Look at the table Model Summary. What do you conclude?

Look at the ANOVA table. What do you conclude?

Look at the table Coefficients. What do you conclude? Write an equation for the

regression model and use it to predict the female life expectancy of a country whose

fertility is 3 and infant mortality is 23 per 1000 live births.

9

Repeated Measures Analysis of Variance

Does the anxiety rating of a person affect performance on a learning task? Twelve subjects

were assigned to one of two anxiety groups on the basis of an anxiety test, and the number of

errors made in four blocks of trials on a learning task was measured. We use repeated

measures analysis of variance technique to study the data.

Open the SPSS data file called anxiety2. Notice that there is one case for each subject and

four trials variables (trial1, trial2, trial3 and trial4).

In repeated measures analysis-of-variance technique, we distinguish two types of factors in

the model: between-subject factors and within-subject factors. A between-subject factor

as the name suggest, divides the subjects into discrete subgroups, for example anxiety in this

data file. Anxiety divides the cases into two groups of high anxiety scores and low anxiety

scores. A within-subjects factor is any factor that distinguishes measurements made on the

same subject. For example, trail distinguishes the four measurements taken for each subjects.

To produce the output in this example, from the menus choose:

Analyze

General Linear Model

Repeated Measures…

Within-Subject Factors name: replace factor1 with trial

Number of Levels: 4

Click Add and Click Define

Within-Subjects Variables (trial): trial1, trial2, trial3 and trial4

Between-Subjects Factor(s): anxiety

Options…

Select Homogeneity tests

Contrasts…

Factors: trial

Contrasts: Repeated (click Change)

Click on Continue and then OK.

Examine the results and try to interpret it.

Between-Subject Test

The test of between-subject effects is shown on the table Tests of Between-Subjects

Effects. Examine this table. What do you conclude?

10

Multivariate Tests

The multivariate table contains tests of the within-subjects factor, trial, and the interaction of

within-subjects factor and the between-subjects factor, trail*anxiety.

Examine the Multivariate Tests table. What do you conclude?

Assumptions

The vector of the dependent variables follows a normal distribution, and the variance-

covariance matrices are equal across the cells formed by the between-subject effects.

The test for this assumption is shown on the table Box’s test of Equality of Covariance

Matrices. Examine this table, what do you conclude?

It is assumed that the variance-covariance matrix of the dependent variables is circular.

The test of this assumption is shown on the table Mauchly’s Test of Sphericity. Examine

this table, what do you conclude?

If the test of sphericity was not satisfied use Greenhouse-Geisser, Huynh-Feldt or Lower-

bound to make your conclusion.

Now let us look at the table of Tests of Within-Subjects Effects. Examine the table, what

can you conclude?

11

Contrasts

A repeated contrast measures compares one level of trial with the subsequent level. The first

column (source) indicates the effect being tested. For example, the label trial test the

hypothesis that averaged over the two anxiety groups, the mean of the specified contrast is

zero.

The second column trial represents the contrasts. For example, Level 1 vs Level 2 represents

the transformation trial 1 – trial 2. This compares the first level of trial with the second level

of trial, and so on.

The label trial*anxiety tests the hypothesis that the mean of the specified contrast is the same

for the two anxiety groups.

Now look at the Tests of Within-Subjects Contrasts. What do you conclude?

12

Data Analysis Using Crosstabulations Techniques in SPSS

Introduction

Crosstabulation is a powerful technique that helps you to describe the relationships between

categorical (nominal or ordinal) variables. With Crosstabulation, we can produce the

following statistics:

Observed Counts and Percentages

Expected Counts and Percentages

Residuals

Chi-Square

Relative Risk and Odds Ratio for a 2 x 2 table

Kappa Measure of agreement for an R x R table

Examples will be used to demonstrate how to produce these statistics using SPSS. The data

set used for the demonstration comes with SPSS and it is called GSS_93.sav. It has 67

variables and 1500 cases (observations). Open this data file which is located in the SPSS

folder. Study the data file in order to understand it before performing the following exercises.

Exercise 1: An R x C Table with Chi-Square Test of Independence

Chi-Square tests the hypothesis that the row and column variables are independent, without

indicating strength or direction of the relationship. Like most statistics test, to use the Chi-

Square test successfully, certain assumptions must be met. They are:

No cell should have expected value (count) less than 0, and

No more than 20% of the cells have expected values (counts) less than 5

In the SPSS file, there is a variable called relig short for religion (Protestant, Catholic,

Jewish, None, Other) and another one called region4 (Northeast, Midwest, South, West). In

this example, we want to find out if religious preferences vary by region of the country.

To produce the output, from the menu choose:

Analyze -> Descriptive Statistics -> Crosstabs….

Row(s): Religious Preferences [relig]

Column(s): Region [region4]

Statistics… select Chi-Square, click Continue then OK

In the SPSS output, Pearson chi-square, likelihood-ratio chi-square, and linear-by-linear

association chi-square are displayed. Fisher's exact test and Yates' corrected chi-square are

computed for 2x2 tables.

State the null and alternative hypothesis that is being tested.

13

Examine the output. What conclusion can you draw from the output?

However, you will notice that certain assumptions are not met. The results could be

misleading. What should you do? We will discuss this further in example 2 below.

Example 2: Percentages, Expected Values, and Residuals and Omitting Categories

From the last example, we noticed that 40% of the cells had expected counts less than 5. So

this assumption was violated. Since Other and Jewish had just 15 cases each, we can drop

them out of the analysis by using Select Cases. In other words, the religious preference is

restricted to Protestant, Catholic and None.

To produce the output, use Select Cases from the Data menu to select cases with relig not

equal to 3 and relig not equal to 5 (relig ~=3 & relig ~=5). Call up the dialogue box for

Crosstabs. Reset it to default and select:

Row(s): Region [region4]

Column(s): Religious Preferences [relig]

Statistics… Select Chi-Square

Nominal: select Contingency coefficient, Phi and Cramer’s V, Lambda,

Uncertainty coefficient, click Continue

Cells…

Counts: select Expected

Percentages: select Row

Residuals: select Adjusted Standardized, click Continue then OK

Now examine the output and try to interpret it.

You can pivot the table so each group of statistics appears in its own panel. Demonstrate.

Double-click Table and drag region on the row tray to the right of statistics.

Look at the Region4*Religion Preference Crosstabulation. What can you conclude?

14

Look at the Chi-Square Tests table. What can you conclude?

Look at the Symmetric Measures table. What can you conclude about the strength of

the relationship between religion preference and region?

Examine the table Directional Measures what do you conclude?

15

Example 3: Tests Within Layers of a Multiway Table

Multiway table allows you to examine the relationship between two categorical variables

within a controlling variable. For example, is the relationship between marital status and view

of life the same for males and females? This example shows you how to answer this type of

question in SPSS.

Use Select Cases from the Data menu to select cases with marital not equal to 4 (marital ~=

4).

Can you think of any reason why we have decided to exclude cases where marital status

is equal to 4 (i.e. separted)?

Call up the Crosstabs dialogue box. Click Reset to restore the dialogue box defaults. Then

select:

Row(s): Marital satus [marital]

Column(s): Is Life Exciting or Dull [life]

Layer 1 of 1: Respondent’s Sex [sex]

Statistics… Select Chi-Square

Cells…

Counts: select Expected

Percentages: select Row

Residuals: select Standardized and Adjusted Standardized

Examine the results and try to interpret it.

Is there a relationship between marital status and view on life? Is this relationship the

same between male and female?

Example 4: The Relative Risk and Odds Ration for a 2 x 2 Table

The Relative Risk for 2 x 2 tables is a measure of the strength of the association between the

presence of a factor and the occurrence of an event. If the confidence interval for the statistic

includes a value of 1, you cannot assume that the factor is associated with the event. The odds

ratio can be used as an estimate or relative risk when the occurrence of the factor is rare.

16

In the GSS93 data file, there is a variable (dwelown) that measure home ownership (owner or

renter) and another variable (vote92) that measure voting (voted or did not vote). We will like

to find out whether home owners are more likely to vote than renters.

Through the Variable window note all the codes that have been used for the two variables of

interest. For example, dwelown uses code 3 for other and code 8 for don’t know, while vote92

uses code 3 for not eligible and code 4 for refused. Select the cases with dwelown less than 3

and vote92 less than 3.

From the menus choose:

Data -> Select Cases

Select If condition is satisfied and click If.

Enter dwelown < 3 & vote92 < 3 as the condition and click Continue then OK.

In the Crosstabs dialogue box, click Reset to restore the dialogue box defaults, and then

select:

Row(s): Homeowner or Renter [dwelown]

Column(s): Voting in 1992 Election [vote92]

Cells…

Percentages select Row, click Continue then OK

Examine and interpret the output.

From the crosstabulation table, what can you conclude?

Recall the crosstabs dialogue box. In the Crosstabs dialogue box, select:

Statistics…

Select Risk, click Continue

Examine the output and interpret it.

Look at the table called Risk Estimate, what can you conclude?

17

The odds ratio should be used as an approximation to the relative risk when the following

conditions are met:

The probability of the event is small (<0.1). This condition guarantees that the odds

ratio will make a good approximation to the relative risk.

The design of the study is case-control.

These conditions are not met in this present example. In smoking and lung cancer study, the

conditions are met. You can use the odds ratio.

Example 5: The Kappa Measure of Agreement for an R x R Table

Cohen's kappa measures the agreement between the evaluations of two raters when both are

rating the same object. A value of 1 indicates perfect agreement. A value of 0 indicates that

agreement is no better than chance. Values of Kappa greater than 0.75 indicates excellent

agreement beyond chance; values between 0.40 to 0.75 indicate fair to good; and values

below 0.40 indicate poor agreement. Kappa is only available for tables in which both

variables use the same category values and both variables have the same number of

categories.

The table structure of the Kappa statistics is a square R x R and has the same row and column

categories because each subject is classified or rated twice. For example, doctor A and doctor

B diagnose the same patients as schizophrenic, manic depressive, or behaviour-disorder. Do

the two doctors agree or disagree in their diagnosis? Two teachers assess a class of 18 years

old students. Do the teachers agree or disagree in their assessment?

In the GSS93 subset data file, we have variables that assess the educational level of

respondent’s father (padeg) and mother (madeg). Is there any agreement between father and

mother educational level?

To produce the output, use Select Cases from the Data menu to select cases with madeg not

equal to 2 and padeg not equal to 2 (madeg ~= 2 & padeg ~= 2). In the Crosstabs dialogue

box, click Reset to restore the dialogue box defaults, and then select:

Row(s): Father’s Highest Degree [padeg]

Column(s): Mother’s Highest Degree [madeg]

Statistics… Select kappa, click Continue

Cells…

Percentages: select Total, click Continue then OK

Examine and interpret the output.

Look at the tables from the output. What can you conclude?

18

Intraclass Correlation Coefficients (ICC)

We can use ICC to assess inter-rater agreement when there are more than two raters. For

example, the International Olympic Committee (IOC) trains judges to assess gymnastics

competitions. How can we find out if the judges are in agreement? ICC can help us to answer

this question. Judges have to be trained to ensure that good performances receive higher

scores than average performances, and average performances receive higher scores than poor

performances; even though two judges may differ on the precise score that should be

assigned to a particular performance.

Use the data set judges.sav to illustrate how to use SPSS to calculate ICC.

Open the data set. From the menu select:

Analyze -> Scale -> Reliability

Items: judge1, judge2, judge3, judge4, judge5, judge6, judge7

Statistics… Under Descriptives for check Item.

Check Intraclass correlation coefficient

Model: Two-Way Random

Type: Consistency

Confidence interval: 95%

Test value: 0

Examine and interpret the output. What would you conclude?

Types of Survival Analyses and when to use them in SPSS

Life Tables: Use life tables if cases can be classified into meaningful equal time interval.

Life table can be used to calculate the probability of a terminal event during any interval

under study.

Kaplan-Meier: Use this technique if cases cannot be classified into equal time intervals as

above. This is common to many clinical and experimental studies.

Cox Regression: Use this technique if you want to see the relation between survival time and

a predictor variable, for instant age or tumour type.

19

Using Kaplan-Meier Survival Analysis to Test Competing Pain

Relief Treatments

A pharmaceutical company is developing an anti-inflammatory medication for treating

chronic arthritic pain. Of particular interest is the time it takes for the drug to take effect and

how it compares to an existing medication. Shorter times to effect are considered better.

The results of a clinical trial are collected in pain_medication.sav. This data file is stored in

the following folders \\campus\software\dept\spss. Open the file and study it. Use Kaplan-

Meier Survival Analysis to examine the distribution of "time to effect" and compare the

effectiveness of the two treatments.

To run a Kaplan-Meier Survival Analysis, from the menus choose:

Analyze

Survival

Kaplan-Meier...

Select Time to effect [time] as the Time variable.

Select Effect status [status] as the Status variable.

Click Define Event.

Under Value(s) Indicating Event Has Occurred type 1 in the text area next to

Single value:.

Click Continue.

Select Treatment [treatment] as a Factor.

Click Compare Factor.

Select Log rank, Breslow, and Tarone-Ware.

Click Continue.

Click Options in the Kaplan-Meier dialog box.

Select Quartiles in the Statistics group and Survival in the Plots group.

Click Continue.

Click OK in the Kaplan-Meier dialog box.

Interpretation

Survival Table

The survival table is a descriptive table that details the time until the drug takes effect. The

table is sectioned by each level of Treatment, and each observation occupies its own row in

the table.

Time: The time at which the event or censoring occurred.

Status: Indicates whether the case experienced the terminal event or was censored.

Cumulative Proportion Surviving at the Time: The proportion of cases surviving from the

start of the table until this time. When multiple cases experience the terminal event at the

mk:@MSITStore:c:/program%20files/spss/tutorial/case_studies/casestudies.chm::/data_files.htm

20

same time, these estimates are printed once for that time period and apply to all the cases

whose drug took effect at that time.

N of Cumulative Events: The number of cases that have experienced the terminal event from

the start of the table until this time.

N of Remaining Cases: The number of cases that, at this time, have yet to experience the

terminal event or be censored.

Survival Functions (Curves)

The survival curves give a visual representation of the life tables. The horizontal axis shows

the time to event. In this plot, drops in the survival curve occur whenever the medication

takes effect in a patient. The vertical axis shows the probability of survival. Thus, any point

on the survival curve shows the probability that a patient on a given treatment will not have

experienced relief by that time.

The plot for the New drug below that of the Existing drug throughout most of the trial, which

suggests that the new drug may give faster relief than the old. To determine whether these

differences are due to chance, look at the comparisons tables.

Mean and Medians for Survival Time

The means and medians for survival time table offers a quick numerical comparison of the

"typical" times to effect for each of the medications. Since there is a lot of overlap in the

confidence intervals, it is unlikely that there is much difference in the "average" survival

time.

Percentiles

The percentiles table gives estimates of the first quartile, median, and third quartile of the

survival distribution. The interpretation of percentiles for survival curves is that the 75th

percentile is the latest time that at least 75 percent of the patients have yet to feel relief.

Overall Comparisons

This table provides overall tests of the equality of survival times across groups. Since the

significance values of the tests are all greater than 0.05, you cannot determine a difference

between the survival curves.

Summary

With the Kaplan-Meier Survival Analysis procedure, you have examined the distribution of

time to effect for two different medications. The comparison tests show that there is not a

statistically significant difference between them.

Recommended Readings

1. Hosmer, D. W., and S. Lemeshow. 1999. Applied Survival Analysis. New York: John

Wiley and Sons.

2. Kleinbaum, D. G. 1996. Survival Analysis: A Self-Learning Text. New York:

Springer-Verlag.

3. Norusis, M. 2004. SPSS 13.0 Advanced Statistical Procedures Companion. Upper

Saddle-River, N.J.: Prentice Hall, Inc..

21

Binary Logistic Regression Model

In this type of model, you estimate the probability of an event occurring. The model can be

written as:

𝑷𝒓𝒐𝒃(𝒆𝒗𝒆𝒏𝒕) =𝟏

𝟏 + 𝒆−𝒛

For a single independent variable

𝒛 = 𝒃𝟎 + 𝒃𝟏𝒙𝟏 For multiple independent variables:

𝒛 = 𝒃𝟎 + 𝒃𝟏𝒙𝟏 + 𝒃𝟐𝒙𝟐 + ⋯ . 𝒃𝒏𝒙𝒏

where b0 and b1, b2, are coefficients estimated from the data, x1, x2, are the independent

variables, n is the number of independent variables and e is the base of natural logarithms

(2.781).

Exercise

The data held in the file cancer.sav are from a study reported by Brown (1980) and are

commonly cited in texts considering binary logistic regression. The prognosis for prostate

cancer is based upon whether or not the cancer has spread to the surrounding lymph nodes. In

this classic study Brown et al (see Brown, 1980) explored the following separate indicators

for lymph node involvement in a group of 53 men known to have prostate cancer. To open

the data file, follow these instructions:



Open.

3. Select the file cancer.sav and click on Open.




The variables (corresponding to columns in the data file) are:

1) age - age of patients in years.

2) acid - level of serum acid phosphates (acid level in King-Armstrong units)

3) xray - x-ray result (0 = negative, 1 - positive)

4) size - size of tumour (0 = small, 1 = large)

5) stage - stage of tumour (0 = less serious, 1 = more serious)

6) nodes - nodal involvement (0 = not involved, 1 = involved)

Modelling

Carry out a Forward Conditional logistic regression analysis of the data using nodal

involvement as the dependent variable and the other variables as independent variables (i.e.

covariates). You do not need to define xray, size or stage as being categorical variables, since

22

they are already binary variables. Follow these steps to carry out the Forward Conditional

binary logistic regression:

Analyze -> Regression -> Binary Logistic….

Dependent: Nodal involvement [nodes]

Covariates: age acid xray size stage

Method: Forward Conditional

Use the output to answer the following questions.

Look at the table Case Processing Summary. What do you conclude?

Now look at the three tables under Block 0: Beginning Block. What do you conclude?

Now look at the tables under Block 1: Method=Forward (stepwise) conditional. What do

you conclude?

Give the logistic regression equation for the final model.

Predictions

Carry out another logistic regression analysis of the data using nodal involvement as the

dependent variable but this time including ALL the covariates in the model, i.e. using the

ENTER method. Also request the Odd Ratio (OR) and the 95% Confidence Interval (CI) of

OR. Follow these steps:

23

Analyze -> Regression -> Binary Logistic….

Dependent: Nodal involvement [nodes]

Covariates: age acid xray size stage

Save: Under Predicted Values select Probabilities

Options: CI for exp(B):

Method: Enter

1. Give the coefficients for the full model, i.e. including all the variables. [Normally

you would only consider the statistically significant variables].

2. Which coefficients are statistically significant and why?

3. What is the probability of nodal involvement for each man in the data set? Which

case has the highest probability and which case the lowest probability of nodal

involvement?

4. Select one significant variable give the OR and its 95% CI? How would you interpret

the OR and its 95% CI?

Reference

Brown, B. W., Jr et al. 1980 Prediction Analyses for Binary Data. In Biostatistics Casebook,

New York: John Wiley and Sons.

24

Multivariate Analysis of Variance (MANOVA)

The GLM Multivariate procedure allows you to model the values of multiple dependent scale

variables, based on their relationships to categorical and scale predictors.

The GLM Multivariate procedure is based on the general linear model, in which factors and

covariates are assumed to have linear relationships to the dependent variables.

Fixed Factors: Categorical predictors should be selected as factors in the model. Each level

of a factor can have a different linear effect on the value of the dependent variables. The

GLM Multivariate procedure assumes that all the model factors are fixed; that is, they are

generally thought of as variables whose values of interest are all represented in the data file,

usually by design.

Covariates: Scale predictors should be selected as covariates in the model. Within

combinations of factor levels (or cells), values of covariates are assumed to be linearly

correlated with values of the dependent variables.

Interactions: By default, the GLM Multivariate procedure produces a model with all

factorial interactions, which means that each combination of factor levels can have a different

linear effect on the dependent variable. Additionally, you may specify factor-covariate

interactions, if you believe that the linear relationship between a covariate and the dependent

variables changes for different levels of a factor.

For the purposes of testing hypotheses concerning parameter estimates, the GLM

Multivariate procedure assumes:

• The values of errors are independent of each other across observations and the

independent variables in the model. Good study design generally avoids violation of

this assumption.

• The covariance of dependent variables is constant across cells. This can be

particularly important when there are unequal cell sizes; that is, different numbers of

observations across factor-level combinations.

• Across the dependent variables, the errors have a multivariate normal distribution

with a mean of 0

As part of the initial treatment for myocardial infarction (MI, or "heart attack"), a

thrombolytic, or "clot-busting", drug is sometimes administered to help clear the patient's

arteries before surgery. Three of the available drugs are alteplase, reteplase, and

streptokinase. Alteplase and reteplase are newer, more expensive drugs, and a regional health

care system wants to determine whether they are cost-effective enough to adopt in place of

streptokinase. One of the benefits of thrombolytic drugs is that surgery generally proceeds

more smoothly, resulting in a shorter recovery period. If the newer drugs are effective, then

patients given those drugs should have shorter lengths of stay in the hospital. Hopefully, the

shorter lengths of stay will help to make up for the greater initial cost of the newer drugs.

Running The Analysis



Open.

25

3. Select the file heart.sav and click on Open.




To run a GLM Multivariate analysis, from the menus choose:

1. Analyze -> General Linear Model -> Multivariate...

2. Select Length of stay [los] and Treatment costs [cost] as dependent variables.

3. Select Clot-dissolving drugs [clotsolv] and Surgical treatment [proc] as fixed factors.

4. Click Contrasts

5. Select clotsolv (None) as the contrast to change.

6. In the Change Contrast group, select Simple as the contrast type.

7. Select First as the reference category.

8. Click Change then click Continue

9. Click Option in the GLM Multivariate dialogue box

10. Select Estimates of effect size, SSCP matrices, Homogeneity tests and Spread vs.

level plot.

11. Click Continue and OK in the GLM Multivariate dialogue box.

26

By default, a model is fit with clot-dissolving drugs and Surgical treatment as main effects

and their interaction as a two-way effect.

Interpretation of Results

SSCP Matrices and Multivariate Test

This table displays the hypothesis and error sum-of-squares and cross-products (SSCP)

matrices for testing model effects. Since there are two dependent variables, each matrix has

two columns and two rows.

For example, the 2x2 matrix associated with CLOTSOLV in the table is the hypothesis

matrix for testing the significance of Clot-dissolving drugs

The matrix associated with PROC in the table is the hypothesis matrix for testing the

significance of Surgical treatment, and the matrix associated with PROC*CLOTSOLV is

used for testing their interaction effect

27

The error matrix is used in testing each effect. In analogy to the test for models with one

dependent variable, the “ratio” of the hypothesis SSCP matrix to the error matrix used to

evaluate the effect of interest.

The multivariate test table displays four tests of significance for each model effect.

Pillai's trace is a positive-valued statistic. Increasing values of the statistic indicate effects

that contribute more to the model.

Wilks' Lambda is a positive-valued statistic that ranges from 0 to 1. Decreasing values of the

statistic indicate effects that contribute more to the model.

Hotelling's trace is the sum of the eigenvalues of the test matrix. It is a positive-valued

statistic for which increasing values indicate effects that contribute more to the model.

Hotelling's trace is always larger than Pillai's trace, but when the eigenvalues of the test

matrix are small, these two statistics will be nearly equal. This indicates that the effect

probably does not contribute much to the model.

Roy's largest root is the largest eigenvalue of the test matrix. Thus, it is a positive-valued

statistic for which increasing values indicate effects that contribute more to the model.

Roy's largest root is always less than or equal to Hotelling's trace. When these two statistics

are equal, the effect is predominantly associated with just one of the dependent variables,

there is a strong correlation between the dependent variables, or the effect does not contribute

much to the model.

There is evidence that Pillai's trace is more robust than the other statistics to violations of

model assumptions (Olson, 1974).

Each multivariate statistic is transformed into a test statistic with an approximate or exact F

distribution.

The hypothesis (numerator) and error (denominator) degrees of freedom for that F

distribution are shown.

The significance values of the main effects, CLOTSOLV and PROC, are less than 0.05,

indicating that the effects contribute to the model.

By contrast, their interaction effect does not contribute to the model.

28

However, though CLOTSOLV does contribute to the model, since the value of Pillai's trace

is close to Hotelling's trace, it doesn't contribute very much.

The multivariate test table

A more straightforward way to see this is to look at the partial eta squared. The partial eta

squared statistics reports the ‘practical’ significance of each term, based upon the ‘ratio’ of

the variation accounted for by the effect to the sum of the variation accounted for by the

effect and the variation left to error.

Larger values of partial eta squared indicate a greater amount of variation accounted for by

the model effect, to a maximum of 1. Since the partial eta squared is very small for

CLOTSOLV, it does not contribute very much to the model. By contrast, the partial eta

squared for PROC is quite large, which is to be expected. The surgical procedure a patient

must undergo for MI treatment is going to have a much greater effect on the length of their

hospital stay and final cost than the type of thrombolytic they receive.

In this case, it is enough for the multivariate tests to show that CLOTSOLV is significant,

which means that the effect of at least one of the drugs is different from the others. The

contrast results will show you where the differences are.

This table displays results for each contrast. Simple contrasts using the first level of Clot-

dissolving drugs as the reference category were specified.

29

Thus, one contrast compares the second level to the first level; that is, the effect of reteplase

to the effect of streptokinase.

The contrast estimates show that, on average, patients given reteplase spend 0.382 fewer days

in the hospital and incur almost 600 dollars more in treatment costs than patients given

streptokinase.

Since the significance value for Length of stay is less than 0.05, you can conclude this

difference is not due to chance.

The significance value for Treatment costs is greater than 0.10, so this difference may be

entirely due to chance variation.

The second contrast compares the third level to the first level; that is, the effect of alteplase to

the effect of streptokinase.

The contrast estimates show that, on average, patients given alteplase spend about half a day

less in the hospital and incur slightly over 700 dollars more in treatment costs.

Since the significance value for Length of stay is less than 0.05, you can conclude this

difference is not due to chance.

The significance value for Treatment costs is greater than 0.10, so this difference may be

entirely due to chance variation.

The contrast results show that alteplase and reteplase seem to reduce patient length of stay.

Moreover, the reduction is enough to equalize the treatment costs, or at least bring the

difference within the random variation.

Thus, the model suggests that alteplase and reteplase should be used in place of streptokinase.

However, before adopting this plan, you should check some tests of the model assumptions

The assumption for the multivariate approach is that the vector of the dependent variables

follows a multivariate normal distribution, and the variance-covariance matrices are equal

across the cells formed by the between-subject effects.

30

Box's M tests the null hypothesis that the observed covariance matrices of the dependent

variables are equal across groups.

The Box's M test statistic is transformed to an F statistic with df1 and df2 degrees of freedom.

Here, the significance value of the test is less than 0.05, suggesting that the assumptions are

not met, and thus the model results are suspect.

Box's M is sensitive to large data files, meaning that when there are a large number of cases,

it can detect even small departures from homogeneity. Moreover, it can be sensitive to

departures from the assumption of normality. As an additional check of the diagonals of the

covariance matrices, look at Levene's tests.

This table tests equality of the error variances across the cells defined by the combination of

factor levels.

A separate test is performed for each dependent variable.

The significance value for Length of stay is greater than 0.10, so there is no reason to believe

that the equal variances assumption is violated for this variable.

However, the significance value for the test of Treatment costs is less than 0.05, indicating

that the equal variances assumption is violated for this variable.

Like Box's M, Levene's test can be sensitive to large data files, so look at the spread vs. level

plot for Treatment costs for visual confirmation.

The spread-versus-level plot is a scatterplot of the cell means and standard deviations.

31

It provides a visual test of the equal variances assumption, with the added benefit of helping

you to assess whether violations of the assumption are due to a relationship between the cell

means and standard deviations.

This plot agrees with the result of Levene's test, that the equal variances assumption is

violated for Treatment costs.

There is also a clear positive relationship in the scatterplot, showing that as the cell mean

increases, so does the variability. This relationship suggests a possible solution to the

problem.

Since Treatment costs is a positive-valued variable, you could propose that the error term has

a multiplicative, rather than additive, effect on cost. Instead of modeling Treatment costs, you

will analyze Log-cost

To run an analysis using log-transformed costs, click the Dialog Recall tool and select GLM

Multivariate (or select Analyze -> General Linear Model -> Multivariate...).

1. Deselect Treatment costs as a dependent variable

2. Select Log-cost as the dependent variable.

3. Click OK in the GLM Multivariate dialogue box.

Box's M is significant, while Levene's test is not. This can happen for several reasons:

• The covariance between Length of stay and Log-cost is not constant across cells, and

thus the model results are suspect.

• The covariances are unequal, though not by much, but the large size of the data file

causes Box's M to be overly sensitive to this departure from homogeneity.

• The covariances are equal, but the test procedure for computing Box's M, a

multivariate test, simply comes up with a different result than the univariate test.

• The distribution of Length of stay and Log-cost is different enough from a

multivariate normal distribution to cause Box's M to be significant.

32

In order to help decide whether you should be concerned about the significance of Box's M,

some exploratory data analysis is in order. You can use the Explore procedure to check the

assumption of normality. With the data file split by the cells, you can use the Bivariate

Correlations procedure to see whether the correlations are constant across cells.

http://127.0.0.1:52815/help/topic/com.ibm.spss.statistics.cs/explore_table.htm

http://127.0.0.1:52815/help/topic/com.ibm.spss.statistics.cs/correlations_table.htm

http://127.0.0.1:52815/help/topic/com.ibm.spss.statistics.cs/correlations_table.htm

33

The results for Length of stay are identical to the results from the previous model.

However, the results for Log-cost are different from those for Treatment costs.

The significance values for both contrasts are less than 0.05, suggesting that the differences in

costs between the newer drugs and streptokinase are not due to chance.

The contrast estimate for the difference between reteplase and streptokinase is 0.0217. Since

you are looking at differences in log-transformed cost, this means that the ratio of costs is

exp(0.0217) = 1.0219. That is, the ratio of the costs incurred by patients given reteplase is

approximately 2.19 percent higher than the costs incurred by patients given streptokinase. If

the typical MI patient incurs 25,000 to 35,000 dollars in treatment costs, that means reteplase

patients incur, roughly, an extra 550 to 770 dollars in costs.

The contrast estimate for the difference between alteplase and streptokinase is 0.0243. Since

you are looking at differences in log-transformed cost, this means that the ratio of costs is

exp(0.0243) = 1.0246. That is, the ratio of the costs incurred by patients given alteplase is

approximately 2.43 percent higher than the costs incurred by patients given streptokinase. If

the typical MI patient incurs 25,000 to 35,000 dollars in treatment costs, that means alteplase

patients incur, roughly, an extra 600 to 860 dollars in costs.

These contrast results show that while alteplase and reteplase do seem to reduce patient

length of stay, the reduction is not enough to equalize the treatment costs.

Thus, determining whether alteplase and reteplase should be used in place of streptokinase

will require further study of the cost of these drugs versus their effectiveness at increasing the

success of surgery.

Using the GLM Multivariate procedure, you have performed a multivariate analysis of

variance on the patient lengths of stay and treatment costs, using the surgical procedure

performed and thrombolytic administered as fixed factors. Your initial model indicated that

the final treatment costs for reteplase and alteplase are not significantly different from those

for streptokinase. However, that model violated the equal variances assumption. The spread

vs. level plot showed that a log-transformation of Treatment costs might be appropriate, so

the model was re-run, replacing Treatment costs with Log-cost as a dependent variable. This

34

second model passed Levene's test, but now showed a significant difference in the final costs

for thrombolytics. The new difference in costs translates to an extra 550 to 860 dollars for the

"average" MI patient, so further study of the cost-effectiveness of the new drugs is necessary.

What happened? The differences in Treatment costs in the original model fall in the range of

550 to 860 dollars, but that model did not find the difference to be significant. Why should it

matter now? Since Treatment costs is a positive-valued variable, its distribution is probably

right-skew, so it is likely that there are patients who incurred unusually high costs, thus

inflating the error variation in the first model. By log-transforming Treatment costs, the

influence of these high-cost patients is reduced. In this case, it was enough to make the

differences in costs to be statistically significant.

Once satisfied with Log-cost as a dependent variable, you should fit a "final" model without

the interaction term, because it has not contributed to either of the first two models.

Recommended Reading

See the following texts for more information on multivariate linear models:

1. Bray, J. H., and S. E. Maxwell. 1985. Multivariate Analysis of Variance. Thousand

Oaks, Calif.: Sage Publications, Inc..

2. Norusis, M. 2004. SPSS 13.0 Statistical Procedures Companion. Upper Saddle-River,

N.J.: Prentice Hall, Inc..

3. Olson, C. L. 1974. Comparative Robustness of Six Tests in Multivariate Analysis of

Variance. Journal of the American Statistical Association, 69:348, 894-908.

ibm spss statistics for windows intermediate / advance › fms › postgrad › skills › documents...

Documents