ibm spss statistics for windows intermediate / advance › fms › postgrad › skills › documents...
TRANSCRIPT
N U I T, NEWCASTLE UNIVERSITY
IBM SPSS STATISTICS for Windows Intermediate /
Advance A Training Manual for Intermediate /
Experience Users, Faculty of Medical Sciences
Dr S. T. Kometa
2
Table of Contents
Ordinary Regression ................................................................................................................ 3
Repeated Measures Analysis ................................................................................................... 9
Data Analysis Using Crosstabulation Techniques .............................................................. 12
Type of Survival Analysis / Kaplan-Meier .......................................................................... 18
Binary Logistics Regression .................................................................................................. 21
Multivariate Analysis of Variance (MANOVA).................................................................. 24
3
Ordinary Linear Regression Model with Two Independent Variables
Why fit a regression model?
To build a model for predicting the outcome variable for a new sample of data.
To see how well the independent (explanatory) variables explain the dependent
(response) variable.
To identify which subsets from many independent variables is most effective for
estimating the dependent variable.
Open the data set called world95.sav. To do this, follow these instructions:
1. Select Start -> Programs -> Statistical Software -> IBM SPSS Statistics -> IBM
SPSS Statistics 19.
2. From SPSS menu bar select File -> Open -> Data… a dialogue box will appear.
3. In the text area for File name: type \\campus\software\dept\spss and then click on
Open.
4. Select the file world95.sav and click on Open.
5. Spend some time to study the data file. How many cases and variables make up the
data file? Cases:…….. Variables:………
6. Are there any missing values in the data? Yes No
Assumptions for Ordinary Linear Regression
All observations should be independent.
Your data should not suffer from multicollinearity. That is the independent variables
should not be highly related. To find out if your data suffer from multicollinearity,
you have to look at the tolerances for each of the independent variables in the model.
These are printed if you select Collinearity Diagnostics in the Linear Regression
Statistics dialogue box. If any of the tolerances are small, less than 0.1 for example,
multicollinearity may be a problem.
Residual from model fit should follow a normal distribution.
Each of the independent (explanatory or predictor) continuous variables should have a
linear relationship with the dependent (response or outcome) variable. It is always a
good idea to check this assumption using a scatterplots.
Simple Linear Regression
Is the female literacy of a country useful in predicting their life expectancy? We want to build
model of the form:
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑒𝑚𝑎𝑙𝑒 𝑙𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦 = 𝑏0 + 𝑏1 ∗ 𝑓𝑒𝑚𝑎𝑙𝑒 𝑙𝑖𝑡𝑒𝑟𝑎𝑐𝑦 + 𝜀
Where Average female life expectancy (lifeexpf) is the dependent (response, y, or outcome)
variable, female who can read (%) (lit_fema) is the independent (explanatory or predictor)
variable, 𝑏0 is the intercept of the line of best fit, b1 is its slope and 𝜀 is the error term.
Is there a linear relationship between average female life expectancy and female literacy? Produce a scatter plot to help you answer this question.
4
To produce the output for the regression model, from the menus choose:
Analyze -> Regression -> Linear….
Dependent Variable: Average female life expectancy [lifeexpf]
Independent: female who can read (%) [lit_fema]
Statistics…
Descriptives
Make sure that Estimates and Model fit are selected.
Select Collinearity diagnostics
Residuals
Casewise disnognotics
Select Outlier outside 1.0 standard deviations
Plots…
Y: *ZRESID
X: *ZPRED
Click Next
Y: ZPRED
X: Dependent
Select Histogram and Normal probability plot
These steps will generate lots of output. Now examine the output and attempt to interpret it.
Look at the table Descriptive Statistics. What will you conclude?
Look at the table Correlations. What are the hypotheses being tested? What will you
conclude?
Look at the table Model Summary. What do you conclude?
.
5
Look at the table ANOVA. Explain what the Degrees of Freedom (DF), Sums of Squares (SS)
and Mean Squares (MS) represent. How they are related?
State the hypotheses being tested in the ANOVA table. How is the test statistic calculated and
what would your decision be?
Look at the table Coefficients. What do you conclude? Write an equation for the
regression model and use it to predict the average female life expectancy of a country
whose female literacy is 86%. What are the hypotheses being tested?
6
The last two columns of the Coefficient table give information about collinearity
statistics. Looking at the Tolerance, can you say if there is any problem with
multicollinearity?
The rest of the output deals with the residuals. This helps to find out if the assumptions to run
a linear are met and to identify any outliers or influential cases.
Look at the table Casewise Diagnostics. What is standardised residual? What do you
conclude?
Look at the table Residual Statistics. What do you conclude?
Look at the Histogram and Normal P-P Plot. What do you conclude about the
residuals?
Now look at the two Scatter Plots. What do you conclude?
Can you think of any restriction when using your model to predict female life
expectancy?
7
How would you validate a model like this?
Multiple Linear Regression
While a simple linear regression can have just one independent variable, a multiple linear
regression can have more than one independent variable. The following is a model with two
independent variables:
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑒𝑚𝑎𝑙𝑒 𝑙𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦 = 𝑏0 + 𝑏1 ∗ 𝑖𝑛𝑓𝑎𝑛𝑡 𝑚𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦 + 𝑏2 ∗ 𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦 + 𝜀
where infant mortality (deaths per 1000 live births) [babymort] is the number of dead babies
during their first year per thousand live births and average number of kids [fertilty] is the
average number of children per family.
We found that literacy explained 67% of the variability of life expectancy. Now we examine
a model using infant mortality (babymort) and fertility (fertility) to predict life expectancy.
To run the analysis, select
Analyze -> Regression -> Linear….Click on Reset.
Dependent Variable: Average female life expectancy [lifeexpf]
Independent: average number of kids [fertilty], infant mortality (deaths per 1000 live births)
[babymort]
Case Labels: country
Statistics…
Descriptives
Make sure that Estimates and Model fit are selected.
Select Collinearity diagnostics
Plots…
Produce all partial plots
Save...
Predicted values: Standardised
Look at the table Descriptive Statistics. What do you conclude?
Look at the table Correlations. What do you conclude?
8
Look at the table Model Summary. What do you conclude?
Look at the ANOVA table. What do you conclude?
Look at the table Coefficients. What do you conclude? Write an equation for the
regression model and use it to predict the female life expectancy of a country whose
fertility is 3 and infant mortality is 23 per 1000 live births.
9
Repeated Measures Analysis of Variance
Does the anxiety rating of a person affect performance on a learning task? Twelve subjects
were assigned to one of two anxiety groups on the basis of an anxiety test, and the number of
errors made in four blocks of trials on a learning task was measured. We use repeated
measures analysis of variance technique to study the data.
Open the SPSS data file called anxiety2. Notice that there is one case for each subject and
four trials variables (trial1, trial2, trial3 and trial4).
In repeated measures analysis-of-variance technique, we distinguish two types of factors in
the model: between-subject factors and within-subject factors. A between-subject factor
as the name suggest, divides the subjects into discrete subgroups, for example anxiety in this
data file. Anxiety divides the cases into two groups of high anxiety scores and low anxiety
scores. A within-subjects factor is any factor that distinguishes measurements made on the
same subject. For example, trail distinguishes the four measurements taken for each subjects.
To produce the output in this example, from the menus choose:
Analyze
General Linear Model
Repeated Measures…
Within-Subject Factors name: replace factor1 with trial
Number of Levels: 4
Click Add and Click Define
Within-Subjects Variables (trial): trial1, trial2, trial3 and trial4
Between-Subjects Factor(s): anxiety
Options…
Select Homogeneity tests
Contrasts…
Factors: trial
Contrasts: Repeated (click Change)
Click on Continue and then OK.
Examine the results and try to interpret it.
Between-Subject Test
The test of between-subject effects is shown on the table Tests of Between-Subjects
Effects. Examine this table. What do you conclude?
10
Multivariate Tests
The multivariate table contains tests of the within-subjects factor, trial, and the interaction of
within-subjects factor and the between-subjects factor, trail*anxiety.
Examine the Multivariate Tests table. What do you conclude?
Assumptions
The vector of the dependent variables follows a normal distribution, and the variance-
covariance matrices are equal across the cells formed by the between-subject effects.
The test for this assumption is shown on the table Box’s test of Equality of Covariance
Matrices. Examine this table, what do you conclude?
It is assumed that the variance-covariance matrix of the dependent variables is circular.
The test of this assumption is shown on the table Mauchly’s Test of Sphericity. Examine
this table, what do you conclude?
If the test of sphericity was not satisfied use Greenhouse-Geisser, Huynh-Feldt or Lower-
bound to make your conclusion.
Now let us look at the table of Tests of Within-Subjects Effects. Examine the table, what
can you conclude?
11
Contrasts
A repeated contrast measures compares one level of trial with the subsequent level. The first
column (source) indicates the effect being tested. For example, the label trial test the
hypothesis that averaged over the two anxiety groups, the mean of the specified contrast is
zero.
The second column trial represents the contrasts. For example, Level 1 vs Level 2 represents
the transformation trial 1 – trial 2. This compares the first level of trial with the second level
of trial, and so on.
The label trial*anxiety tests the hypothesis that the mean of the specified contrast is the same
for the two anxiety groups.
Now look at the Tests of Within-Subjects Contrasts. What do you conclude?
12
Data Analysis Using Crosstabulations Techniques in SPSS
Introduction
Crosstabulation is a powerful technique that helps you to describe the relationships between
categorical (nominal or ordinal) variables. With Crosstabulation, we can produce the
following statistics:
Observed Counts and Percentages
Expected Counts and Percentages
Residuals
Chi-Square
Relative Risk and Odds Ratio for a 2 x 2 table
Kappa Measure of agreement for an R x R table
Examples will be used to demonstrate how to produce these statistics using SPSS. The data
set used for the demonstration comes with SPSS and it is called GSS_93.sav. It has 67
variables and 1500 cases (observations). Open this data file which is located in the SPSS
folder. Study the data file in order to understand it before performing the following exercises.
Exercise 1: An R x C Table with Chi-Square Test of Independence
Chi-Square tests the hypothesis that the row and column variables are independent, without
indicating strength or direction of the relationship. Like most statistics test, to use the Chi-
Square test successfully, certain assumptions must be met. They are:
No cell should have expected value (count) less than 0, and
No more than 20% of the cells have expected values (counts) less than 5
In the SPSS file, there is a variable called relig short for religion (Protestant, Catholic,
Jewish, None, Other) and another one called region4 (Northeast, Midwest, South, West). In
this example, we want to find out if religious preferences vary by region of the country.
To produce the output, from the menu choose:
Analyze -> Descriptive Statistics -> Crosstabs….
Row(s): Religious Preferences [relig]
Column(s): Region [region4]
Statistics… select Chi-Square, click Continue then OK
In the SPSS output, Pearson chi-square, likelihood-ratio chi-square, and linear-by-linear
association chi-square are displayed. Fisher's exact test and Yates' corrected chi-square are
computed for 2x2 tables.
State the null and alternative hypothesis that is being tested.
13
Examine the output. What conclusion can you draw from the output?
However, you will notice that certain assumptions are not met. The results could be
misleading. What should you do? We will discuss this further in example 2 below.
Example 2: Percentages, Expected Values, and Residuals and Omitting Categories
From the last example, we noticed that 40% of the cells had expected counts less than 5. So
this assumption was violated. Since Other and Jewish had just 15 cases each, we can drop
them out of the analysis by using Select Cases. In other words, the religious preference is
restricted to Protestant, Catholic and None.
To produce the output, use Select Cases from the Data menu to select cases with relig not
equal to 3 and relig not equal to 5 (relig ~=3 & relig ~=5). Call up the dialogue box for
Crosstabs. Reset it to default and select:
Row(s): Region [region4]
Column(s): Religious Preferences [relig]
Statistics… Select Chi-Square
Nominal: select Contingency coefficient, Phi and Cramer’s V, Lambda,
Uncertainty coefficient, click Continue
Cells…
Counts: select Expected
Percentages: select Row
Residuals: select Adjusted Standardized, click Continue then OK
Now examine the output and try to interpret it.
You can pivot the table so each group of statistics appears in its own panel. Demonstrate.
Double-click Table and drag region on the row tray to the right of statistics.
Look at the Region4*Religion Preference Crosstabulation. What can you conclude?
14
Look at the Chi-Square Tests table. What can you conclude?
Look at the Symmetric Measures table. What can you conclude about the strength of
the relationship between religion preference and region?
Examine the table Directional Measures what do you conclude?
15
Example 3: Tests Within Layers of a Multiway Table
Multiway table allows you to examine the relationship between two categorical variables
within a controlling variable. For example, is the relationship between marital status and view
of life the same for males and females? This example shows you how to answer this type of
question in SPSS.
Use Select Cases from the Data menu to select cases with marital not equal to 4 (marital ~=
4).
Can you think of any reason why we have decided to exclude cases where marital status
is equal to 4 (i.e. separted)?
Call up the Crosstabs dialogue box. Click Reset to restore the dialogue box defaults. Then
select:
Row(s): Marital satus [marital]
Column(s): Is Life Exciting or Dull [life]
Layer 1 of 1: Respondent’s Sex [sex]
Statistics… Select Chi-Square
Cells…
Counts: select Expected
Percentages: select Row
Residuals: select Standardized and Adjusted Standardized
Examine the results and try to interpret it.
Is there a relationship between marital status and view on life? Is this relationship the
same between male and female?
Example 4: The Relative Risk and Odds Ration for a 2 x 2 Table
The Relative Risk for 2 x 2 tables is a measure of the strength of the association between the
presence of a factor and the occurrence of an event. If the confidence interval for the statistic
includes a value of 1, you cannot assume that the factor is associated with the event. The odds
ratio can be used as an estimate or relative risk when the occurrence of the factor is rare.
16
In the GSS93 data file, there is a variable (dwelown) that measure home ownership (owner or
renter) and another variable (vote92) that measure voting (voted or did not vote). We will like
to find out whether home owners are more likely to vote than renters.
Through the Variable window note all the codes that have been used for the two variables of
interest. For example, dwelown uses code 3 for other and code 8 for don’t know, while vote92
uses code 3 for not eligible and code 4 for refused. Select the cases with dwelown less than 3
and vote92 less than 3.
From the menus choose:
Data -> Select Cases
Select If condition is satisfied and click If.
Enter dwelown < 3 & vote92 < 3 as the condition and click Continue then OK.
In the Crosstabs dialogue box, click Reset to restore the dialogue box defaults, and then
select:
Row(s): Homeowner or Renter [dwelown]
Column(s): Voting in 1992 Election [vote92]
Cells…
Percentages select Row, click Continue then OK
Examine and interpret the output.
From the crosstabulation table, what can you conclude?
Recall the crosstabs dialogue box. In the Crosstabs dialogue box, select:
Statistics…
Select Risk, click Continue
Examine the output and interpret it.
Look at the table called Risk Estimate, what can you conclude?
17
The odds ratio should be used as an approximation to the relative risk when the following
conditions are met:
The probability of the event is small (<0.1). This condition guarantees that the odds
ratio will make a good approximation to the relative risk.
The design of the study is case-control.
These conditions are not met in this present example. In smoking and lung cancer study, the
conditions are met. You can use the odds ratio.
Example 5: The Kappa Measure of Agreement for an R x R Table
Cohen's kappa measures the agreement between the evaluations of two raters when both are
rating the same object. A value of 1 indicates perfect agreement. A value of 0 indicates that
agreement is no better than chance. Values of Kappa greater than 0.75 indicates excellent
agreement beyond chance; values between 0.40 to 0.75 indicate fair to good; and values
below 0.40 indicate poor agreement. Kappa is only available for tables in which both
variables use the same category values and both variables have the same number of
categories.
The table structure of the Kappa statistics is a square R x R and has the same row and column
categories because each subject is classified or rated twice. For example, doctor A and doctor
B diagnose the same patients as schizophrenic, manic depressive, or behaviour-disorder. Do
the two doctors agree or disagree in their diagnosis? Two teachers assess a class of 18 years
old students. Do the teachers agree or disagree in their assessment?
In the GSS93 subset data file, we have variables that assess the educational level of
respondent’s father (padeg) and mother (madeg). Is there any agreement between father and
mother educational level?
To produce the output, use Select Cases from the Data menu to select cases with madeg not
equal to 2 and padeg not equal to 2 (madeg ~= 2 & padeg ~= 2). In the Crosstabs dialogue
box, click Reset to restore the dialogue box defaults, and then select:
Row(s): Father’s Highest Degree [padeg]
Column(s): Mother’s Highest Degree [madeg]
Statistics… Select kappa, click Continue
Cells…
Percentages: select Total, click Continue then OK
Examine and interpret the output.
Look at the tables from the output. What can you conclude?
18
Intraclass Correlation Coefficients (ICC)
We can use ICC to assess inter-rater agreement when there are more than two raters. For
example, the International Olympic Committee (IOC) trains judges to assess gymnastics
competitions. How can we find out if the judges are in agreement? ICC can help us to answer
this question. Judges have to be trained to ensure that good performances receive higher
scores than average performances, and average performances receive higher scores than poor
performances; even though two judges may differ on the precise score that should be
assigned to a particular performance.
Use the data set judges.sav to illustrate how to use SPSS to calculate ICC.
Open the data set. From the menu select:
Analyze -> Scale -> Reliability
Items: judge1, judge2, judge3, judge4, judge5, judge6, judge7
Statistics… Under Descriptives for check Item.
Check Intraclass correlation coefficient
Model: Two-Way Random
Type: Consistency
Confidence interval: 95%
Test value: 0
Examine and interpret the output. What would you conclude?
Types of Survival Analyses and when to use them in SPSS
Life Tables: Use life tables if cases can be classified into meaningful equal time interval.
Life table can be used to calculate the probability of a terminal event during any interval
under study.
Kaplan-Meier: Use this technique if cases cannot be classified into equal time intervals as
above. This is common to many clinical and experimental studies.
Cox Regression: Use this technique if you want to see the relation between survival time and
a predictor variable, for instant age or tumour type.
19
Using Kaplan-Meier Survival Analysis to Test Competing Pain
Relief Treatments
A pharmaceutical company is developing an anti-inflammatory medication for treating
chronic arthritic pain. Of particular interest is the time it takes for the drug to take effect and
how it compares to an existing medication. Shorter times to effect are considered better.
The results of a clinical trial are collected in pain_medication.sav. This data file is stored in
the following folders \\campus\software\dept\spss. Open the file and study it. Use Kaplan-
Meier Survival Analysis to examine the distribution of "time to effect" and compare the
effectiveness of the two treatments.
To run a Kaplan-Meier Survival Analysis, from the menus choose:
Analyze
Survival
Kaplan-Meier...
Select Time to effect [time] as the Time variable.
Select Effect status [status] as the Status variable.
Click Define Event.
Under Value(s) Indicating Event Has Occurred type 1 in the text area next to
Single value:.
Click Continue.
Select Treatment [treatment] as a Factor.
Click Compare Factor.
Select Log rank, Breslow, and Tarone-Ware.
Click Continue.
Click Options in the Kaplan-Meier dialog box.
Select Quartiles in the Statistics group and Survival in the Plots group.
Click Continue.
Click OK in the Kaplan-Meier dialog box.
Interpretation
Survival Table
The survival table is a descriptive table that details the time until the drug takes effect. The
table is sectioned by each level of Treatment, and each observation occupies its own row in
the table.
Time: The time at which the event or censoring occurred.
Status: Indicates whether the case experienced the terminal event or was censored.
Cumulative Proportion Surviving at the Time: The proportion of cases surviving from the
start of the table until this time. When multiple cases experience the terminal event at the
20
same time, these estimates are printed once for that time period and apply to all the cases
whose drug took effect at that time.
N of Cumulative Events: The number of cases that have experienced the terminal event from
the start of the table until this time.
N of Remaining Cases: The number of cases that, at this time, have yet to experience the
terminal event or be censored.
Survival Functions (Curves)
The survival curves give a visual representation of the life tables. The horizontal axis shows
the time to event. In this plot, drops in the survival curve occur whenever the medication
takes effect in a patient. The vertical axis shows the probability of survival. Thus, any point
on the survival curve shows the probability that a patient on a given treatment will not have
experienced relief by that time.
The plot for the New drug below that of the Existing drug throughout most of the trial, which
suggests that the new drug may give faster relief than the old. To determine whether these
differences are due to chance, look at the comparisons tables.
Mean and Medians for Survival Time
The means and medians for survival time table offers a quick numerical comparison of the
"typical" times to effect for each of the medications. Since there is a lot of overlap in the
confidence intervals, it is unlikely that there is much difference in the "average" survival
time.
Percentiles
The percentiles table gives estimates of the first quartile, median, and third quartile of the
survival distribution. The interpretation of percentiles for survival curves is that the 75th
percentile is the latest time that at least 75 percent of the patients have yet to feel relief.
Overall Comparisons
This table provides overall tests of the equality of survival times across groups. Since the
significance values of the tests are all greater than 0.05, you cannot determine a difference
between the survival curves.
Summary
With the Kaplan-Meier Survival Analysis procedure, you have examined the distribution of
time to effect for two different medications. The comparison tests show that there is not a
statistically significant difference between them.
Recommended Readings
1. Hosmer, D. W., and S. Lemeshow. 1999. Applied Survival Analysis. New York: John
Wiley and Sons.
2. Kleinbaum, D. G. 1996. Survival Analysis: A Self-Learning Text. New York:
Springer-Verlag.
3. Norusis, M. 2004. SPSS 13.0 Advanced Statistical Procedures Companion. Upper
Saddle-River, N.J.: Prentice Hall, Inc..
21
Binary Logistic Regression Model
In this type of model, you estimate the probability of an event occurring. The model can be
written as:
𝑷𝒓𝒐𝒃(𝒆𝒗𝒆𝒏𝒕) =𝟏
𝟏 + 𝒆−𝒛
For a single independent variable
𝒛 = 𝒃𝟎 + 𝒃𝟏𝒙𝟏 For multiple independent variables:
𝒛 = 𝒃𝟎 + 𝒃𝟏𝒙𝟏 + 𝒃𝟐𝒙𝟐 + ⋯ . 𝒃𝒏𝒙𝒏
where b0 and b1, b2, are coefficients estimated from the data, x1, x2, are the independent
variables, n is the number of independent variables and e is the base of natural logarithms
(2.781).
Exercise
The data held in the file cancer.sav are from a study reported by Brown (1980) and are
commonly cited in texts considering binary logistic regression. The prognosis for prostate
cancer is based upon whether or not the cancer has spread to the surrounding lymph nodes. In
this classic study Brown et al (see Brown, 1980) explored the following separate indicators
for lymph node involvement in a group of 53 men known to have prostate cancer. To open
the data file, follow these instructions:
1. From SPSS menu bar select File -> Open -> Data… a dialogue box will appear.
2. In the text area for File name: type \\campus\software\dept\spss and then click on
Open.
3. Select the file cancer.sav and click on Open.
4. Spend some time to study the data file. How many cases and variables make up the
data file? Cases:…….. Variables:………
5. Are there any missing values in the data? Yes No
The variables (corresponding to columns in the data file) are:
1) age - age of patients in years.
2) acid - level of serum acid phosphates (acid level in King-Armstrong units)
3) xray - x-ray result (0 = negative, 1 - positive)
4) size - size of tumour (0 = small, 1 = large)
5) stage - stage of tumour (0 = less serious, 1 = more serious)
6) nodes - nodal involvement (0 = not involved, 1 = involved)
Modelling
Carry out a Forward Conditional logistic regression analysis of the data using nodal
involvement as the dependent variable and the other variables as independent variables (i.e.
covariates). You do not need to define xray, size or stage as being categorical variables, since
22
they are already binary variables. Follow these steps to carry out the Forward Conditional
binary logistic regression:
Analyze -> Regression -> Binary Logistic….
Dependent: Nodal involvement [nodes]
Covariates: age acid xray size stage
Method: Forward Conditional
Use the output to answer the following questions.
Look at the table Case Processing Summary. What do you conclude?
Now look at the three tables under Block 0: Beginning Block. What do you conclude?
Now look at the tables under Block 1: Method=Forward (stepwise) conditional. What do
you conclude?
Give the logistic regression equation for the final model.
Predictions
Carry out another logistic regression analysis of the data using nodal involvement as the
dependent variable but this time including ALL the covariates in the model, i.e. using the
ENTER method. Also request the Odd Ratio (OR) and the 95% Confidence Interval (CI) of
OR. Follow these steps:
23
Analyze -> Regression -> Binary Logistic….
Dependent: Nodal involvement [nodes]
Covariates: age acid xray size stage
Save: Under Predicted Values select Probabilities
Options: CI for exp(B):
Method: Enter
1. Give the coefficients for the full model, i.e. including all the variables. [Normally
you would only consider the statistically significant variables].
2. Which coefficients are statistically significant and why?
3. What is the probability of nodal involvement for each man in the data set? Which
case has the highest probability and which case the lowest probability of nodal
involvement?
4. Select one significant variable give the OR and its 95% CI? How would you interpret
the OR and its 95% CI?
Reference
Brown, B. W., Jr et al. 1980 Prediction Analyses for Binary Data. In Biostatistics Casebook,
New York: John Wiley and Sons.
24
Multivariate Analysis of Variance (MANOVA)
The GLM Multivariate procedure allows you to model the values of multiple dependent scale
variables, based on their relationships to categorical and scale predictors.
The GLM Multivariate procedure is based on the general linear model, in which factors and
covariates are assumed to have linear relationships to the dependent variables.
Fixed Factors: Categorical predictors should be selected as factors in the model. Each level
of a factor can have a different linear effect on the value of the dependent variables. The
GLM Multivariate procedure assumes that all the model factors are fixed; that is, they are
generally thought of as variables whose values of interest are all represented in the data file,
usually by design.
Covariates: Scale predictors should be selected as covariates in the model. Within
combinations of factor levels (or cells), values of covariates are assumed to be linearly
correlated with values of the dependent variables.
Interactions: By default, the GLM Multivariate procedure produces a model with all
factorial interactions, which means that each combination of factor levels can have a different
linear effect on the dependent variable. Additionally, you may specify factor-covariate
interactions, if you believe that the linear relationship between a covariate and the dependent
variables changes for different levels of a factor.
For the purposes of testing hypotheses concerning parameter estimates, the GLM
Multivariate procedure assumes:
• The values of errors are independent of each other across observations and the
independent variables in the model. Good study design generally avoids violation of
this assumption.
• The covariance of dependent variables is constant across cells. This can be
particularly important when there are unequal cell sizes; that is, different numbers of
observations across factor-level combinations.
• Across the dependent variables, the errors have a multivariate normal distribution
with a mean of 0
As part of the initial treatment for myocardial infarction (MI, or "heart attack"), a
thrombolytic, or "clot-busting", drug is sometimes administered to help clear the patient's
arteries before surgery. Three of the available drugs are alteplase, reteplase, and
streptokinase. Alteplase and reteplase are newer, more expensive drugs, and a regional health
care system wants to determine whether they are cost-effective enough to adopt in place of
streptokinase. One of the benefits of thrombolytic drugs is that surgery generally proceeds
more smoothly, resulting in a shorter recovery period. If the newer drugs are effective, then
patients given those drugs should have shorter lengths of stay in the hospital. Hopefully, the
shorter lengths of stay will help to make up for the greater initial cost of the newer drugs.
Running The Analysis
1. From SPSS menu bar select File -> Open -> Data… a dialogue box will appear.
2. In the text area for File name: type \\campus\software\dept\spss and then click on
Open.
25
3. Select the file heart.sav and click on Open.
4. Spend some time to study the data file. How many cases and variables make up the
data file? Cases:…….. Variables:………
5. Are there any missing values in the data? Yes No
To run a GLM Multivariate analysis, from the menus choose:
1. Analyze -> General Linear Model -> Multivariate...
2. Select Length of stay [los] and Treatment costs [cost] as dependent variables.
3. Select Clot-dissolving drugs [clotsolv] and Surgical treatment [proc] as fixed factors.
4. Click Contrasts
5. Select clotsolv (None) as the contrast to change.
6. In the Change Contrast group, select Simple as the contrast type.
7. Select First as the reference category.
8. Click Change then click Continue
9. Click Option in the GLM Multivariate dialogue box
10. Select Estimates of effect size, SSCP matrices, Homogeneity tests and Spread vs.
level plot.
11. Click Continue and OK in the GLM Multivariate dialogue box.
26
By default, a model is fit with clot-dissolving drugs and Surgical treatment as main effects
and their interaction as a two-way effect.
Interpretation of Results
SSCP Matrices and Multivariate Test
This table displays the hypothesis and error sum-of-squares and cross-products (SSCP)
matrices for testing model effects. Since there are two dependent variables, each matrix has
two columns and two rows.
For example, the 2x2 matrix associated with CLOTSOLV in the table is the hypothesis
matrix for testing the significance of Clot-dissolving drugs
The matrix associated with PROC in the table is the hypothesis matrix for testing the
significance of Surgical treatment, and the matrix associated with PROC*CLOTSOLV is
used for testing their interaction effect
27
The error matrix is used in testing each effect. In analogy to the test for models with one
dependent variable, the “ratio” of the hypothesis SSCP matrix to the error matrix used to
evaluate the effect of interest.
The multivariate test table displays four tests of significance for each model effect.
Pillai's trace is a positive-valued statistic. Increasing values of the statistic indicate effects
that contribute more to the model.
Wilks' Lambda is a positive-valued statistic that ranges from 0 to 1. Decreasing values of the
statistic indicate effects that contribute more to the model.
Hotelling's trace is the sum of the eigenvalues of the test matrix. It is a positive-valued
statistic for which increasing values indicate effects that contribute more to the model.
Hotelling's trace is always larger than Pillai's trace, but when the eigenvalues of the test
matrix are small, these two statistics will be nearly equal. This indicates that the effect
probably does not contribute much to the model.
Roy's largest root is the largest eigenvalue of the test matrix. Thus, it is a positive-valued
statistic for which increasing values indicate effects that contribute more to the model.
Roy's largest root is always less than or equal to Hotelling's trace. When these two statistics
are equal, the effect is predominantly associated with just one of the dependent variables,
there is a strong correlation between the dependent variables, or the effect does not contribute
much to the model.
There is evidence that Pillai's trace is more robust than the other statistics to violations of
model assumptions (Olson, 1974).
Each multivariate statistic is transformed into a test statistic with an approximate or exact F
distribution.
The hypothesis (numerator) and error (denominator) degrees of freedom for that F
distribution are shown.
The significance values of the main effects, CLOTSOLV and PROC, are less than 0.05,
indicating that the effects contribute to the model.
By contrast, their interaction effect does not contribute to the model.
28
However, though CLOTSOLV does contribute to the model, since the value of Pillai's trace
is close to Hotelling's trace, it doesn't contribute very much.
The multivariate test table
A more straightforward way to see this is to look at the partial eta squared. The partial eta
squared statistics reports the ‘practical’ significance of each term, based upon the ‘ratio’ of
the variation accounted for by the effect to the sum of the variation accounted for by the
effect and the variation left to error.
Larger values of partial eta squared indicate a greater amount of variation accounted for by
the model effect, to a maximum of 1. Since the partial eta squared is very small for
CLOTSOLV, it does not contribute very much to the model. By contrast, the partial eta
squared for PROC is quite large, which is to be expected. The surgical procedure a patient
must undergo for MI treatment is going to have a much greater effect on the length of their
hospital stay and final cost than the type of thrombolytic they receive.
In this case, it is enough for the multivariate tests to show that CLOTSOLV is significant,
which means that the effect of at least one of the drugs is different from the others. The
contrast results will show you where the differences are.
This table displays results for each contrast. Simple contrasts using the first level of Clot-
dissolving drugs as the reference category were specified.
29
Thus, one contrast compares the second level to the first level; that is, the effect of reteplase
to the effect of streptokinase.
The contrast estimates show that, on average, patients given reteplase spend 0.382 fewer days
in the hospital and incur almost 600 dollars more in treatment costs than patients given
streptokinase.
Since the significance value for Length of stay is less than 0.05, you can conclude this
difference is not due to chance.
The significance value for Treatment costs is greater than 0.10, so this difference may be
entirely due to chance variation.
The second contrast compares the third level to the first level; that is, the effect of alteplase to
the effect of streptokinase.
The contrast estimates show that, on average, patients given alteplase spend about half a day
less in the hospital and incur slightly over 700 dollars more in treatment costs.
Since the significance value for Length of stay is less than 0.05, you can conclude this
difference is not due to chance.
The significance value for Treatment costs is greater than 0.10, so this difference may be
entirely due to chance variation.
The contrast results show that alteplase and reteplase seem to reduce patient length of stay.
Moreover, the reduction is enough to equalize the treatment costs, or at least bring the
difference within the random variation.
Thus, the model suggests that alteplase and reteplase should be used in place of streptokinase.
However, before adopting this plan, you should check some tests of the model assumptions
The assumption for the multivariate approach is that the vector of the dependent variables
follows a multivariate normal distribution, and the variance-covariance matrices are equal
across the cells formed by the between-subject effects.
30
Box's M tests the null hypothesis that the observed covariance matrices of the dependent
variables are equal across groups.
The Box's M test statistic is transformed to an F statistic with df1 and df2 degrees of freedom.
Here, the significance value of the test is less than 0.05, suggesting that the assumptions are
not met, and thus the model results are suspect.
Box's M is sensitive to large data files, meaning that when there are a large number of cases,
it can detect even small departures from homogeneity. Moreover, it can be sensitive to
departures from the assumption of normality. As an additional check of the diagonals of the
covariance matrices, look at Levene's tests.
This table tests equality of the error variances across the cells defined by the combination of
factor levels.
A separate test is performed for each dependent variable.
The significance value for Length of stay is greater than 0.10, so there is no reason to believe
that the equal variances assumption is violated for this variable.
However, the significance value for the test of Treatment costs is less than 0.05, indicating
that the equal variances assumption is violated for this variable.
Like Box's M, Levene's test can be sensitive to large data files, so look at the spread vs. level
plot for Treatment costs for visual confirmation.
The spread-versus-level plot is a scatterplot of the cell means and standard deviations.
31
It provides a visual test of the equal variances assumption, with the added benefit of helping
you to assess whether violations of the assumption are due to a relationship between the cell
means and standard deviations.
This plot agrees with the result of Levene's test, that the equal variances assumption is
violated for Treatment costs.
There is also a clear positive relationship in the scatterplot, showing that as the cell mean
increases, so does the variability. This relationship suggests a possible solution to the
problem.
Since Treatment costs is a positive-valued variable, you could propose that the error term has
a multiplicative, rather than additive, effect on cost. Instead of modeling Treatment costs, you
will analyze Log-cost
To run an analysis using log-transformed costs, click the Dialog Recall tool and select GLM
Multivariate (or select Analyze -> General Linear Model -> Multivariate...).
1. Deselect Treatment costs as a dependent variable
2. Select Log-cost as the dependent variable.
3. Click OK in the GLM Multivariate dialogue box.
Box's M is significant, while Levene's test is not. This can happen for several reasons:
• The covariance between Length of stay and Log-cost is not constant across cells, and
thus the model results are suspect.
• The covariances are unequal, though not by much, but the large size of the data file
causes Box's M to be overly sensitive to this departure from homogeneity.
• The covariances are equal, but the test procedure for computing Box's M, a
multivariate test, simply comes up with a different result than the univariate test.
• The distribution of Length of stay and Log-cost is different enough from a
multivariate normal distribution to cause Box's M to be significant.
32
In order to help decide whether you should be concerned about the significance of Box's M,
some exploratory data analysis is in order. You can use the Explore procedure to check the
assumption of normality. With the data file split by the cells, you can use the Bivariate
Correlations procedure to see whether the correlations are constant across cells.
33
The results for Length of stay are identical to the results from the previous model.
However, the results for Log-cost are different from those for Treatment costs.
The significance values for both contrasts are less than 0.05, suggesting that the differences in
costs between the newer drugs and streptokinase are not due to chance.
The contrast estimate for the difference between reteplase and streptokinase is 0.0217. Since
you are looking at differences in log-transformed cost, this means that the ratio of costs is
exp(0.0217) = 1.0219. That is, the ratio of the costs incurred by patients given reteplase is
approximately 2.19 percent higher than the costs incurred by patients given streptokinase. If
the typical MI patient incurs 25,000 to 35,000 dollars in treatment costs, that means reteplase
patients incur, roughly, an extra 550 to 770 dollars in costs.
The contrast estimate for the difference between alteplase and streptokinase is 0.0243. Since
you are looking at differences in log-transformed cost, this means that the ratio of costs is
exp(0.0243) = 1.0246. That is, the ratio of the costs incurred by patients given alteplase is
approximately 2.43 percent higher than the costs incurred by patients given streptokinase. If
the typical MI patient incurs 25,000 to 35,000 dollars in treatment costs, that means alteplase
patients incur, roughly, an extra 600 to 860 dollars in costs.
These contrast results show that while alteplase and reteplase do seem to reduce patient
length of stay, the reduction is not enough to equalize the treatment costs.
Thus, determining whether alteplase and reteplase should be used in place of streptokinase
will require further study of the cost of these drugs versus their effectiveness at increasing the
success of surgery.
Using the GLM Multivariate procedure, you have performed a multivariate analysis of
variance on the patient lengths of stay and treatment costs, using the surgical procedure
performed and thrombolytic administered as fixed factors. Your initial model indicated that
the final treatment costs for reteplase and alteplase are not significantly different from those
for streptokinase. However, that model violated the equal variances assumption. The spread
vs. level plot showed that a log-transformation of Treatment costs might be appropriate, so
the model was re-run, replacing Treatment costs with Log-cost as a dependent variable. This
34
second model passed Levene's test, but now showed a significant difference in the final costs
for thrombolytics. The new difference in costs translates to an extra 550 to 860 dollars for the
"average" MI patient, so further study of the cost-effectiveness of the new drugs is necessary.
What happened? The differences in Treatment costs in the original model fall in the range of
550 to 860 dollars, but that model did not find the difference to be significant. Why should it
matter now? Since Treatment costs is a positive-valued variable, its distribution is probably
right-skew, so it is likely that there are patients who incurred unusually high costs, thus
inflating the error variation in the first model. By log-transforming Treatment costs, the
influence of these high-cost patients is reduced. In this case, it was enough to make the
differences in costs to be statistically significant.
Once satisfied with Log-cost as a dependent variable, you should fit a "final" model without
the interaction term, because it has not contributed to either of the first two models.
Recommended Reading
See the following texts for more information on multivariate linear models:
1. Bray, J. H., and S. E. Maxwell. 1985. Multivariate Analysis of Variance. Thousand
Oaks, Calif.: Sage Publications, Inc..
2. Norusis, M. 2004. SPSS 13.0 Statistical Procedures Companion. Upper Saddle-River,
N.J.: Prentice Hall, Inc..
3. Olson, C. L. 1974. Comparative Robustness of Six Tests in Multivariate Analysis of
Variance. Journal of the American Statistical Association, 69:348, 894-908.