handleiding spss multinomial logit regression

The SPSS Sample Problem

To demonstrate multinomial logistic regression, we will work the sample problem for multinomial logistic regression in SPSS Regression Models 10.0, pages 65 - 82.

The description of the problem found on page 66 states that the 1996 General Social Survey asked people who they voted for in 1992. Demographic variables from the GSS, such as sex, age, and education, can be used to identify the relationships between voter demographics and voter preference.

The data for this problem is: voter.sav.

Multinomial Logistic Regression

Stage One: Define the Research Problem

In this stage, the following issues are addressed:

•Relationship to be analyzed•Specifying the dependent and independent variables•Method for including independent variables


Relationship to be analyzed

The goal of this analysis is to examine the relationship between presidential choice in 1992, sex, age, and education.

Specifying the dependent and independent variables

The dependent variable is pres92 ‘Vote for Clinton, Bush, Perot.’ It has three categories: 1 is a vote for Bush, 2 is a vote for Perot, and 3 is vote for Clinton. SPSS will solve the problem by contrasting votes for Bush to votes for Clinton, and votes for Perot to votes for Clinton. By default, SPSS uses the highest numbered choice as the reference category.

The independent variables which we will use in this analysis are:

•AGE 'Age of respondent’•EDUC ‘Highest year of school completed’•DEGREE ‘Respondent’s Highest Degree’•SEX ‘Respondent’s Sex’


Method for including independent variables

The only method for including variables multinomial logistic regression in SPSS is direct entry of all variables.

Stage 2: Develop the Analysis Plan: Sample Size Issues


•Missing data analysis•Minimum sample size requirement: 15-20 cases per independent variable


Missing data analysis

Only 2 of the 1847 cases have any missing data. Since the number of cases with missing data is so small, it cannot produce a missing data process that is disruptive to the analysis. We will bypass any missing data analysis.

Minimum sample size requirement: 15-20 cases per independent variable

The data set has 1845 cases and 4 independent variables for a ratio of 462 to 1, well in excess of the requirement that we have 15-20 cases per independent variable.

Stage 2: Develop the Analysis Plan: Measurement Issues:


•Incorporating nonmetric data with dummy variables•Representing Curvilinear Effects with Polynomials•Representing Interaction or Moderator Effects


Incorporating Nonmetric Data with Dummy Variables

It is not necessary to create dummy variables for nonmetric data since SPSS will do this automatically when we specify that a variable is a “factor” in the model.

Representing Curvilinear Effects with Polynomials

We do not have any evidence of curvilinear effects at this point in the analysis, though the SPSS text for this problem points out that there is a curvilinear relationship between education and voting preference, which led them to create the variable Degree ‘Respondent’s Highest Degree’. Democrats (i.e. Clinton voters) are favored by both those with little formal education and those who have advanced degrees.

Representing Interaction or Moderator Effects

We do not have any evidence at this point in the analysis that we should add interaction or moderator variables. The SPSS procedure makes it very easy to add interaction terms.

Stage 3: Evaluate Underlying Assumptions


•Nonmetric dependent variable with two or more groups•Metric or nonmetric independent variables


Nonmetric dependent variable having two groups

The dependent variable pres92 ‘Vote for Clinton, Bush, Perot’ has three categories.

Metric or nonmetric independent variables

AGE and EDUC, as metric variables, will be entered as covariates in the model. SEX and DEGREE, as nonmetric variables, will be entered as factors.

Stage 4: Estimation of Logistic Regression and Assessing Overall Fit: Model Estimation


•Compute logistic regression model


Compute the logistic regression

The steps to obtain a logistic regression analysis are detailed on the following screens.

Requesting a Multinomial Logistic Regression


Specifying the Dependent Variable


Specifying the Independent Variables


Requesting additional statistics


Specifying Options to Include in the Output


Complete the multinomial logistic regression request


Stage 4: Estimation of Logistic Regression and Assessing Overall Fit: Assessing Model Fit


•Significance test of the model log likelihood (Change in -2LL)•Measures Analogous to R²: Cox and Snell R² and Nagelkerke R²•Classification matrices as a measure of model accuracy•Check for Numerical Problems•Presence of outliers


Significance test of the model log likelihood

The Initial Log Likelihood Function, (-2 Log Likelihood or -2LL) is a statistical measure like total sums of squares in regression. If our independent variables have a relationship to the dependent variable, we will improve our ability to predict the dependent variable accurately, and the log likelihood measure will decrease.

The initial log likelihood value (2718.636) is a measure of a model with no independent variables, i.e. only a constant or intercept. The final log likelihood value (2600.138) is the measure computed after all of the independent variables have been entered into the logistic regression. The difference between these two measures is the model chi-square value (118.497 = 2718.636 - 2600.138) that is tested for statistical significance. This test is analogous to the F-test for R² or change in R² value in multiple regression which tests whether or not the improvement in the model associated with the additional variables is statistically significant.

In this problem the model Chi-Square value of 118.497 has a significance < 0.0001, so we conclude that there is a significant relationship between the dependent variable and the set of independent variables.


Measures Analogous to R²

The next SPSS outputs indicate the strength of the relationship between the dependent variable and the independent variables, analogous to the R² measures in multiple regression.

The Cox and Snell R² measure operates like R², with higher values indicating greater model fit. However, this measure is limited in that it cannot reach the maximum value of 1, so Nagelkerke proposed a modification that had the range from 0 to 1. We will rely upon Nagelkerke's measure as indicating the strength of the relationship.

If we applied our interpretive criteria to the Nagelkerke R², we would characterize the relationship as weak.


The Classification Matrix as a Measure of Model Accuracy

The classification matrix in logistic regression serves the same function as the classification matrix in Multinomial Logistic Regression, i.e. evaluating the accuracy of the model.

If the predicted and actual group memberships are the same, i.e. 1 and 1, 2 and 2, or 3 and 3, then the prediction is accurate for that case. If predicted group membership and actual group membership are different, the model "misses" for that case. The overall percentage of accurate predictions (49.9% in this case) is the measure of a model that I rely on most heavily for this analysis as well as for Multinomial Logistic Regression because it has a meaning that is readily communicated, i.e. the percentage of cases for which our model predicts accurately.

To evaluate the accuracy of the model, we compute the proportional by chance accuracy rate and the maximum by chance accuracy rates, if appropriate. The proportional by chance accuracy rate is equal to 0.393 (0.358^2 + 0.150^2 + 0.492). A 25% increase over the proportional by chance accuracy rate would equal 0.491. Our model accuracy rate of 49.9% meets this criterion.

Since one of our groups (voters for Clinton) contains 49.2% of the cases, we should also apply the maximum by chance criterion. A 25% increase over the largest groups would equal 0.614. Our model accuracy race of 49.9% fails to meet this criterion. The usefulness of the relationship among the demographic variables and voter preference is questionable.Multinomial Logistic Regression

Check for Numerical Problems

There are several numerical problems that can occur in logistic regression that are not detected by SPSS or other statistical packages: multicollinearity among the independent variables, zero cells for a dummy-coded independent variable because all of the subjects have the same value for the variable, and "complete separation" whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables.

All of these problems produce large standard errors (over 2) for the variables included in the analysis and very often produce very large B coefficients as well. If we encounter large standard errors for the predictor variables, we should examine frequency tables, one-way ANOVAs, and correlations for the variables involved to try to identify the source of the problem.

None of the standard errors or B coefficients are excessively large, so there is no evidence of a numeric problem with this analysis.Multinomial Logistic Regression

Presence of outliers


Multinomial logistic regression does not provide any output for detecting outliers. However, if we are concerned with outliers, we can identify outliers on the combination of independent variables by computing Mahalanobis distance in the SPSS regression procedure.

Stage 5: Interpret the Results

In this section, we address the following issues:

•Identifying the statistically significant predictor variables•Direction of relationship and contribution to dependent variable


Identifying the statistically significant predictor variables - 1

There are two outputs related to the statistical significance of individual predictor variables: the Likelihood Ratio Tests and Parameter Estimates. The Likelihood Ratio Tests indicate the contribution of the variable to the overall relationship between the dependent variable and the individual independent variables. The Parameter Estimates focus on the role of each independent variable in differentiating between the groups specified by the dependent variable.

The likelihood ratio tests are a hypothesis test that the variable contributes to the reduction in error measured by the –2 log likelihood statistic. In this model, the variables age, degree, and sex are all significant contributors to explaining differences in voting preference.


Identifying the statistically significant predictor variables - 2

The two equations in the table of Parameter Estimates are labeled by the group they contrast to the reference group. The first equation is labeled "1 Bush", and the second equation is labeled "2 Perot." The coefficients for each logistic regression equation are found in the column labeled B. The hypothesis that the coefficient is not zero, i.e. changes the odds of the dependent variable event, is tested with the Wald statistic, instead of the t-test as was done for the individual B coefficients in the multiple regression equation.


The variables that have a statistically significant relationship to distinguishing voters for Bush from voters for Clinton in the first logistic regression equation were DEGREE=3 (bachelor's degree) and Sex=1 (Male).

The variables that have a statistically significant relationship to distinguishing voters for Perot from voters for Clinton were AGE, Degree=2 (junior college degree), and SEX=1 (male).

Direction of relationship and contribution to dependent variable - 1

Interpretation of the independent variables is aided by the "Exp (B)" column which contains the odds ratio for each independent variable. We can state the relationships as follows:

•Having a bachelor's degree rather than an advanced degree increased the likelihood that a voter would choose Bush over Clinton by about 50%.

•Being a male increased the likelihood that a voter would choose Bush over Clinton by approximately 50% (almost 60%).


Direction of relationship and contribution to dependent variable

Interpretation of the independent variables is aided by the "Exp (B)" column which contains the odds ratio for each independent variable. We can state the relationships as follows:

•Increases in age made a voter about 3% less likely to choose Perot over Clinton.

•Having a junior college degree made a person about 2.3 times more likely to choose Perot over Clinton.

•Being a male doubled the likelihood that a voter would choose Perot over Clinton.


Stage 6: Validate The Model

The SPSS multinomial logistic procedure does not include the ability to select a subset of cases based on the value of a variable, so we cannot use our usual strategy for conducting a validation analysis.

We can, however, accomplish the same results with a step-by-step series of syntax commands, as will be shown on the following screens. We cannot run all of the syntax commands at one time because one of the steps requires us to manually type the coefficients from the SPSS output into the syntax file so that we can calculate predicted values for the logistic regression equations.

In order to understand the steps that we will follow, we need to understand how we translate scores on the logistic regression equations into classification in a group.

The multinomial logistic regression problem for three groups is solved by contrasting two of the groups with a reference group. In this problem, the reference group is Clinton voters. The classification score for the reference group is 0, just as the code for any reference group for dummy coded variables is 0.

The first logistic regression equation is used to compute a logistic regression score that would test whether or not the subject is more likely a member of the group of Bush voters rather than a member of the group of Clinton voters. Similarly, the second logistic regression equation is used to test whether or not the subject is more likely to be a Perot voter than a Clinton voter.


Stage 6: Validate The Model (continued)

The classification problem, thus, involves the comparison of three scores, one associated with each of the groups. The first score (which we will label g1) is associated with voting for Bush. The second score (which we will label g2) is associated with voting for Perot. The third score (which we will label g3) is associated with voting for Clinton. Calculating g1 and g2 require substituting the variables for each subject in the logistic regression equations. G3 is always 0.

The scores g1, g2, and g3 are log estimates of the odds of belonging to each group. To convert the scores into a probability of group membership, we convert each score into its antilog equivalent and divide by the sum of the three antilog equivalents. To estimate group membership, we compare the three probabilities, and estimate that the subject is a member of the group associated with the highest probability.


Computing the First Validation Analysis

The first step in our validation analysis is to create the split variable.

* Compute the split variable for the learning and validation samples.SET SEED 2000000.COMPUTE split = uniform(1) > 0.50 .EXECUTE .


Creating the Multinomial Logistic Regression for the First Half of the Data

Next, we run the multinomial logistic regression on the first half of the sample, where split = 0.

* Select the cases to include in the first validation analysis.USE ALL.COMPUTE filter_$=(split=0).FILTER BY filter_$.EXECUTE .

* Run the multinomial logistic regression for these cases.NOMREG pres92 BY degree sex WITH age educ /CRITERIA = CIN(95) DELTA(0) MXITER(100) MXSTEP(5) LCONVERGE(0) PCONVERGE(1.0E-6) SINGULAR(1.0E-8) /MODEL /INTERCEPT = INCLUDE /PRINT = CLASSTABLE PARAMETER SUMMARY LRT .


Entering the Logistic Regression Coefficients into SPSS

To compute the classification scores for the logistic regression equations, we need to enter the coefficients for each equation into SPSS.

Next, we enter the B coefficients into SPSS using compute commands. For the first set of coefficients, we will use the letter A, followed by a number. For the second set of coefficients, we will use the letter B, followed by a number. The complete set of compute commands are below the graphic.


Create the coefficients in SPSS

* Assign the coefficients from the model just run to variables.compute A0 = 0.4371960543979.compute A1 = 0.000141395117344668.compute A2 = -0.0627104600309503.compute A3 = -0.498317435855329.compute A4 = 0.000960703262109129.compute A5 = 0.100394066914469.compute A6 = 0.0289917073909467.compute A7 = 0.395907588574936.compute B0 = -0.181048831733511.compute B1 = -0.0230592828938232.compute B2 = -0.0511669299018998.compute B3 = -0.548361281578711.compute B4 = 0.482047826644372.compute B5 = 0.532843066729832.compute B6 = 0.492518027246711.compute B7 = 0.6773170430501.execute.


Entering the Logistic Regression Equations into SPSS

Before we can enter the logistic regression equations, we need to explicitly create the dummy coded variables which the logistic regression equation created for the variables that we specified were factors.

* Create the dummy coded variables which SPSS created.* Use a logical assignment to code the variables as 0 or 1.compute degree0 = (degree = 0).compute degree1 = (degree = 1).compute degree2 = (degree = 2).compute degree3 = (degree = 3).compute degree4 = (degree = 4).compute sex1 = (sex = 1).execute.

The logistic regression equations can be entered as compute statements. We will also enter the zero value for the third group, g3.

compute g1 = A0 + A1 * AGE + A2 * EDUC + A3 * DEGREE0 + A4 * DEGREE1 + A5 * DEGREE2 + A6 * DEGREE3 + A7 * SEX1.

compute g2 = B0 + B1 * AGE + B2 * EDUC + B3 * DEGREE0 + B4 * DEGREE1 + B5 * DEGREE2 + B6 * DEGREE3 + B7 * SEX1.

compute g3 = 0.execute.

When these statements are run in SPSS, the scores for g1, g2, and g3 will be added to the dataset.


Converting Classification Scores into Predicted Group Membership

We convert the three scores into odds ratios using the EXP function. When we divide each score by the sum of the three odds ratios, we end up with a probability of membership in each group.

* Compute the probabilities of membership in each group.compute p1 = exp(g1) / (exp(g1) + exp(g2) + exp(g3)).compute p2 = exp(g2) / (exp(g1) + exp(g2) + exp(g3)).compute p3 = exp(g3) / (exp(g1) + exp(g2) + exp(g3)).execute.

The follow if statements compare probabilities to predict group membership.

* Translate the probabilities into predicted group membership.if (p1 > p2 and p1 > p3) predgrp = 1.if (p2 > p1 and p2 > p3) predgrp = 2.if (p3 > p1 and p3 > p2) predgrp = 3.execute.

When these statements are run in SPSS, the dataset will have both actual and predicted membership for the first validation sample.


The Classification Table

To produce a classification table for the validation sample, we change the filter criteria to include cases where split = 1, and create a contingency table of predicted voting versus actual voting.

USE ALL.COMPUTE filter_$=(split=1).FILTER BY filter_$.EXECUTE.CROSSTABS /TABLES=pres92 BY predgrp /FORMAT= AVALUE TABLES /CELLS= COUNT TOTAL .

These command produce the following table. The classification accuracy rate is computed by adding the percents for the cells where predicted accuracy coincides with actual voting behavior: 6.3% + 42.2% = 48.5%.

We enter this information in the validation table.


Computing the Second Validation Analysis

The second validation analysis follows the same series of command, except that we build the model with the cases where split = 1 and validate the model on cases where split = 0. The results from my calculations have been entered into the validation table below.


Full Model Split = 0 Split = 1

Model Chi-Square 118.497, p < 0.0001 42.610, p < 0.0001 92.772, p < 0.0001

Nagelkerke R2 0.072 0.051 0.113

Accuracy Rate forLearning Sample

49.9% 48.8% 50.6%

Accuracy Rate for Validation Sample

48.5% 46.9%

Significant Coefficients

(p < 0.05)

Equation 1 DEGREE = 3 SEX = 1Equation 2 AGE DEGREE = 2 SEX = 1

Equation 1 SEX=1Equation 2 AGE SEX=1

Equation 1 DEGREE = 3 SEX = 1Equation 2 AGE DEGREE = 2 SEX = 1

Generalizability of the Multinomial Logistic Regression Model

We can summarize the results of the validation analyses in the following table.


Full Model Split = 0 Split = 1

Model Chi-Square 118.497, p < 0.0001 42.610, p < 0.0001 92.772, p < 0.0001

Nagelkerke R2 0.072 0.051 0.113

Accuracy Rate forLearning Sample

49.9% 48.8% 50.6%

Accuracy Rate for Validation Sample

48.5% 46.9%

Significant Coefficients (p < 0.05)

Equation 1DEGREE = 3SEX = 1Equation 2AGEDEGREE = 2SEX = 1

Equation 1SEX=1Equation 2AGESEX=1

Equation 1DEGREE = 3SEX = 1Equation 2AGEDEGREE = 2SEX = 1

From the validation table, we see that the original model is verified by the accuracy rates for the validation analyses. SEX and AGE would appear to be the more reliable predictors of voting behavior. However, the relationship is weak and falls short of the classification accuracy criteria for a useful model.

handleiding spss multinomial logit regression

Documents

multinomial logistic regression

logistic regression equations

chance accuracy rate

incorporating nonmetric data

logistic regression equation

dummy coded variables

likelihood ratio tests

nonmetric dependent variable