multinomial logistic regression: detecting outliers and validating analysis outliers split-sample...

87
Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Post on 21-Dec-2015

377 views

Category:

Documents


11 download

TRANSCRIPT

Page 1: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Multinomial Logistic Regression:Detecting Outliers and Validating

Analysis

Outliers

Split-sample Validation

Page 2: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Outliers

Multinomial logistic regression in SPSS does not compute any diagnostic statistics.

In the absence of diagnostic statistics, SPSS recommends using the Logistic Regression procedure to calculate and examine diagnostic measures.

A multinomial logistic regression for three groups compares group 1 to group 3 and group 2 to group 3. To test for outliers, we will run two binary logistic regressions, using case selection to compare group 1 to group 3 and group 2 to group 3.

From both of these analyses we will identify a list of cases with studentized residuals greater than ± 2.0, and test the multinomial solution without these cases. If the accuracy rate of this model is less than 2% more accurate, we will interpret the model that includes all cases.

Page 3: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Example

To demonstrate the process for detecting outliers, we will examine the relationship between the independent variables "age" [age],"highest year of school completed" [educ] and "confidence in banks and financial institutions" [confinan] and the dependent variable "opinion about spending on social security" [natsoc].

Opinion about spending on social security contains three categories: 1 too little 2 about right 3 too much

With all cases, including those that might be identified as outliers, the accuracy rate was 63.7%. We note this to compare with the classification accuracy after removing outliers to determine which model we will interpret.

Page 4: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Request multinomial logistic regression for baseline model

Select the Regression | Multinomial Logistic… command from the Analyze menu.

Page 5: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Selecting the dependent variable

Second, click on the right arrow button to move the dependent variable to the Dependent text box.First, highlight the

dependent variable natsoc in the list of variables.

Page 6: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Selecting metric independent variables

Move the metric independent variables, age, educ and confinan to the Covariate(s) list box.

Metric independent variables are specified as covariates in multinomial logistic regression. Metric variables can be either interval or, by convention, ordinal.

Page 7: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Specifying statistics to include in the output

While we will accept most of the SPSS defaults for the analysis, we need to specifically request the classification table.

Click on the Statistics… button to make a request.

Page 8: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Requesting the classification table

First, keep the SPSS defaults for Model and Parameters.

Second, mark the checkbox for the Classification table.

Third, click on the Continue button to complete the request.

Page 9: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Completing the multinomial logistic regression request

Click on the OK button to request the output for the multinomial logistic regression.

The multinomial logistic procedure supports additional commands to specify the model computed for the relationships (we will use the default main effects model), additional specifications for computing the regression, and saving classification results. We will not make use of these options.

Page 10: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Classification

100 5 0 95.2%

50 9 0 15.3%

6 1 0 .0%

91.2% 8.8% .0% 63.7%

ObservedTOO LITTLE

ABOUT RIGHT

TOO MUCH

Overall Percentage

TOO LITTLEABOUTRIGHT TOO MUCH

PercentCorrect

Predicted

Classification accuracy for all cases

With all cases, including those that might be identified as outliers, the accuracy rate was 63.7%.

We will compare the classification accuracy of the model with all cases to the classification accuracy of the model excluding outliers.

Page 11: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Outliers for the comparison of groups 1 and 3

Since multinomial logistic regression does not identify outliers, we will use binary logistic regressions to identify them.

Choose the Select Cases… command from the Data menu to include only groups 1 and 3 in the analysis.

Page 12: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Selecting groups 1 and 3

First, mark the If condition is satisfied option button.

Second, click on the IF… button to specify the condition.

Page 13: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Formula for selecting groups 1 and 3

To include only groups 1 and 3 in the analysis, we enter the formula to include cases that had a value of 1 for natsoc or a value of 3 for natsoc.

After completing the formula, click on the Continue button to close the dialog box.

Page 14: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Completing the selection of groups 1 and 3

To activate the selection, click on the OK button.

Page 15: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Binary logistic regression comparing groups 1 and 3

Select the Regression | Binary Logistic… command from the Analyze menu.

Page 16: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Dependent and independent variables for the comparison of groups 1 and 3

Second, move the independent variables age, educ, and confinan to the Covariates list box.

Third, click on the Save… button to request the inclusion of standardized residuals in the data set.

First, move the dependent variable natsoc to the Dependent variable text box.

Page 17: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Including studentized residuals in the comparison of groups 1 and 3

Second, click on the Continue button to complete the specifications.

First, mark the checkbox for Studentized residuals in the Residuals panel.

Page 18: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Outliers for the comparison of groups 1 and 3

Click on the OK button to request the output for the logistic regression.

Page 19: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Locating the case ids for outliers for groups 1 and 3

In order to exclude outliers from the multinomial logistic regression, we must identify their case ids.

Choose the Select Cases… command from the Data menu to identify cases that are outliers.

Page 20: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Replace the selection criteria

To replace the formula that selected cases in group 1 and 3 for the dependent variable, click on the IF… button.

Page 21: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Formula for identifying outliers

Type in the formula for including outliers.

Note that we are including outliers because we want to identify them. This is different that previous procedures where we included cases that were not outliers in the analysis.

Click on the Continue button to close the dialog box.

Page 22: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Completing the selection of outliers

To activate the selection, click on the OK button.

Page 23: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Locating the outliers in the data editor

We used Select cases to specify a criteria for including cases that were outliers. Select cases will assign a 1 (true) to the filter_$ variable if a cases satisfies the criteria. To locate the cases that have a filter_$ value of 1, we can sort the data set in descending order of the values for the filter variable.

Click on the column header for filter_$ and select Sort Descending from the drop down menu.

Page 24: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

The outliers in the data editor

At the top of the sorted column for filter_$, we see four 1’s indicating that 4 cases met the criteria for being considered an outlier.

Page 25: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Outliers for the comparison of groups 2 and 3

Since multinomial logistic regression does not identify outliers, we will use binary logistic regressions to identify them.

Choose the Select Cases… command from the Data menu to include only groups 2 and 3 in the analysis.

The process for identifying outliers is repeated for the other comparison done by the multinomial logistic regression, group 2 versus group 3.

Page 26: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Selecting groups 2 and 3

First, mark the If condition is satisfied option button.

Second, click on the IF… button to change the condition.

Page 27: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Formula for selecting groups 2 and 3

To include only groups 2 and 3 in the analysis, we enter the formula to include cases that had a value of 2 for natsoc or a value of 3 for natsoc.

After completing the formula, click on the Continue button to close the dialog box.

Page 28: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Completing the selection of groups 2 and 3

To activate the selection, click on the OK button.

Page 29: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Binary logistic regression comparing groups 2 and 3

Select the Regression | Binary Logistic… command from the Analyze menu.

Page 30: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Outliers for the comparison of groups 2 and 3

Click on the OK button to request the output for the logistic regression.

The specifications for the analysis are the same as the ones we used for detecting outliers for groups 1 and 3.

Page 31: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Locating the case ids for outliers for groups 2 and 3

In order to exclude outliers from the multinomial logistic regression, we must identify their case ids.

Choose the Select Cases… command from the Data menu to identify cases that are outliers.

Page 32: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Replace the selection criteria

To replace the formula that selected cases in group 2 and 3 for the dependent variable, click on the IF… button.

Page 33: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Formula for identifying outliers

Type in the formula for including outliers.

Note that we use the second version of the standardized residual, sre_2.

Click on the Continue button to close the dialog box.

Page 34: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Completing the selection of outliers

To activate the selection, click on the OK button.

Page 35: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Locating the outliers in the data editor

We used Select cases to specify a criteria for including cases that were outliers. Select cases will assign a 1 (true) to the filter_$ variable if a cases satisfies the criteria. To locate the cases that have a filter_$ value of 1, we can sort the data set in descending order of the values for the filter variable.

Click on the column header for filter_$ and select Sort Descending from the drop down menu.

Page 36: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

The outliers in the data editor

At the top of the sorted column for filter_$, we see that we have two outliers. These two outliers were among outliers for the analysis of groups 1 and 3.

Page 37: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

The caseid of the outliers

The case id for the outlier is “20002045”, “20002413”, “20000012”, and “20000816." These are the cases that we will omit from the multinomial logistic regression.

Since the studentized residuals were only calculated for a subset of the cases, the cases not included were assigned missing values and would be excluded from the analysis if the selection criteria were based on standardized residuals. We will use caseid in the selection criteria instead.

Page 38: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Excluding the outliers from the multinomial logistic regression

To exclude the outlier from the analysis, we will use the Select Cases… command again.

Page 39: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Changing the condition for the selection

Click on the IF… button to change the condition.

Page 40: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Excluding cases identified as outliers

To include all of the cases except the outlier, we set caseid not equal to the subject's id. Note that the subject's id is put in quotation marks because it is string data in this data set.

After completing the formula, click on the Continue button to close the dialog box.

Page 41: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Completing the exclusion of the outlier

To activate the exclusion, click on the OK button.

Page 42: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Multinomial logistic regressionexcluding the outlier

Select the Regression | Multinomial Logistic… command from the Analyze menu.

Page 43: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Running the multinomial logistic regression without the outlier

Click on the OK button to request the output for the logistic regression.

The specifications for the analysis are the same as the ones we used the multinomial logistic regression with all cases.

Page 44: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Classification accuracy after omitting outliers

With all cases the classification accuracy rate for the multinomial logistic regression model was 63.7%.

After omitting the outlier, the accuracy rate improved to 65.3%. Since the amount of the increase in accuracy was less than 2%, the multinomial logistic regression model with all cases will be interpreted.

Page 45: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

75/25% Cross-validation Strategy

In this validation strategy, the cases are randomly divided into two subsets: a training sample containing 75% of the cases and a holdout sample containing the remaining 25% of the cases.

The training sample is used to derive the multinomial logistic regression model. The holdout sample is classified using the coefficients for the training model. The classification accuracy for the holdout sample is used to estimate how well the model based on the training sample will perform for the population represented by the data set.

While it is expected that the classification accuracy for the validation sample will be lower than the classification for the training sample, the difference (shrinkage) should be no larger than 2%.

In addition to satisfying the classification accuracy, we will require that the significance of the overall relationship and the relationships with individual predictors for the training sample match the significance results for the model using the full data set.

Page 46: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

75/25% Cross-validation Strategy

SPSS does not classify cases that are not included in the training sample, so we will have to manually compute the classifications for the holdout sample if we want to use this strategy.

We will run the analysis for the training sample, use the coefficients from the training sample analysis to compute classification scores (log of the odds) for each group, compute the probabilities that correspond to each group defined by the dependent variable, and classify the case in the group with the highest probability.

Page 47: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Restoring the outlier to the data set

To include the outlier back into the analysis, we will use the Select Cases… command again.

Page 48: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Restoring the outliers to the data set

To activate the exclusion, click on the OK button.

Mark the All cases option button to include the outlier back into the data set.

Page 49: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Re-running the multinomial logistic regression with all cases

Select the Regression | Multinomial Logistic… command from the Analyze menu.

Page 50: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Requesting the multinomial logistic regression again

Click on the OK button to request the output for the multinomial logistic regression.

The specifications for the analysis are the same as the ones we have been using all along.

Page 51: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Model Fitting Information

258.051

242.536 15.515 6 .017

ModelIntercept Only

Final

-2 LogLikelihood Chi-Square df Sig.

Overall Relationship

The presence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the final model chi-square in the SPSS table titled "Model Fitting Information".

In this analysis, the probability of the model chi-square (15.515) was p=0.017, less than or equal to the level of significance of 0.05. The null hypothesis that there was no difference between the model without independent variables and the model with independent variables was rejected. The existence of a relationship between the independent variables and the dependent variable was supported.

Page 52: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Likelihood Ratio Tests

254.152 11.616 2 .003

244.186 1.650 2 .438

247.902 5.366 2 .068

251.981 9.445 2 .009

EffectIntercept

AGE

EDUC

CONFINAN

-2 LogLikelihood of

ReducedModel Chi-Square df Sig.

The chi-square statistic is the difference in -2 log-likelihoodsbetween the final model and a reduced model. The reduced model isformed by omitting an effect from the final model. The null hypothesisis that all parameters of that effect are 0.

Individual relationships - 1

The statistical significance of the relationship between confidence in banks and financial institutions and opinion about spending on social security is based on the statistical significance of the chi-square statistic in the SPSS table titled "Likelihood Ratio Tests".

For this relationship, the probability of the chi-square statistic (9.445) was p=0.009, less than or equal to the level of significance of 0.05. The null hypothesis that all of the b coefficients associated with confidence in banks and financial institutions were equal to zero was rejected. The existence of a relationship between confidence in banks and financial institutions and opinion about spending on social security was supported.

Page 53: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Individual relationships - 2

In the comparison of survey respondents who thought we spend too little money on social security to survey respondents who thought we spend too much money on social security, the probability of the Wald statistic (6.263) for the variable confidence in banks and financial institutions [confinan] was 0.012. Since the probability was less than or equal to the level of significance of 0.05, the null hypothesis that the b coefficient for confidence in banks and financial institutions was equal to zero for this comparison was rejected. .

Page 54: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Individual relationships - 3

The value of Exp(B) was 0.121 which implies that for each unit increase in confidence in banks and financial institutions the odds decreased by 87.9% (0.121 - 1.0 = -0.879).

The relationship stated in the problem is supported. Survey respondents who had more confidence in banks and financial institutions were less likely to be in the group of survey respondents who thought we spend too little money on social security, rather than the group of survey respondents who thought we spend too much money on social security. For each unit increase in confidence in banks and financial institutions, the odds of being in the group of survey respondents who thought we spend too little money on social security decreased by 87.9%.

Page 55: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Individual relationships - 4

In the comparison of survey respondents who thought we spend about the right amount of money on social security to survey respondents who thought we spend too much money on social security, the probability of the Wald statistic (7.276) for the variable confidence in banks and financial institutions [confinan] was 0.007. Since the probability was less than or equal to the level of significance of 0.05, the null hypothesis that the b coefficient for confidence in banks and financial institutions was equal to zero for this comparison was rejected.

Page 56: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Individual relationships - 5

The value of Exp(B) was 0.098 which implies that for each unit increase in confidence in banks and financial institutions the odds decreased by 90.2% (0.098 - 1.0 = -0.902).

The relationship stated in the problem is supported. Survey respondents who had more confidence in banks and financial institutions were less likely to be in the group of survey respondents who thought we spend about the right amount of money on social security, rather than the group of survey respondents who thought we spend too much money on social security. For each unit increase in confidence in banks and financial institutions, the odds of being in the group of survey respondents who thought we spend about the right amount of money on social security decreased by 90.2%.

Page 57: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Case Processing Summary

105 61.4%

59 34.5%

7 4.1%

171 100.0%

99

270

152a

TOO LITTLE

ABOUT RIGHT

TOO MUCH

SOCIALSECURITY

Valid

Missing

Total

Subpopulation

NMarginal

Percentage

The dependent variable has only one value observedin 142 (93.4%) subpopulations.

a.

Classification Accuracy - 1

The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the 'Case Processing Summary', and then squaring and summing the proportion of cases in each group (0.614² + 0.345² + 0.041² = 0.498).

The independent variables could be characterized as useful predictors distinguishing survey respondents who thought we spend too little money on welfare, survey respondents who thought we spend about the right amount of money on welfare and survey respondents who thought we spend too much money on welfare if the classification accuracy rate was substantially higher than the accuracy attainable by chance alone. Operationally, the classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate.

Page 58: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Classification

100 5 0 95.2%

50 9 0 15.3%

6 1 0 .0%

91.2% 8.8% .0% 63.7%

ObservedTOO LITTLE

ABOUT RIGHT

TOO MUCH

Overall Percentage

TOO LITTLEABOUTRIGHT TOO MUCH

PercentCorrect

Predicted

Classification Accuracy - 2

The classification accuracy rate was 63.7% which was greater than or equal to the proportional by chance accuracy criteria of 62.2% (1.25 x 49.8% = 62.2%).

The criteria for classification accuracy is satisfied.

Page 59: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Validation analysis:set the random number seed

To set the random number seed, select the Random Number Seed… command from the Transform menu.

If the cases have been sorted in a different order when checking outliers, they should be resorted by caseid, or the assignment of random numbers will not match mine.

Page 60: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Set the random number seed

First, click on the Set seed to option button to activate the text box.

Second, type in the random seed stated in the problem. For this example, assume it is 892776.

Third, click on the OK button to complete the dialog box.

Note that SPSS does not provide you with any feedback about the change.

Page 61: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Validation analysis:compute the split variable

To enter the formula for the variable that will split the sample in two parts, click on the Compute… command.

Page 62: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

The formula for the split variable

First, type the name for the new variable, split, into the Target Variable text box.

Second, the formula for the value of split is shown in the text box.

The uniform(1) function generates a random decimal number between 0 and 1. The random number is compared to the value 0.75.

If the random number is less than or equal to 0.75, the value of the formula will be 1, the SPSS numeric equivalent to true. If the random number is larger than 0.75, the formula will return a 0, the SPSS numeric equivalent to false.Third, click on the OK

button to complete the dialog box.

Page 63: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Selecting the teaching sample - 1

To select the cases that we will use for the training sample, we will use the Select Cases… command again.

Page 64: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Selecting the teaching sample - 2

First, mark the If condition is satisfied option button.

Second, click on the IF… button to specify the condition.

Page 65: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Selecting the teaching sample - 3

To include the cases for the teaching sample, we enter the selection criteria: "split = 1".

After completing the formula, click on the Continue button to close the dialog box.

Page 66: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Selecting the teaching sample - 4

To activate the selection, click on the OK button.

Page 67: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Re-running the multinomial logistic regression with the teaching sample

Select the Regression | Multinomial Logistic… command from the Analyze menu.

Page 68: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Requesting the multinomial logistic regression again

Click on the OK button to request the output for the multinomial logistic regression.

The specifications for the analysis are the same as the ones we have been using all along.

Page 69: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Model Fitting Information

199.385

181.898 17.487 6 .008

ModelIntercept Only

Final

-2 LogLikelihood Chi-Square df Sig.

Comparing the teaching model to full model - 1

In the cross-validation analysis, the relationship between the independent variables and the dependent variable was statistically significant.

The probability for the model chi-square (17.487) testing overall relationship was = 0.008.

The significance of the overall relationship between the individual independent variables and the dependent variable supports the interpretation of the model using the full data set.

Page 70: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Likelihood Ratio Tests

189.239 7.341 2 .025

184.548 2.650 2 .266

189.290 7.392 2 .025

192.355 10.457 2 .005

EffectIntercept

AGE

EDUC

CONFINAN

-2 LogLikelihood of

ReducedModel Chi-Square df Sig.

The chi-square statistic is the difference in -2 log-likelihoodsbetween the final model and a reduced model. The reduced model isformed by omitting an effect from the final model. The null hypothesisis that all parameters of that effect are 0.

Comparing the teaching model to full model - 2

The pattern of significance of individual predictors for the teaching model does not match the pattern for the full data set. Age is not significant in either model, and confinan is statistically significant in both. Educ is statistically significant in the teaching sample, but not for the full model.

Though we have a reason to declare the question false, we will continue on to demonstrate the statistical method.

Page 71: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Comparing the teaching model to full model - 3

The statistical significance and direction of the relationship between confinan and the dependent variable for the teaching model agrees with the findings for the model using the full data set.

Page 72: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Classification accuracy of the training sample

The classification accuracy for the training sample is 66.2%. The final consideration in the validation analysis is to see whether or not the shrinkage in classification accuracy for the holdout sample is less than 2%.

Unfortunately, SPSS does not calculate classifications for the cases in the holdout validation sample, so we must manually calculate the values for classification of the cases. The steps and calculations on the following slides are needed to classify the holdout cases and compute classification accuracy in a crosstabs table.

Page 73: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Classification accuracy of the holdout sample

SPSS does not calculate classifications for the cases in the holdout validation sample, so we must manually calculate the values for classification of the cases.

Page 74: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

The log of the odds for the first group

To calculate the log of the odds for the first group (G1), we multiple the coefficients for the first group from the table of parameter estimates times the variables:

COMPUTE G1 = 6.573629842223 + 0.009441308512708 * AGE + 0.155649871298 * EDUC - 2.496600350832 * CONFINAN.

To get all of the decimal places for a number, double click on a cell to highlight it and the full number will appear.

To classify cases, we first calculate the log of the odds for membership in each group, G1, G2, and G3.

Page 75: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

The log of the odds for the second group

To calculate the log of the odds for the second group (G2), we multiple the coefficients for the second group from the table of parameter estimates times the variables:

COMPUTE G2 = 3.664294481189 + 0.02905602322394 * AGE + 0.3303189055983 * EDUC - 2.947458591882 * CONFINAN.

Page 76: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

The log of the odds for the third group

The third group (G3) is the reference group and does not appear in the table of parameter estimates.

By definition, the log of the odds for the reference group is equal to zero (0). We create the variable for G3 with the command:

COMPUTE G3 = 0.

Page 77: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

The probabilities for each group

Having computed the log of the odds for each group, we convert the log of the odds back to a probability value with the following formulas:

COMPUTE P1 = EXP(G1) / (EXP(G1) + EXP(G2) + EXP(G3)).

COMPUTE P2 = EXP(G2) / (EXP(G1) + EXP(G2) + EXP(G3)).

COMPUTE P3 = EXP(G3) / (EXP(G1) + EXP(G2) + EXP(G3)).

EXECUTE.

Page 78: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Group classification

Each case is predicted to be a member of the group to which it has the highest probability of belonging. We can accomplish this using "IF" statements in SPSS:

IF (P1 > P2 AND P1 > P3) PREDGRP = 1. IF (P2 > P1 AND P2 > P3) PREDGRP = 2. IF (P3 > P1 AND P3 > P2) PREDGRP = 3. EXECUTE.

Page 79: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Selecting the holdout sample - 1

To select the cases that we will use to compute classification accuracy for the holdout group , we will use the Select Cases… command again.

Our calculations predicted group membership for all cases in the data set, including the training sample. To compute the classification accuracy for the holdout sample, we will have to explicitly include only the holdout sample in the calculations.

Page 80: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Selecting the holdout sample - 2

First, mark the If condition is satisfied option button.

Second, click on the IF… button to specify the condition.

Page 81: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Selecting the holdout sample - 3

To include the cases in the 25% holdout sample, we enter the criterion: "split = 0".

After completing the formula, click on the Continue button to close the dialog box.

Page 82: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

Selecting the holdout sample - 4

To activate the selection, click on the OK button.

Page 83: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

The crosstabs classification accuracy table - 1

The classification accuracy table is a table of predicted group membership versus actual group membership. SPSS can create it as a cross-tabulated table.

Select the Crosstabs… | Descriptive Statistics command from the Analyze menu.

Page 84: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

The crosstabs classification accuracy table - 2

To mimic the appearance of classification tables in SPSS, we will put the original variable, natsoc, in the rows of the table and the predicted group variable, predgrp, in the columns.

After specifying the row and column variables, we click on the Cells… button to request percentages.

Page 85: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

The crosstabs classification accuracy table - 3

Second, click on the Continue button to close the dialog box.

The classification accuracy rate will be the sum of the total percentages on the main diagonal.

First, to obtain these percentage, mark the check box for Total on the Percentages panel.

Page 86: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

The crosstabs classification accuracy table - 4

To complete the request for the cross-tabulated table, click on the OK button.

Page 87: Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation

SOCIAL SECURITY * PREDGRP Crosstabulation

21 4 25

51.2% 9.8% 61.0%

9 5 14

22.0% 12.2% 34.1%

1 1 2

2.4% 2.4% 4.9%

31 10 41

75.6% 24.4% 100.0%

Count

% of Total

Count

% of Total

Count

% of Total

Count

% of Total

TOO LITTLE

ABOUT RIGHT

TOO MUCH

SOCIALSECURITY

Total

1.0000 2.0000

PREDGRP

Total

The crosstabs classification accuracy table - 5

The classification accuracy rate will be the sum of the total percentages on the main diagonal:

51.2% + 12.2% = 63.4%.

The criteria to support the classification accuracy of the model is an accuracy rate for the holdout sample that has no more than 2% shrinkage from the accuracy rate for the training sample. The accuracy rate for the training sample was 66.2%. The shrinkage was 66.2% - 63.4% = 2.8%. The shrinkage in the accuracy rate for the holdout sample does not satisfy the requirement. The classification accuracy for the analysis of the full data set was not supported.