slide 1 two-group illustrative example of discriminant analysis overview of discriminant analysis...

Two-Group Illustrative Example of Discriminant Analysis

Overview of Discriminant Analysis

There are many occasions when the dependent variable in our analysis is a categorical variable such as type of client, problem, treatment, organization, diagnostic category, or outcome group. If there are two categories in the group, we have a choice of logistic regression or discriminant analysis. If there are more than two categories in the group, the appropriate analytic technique until recently has been discriminant analysis. An alternative to discriminant analysis with three or more groups is multinomial logistic regression, which we will consider in the last class of the semester.

There are two generic types of discriminant analysis: descriptive discriminant analysis and predictive discriminant analysis. The goal of descriptive discriminant analysis is to identify the independent variables that have a strong relationship to group membership in the categories of the dependent variable. This component or stage of discriminant analysis may be referred to as deriving the discriminant functions. The goal of predictive discriminant analysis is to use the relationships in the discriminant functions to build a valid and accurate predictive model. This may be referred to as the classification phase or stage of discriminant analysis.

While we will rely on the classification results to assess the overall fit of the model, our use of discriminant analysis will be for descriptive studies. The text presents a substantial amount of material on the topic of reducing classification errors on which we will not spend any time.

Like multiple regression and logistic regression, the relationship between the dependent and independent variable is expressed in an equation or a function. Unlike regression analysis, discriminant analysis may produce multiple functions, which together distinguish among categories of the dependent variable. Each discriminant function produces a discriminant score. The pattern of the cases on the discriminant scores is used to estimate which group of the dependent variable a case belongs to.

Discriminant Analysis


Overview of Discriminant Analysis (continued)

Generally, the number of discriminant functions is one less than the number of groups in the dependent variable, unless there are fewer independent variables than groups, in which case, the number of functions is bounded by the number of independent variables. Some, all, or none of the discriminant functions may be statistically significant in a particular problem, depending on how well we can distinguish among the groups.

In multiple regression, the function was derived to satisfy the mathematical property of minimizing the residual variance in the dependent variable. In discriminant analysis, the functions are derived to maximize the between groups variance relative to the within groups variance. Another way to think of this is that discriminant analysis maximizes the statistical distance between the means, or centroids (set of means of several variables), of the groups on the set of independent variables. Maximizing this distance between group means should enhance our ability to estimate which group a case belongs to because the distinction between groups is more clearly defined.

The process of translating the independent variables from the original coordinate system to the coordinate system of discriminant space uses the mathematical procedure of finding characteristic roots or eigenvalues. The matrix that is used to multiply the original scores to convert them to discriminant scores is referred to as the eigenvectors. It is not absolutely necessary that we understand the mathematical process for deriving this translation in coordinate systems to make use of it. What is necessary to remember is that this mathematical process translates the coordinate dimensions of our original problem (one coordinate or axis for each independent variables) to the reduced dimensionality of discriminant space, in which the largest eigenvalue is associated with the first dimension of the reduced space, the second largest eigenvalue is associated with the second dimension of the reduced space, etc. The translation from one coordinate system to the other is mathematically exact and precise, and changes the form of the information, but not the content.



Overview of Discriminant Analysis (continued)

The classification phase of discriminant analysis is analogous to the process of comparing individual cases to a group mean using standard scores or z-scores. For each group in the dependent variable, we calculate the mean and variance in discriminant space. We then convert the independent variables for an individual case to discriminant space. We compute the statistical distance that the case is from the group mean in standard score units. We guess or predict that the case is a member of the group that corresponds to the smallest distance between group mean and individual scores.

We can compare the predicted group memberships to known group memberships to derive an accuracy measure, or "hit" rate, for a discriminant model. The accuracy rate for a model is notoriously inflated, or overfitted, when the same cases are used in deriving the functions, making holdout testing a necessity. SPSS provides us with a one-at-a-time hold out method. This method is computed by sequentially holding-out one case from the analysis and using the remaining cases to derive the discriminant functions used to classify the case. This method is repeated for all cases in the analysis and the resulting model accuracy is usually regarded as a less biased measure of model accuracy. To this calculation, we will add our usual split-half validation analysis.

A more detailed presentation of all of the statistics and processes incorporated in discriminant analysis are presented in the text. I will follow the text in working the two group and three group sample problems, in that we will extract a holdout sample from the very start of the analysis, and do the discriminant analysis on the cases that were not in the holdout sample. However, when we work problems thereafter, we will conduct the discriminant analysis on the entire sample, and do the holdout analysis in stage 6 when we address the issue of validation.


Preliminary Division of the Data Set

Instead of conducting the analysis with the entire data set, and then splitting the data for the validation analysis, the authors opt to divide the sample prior to doing the analysis. They use the estimation or learning sample of 60 cases to build the discriminant model and the other 40 cases for a holdout sample to validate the model.

To replicate the author's analysis, we will create a randomly generated variable, randz, to split the sample. We will use the cases where randz = 0 to create the discriminant model.


Specify the Random Number Seed


Compute the random selection variable


Stage One: Define the Research Problem

In this stage, the following issues are addressed:•Relationship to be analyzed•Specifying the dependent and independent variables•Method for including independent variables


Relationship to be analyzed

The purpose of this analysis is to identify the perceptions of HATCO that differ significantly between firms using the two purchasing methods: Total Value Analysis versus Specification Buying. The company would then be able to tailor sales presentations and benefits offered to best match the buyer's perceptions. To do so, discriminant analysis was selected to identify these perceptions of HATCO that best distinguish firms using each buying approach. (Text, page 281)

The data set for this analysis is HATCO.SAV.

Specifying the dependent and independent variables

The dependent variable is:

•X11, Purchasing Approach

The independent variables are the seven metric perception variables:

•X1, Delivery Speed•X2, Price Level•X3, Price Flexibility•X4, Manufacturer Image•X5, Service•X6, Sales Force Image•X7, Product Quality


Method for including independent variables

Since the purpose of this analysis is to identify the variables which do the best job of differentiating between the two groups, the stepwise method for selecting variables is appropriate.

Stage 2: Develop the Analysis Plan: Sample Size Issues

In this stage, the following issues are addressed:

•Missing data analysis•Minimum sample size requirement: 20+ cases per independent variable•Division of the sample: 20+ cases in each dependent variable group


Missing data analysis

There is no missing data in this data set.

Minimum sample size requirement: 20+ cases per independent variable

With 100 cases and 7 independent variables, we have a ratio of 14 cases per independent variable, close to the suggested ratio of 20 to 1. When we reduce the effective sample size for building the model to 60 cases, we fall to a 9 to 1 ratio; however the authors do not identify this as a problem.

Division of the sample: 20+ cases in each dependent variable group

In the original sample, we have 40 cases in the Specification Buying group and 60 in the Total Value Analysis Group, so we meet this requirement. In the sample used to build the model, we have 22 in the Specification Buying group and 38 in the Total Value Analysis group, so we still meet this requirement.

Stage 2: Develop the Analysis Plan: Measurement Issues


•Incorporating nonmetric data with dummy variables•Representing curvilinear effects with polynomials•Representing interaction or moderator effects


Incorporating Nonmetric Data with Dummy Variables

All of the nonmetric variables have recoded into dichotomous dummy-coded variables.

Representing Curvilinear Effects with Polynomials

We do not have any evidence of curvilinear effects at this point in the analysis.

Representing Interaction or Moderator Effects

We do not have any evidence at this point in the analysis that we should add interaction or moderator variables.

Stage 3: Evaluate Underlying Assumptions


•Nonmetric dependent variable and metric or dummy-coded independent variables•Multivariate normality of metric independent variables: assess normality of individual variables•Linear relationships among variables•Assumption of equal dispersion for dependent variable groups


Nonmetric dependent variable and metric or dummy-coded independent variables

All of the variables in the analysis are metric or dummy-coded.

Multivariate normality of metric independent variables

Since there is not a method for assessing multivariate normality, we assess the normality of the individual metric variables.

We did the assessment of normality for the metric variables in this data set in the class 5 exercise "Illustration of a Regression Analysis."

In that exercise, we found that the tests of normality indicated that the following variables are normally distributed: X1 'Delivery Speed', and X5 'Service'.

The following independent variables are not normally distributed: X2 'Price Level', X3 'Price Flexibility, X4 'Manufacturer's Image', X6 'Sales force Image', and X7 'Product Quality'.

X2 'Price Level' is induced to normality by a log and a square root transformation. X7 'Product Quality' is induced to normality by a log and a square root transformation. The other non-normal variables are not improved by using transformations.

Note that this finding does not agree with the text, which finds that X2 'Price Level', X4 'Manufacturer Image', and X6 'Salesforce Image' are correctable with a log transformation. I have no explanation for the discrepancy.

We can use include the transformed version of the variables in an additional analysis to see if they improve the overall fit between the dependent and the independent variables.


Linear relationships among variables

Since our dependent variable is not metric, we cannot use it to test for linearity of the independent variables. As an alternative, we can plot each metric independent variable against all other independent variables in a scatterplot matrix to look for patterns of nonlinear relationships. If one of the independent variables shows multiple nonlinear relationships to the other independent variables, we consider it a candidate for transformation


Requesting a Scatterplot Matrix


Specifications for the Scatterplot Matrix


The Scatterplot Matrix

Blue fit lines were added to the scatterplot matrix to improve interpretability.

Having computed a scatterplot for all combinations of metric independent variables, we identify all of the variables that appear in any plot that shows a nonlinear trend. We will call these variables our nonlinear candidates. To identify which of the nonlinear candidates is producing the nonlinear pattern, we look at all of the plots for each of the candidate variables. The candidate variable that is not linear should show up in a nonlinear relationship in several plots with other linear variables. Hopefully, the form of the plot will suggest the power term to best represent the relationship, e.g. squared term, cubed term, etc.

None of our metric independent variables show a strong nonlinear pattern, so no transformations will be used in this analysis.


Assumption of equal dispersion for dependent variable groups

Box's M test evaluated the homogeneity of dispersion matrices across the subgroups of the dependent variable. The null hypothesis is that the dispersion matrices are homogenous. If the analysis fails this test, we can request using separate group dispersion matrices in the classification phase of the discriminant analysis to see it this improves our accuracy rate.

Box's M test is produced by the SPSS discriminant procedure, so we will defer this question until we have obtained the discriminant analysis output.


Stage 4: Estimation of Discriminant Functions and Overall Fit: The Discriminant Functions


•Compute the discriminant analysis•Overall significance of the discriminant function(s)


Compute the discriminant analysis

The steps to obtain a discriminant analysis are detailed on the following screens.

We will not produce all of the output provided in the text for two reasons. First, some of the output can only be obtained with syntax commands. Second, some of the authors’ analyses are either produced with other statistical software or are computed by hand. In spite of this, we can produce sufficient output with the menu commands to do a creditable analysis.


Requesting a Discriminant Analysis


Specifying the Dependent Variable


Specifying the Independent Variables


Selecting the Cases to Include in the Analysis


Specifying Statistics to Include in the Output


Specifying the Stepwise Method for Selecting Variables


Specifying the Classification Requirement


Complete the Discriminant Analysis Request


Overall significance of the discriminant function(s) - 1

Similar to multiple regression analysis, our first task is to determine whether or not there is a statistically significant relationship between the independent variables and the dependent variable. We navigate to the section of output titled "Summary of Canonical Discriminant Functions" to locate the following outputs:

The key statistic indicating whether or not there is a relationship between the independent and dependent variables is the significance test for Wilks' lambda. Wilks' lambda is the proportion of the total variance in the discriminant scores NOT explained by differences among the groups. In this example about 33% of the variance is not explained by group differences. Unlike R², smaller values of Wilks' lambda are desirable.

The canonical correlation coefficient (.818) measures the association between the discriminant score and the set of independent variables. Like Wilks' lambda, it is an indicator of the strength of relationship between entities in the solution, but it does not have any necessary relationship to the classification accuracy which is our ultimate measure of the value of the model.


Overall significance of the discriminant function(s) - 2

Wilk's lambda is used to test the null hypothesis that the means of all of the independent variables are equal across groups of the dependent variable. If the means of the independent variables are equal for all groups, the means will not be a useful basis for predicting the group to which a case belongs, and thus there is no relationship between the independent variables and the dependent variable. If the chi-square statistic corresponding to Wilks' lambda is statistically significant we conclude that there is a relationship between the dependent groups and the independent variables. We should note that there is no correspondence between the size of Wilks' lambda and the accuracy of the classifications based on the discriminant functions.

The information from the table of Eigenvalues is often cited in analyses using discriminant analysis, but it is not as important to us as the statistical test of Wilks' lambda. The table of eigenvalues gives us information about the effectiveness of the discriminant functions. The eigenvalue is a ratio of the between-groups sum of squares to the within-groups or error sum of squares. The size of the eigenvalue is helpful for measuring the spread of the group centroids in the corresponding dimension of the multivariate discriminant space. Larger eigenvalues indicate that the discriminant function is more useful in distinguishing between the groups. The eigenvalues will always be listed in descending order since the solution in a discriminant analysis requires that the first discriminant function is the most capable in differentiating the groups; the second discriminant function is the second most useful function, etc.

The canonical correlation coefficient (.818) measures the association between the discriminant score and the set of independent variables. Like Wilks' lambda, it is an indicator of the strength of relationship between entities in the solution, but it does not have any necessary relationship to the classification accuracy which is our ultimate measure of the value of the model.


Stage 4: Estimation of Discriminant Functions and Overall Fit: Assessing Model Fit


•Assumption of equal dispersion for dependent variable groups•Classification accuracy chance criteria•Press's Q statistic•Presence of outliers


In discriminant analysis, the best measure of overall fit is classification accuracy. The appropriateness of using the pooled covariance matrix in the classification phase is evaluated by the Box's M statistic.

We examine the probability of the Box's M statistic to determine whether or not we meet the assumption of equal dispersion of the dispersion or covariance matrices (multivariate measure of variance). This test is very sensitive, so we should select a conservative alpha value of 0.01. At that alpha level, we fail to reject the null hypothesis for this test and conclude that the dispersion matrices for our groups are equal.

Assumption of equal dispersion for dependent variable groups

Had we failed this test, our remedy would be to re-run the discriminant analysis requesting the use of separate covariance matrices in classification to see if this improves the accuracy rate of the discriminant model.


Classification accuracy chance criteria - 1

The classification matrix for this problem computed by SPSS is shown below. Note that this table also includes the information for classification of the holdout sample that we will use in validating the model. Note further that these classification results do not agree with the text; I can only get the same classification table as the text if I substitute separate group covariance matrices instead of the pooled covariance matrix in classification.


The classification results contain two parts. The "Original" classification is the classification in which all cases in the analysis are classified by functions created using all cases in the sample. It is generally regarded as an "optimistic" accuracy rate which overestimates the accuracy of the model.

One strategy for computing a more realistic accuracy rate is produced by the "Leave-one-out" classification option which we specified in our SPSS commands. The results are shown as the "Cross-validated" component of the classification results. This method is computed by sequentially holding-out one case from the analysis and using the remaining cases to derive the discriminant functions used to classify that case. This method is repeated for all cases in the analysis and the resulting model accuracy is usually regarded as a less biased measure of model accuracy. This method is often applied is analyses of small data sets where there are insufficient cases to do a 50% holdout sample.

The best estimate of the accuracy rate that we would expect to find in the population is the rate obtained from the holdout sample, where none of the cases classified were included in the computation of the discriminant functions. To maintain consistency with the text, we will use the holdout accuracy rate (85%) in our analysis as the measure of overall model fit.

If we based the discriminant analysis on the full data set, i.e. did not eliminate holdout cases, we would use the accuracy rate for the cross-validated cases (86.7%) in our analysis of overall model fit.

We compare the holdout accuracy rate to each of the by chance accuracy rates.

Classification accuracy chance criteria - 2


Classification accuracy chance criteria - 3The proportional chance criteria for assessing model fit is calculated by summing the squared proportion that each group represents of the sample. Using the probabilities from the output 'Prior Probabilities for Groups', the calculation in this case is (0.367 x 0. 367) + (0.633 x 0.633) = 0.536.

Based on the requirement that model accuracy be 25% better than the chance criteria, the standard to use for comparing the model's accuracy is 1.25 x 0.536 = 0.67. Our model accuracy rate of 85% exceeds this standard.

The maximum chance criteria is the proportion of cases in the largest group, 63.3% in this problem. Based on the requirement that model accuracy be 25% better than the chance criteria, the standard to use for comparing the model's accuracy is 1.25 x 0.633 = 0.79. Our model accuracy rate of 85% exceeds this standard.


Press's Q statistic

Substituting the values for this problem (60 cases, 53 correct classifications, and 2 groups) into the formula for Press's Q statistic, we obtain a value = [60 - (53 x 2)] ^ 2 / 60 (2 - 1) = 35.3. This value exceeds the critical value of 6.63 (Text, page 290) so we conclude that the prediction accuracy is greater than that expected by chance.

By all three criteria, we would interpret our model as having an accuracy above that expected by chance. Thus, this is a valuable or useful model that supports predictions of the dependent variable.


Presence of outliers - 1

SPSS print Mahalanobis distance scores for each case in the table of Casewise Statistics, so we can use this as a basis for detecting outliers.

According to the SPSS Applications Guide, p .227, cases with large values of the Mahalanobis Distance from their group mean can be identified as outliers. For large samples from a multivariate normal distribution, the square of the Mahalanobis distance from a case to its group mean is approximately distributed as a chi-square statistic with degrees of freedom equal to the number of variables in the analysis. The critical value of chi-square with 3 degrees of freedom (the stepwise procedure entered three variables in the function) and an alpha of 0.01 (we only want to detect major outliers) is 11.345.

We can request this figure from SPSS using the following compute command:

COMPUTE mahcutpt = IDF.CHISQ(0.99,3).

EXECUTE.

Where 0.99 is the cumulative probability up to the significance level of interest and 3 is the number of degrees of freedom. SPSS will create a column of values in the data set that contains the desired value.

We scan the table of Casewise Statistics to identify any cases that have a Squared Mahalanobis distance greater than 11.345 for the group to which the case is most likely to belong, i.e. under the column labeled 'Highest Group.'


Presence of outliers - 2

In this particular analysis, I do not find any cases with a large enough Mahalanobis distance to indicate that they are outliers.


Stage 5: Interpret the Results

In this section, we address the following issues:

•Number of functions to be interpreted•Assessing the contribution of predictor variables•Impact of multicollinearity on solution


Number of functions to be interpreted

As indicated above, there is one significant discriminant function to be interpreted. If we had not found at least one statistically significant discriminant function, it would mean that we could not distinguish between the groups of the dependent variable based upon the independent variables, and our analysis would be concluded.

Role of functions in differentiating categories of the dependent variable

We had requested a "combined-groups" scatterplot in specifications for the analysis. This graphic is useful in identify the relationship between the discriminant functions and the groups on the dependent variable, i.e. which function is differentiating between which groups.

SPSS responded with the warning message that the "All-Groups Stacked Histogram is no longer displayed." The "All-Groups Stacked Histogram" is a substitute that SPSS uses to visually display the distribution of cases by discriminant function score and group membership. It is not clear why this is unavailable since it is provided in the output for logistic regression. We will demonstrate the use of this plot when we present the three group example.

With only two groups on the dependent variable and one function, a statistically significant function will obviously differentiate between the two groups on the dependent variable.


Identifying the statistically significant predictor variables

Assessing the contribution of predictor variables - 1


Discriminant analysis does not have a statistical test of the coefficients of individual independent variables comparable to multiple regression. To get this information, it is necessary to use a stepwise method for variable selection where variables are added or removed from the model if they pass a statistical criteria that increases the differentiation between the groups, similar to the R-square change test in multiple regression. This is not a test of individual variables, but a test of combinations of variables.

In the table of Variables "Entered/Removed" shown below, at step 1, X7 'Product Quality' had the strongest individual relationship with the dependent variable groupings. At step 2, the variable X3 'Price Flexibility, when combined with X7 'Product Quality' had the strongest overall relationship with the dependent variable groupings. At step 4, X1 'Delivery Speed', when combined with X3 'Price Flexibility, and X7 'Product Quality' had the strongest overall relationship with the dependent variable groupings. At step 4, none of the remaining variables was able to increase the discrimination between dependent variable groupings.


If we look at the output for the one-way ANOVA tests for the individual variables, we see that there was a statistically significant difference in dependent variable groups for the three variables included in the discriminant analysis, but we also note that there are significant differences for other variables which did not enter the discriminant functions: X2 'Price Level' and X5 'Service.'



Importance of Variables and the Structure Matrix

There are a variety of methods for determining which predictor variables are more important in predicting group membership. When we use a stepwise method of variable selection, as we did in this problem, we can simply look at the order in which the variables entered, as shown in the following table. (Note that there is not universal agreement that stepwise discriminant analysis results should be used for the purpose of determining relative predictor importance. See Carl J. Huberty, Applied Discriminant Analysis, page 127.)


Thus, we see that product quality, price flexibility, and delivery speed are the three most important predictors.


While we know which variables were important to the overall analysis, we are also concerned with which variables are important to on which discriminant function. This information is provided by the structure matrix, which is a rotated correlation matrix containing the correlations between each of the independent variables and the discriminant function scores.

Importance of variables is indicated by the relative size of the absolute value of the coefficient, more important variables having larger coefficients. In a two-group, one function problem, the statistically significant variables are all important to function 1.

The structure matrix becomes a more important tool in determining the relative importance of the predictor variables in analyses that have more than one discriminant function.

Three cautions should be observed about the structure matrices as presented by SPSS.

First, the structure matrix includes all variables, including those that do not have a statistically significant relationship to the dependent variable.



Second, the sign of the coefficients in the structure matrix are arbitrary. The sign indicates the direction of the relationship to discriminant function scores, and does not indicate a direct or inverse relationship to the dependent variable as it does in multiple regression. Therefore, the sign of the coefficients should not be interpreted. We will look at patterns of group means across groups when we want to talk more specifically about the role of a variable in determining group membership.

Third, in structure matrices with more than one function, SPSS prints an asterisk next to the largest correlation for each variable. This asterisk does not indicate statistical significance as it does in other correlation matrices; it only indicates which function has the larger correlation with the variable. The variable could be a poor predictor for all groups and still be display an asterisk in one of the function columns.




Comparing Group Means to Determine Direction of Relationships

We can determine the pattern of the relationships between the dependent variable groups and the independent variables by examining the pattern of means for the predictor variables, using the Group Statistics table produced by SPSS. The variables which were statistically significant in this analysis are highlighted.

From this output, we see that mean for Product Quality is higher for the specification buying group, indicating a greater importance of this perception to that buying situation. The means for the two other statistically significant variables: Delivery Speed and Price Flexibility, are higher for the total value analysis group, indicating greater importance of these variables to that buying situation. Another way we could state this is that if product quality is more important to you than delivery speed and price flexibility, you are more like to favor specification buying.

Impact of Multicollinearity on solution

Multicollinearity is indicated by SPSS for discriminant analysis by very small tolerance values for variables, e.g. less than 0.10 (0.10 is the size of the tolerance, not its significance value). If we look at the table of Variables Not In The Analysis, we see that the smallest tolerance for any variable not included is 0.570, supporting a conclusion that multicollinearity is not a problem in this analysis.


Stage 6: Validate The Model

In this section, we address the following issues:

•Generalizability of the discriminant model


Generalizability of the discriminant model

The authors use the classification accuracy for the cases not selected for the analysis, 85.0% (34/40), as evidence that the model is valid and can be generalized to the population from which the sample was drawn.

While this is acceptable for a textbook example, in the future we will use the split-sample validation technique parallel to that used for multiple regression and logistic regression.

slide 1 two-group illustrative example of discriminant analysis overview of discriminant analysis...

Documents