chapter 1: associations 1.1 introduction to categorical data 1.2 examining associations among...

Chapter 1: Associations

1.1 Introduction to Categorical Data

1.2 Examining Associations among Variables

1.3 Correspondence Analysis

1.4 Recursive Partitioning

1


1.1 Introduction to Categorical Data1.1 Introduction to Categorical Data




2

Objectives Recognize the differences between categorical

and continuous data analysis. Identify the scale of measurement for your

response variable. Examine the distribution of categorical data.

3

Categorical Data Categorical data represents categories, classes

and classifications, groups, or qualitative characteristics or attributes.– respondent gender (male or female)– product disposition (conforming or nonconforming)– patient mortality (survived or died)

Continuous data represents measurements.– length, time, temperature, concentration

Categorical data is qualitative, continuous data is quantitative.

Categorical data values are discrete and the distance between categories is unknown.

4

Categorical Response The methods presented in this course are appropriate

for a response (dependent variable) that is categorical.– Methods such as the Student t-test, a two-way

analysis of variance (ANOVA), or multiple least squares linear regression are not appropriate.

The explanatory variables (independent or predictor variable) can be continuous or categorical.– The nature of the explanatory variable can also

determine which methods are appropriate.

5

Probability The analysis or modeling of a continuous response

directly applies to the value or measurement itself.– This approach is not possible for a categorical

response. The analysis or modeling of a categorical response

is based on the proportion or probability of each level.

6

Common Applications Medicine, epidemiology, and public health Sociology and behavioral science Marketing and demographics Political science Quality and Six Sigma

7

1.01 Multiple Answer PollWhat is your area of application for categorical data analysis?

a. Medicine, epidemiology, and public health

b. Sociology and behavioral science

c. Marketing and demographics

d. Political science

e. Quality and Six Sigma

f. Other

9

Data Type for Categorical Data You might use either the numeric or the character data

type to represent categorical data, such as customer satisfaction.– 1, 2, 3, 4, 5 (a Likert scale)– Poor, fair, good, very good, excellent

You must use the numeric data type to represent continuous data, such as a physical measurement.

10

Modeling Type for Categorical Data You must use either the nominal or ordinal modeling

type for categorical data.– Nominal variables contain values without any

natural ordering. Hair color, gender, political affiliation, or county

of residence– Ordinal variables contain values with a natural

order. Satisfaction index, income category, or level

of education You must use the continuous modeling type for interval

or ratio data.

11

1.02 Multiple Choice PollWhat is the best choice for the data type and modeling type for the combination of variables Age (in years) and Gender (male or female)?

a. (numeric, continuous) and (character, ordinal)

b. (numeric, ordinal) and (character, continuous)

c. (numeric, continuous) and (character, nominal)

d. (character, nominal) and (numeric, continuous)

13

1.02 Multiple Choice Poll – Correct AnswerWhat is the best choice for the data type and modeling type for the combination of variables Age (in years) and Gender (male or female)?

a. (numeric, continuous) and (character, ordinal)

b. (numeric, ordinal) and (character, continuous)

c. (numeric, continuous) and (character, nominal)

d. (character, nominal) and (numeric, continuous)

14

Titanic Example You will use the Titanic data set to explore the nature

of categorical data.– Class: first, second, or third class

passengers, or crew members– Age: adult or child– Sex: male or female– Survived: yes or no

15

This demonstration illustrates the concepts discussed previously.

Categorical Data Example

16

1.03 Multiple Choice PollWhat data type and modeling type are used for the Age variable?

a. Character, ordinal

b. Numeric, nominal

c. Character, nominal

d. Character, continuous

18

1.03 Multiple Choice Poll – Correct AnswerWhat data type and modeling type are used for the Age variable?

a. Character, ordinal

b. Numeric, nominal

c. Character, nominal

d. Character, continuous

19

Distribution of Continuous Data Continuous data might be realized as an infinity

of values, within an arbitrary level of discreteness, over a given range.

The distribution or frequency of these values depends on the process that generates them.– Many examples can be described by the normal

distribution. The distribution might be asymmetric when values

approach a natural boundary. The distribution might exhibit unusual tails.

21

Distribution Models for Continuous Data Many mathematical models exist for continuous data. The model parameters determine the characteristics

of the distribution.

– The model is fit to the data by determining the best values for the parameters.

The model can be expressed as functions:

– probability density function (PDF)

– cumulative distribution function (CDF) Common examples of models are the normal,

lognormal, Weibull, Johnson, and gamma distributions.

22

Distribution of Categorical Data Categorical data might be realized only as discrete

values, few or many. The distribution or frequency of these values depends

on the process that generates them.– Many examples of dichotomous responses can

be described by the binomial distribution. The distribution might not be symmetric. The distribution of many levels might exhibit unusual

tails.

23

Distribution Models for Categorical Data Many mathematical models exist for categorical data. The model parameters determine the characteristics

of the distribution.

– The model is fit to the data by determining the best values for the parameters.

The model can be expressed as functions:

– probability mass function (PMF)

– cumulative distribution function (CDF) Common examples of models are the binomial, negative

binomial, geometric, hypergeometric, and Poisson distributions.

24

Binomial Distribution Model The basis for this distribution is a Bernoulli trial.

– There are only two possible outcomes of each trial. Generally, 1 for success or 0 for failure.

– Each individual outcome (yi) is independent of the others (in other words, the probability of the outcome 1 is always the same).

Total number of successes (outcome of 1) is y.

25

n

ii

y y

Binomial Distribution Model The binomial distribution describes the probability

of y, the number of successes, from 0 to n. The parameters in this model are n, the number of

trials, and , the probability of outcome 1 in each trial.

The expected value (mean) is n and the variance is n(1- ) for the binomial distribution.

26

1n yyn

PMF yy

Example of Binomial Distribution A college basketball player finished the last season

with a record of 77% success making free throws.– What performance should you expect from this

player if her free-throw success rate has not changed?

Specifically, how many baskets should she make in 25 attempts?

27

1.04 Multiple Choice PollWhat is the parameter π in the binomial distribution model?

a. The total number of successes

b. The probability of success in each trial

c. The number of possible outcomes from each trial

d. The proportion of failures in each trial

29

1.04 Multiple Choice Poll – Correct AnswerWhat is the parameter π in the binomial distribution model?

a. The total number of successes

b. The probability of success in each trial

c. The number of possible outcomes from each trial

d. The proportion of failures in each trial

30

Graphics for Frequency and Proportion Statistical graphics are designed to interpret the data. The bar chart represents the frequency of each level

by the length of its bar. The mosaic plot represents the proportion of each

level by the length of its segment.

31

Multinomial Distribution Model Some categorical responses have more than two

possible values. The idea of the binomial distribution can be extended

to the multinomial distribution.

32

Test Proportions There might be supposed proportions for each

of the categories in the response variable. The sample can be used to test that supposition. JMP calls this command test probabilities. Enter a probability for only the subset of levels that

you want to test, and leave the others blank, when you have a response with more than two levels.– Enter 1 for all levels to test if they are equal.

33

Chi-Square Test for Proportions The appropriate test of proportions is based

on the chi-square statistic.– This statistic is covered in detail in the next section.

The test is available for three situations:– Test whether probabilities are not equal to

supposition– Test whether probabilities are greater than

supposed– Test whether probabilities are less than supposed

34

Poisson Distribution Sometimes the number of trials is not fixed and there

is no practical upper limit. The response y is the count of events over time. The Poisson distribution is often a good model for

the distribution of y. This model has a single parameter, .

35

!

yePMF y

y

36


Examining Distributions

Exercise

This exercise reinforces the concepts discussed previously.

38



1.2 Examining Associations among Variables1.2 Examining Associations among Variables



39

Objectives Determine whether an association exists among

categorical variables. Perform a stratified analysis of categorical variables.

40

Association An association exists between two variables

if the distribution of one variable changes when the level (or value) of the other variable changes.

If there is no association, the distribution of the first variable is the same, regardless of the level of the other variable.

41

No Association

42

72% 28%

28%72%

Is mood associatedwith the weather?

Association

43

82% 18%

40%60%

Is mood associatedwith the weather?

44


Recognizing Associations

Marginal Distribution in an Association The marginal distribution of the response ignores

the explanatory variable. The mosaic plot explores the data without regard

to any association.

45

Conditional Distribution in an Association The conditional distribution of the response describes

the frequency of the responses for each level of the explanatory variable.

The mosaic plotexplores the dataand the possibilityof an association.

46

Two-Dimensional Mosaic Plot This mosaic plot includes the marginal distribution

on the right and conditional distribution on the left.

47

conditional marginal

48


Exploring Associations

1.05 QuizIs there an association between the severity of an adverse reaction and the treatment?

50

1.05 Quiz – Correct AnswerIs there an association between the severity of an adverse reaction and the treatment?

No, the distribution of ADR SEVERITY is the same between the two levels of TREATMENT GROUP.

51

Test for Association The row percentage (proportion or probability) is used

to test the association between Survived and Class.

52

Null Hypothesis H0: There is no association between Survived

and Class. The probability of surviving is the same, regardless

of the class of the passenger.

53

Alternative Hypothesis H1: There is an association between Survived

and Class. The probability of surviving is different between crew, first, second, and third class passengers.

54

Chi-Square Test

The expected frequencies are based on the marginal distribution, or null hypothesis.

55

NO ASSOCIATIONobserved frequencies=expected frequencies

ASSOCIATIONobserved frequencies≠expected frequencies

Expected Frequency The expected frequency

of each cell is based on the marginal distribution (null hypothesis).

It is the product of the marginal proportion of the explanatory variable and the marginal frequency of the response.

0.4021 * 1490=599.114

56

Pearson Chi-Square Statistic The observed frequency

is compared to the expected frequency.

The cell statistics are accumulated into the sample statistic.

(73.886)2/599.114=9.112

57

22 i i

i i

n

p-Value for Chi-Square TestThis p-value is the probability of observing a chi-square sample statistic

at least as large as the one actually observed, given that there is no association between the variables

probability of the association that you observe in the data occurring by chance.

58

Chi-Square TestsChi-square tests and the corresponding p-values determine whether an association exists do not measure the strength of an association depend on and reflect the sample size.

59

Agreement A stronger relationship than an association might

be sought when the two variables use the same levels. Agreement measures the strength of such

a relationship. Cohen’s kappa, κ, for agreement. Bowker’s test of symmetry (association) McNemar’s test of agreement

(Bowker’s test when levels are the same)

60

Trend in Association Two variables might exhibit a trend in the association

between their ordered levels.– The response has two levels.– The predictor is ordinal.

The Cochran-Armitage test is available for a trend.

61

62


Chi-Square Test

1.06 QuizIs there sufficient evidence that an association exists between adverse effect severity and treatment?

64

1.06 Quiz – Correct AnswerIs there sufficient evidence that an association exists between adverse effect severity and treatment?

No, the p-value for the Pearson chi-square statistic is 0.7919, so there is insufficient evidence to reject the null (that no association exists) at α=0.05.

65

When Not to Use the Chi-Square Test

67

When more than 20% of the cellshave expected counts less than five

2

Expected

Observed versus Expected Values

68

3.43 4.57 6.00

4.41 5.88 7.71

4.16 5.55 7.29

Observed Values Expected Values

1 5 8

5 6 7

6 5 6

4 of 9 cells, or 44%,with expected value

less than 5

1 of 9 cells, or 11%,with observed value

less than 5

Small Samples – Fisher’s Exact Test

69

Fisher’sExactTest

SAMPLE SIZE

Small

Large

Example: Tea and MilkSuppose you want to test whether someone can determine whether a cup of tea with milk had the milk poured first or the tea poured first.

70

Fisher’s Exact Test Example8 Cups of Tea: 4 with Milk First and 4 with Tea First

Predict which cups had tea poured first.

71

4

4

4 4

M

T

M T

FixedMarginalTotalsP

rep

are

d

Test

Basis for Fisher’s Exact Test

72

0

4

4

0

4

4

4

4

2

2

2

2

4

4

4

4

1

3

3

1

4

4

4

4

row and columntotals fixed

Other possible samples:

M

M

T

T

3 4

4

4 4

0

0 4

4

Pre

par

ed

Test

3

1

1

3

4

4

4

4

Sample:

Fisher’s Exact Test HypothesesNull Hypothesis: There is no association.

Alternative Hypothesis: There is an association. Left-tailed Right-tailed Two-tailed

73

Left-Tailed Alternative Hypothesis

74

Left-tailed p-value

M

3

1

1

3

4

4

4

4

M

T

T

Ac

tua

l

Test

0

4

4

0

4

4

4

4

2

2

2

2

4

4

4

4

1

3

3

1

4

4

4

4

The alternative hypothesis is that the predictionis worse than that by chance.

Right-Tailed Alternative Hypothesis

75

Right-tailed p-value

M

3

1

1

3

4

4

4

4

M

T

T

Pre

par

ed

Test

3 4

4

4 4

0

0 4

4

The alternative hypothesis is that the predictionis better than that by chance.

Two-Tailed Alternative Hypothesis

76

Two-tailed p-value

0

4

4

0

4

4

4

4

2

2

2

2

4

4

4

4

1

3

3

1

4

4

4

4

M

3

1

1

3

4

4

4

4

M

T

T

Pre

par

ed

Test 3 4

4

4 4

0

0 4

4

77


Fisher’s Exact Test

1.07 QuizWhich test should you use for the alternative hypothesis that finishing the prescription decreases the chance of a relapse, and is the test significant at α=0.05?

79

1.07 Quiz – Correct AnswerWhich test should you use for the alternative hypothesis that finishing the prescription decreases the chance of a relapse, and is the test significant at α=0.05?

The Left test is for the specified hypothesis and the p-value=0.0007 is significant at the α=0.05 level.

80

Stratified Data Analysis Stratified data analysis is the process of dividing

subjects into groups defined by the levels of a third variable.

Use this analysis when you want to examine the association between two variables within the levels of a third variable.

82

Unstratified Data Analysis

Of the 39 single people, 23% have lung cancer and 77% do not. Of the 36 married people, 17% have lung cancer and 83% do not.

83

Stratified Data Analysis

Of the 28 single smokers, 28% have lung cancer and 72% do not. Of the 14 married smokers, 28% have lung cancer and 72% do not.

84

Cochran-Mantel-Haenszel Statistics

85

Sample Size for CMH versus Chi-Square Recommended that you have either sample size

of 25 for each degree of freedom in original table or at least 80% of cells with expected frequency of at least 5 (same as unstratified test).

86

1. Correlation of Scores

87

B

A

Test linear association

2. Row Scores by Column Categories

88

B

A

Test equal row scores

3. Column Scores by Row Categories

89

B

A

Test equal column scores

4. General Association of Categories

90

B

A

22

Test general association

1.08 Multiple Choice PollWhich CMH test is the most appropriate for Survived (nominal, columns) versus Class (ordinal, rows) when stratified by Sex?

a. Row Scores by Column Categories

b. General Association of Categories

c. Correlation of Scores

d. Column Scores by Row Categories

92

1.08 Multiple Choice Poll – Correct AnswerWhich CMH test is the most appropriate for Survived (nominal, columns) versus Class (ordinal, rows) when stratified by Sex?

a. Row Scores by Column Categories

b. General Association of Categories

c. Correlation of Scores

d. Column Scores by Row Categories

A. Row Scores for ordinal Class by Column Categories of nominal Survived.

93

CMH Statistics and 2x2 Tables

94

2x2All CMHstatisticsare equal

When Do CMH Tests Lack Power? The CMH statistics accumulate over the strata. If the association is similar in all strata, then the

statistics are strengthened.– This case is easier to detect, and the tests have

more power. If the association changes or reverses across strata,

then the statistics are weakened.– This case is more difficult to detect, and the tests

have less power.

95

Concordance and Discordance A crosstabulation of ordinal data introduces the ideas

of concordance and discordance.– These ideas involve a pair of observations.

The association might exhibit a trend. A pair is concordant if one observation that is ranked

higher on X is also ranked higher on Y. A pair is discordant if one observation that is ranked

higher on X is ranked lower on Y. A pair is tied if both observations have the same level

for X and Y.

96

Measures of Association Measures of association for ordinal variables serve like

the correlation coefficient for continuous variables that exhibit a linear trend.

Gamma: ignores ties Kendall’s b : corrects for ties

Stuart’s c : corrects for table size and ties

Somer’s D: asymmetric modification of b

Lambda: measures improvement in predicting Y,given X; two asymmetric forms

Uncertainty Coefficient U: proportion of uncertainty explained

97

98


CMH Tests

100

Exercise


1.09 Multiple Choice PollThe Correlation of Scores CMH test has which null hypothesis?

a. There is no linear association between the row and column variables in any stratum.

b. The mean scores for each column are equal in each stratum.

c. The mean scores for each row are equal in each stratum.

d. There is no association between the row and column variables in any stratum.

102

1.09 Multiple Choice Poll – Correct AnswerThe Correlation of Scores CMH test has which null hypothesis?

a. There is no linear association between the row and column variables in any stratum.

b. The mean scores for each column are equal in each stratum.

c. The mean scores for each row are equal in each stratum.

d. There is no association between the row and column variables in any stratum.

103




1.3 Correspondence Analysis1.3 Correspondence Analysis


104

Objectives Explain how correspondence analysis can help

you find associations. Perform a simple correspondence analysis. Interpret a correspondence plot.

105

What Is Correspondence Analysis?Correspondence analysis is a data analysis technique that enables you to display the associations between the levels of two

or more categorical variables graphically extract information from a frequency table with

many levels for the rows and columns.

106

Row and Column Profiles

Row and column percentages are used to obtain row and column profiles.

107

A B C

1

4

19.5527.39

25.9123.27

54.5525.53

217.2724.20

28.84

29.49

25.31

26.12

53.49

53.00

24.47

24.47

317.6724.20

17.5124.20

28.1825.31

54.5525.53

GivesRow Profile

Gives Column Profile

Row %Column %

Example Data collected for these two categorical variables:

– Mental health status (well, mild symptom formation, moderate symptom formation, or impaired)

– Parent socioeconomic status (A through F) Is there an association? Which levels of each variable are associated?

108

Rows A and B have similar profiles. Their points are close together and fall away from the origin in the same direction.

The profile for Row F is different. Its point falls away from the origin in a different direction.

Correspondence Plot

109

Rows A and B and Column Well fall in approximately the same direction from the origin, and are relatively close to one another.

Association

110

1.10 Multiple Answer PollIn correspondence analysis, which of the following are true? (Choose all answers that apply.)

a. Row points that fall far from each other but in the same direction away from the origin indicate that they have similar profiles.

b. Column points that fall close together and in the same direction away from the origin indicate that they have similar profiles.

c. Row and column points that fall in the same direction away from the origin indicate that they have an association.

112

1.10 Multiple Answer Poll – Correct AnswersIn correspondence analysis, which of the following are true? (Choose all answers that apply.)

a. Row points that fall far from each other but in the same direction away from the origin indicate that they have similar profiles.

b. Column points that fall close together and in the same direction away from the origin indicate that they have similar profiles.

c. Row and column points that fall in the same direction away from the origin indicate that they have an association.

113

Sample Data Set

114

ACTION

MYSTERY

COMEDY

SPORTS

ROMANCE

SCI-FI

HORROR

DRAMA

FAMILY

AGE

GENDER

MOVIES

Analysis ApproachesYou want to perform an analysis that takes into account the three variables Movie, Age, and Gender. There are several approaches. Analyze a two-way table where the columns

correspond to the levels of Movie and the rows correspond to combinations of the levels of Age and Gender.

Treat Gender as a stratification variable and analyze males and females separately.

115

116


Correspondence Analysis

118

Exercise


1.11 QuizIce cream brands A through D are tested by a panel, and rated from 1 through 9 (with 9 as the best score). What can you conclude from the Correspondence Analysis?

120

1.11 Quiz – Correct AnswerIce cream brands A through D are tested by a panel, and rated from 1 through 9 (with 9 as the best score). What can you conclude from the Correspondence Analysis?

Answerswill vary.

121





1.4 Recursive Partitioning1.4 Recursive Partitioning

122

Objectives Define recursive partitioning. Understand the splitting criteria used in JMP. Review algorithm parameters available in JMP. Use the Partition platform in JMP.

123

Recursive Partitioning Recursive partitioning refers to segmenting the data

into groups that are as homogeneous as possible with respect to the dependent variable (Y) and maximizing the difference in the response of the groups.

Successive splits produce a structure of rules and groups known as a decision tree, a model of the data.– Splits are binary.– The reverse of splitting is pruning.

The tree helps interpret the associations in the data.

124

Split into New Groups

125

size (Large) size (Medium, Small)

What factors determine the country from which cars are purchased?

n =303

Country

n=42 n=261

Model Metrics R square represents the amount of uncertainty

in the data that has been accounted for by the explanatory variables.– Larger R2 is better.

Akaike’s Information Criterion (AICc) measures the decrease in the uncertainty but adds a penalty for excessive splitting.

– Smaller AICc is better.

126

Splitting Metrics Candidate G2 measures the change in the entropy.

– Larger G2 values are better. Candidate LogWorth is the negative log of the p-value

for the likelihood ratio chi-square.– Larger LogWorth values are better.– Monte Carlo simulation adjusts the p-value.

The criterion for the best split is LogWorth.

127

Partition Algorithm: Calculate Split Metric

128

size

Log Worth

Partition Algorithm: Find Best Cutting Point

129

Best Split

size

Log Worth

Partition Algorithm: Calculate forOther Variables

130

type

Log Worth

Partition Algorithm: Compare the Best Splits

131

Best Split type

Best Split size

Partition Algorithm: Partition with Best Split

132

Partition Algorithm: Repeat within Partitions

133

Under-fitting and Over-fitting Under-fitting is a situation where too few splits

are used and prediction suffers.– The uncertainty could be reduced further.

Over-fitting is a situation where too many splits are used and prediction also suffers.– The model incorporates features of random noise

in the data, which will not be repeated again. Both problems adversely affect model predictions.

134

Crossvalidation Crossvalidation attempts to find the optimum number

of splits. The sample data are divided into groups. One group is designated as the hold-out set.

– It is not used to train (fit) the model (tree).– It is used for predictions (as if it were future cases).

The other group is used to train the model. JMP offers two methods of crossvalidation.

– K-fold crossvalidation– Excluded rows

135

K-fold Crossvalidation Divide the data into k groups. Designate one group as the hold-out set. Designate the other groups for making the tree. Rotate the roles of the training groups and the hold-out

set until all groups have been held out once. Combine the statistics of the hold-out sets.

136

Evaluate Crossvalidation Specify the number of groups, k.

– The default is 5 groups. The -2LogLikelihood measures the decrease

in the uncertainty from the overall probabilities.

K-fold crossvalidations leads to over-fitting.

137

Crossvalidation by Excluded Rows A portion of the sample is randomly selected. Exclude these rows to make the hold-out set. The other rows are used to make the tree. There is no universal rule for the size of the portion

for the hold-out set.– 25% to 50%

138

Stopping Rule You can avoid repeatedly clicking the Split button

by clicking the Go button that appears when crossvalidation is used.

The Partition platform continues to split until the R2 value for the validation data is better than what the next 10 splits would obtain.

The R2 for the training and the validation data is presented in a run chart in the Split History report.

139

Akaike’s Information Criterion It is a popular and rigorous criterion for comparing

models. It is based on the likelihood of the data under the

current model (partition). It includes a penalty for over-fitting. It includes a correction for small samples. Smaller values suggest better models.

140

2 12 2

1

k kAICc Log L k

n k

penalty correction

Special Cases Limit the splitting by specifying the smallest group

size.– Default minimum size is 5 cases.

Outliers form their own nodes and do not interfere with the rest of the tree.

Linear relationships with continuous explanatory variables might require very many splits to adequately model the effects.

141

Missing Data A missing response causes the entire case

to be excluded unless you enable the Missing Value Categories option when launching Partition.– A new response level is added for missing values.

A missing categorical explanatory variable is imputed (random selection of other levels) or a new category is created for missing values.

A missing continuous explanatory variable is randomly assigned to one of the two splits.

142

Evaluate Model: ROC Curve The receiver operating characteristic curve (ROC)

evaluates the ability of the model to distinguish the levels of the response.

It is based on the sensitivity (true positive rate) and the 1-specificity (false positive rate).

143

Sensitivity The sensitivity is the probability or rate of a true

positive prediction of the given level. For this example, if the model predicts Survived=no

for 992 cases out of 1004 cases where it is true, then the sensitivity is 0.988 or 98.8%.

The sensitivity should be near 1.

144

Specificity The specificity is the probability or rate of a true

negative prediction of the given level. For this example, if the model does not predict

Survived=no for 184 cases out of 494 cases where it is not true, then the specificity is 0.37 or 37%.

1 – specificity, or the false positive rate, should ideally be near 0.

145

Evaluate Model: ROC Curve Rank order the fitted probabilities for the response. For each row, move up if the response is correct,

move right if the response is wrong.

146

Area under the Curve The area under the ROC curve (AUC) measures

the goodness of fit for the tree to the data. A general rule for interpretation of AUC:

147

Result Discrimination

AUC=0.5 None

0.7< AUC< 0.8 Acceptable

0.8< AUC< 0.9 Excellent

AUC>0.9 Outstanding

Evaluate Model: Lift Curve Shows performance of tree predictions. Orders cases by predicted probability. Compares proportion of cases with one response level

in a given portion to proportion of cases with this response overall.

148

Evaluate Model: Confusion Matrix The actual response is compared to the predicted

response from the model in the confusion matrix. A model that predicts better than chance has more

cases on the diagonal than off the diagonal. This example shows a no response that is predicted

well and a yes response that is not predicted well. The confusion matrix is not useful for model selection

when the marginal distribution is not near a probability of 0.5 for both levels.

149

150


Recursive Partitioning

152

Exercise


1.12 QuizIn which leaf, and on what variable, will JMP next split?

154

1.12 Quiz – Correct AnswerIn which leaf, and on what variable, will JMP next split?

Of the leaves, the highest LogWorth is for Age (.7313), in the Gender(Female) leaf. This is where JMP will next split.

155

chapter 1: associations 1.1 introduction to categorical data 1.2 examining associations among...

Documents

categorical data analysis

categorical data values

continuous data analysis

character data type

numeric data type

categorical data1

categorical datayou

concentrationcategorical