chapter 1: associations 1.1 introduction to categorical data 1.2 examining associations among...

155
Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning 1

Upload: leslie-miller

Post on 16-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Chapter 1: Associations

1.1 Introduction to Categorical Data

1.2 Examining Associations among Variables

1.3 Correspondence Analysis

1.4 Recursive Partitioning

1

Page 2: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Chapter 1: Associations

1.1 Introduction to Categorical Data1.1 Introduction to Categorical Data

1.2 Examining Associations among Variables

1.3 Correspondence Analysis

1.4 Recursive Partitioning

2

Page 3: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Objectives Recognize the differences between categorical

and continuous data analysis. Identify the scale of measurement for your

response variable. Examine the distribution of categorical data.

3

Page 4: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Categorical Data Categorical data represents categories, classes

and classifications, groups, or qualitative characteristics or attributes.– respondent gender (male or female)– product disposition (conforming or nonconforming)– patient mortality (survived or died)

Continuous data represents measurements.– length, time, temperature, concentration

Categorical data is qualitative, continuous data is quantitative.

Categorical data values are discrete and the distance between categories is unknown.

4

Page 5: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Categorical Response The methods presented in this course are appropriate

for a response (dependent variable) that is categorical.– Methods such as the Student t-test, a two-way

analysis of variance (ANOVA), or multiple least squares linear regression are not appropriate.

The explanatory variables (independent or predictor variable) can be continuous or categorical.– The nature of the explanatory variable can also

determine which methods are appropriate.

5

Page 6: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Probability The analysis or modeling of a continuous response

directly applies to the value or measurement itself.– This approach is not possible for a categorical

response. The analysis or modeling of a categorical response

is based on the proportion or probability of each level.

6

Page 7: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Common Applications Medicine, epidemiology, and public health Sociology and behavioral science Marketing and demographics Political science Quality and Six Sigma

7

Page 8: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

8

Page 9: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.01 Multiple Answer PollWhat is your area of application for categorical data analysis?

a. Medicine, epidemiology, and public health

b. Sociology and behavioral science

c. Marketing and demographics

d. Political science

e. Quality and Six Sigma

f. Other

9

Page 10: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Data Type for Categorical Data You might use either the numeric or the character data

type to represent categorical data, such as customer satisfaction.– 1, 2, 3, 4, 5 (a Likert scale)– Poor, fair, good, very good, excellent

You must use the numeric data type to represent continuous data, such as a physical measurement.

10

Page 11: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Modeling Type for Categorical Data You must use either the nominal or ordinal modeling

type for categorical data.– Nominal variables contain values without any

natural ordering. Hair color, gender, political affiliation, or county

of residence– Ordinal variables contain values with a natural

order. Satisfaction index, income category, or level

of education You must use the continuous modeling type for interval

or ratio data.

11

Page 12: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

12

Page 13: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.02 Multiple Choice PollWhat is the best choice for the data type and modeling type for the combination of variables Age (in years) and Gender (male or female)?

a. (numeric, continuous) and (character, ordinal)

b. (numeric, ordinal) and (character, continuous)

c. (numeric, continuous) and (character, nominal)

d. (character, nominal) and (numeric, continuous)

13

Page 14: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.02 Multiple Choice Poll – Correct AnswerWhat is the best choice for the data type and modeling type for the combination of variables Age (in years) and Gender (male or female)?

a. (numeric, continuous) and (character, ordinal)

b. (numeric, ordinal) and (character, continuous)

c. (numeric, continuous) and (character, nominal)

d. (character, nominal) and (numeric, continuous)

14

Page 15: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Titanic Example You will use the Titanic data set to explore the nature

of categorical data.– Class: first, second, or third class

passengers, or crew members– Age: adult or child– Sex: male or female– Survived: yes or no

15

Page 16: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

This demonstration illustrates the concepts discussed previously.

Categorical Data Example

16

Page 17: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

17

Page 18: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.03 Multiple Choice PollWhat data type and modeling type are used for the Age variable?

a. Character, ordinal

b. Numeric, nominal

c. Character, nominal

d. Character, continuous

18

Page 19: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.03 Multiple Choice Poll – Correct AnswerWhat data type and modeling type are used for the Age variable?

a. Character, ordinal

b. Numeric, nominal

c. Character, nominal

d. Character, continuous

19

Page 20: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

20

Page 21: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Distribution of Continuous Data Continuous data might be realized as an infinity

of values, within an arbitrary level of discreteness, over a given range.

The distribution or frequency of these values depends on the process that generates them.– Many examples can be described by the normal

distribution. The distribution might be asymmetric when values

approach a natural boundary. The distribution might exhibit unusual tails.

21

Page 22: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Distribution Models for Continuous Data Many mathematical models exist for continuous data. The model parameters determine the characteristics

of the distribution.

– The model is fit to the data by determining the best values for the parameters.

The model can be expressed as functions:

– probability density function (PDF)

– cumulative distribution function (CDF) Common examples of models are the normal,

lognormal, Weibull, Johnson, and gamma distributions.

22

Page 23: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Distribution of Categorical Data Categorical data might be realized only as discrete

values, few or many. The distribution or frequency of these values depends

on the process that generates them.– Many examples of dichotomous responses can

be described by the binomial distribution. The distribution might not be symmetric. The distribution of many levels might exhibit unusual

tails.

23

Page 24: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Distribution Models for Categorical Data Many mathematical models exist for categorical data. The model parameters determine the characteristics

of the distribution.

– The model is fit to the data by determining the best values for the parameters.

The model can be expressed as functions:

– probability mass function (PMF)

– cumulative distribution function (CDF) Common examples of models are the binomial, negative

binomial, geometric, hypergeometric, and Poisson distributions.

24

Page 25: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Binomial Distribution Model The basis for this distribution is a Bernoulli trial.

– There are only two possible outcomes of each trial. Generally, 1 for success or 0 for failure.

– Each individual outcome (yi) is independent of the others (in other words, the probability of the outcome 1 is always the same).

Total number of successes (outcome of 1) is y.

25

n

ii

y y

Page 26: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Binomial Distribution Model The binomial distribution describes the probability

of y, the number of successes, from 0 to n. The parameters in this model are n, the number of

trials, and , the probability of outcome 1 in each trial.

The expected value (mean) is n and the variance is n(1- ) for the binomial distribution.

26

1n yyn

PMF yy

Page 27: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Example of Binomial Distribution A college basketball player finished the last season

with a record of 77% success making free throws.– What performance should you expect from this

player if her free-throw success rate has not changed?

Specifically, how many baskets should she make in 25 attempts?

27

Page 28: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

28

Page 29: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.04 Multiple Choice PollWhat is the parameter π in the binomial distribution model?

a. The total number of successes

b. The probability of success in each trial

c. The number of possible outcomes from each trial

d. The proportion of failures in each trial

29

Page 30: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.04 Multiple Choice Poll – Correct AnswerWhat is the parameter π in the binomial distribution model?

a. The total number of successes

b. The probability of success in each trial

c. The number of possible outcomes from each trial

d. The proportion of failures in each trial

30

Page 31: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Graphics for Frequency and Proportion Statistical graphics are designed to interpret the data. The bar chart represents the frequency of each level

by the length of its bar. The mosaic plot represents the proportion of each

level by the length of its segment.

31

Page 32: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Multinomial Distribution Model Some categorical responses have more than two

possible values. The idea of the binomial distribution can be extended

to the multinomial distribution.

32

Page 33: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Test Proportions There might be supposed proportions for each

of the categories in the response variable. The sample can be used to test that supposition. JMP calls this command test probabilities. Enter a probability for only the subset of levels that

you want to test, and leave the others blank, when you have a response with more than two levels.– Enter 1 for all levels to test if they are equal.

33

Page 34: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Chi-Square Test for Proportions The appropriate test of proportions is based

on the chi-square statistic.– This statistic is covered in detail in the next section.

The test is available for three situations:– Test whether probabilities are not equal to

supposition– Test whether probabilities are greater than

supposed– Test whether probabilities are less than supposed

34

Page 35: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Poisson Distribution Sometimes the number of trials is not fixed and there

is no practical upper limit. The response y is the count of events over time. The Poisson distribution is often a good model for

the distribution of y. This model has a single parameter, .

35

!

yePMF y

y

Page 36: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

36

This demonstration illustrates the concepts discussed previously.

Examining Distributions

Page 37: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

37

Page 38: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Exercise

This exercise reinforces the concepts discussed previously.

38

Page 39: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Chapter 1: Associations

1.1 Introduction to Categorical Data

1.2 Examining Associations among Variables1.2 Examining Associations among Variables

1.3 Correspondence Analysis

1.4 Recursive Partitioning

39

Page 40: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Objectives Determine whether an association exists among

categorical variables. Perform a stratified analysis of categorical variables.

40

Page 41: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Association An association exists between two variables

if the distribution of one variable changes when the level (or value) of the other variable changes.

If there is no association, the distribution of the first variable is the same, regardless of the level of the other variable.

41

Page 42: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

No Association

42

72% 28%

28%72%

Is mood associatedwith the weather?

Page 43: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Association

43

82% 18%

40%60%

Is mood associatedwith the weather?

Page 44: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

44

This demonstration illustrates the concepts discussed previously.

Recognizing Associations

Page 45: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Marginal Distribution in an Association The marginal distribution of the response ignores

the explanatory variable. The mosaic plot explores the data without regard

to any association.

45

Page 46: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Conditional Distribution in an Association The conditional distribution of the response describes

the frequency of the responses for each level of the explanatory variable.

The mosaic plotexplores the dataand the possibilityof an association.

46

Page 47: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Two-Dimensional Mosaic Plot This mosaic plot includes the marginal distribution

on the right and conditional distribution on the left.

47

conditional marginal

Page 48: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

48

This demonstration illustrates the concepts discussed previously.

Exploring Associations

Page 49: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

49

Page 50: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.05 QuizIs there an association between the severity of an adverse reaction and the treatment?

50

Page 51: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.05 Quiz – Correct AnswerIs there an association between the severity of an adverse reaction and the treatment?

No, the distribution of ADR SEVERITY is the same between the two levels of TREATMENT GROUP.

51

Page 52: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Test for Association The row percentage (proportion or probability) is used

to test the association between Survived and Class.

52

Page 53: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Null Hypothesis H0: There is no association between Survived

and Class. The probability of surviving is the same, regardless

of the class of the passenger.

53

Page 54: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Alternative Hypothesis H1: There is an association between Survived

and Class. The probability of surviving is different between crew, first, second, and third class passengers.

54

Page 55: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Chi-Square Test

The expected frequencies are based on the marginal distribution, or null hypothesis.

55

NO ASSOCIATIONobserved frequencies=expected frequencies

ASSOCIATIONobserved frequencies≠expected frequencies

Page 56: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Expected Frequency The expected frequency

of each cell is based on the marginal distribution (null hypothesis).

It is the product of the marginal proportion of the explanatory variable and the marginal frequency of the response.

0.4021 * 1490=599.114

56

Page 57: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Pearson Chi-Square Statistic The observed frequency

is compared to the expected frequency.

The cell statistics are accumulated into the sample statistic.

(73.886)2/599.114=9.112

57

22 i i

i i

n

Page 58: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

p-Value for Chi-Square TestThis p-value is the probability of observing a chi-square sample statistic

at least as large as the one actually observed, given that there is no association between the variables

probability of the association that you observe in the data occurring by chance.

58

Page 59: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Chi-Square TestsChi-square tests and the corresponding p-values determine whether an association exists do not measure the strength of an association depend on and reflect the sample size.

59

Page 60: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Agreement A stronger relationship than an association might

be sought when the two variables use the same levels. Agreement measures the strength of such

a relationship. Cohen’s kappa, κ, for agreement. Bowker’s test of symmetry (association) McNemar’s test of agreement

(Bowker’s test when levels are the same)

60

Page 61: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Trend in Association Two variables might exhibit a trend in the association

between their ordered levels.– The response has two levels.– The predictor is ordinal.

The Cochran-Armitage test is available for a trend.

61

Page 62: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

62

This demonstration illustrates the concepts discussed previously.

Chi-Square Test

Page 63: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

63

Page 64: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.06 QuizIs there sufficient evidence that an association exists between adverse effect severity and treatment?

64

Page 65: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.06 Quiz – Correct AnswerIs there sufficient evidence that an association exists between adverse effect severity and treatment?

No, the p-value for the Pearson chi-square statistic is 0.7919, so there is insufficient evidence to reject the null (that no association exists) at α=0.05.

65

Page 66: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

66

Page 67: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

When Not to Use the Chi-Square Test

67

When more than 20% of the cellshave expected counts less than five

2

Expected

Page 68: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Observed versus Expected Values

68

3.43 4.57 6.00

4.41 5.88 7.71

4.16 5.55 7.29

Observed Values Expected Values

1 5 8

5 6 7

6 5 6

4 of 9 cells, or 44%,with expected value

less than 5

1 of 9 cells, or 11%,with observed value

less than 5

Page 69: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Small Samples – Fisher’s Exact Test

69

Fisher’sExactTest

SAMPLE SIZE

Small

Large

Page 70: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Example: Tea and MilkSuppose you want to test whether someone can determine whether a cup of tea with milk had the milk poured first or the tea poured first.

70

Page 71: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Fisher’s Exact Test Example8 Cups of Tea: 4 with Milk First and 4 with Tea First

Predict which cups had tea poured first.

71

4

4

4 4

M

T

M T

FixedMarginalTotalsP

rep

are

d

Test

Page 72: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Basis for Fisher’s Exact Test

72

0

4

4

0

4

4

4

4

2

2

2

2

4

4

4

4

1

3

3

1

4

4

4

4

row and columntotals fixed

Other possible samples:

M

M

T

T

3 4

4

4 4

0

0 4

4

Pre

par

ed

Test

3

1

1

3

4

4

4

4

Sample:

Page 73: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Fisher’s Exact Test HypothesesNull Hypothesis: There is no association.

Alternative Hypothesis: There is an association. Left-tailed Right-tailed Two-tailed

73

Page 74: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Left-Tailed Alternative Hypothesis

74

Left-tailed p-value

M

3

1

1

3

4

4

4

4

M

T

T

Ac

tua

l

Test

0

4

4

0

4

4

4

4

2

2

2

2

4

4

4

4

1

3

3

1

4

4

4

4

The alternative hypothesis is that the predictionis worse than that by chance.

Page 75: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Right-Tailed Alternative Hypothesis

75

Right-tailed p-value

M

3

1

1

3

4

4

4

4

M

T

T

Pre

par

ed

Test

3 4

4

4 4

0

0 4

4

The alternative hypothesis is that the predictionis better than that by chance.

Page 76: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Two-Tailed Alternative Hypothesis

76

Two-tailed p-value

0

4

4

0

4

4

4

4

2

2

2

2

4

4

4

4

1

3

3

1

4

4

4

4

M

3

1

1

3

4

4

4

4

M

T

T

Pre

par

ed

Test 3 4

4

4 4

0

0 4

4

Page 77: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

77

This demonstration illustrates the concepts discussed previously.

Fisher’s Exact Test

Page 78: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

78

Page 79: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.07 QuizWhich test should you use for the alternative hypothesis that finishing the prescription decreases the chance of a relapse, and is the test significant at α=0.05?

79

Page 80: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.07 Quiz – Correct AnswerWhich test should you use for the alternative hypothesis that finishing the prescription decreases the chance of a relapse, and is the test significant at α=0.05?

The Left test is for the specified hypothesis and the p-value=0.0007 is significant at the α=0.05 level.

80

Page 81: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

81

Page 82: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Stratified Data Analysis Stratified data analysis is the process of dividing

subjects into groups defined by the levels of a third variable.

Use this analysis when you want to examine the association between two variables within the levels of a third variable.

82

Page 83: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Unstratified Data Analysis

Of the 39 single people, 23% have lung cancer and 77% do not. Of the 36 married people, 17% have lung cancer and 83% do not.

83

Page 84: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Stratified Data Analysis

Of the 28 single smokers, 28% have lung cancer and 72% do not. Of the 14 married smokers, 28% have lung cancer and 72% do not.

84

Page 85: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Cochran-Mantel-Haenszel Statistics

85

Page 86: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Sample Size for CMH versus Chi-Square Recommended that you have either sample size

of 25 for each degree of freedom in original table or at least 80% of cells with expected frequency of at least 5 (same as unstratified test).

86

Page 87: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1. Correlation of Scores

87

B

A

Test linear association

Page 88: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

2. Row Scores by Column Categories

88

B

A

Test equal row scores

Page 89: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

3. Column Scores by Row Categories

89

B

A

Test equal column scores

Page 90: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

4. General Association of Categories

90

B

A

22

Test general association

Page 91: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

91

Page 92: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.08 Multiple Choice PollWhich CMH test is the most appropriate for Survived (nominal, columns) versus Class (ordinal, rows) when stratified by Sex?

a. Row Scores by Column Categories

b. General Association of Categories

c. Correlation of Scores

d. Column Scores by Row Categories

92

Page 93: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.08 Multiple Choice Poll – Correct AnswerWhich CMH test is the most appropriate for Survived (nominal, columns) versus Class (ordinal, rows) when stratified by Sex?

a. Row Scores by Column Categories

b. General Association of Categories

c. Correlation of Scores

d. Column Scores by Row Categories

A. Row Scores for ordinal Class by Column Categories of nominal Survived.

93

Page 94: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

CMH Statistics and 2x2 Tables

94

2x2All CMHstatisticsare equal

Page 95: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

When Do CMH Tests Lack Power? The CMH statistics accumulate over the strata. If the association is similar in all strata, then the

statistics are strengthened.– This case is easier to detect, and the tests have

more power. If the association changes or reverses across strata,

then the statistics are weakened.– This case is more difficult to detect, and the tests

have less power.

95

Page 96: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Concordance and Discordance A crosstabulation of ordinal data introduces the ideas

of concordance and discordance.– These ideas involve a pair of observations.

The association might exhibit a trend. A pair is concordant if one observation that is ranked

higher on X is also ranked higher on Y. A pair is discordant if one observation that is ranked

higher on X is ranked lower on Y. A pair is tied if both observations have the same level

for X and Y.

96

Page 97: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Measures of Association Measures of association for ordinal variables serve like

the correlation coefficient for continuous variables that exhibit a linear trend.

Gamma: ignores ties Kendall’s b : corrects for ties

Stuart’s c : corrects for table size and ties

Somer’s D: asymmetric modification of b

Lambda: measures improvement in predicting Y,given X; two asymmetric forms

Uncertainty Coefficient U: proportion of uncertainty explained

97

Page 98: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

98

This demonstration illustrates the concepts discussed previously.

CMH Tests

Page 99: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

99

Page 100: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

100

Exercise

This exercise reinforces the concepts discussed previously.

Page 101: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

101

Page 102: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.09 Multiple Choice PollThe Correlation of Scores CMH test has which null hypothesis?

a. There is no linear association between the row and column variables in any stratum.

b. The mean scores for each column are equal in each stratum.

c. The mean scores for each row are equal in each stratum.

d. There is no association between the row and column variables in any stratum.

102

Page 103: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.09 Multiple Choice Poll – Correct AnswerThe Correlation of Scores CMH test has which null hypothesis?

a. There is no linear association between the row and column variables in any stratum.

b. The mean scores for each column are equal in each stratum.

c. The mean scores for each row are equal in each stratum.

d. There is no association between the row and column variables in any stratum.

103

Page 104: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Chapter 1: Associations

1.1 Introduction to Categorical Data

1.2 Examining Associations among Variables

1.3 Correspondence Analysis1.3 Correspondence Analysis

1.4 Recursive Partitioning

104

Page 105: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Objectives Explain how correspondence analysis can help

you find associations. Perform a simple correspondence analysis. Interpret a correspondence plot.

105

Page 106: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

What Is Correspondence Analysis?Correspondence analysis is a data analysis technique that enables you to display the associations between the levels of two

or more categorical variables graphically extract information from a frequency table with

many levels for the rows and columns.

106

Page 107: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Row and Column Profiles

Row and column percentages are used to obtain row and column profiles.

107

A B C

1

4

19.5527.39

25.9123.27

54.5525.53

217.2724.20

28.84

29.49

25.31

26.12

53.49

53.00

24.47

24.47

317.6724.20

17.5124.20

28.1825.31

54.5525.53

GivesRow Profile

Gives Column Profile

Row %Column %

Page 108: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Example Data collected for these two categorical variables:

– Mental health status (well, mild symptom formation, moderate symptom formation, or impaired)

– Parent socioeconomic status (A through F) Is there an association? Which levels of each variable are associated?

108

Page 109: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Rows A and B have similar profiles. Their points are close together and fall away from the origin in the same direction.

The profile for Row F is different. Its point falls away from the origin in a different direction.

Correspondence Plot

109

Page 110: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Rows A and B and Column Well fall in approximately the same direction from the origin, and are relatively close to one another.

Association

110

Page 111: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

111

Page 112: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.10 Multiple Answer PollIn correspondence analysis, which of the following are true? (Choose all answers that apply.)

a. Row points that fall far from each other but in the same direction away from the origin indicate that they have similar profiles.

b. Column points that fall close together and in the same direction away from the origin indicate that they have similar profiles.

c. Row and column points that fall in the same direction away from the origin indicate that they have an association.

112

Page 113: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.10 Multiple Answer Poll – Correct AnswersIn correspondence analysis, which of the following are true? (Choose all answers that apply.)

a. Row points that fall far from each other but in the same direction away from the origin indicate that they have similar profiles.

b. Column points that fall close together and in the same direction away from the origin indicate that they have similar profiles.

c. Row and column points that fall in the same direction away from the origin indicate that they have an association.

113

Page 114: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Sample Data Set

114

ACTION

MYSTERY

COMEDY

SPORTS

ROMANCE

SCI-FI

HORROR

DRAMA

FAMILY

AGE

GENDER

MOVIES

Page 115: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Analysis ApproachesYou want to perform an analysis that takes into account the three variables Movie, Age, and Gender. There are several approaches. Analyze a two-way table where the columns

correspond to the levels of Movie and the rows correspond to combinations of the levels of Age and Gender.

Treat Gender as a stratification variable and analyze males and females separately.

115

Page 116: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

116

This demonstration illustrates the concepts discussed previously.

Correspondence Analysis

Page 117: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

117

Page 118: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

118

Exercise

This exercise reinforces the concepts discussed previously.

Page 119: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

119

Page 120: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.11 QuizIce cream brands A through D are tested by a panel, and rated from 1 through 9 (with 9 as the best score). What can you conclude from the Correspondence Analysis?

120

Page 121: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.11 Quiz – Correct AnswerIce cream brands A through D are tested by a panel, and rated from 1 through 9 (with 9 as the best score). What can you conclude from the Correspondence Analysis?

Answerswill vary.

121

Page 122: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Chapter 1: Associations

1.1 Introduction to Categorical Data

1.2 Examining Associations among Variables

1.3 Correspondence Analysis

1.4 Recursive Partitioning1.4 Recursive Partitioning

122

Page 123: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Objectives Define recursive partitioning. Understand the splitting criteria used in JMP. Review algorithm parameters available in JMP. Use the Partition platform in JMP.

123

Page 124: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Recursive Partitioning Recursive partitioning refers to segmenting the data

into groups that are as homogeneous as possible with respect to the dependent variable (Y) and maximizing the difference in the response of the groups.

Successive splits produce a structure of rules and groups known as a decision tree, a model of the data.– Splits are binary.– The reverse of splitting is pruning.

The tree helps interpret the associations in the data.

124

Page 125: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Split into New Groups

125

size (Large) size (Medium, Small)

What factors determine the country from which cars are purchased?

n =303

Country

n=42 n=261

Page 126: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Model Metrics R square represents the amount of uncertainty

in the data that has been accounted for by the explanatory variables.– Larger R2 is better.

Akaike’s Information Criterion (AICc) measures the decrease in the uncertainty but adds a penalty for excessive splitting.

– Smaller AICc is better.

126

Page 127: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Splitting Metrics Candidate G2 measures the change in the entropy.

– Larger G2 values are better. Candidate LogWorth is the negative log of the p-value

for the likelihood ratio chi-square.– Larger LogWorth values are better.– Monte Carlo simulation adjusts the p-value.

The criterion for the best split is LogWorth.

127

Page 128: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Partition Algorithm: Calculate Split Metric

128

size

Log Worth

Page 129: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Partition Algorithm: Find Best Cutting Point

129

Best Split

size

Log Worth

Page 130: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Partition Algorithm: Calculate forOther Variables

130

type

Log Worth

Page 131: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Partition Algorithm: Compare the Best Splits

131

Best Split type

Best Split size

Page 132: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Partition Algorithm: Partition with Best Split

132

Page 133: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Partition Algorithm: Repeat within Partitions

133

Page 134: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Under-fitting and Over-fitting Under-fitting is a situation where too few splits

are used and prediction suffers.– The uncertainty could be reduced further.

Over-fitting is a situation where too many splits are used and prediction also suffers.– The model incorporates features of random noise

in the data, which will not be repeated again. Both problems adversely affect model predictions.

134

Page 135: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Crossvalidation Crossvalidation attempts to find the optimum number

of splits. The sample data are divided into groups. One group is designated as the hold-out set.

– It is not used to train (fit) the model (tree).– It is used for predictions (as if it were future cases).

The other group is used to train the model. JMP offers two methods of crossvalidation.

– K-fold crossvalidation– Excluded rows

135

Page 136: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

K-fold Crossvalidation Divide the data into k groups. Designate one group as the hold-out set. Designate the other groups for making the tree. Rotate the roles of the training groups and the hold-out

set until all groups have been held out once. Combine the statistics of the hold-out sets.

136

Page 137: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Evaluate Crossvalidation Specify the number of groups, k.

– The default is 5 groups. The -2LogLikelihood measures the decrease

in the uncertainty from the overall probabilities.

K-fold crossvalidations leads to over-fitting.

137

Page 138: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Crossvalidation by Excluded Rows A portion of the sample is randomly selected. Exclude these rows to make the hold-out set. The other rows are used to make the tree. There is no universal rule for the size of the portion

for the hold-out set.– 25% to 50%

138

Page 139: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Stopping Rule You can avoid repeatedly clicking the Split button

by clicking the Go button that appears when crossvalidation is used.

The Partition platform continues to split until the R2 value for the validation data is better than what the next 10 splits would obtain.

The R2 for the training and the validation data is presented in a run chart in the Split History report.

 

139

Page 140: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Akaike’s Information Criterion It is a popular and rigorous criterion for comparing

models. It is based on the likelihood of the data under the

current model (partition). It includes a penalty for over-fitting. It includes a correction for small samples. Smaller values suggest better models.

140

2 12 2

1

k kAICc Log L k

n k

penalty correction

Page 141: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Special Cases Limit the splitting by specifying the smallest group

size.– Default minimum size is 5 cases.

Outliers form their own nodes and do not interfere with the rest of the tree.

Linear relationships with continuous explanatory variables might require very many splits to adequately model the effects.

141

Page 142: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Missing Data A missing response causes the entire case

to be excluded unless you enable the Missing Value Categories option when launching Partition.– A new response level is added for missing values.

A missing categorical explanatory variable is imputed (random selection of other levels) or a new category is created for missing values.

A missing continuous explanatory variable is randomly assigned to one of the two splits.

142

Page 143: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Evaluate Model: ROC Curve The receiver operating characteristic curve (ROC)

evaluates the ability of the model to distinguish the levels of the response.

It is based on the sensitivity (true positive rate) and the 1-specificity (false positive rate).

143

Page 144: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Sensitivity The sensitivity is the probability or rate of a true

positive prediction of the given level. For this example, if the model predicts Survived=no

for 992 cases out of 1004 cases where it is true, then the sensitivity is 0.988 or 98.8%.

The sensitivity should be near 1.

144

Page 145: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Specificity The specificity is the probability or rate of a true

negative prediction of the given level. For this example, if the model does not predict

Survived=no for 184 cases out of 494 cases where it is not true, then the specificity is 0.37 or 37%.

1 – specificity, or the false positive rate, should ideally be near 0.

145

Page 146: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Evaluate Model: ROC Curve Rank order the fitted probabilities for the response. For each row, move up if the response is correct,

move right if the response is wrong.

146

Page 147: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Area under the Curve The area under the ROC curve (AUC) measures

the goodness of fit for the tree to the data. A general rule for interpretation of AUC:

147

Result Discrimination

AUC=0.5 None

0.7< AUC< 0.8 Acceptable

0.8< AUC< 0.9 Excellent

AUC>0.9 Outstanding

Page 148: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Evaluate Model: Lift Curve Shows performance of tree predictions. Orders cases by predicted probability. Compares proportion of cases with one response level

in a given portion to proportion of cases with this response overall.

148

Page 149: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

Evaluate Model: Confusion Matrix The actual response is compared to the predicted

response from the model in the confusion matrix. A model that predicts better than chance has more

cases on the diagonal than off the diagonal. This example shows a no response that is predicted

well and a yes response that is not predicted well. The confusion matrix is not useful for model selection

when the marginal distribution is not near a probability of 0.5 for both levels.

149

Page 150: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

150

This demonstration illustrates the concepts discussed previously.

Recursive Partitioning

Page 151: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

151

Page 152: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

152

Exercise

This exercise reinforces the concepts discussed previously.

Page 153: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

153

Page 154: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.12 QuizIn which leaf, and on what variable, will JMP next split?

154

Page 155: Chapter 1: Associations 1.1 Introduction to Categorical Data 1.2 Examining Associations among Variables 1.3 Correspondence Analysis 1.4 Recursive Partitioning

1.12 Quiz – Correct AnswerIn which leaf, and on what variable, will JMP next split?

Of the leaves, the highest LogWorth is for Age (.7313), in the Gender(Female) leaf. This is where JMP will next split.

155