chapter 1: associations 1.1 introduction to categorical data 1.2 examining associations among...
TRANSCRIPT
Chapter 1: Associations
1.1 Introduction to Categorical Data
1.2 Examining Associations among Variables
1.3 Correspondence Analysis
1.4 Recursive Partitioning
1
Chapter 1: Associations
1.1 Introduction to Categorical Data1.1 Introduction to Categorical Data
1.2 Examining Associations among Variables
1.3 Correspondence Analysis
1.4 Recursive Partitioning
2
Objectives Recognize the differences between categorical
and continuous data analysis. Identify the scale of measurement for your
response variable. Examine the distribution of categorical data.
3
Categorical Data Categorical data represents categories, classes
and classifications, groups, or qualitative characteristics or attributes.– respondent gender (male or female)– product disposition (conforming or nonconforming)– patient mortality (survived or died)
Continuous data represents measurements.– length, time, temperature, concentration
Categorical data is qualitative, continuous data is quantitative.
Categorical data values are discrete and the distance between categories is unknown.
4
Categorical Response The methods presented in this course are appropriate
for a response (dependent variable) that is categorical.– Methods such as the Student t-test, a two-way
analysis of variance (ANOVA), or multiple least squares linear regression are not appropriate.
The explanatory variables (independent or predictor variable) can be continuous or categorical.– The nature of the explanatory variable can also
determine which methods are appropriate.
5
Probability The analysis or modeling of a continuous response
directly applies to the value or measurement itself.– This approach is not possible for a categorical
response. The analysis or modeling of a categorical response
is based on the proportion or probability of each level.
6
Common Applications Medicine, epidemiology, and public health Sociology and behavioral science Marketing and demographics Political science Quality and Six Sigma
7
8
1.01 Multiple Answer PollWhat is your area of application for categorical data analysis?
a. Medicine, epidemiology, and public health
b. Sociology and behavioral science
c. Marketing and demographics
d. Political science
e. Quality and Six Sigma
f. Other
9
Data Type for Categorical Data You might use either the numeric or the character data
type to represent categorical data, such as customer satisfaction.– 1, 2, 3, 4, 5 (a Likert scale)– Poor, fair, good, very good, excellent
You must use the numeric data type to represent continuous data, such as a physical measurement.
10
Modeling Type for Categorical Data You must use either the nominal or ordinal modeling
type for categorical data.– Nominal variables contain values without any
natural ordering. Hair color, gender, political affiliation, or county
of residence– Ordinal variables contain values with a natural
order. Satisfaction index, income category, or level
of education You must use the continuous modeling type for interval
or ratio data.
11
12
1.02 Multiple Choice PollWhat is the best choice for the data type and modeling type for the combination of variables Age (in years) and Gender (male or female)?
a. (numeric, continuous) and (character, ordinal)
b. (numeric, ordinal) and (character, continuous)
c. (numeric, continuous) and (character, nominal)
d. (character, nominal) and (numeric, continuous)
13
1.02 Multiple Choice Poll – Correct AnswerWhat is the best choice for the data type and modeling type for the combination of variables Age (in years) and Gender (male or female)?
a. (numeric, continuous) and (character, ordinal)
b. (numeric, ordinal) and (character, continuous)
c. (numeric, continuous) and (character, nominal)
d. (character, nominal) and (numeric, continuous)
14
Titanic Example You will use the Titanic data set to explore the nature
of categorical data.– Class: first, second, or third class
passengers, or crew members– Age: adult or child– Sex: male or female– Survived: yes or no
15
This demonstration illustrates the concepts discussed previously.
Categorical Data Example
16
17
1.03 Multiple Choice PollWhat data type and modeling type are used for the Age variable?
a. Character, ordinal
b. Numeric, nominal
c. Character, nominal
d. Character, continuous
18
1.03 Multiple Choice Poll – Correct AnswerWhat data type and modeling type are used for the Age variable?
a. Character, ordinal
b. Numeric, nominal
c. Character, nominal
d. Character, continuous
19
20
Distribution of Continuous Data Continuous data might be realized as an infinity
of values, within an arbitrary level of discreteness, over a given range.
The distribution or frequency of these values depends on the process that generates them.– Many examples can be described by the normal
distribution. The distribution might be asymmetric when values
approach a natural boundary. The distribution might exhibit unusual tails.
21
Distribution Models for Continuous Data Many mathematical models exist for continuous data. The model parameters determine the characteristics
of the distribution.
– The model is fit to the data by determining the best values for the parameters.
The model can be expressed as functions:
– probability density function (PDF)
– cumulative distribution function (CDF) Common examples of models are the normal,
lognormal, Weibull, Johnson, and gamma distributions.
22
Distribution of Categorical Data Categorical data might be realized only as discrete
values, few or many. The distribution or frequency of these values depends
on the process that generates them.– Many examples of dichotomous responses can
be described by the binomial distribution. The distribution might not be symmetric. The distribution of many levels might exhibit unusual
tails.
23
Distribution Models for Categorical Data Many mathematical models exist for categorical data. The model parameters determine the characteristics
of the distribution.
– The model is fit to the data by determining the best values for the parameters.
The model can be expressed as functions:
– probability mass function (PMF)
– cumulative distribution function (CDF) Common examples of models are the binomial, negative
binomial, geometric, hypergeometric, and Poisson distributions.
24
Binomial Distribution Model The basis for this distribution is a Bernoulli trial.
– There are only two possible outcomes of each trial. Generally, 1 for success or 0 for failure.
– Each individual outcome (yi) is independent of the others (in other words, the probability of the outcome 1 is always the same).
Total number of successes (outcome of 1) is y.
25
n
ii
y y
Binomial Distribution Model The binomial distribution describes the probability
of y, the number of successes, from 0 to n. The parameters in this model are n, the number of
trials, and , the probability of outcome 1 in each trial.
The expected value (mean) is n and the variance is n(1- ) for the binomial distribution.
26
1n yyn
PMF yy
Example of Binomial Distribution A college basketball player finished the last season
with a record of 77% success making free throws.– What performance should you expect from this
player if her free-throw success rate has not changed?
Specifically, how many baskets should she make in 25 attempts?
27
28
1.04 Multiple Choice PollWhat is the parameter π in the binomial distribution model?
a. The total number of successes
b. The probability of success in each trial
c. The number of possible outcomes from each trial
d. The proportion of failures in each trial
29
1.04 Multiple Choice Poll – Correct AnswerWhat is the parameter π in the binomial distribution model?
a. The total number of successes
b. The probability of success in each trial
c. The number of possible outcomes from each trial
d. The proportion of failures in each trial
30
Graphics for Frequency and Proportion Statistical graphics are designed to interpret the data. The bar chart represents the frequency of each level
by the length of its bar. The mosaic plot represents the proportion of each
level by the length of its segment.
31
Multinomial Distribution Model Some categorical responses have more than two
possible values. The idea of the binomial distribution can be extended
to the multinomial distribution.
32
Test Proportions There might be supposed proportions for each
of the categories in the response variable. The sample can be used to test that supposition. JMP calls this command test probabilities. Enter a probability for only the subset of levels that
you want to test, and leave the others blank, when you have a response with more than two levels.– Enter 1 for all levels to test if they are equal.
33
Chi-Square Test for Proportions The appropriate test of proportions is based
on the chi-square statistic.– This statistic is covered in detail in the next section.
The test is available for three situations:– Test whether probabilities are not equal to
supposition– Test whether probabilities are greater than
supposed– Test whether probabilities are less than supposed
34
Poisson Distribution Sometimes the number of trials is not fixed and there
is no practical upper limit. The response y is the count of events over time. The Poisson distribution is often a good model for
the distribution of y. This model has a single parameter, .
35
!
yePMF y
y
36
This demonstration illustrates the concepts discussed previously.
Examining Distributions
37
Exercise
This exercise reinforces the concepts discussed previously.
38
Chapter 1: Associations
1.1 Introduction to Categorical Data
1.2 Examining Associations among Variables1.2 Examining Associations among Variables
1.3 Correspondence Analysis
1.4 Recursive Partitioning
39
Objectives Determine whether an association exists among
categorical variables. Perform a stratified analysis of categorical variables.
40
Association An association exists between two variables
if the distribution of one variable changes when the level (or value) of the other variable changes.
If there is no association, the distribution of the first variable is the same, regardless of the level of the other variable.
41
No Association
42
72% 28%
28%72%
Is mood associatedwith the weather?
Association
43
82% 18%
40%60%
Is mood associatedwith the weather?
44
This demonstration illustrates the concepts discussed previously.
Recognizing Associations
Marginal Distribution in an Association The marginal distribution of the response ignores
the explanatory variable. The mosaic plot explores the data without regard
to any association.
45
Conditional Distribution in an Association The conditional distribution of the response describes
the frequency of the responses for each level of the explanatory variable.
The mosaic plotexplores the dataand the possibilityof an association.
46
Two-Dimensional Mosaic Plot This mosaic plot includes the marginal distribution
on the right and conditional distribution on the left.
47
conditional marginal
48
This demonstration illustrates the concepts discussed previously.
Exploring Associations
49
1.05 QuizIs there an association between the severity of an adverse reaction and the treatment?
50
1.05 Quiz – Correct AnswerIs there an association between the severity of an adverse reaction and the treatment?
No, the distribution of ADR SEVERITY is the same between the two levels of TREATMENT GROUP.
51
Test for Association The row percentage (proportion or probability) is used
to test the association between Survived and Class.
52
Null Hypothesis H0: There is no association between Survived
and Class. The probability of surviving is the same, regardless
of the class of the passenger.
53
Alternative Hypothesis H1: There is an association between Survived
and Class. The probability of surviving is different between crew, first, second, and third class passengers.
54
Chi-Square Test
The expected frequencies are based on the marginal distribution, or null hypothesis.
55
NO ASSOCIATIONobserved frequencies=expected frequencies
ASSOCIATIONobserved frequencies≠expected frequencies
Expected Frequency The expected frequency
of each cell is based on the marginal distribution (null hypothesis).
It is the product of the marginal proportion of the explanatory variable and the marginal frequency of the response.
0.4021 * 1490=599.114
56
Pearson Chi-Square Statistic The observed frequency
is compared to the expected frequency.
The cell statistics are accumulated into the sample statistic.
(73.886)2/599.114=9.112
57
22 i i
i i
n
p-Value for Chi-Square TestThis p-value is the probability of observing a chi-square sample statistic
at least as large as the one actually observed, given that there is no association between the variables
probability of the association that you observe in the data occurring by chance.
58
Chi-Square TestsChi-square tests and the corresponding p-values determine whether an association exists do not measure the strength of an association depend on and reflect the sample size.
59
Agreement A stronger relationship than an association might
be sought when the two variables use the same levels. Agreement measures the strength of such
a relationship. Cohen’s kappa, κ, for agreement. Bowker’s test of symmetry (association) McNemar’s test of agreement
(Bowker’s test when levels are the same)
60
Trend in Association Two variables might exhibit a trend in the association
between their ordered levels.– The response has two levels.– The predictor is ordinal.
The Cochran-Armitage test is available for a trend.
61
62
This demonstration illustrates the concepts discussed previously.
Chi-Square Test
63
1.06 QuizIs there sufficient evidence that an association exists between adverse effect severity and treatment?
64
1.06 Quiz – Correct AnswerIs there sufficient evidence that an association exists between adverse effect severity and treatment?
No, the p-value for the Pearson chi-square statistic is 0.7919, so there is insufficient evidence to reject the null (that no association exists) at α=0.05.
65
66
When Not to Use the Chi-Square Test
67
When more than 20% of the cellshave expected counts less than five
2
Expected
Observed versus Expected Values
68
3.43 4.57 6.00
4.41 5.88 7.71
4.16 5.55 7.29
Observed Values Expected Values
1 5 8
5 6 7
6 5 6
4 of 9 cells, or 44%,with expected value
less than 5
1 of 9 cells, or 11%,with observed value
less than 5
Small Samples – Fisher’s Exact Test
69
Fisher’sExactTest
SAMPLE SIZE
Small
Large
Example: Tea and MilkSuppose you want to test whether someone can determine whether a cup of tea with milk had the milk poured first or the tea poured first.
70
Fisher’s Exact Test Example8 Cups of Tea: 4 with Milk First and 4 with Tea First
Predict which cups had tea poured first.
71
4
4
4 4
M
T
M T
FixedMarginalTotalsP
rep
are
d
Test
Basis for Fisher’s Exact Test
72
0
4
4
0
4
4
4
4
2
2
2
2
4
4
4
4
1
3
3
1
4
4
4
4
row and columntotals fixed
Other possible samples:
M
M
T
T
3 4
4
4 4
0
0 4
4
Pre
par
ed
Test
3
1
1
3
4
4
4
4
Sample:
Fisher’s Exact Test HypothesesNull Hypothesis: There is no association.
Alternative Hypothesis: There is an association. Left-tailed Right-tailed Two-tailed
73
Left-Tailed Alternative Hypothesis
74
Left-tailed p-value
M
3
1
1
3
4
4
4
4
M
T
T
Ac
tua
l
Test
0
4
4
0
4
4
4
4
2
2
2
2
4
4
4
4
1
3
3
1
4
4
4
4
The alternative hypothesis is that the predictionis worse than that by chance.
Right-Tailed Alternative Hypothesis
75
Right-tailed p-value
M
3
1
1
3
4
4
4
4
M
T
T
Pre
par
ed
Test
3 4
4
4 4
0
0 4
4
The alternative hypothesis is that the predictionis better than that by chance.
Two-Tailed Alternative Hypothesis
76
Two-tailed p-value
0
4
4
0
4
4
4
4
2
2
2
2
4
4
4
4
1
3
3
1
4
4
4
4
M
3
1
1
3
4
4
4
4
M
T
T
Pre
par
ed
Test 3 4
4
4 4
0
0 4
4
77
This demonstration illustrates the concepts discussed previously.
Fisher’s Exact Test
78
1.07 QuizWhich test should you use for the alternative hypothesis that finishing the prescription decreases the chance of a relapse, and is the test significant at α=0.05?
79
1.07 Quiz – Correct AnswerWhich test should you use for the alternative hypothesis that finishing the prescription decreases the chance of a relapse, and is the test significant at α=0.05?
The Left test is for the specified hypothesis and the p-value=0.0007 is significant at the α=0.05 level.
80
81
Stratified Data Analysis Stratified data analysis is the process of dividing
subjects into groups defined by the levels of a third variable.
Use this analysis when you want to examine the association between two variables within the levels of a third variable.
82
Unstratified Data Analysis
Of the 39 single people, 23% have lung cancer and 77% do not. Of the 36 married people, 17% have lung cancer and 83% do not.
83
Stratified Data Analysis
Of the 28 single smokers, 28% have lung cancer and 72% do not. Of the 14 married smokers, 28% have lung cancer and 72% do not.
84
Cochran-Mantel-Haenszel Statistics
85
Sample Size for CMH versus Chi-Square Recommended that you have either sample size
of 25 for each degree of freedom in original table or at least 80% of cells with expected frequency of at least 5 (same as unstratified test).
86
1. Correlation of Scores
87
B
A
Test linear association
2. Row Scores by Column Categories
88
B
A
Test equal row scores
3. Column Scores by Row Categories
89
B
A
Test equal column scores
4. General Association of Categories
90
B
A
22
Test general association
91
1.08 Multiple Choice PollWhich CMH test is the most appropriate for Survived (nominal, columns) versus Class (ordinal, rows) when stratified by Sex?
a. Row Scores by Column Categories
b. General Association of Categories
c. Correlation of Scores
d. Column Scores by Row Categories
92
1.08 Multiple Choice Poll – Correct AnswerWhich CMH test is the most appropriate for Survived (nominal, columns) versus Class (ordinal, rows) when stratified by Sex?
a. Row Scores by Column Categories
b. General Association of Categories
c. Correlation of Scores
d. Column Scores by Row Categories
A. Row Scores for ordinal Class by Column Categories of nominal Survived.
93
CMH Statistics and 2x2 Tables
94
2x2All CMHstatisticsare equal
When Do CMH Tests Lack Power? The CMH statistics accumulate over the strata. If the association is similar in all strata, then the
statistics are strengthened.– This case is easier to detect, and the tests have
more power. If the association changes or reverses across strata,
then the statistics are weakened.– This case is more difficult to detect, and the tests
have less power.
95
Concordance and Discordance A crosstabulation of ordinal data introduces the ideas
of concordance and discordance.– These ideas involve a pair of observations.
The association might exhibit a trend. A pair is concordant if one observation that is ranked
higher on X is also ranked higher on Y. A pair is discordant if one observation that is ranked
higher on X is ranked lower on Y. A pair is tied if both observations have the same level
for X and Y.
96
Measures of Association Measures of association for ordinal variables serve like
the correlation coefficient for continuous variables that exhibit a linear trend.
Gamma: ignores ties Kendall’s b : corrects for ties
Stuart’s c : corrects for table size and ties
Somer’s D: asymmetric modification of b
Lambda: measures improvement in predicting Y,given X; two asymmetric forms
Uncertainty Coefficient U: proportion of uncertainty explained
97
98
This demonstration illustrates the concepts discussed previously.
CMH Tests
99
100
Exercise
This exercise reinforces the concepts discussed previously.
101
1.09 Multiple Choice PollThe Correlation of Scores CMH test has which null hypothesis?
a. There is no linear association between the row and column variables in any stratum.
b. The mean scores for each column are equal in each stratum.
c. The mean scores for each row are equal in each stratum.
d. There is no association between the row and column variables in any stratum.
102
1.09 Multiple Choice Poll – Correct AnswerThe Correlation of Scores CMH test has which null hypothesis?
a. There is no linear association between the row and column variables in any stratum.
b. The mean scores for each column are equal in each stratum.
c. The mean scores for each row are equal in each stratum.
d. There is no association between the row and column variables in any stratum.
103
Chapter 1: Associations
1.1 Introduction to Categorical Data
1.2 Examining Associations among Variables
1.3 Correspondence Analysis1.3 Correspondence Analysis
1.4 Recursive Partitioning
104
Objectives Explain how correspondence analysis can help
you find associations. Perform a simple correspondence analysis. Interpret a correspondence plot.
105
What Is Correspondence Analysis?Correspondence analysis is a data analysis technique that enables you to display the associations between the levels of two
or more categorical variables graphically extract information from a frequency table with
many levels for the rows and columns.
106
Row and Column Profiles
Row and column percentages are used to obtain row and column profiles.
107
A B C
1
4
19.5527.39
25.9123.27
54.5525.53
217.2724.20
28.84
29.49
25.31
26.12
53.49
53.00
24.47
24.47
317.6724.20
17.5124.20
28.1825.31
54.5525.53
GivesRow Profile
Gives Column Profile
Row %Column %
Example Data collected for these two categorical variables:
– Mental health status (well, mild symptom formation, moderate symptom formation, or impaired)
– Parent socioeconomic status (A through F) Is there an association? Which levels of each variable are associated?
108
Rows A and B have similar profiles. Their points are close together and fall away from the origin in the same direction.
The profile for Row F is different. Its point falls away from the origin in a different direction.
Correspondence Plot
109
Rows A and B and Column Well fall in approximately the same direction from the origin, and are relatively close to one another.
Association
110
111
1.10 Multiple Answer PollIn correspondence analysis, which of the following are true? (Choose all answers that apply.)
a. Row points that fall far from each other but in the same direction away from the origin indicate that they have similar profiles.
b. Column points that fall close together and in the same direction away from the origin indicate that they have similar profiles.
c. Row and column points that fall in the same direction away from the origin indicate that they have an association.
112
1.10 Multiple Answer Poll – Correct AnswersIn correspondence analysis, which of the following are true? (Choose all answers that apply.)
a. Row points that fall far from each other but in the same direction away from the origin indicate that they have similar profiles.
b. Column points that fall close together and in the same direction away from the origin indicate that they have similar profiles.
c. Row and column points that fall in the same direction away from the origin indicate that they have an association.
113
Sample Data Set
114
ACTION
MYSTERY
COMEDY
SPORTS
ROMANCE
SCI-FI
HORROR
DRAMA
FAMILY
AGE
GENDER
MOVIES
Analysis ApproachesYou want to perform an analysis that takes into account the three variables Movie, Age, and Gender. There are several approaches. Analyze a two-way table where the columns
correspond to the levels of Movie and the rows correspond to combinations of the levels of Age and Gender.
Treat Gender as a stratification variable and analyze males and females separately.
115
116
This demonstration illustrates the concepts discussed previously.
Correspondence Analysis
117
118
Exercise
This exercise reinforces the concepts discussed previously.
119
1.11 QuizIce cream brands A through D are tested by a panel, and rated from 1 through 9 (with 9 as the best score). What can you conclude from the Correspondence Analysis?
120
1.11 Quiz – Correct AnswerIce cream brands A through D are tested by a panel, and rated from 1 through 9 (with 9 as the best score). What can you conclude from the Correspondence Analysis?
Answerswill vary.
121
Chapter 1: Associations
1.1 Introduction to Categorical Data
1.2 Examining Associations among Variables
1.3 Correspondence Analysis
1.4 Recursive Partitioning1.4 Recursive Partitioning
122
Objectives Define recursive partitioning. Understand the splitting criteria used in JMP. Review algorithm parameters available in JMP. Use the Partition platform in JMP.
123
Recursive Partitioning Recursive partitioning refers to segmenting the data
into groups that are as homogeneous as possible with respect to the dependent variable (Y) and maximizing the difference in the response of the groups.
Successive splits produce a structure of rules and groups known as a decision tree, a model of the data.– Splits are binary.– The reverse of splitting is pruning.
The tree helps interpret the associations in the data.
124
Split into New Groups
125
size (Large) size (Medium, Small)
What factors determine the country from which cars are purchased?
n =303
Country
n=42 n=261
Model Metrics R square represents the amount of uncertainty
in the data that has been accounted for by the explanatory variables.– Larger R2 is better.
Akaike’s Information Criterion (AICc) measures the decrease in the uncertainty but adds a penalty for excessive splitting.
– Smaller AICc is better.
126
Splitting Metrics Candidate G2 measures the change in the entropy.
– Larger G2 values are better. Candidate LogWorth is the negative log of the p-value
for the likelihood ratio chi-square.– Larger LogWorth values are better.– Monte Carlo simulation adjusts the p-value.
The criterion for the best split is LogWorth.
127
Partition Algorithm: Calculate Split Metric
128
size
Log Worth
Partition Algorithm: Find Best Cutting Point
129
Best Split
size
Log Worth
Partition Algorithm: Calculate forOther Variables
130
type
Log Worth
Partition Algorithm: Compare the Best Splits
131
Best Split type
Best Split size
Partition Algorithm: Partition with Best Split
132
Partition Algorithm: Repeat within Partitions
133
Under-fitting and Over-fitting Under-fitting is a situation where too few splits
are used and prediction suffers.– The uncertainty could be reduced further.
Over-fitting is a situation where too many splits are used and prediction also suffers.– The model incorporates features of random noise
in the data, which will not be repeated again. Both problems adversely affect model predictions.
134
Crossvalidation Crossvalidation attempts to find the optimum number
of splits. The sample data are divided into groups. One group is designated as the hold-out set.
– It is not used to train (fit) the model (tree).– It is used for predictions (as if it were future cases).
The other group is used to train the model. JMP offers two methods of crossvalidation.
– K-fold crossvalidation– Excluded rows
135
K-fold Crossvalidation Divide the data into k groups. Designate one group as the hold-out set. Designate the other groups for making the tree. Rotate the roles of the training groups and the hold-out
set until all groups have been held out once. Combine the statistics of the hold-out sets.
136
Evaluate Crossvalidation Specify the number of groups, k.
– The default is 5 groups. The -2LogLikelihood measures the decrease
in the uncertainty from the overall probabilities.
K-fold crossvalidations leads to over-fitting.
137
Crossvalidation by Excluded Rows A portion of the sample is randomly selected. Exclude these rows to make the hold-out set. The other rows are used to make the tree. There is no universal rule for the size of the portion
for the hold-out set.– 25% to 50%
138
Stopping Rule You can avoid repeatedly clicking the Split button
by clicking the Go button that appears when crossvalidation is used.
The Partition platform continues to split until the R2 value for the validation data is better than what the next 10 splits would obtain.
The R2 for the training and the validation data is presented in a run chart in the Split History report.
139
Akaike’s Information Criterion It is a popular and rigorous criterion for comparing
models. It is based on the likelihood of the data under the
current model (partition). It includes a penalty for over-fitting. It includes a correction for small samples. Smaller values suggest better models.
140
2 12 2
1
k kAICc Log L k
n k
penalty correction
Special Cases Limit the splitting by specifying the smallest group
size.– Default minimum size is 5 cases.
Outliers form their own nodes and do not interfere with the rest of the tree.
Linear relationships with continuous explanatory variables might require very many splits to adequately model the effects.
141
Missing Data A missing response causes the entire case
to be excluded unless you enable the Missing Value Categories option when launching Partition.– A new response level is added for missing values.
A missing categorical explanatory variable is imputed (random selection of other levels) or a new category is created for missing values.
A missing continuous explanatory variable is randomly assigned to one of the two splits.
142
Evaluate Model: ROC Curve The receiver operating characteristic curve (ROC)
evaluates the ability of the model to distinguish the levels of the response.
It is based on the sensitivity (true positive rate) and the 1-specificity (false positive rate).
143
Sensitivity The sensitivity is the probability or rate of a true
positive prediction of the given level. For this example, if the model predicts Survived=no
for 992 cases out of 1004 cases where it is true, then the sensitivity is 0.988 or 98.8%.
The sensitivity should be near 1.
144
Specificity The specificity is the probability or rate of a true
negative prediction of the given level. For this example, if the model does not predict
Survived=no for 184 cases out of 494 cases where it is not true, then the specificity is 0.37 or 37%.
1 – specificity, or the false positive rate, should ideally be near 0.
145
Evaluate Model: ROC Curve Rank order the fitted probabilities for the response. For each row, move up if the response is correct,
move right if the response is wrong.
146
Area under the Curve The area under the ROC curve (AUC) measures
the goodness of fit for the tree to the data. A general rule for interpretation of AUC:
147
Result Discrimination
AUC=0.5 None
0.7< AUC< 0.8 Acceptable
0.8< AUC< 0.9 Excellent
AUC>0.9 Outstanding
Evaluate Model: Lift Curve Shows performance of tree predictions. Orders cases by predicted probability. Compares proportion of cases with one response level
in a given portion to proportion of cases with this response overall.
148
Evaluate Model: Confusion Matrix The actual response is compared to the predicted
response from the model in the confusion matrix. A model that predicts better than chance has more
cases on the diagonal than off the diagonal. This example shows a no response that is predicted
well and a yes response that is not predicted well. The confusion matrix is not useful for model selection
when the marginal distribution is not near a probability of 0.5 for both levels.
149
150
This demonstration illustrates the concepts discussed previously.
Recursive Partitioning
151
152
Exercise
This exercise reinforces the concepts discussed previously.
153
1.12 QuizIn which leaf, and on what variable, will JMP next split?
154
1.12 Quiz – Correct AnswerIn which leaf, and on what variable, will JMP next split?
Of the leaves, the highest LogWorth is for Age (.7313), in the Gender(Female) leaf. This is where JMP will next split.
155