subgroup discovery

23
Subgroup Discovery Finding Local Patterns in Data

Upload: caraf

Post on 06-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Subgroup Discovery. Finding Local Patterns in Data. Exploratory Data Analysis. Classification: model the dependence of the target on the remaining attributes. problem: sometimes classifier is a black-box, or uses only some of the available dependencies. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Subgroup Discovery

Subgroup DiscoveryFinding Local Patterns in Data

Page 2: Subgroup Discovery

Exploratory Data Analysis Classification: model the dependence of the target on

the remaining attributes. problem: sometimes classifier is a black-box, or uses only some

of the available dependencies. for example: in decision trees, some attributes may not appear

because of overshadowing.

Exploratory Data Analysis: understanding the effects of all attributes on the target.

Q: How can we use ideas from C4.5 to approach this task?

A: Why not list the info gain of all attributes, and rank according to this?

Page 3: Subgroup Discovery

Interactions between Attributes

Single-attribute effects are not enough

XOR problem is extreme example: 2 attributes with no info gain form a good subgroup

Apart from A=a, B=b, C=c, …

consider alsoA=aB=b, A=aC=c, …, B=bC=c, …

A=aB=bC=c, …

Page 4: Subgroup Discovery

Subgroup Discovery Task“Find all subgroups within the inductive constraints

that show a significant deviation in the distribution of the target attribute”

Inductive constraints: Minimum support (Maximum support) Minimum quality (Information gain, X2, WRAcc) Maximum complexity …

Page 5: Subgroup Discovery

Confusion Matrix A confusion matrix (or contingency table) describes the

frequency of the four combinations of subgroup and target: within subgroup, positive within subgroup, negative outside subgroup, positive outside subgroup, negative

T F

T .42 .13 .55

F .12 .33

.54 1.0

subgroup

target

Page 6: Subgroup Discovery

Confusion Matrix High numbers along the TT-FF diagonal means a positive

correlation between subgroup and target

High numbers along the TF-FT diagonal means a negative correlation between subgroup and target

Target distribution on DB is fixed

Only two degrees of freedom

T F

T .42 .13 .55

F .12 .33 .45

.54 .46 1.0

subgroup

target

Page 7: Subgroup Discovery

Quality MeasuresA quality measure for subgroups summarizes the interestingness of

its confusion matrix into a single number

WRAcc, weighted relative accuracy Also known as Novelty

Balance between coverage and unexpectedness

nov(S,T) = p(ST) – p(S)p(T)

between −.25 and .25, 0 means uninteresting

T F

T .42 .13 .55

F .12 .33

.54 1.0

nov(S,T) = p(ST)−p(S)p(T)= .42 − .297 = .123

subgroup

target

Page 8: Subgroup Discovery

Quality Measures WRAcc: Weighted Relative Accuracy

Information gain

X2

Correlation Coefficient

Laplace

Jaccard

Specificity

Page 9: Subgroup Discovery

Subgroup Discovery as Search

T

A=a1 A=a2 B=b1 B=b2 C=c1 …

T F

T .42 .13 .55

F .12 .33

.54 1.0

A=a2B=b1A=a1B=b1 A=a1B=b2… …

…A=a1B=b1C=c1

minimum support level reached

Page 10: Subgroup Discovery

Refinements are (anti-)monotonic

target concept

subgroup S1

S2 refinement of S1

S3 refinement of S2

Refinements are (anti-) monotonic in their support…

…but not in interestingness. This may go up or down.

Page 11: Subgroup Discovery

SD vs. Separate & Conquer

Produces collection of subgroups

Local Patterns

Subgroups may overlap and conflict

Subgroup, unusual distribution of classes

Produces decision-list

Classifier

Exclusive coverage of instance space

Rules, clear conclusion

Subgroup Discovery Separate & Conquer

Page 12: Subgroup Discovery

Subgroup Discovery and ROC space

Page 13: Subgroup Discovery

ROC Space

Each subgroup forms a point in ROC space, in terms of its False Positive Rate, and True Positive Rate.

TPR = TP/Pos = TP/TP+FN (fraction of positive cases in the subgroup)FPR = FP/Neg = FP/FP+TN (fraction of negative cases in the subgroup)

ROC = Receiver Operating Characteristics

Page 14: Subgroup Discovery

ROC Space Properties

‘ROC heaven’perfect subgroup

‘ROC hell’random subgroup

entire database

emptysubgroup minimum support

threshold

perfectnegative subgroup

Page 15: Subgroup Discovery

Measures in ROC Space

0

WRAcc Information Gain

positive

negative

sou

rce:

Flach

& F

ürn

kran

z

Page 16: Subgroup Discovery

Other MeasuresPrecision Gini index

Correlation coefficient Foil gain

Page 17: Subgroup Discovery

Refinements in ROC SpaceRefinements of S will reduce the FPR and TPR, so will appear to the left and below S.

Blue polygon represents possible refinements of S. With a convex measure, f is bounded by measure of corners.

..

.

.

.

If corners are not above minimum quality or current best (top k?), prune search space below S.

Page 18: Subgroup Discovery

Combining Two Subgroups

Page 19: Subgroup Discovery

Multi-class problems Generalising to problems with more than 2 classes

is fairly staightforward:

X2 Information gain

C1 C2 C3

T .27 .06 .22 .55

F .03 .19 .23 .45

.3 .25 .45 1.0

combine values to qualitymeasure

subgroup

target

Page 20: Subgroup Discovery

Numeric Subgroup Discovery

Target is numeric: find subgroups with significantly higher or lower average value

Trade-off between size of subgroup and average target value

h = 2200

h = 3100

h = 3600

Page 21: Subgroup Discovery

Quiz 1

Q: Assume you have found a subgroup with a positive WRAcc (or infoGain). Can any refinement of this subgroup be negative?

A: Yes.

Page 22: Subgroup Discovery

Quiz 2Q: Assume both A and B are a random subgroup. Can

the subgroup A B be an interesting subgroup?

A: Yes.

Think of the XOR problem. A B is either completely positive or negative.

.

.

.

Page 23: Subgroup Discovery

Quiz 3

Q: Can the combination of two positive subgroups ever produce a negative subgroup?

A: Yes.

.