subgroup discovery

54
Subgroup Discovery Finding Local Patterns in Data

Upload: lucia

Post on 22-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Subgroup Discovery. Finding Local Patterns in Data. Exploratory Data Analysis. Scan the data without much prior focus Find unusual parts of the data Analyse attribute dependencies interpret this as ‘ rule ’ : if X =x and Y =y then Z is unusual - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Subgroup Discovery

Subgroup Discovery

Finding Local Patterns in Data

Page 2: Subgroup Discovery

Exploratory Data Analysis

Scan the data without much prior focus

Find unusual parts of the data

Analyse attribute dependencies interpret this as ‘rule’:

if X=x and Y=y then Z is unusual

Complex data: nominal, numeric, relational?

the Subgroup

Page 3: Subgroup Discovery

Exploratory Data Analysis Classification: model the dependency of the

target on the remaining attributes. problem: sometimes classifier is a black-box, or uses

only some of the available dependencies. for example: in decision trees, some attributes may not

appear because of overshadowing.

Exploratory Data Analysis: understanding the effects of all attributes on the target.

Page 4: Subgroup Discovery

Interactions between Attributes

Single-attribute effects are not enough

XOR problem is extreme example: 2 attributes with no info gain form a good subgroup

Apart from A=a, B=b, C=c, …

consider alsoA=aB=b, A=aC=c, …, B=bC=c, …

A=aB=bC=c, …

Page 5: Subgroup Discovery

Subgroup Discovery Task

“Find all subgroups within the inductive constraints that show a significant deviation in the distribution of the target attributes”

Inductive constraints: Minimum support (Maximum support) Minimum quality (Information gain, X2, WRAcc) Maximum complexity …

Page 6: Subgroup Discovery

Subgroup Discovery: the Binary Target Case

Page 7: Subgroup Discovery

Confusion Matrix A confusion matrix (or contingency table) describes the

frequency of the four combinations of subgroup and target: within subgroup, positive within subgroup, negative outside subgroup, positive outside subgroup, negative

T F

T .42 .13 .55

F .12 .33

.54 1.0

subgroup

target

Page 8: Subgroup Discovery

Confusion Matrix High numbers along the TT-FF diagonal means a positive

correlation between subgroup and target

High numbers along the TF-FT diagonal means a negative correlation between subgroup and target

Target distribution on DB is fixed

Only two degrees of freedom

T F

T .42 .13 .55

F .12 .33 .45

.54 .46 1.0

subgroup

target

Page 9: Subgroup Discovery

Quality MeasuresA quality measure for subgroups summarizes the interestingness of

its confusion matrix into a single number

WRAcc, weighted relative accuracy Balance between coverage and unexpectedness

WRAcc(S,T) = p(ST) – p(S)p(T)

between −.25 and .25, 0 means uninteresting

T F

T .42 .13 .55

F .12 .33

.54 1.0

WRAcc(S,T) = p(ST)−p(S)p(T)= .42 − .297 = .123

subgroup

target

Page 10: Subgroup Discovery

Quality Measures

WRAcc: Weighted Relative Accuracy

Information gain

X2

Correlation Coefficient

Laplace

Jaccard

Specificity

Page 11: Subgroup Discovery

Subgroup Discovery as Searchtrue

A=a1 A=a2 B=b1 B=b2 C=c1 …

T F

T .42 .13 .55

F .12 .33

.54 1.0

A=a2B=b1A=a1B=b1 A=a1B=b2… …

…A=a1B=b1C=c1

minimum support level reached

Page 12: Subgroup Discovery

Refinements are (anti-)monotonic

target concept

subgroup S1

S2 refinement of S1

S3 refinement of S2

Refinements are (anti-) monotonic in their support…

…but not in interestingness. This may go up or down.

entire database

Page 13: Subgroup Discovery

Subgroup Discovery and ROC space

Page 14: Subgroup Discovery

ROC Space

Each subgroup forms a point in ROC space, in terms of its False Positive Rate, and True Positive Rate.

TPR = TP/Pos = TP/TP+FN (fraction of positive cases in the subgroup)FPR = FP/Neg = FP/FP+TN (fraction of negative cases in the subgroup)

ROC = Receiver Operating Characteristics

Page 15: Subgroup Discovery

ROC Space Properties

‘ROC heaven’perfect subgroup

‘ROC hell’random subgroup

entire database

emptysubgroup minimum support

threshold

perfectnegative subgroup

Page 16: Subgroup Discovery

Measures in ROC Space

0

WRAcc Information Gain

positive

negative

sou

rce:

Flach

& F

ürn

kran

z

isometric

Page 17: Subgroup Discovery

Other MeasuresPrecision Gini index

Correlation coefficient Foil gain

Page 18: Subgroup Discovery

Refinements in ROC SpaceRefinements of S will reduce the FPR and TPR, so will appear to the left and below S.

Blue polygon represents possible refinements of S. With a convex measure, f is bounded by measure of corners.

..

.

.

.

If corners are not above minimum quality or current best (top k?), prune search space below S.

Page 19: Subgroup Discovery

Multi-class problems Generalising to problems with more than 2 classes

is fairly staightforward:

X2 Information gain

C1 C2 C3

T .27 .06 .22 .55

F .03 .19 .23 .45

.3 .25 .45 1.0

combine values to qualitymeasure

subgroup

target

sou

rce:

Nijs

sen

& K

ok

Page 20: Subgroup Discovery

Subgroup Discovery for Numeric targets

Page 21: Subgroup Discovery

Numeric Subgroup Discovery Target is numeric: find subgroups with significantly

higher or lower average value

Trade-off between size of subgroup and average target value

h = 2200h = 2200

h = 3100h = 3100

h = 3600h = 3600

Page 22: Subgroup Discovery

Types of SD for Numeric Targets

Regression subgroup discovery numeric target has order and scale

Ordinal subgroup discovery numeric target has order

Ranked subgroup discovery numeric target has order or scale

Page 23: Subgroup Discovery

Vancouver 2010 Winter Olympics

Partial rankingobjects share a rank

Partial rankingobjects share a rank

ordinal targetordinal target

regression targetregression target

Page 24: Subgroup Discovery

Offical IOC ranking of countries (med > 0)Rank Country Medals Athletes Continent Popul. Language Family Repub.

Polar 1 USA 37 214 N. America 309 Germanic y y2 Germany 30 152 Europe 82 Germanic y n3 Canada 26 205 N. America 34 Germanic n y4 Norway 23 100 Europe 4.8 Germanic n y5 Austria 16 79 Europe 8.3 Germanic y n6 Russ. Fed. 15 179 Asia 142 Slavic y y7 Korea 14 46 Asia 73 Altaic y n9 China 11 90 Asia 1338 Sino-Tibetan y n9 Sweden 11 107 Europe 9.3 Germanic n y9 France 11 107 Europe 65 Italic y n11 Switzerland 9 144 Europe 7.8 Germanic y n12 Netherlands 8 34 Europe 16.5 Germanic n n13.5 Czech Rep. 6 92 Europe 10.5 Slavic y n13.5 Poland 6 50 Europe 38 Slavic y n16 Italy 5 110 Europe 60 Italic y n16 Japan 5 94 Asia 127 Japonic n n16 Finland 5 95 Europe 5.3 Finno-Ugric y y20 Australia 3 40 Australia 22 Germanic y n20 Belarus 3 49 Europe 9.6 Slavic y n20 Slovakia 3 73 Europe 5.4 Slavic y n20 Croatia 3 18 Europe 4.5 Slavic y n20 Slovenia 3 49 Europe 2 Slavic y n23 Latvia 2 58 Europe 2.2 Slavic y n25 Great Britain 1 52 Europe 61 Germanic n n25 Estonia 1 30 Europe 1.3 Finno-Ugric y n25 Kazakhstan 1 38 Asia 16 Turkic y n

Fractional ranksshared ranks are averaged

Fractional ranksshared ranks are averaged

Page 25: Subgroup Discovery

Interesting Subgroups‘polar = yes’

1. United States

3. Canada

4. Norway

6. Russian Federation

9. Sweden

16 Finland

‘language_family = Germanic & athletes 60’

1. United States

2. Germany

3. Canada

4. Norway

5. Austria

9. Sweden

11. Switzerland

Page 26: Subgroup Discovery

Intuitions Size: larger subgroups are more

reliable

Rank: majority of objects appear at the top

Position: ‘middle’ of subgroup should differ from middle of ranking

Deviation: objects should have similar rank

*****

**

*

language_family = Slaviclanguage_family = Slavic

Page 27: Subgroup Discovery

Intuitions Size: larger subgroups are more

reliable

Rank: majority of objects appear at the top

Position: ‘middle’ of subgroup should differ from middle of ranking

Deviation: objects should have similar rank

polar = yespolar = yes

*****

*

Page 28: Subgroup Discovery

Intuitions Size: larger subgroups are more

reliable

Rank: majority of objects appear at the top

Position: ‘middle’ of subgroup should differ from middle of ranking

Deviation: objects should have similar rank

population 10Mpopulation 10M

**

*

*******

Page 29: Subgroup Discovery

Intuitions Size: larger subgroups are more

reliable

Rank: majority of objects appear at the top

Position: ‘middle’ of subgroup should differ from middle of ranking

Deviation: objects should have similar rank

language_family = Slavic & population 10Mlanguage_family = Slavic & population 10M

*****

Page 30: Subgroup Discovery

Quality Measures Average

Mean test

z-Score

t-Statistic

Median X2 statistic

AUC of ROC

Wilcoxon-Mann-Whitney Ranks statistic

Median MAD Metric

Page 31: Subgroup Discovery

Meet Cortanathe open source Subgroup Discovery tool

Page 32: Subgroup Discovery

Cortana Features

Generic Subgroup Discovery algorithm quality measure search strategy inductive constraints

Flat file, .txt, .arff, (DB connection to come)

Support for complex targets

41 quality measures

ROC plots

Statistical validation

Page 33: Subgroup Discovery

Target Concepts

‘Classical’ Subgroup Discovery nominal targets (classification) numeric targets (regression)

Exceptional Model Mining (to be discussed in a few slides)

multiple targets regression, correlation multi-label classification

Page 34: Subgroup Discovery

Mixed Data

Data types binary nominal numeric

Numeric data is treated dynamically (no discretisation as preprocessing) all: consider all available thresholds bins: discretise the current candidate subgroup best: find best threshold, and search from there

Page 35: Subgroup Discovery

Statistical Validation Determine distribution of random results

random subsets random conditions swap-randomization

Determine minimum quality

Significance of individual results

Validate quality measures how exceptional?

Page 36: Subgroup Discovery

Open Source

You can Use Cortana binary

datamining.liacs.nl/cortana.html

Use and modify Cortana sources (Java)

Page 37: Subgroup Discovery

Exceptional Model MiningSubgroup Discovery with multiple target attributes

Page 38: Subgroup Discovery

Mixture of Distributions

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

Page 39: Subgroup Discovery

Mixture of Distributions

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

Page 40: Subgroup Discovery

Mixture of Distributions

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

For each datapoint it is unclear whether it belongs to G or G

Intensional description of exceptional subgroup G?

Model class unknown

Model parameters unknown

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

Page 41: Subgroup Discovery

Solution: extend Subgroup Discovery

Use other information than X and Y: object desciptions D

Use Subgroup Discovery to scan sub-populations in terms of D

Subgroup Discovery: find subgroups of the database where the target attribute shows an unusual distribution.

Page 42: Subgroup Discovery

Solution: extend Subgroup Discovery

Use other information than X and Y: object desciptions D

Use Subgroup Discovery to scan sub-populations in terms of D

Model over subgroup becomes target of SD

Subgroup Discovery: find subgroups of the database where the target attributes show an unusual distribution, by means of modeling over the target attributes.

Exceptional Model Mining

Page 43: Subgroup Discovery

Exceptional Model Mining

Define a target concept (X and y)

X y

object descriptiontarget concept

Page 44: Subgroup Discovery

Exceptional Model Mining

Define a target concept (X and y)

Choose a model class C

Define a quality measure φ over C

X y

modeling

object descriptiontarget concept

Page 45: Subgroup Discovery

Exceptional Model Mining

Define a target concept (X and y)

Choose a model class C

Define a quality measure φ over C

Use Subgroup Discovery to find exceptional subgroups G and associated model M

X y

modeling

Subgroup Discovery

object descriptiontarget concept

Page 46: Subgroup Discovery

Quality Measure Specify what defines an exceptional subgroup G based on

properties of model M

Absolute measure (absolute quality of M) Correlation coefficient Predictive accuracy

Difference measure (difference between M and M) Difference in slope qualitative properties of classifier

Reliable results Minimum support level Statistical significance of G

Page 47: Subgroup Discovery

Correlation Model Correlation coefficient

φρ = ρ(G)

Absolute difference in correlation

φabs = |ρ(G) - ρ(G)|

Entropy weighted absolute difference

φent = H(p)·|ρ(G) - ρ(G)|

Statistical significance of correlation difference φscd

compute z-score from ρ through Fisher transform

compute p-value from z-score

31

31

''*

nn

zzz

Page 48: Subgroup Discovery

Regression Model Compare slope b of

yi = a + b·xi + e, and

yi = a + b·xi + e

Compute significance of slope difference φssd

y = 41 568 + 3.31·x y = 30 723 + 8.45·x

drive = 1 basement = 0 #baths ≤ 1

Page 49: Subgroup Discovery

Gene Expression Data

11_band = ‘no deletion’ survival time ≤ 1919 XP_498569.1 ≤ 57

y = 3313 - 1.77·x y = 360 + 0.40·x

Page 50: Subgroup Discovery

Classification Model Decision Table Majority classifier

BDeu measure (predictiveness)

Hellinger (unusual distribution)

whole database

RIF1 160.45

prognosis = ‘unknown’

Page 51: Subgroup Discovery

General Framework

X y

modeling

Subgroup Discovery

object descriptiontarget concept

Page 52: Subgroup Discovery

General Framework

X y

Regression ●Classification ●ClusteringAssociationGraphical modeling ●…

Subgroup Discovery

object descriptiontarget concept

Page 53: Subgroup Discovery

General Framework

X y

RegressionClassificationClusteringAssociationGraphical modeling…

Subgroup Discovery ●Decision TreesSVM…

object descriptiontarget concept

Page 54: Subgroup Discovery

General Framework

X y

RegressionClassificationClusteringAssociationGraphical modeling…

Subgroup DiscoveryDecision TreesSVM…

propositional ●multi-relational ●…

target concept