subgroup discovery
Post on 01-Jan-2016
27 Views
Preview:
DESCRIPTION
TRANSCRIPT
Subgroup Discovery
Finding Local Patterns in Data
Exploratory Data Analysis
Scan the data without much prior focus
Find unusual parts of the data
Analyse attribute dependencies interpret this as ‘rule’:
if X=x and Y=y then Z is unusual
Complex data: nominal, numeric
the Subgroup
Exploratory Data Analysis Classification: model the dependency of the target
on the remaining attributes problem: sometimes classifier is a black-box, or uses only
some of the available dependencies for example: in decision trees, some attributes may not
appear because of overshadowing
Exploratory Data Analysis: understanding the effects of all attributes on the target
Interactions between Attributes
Single-attribute effects are not enough
XOR problem is extreme example: 2 attributes with no info gain form a good subgroup
Apart from A=a, B=b, C=c, …
consider alsoA=aB=b, A=aC=c, …, B=bC=c, …
A=aB=bC=c, …
…
Subgroup Discovery Task
“Find all subgroups within the inductive constraints that show a significant deviation in the distribution of the target attribute”
Inductive constraints: Minimum coverage (Maximum coverage) Minimum quality (Information gain, X2, WRAcc) Maximum complexity …
Subgroup Discovery: the Binary Target Case
Confusion Matrix A confusion matrix (or contingency table) describes the
frequency of the four combinations of subgroup and target: within subgroup, positive within subgroup, negative outside subgroup, positive outside subgroup, negative
T F
T .42 .13 .55
F .12 .33
.54 1.0
subgroup
target
Confusion Matrix High numbers along the TT-FF diagonal means a positive
correlation between subgroup and target
High numbers along the TF-FT diagonal means a negative correlation between subgroup and target
Target distribution on DB is fixed
Only two degrees of freedom
T F
T .42 .13 .55
F .12 .33 .45
.54 .46 1.0
subgroup
target
Quality MeasuresA quality measure for subgroups summarizes the interestingness of
its confusion matrix into a single number
WRAcc, weighted relative accuracy WRAcc(S,T) = p(ST) – p(S)p(T)
compare p(ST) to independence
balance between coverage and unexpectedness
between −.25 and .25, 0 means uninteresting
T F
T .42 .13 .55
F .12 .33
.54 1.0
WRAcc(S,T) = p(ST)−p(S)p(T)= .42 − .297 = .123
subgroup
target
Quality Measures
WRAcc: Weighted Relative Accuracy
Information gain
X2
Correlation Coefficient
Laplace
Jaccard
Specificity
…
Subgroup Discovery as Searchtrue
A=a1 A=a2 B=b1 B=b2 C=c1 …
T F
T .42 .13 .55
F .12 .33
.54 1.0
A=a2B=b1A=a1B=b1 A=a1B=b2… …
…A=a1B=b1C=c1
minimum coverage level
reached
Refinements are (anti-)monotonic
target concept
subgroup S1
S2 refinement of S1
S3 refinement of S2
Refinements are (anti-) monotonic in their support…
…but not in interestingness. This may go up or down.
entire database
Subgroup Discovery and ROC space
ROC Space
Each subgroup forms a point in ROC space, in terms of its False Positive Rate, and True Positive Rate.
TPR = TP/Pos = TP/(TP+FN) (fraction of positive cases in the subgroup)FPR = FP/Neg = FP/(FP+TN) (fraction of negative cases in the subgroup)
ROC = Receiver Operating Characteristics
ROC Space Properties
‘ROC heaven’perfect subgroup
‘ROC hell’random subgroup
entire database
emptysubgroup minimum coverage
threshold
perfectnegative subgroup
Measures in ROC Space
0
WRAcc Information Gain
positive
negative
sou
rce:
Flach
& F
ürn
kran
z
isometric
Other MeasuresPrecision Gini index
Correlation coefficient Foil gain
Refinements in ROC SpaceRefinements of S will reduce the FPR and TPR, so will appear to the left and below S.
Blue polygon represents possible refinements of S. With a convex measure, f is bounded by measure of corners.
..
.
.
.
If corners are not above minimum quality or current best (top k?), prune search space below S.
Multi-class problems Generalising to problems with more than 2 classes
is fairly staightforward:
X2 Information gain
C1 C2 C3
T .27 .06 .22 .55
F .03 .19 .23 .45
.3 .25 .45 1.0
combine values to qualitymeasure
subgroup
target
sou
rce:
Nijs
sen
& K
ok
Subgroup Discovery for Numeric targets
Numeric Subgroup Discovery Target is numeric: find subgroups with significantly
higher or lower average value
Trade-off between size of subgroup and average target value
h = 2200h = 2200
h = 3100h = 3100
h = 3600h = 3600
Types of SD for Numeric Targets
Regression subgroup discovery numeric target has order and scale
Ordinal subgroup discovery numeric target has order
Ranked subgroup discovery numeric target has order or scale
Vancouver 2010 Winter Olympics
Partial rankingobjects share a rank
Partial rankingobjects share a rank
ordinal targetordinal target
regression targetregression target
Offical IOC ranking of countries (med > 0)Rank Country Medals Athletes Continent Popul. Language Family Repub.
Polar 1 USA 37 214 N. America 309 Germanic y y2 Germany 30 152 Europe 82 Germanic y n3 Canada 26 205 N. America 34 Germanic n y4 Norway 23 100 Europe 4.8 Germanic n y5 Austria 16 79 Europe 8.3 Germanic y n6 Russ. Fed. 15 179 Asia 142 Slavic y y7 Korea 14 46 Asia 73 Altaic y n9 China 11 90 Asia 1338 Sino-Tibetan y n9 Sweden 11 107 Europe 9.3 Germanic n y9 France 11 107 Europe 65 Italic y n11 Switzerland 9 144 Europe 7.8 Germanic y n12 Netherlands 8 34 Europe 16.5 Germanic n n13.5 Czech Rep. 6 92 Europe 10.5 Slavic y n13.5 Poland 6 50 Europe 38 Slavic y n16 Italy 5 110 Europe 60 Italic y n16 Japan 5 94 Asia 127 Japonic n n16 Finland 5 95 Europe 5.3 Finno-Ugric y y20 Australia 3 40 Australia 22 Germanic y n20 Belarus 3 49 Europe 9.6 Slavic y n20 Slovakia 3 73 Europe 5.4 Slavic y n20 Croatia 3 18 Europe 4.5 Slavic y n20 Slovenia 3 49 Europe 2 Slavic y n23 Latvia 2 58 Europe 2.2 Slavic y n25 Great Britain 1 52 Europe 61 Germanic n n25 Estonia 1 30 Europe 1.3 Finno-Ugric y n25 Kazakhstan 1 38 Asia 16 Turkic y n
Fractional ranksshared ranks are averaged
Fractional ranksshared ranks are averaged
Interesting Subgroups‘polar = yes’
1. United States
3. Canada
4. Norway
6. Russian Federation
9. Sweden
16 Finland
‘language_family = Germanic & athletes 60’
1. United States
2. Germany
3. Canada
4. Norway
5. Austria
9. Sweden
11. Switzerland
Intuitions Size: larger subgroups are more
reliable
Rank: majority of objects appear at the top
Position: ‘middle’ of subgroup should differ from middle of ranking
Deviation: objects should have similar rank
*****
**
*
language_family = Slaviclanguage_family = Slavic
Intuitions Size: larger subgroups are more
reliable
Rank: majority of objects appear at the top
Position: ‘middle’ of subgroup should differ from middle of ranking
Deviation: objects should have similar rank
polar = yespolar = yes
*****
*
Intuitions Size: larger subgroups are more
reliable
Rank: majority of objects appear at the top
Position: ‘middle’ of subgroup should differ from middle of ranking
Deviation: objects should have similar rank
population 10Mpopulation 10M
**
*
*******
Intuitions Size: larger subgroups are more
reliable
Rank: majority of objects appear at the top
Position: ‘middle’ of subgroup should differ from middle of ranking
Deviation: objects should have similar rank
language_family = Slavic & population 10Mlanguage_family = Slavic & population 10M
*****
Quality Measures Average
Mean test
z-Score
t-Statistic
Median X2 statistic
AUC of ROC
Wilcoxon-Mann-Whitney Ranks statistic
Median MAD Metric
Cortanathe open source Subgroup Discovery tool
Cortana Features
Generic Subgroup Discovery algorithm quality measure search strategy inductive constraints
Flat file, .txt, .arff
Support for complex targets
>41 quality measures
ROC plots
Statistical validation
Target Concepts
‘Classical’ Subgroup Discovery nominal targets (classification) numeric targets (regression)
Exceptional Model Mining (an extension of SD)
multiple targets regression, correlation multi-label classification
Mixed Data
Data types binary nominal numeric
Numeric data is treated dynamically (no discretisation as preprocessing) all: consider all available thresholds bins: discretise the current candidate subgroup best: find best threshold, and search from there
Statistical Validation Determine distribution of random results
random subsets random conditions swap-randomization
Determine minimum quality
Significance of individual results
Validate quality measures how exceptional?
Open Source
You can Use Cortana binary
datamining.liacs.nl/cortana.html
Use and modify Cortana sources (Java)
top related