subgroup discovery

Subgroup Discovery

Finding Local Patterns in Data

Exploratory Data Analysis

Scan the data without much prior focus

Find unusual parts of the data

Analyse attribute dependencies interpret this as ‘rule’:

if X=x and Y=y then Z is unusual

Complex data: nominal, numeric

the Subgroup

Exploratory Data Analysis Classification: model the dependency of the target

on the remaining attributes problem: sometimes classifier is a black-box, or uses only

some of the available dependencies for example: in decision trees, some attributes may not

appear because of overshadowing

Exploratory Data Analysis: understanding the effects of all attributes on the target

Interactions between Attributes

Single-attribute effects are not enough

XOR problem is extreme example: 2 attributes with no info gain form a good subgroup

Apart from A=a, B=b, C=c, …

consider alsoA=aB=b, A=aC=c, …, B=bC=c, …

A=aB=bC=c, …

Subgroup Discovery Task

“Find all subgroups within the inductive constraints that show a significant deviation in the distribution of the target attribute”

Inductive constraints: Minimum coverage (Maximum coverage) Minimum quality (Information gain, X2, WRAcc) Maximum complexity …

Subgroup Discovery: the Binary Target Case

Confusion Matrix A confusion matrix (or contingency table) describes the

frequency of the four combinations of subgroup and target: within subgroup, positive within subgroup, negative outside subgroup, positive outside subgroup, negative

T .42 .13 .55

F .12 .33

.54 1.0

subgroup

target

Confusion Matrix High numbers along the TT-FF diagonal means a positive

correlation between subgroup and target

High numbers along the TF-FT diagonal means a negative correlation between subgroup and target

Target distribution on DB is fixed

Only two degrees of freedom

T .42 .13 .55

F .12 .33 .45

.54 .46 1.0

subgroup

target

Quality MeasuresA quality measure for subgroups summarizes the interestingness of

its confusion matrix into a single number

WRAcc, weighted relative accuracy WRAcc(S,T) = p(ST) – p(S)p(T)

compare p(ST) to independence

balance between coverage and unexpectedness

between −.25 and .25, 0 means uninteresting

T .42 .13 .55

F .12 .33

.54 1.0

WRAcc(S,T) = p(ST)−p(S)p(T)= .42 − .297 = .123

subgroup

target

Quality Measures

WRAcc: Weighted Relative Accuracy

Information gain

Correlation Coefficient

Laplace

Jaccard

Specificity

Subgroup Discovery as Searchtrue

A=a1 A=a2 B=b1 B=b2 C=c1 …

T .42 .13 .55

F .12 .33

.54 1.0

A=a2B=b1A=a1B=b1 A=a1B=b2… …

…A=a1B=b1C=c1

minimum coverage level

reached

Refinements are (anti-)monotonic

target concept

subgroup S1

S2 refinement of S1

S3 refinement of S2

Refinements are (anti-) monotonic in their support…

…but not in interestingness. This may go up or down.

entire database

Subgroup Discovery and ROC space

ROC Space

Each subgroup forms a point in ROC space, in terms of its False Positive Rate, and True Positive Rate.

TPR = TP/Pos = TP/(TP+FN) (fraction of positive cases in the subgroup)FPR = FP/Neg = FP/(FP+TN) (fraction of negative cases in the subgroup)

ROC = Receiver Operating Characteristics

ROC Space Properties

‘ROC heaven’perfect subgroup

‘ROC hell’random subgroup

entire database

emptysubgroup minimum coverage

threshold

perfectnegative subgroup

Measures in ROC Space

WRAcc Information Gain

positive

negative

isometric

Other MeasuresPrecision Gini index

Correlation coefficient Foil gain

Refinements in ROC SpaceRefinements of S will reduce the FPR and TPR, so will appear to the left and below S.

Blue polygon represents possible refinements of S. With a convex measure, f is bounded by measure of corners.

If corners are not above minimum quality or current best (top k?), prune search space below S.

Multi-class problems Generalising to problems with more than 2 classes

is fairly staightforward:

X2 Information gain

C1 C2 C3

T .27 .06 .22 .55

F .03 .19 .23 .45

.3 .25 .45 1.0

combine values to qualitymeasure

subgroup

target

Subgroup Discovery for Numeric targets

Numeric Subgroup Discovery Target is numeric: find subgroups with significantly

higher or lower average value

Trade-off between size of subgroup and average target value

h = 2200h = 2200

h = 3100h = 3100

h = 3600h = 3600

Types of SD for Numeric Targets

Regression subgroup discovery numeric target has order and scale

Ordinal subgroup discovery numeric target has order

Ranked subgroup discovery numeric target has order or scale

Vancouver 2010 Winter Olympics

Partial rankingobjects share a rank

ordinal targetordinal target

regression targetregression target

Offical IOC ranking of countries (med > 0)Rank Country Medals Athletes Continent Popul. Language Family Repub.

Polar 1 USA 37 214 N. America 309 Germanic y y2 Germany 30 152 Europe 82 Germanic y n3 Canada 26 205 N. America 34 Germanic n y4 Norway 23 100 Europe 4.8 Germanic n y5 Austria 16 79 Europe 8.3 Germanic y n6 Russ. Fed. 15 179 Asia 142 Slavic y y7 Korea 14 46 Asia 73 Altaic y n9 China 11 90 Asia 1338 Sino-Tibetan y n9 Sweden 11 107 Europe 9.3 Germanic n y9 France 11 107 Europe 65 Italic y n11 Switzerland 9 144 Europe 7.8 Germanic y n12 Netherlands 8 34 Europe 16.5 Germanic n n13.5 Czech Rep. 6 92 Europe 10.5 Slavic y n13.5 Poland 6 50 Europe 38 Slavic y n16 Italy 5 110 Europe 60 Italic y n16 Japan 5 94 Asia 127 Japonic n n16 Finland 5 95 Europe 5.3 Finno-Ugric y y20 Australia 3 40 Australia 22 Germanic y n20 Belarus 3 49 Europe 9.6 Slavic y n20 Slovakia 3 73 Europe 5.4 Slavic y n20 Croatia 3 18 Europe 4.5 Slavic y n20 Slovenia 3 49 Europe 2 Slavic y n23 Latvia 2 58 Europe 2.2 Slavic y n25 Great Britain 1 52 Europe 61 Germanic n n25 Estonia 1 30 Europe 1.3 Finno-Ugric y n25 Kazakhstan 1 38 Asia 16 Turkic y n

Fractional ranksshared ranks are averaged

Interesting Subgroups‘polar = yes’

1. United States

3. Canada

4. Norway

6. Russian Federation

9. Sweden

16 Finland

‘language_family = Germanic & athletes 60’

1. United States

2. Germany

3. Canada

4. Norway

5. Austria

9. Sweden

11. Switzerland

Intuitions Size: larger subgroups are more

reliable

Rank: majority of objects appear at the top

Position: ‘middle’ of subgroup should differ from middle of ranking

Deviation: objects should have similar rank

language_family = Slaviclanguage_family = Slavic

reliable

polar = yespolar = yes

reliable

population 10Mpopulation 10M

*******

reliable

language_family = Slavic & population 10Mlanguage_family = Slavic & population 10M

Quality Measures Average

Mean test

z-Score

t-Statistic

Median X2 statistic

AUC of ROC

Wilcoxon-Mann-Whitney Ranks statistic

Median MAD Metric

Cortanathe open source Subgroup Discovery tool

Cortana Features

Generic Subgroup Discovery algorithm quality measure search strategy inductive constraints

Flat file, .txt, .arff

Support for complex targets

>41 quality measures

ROC plots

Statistical validation

Target Concepts

‘Classical’ Subgroup Discovery nominal targets (classification) numeric targets (regression)

Exceptional Model Mining (an extension of SD)

multiple targets regression, correlation multi-label classification

Mixed Data

Data types binary nominal numeric

Numeric data is treated dynamically (no discretisation as preprocessing) all: consider all available thresholds bins: discretise the current candidate subgroup best: find best threshold, and search from there

Statistical Validation Determine distribution of random results

random subsets random conditions swap-randomization

Determine minimum quality

Significance of individual results

Validate quality measures how exceptional?

Open Source

You can Use Cortana binary

datamining.liacs.nl/cortana.html

Use and modify Cortana sources (Java)

subgroup discovery

positiveoutside subgroup

negativeoutside subgroup

positivewithin subgroup

combinations of subgroup

subgroup discovery taskfind

unusualcomplex data

positive correlation

quality measureswracc

Documents

fdasia regulations subgroup may 30, 2013. topics 1....

exploratory subgroup analysis: post-hoc subgroup ... ·...

subgroup discovery with...

semantic subgroup discovery systems and workﬂows in the...

subgroup growth - hebrew university of...

subgroup 6 membership

regulations subgroup plan

docs.fcc.gov · web viewprogramming service effective...

a fuzzy genetic programming-based algorithm for subgroup...

horario provisional curso 2020-2021 degree in business...

predictive analysis of metabolomics data in …...predictive...

subgroup discovery and community detection on attributed...

cases - modeling subgroup

maximally informative k-itemsets. motivation subgroup...

non-redundant subgroup discovery in large and complex data -...

subgroup discovery in defect prediction

subgroup discovery with evolutionary fuzzy systems in r: the

fdasia regulations subgroup

international reporting and withholding subgroup … ....

subgroup mining 1 subgroup discovery driven by a property of...