Clustering PathwaysUsing Graph Mining Approach
Mahmud Shahriar HossainMonika AkbarPramodh PochuVenkata Sesha Sanagavarapu
2
Design Pipeline
Preprocessor
Frequent Subgraph Discovery
Graph Objects of Pathways
Mined Data
Pathway Clustering
STKE Dataset
NN Search Pathway Relations
3
Dataset Properties (size)
Total Pathways = 50
Size of Pathway, k
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
105
110
Nu
mb
er
of
k-e
dg
e p
ath
wa
ys
0
1
2
3
4
4
Dataset Properties (size)
Total Pathways = 50
Size Range
0-1
0
11
-20
21
-30
31
-40
41
-50
51
-60
61
-70
71
-80
81
-90
91
-10
0
10
0-1
10
Nu
mb
er
of
Pa
thw
ays
in S
ize
Ra
ng
e
0
1
2
3
4
5
6
7
8
9
10
11
12
13
5
pf-ipf (tf-idf)
Transaction Items bought
David Lopez Orange Juice (2), Potato chip (3), Pepsi (1)
Robbie Lamb Potato chip (3), Pepsi (3), Beer (1)
Jonathan Branden Potato chip (1), Pepsi (1)
John Paxton Potato chip (2), Coconut Cookies (2), Pepsi (1)
Rafal Angryk Swiss Army Knife (15)
Jeannete Radclif Potato chip (2), Coconut Cookies (3)
Rocky Ross Orange Juice (2), Coconut Cookies (3)
Richard MaClaster Coconut Cookies (3), Beer (1)
………… ……………………………….
6
Dataset Properties (pf-ipf)
Number of Edges in MPG = 1376
min_pfipf
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
Nu
mb
er o
f ed
ges
left
0
200
400
600
800
1000
1200
1400
7
Dataset Properties (pf-ipf)
Total Pathways=50
min_pfipf
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
Nu
mb
er o
f p
ath
way
s le
ft
20
25
30
35
40
45
50
8
Subgraph Discovery
k # of Subgraphs generated
Time (sec.)
1 1,376 Existing
2 5,380 41
3 29,565 149
4 187,508 971
5 1274,852 7518
--- ---- -----
min_sup=2%
• What so novel about pruning edges?
9
Subgraph Discovery
Contour graph for number of subgraphs
min_sup4 6 8 10 12 14 16 18 20
pf-
ipf
thre
sho
ld0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
1000 2000 3000 4000
0
1000
2000
3000
4000
5000
6000
46
810
1214
1618
20
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Nu
mb
er o
f S
ub
gra
ph
s
min
_sup
pf-ipf threshold
Total Run: 10X9
0 1000 2000 3000 4000 5000 6000
10
Subgraph Discovery
minsup= 4.0%min_tfidf= 0.01
k
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Tim
e (m
s)
0
50x103
100x103
150x103
200x103
250x103
300x103
350x103
400x103
FSGSEM
11
Subgraph Discovery
minsup= 4.0%min_tfidf= 0.01
k
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Tim
e (m
s)
0
500
1000
1500
2000
2500
3000
FSGSEM
12
Subgraph Discovery
minsup= 4.0%min_tfidf= 0.01
k
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
# o
f A
tem
pts
0
250000
500000
750000
1000000
1250000
FSGSEM
k Number of Subgraphs
Time Saved (%)
Attempts Saved(%)
2 186 99.83 98.983 246 98.33 86.154 305 98.57 86.385 323 98.95 86.916 313 98.96 85.647 279 98.88 83.258 263 98.67 78.919 292 98.38 74.76
10 364 98.58 74.7511 470 98.76 78.0812 608 99.04 81.8413 785 99.22 85.0214 980 99.38 87.6315 1117 99.48 89.4816 1075 99.53 90.2617 804 99.51 89.4018 430 99.34 85.2219 141 98.76 71.2220 20 96.15 9.1921 1 75.74 -574.47
Overall attempts saved = 89.52%Overall time saved = 99.39%
13
Clustering
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0.24
6 8 10 12 14 16 18 200.010.020.030.040.050.060.070.080.090.10
Ave
rag
e S
C
min_sup
pf-
ipf
thre
sho
ld
Average SC Mesh plot for 10 clusters using different min_sup and pf-ipf threshold
0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22
14
Clustering
Average SC Contour Graph for 10 clusters using different min_sup and pf-ipf
min_sup
4 6 8 10 12 14 16 18 20
pf-
ipf
thre
sh
old
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.08 0.10 0.12 0.14 0.16 0.18 0.20
15
Nearest Neighbors
Each bar indicates 100 execution time of NN search of a pathway
Sample Pathway
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Tim
e (
ms
)
0
2000
4000
6000
8000
10000
12000
14000
16000
Cover Tree Brute-force
Cover Tree andBrute-force method
16
Pathway Relations (StoryTelling)
Bidirectional Search
S
p1
p2
p3
T
p7
p8
p9
17
Pathway Relations (StoryTelling)
Numbers of varying length storiesfor different branching factor
Story length, t
3 4 5 6 7 8 9 10 11 12 13 14 15 16
Nu
mb
er
of
t-le
ng
th s
tori
es
0
50
100
150
200
250
300
350
b=2b=4b=6b=8
18
Pathway Relations (StoryTelling)
Numbers of varying length storiesfor different branching factor
Story length, t
3 4 5 6 7 8 9 10 11 12 13 14 15 16
Nu
mb
er
of
t-le
ng
th s
tori
es
0
50
100
150
200
250
300
350
b=2b=3b=4b=5b=6b=7b=8b=9b=10
19
Pathway Relations (StoryTelling)
Branching factor, b
2 3 4 5 6 7 8 9 10
To
tal s
tori
es f
rom
all
pa
irs
0
200
400
600
800
1000
Branching factor, b
2 3 4 5 6 7 8 9 10
Tim
e t
o g
ene
rate
all
sto
rie
s (
ms)
0.0
200.0x103
400.0x103
600.0x103
800.0x103
1.0x106
1.2x106
1.4x106
Branching factor, b
2 3 4 5 6 7 8 9 10
Len
gth
of
the
lon
ges
t s
tory
4
6
8
10
12
14
16
20
Questions ???