graph-based learning - school of electrical …holder/courses/cse6363/spr04/slides/...2 graph-based...
TRANSCRIPT
11
GraphGraph--based Learningbased LearningLarry HolderLarry Holder
Computer Science and EngineeringComputer Science and EngineeringUniversity of Texas at ArlingtonUniversity of Texas at Arlington
22
GraphGraph--based Learningbased Learning
MultiMulti--relational data mining and learningrelational data mining and learningSUBDUE graphSUBDUE graph--based relational learnerbased relational learner
DiscoveryDiscoveryClusteringClusteringGraph grammar learningGraph grammar learningSupervised learningSupervised learning
33
MultiMulti--Relational Data MiningRelational Data Mining
Looking for patterns involving multiple Looking for patterns involving multiple tables (relations) in a relational databasetables (relations) in a relational database
IDID LastLast FirstFirst AgeAge IncomeIncome
P1P1P2P2
DoeDoe JohnJohn 3030
P3P3DoeDoe SallySally 2929
80000800009000090000
SmithSmith RobertRobert 3535 100000100000
PersonPerson1Person1 Person2Person2
P1P1 P2P2
P3P3 P7P7
Married
RichCouple(X,Y) Person(X,LastX,FirstX,AgeX,IncX) &Person(Y,LastY,FirstY,AgeY,IncY) & Married(X,Y) &(IncX + IncY) > 150000.
44
MultiMulti--Relational Data MiningRelational Data Mining
ApproachesApproachesTransform to nonTransform to non--relational problemrelational problemFirstFirst--order logic basedorder logic based
Inductive Logic Programming (ILP)Inductive Logic Programming (ILP)
Graph basedGraph based
55
GraphGraph--based Data Miningbased Data Mining
Finding all Finding all subgraphssubgraphs gg within a set of within a set of graph transactions graph transactions GG such thatsuch that
where where tt is the minimum support
tG
gfreq>
||)(
is the minimum support
66
GraphGraph--based Data Miningbased Data Mining
SystemsSystemsAprioriApriori--based Graph Mining (AGM)based Graph Mining (AGM)
InokuchiInokuchi, , WashioWashio and and MotodaMotoda, 2003, 2003Frequent SubFrequent Sub--Graph discovery (FSG)Graph discovery (FSG)
KuramochiKuramochi and and KarypisKarypis, 2001, 2001GraphGraph--based Substructure pattern mining based Substructure pattern mining ((gSpangSpan))
YanYan and Han, 2002and Han, 2002
Focus on pruning and fast, codeFocus on pruning and fast, code--based based graph matchinggraph matching
77
GraphGraph--based Relational Learningbased Relational Learning
Finding patterns in Finding patterns in graph(sgraph(s))DiscoveryDiscoveryClusteringClusteringSupervised learningSupervised learning
Person
Doe John
8000030
Last First
Age Income
Person
Doe Sally
9000029
Last First
Age Income
Person
Smith Robert
10000035
Last First
Age Income
Married
Married
88
GraphGraph--based Relational Learningbased Relational Learning
GraphGraph--Based Induction (GBI)Based Induction (GBI)Yoshida, Yoshida, MotodaMotoda and and IndurkhyaIndurkhya, 1994, 1994
SUBstructureSUBstructure Discovery Using Examples Discovery Using Examples (SUBDUE)(SUBDUE)
Cook and Holder, 1994Cook and Holder, 1994Focus on efficient Focus on efficient subgraphsubgraph generation generation and compressionand compression--based heuristic searchbased heuristic search
99
SUBDUE GraphSUBDUE Graph--based Discoverybased Discovery
Graph representationGraph representationGraph compression and MDLGraph compression and MDLDiscovery algorithmDiscovery algorithmInexact graph matchInexact graph matchBackground knowledgeBackground knowledgeParallel/distributed discoveryParallel/distributed discovery
1010
Graph RepresentationGraph RepresentationInput is a labeled (vertices and edges) directed graphInput is a labeled (vertices and edges) directed graphA A substructuresubstructure is a connected is a connected subgraphsubgraphAn An instanceinstance of a substructure is an isomorphic of a substructure is an isomorphic subgraphsubgraphof the input graphof the input graphInput graph compressed by replacing instances with Input graph compressed by replacing instances with vertex representing substructurevertex representing substructure
R1
C1T1S1
T2S2
T3S3
T4S4
Input Database Substructure S1(graph form)
Compressed Database
object
triangle
R1
C1object
squareon
shape
shape S1S1S1 S1S1S1 S1S1S1
S1S1S1
1212
Graph Compression and MDLGraph Compression and MDL
Minimum Description Length (MDL) Minimum Description Length (MDL) principleprinciple
Best theory minimizes description length of Best theory minimizes description length of theory and the data given theorytheory and the data given theory
Best substructure Best substructure SS minimizes description minimizes description length of substructure definition length of substructure definition DL(S)DL(S) and and compressed graph compressed graph DL(G|S)DL(G|S)
))|()((min SGDLSDLS
+
1313
Discovery AlgorithmDiscovery Algorithm
1.1. Create substructure for each unique Create substructure for each unique vertex labelvertex label
Substructures:
triangle (4), square (4),circle (1), rectangle (1)circle
rectangle
triangle
squareon
on
triangle
squareon
ontriangle
squareon
ontriangle
squareon
on
on
1414
Discovery AlgorithmDiscovery Algorithm
2.2. Expand best substructures by an edge or Expand best substructures by an edge or edge+neighboring vertexedge+neighboring vertex
Substructures:
triangle
squareon
circle
rectangle
squareon
rectangle
triangleon
circle
rectangle
triangle
squareon
on
triangle
squareon
ontriangle
squareon
ontriangle
squareon
on
onrectangleon
1515
Discovery AlgorithmDiscovery Algorithm
3.3. Keep only best Keep only best beambeam--widthwidthsubstructures on queuesubstructures on queue
4.4. Terminate when queue is empty or Terminate when queue is empty or #discovered substructures > #discovered substructures > limitlimit
5.5. Compress graph and repeat to generate Compress graph and repeat to generate hierarchical descriptionhierarchical description
1717
Sample SUBDUE InputSample SUBDUE Inputsample.g:
e 1 11 shapee 2 12 shapee 3 13 shapee 4 14 shapee 5 15 shapee 6 16 shapee 7 17 shapee 8 18 shapee 9 19 shapee 10 20 shapee 1 5 one 2 6 one 3 7 one 4 8 one 5 10 one 9 10 one 10 2 one 10 3 one 10 4 on
v 1 objectv 2 objectv 3 objectv 4 objectv 5 objectv 6 objectv 7 objectv 8 objectv 9 objectv 10 objectv 11 trianglev 12 trianglev 13 trianglev 14 trianglev 15 squarev 16 squarev 17 squarev 18 squarev 19 circlev 20 rectangle
R1
C1T1S1
T2S2
T3S3
T4S4
1818
Inexact Graph MatchInexact Graph Match
Some variations may occur between Some variations may occur between instancesinstancesWant to abstract over minor differencesWant to abstract over minor differencesDifference = cost of transforming one Difference = cost of transforming one graph to make it isomorphic to anothergraph to make it isomorphic to anotherMatch if cost/size < Match if cost/size < thresholdthreshold
1919
Inexact Graph MatchInexact Graph Match
1 2A Ba
b
5
3 4B Ab
aa b
B∅
(1,3) 1 (1,4) 0 (1,5) 1 (1,λ) 1
(2,4)7
(2,5)6
(2,λ)10
(2,3)3
(2,5)6
(2,λ)9
(2,3)7
(2,4)7
(2,λ)10
(2,3)9
(2,4)10
(2,5)9
(2,λ)11
Least-cost match is {(1,4), (2,3)}
2020
Inexact Graph MatchInexact Graph Match
Vertices considered by degreeVertices considered by degreePolynomiallyPolynomially constrainedconstrained
Greedy after Greedy after nnkk partial mappings consideredpartial mappings consideredSuboptimal mappings rare for k>2Suboptimal mappings rare for k>2
2121
Background KnowledgeBackground Knowledge
UserUser--defined substructuresdefined substructuresTwo alternative usesTwo alternative uses
Prime search queuePrime search queueInitial graph compressionInitial graph compression
Variant of discovery algorithm used to Variant of discovery algorithm used to generate instancesgenerate instances
2222
Parallel/Distributed DiscoveryParallel/Distributed Discovery
Divide graph into P partitionsDivide graph into P partitionsDistribute to P processorsDistribute to P processorsEach processor performs serial discovery Each processor performs serial discovery on local partitionon local partitionBroadcast best substructures, evaluate on Broadcast best substructures, evaluate on other processorsother processorsMaster processor stores best global Master processor stores best global substructuressubstructures
2323
GraphGraph--based Clusteringbased Clustering
Hierarchical, conceptual clusteringHierarchical, conceptual clusteringPrevious work defined classification treesPrevious work defined classification trees
Inadequate in relational domainsInadequate in relational domainsBetter hierarchical description: Better hierarchical description: classification latticeclassification lattice
A cluster can have more than one parentA cluster can have more than one parentA parent can be at any level (not only one A parent can be at any level (not only one level above)level above)
Use iterative graphUse iterative graph--based discoverybased discovery
2525
Clustering: DNAClustering: DNA
CoverageCoverage61%61%
68%68%
71%71%
DNA
O|
O == P — OH C — N C — C
C — C \O
O|
O == P — OH |O|
CH2
C\N — C
\C
O\C/ \
C — C N — C/ \
O C
2626
Learning Graph GrammarsLearning Graph Grammars
Graph grammar production: S Graph grammar production: S PPS is a nonS is a non--terminalterminalP is a graph containing terminals and/or nonP is a graph containing terminals and/or non--terminalsterminalsS S PP11 | P| P22 | | …… | | PPnn
Recursive production: S Recursive production: S P S | PP S | PP linked to S via a single edgeP linked to S via a single edgeAlgorithm exponential in number of linking Algorithm exponential in number of linking edgesedges
2828
Graph Grammar LearningGraph Grammar Learning
SUBDUE Extensions (SUBDUE Extensions (SubdueGLSubdueGL))Each iteration results in a graph grammar Each iteration results in a graph grammar production substructureproduction substructureProduction used to compress graphProduction used to compress graph
Replace instances of rightReplace instances of right--hand side with new hand side with new vertex labeled with nonvertex labeled with non--terminal on leftterminal on left--hand sidehand side
Iterations continue until entire graph Iterations continue until entire graph compressed to single noncompressed to single non--terminalterminal
2929
SubdueGLSubdueGL ExampleExample
Input graphInput graphEdge labels: ‘t’, ‘s’, ‘next’Edge labels: ‘t’, ‘s’, ‘next’
a
cb
a
db
a
eb
a
fb
x
qz
y x
qz
y x
qz
y x
qz
yr
k
3030
SubdueGLSubdueGL ExampleExample
First production ruleFirst production rule
Input graph parsed by first production
x
qz
y S1S1 x
qz
y
Input graph parsed by first production
a
cb
a
db
a
eb
a
fb
r
k
S1 S1
3131
SubdueGLSubdueGL ExampleExample
Second and third production rulesSecond and third production rules
Input graph parsed by productionsInput graph parsed by productions
S2 a
b S3
S2
S3 c d e f
a
b S3
r
k
S2
S1 S1
3232
GraphGraph--Based Supervised LearningBased Supervised Learning
Input now a set of positive graphs and a Input now a set of positive graphs and a set of negative graphsset of negative graphs
Input Hypothesis
object
object
object
on
on
triangle
square
shape
shape
3333
GraphGraph--Based Supervised LearningBased Supervised Learning
Solution 1Solution 1Find substructure compressing positive Find substructure compressing positive graphs, but not negative graphsgraphs, but not negative graphsCompress graphs and iterate until no further Compress graphs and iterate until no further compressioncompression
ProblemProblemCompressing, instead of removing, partiallyCompressing, instead of removing, partially--covered positive graphs leads to overlycovered positive graphs leads to overly--specific hypothesesspecific hypotheses
3434
GraphGraph--Based Supervised LearningBased Supervised Learning
Solution 2Solution 2Find substructure Find substructure coveringcovering (i.e., (i.e., subgraphsubgraph of) of) positive graphs, but not negative graphspositive graphs, but not negative graphsRemoveRemove covered positive graphs and iterate covered positive graphs and iterate until all covereduntil all covered
Substructure value = 1 Substructure value = 1 -- ErrorError
NegEgsPosEgsredNegEgsCoveoveredPosEgsNotCError
####
++
=
3535
Supervised Learning: CancerSupervised Learning: Cancer
Chemical toxicityChemical toxicitySUBDUE achieved 62% accuracy classifying SUBDUE achieved 62% accuracy classifying carcinogenic vs. noncarcinogenic vs. non--carcinogenic compoundscarcinogenic compounds
compound
atom
atom
c
22
-13
c
22
-13
element
element
type
type
charge
charge
7
contains
contains
six_ring
in_groupin_group
halide10
ashby_alertashby_alert
p
6
positiveames
di227
cytogen_ca
compound
atom
atom
c
22
-13
c
22
-13
element
element
type
type
charge
charge
7
contains
contains
six_ring
in_groupin_group
halide10
ashby_alertashby_alert
p
6
positiveames
di227
cytogen_ca
compound pdrosophila_slrlcompound p
_compoundcompound p_
compound
amine
pchromaberr
has_group
compound
amine
p
has_group
compoundcompound
amine
p
has_group
3636
Application DomainsApplication DomainsBiochemical domainsBiochemical domains
Protein data Protein data DNA dataDNA dataToxicology (cancer) dataToxicology (cancer) data
SpatialSpatial--temporal domainstemporal domainsEarthquake dataEarthquake dataAircraft Safety and Reporting SystemAircraft Safety and Reporting System
Web topology and search Web topology and search Social network analysisSocial network analysis…
web_page
web_page
web_page
hyperlink
hyperlinkhyperlink
home …
…
…
3737
SummarySummary
MultiMulti--relational data mining and learningrelational data mining and learningGraphGraph--based relational learningbased relational learning
DiscoveryDiscoveryClusteringClusteringGraph grammar learningGraph grammar learningSupervised learningSupervised learning
3838
Future DirectionsFuture Directions
Efficient graphEfficient graph--based learning from based learning from incremental streaming dataincremental streaming dataSupervised graphsSupervised graphs
All examples in one, connected graphAll examples in one, connected graphGraphGraph--based anomaly detectionbased anomaly detectionImproved scalabilityImproved scalability
Graph and Graph and subgraphsubgraph isomorphismisomorphism