graph-based learning - school of electrical …holder/courses/cse6363/spr04/slides/...2 graph-based...

39
1 1 Graph Graph - - based Learning based Learning Larry Holder Larry Holder Computer Science and Engineering Computer Science and Engineering University of Texas at Arlington University of Texas at Arlington

Upload: trantuong

Post on 17-May-2018

218 views

Category:

Documents


1 download

TRANSCRIPT

11

GraphGraph--based Learningbased LearningLarry HolderLarry Holder

Computer Science and EngineeringComputer Science and EngineeringUniversity of Texas at ArlingtonUniversity of Texas at Arlington

22

GraphGraph--based Learningbased Learning

MultiMulti--relational data mining and learningrelational data mining and learningSUBDUE graphSUBDUE graph--based relational learnerbased relational learner

DiscoveryDiscoveryClusteringClusteringGraph grammar learningGraph grammar learningSupervised learningSupervised learning

33

MultiMulti--Relational Data MiningRelational Data Mining

Looking for patterns involving multiple Looking for patterns involving multiple tables (relations) in a relational databasetables (relations) in a relational database

IDID LastLast FirstFirst AgeAge IncomeIncome

P1P1P2P2

DoeDoe JohnJohn 3030

P3P3DoeDoe SallySally 2929

80000800009000090000

SmithSmith RobertRobert 3535 100000100000

PersonPerson1Person1 Person2Person2

P1P1 P2P2

P3P3 P7P7

Married

RichCouple(X,Y) Person(X,LastX,FirstX,AgeX,IncX) &Person(Y,LastY,FirstY,AgeY,IncY) & Married(X,Y) &(IncX + IncY) > 150000.

44

MultiMulti--Relational Data MiningRelational Data Mining

ApproachesApproachesTransform to nonTransform to non--relational problemrelational problemFirstFirst--order logic basedorder logic based

Inductive Logic Programming (ILP)Inductive Logic Programming (ILP)

Graph basedGraph based

55

GraphGraph--based Data Miningbased Data Mining

Finding all Finding all subgraphssubgraphs gg within a set of within a set of graph transactions graph transactions GG such thatsuch that

where where tt is the minimum support

tG

gfreq>

||)(

is the minimum support

66

GraphGraph--based Data Miningbased Data Mining

SystemsSystemsAprioriApriori--based Graph Mining (AGM)based Graph Mining (AGM)

InokuchiInokuchi, , WashioWashio and and MotodaMotoda, 2003, 2003Frequent SubFrequent Sub--Graph discovery (FSG)Graph discovery (FSG)

KuramochiKuramochi and and KarypisKarypis, 2001, 2001GraphGraph--based Substructure pattern mining based Substructure pattern mining ((gSpangSpan))

YanYan and Han, 2002and Han, 2002

Focus on pruning and fast, codeFocus on pruning and fast, code--based based graph matchinggraph matching

77

GraphGraph--based Relational Learningbased Relational Learning

Finding patterns in Finding patterns in graph(sgraph(s))DiscoveryDiscoveryClusteringClusteringSupervised learningSupervised learning

Person

Doe John

8000030

Last First

Age Income

Person

Doe Sally

9000029

Last First

Age Income

Person

Smith Robert

10000035

Last First

Age Income

Married

Married

88

GraphGraph--based Relational Learningbased Relational Learning

GraphGraph--Based Induction (GBI)Based Induction (GBI)Yoshida, Yoshida, MotodaMotoda and and IndurkhyaIndurkhya, 1994, 1994

SUBstructureSUBstructure Discovery Using Examples Discovery Using Examples (SUBDUE)(SUBDUE)

Cook and Holder, 1994Cook and Holder, 1994Focus on efficient Focus on efficient subgraphsubgraph generation generation and compressionand compression--based heuristic searchbased heuristic search

99

SUBDUE GraphSUBDUE Graph--based Discoverybased Discovery

Graph representationGraph representationGraph compression and MDLGraph compression and MDLDiscovery algorithmDiscovery algorithmInexact graph matchInexact graph matchBackground knowledgeBackground knowledgeParallel/distributed discoveryParallel/distributed discovery

1010

Graph RepresentationGraph RepresentationInput is a labeled (vertices and edges) directed graphInput is a labeled (vertices and edges) directed graphA A substructuresubstructure is a connected is a connected subgraphsubgraphAn An instanceinstance of a substructure is an isomorphic of a substructure is an isomorphic subgraphsubgraphof the input graphof the input graphInput graph compressed by replacing instances with Input graph compressed by replacing instances with vertex representing substructurevertex representing substructure

R1

C1T1S1

T2S2

T3S3

T4S4

Input Database Substructure S1(graph form)

Compressed Database

object

triangle

R1

C1object

squareon

shape

shape S1S1S1 S1S1S1 S1S1S1

S1S1S1

1111

Graph RepresentationGraph Representation

S1

S1

S1

S1

S1

S2

S2 S2

1212

Graph Compression and MDLGraph Compression and MDL

Minimum Description Length (MDL) Minimum Description Length (MDL) principleprinciple

Best theory minimizes description length of Best theory minimizes description length of theory and the data given theorytheory and the data given theory

Best substructure Best substructure SS minimizes description minimizes description length of substructure definition length of substructure definition DL(S)DL(S) and and compressed graph compressed graph DL(G|S)DL(G|S)

))|()((min SGDLSDLS

+

1313

Discovery AlgorithmDiscovery Algorithm

1.1. Create substructure for each unique Create substructure for each unique vertex labelvertex label

Substructures:

triangle (4), square (4),circle (1), rectangle (1)circle

rectangle

triangle

squareon

on

triangle

squareon

ontriangle

squareon

ontriangle

squareon

on

on

1414

Discovery AlgorithmDiscovery Algorithm

2.2. Expand best substructures by an edge or Expand best substructures by an edge or edge+neighboring vertexedge+neighboring vertex

Substructures:

triangle

squareon

circle

rectangle

squareon

rectangle

triangleon

circle

rectangle

triangle

squareon

on

triangle

squareon

ontriangle

squareon

ontriangle

squareon

on

onrectangleon

1515

Discovery AlgorithmDiscovery Algorithm

3.3. Keep only best Keep only best beambeam--widthwidthsubstructures on queuesubstructures on queue

4.4. Terminate when queue is empty or Terminate when queue is empty or #discovered substructures > #discovered substructures > limitlimit

5.5. Compress graph and repeat to generate Compress graph and repeat to generate hierarchical descriptionhierarchical description

1616

DNA ExampleDNA Example

1717

Sample SUBDUE InputSample SUBDUE Inputsample.g:

e 1 11 shapee 2 12 shapee 3 13 shapee 4 14 shapee 5 15 shapee 6 16 shapee 7 17 shapee 8 18 shapee 9 19 shapee 10 20 shapee 1 5 one 2 6 one 3 7 one 4 8 one 5 10 one 9 10 one 10 2 one 10 3 one 10 4 on

v 1 objectv 2 objectv 3 objectv 4 objectv 5 objectv 6 objectv 7 objectv 8 objectv 9 objectv 10 objectv 11 trianglev 12 trianglev 13 trianglev 14 trianglev 15 squarev 16 squarev 17 squarev 18 squarev 19 circlev 20 rectangle

R1

C1T1S1

T2S2

T3S3

T4S4

1818

Inexact Graph MatchInexact Graph Match

Some variations may occur between Some variations may occur between instancesinstancesWant to abstract over minor differencesWant to abstract over minor differencesDifference = cost of transforming one Difference = cost of transforming one graph to make it isomorphic to anothergraph to make it isomorphic to anotherMatch if cost/size < Match if cost/size < thresholdthreshold

1919

Inexact Graph MatchInexact Graph Match

1 2A Ba

b

5

3 4B Ab

aa b

B∅

(1,3) 1 (1,4) 0 (1,5) 1 (1,λ) 1

(2,4)7

(2,5)6

(2,λ)10

(2,3)3

(2,5)6

(2,λ)9

(2,3)7

(2,4)7

(2,λ)10

(2,3)9

(2,4)10

(2,5)9

(2,λ)11

Least-cost match is {(1,4), (2,3)}

2020

Inexact Graph MatchInexact Graph Match

Vertices considered by degreeVertices considered by degreePolynomiallyPolynomially constrainedconstrained

Greedy after Greedy after nnkk partial mappings consideredpartial mappings consideredSuboptimal mappings rare for k>2Suboptimal mappings rare for k>2

2121

Background KnowledgeBackground Knowledge

UserUser--defined substructuresdefined substructuresTwo alternative usesTwo alternative uses

Prime search queuePrime search queueInitial graph compressionInitial graph compression

Variant of discovery algorithm used to Variant of discovery algorithm used to generate instancesgenerate instances

2222

Parallel/Distributed DiscoveryParallel/Distributed Discovery

Divide graph into P partitionsDivide graph into P partitionsDistribute to P processorsDistribute to P processorsEach processor performs serial discovery Each processor performs serial discovery on local partitionon local partitionBroadcast best substructures, evaluate on Broadcast best substructures, evaluate on other processorsother processorsMaster processor stores best global Master processor stores best global substructuressubstructures

2323

GraphGraph--based Clusteringbased Clustering

Hierarchical, conceptual clusteringHierarchical, conceptual clusteringPrevious work defined classification treesPrevious work defined classification trees

Inadequate in relational domainsInadequate in relational domainsBetter hierarchical description: Better hierarchical description: classification latticeclassification lattice

A cluster can have more than one parentA cluster can have more than one parentA parent can be at any level (not only one A parent can be at any level (not only one level above)level above)

Use iterative graphUse iterative graph--based discoverybased discovery

2424

Clustering: DNAClustering: DNA

2525

Clustering: DNAClustering: DNA

CoverageCoverage61%61%

68%68%

71%71%

DNA

O|

O == P — OH C — N C — C

C — C \O

O|

O == P — OH |O|

CH2

C\N — C

\C

O\C/ \

C — C N — C/ \

O C

2626

Learning Graph GrammarsLearning Graph Grammars

Graph grammar production: S Graph grammar production: S PPS is a nonS is a non--terminalterminalP is a graph containing terminals and/or nonP is a graph containing terminals and/or non--terminalsterminalsS S PP11 | P| P22 | | …… | | PPnn

Recursive production: S Recursive production: S P S | PP S | PP linked to S via a single edgeP linked to S via a single edgeAlgorithm exponential in number of linking Algorithm exponential in number of linking edgesedges

2727

Example Graph GrammarExample Graph Grammar

S2 a

b S3

c d e f

S2 a

b

S3

S3

2828

Graph Grammar LearningGraph Grammar Learning

SUBDUE Extensions (SUBDUE Extensions (SubdueGLSubdueGL))Each iteration results in a graph grammar Each iteration results in a graph grammar production substructureproduction substructureProduction used to compress graphProduction used to compress graph

Replace instances of rightReplace instances of right--hand side with new hand side with new vertex labeled with nonvertex labeled with non--terminal on leftterminal on left--hand sidehand side

Iterations continue until entire graph Iterations continue until entire graph compressed to single noncompressed to single non--terminalterminal

2929

SubdueGLSubdueGL ExampleExample

Input graphInput graphEdge labels: ‘t’, ‘s’, ‘next’Edge labels: ‘t’, ‘s’, ‘next’

a

cb

a

db

a

eb

a

fb

x

qz

y x

qz

y x

qz

y x

qz

yr

k

3030

SubdueGLSubdueGL ExampleExample

First production ruleFirst production rule

Input graph parsed by first production

x

qz

y S1S1 x

qz

y

Input graph parsed by first production

a

cb

a

db

a

eb

a

fb

r

k

S1 S1

3131

SubdueGLSubdueGL ExampleExample

Second and third production rulesSecond and third production rules

Input graph parsed by productionsInput graph parsed by productions

S2 a

b S3

S2

S3 c d e f

a

b S3

r

k

S2

S1 S1

3232

GraphGraph--Based Supervised LearningBased Supervised Learning

Input now a set of positive graphs and a Input now a set of positive graphs and a set of negative graphsset of negative graphs

Input Hypothesis

object

object

object

on

on

triangle

square

shape

shape

3333

GraphGraph--Based Supervised LearningBased Supervised Learning

Solution 1Solution 1Find substructure compressing positive Find substructure compressing positive graphs, but not negative graphsgraphs, but not negative graphsCompress graphs and iterate until no further Compress graphs and iterate until no further compressioncompression

ProblemProblemCompressing, instead of removing, partiallyCompressing, instead of removing, partially--covered positive graphs leads to overlycovered positive graphs leads to overly--specific hypothesesspecific hypotheses

3434

GraphGraph--Based Supervised LearningBased Supervised Learning

Solution 2Solution 2Find substructure Find substructure coveringcovering (i.e., (i.e., subgraphsubgraph of) of) positive graphs, but not negative graphspositive graphs, but not negative graphsRemoveRemove covered positive graphs and iterate covered positive graphs and iterate until all covereduntil all covered

Substructure value = 1 Substructure value = 1 -- ErrorError

NegEgsPosEgsredNegEgsCoveoveredPosEgsNotCError

####

++

=

3535

Supervised Learning: CancerSupervised Learning: Cancer

Chemical toxicityChemical toxicitySUBDUE achieved 62% accuracy classifying SUBDUE achieved 62% accuracy classifying carcinogenic vs. noncarcinogenic vs. non--carcinogenic compoundscarcinogenic compounds

compound

atom

atom

c

22

-13

c

22

-13

element

element

type

type

charge

charge

7

contains

contains

six_ring

in_groupin_group

halide10

ashby_alertashby_alert

p

6

positiveames

di227

cytogen_ca

compound

atom

atom

c

22

-13

c

22

-13

element

element

type

type

charge

charge

7

contains

contains

six_ring

in_groupin_group

halide10

ashby_alertashby_alert

p

6

positiveames

di227

cytogen_ca

compound pdrosophila_slrlcompound p

_compoundcompound p_

compound

amine

pchromaberr

has_group

compound

amine

p

has_group

compoundcompound

amine

p

has_group

3636

Application DomainsApplication DomainsBiochemical domainsBiochemical domains

Protein data Protein data DNA dataDNA dataToxicology (cancer) dataToxicology (cancer) data

SpatialSpatial--temporal domainstemporal domainsEarthquake dataEarthquake dataAircraft Safety and Reporting SystemAircraft Safety and Reporting System

Web topology and search Web topology and search Social network analysisSocial network analysis…

web_page

web_page

web_page

hyperlink

hyperlinkhyperlink

home …

3737

SummarySummary

MultiMulti--relational data mining and learningrelational data mining and learningGraphGraph--based relational learningbased relational learning

DiscoveryDiscoveryClusteringClusteringGraph grammar learningGraph grammar learningSupervised learningSupervised learning

3838

Future DirectionsFuture Directions

Efficient graphEfficient graph--based learning from based learning from incremental streaming dataincremental streaming dataSupervised graphsSupervised graphs

All examples in one, connected graphAll examples in one, connected graphGraphGraph--based anomaly detectionbased anomaly detectionImproved scalabilityImproved scalability

Graph and Graph and subgraphsubgraph isomorphismisomorphism

3939

Further InformationFurther Information

GraphGraph--based Data Miningbased Data Mininghttp://banzai.uta.edu/gdmhttp://banzai.uta.edu/gdm

SUBDUE ProjectSUBDUE Projecthttp://http://ailab.uta.eduailab.uta.edu/subdue/subdue