mining graph patterns efficiently via randomized summaries chen chen, cindy x. lin, matt fredrikson,...

20
Mining Graph Patterns Mining Graph Patterns Efficiently via Randomized Efficiently via Randomized Summaries Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

Upload: ruby-chambers

Post on 18-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

Mining Graph Patterns Efficiently via Mining Graph Patterns Efficiently via Randomized SummariesRandomized Summaries

Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han

VLDB’09

Page 2: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

Outline

Motivation Preliminaries SUMMARIZE-MINE FRAMEWORKSUMMARIZE-MINE FRAMEWORK Bounding the False Negative RateBounding the False Negative Rate ExperimentsExperiments ConclusionConclusion

Page 3: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

MotivationMotivation

Graphs Pattern Mining are heavily needed in many real applications, such as bioinformatics, hyperlinked webs and social network analysis.

Unfortunately, due to the fundamental role subgraph isomorphism plays in existing methods, they may all enter into a pitfall when the cost to enumerate a huge set of isomorphic embeddings blows up, especially in large graphs with few identical labels.

Page 4: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

MotivationMotivation

Consider possible ways to reduce the number of embeddings. In particular, since in real applications, many embeddings overlap substantially, we explore the possibility of somehow “merging” these embeddings to significantly reduce the overall cardinality.

Page 5: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

Preliminaries

Page 6: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

SUMMARIZE-MINE FRAMEWORKSUMMARIZE-MINE FRAMEWORK

Summarization: For raw database with frequency threshold min_sup, we bind vertices with identical labels into a single node and collapse the network correspondingly into a smaller summarized version. This step generalizes our view on the data to a higher level.

Mining: Apply any state-of-art frequent subgraph mining algorithm on the summarized database D’ = {S1, S2, . . . , Sn} with a slightly lowered support threshold min sup’ , which generates the pattern set FP(D’).

Verification: Check patterns in FP(D’) against the original database D, remove those p FP(D’)∈ whose support in D is less than min sup and transform the result collection into R’

Iteration: Repeat steps 1, 2 and 3 for t times, and combine the results from each iteration. Let R’1,R’2, . . . ,R’t be the patterns obtained from different iterations, the final result is R’ = R’1 R’∪ 2 … R’∪ ∪ t. This step is to guarantee that the overall probability of missing any frequent pattern is bounded.

Deal with false positive and false negative.

Raw DB

Summarized DB

Page 7: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

Summarization: For raw database with frequency threshold min_sup, we bind vertices with identical labels into a single node and collapse the network correspondingly into a smaller summarized version. This step generalizes our view on the data to a higher level.

Mining: Apply any state-of-art frequent subgraph mining algorithm on the summarized database D’ = {S1, S2, . . . , Sn} with a slightly lowered support threshold min sup’ , which generates the pattern set FP(D’).

Verification: Check patterns in FP(D’) against the original database D, remove those p FP(D’)∈ whose support in D is less than min sup and transform the result collection into R’

Iteration: Repeat steps 1, 2 and 3 for t times, and combine the results from each iteration. Let R’1,R’2, . . . ,R’t be the patterns obtained from different iterations, the final result is R’ = R’1 R’∪ 2 … R’∪ ∪ t. This step is to guarantee that the overall probability of missing any frequent pattern is bounded.

Deal with false positive and false negative.

SUMMARIZE-MINE FRAMEWORKSUMMARIZE-MINE FRAMEWORK

Raw DB

Summarized DB

Page 8: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

Take gSpan as the skeleton of mining algorithm Each labeled graph pattern can be transformed into a sequential representation called

DFS code

With a defined lexicographical order on the DFS code space, all subgraph patterns can be organized into a tree structure, where 1. patterns with k edges are put on the k th level 2. a preorder traversal of this tree would generate the DFS codes of all possible

patterns in the lexicographical order

SUMMARIZE-MINE FRAMEWORKSUMMARIZE-MINE FRAMEWORK

Page 9: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

According to DFS lexicographic order,

SUMMARIZE-MINE FRAMEWORKSUMMARIZE-MINE FRAMEWORK

Page 10: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

SUMMARIZE-MINE FRAMEWORKSUMMARIZE-MINE FRAMEWORK

Summarization: For raw database with frequency threshold min_sup, we bind vertices with identical labels into a single node and collapse the network correspondingly into a smaller summarized version. This step generalizes our view on the data to a higher level.

Mining: Apply any state-of-art frequent subgraph mining algorithm on the summarized database D’ = {S1, S2, . . . , Sn} with a slightly lowered support threshold min sup’ , which generates the pattern set FP(D’).

Verification: Check patterns in FP(D’) against the original database D, remove those p FP(D’)∈ whose support in D is less than min sup and transform the result collection into R’

Iteration: Repeat steps 1, 2 and 3 for t times, and combine the results from each iteration. Let R’1,R’2, . . . ,R’t be the patterns obtained from different iterations, the final result is R’ = R’1 R’∪ 2 … R’∪ ∪ t. This step is to guarantee that the overall probability of missing any frequent pattern is bounded.

Deal with false positive and false negative.

Raw DB

Summarized DB

Page 11: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

SUMMARIZE-MINE FRAMEWORKSUMMARIZE-MINE FRAMEWORK

Reduce false positives Technique 1: Bottom-up

sup(p1) > sup(p2) >min_sup Technique 2: Top-down

min_sup > sup(p1) > sup(p2)

It is guaranteed that there is no false positives.

False Embeddings False Positives

Page 12: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

SUMMARIZE-MINE FRAMEWORKSUMMARIZE-MINE FRAMEWORK

Summarization: For raw database with frequency threshold min_sup, we bind vertices with identical labels into a single node and collapse the network correspondingly into a smaller summarized version. This step generalizes our view on the data to a higher level.

Mining: Apply any state-of-art frequent subgraph mining algorithm on the summarized database D’ = {S1, S2, . . . , Sn} with a slightly lowered support threshold min sup’ , which generates the pattern set FP(D’).

Verification: Check patterns in FP(D’) against the original database D, remove those p FP(D’)∈ whose support in D is less than min sup and transform the result collection into R’

Iteration: Repeat steps 1, 2 and 3 for t times, and combine the results from each iteration. Let R’1,R’2, . . . ,R’t be the patterns obtained from different iterations, the final result is R’ = R’1 R’∪ 2 … R’∪ ∪ t. This step is to guarantee that the overall probability of missing any frequent pattern is bounded.

Deal with false positive and false negative.

Raw DB

Summarized DB

Page 13: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

SUMMARIZE-MINE FRAMEWORKSUMMARIZE-MINE FRAMEWORK

Page 14: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

SUMMARIZE-MINE FRAMEWORKSUMMARIZE-MINE FRAMEWORK

Summarization: For raw database with frequency threshold min_sup, we bind vertices with identical labels into a single node and collapse the network correspondingly into a smaller summarized version. This step generalizes our view on the data to a higher level.

Mining: Apply any state-of-art frequent subgraph mining algorithm on the summarized database D’ = {S1, S2, . . . , Sn} with a slightly lowered support threshold min sup’ , which generates the pattern set FP(D’).

Verification: Check patterns in FP(D’) against the original database D, remove those p FP(D’)∈ whose support in D is less than min sup and transform the result collection into R’

Iteration: Repeat steps 1, 2 and 3 for t times, and combine the results from each iteration. Let R’1,R’2, . . . ,R’t be the patterns obtained from different iterations, the final result is R’ = R’1 R’∪ 2 … R’∪ ∪ t. This step is to guarantee that the overall probability of missing any frequent pattern is bounded.

Deal with false positive and false negative.

Raw DB

Summarized DB

Page 15: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

Bounding the False Negative RateBounding the False Negative Rate

Miss Embeddings False Negatives

q(p)

The probability that all mj vertices with label lj are assigned to xj different groups (and thus f continues to exist) isMultiplying the probabilities for all L labels

Page 16: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

Bounding the False Negative RateBounding the False Negative Rate

Page 17: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

Bounding the False Negative RateBounding the False Negative Rate

The false negative rate after t iterations is (1−P)t. To make (1−P)t less than some small

Technique 1: For raw database with frequency threshold min_sup, we adopt a lower frequency threshold min_sup’ for summarized database.

Technique 2: Iterate the mining steps for t times and combine the results generated in each time.

It is NOT guaranteed that there is no false negaitives, but the possibility is bounded by

Page 18: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

ExperimentsExperiments

Page 19: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

ExperimentsExperiments

Page 20: Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

ConclusionConclusion

Isomorphism test on small graphs is much more easier.

Each graph does iteration t times to reduce the false negative rate, t = ?