mining adaptively frequent closed unlabeled rooted trees ...abifet/kdd08-slides.pdf · mining...

13
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams Albert Bifet and Ricard Gavaldà Universitat Politècnica de Catalunya 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08) 2008 Las Vegas, USA

Upload: others

Post on 30-Apr-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Mining Adaptively Frequent Closed UnlabeledRooted Trees in Data Streams

Albert Bifet and Ricard Gavaldà

Universitat Politècnica de Catalunya

14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD’08)

2008 Las Vegas, USA

Data StreamsSequence is potentiallyinfiniteHigh amount of data:sublinear spaceHigh speed of arrival:sublinear time perexample

Tree MiningMining frequent trees isbecoming an importanttaskApplications:

chemical informaticscomputer visiontext retrievalbioinformaticsWeb analysis.

Many link-basedstructures may bestudied formally bymeans of unorderedtrees

Introduction: Trees

Our trees are:RootedUnlabeledOrdered and Unordered

Our subtrees are:Induced

Two different ordered treesbut the same unordered tree

Introduction

What Is Tree Pattern Mining?

Given a dataset of trees, find the complete set of frequentsubtrees

Frequent Tree Pattern (FS):

Include all the trees whose support is no less than min_sup

Closed Frequent Tree Pattern (CS):

Include no tree which has a super-tree with the samesupport

CS ⊆ FSClosed Frequent Tree Mining provides a compactrepresentation of frequent trees without loss of information

Introduction

Unordered Subtree Mining

A: B: X: Y:X: Y:

D = {A,B},min_sup = 2

# Closed Subtrees : 2# Frequent Subtrees: 9

Closed Subtrees: X, Y

Frequent Subtrees:

Introduction

ProblemGiven a data stream D of rooted, unlabelled and unorderedtrees, find frequent closed trees.

D

We provide three algorithms,of increasing power

IncrementalSliding WindowAdaptive

Relaxed Support

Guojie Song, Dongqing Yang, Bin Cui, Baihua Zheng,Yunfeng Liu and Kunqing Xie.CLAIM: An Efficient Method for Relaxed Frequent ClosedItemsets Mining over Stream Data

Linear Relaxed Interval:The support space of allsubpatterns can be divided into n = d1/εre intervals, whereεr is a user-specified relaxed factor, and each interval canbe denoted by Ii = [li ,ui), where li = (n− i)∗ εr ≥ 0,ui = (n− i +1)∗ εr ≤ 1 and i ≤ n.Linear Relaxed closed subpattern t : if and only if thereexists no proper superpattern t ′ of t such that their suportsbelong to the same interval Ii .

Relaxed Support

As the number of closed frequent patterns is not linear withrespect support, we introduce a new relaxed support:

Logarithmic Relaxed Interval:The support space of allsubpatterns can be divided into n = d1/εre intervals, whereεr is a user-specified relaxed factor, and each interval canbe denoted by Ii = [li ,ui), where li = dc ie, ui = dc i+1−1eand i ≤ n.Logarithmic Relaxed closed subpattern t : if and only ifthere exists no proper superpattern t ′ of t such that theirsuports belong to the same interval Ii .

Galois Lattice of closed set of trees

D

We needa Galoisconnection paira closure operator

1 2 3

12 13 23

123

1 2 3

12 13 23

123

Algorithms

AlgorithmsIncremental: INCTREENAT

Sliding Window: WINTREENAT

Adaptive: ADATREENAT Uses ADWIN to monitor change

ADWIN

An adaptive sliding window whose size is recomputed onlineaccording to the rate of change observed.

ADWIN has rigorous guarantees (theorems)On ratio of false positives and negativesOn the relation of the size of the current window andchange rates

Experimental Validation: TN1

INCTREENAT

CMTreeMiner

Time(sec.)

Size (Milions)2 4 6 8

100

200

300

Figure: Time on experiments on ordered trees on TN1 dataset

Experimental Validation

5

15

25

35

45

0 21.460 42.920 64.380 85.840 107.300 128.760 150.220 171.680 193.140

Number of Samples

Nu

mb

er

of

Clo

se

d T

ree

s

AdaTreeInc 1

AdaTreeInc 2

Figure: Number of closed trees maintaining the same number ofclosed datasets on input data

Summary

ConclusionsNew logarithmic relaxed closed supportUsing Galois Latice Theory, we present methods for miningclosed trees

Incremental: INCTREENATSliding Window: WINTREENATAdaptive: ADATREENAT using ADWIN to monitor change

Future WorkLabeled Trees and XML data.