Download - CEMiner – An Efficient Algorithm for Mining Closed Patterns from Time Interval-based Data
CEMiner – An Efficient Algorithm for Mining Closed Patterns from Time Interval-based DataYi-Cheng Chen, Wen-Chih Peng
and Suh-Yin Lee
ICDM 2011
2
Outlines
2012/6/13
Motivation Preliminaries Endpoint representation CEMiner algorithm Experimental result Conclusion
3
Motivation
2012/6/13
Existing studies only focus on mining closed sequential patterns from time point-based data.
4
Cont.
2012/6/13
In this paper, we discuss and design an efficient method to discover closed temporal patterns from interval-based data.
Three contributions : We simplify the processing of complex relations. i.e., only “before”, “after” and “equal.”
Endpoint representation
A novel algorithm, CEMiner (Closed Endpoint Temporal Miner).
5
Preliminaries
2012/6/13
Definition 1. Event interval and event sequence
E = {e1 , e2 ,…, ek } be the set of event symbols :
{A, B, C, D, E } The triplet (ei , si , fi ) is an event interval : (A , 2 , 7) An event sequence is a series of event
interval triplets : <(A, 2 , 7), (B , 5 , 10), …, (E , 18 , 20)>.
6
Cont.
2012/6/13
Definition 2. Temporal database Database DB = {r1 , r2 , …, rm }, each record ri , consists of a sequence-id, SID and an
event. DB is called a temporal database.
7
Endpoint representation
2012/6/13
When describing relationships among more than three events, Allen’s temporal logics may suffer several problems.
A suitable representation is very important for describing a temporal pattern.
A new expression, endpoint representation is proposed to address the ambiguous and scalable problem.
8
Cont.
2012/6/13
Definition 3. Endpoint sequence
event sequence q = <( A , 2 , 7 ), ( B , 5 , 10 ), ( C , 5 , 12 ), ( D , 16 , 22 ), ( E , 18 , 20 )>
Tq = { 2 ,7 ,5 ,10 ,5 ,12 ,16 ,22 ,18 ,20 } endpoint sequence : qe =
<2 ,5 ,5 ,7 ,10 ,12 ,16 ,18 ,20 ,22> endpoint representation : <>
9
Cont.
2012/6/13
The endpoint representation has several benefits : Scalability
Nonambiguity
Simplicity
10
CEMiner algorithm
2012/6/13
CEMiner (standing for Closed Endpoint temporal Miner) utilizes the arrangement of endpoints to accomplish the closed temporal pattern mining. Closure Checking
subsequence & supersequence Ex. Given two sequences = <A, B, C>, = <𝛽 A, D, B, C, E>, we say is a subsequence of , and is 𝛽 𝛽
a supersequence of.
11
Cont.
2012/6/13
Definition 4. Closed temporal pattern CTP = {( ∈ TP ) ˄ ( ∄ ∈ TP ) such that ( ⊆ 𝛼 𝛽 𝛼
β) ∧ ( support ( ) = support ( ) )}𝛼 𝛽 Given two sequence and 𝛼 𝛽 If is a closed temporal pattern, 𝛼
𝛼 is a temporal pattern and there doesn’t exist a supersequence and support ( ) 𝛽 𝛼
= support ( ). 𝛽
12
Cont.
2012/6/13
Ex. min_sup = 2 The endpoint sequence = <> is a temporal
pattern but not a closed temporal pattern. Because <> ⊂ <> and both support = 2.
13
Cont.
2012/6/13
Closure Checking To verify a new closed temporal pattern p, we
require checking whether p is a sub-sequence or super-sequence of an existing temporal pattern p’ and the projected database of p and p’ is equal.
This paper borrow BI-Directional Extension [WH04] to check patterns’ closure. Forward-extension Backward-extension
14
Cont.
2012/6/13
Definition 5. Forward-extension and backward-extension
If = <> is non-closed, there must exist at least one endpoint x, which can be used to extend to a new endpoint sequence ’, support () = support (’).
can be extended in five ways: (1)’= 〈〉 (2)’= 〈〉 ’ 𝛼 a forward-extension sequence (3)’= 〈〉 (4)’= 〈〉 (5)’= 〈〉 ’ backward-extension sequence
15
Cont.
2012/6/13
If there exists no forward-extension endpoint nor backward-extension , must be a closed 𝛼endpoint sequence.
The CEMiner checks closure in two directions as follows, Forward directional checking Backward directional checking
16
Cont.
2012/6/13
Definition First instance of a prefix sequence
Ex. The first instance of the prefix sequence AB in
sequence CAABC is CAAB.
17
Cont.
2012/6/13
Definition 6. The i-th last-in-first appearance Ex. 〈 ABAB(AB)(AB) 〉 p = 〈〉1. The last-in-first appearance w.r.t. prefix p in ? (1) 1≤ i < n, n=4, i=2 first instance : 〈 ABAB(AB)(AB) 〉 2. The last-in-first appearance w.r.t. prefix p in? (2) i = n, i = n = 4 first instance : 〈 ABAB(AB)(AB) 〉
18
Cont.
2012/6/13
Definition 7. The i-th semi-maximum period Ex. 〈 ABAB(AB)(AB) 〉 p = 〈〉 1. semi-maximum period of prefix p in (1) i =1 , before the last-in-first appearance : 〈 ABAB(AB)(AB) 〉 2. semi-maximum period of prefix p in (2) 1< i ≤ n, n=4, i=2 a. end of the first instance of 〈〉 : 〈 AB 〉 b. the 2-th last-in-first appearance w.r.t p: B 〈 ABAB(AB)(AB) 〉
19
Cont.
2012/6/13
EbackScan search Let an endpoint sequence, if there exists i, 1 ≤ i ≤ n
and there exists an endpoint x which appears in each of the i-th semi-maximum periods of the prefixin database.
We can derive a new endpoint sequenceand we can stop growing the endpoint sequence .
Ex. Prefix sequence p = <A, C> B is the 2nd semi-max. period of the prefix p in
database We can derive a new prefix sequence p’ = <A, B, C>
20
CEMiner Algorithm
2012/6/13
We use three pruning strategies to reduce the searching space efficiently and effectively.
(1) pre-pruning (2) post-pruning (3) pair-pruning
21
CEMiner Algo.
2012/6/13
22
CEMiner Algo.
2012/6/13
23
CEMiner Algo.
2012/6/13
24
CEMiner Algo.
2012/6/13
25
CEMiner Algo.
2012/6/13
Pair-pruning: If the endpoint is a
starting endpoint, we can omit the closure checking.
Because the starting endpoint and finishing endpoint always occur in pairs in an endpoint sequence.
26
CEMiner Algo.
2012/6/13
Ex. Prefix p =<> Endpoint B+ is a
backward-extension endpoint of p.
So we can stop growing p.
27
CEMiner Algo.
2012/6/13
28
CEMiner Algo.
2012/6/13
29
CEMiner Algo.
2012/6/13
Pre-pruning: If y is finishing
endpoint and it has corresponding starting endpoint in .
30
CEMiner Algo.
2012/6/13
Post-pruning: A finish point is
called significant, if it has a corresponding starting endpoint in projected postfix or in .
31
Cont.
2012/6/13
32
Experimental result
2012/6/13
33
Conclusion
2012/6/13
We develop an efficient algorithm, CEMiner, to discover closed temporal patterns without candidate generation, based on proposed endpoint representation.
The algorithm further employs three pruning methods to reduce the search space effectively.