jianmin wang 1, shaoxu song 1, xiaochen zhu 1, xuemin lin 2 1 tsinghua university, china 2...

23
Efficient Recovery of Missing Events Jianmin Wang 1 , Shaoxu Song 1 , Xiaochen Zhu 1 , Xuemin Lin 2 1 Tsinghua University, China 2 University of New South Wales, Australia 1/23 VLDB 2013

Upload: archibald-curtis-thomas

Post on 03-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Efficient Recovery of Missing Events

Jianmin Wang1, Shaoxu Song1, Xiaochen Zhu1, Xuemin Lin2

1Tsinghua University, China2University of New South Wales, Australia

1/23

VLDB 2013

Outline

Motivation Recovery Algorithms

Filling gaps in causal net Branching framework Dealing with loops

Experiments Conclusion Ongoing Work

2/23

VLDB 2013

Event Data Information systems record the business history in their event logs.

3/23

VLDB 2013

Huge Amount of Event Data:

Corporation

Products

No. of Event Sequences 1,230,000

Power Generatior

3,260,000

Machinery

2,600,000

Train

E ID T ID Event Name Timestamp

1 1 Order 2013-04-22 13:33:34

2 1 Pay by Cash 2013-04-22 15:10:17

3 1 Check Inventory 2013-04-22 15:18:11

4 1 Validate 2013-04-22 15:31:50

5 1 Delivery 2013-04-23 08:14:26

Event Sequence: Order Pay by cash Check inventory Validate Delivery

Process Specification Business events often follow certain business rules or constraints

4/23

VLDB 2013

1

2

3

Order Pay by cash Check inventory Validate Delivery

Order Check inventory Pay by credit card Check credit history Validate Delivery

Order Check inventory Pay by cash Re-check Check inventory Validate Delivery

Process specification

Event sequences

follow

Constraints by Petri net:• Sequence• Parallel• Choice

A

B

C D

E

H

F GOrder Pay by

credit cardCheck credit history

Check inventory

Pay by cash

Re-check

Validate Delivery

XORsplit

XORjoin

ANDsplit

ANDjoin

Order Pay by cash Pay by cash Check inventory

Check inventory Pay by cash

Pay by cash

Pay by cash

Pay by credit card

Missing Events5/23

VLDB 2013

The Causes of Missing Events: Man-made errors (typo); System failures (power down); Lost during system integration (mismatching).

Survey in one division of CNR: 47% events are missed.

Sequence of events

A

B

C D

E

H

F GOrder Pay by

credit cardCheck credit history

Check inventory

Pay by cash

Re-check

Validate Delivery<A, B, E, F, G> complete

<A, C, E, F, G> incomplete

B

F GA

E

A

C

E

F G

D

Meaning of Recovery

Why Need Recovery?

Incomplete provenance answer

Inaccurate query answer

6/23

VLDB 2013

1 Order Check inventory Re-check Check inventory Validate Delivery

SEQ(Pay by cash, Validate), which means “Pay by cash” followed by “Validate”.

What are the prerequisites of “Validate” in sequence 1?

Event sequence 1 is lost.

“Pay by cash” is lost.Answer: “Order”, “Check Inventory” and “Re-check”.

Answer: sequence 2.

Pay by cash Pay by cash

2 Order Pay by cash Check inventory Validate Delivery

Recovery of Missing Events

Recovery Insert the missing prerequisites into the gap; Based on the specification (follows constraints).

7/23

VLDB 2013

Event sequence <ACEFG> has missing event between ACE and F, called a GAP (ACE, F).

A possible recovery is <ACEDFG>

A

B

C D

E

H

F GOrder Pay by

credit cardCheck credit history

Check inventory

Pay by cash

Re-check

Validate DeliveryA

E

C

F

D

Possible Recoveries

Multiple Possible Recoveries

8/23

• Loop structures:To recover (AB,F), we have ABEF, ABEHEF, ABEHEHEF...

VLDB 2013

• Choice structures:To recover (AE,F), we have AEBF, AECDF

What is the most likely recovery?

A

B

C D

E

H

F GOrder Pay by

credit cardCheck credit history

Check inventory

Pay by cash

Re-check

Validate DeliveryA

E

F

B

C D

A

B

F

E

H

Minimum Recovery

Intuition: Generally, people try to make the minimum mistakes. Principle: Follow the minimum change law in improving data

quality. Problem: The cost of a recovery, which is the number of inserted

events should be minimized.

9/23

VLDB 2013

To recover <AEFG>: <AEBFG> is a minimum recovery, while <AECDFG> is not.

A

BC D

EH

F GOrder Pay by

credit cardCheck credit history

Check inventory

Pay by cash

Re-check

Validate Delivery

Hardness and Related Work

Hardness: Owing to choices and parallelization of flows, there exist vast

alternatives to enumerate in the recovery; We prove the NP-hardness of this problem.

Existing Approach: The Alignment algorithm* Enumerates a factorial number of redundant recoveries.

Our Observation: All the sequences in S are equivalent w.r.t. the specification; Any of S is a minimum recovery; Not necessary to enumerate.

*M. de Leoni, F. M. Maggi, and W. M. P. van der Aalst. Aligning event logs and declarative process models for conformance checking. In BPM, pages 82–97, 2012.

10/23

VLDB 2013

To recover (A,E), it generates S: {ABCDE, ABDCE, ACBDE, ACDBE, ADBCE, ADCBE}

A

B

C

D

EA E

B

C

D

Outline

Motivation Recovery Algorithms

1. Filling gaps upon causal net2. Branching framework3. Dealing with loops

Experiments Conclusion Ongoing Work

11/23

VLDB 2013

2. Net with Choice StructuresB

3. Net with Loop StructuresH

1. Causal Net

A

C D

E

F G

1. Filling Gaps upon Causal Net

Causal Net: net without XOR structures. It can be translated into a equivalent DAG.

Based on parallel constraints & sequence constraints: Each complete sequence must be one of the topological sorts on the

DAG. A recovery of incomplete sequence is a subsequence of a complete

sequence. <ABCE> is a subsequence of <ABCDE>

Hints: fills the missing prerequisites by backtracking.

12/23

VLDB 2013

A

B

C

D

E A

B

C

D

E

2. The Branching Framework

Problem: topological sort cannot be applied on choice structure. Hints:

Enumerate all the possible execution of the choice structures Each execution is a causal net, topological sort can be applied.

13/23

VLDB 2013

To recover (A, E)

(A, BE)

(A, CDE)

A

B

C ED

A

B

C ED

E

Recovery with minimum cost

Pruning Methods

1. Branching Index: prune the branches not including the right-side event of gap.

2. Path Reachability Pruning: prune the branches that the left-side of gap are not reachable to the right-side of gap.

3. Branch and Bound: prune the branches that may lead to possible recoveries but cannot be the minimum one.

4. Local Optimality: identify groups of transitions that always share the same branching, thus only one of them needs to be computed.

14/23

VLDB 2013

A

B

C ED

EGap (A,E)Gap (A,D)

Gap (AC,E)

A

E

C LB = 4

LB = 3

E

E

A

E

EA D

3. Extension on Loops Problem: when to terminate? Hints: considers loop at most once (minimum cost).

15/23

To recover (ABC, B)

VLDB 2013

A

E

B DC

A

E

B DC

A B CBA

A B CB

E

(ABC, AB)Not valid

(ABC, EB)

A B DC

E

A B C

(ABC, EBCEB)Not minimum

E

B C

E

B

Outline

Motivation Recovery Algorithms

Filling gaps in causal net Branching framework Dealing with loops

Experiments Conclusion Ongoing Work

16/23

VLDB 2013

Experiment Setting

Real Life Data Set: employed from CNR*

Setting we randomly remove 10%-90% events from the complete

sequences; apply the recovery methods to recover the removed events.

Criteria: to evaluate the accuracy of recovery, F-measure of precision and recall.

Baseline: the Alignment approach.

*www.tangche.com

17/23

VLDB 2013

No. of Process Specifications

149Max Specification Size

63 transitions 79 places

No. of Event Sequences

4470Max Sequence Size

1000 events

Causal net 25

Net with choices (no loop) 57

Net with loops 67

Experiments on Causal Net

The accuracy is no worse than Alignment algorithm. Our algorithm shows great improvement in time cost.

18/23

VLDB 2013

0-2 3-5 6-8 9-11 12-14 21-230.01

0.1

1

10

100

1000

10000

100000time performance

Specification size

Tim

e co

st (m

s)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.1

1

10

100

1000

10000time performance

Missing Rate

Tim

e co

st (m

s)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.10.20.30.40.50.60.70.80.9

accuracy

AlignmentBacktracking

Missing Rate

F-M

easu

re

0-2 3-5 6-8 9-11 12-14 21-230

0.10.20.30.40.50.60.70.80.9

1accuracy

Specification size

F-M

easu

re

Experiments on Net with Choices

The pruning methods are effective.

19/23

VLDB 2013

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.01

0.1

1

10

100

1000time performance

AlignmentBranchBranch+ReachBranch+Reach+BoundLocalLocal+Reach+Bound

Missing Rate

Tim

e co

st (m

s)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.10.20.30.40.50.60.70.80.9

1 accuracy

Missing Rate

F-M

easu

re

0 - 4 5 - 9 1 0 -1 4

1 5 -1 9

2 0 -2 4

2 5 -2 9

3 0 -3 4

3 5 -3 9

4 0 -4 4

00.10.20.30.40.50.60.70.80.9

accuracy

Specification size

F-M

easu

re

0 - 4 5 - 9 1 0 -1 4

1 5 -1 9

2 0 -2 4

2 5 -2 9

3 0 -3 4

3 5 -3 9

4 0 -4 4

0.001

0.01

0.1

1

10

100

1000

10000time performance

Specification size

Tim

e co

st (m

s)

Best approach

100 200 300 400 500 600 700 800 900 1000100

1000

10000

100000

time performanceAlignmentBranchBranch+ReachBranch+Reach+BoundLocalLocal+Reach+Bound

Sequence size

Tim

e co

st (m

s)

100 200 300 400 500 600 700 800 90010000

0.10.20.30.40.50.60.70.80.9

1

accuracy

Sequence size

F-M

easu

reExperiments on Net with Loops

Local+Reach+Bound can always achieve the lowest time cost.

20/23

VLDB 2013

Best approach

Conclusion

We define the specification-based minimum recovery problem for missing events.

We prove the NP-hardness of this problem. To efficiently find the optimal recovery:

We propose a backtracking idea to reduce the redundant sequences with respect to parallel events.

We construct a branching index, and develop reachability checking and lower bounds of recovery distances to further accelerate the computation.

The local optimal method can reduce the number of intermediate results.

21/23

VLDB 2013

Ongoing Work In this paper of missing events, we assume that the logged event data

are clean. However, the logged event data may also be dirty.

Encoding example: UTF-8: Re-check GB2312: e ?□ □ □

Without addressing erroneous events, Inaccurate provenance answer Inaccurate query answer

By the insertion technique proposed in this paper, we cannot elimate the erroneous events.

Instead, we need to modify the mistakely recorded events Non-trivial, also NP-hard Heuristic approach may apply

22/23

VLDB 2013

AOrder

Pay by credit card Check credit historyCheck inventory

Validate Delivery

Check inventory□e ?□ □

C D

E E

F G

Q & AThanks!

23/23

VLDB 2013