1 ternary directed acyclic word graphs (tdawg) satoru miyamoto, shunsuke inenaga, masayuki takeda...

38
1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last Algorithm Group)

Upload: garey-poole

Post on 18-Jan-2016

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

1

Ternary Directed Acyclic Word Graphs (TDAWG)

Satoru Miyamoto, Shunsuke Inenaga,

Masayuki Takeda and Ayumi Shinohara

Present by

Peera Liewlom

(The Last Algorithm Group)

Page 2: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

2

CIAA 2003• Eighth International Conference on

Implementation and Application of Automata

• July 16-18, 2003, Santa Barbara, CA, USA

• Topic / Committee / Community

Page 3: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

3

Why did I select this paper ?• DAWG start 1985… not so far• Continueing development• cDAWG, ASDAWG, morphic DAWG, WDAWG,

SDAWG, two-tree DAWG, DASG, CSDAWG etc.• TST : 1997 – 98, TDAWG : 2003• DAWG : Widely Apply by Bioinformatics, NLP,

Graph Theory, String Matching, Automata etc.• Speed & Space Trends in Huge Data Management• Topic for Algorithm Group• Matching the interesting topics in this seminar

group

Page 4: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

4

Content

• DFA (use in string matching’s problem)

• DAWG

• Ternary Search Tree

• Paper : TDAWG, Experiment & Result

• Paper : Conclusion

• Paper : Discussion

Page 5: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

5

DFADeterministic Finite Automata

Page 6: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

6

Formalities• Deterministic Finite Accepter (DFA)

FqQM ,,,, 0Q

0q

F

: set of states

: input alphabet

: transition function

: initial state

: set of final states

Page 7: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

7

Set of States

Q

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

543210 ,,,,, qqqqqqQ

ba,

Page 8: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

8

Input Aplhabet

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

ba,

Page 9: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

9

Initial State

0q

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

Page 10: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

10

Set of Final States

F

0q 1q 2q 3qa b b a

5q

a a bb

ba,

4qF

ba,

4q

Page 11: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

11

Transition Function

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

QQ :

ba,

Page 12: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

12

10 , qaq

2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q 1q

Page 13: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

13

50 , qbq

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

Page 14: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

14

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

32 , qbq

Page 15: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

15

Transition Function

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b

0q

1q

2q

3q

4q

5q

1q 5q

5q 2q

2q 3q

4q 5q

ba,5q5q5q5q

Page 16: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

16

Another Example

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

abbaabML ,, M

acceptacceptaccept

Page 17: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

17

• ML = { all substrings with prefix }ab

a b

ba,

0q 1q 2q

accept

ba,3q

ab

Page 18: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

18

ML = { all strings without substring }001

0 00 001

1

0

1

10

0 1,0

Page 19: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

19

DAWGDirected Acyclic Word Graph

Page 20: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

20

DAWG

Page 21: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

21

DAWG

Page 22: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

22

DAWG

Page 23: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

23

cDAWG

Page 24: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

24

แนวคิ�ดพั�ฒนาหลั�กMethodology

node edge

จุ�ดเด�นในการพั�ฒนา

1.DAWG

เป็�นต้�นแบบของการพั�ฒนาDAWG ซึ่��งป็ร�บทิ�ศทิางของกราฟแบบ แต้กต้�นไม้�ให้�สาม้ารถชี้ !ต้นเองได้�ทิ#าให้�ลด้node ลงไป็ ได้�ม้ากและเพั��ม้

ป็ระส�ทิธิ�ภาพัความ้เร*วม้ากกว+าDAG 2.cDAWG

เน�นการลด้จำ#านวนnode ลงทิ#าให้�ลด้จำ#านวนedge ลงต้าม้ไป็ด้�วย

ทิ#าให้�การป็ระม้วลผลเร*วกว+าDAWG 3.ASDAWG

สาม้ารถเก*บsubsequence ทิ�!งห้ม้ด้ให้�รวม้อย/+ในกราฟก�อนเด้ ยวก�น

เห้ม้าะส#าห้ร�บการว�เคราะห้0subsequence และลด้พั1!นทิ � ห้น+วยความ้จำ#าได้�ม้าก

4.morphic DAWG

เป็�นการป็ระย2กต้0น#าฟ3งก0ชี้��นม้ากระทิ#าก�บข�อม้/ลแบบDAWG

5.WDAWG

ม้ กรอบความ้ยาวของสายsequence ส#าห้ร�บควบค2ม้เฉพัาะส��งทิ �เรา สนใจำ(VLDC) โด้ยส��งทิ �ไม้+สนใจำให้�ก#าห้นด้เป็�นwildcard ทิ#าให้�

เจำาะกล2+ม้เป็6าห้ม้ายในการว�เคราะห้0ได้�ง+ายสะ ด้วกข�!น6.SDAWG

ใชี้�ป็ร�บโครงสร�าง DAWG ให้�ม้ ค2ณสม้บ�ต้�symmetric tree

ทิ#าให้�ม้ ความ้เร*วเฉล �ยในการใชี้�งานส/งส2ด้7.two-tree DAWG

เป็�นเทิคน�คส#าห้ร�บต้�ด้แบ+งDAWG ออกเป็�น2 ส+วนซึ่��งทิ#าให้�การ อ�พัเด้ทิข�อม้/ลทิ#าได้�เร*วข�!นไม้+ต้�องป็ร�บโครงสร�างต้�นไม้�ทิ�!งต้�น

8.DASG

พั�ฒนาเพั��ม้จำากcDAWG โด้ยก#าห้นด้ให้�แต้+ละ edge เชี้1�อม้โยง ระห้ว+างnode สาม้ารถม้ ทิ�ศทิางไป็และย�อนกล�บได้�

9.CSDAWG

ป็ร�บให้�โครงสร�างต้�นไม้�DAWG สาม้ารถม้ จำ2ด้เร��ม้ต้�นและจำ2ด้ส�!นส2ด้ เป็�นจำ2ด้เด้ ยวก�นได้�ทิ#าให้�น#าการเก*บข�อม้/ลแบบน !ไป็ใชี้�ก�บข�อม้/ลกราฟ ฟ8คห้ร1อจำ โอเม้ต้ร�กเชี้+น วงกลม้ห้ร1อโพัล กอนได้�

Page 25: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

25

TSTTernary Search Tree

Page 26: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

26

TST History• Jon L. Bentley and Robert Sedgewick• Algorithms for Sorting and Searching

Strings, Proceeding. 8th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), January 1997.

• Ternary Search Trees, Dr. Dobb's Journal, April 1998.

• Dictionary of Algorithms and Data Structures, National Institute of Standard and Technology, http://www.nist.gov/

Page 27: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

27

BST DST

TST

Page 28: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

28

Page 29: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

29

TDAWGTernary Directed Acyclic Word Graph

Page 30: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

30

Introduction

• DFA how to implement the transitions of each state ? (Time & Space efficiency)

• TST “implant” BST for transitions– Good Time

• DAWG smallest DFA for all suffixes– Good Space

• TDAWG

• Proof : TDAWG VS. DAWG

Page 31: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

31

Hypothesis / Theorem (1/2)• Time = Construct + Search (useable for online)• DFA function

= Alphabet (Chinese & Japan ~ 1000 chars)• State• Table O(|p|) p = length of pattern• Table use very large memory• Link List O(| | x |p|) search time• If is large … problem for search time

FqQM ,,,, 0

QQ :

Page 32: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

32

Hypothesis / Theorem (2/2)• For TDAWG

– Use O(|S|) space– Use O(log|| x |p|) for search time– Use O(|| x |S|2) construct time (Bentley & Sedwick)– Use O(|| x |S|) construct time (this paper … apply from

Blummer’s online DAWG construction)

• Comparison : TDAWG VS. DAWG(table & link list)– Space , Search Time , Construction Time

Page 33: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

33

TST TDAWG

Page 34: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

34

Online DAWG Construction

Page 35: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

35

Online TDAWG Construction

Page 36: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

36

Experiment Result

Page 37: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

37

Conclusion

• New data structure … TDAWG

• Construction time (English text 256)– TDAWG < linklistDAWG < tableDAWG

• Space Requirment– linklistDAWG < TDAWG ~ 20 %– tableDAWG not compare in same scale

• Search Time– Short pattern: tableDAWG best , TDAWG <

linklistDAWG– Log curve VS. Linear Curve (long pattern?)

Page 38: 1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last

38

Discussion & Future Work• In Asian Language (characters~1000s)

should have better search time than English (character 256) because log(||x|p|)

• Apply to other DAWG… cDAWG, minimumDAWG …etc.

• More efficiency by AVL tree (AVL-balance)

• Bioinformatic have 4 character . But, Sliding window with 12 characters = 412