advance database systems and applications comp 6521 1

24
PROFESSOR: DR. GOSTA GRAHNE LAB INSTRUCTOR: ASHKAN AZARNIK GROUP 15 ADITYA DEWAL MOHAMMAD IFTEKHARUL HOQUE SALEH AHMED Advance Database Systems and Applications COMP 6521 1

Upload: roger-pennings

Post on 30-Mar-2015

244 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advance Database Systems and Applications COMP 6521 1

1

PROFESSOR:DR. GOSTA GRAHNE

LAB INSTRUCTOR:ASHKAN AZARNIK

GROUP 15 ADITYA DEWAL

MOHAMMAD IFTEKHARUL HOQUE SALEH AHMED

Advance Database Systems and Applications

COMP 6521

Page 2: Advance Database Systems and Applications COMP 6521 1

2

PROJECT 1

Develop a program which sort numbers in ascending order using 2 Phase Multiway Merge Sort(2PMMS) with limitation of 5MB of virtual memory.

External sorting is required when the data being sorted do not fit into the main memory of a computing device and instead they must reside in slower external memory (usually hard drive).

Page 3: Advance Database Systems and Applications COMP 6521 1

3

Our approached to solve the problem External sorting typically uses a sort-merge

technique.

In the sorting phase, chunks of data small enough to fit in main memory are read, sorted in ascending order using quick sort algorithm and written out to a temporary file.

In the merge phase, the sorted temporary files are combined using 2 phase multiway merge sort into a single larger file.

Page 4: Advance Database Systems and Applications COMP 6521 1

4

Challenges Which algorithm to choose ?

Quicksort is one of the fastest and simplest sorting algorithm because its inner loop can be efficiently implemented on most architectures.

Efficient average case compared to other sort algorithms.

The complexity of quick sort in the average case is O(n log(n)

Page 5: Advance Database Systems and Applications COMP 6521 1

5

List of Data Structures Primitive Types:

Boolean, Integer, Long Abstract Types:

Array, String Arrays (Linear Data Structure)

Integer Array, Boolean Array, Long Array I/O:

Scanner, PrintWriter

Page 6: Advance Database Systems and Applications COMP 6521 1

6

Buffer Size Experiments

0 50 100 150 200 250 3000

20

40

60

80

100

120

140

160

180

200

The execution time (sec) as a function of the buffer size (KB)

Small

Medium

Large

Buffer Size KB

Exe

cuti

on

Tim

e S

ec

Page 7: Advance Database Systems and Applications COMP 6521 1

7

Conclusion

After our buffer size experiments we concluded that for 160000 number of data which occupying 2.5mb of memory gives best execution time for us.

Page 8: Advance Database Systems and Applications COMP 6521 1

8

Results from Demo

The execution time to run our program during the demo was 3 minutes.

The reason for taking too much time

was the way we were taking our input and writing output in our program.

Page 9: Advance Database Systems and Applications COMP 6521 1

9

Project 2

Mining Frequent Itemsets from Secondary Memory

Build an application that will compute the frequent itemsets of all sizes (Pairs, Triples, Quadruples, etc.) from a set of transactions

based on input support threshold percentage.

Page 10: Advance Database Systems and Applications COMP 6521 1

10

Algorithms Considered

AprioriHorizontal Data Layout

EclatVertical Data Layout

Page 11: Advance Database Systems and Applications COMP 6521 1

11

Algorithms Considered

AprioriBreadth-First Traversal

EclatDepth-First Traversal

Page 12: Advance Database Systems and Applications COMP 6521 1

12

ECLAT

Better Execution TimeExecution time is better than Apriori

Memory EfficientRequire less amount of memory compare to Apriori if itemsets are small in number

Depth-First Search

Explore the unexplored

Page 13: Advance Database Systems and Applications COMP 6521 1

13

ECLAT Algorithm

For each item, store a list of transaction ids (tids)

TID Items1 A,B,E2 B,C,D3 C,E4 A,C,D5 A,B,C,D6 A,E7 A,B8 A,B,C9 A,C,D

10 B

HorizontalData Layout

A B C D E1 1 2 2 14 2 3 4 35 5 4 5 66 7 8 97 8 98 109

Vertical Data Layout

TID-list

Page 14: Advance Database Systems and Applications COMP 6521 1

14

ECLAT AlgorithmDetermine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets.

3 traversal approaches: top-down bottom-up hybrid

B1257810

AB1578

A1456789

Page 15: Advance Database Systems and Applications COMP 6521 1

15

ECLAT Algorithm

Page 16: Advance Database Systems and Applications COMP 6521 1

16

Primitive Types

Boolean, Integer, Double

Abstract Types

Map, Set, List, Array,

String

Arrays (Linear Data Struc.)

Hash Map (Hash Table)

Hash Set (Hash Map)

Array List (Dynamic Array)

Bit Set (Bit Array)

String Array

Trees

Search Tree

List of Data Structures

ECLAT Implementation

Page 17: Advance Database Systems and Applications COMP 6521 1

17

ECLAT Implementation

Our implementation denotes the set of transactions as a bit set.

Intersects rows to determine the support of item sets.

The search follows a depth first traversal of a prefix tree as it is shown in Figure 1.

Page 18: Advance Database Systems and Applications COMP 6521 1

18

ECLAT ImplementationDivide and Conquer Phase

Divide the file in N partitions. If an item is frequent in one partition we don’t check it again.

Merge Phase

Suppose an item is not frequent in any partition but it is frequent globally, it is going to come when we would merge.

In the merge part we would run the algorithm again with the infrequent items.

Page 19: Advance Database Systems and Applications COMP 6521 1

19

ECLAT Implementation

File size = 10000, Threshold = 2%An item is frequent if it occurs >= 200 timesWe would get intermediate results by checking all the partitions.Merge part we would work with the infrequent items for each partition, and then merge the results to get the final output list of frequent items

Page 20: Advance Database Systems and Applications COMP 6521 1

20

Eclat Execution Time

Execution time of Eclat for Small and Medium datasets:

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.220

50

100

150

200

250

Small Dataset

Eclat

Support

Tim

e m

s

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.220

50

100

150

200

250

300

350

400

450

500

Medium Dataset

Eclat

Support

Tim

e m

s

Page 21: Advance Database Systems and Applications COMP 6521 1

21

Eclat VS Apriori

We have compared the execution time for Apriori and Eclat for Small and Medium datasets and found the following:

0 0.05 0.1 0.15 0.2 0.250

5000

10000

15000

20000

25000

0

50

100

150

200

250

Small Dataset

Apriori

Eclat

Support

Ap

rio

ri T

ime

Ecl

at T

ime

ms

0 0.05 0.1 0.15 0.2 0.250

10000

20000

30000

40000

50000

60000

70000

80000

0

50

100

150

200

250

300

350

400

450

500

Medium Dataset

Apriori

Eclat

Support

Ap

rio

ri T

ime

Ecl

at T

ime

ms

Page 22: Advance Database Systems and Applications COMP 6521 1

22

Benefits of Divide and Conquer

Program executes for Large files. Gives better performance.

Page 23: Advance Database Systems and Applications COMP 6521 1

23

Results from Demo

Execution time was 35 seconds.

Page 24: Advance Database Systems and Applications COMP 6521 1

24

REFERENCES

Project 1Database Systems, the complete book by Hector Gracia-Molina, Jeff Ullman, and Jennifer widom

http://en.wikipedia.org/wiki/Quicksort

Project 2

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=846291&userType=inst

http://www.ece.northwestern.edu/~yingliu/papers/para_arm_cluster.pdf

http://ceur-ws.org/Vol-90/borgelt.pdf

http://www.isca.in/COM_IT_SCI/Archive/v1i1/2.ISCA-RJCITS-2013-001.pdf

http://www.intsci.ac.cn/shizz/fimi.pdf