query optimization of frequent itemset mining on multiple databases

Query Optimization of Frequent ItemsetQuery Optimization of Frequent Itemset Mining on Multiple DatabasesMining on Multiple Databases

David FuhryDavid Fuhry

Department of Computer ScienceDepartment of Computer ScienceKent State UniversityKent State University

Joint work with Ruoming Jin and Abdulkareem Alali (Kent State)Joint work with Ruoming Jin and Abdulkareem Alali (Kent State)

March 27, 2008 / EDBT’08March 27, 2008 / EDBT’08

Why do we care about mining multiple datasets?

• Multiple datasets are everywhere– Data warehouse– Data collected at different places, at

different times– Large dataset logically partitioned into

several small datasets

• Comparing the patterns from different datasets is very important

• Combining mining results from each individual dataset is not good enough

Frequent Itemset Mining

• One of the most well-studied areas in KDD; one of the most widely used data mining techniques; one of the most costly data mining operators

• Tens (or maybe well over one hundred) of algorithms have been developed– Among them, Apriori and FP-Tree

• Frequent Pattern Mining (FPM)– Sequences, Trees, Graphs, Geometric structures, …

• Implemented in modern database system– Oracle, SQL Server, DB2, …

Motivating Examples• Mining the Data Warehouse for a Nation-wide

Store:– Three branches in OH, MI, CA– One week’s retail transactions

• Queries– Find itemsets that are frequent with support level

0.1% in each of the stores– Find itemsets that are frequent with support level

0.05% in both the stores in mid-west, but are very infrequent (support less that 0.01%) in the west coast store

So, how to answer these queries?

• Imagine we have only two transaction datasets, A and B

• A simple query– Find the itemsets that are frequent in A and B with

support level 0.1 and 0.3, respectively, or the itemsets that are frequent in A and B with

support level 0.3 and 0.1, respectively. • We have the following options to evaluate this

query– Option 1

• Finding frequent itemsets in A with support level 0.1• Finding frequent itemsets in B with support level 0.3• Finding frequent itemsets in A with support level 0.3• Finding frequent itemsets in B with support level 0.1

How to? (cont’d)– Option 2

• Finding frequent itemsets in A with support 0.1• Finding frequent itemsets in B with support 0.1

– Option 3• Finding frequent itemsets in A (or B) with support 0.1

– Among them, finding itemsets that are also frequent in B (or A) with support 0.1

– Option 4• Finding frequent itemsets in A with support 0.3

– Among them, finding itemsets that are also frequent in B with support 0.1

• Finding frequent itemsets in B with support 0.3 – Among them, finding itemsets that are also frequent in A with

support 0.1

– …Depending on the characteristics of datasets A and B, and the support levels, each option can have very different total mining cost!

Basic Operations

• SF(A, α)– Find frequent itemsets in dataset A with

support α

• ⊔ Union

• ⊓ Intersection

• σβ Select frequent itemsets with support β from results of SF(A, α) (β > α)

(Inexpensive)

(Inexpensive)

(Inexpensive)

Query M-Table Representation

0.10.10.05D0.10.1C

0.050.1B0.050.1A

F3F2F1

Query

M-Table

(SF(A,0.1) ⊓ SF(B,0.1) ⊓ SF(D,0.05)) ⊔

(SF(A,0.05) ⊓ SF(C,0.1) ⊓ SF(D,0.1)) ⊔

(SF(B,0.05) ⊓ SF(C,0.1) ⊓ SF(D,0.1))

dummy

CF Mining Operator

• SF(A, α)– Find frequent itemsets in dataset A with

support α

Recall:• CF(A, α, X)

– Find frequent itemsets in dataset A with support α within the search space X

– X is the result of a previous SF or CF operation (a set of frequent itemsets)

– Equivalent to SF(A, α) X⊓

Coloring the M-Table

0.10.10.10.05D

0.10.10.1C

0.050.10.1B

0.050.10.1A

F5F4F3F2F1

SF(A,0.05)

SF(C,0.1)

…When the M-Table is covered, the query is solved.

Query PlanCF(B,0.1,SF(A,0.1))

Query Plan Space

Phase 1:Phase 1: Color at least one cell in every column using the SF mining operator.

Phase 2:Phase 2: Color the remaining cells using the CF mining operator.

Independent of the mining results generated from any other mining operation.

Dependent on results from previous SF or CF operations.

Query Plan Space (cont’d)Example :Example :

0.10.10.10.05

D

0.10.10.100C

0.05

0.10.1B

0.05

0.10.1A

F5F4F3F2F1

Phase1:Phase1: SF( A , 0.1)

Phase2:Phase2: CF ( A , 0.05 , SF( C , 0.1) )

SF( C , 0.1)

CF ( B , 0.05 , SF( A , 0.1) ⊔ SF( C , 0.1) )

CF ( D , 0.05 , ( SF( A , 0.1 ) ⊔ SF( B , 0.01) ) ⊔ SF( C , 0 ) )

CF ( C , 0 , SF( A , 0.1) ⊔ SF( B , 0.1) )

Partial Orders and Equivalent Query Plans

Phase 2 of a Query Plan represented as a DAG

Two Query Plans are equivalent if their corresponding Partial Orders are the same.

Reducing Search Space in Phase 1

Observation:CF Mining Operator usually costs less than its

corresponding SF operatorHeuristic 1:In Phase 1, perform the set of SF operations with

the minimal cost so that every column has one cell colored.

0.10.10.10.05

D

0.10.10.100C

0.05

0.10.1B

0.05

0.10.1A

F5F4F3F2F1

Phase 1:SF(A, 0.1)SF(C, 0.1)done

Reducing Search Space in Phase 2

Observation:Within a row, a CF operation can cover a single cell or multiple cells.Planning cost will be high if every combination of single cells is tested.

Heuristic 2:Specify a basic unit as the smallest cell-set a CF operator can cover.

We test three different granularities of basic units.Granularity One: A CF operation will color each cell individually.Granularity Two: Within a row, a CF operation will color all cells with

the same support level.Granularity Three: Within a row, a CF operation will color all uncolored

cells.

0.05

0.10.1A

F5F4F3F2F1

Granularity OneGranularity Two

Granularity Three

Query Plan With Granularity 2

Example:

0.10.10.10.05

D

0.10.10.100C

0.05

0.10.1B

0.05

0.10.1A

F5F4F3F2F1

Phase1: SF(A , 0.1), SF(C , 0.1)

Phase2: CF(B , 0.1, SF(A ,0.1) )

CF(D , 0.1, SF(C , 0.1) )

CF(A , 0.05 , SF(C , 0.1) ⊓ SF(D , 0.1) )

CF(B , 0.05 , SF(C , 0.1) ⊓ SF(D , 0.1) )

CF(D , 0.05 , SF(A , 0.1) ⊓ SF(B , 0.1) )

CF(C , 0 , SF(A , 0.1) ⊓ SF(B , 0.1) )

Cost Estimation

• Cost estimation for SF– Factors:

• The number of transactions: n

• The average length of the transactions: |I|

• The density of the datasets: d (entropy of correlations) – Use density formula proposed in [Palmerini, 2004]

• The support level: s

– Formula (best of several we tested):

– Regression to determine parameters

– CF based on SF

►Show density formula

Query Plan Generation with Cost Estimation

• Dynamic Programming

• Branch & Bound– Use cost of Greedy as initial min cost– Enumerate query plans, prune branches which

begin to exceed min cost– Cheaper completed query plans update the

min cost

Results for Query Evaluation

Averages are across nine datasets, ranging from 500,000 to 2,000,000 transactions.

Greedy and cost-based approaches significantly reduce evaluation costs

Average of 10x speedupUp to 20x speedup on QUEST-6The cost-based algorithm reduces the mining cost of the query plan generated by the heuristic algorithm by an average of 20% per query (significantly improves 40% of the queries)

Average Cost (in seconds) to evaluate query plan

Results for Query Planning

Average Cost (in seconds) to Generate a Query Plan

Using granularities 2 and 3 significantly reduces planning cost.

Cost-based algorithms’ planning cost can be competitive with Greedy.

Conclusions

• Cost-Based Query Optimization on Multiple Datasets an important part of a Knowledge Discovery and Data Mining Management System (KDDMS)

• A Long-term goal for data mining– Interactive data mining– New techniques in database-type

environment to support efficient data mining

Thank You

Density Formula

Palmerini et. al. 2004. Statistical properties of transactional databases.

◄ Return to Cost Estimation

Frequent Itemset Cardinality Estimation

• Sampling

• Sketch Matrix

query optimization of frequent itemset mining on multiple databases

Documents