v storage manager

85
V Storage Manager V Storage Manager Shahram Ghandeharizadeh Shahram Ghandeharizadeh Computer Science Department Computer Science Department University of Southern University of Southern California California

Upload: akiko

Post on 08-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

V Storage Manager. Shahram Ghandeharizadeh Computer Science Department University of Southern California. Traces. Make sure your persistent BDB is configured with 256 MB of memory. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: V Storage Manager

V Storage ManagerV Storage Manager

Shahram GhandeharizadehShahram GhandeharizadehComputer Science DepartmentComputer Science DepartmentUniversity of Southern CaliforniaUniversity of Southern California

Page 2: V Storage Manager

TracesTraces

Make sure your persistent BDB is configured Make sure your persistent BDB is configured with 256 MB of memory.with 256 MB of memory.

With a trace, say 21, use its “21Objs.Save” to With a trace, say 21, use its “21Objs.Save” to create and populate your persistent create and populate your persistent database. Subsequently, use its database. Subsequently, use its “Trace21.1KGet” to debug your software.“Trace21.1KGet” to debug your software. Start with 1 thread and expand to 2, 3, and 4.Start with 1 thread and expand to 2, 3, and 4.

Try to make your software as efficient as Try to make your software as efficient as possible. If it is too slow (maybe because of possible. If it is too slow (maybe because of low byte hit rates) then you may not be able low byte hit rates) then you may not be able to run “Trace21.1MGet”.to run “Trace21.1MGet”.

Page 3: V Storage Manager

QuestionsQuestions

Page 4: V Storage Manager

QuestionsQuestions

Will there be another release of the workload Will there be another release of the workload generator before Friday?generator before Friday? I do not anticipate one unless there is a bug I do not anticipate one unless there is a bug

report.report.

Is there an obvious item missing from the Is there an obvious item missing from the current workload generator?current workload generator? Mandatory: Invocation of the method to report Mandatory: Invocation of the method to report

cache and byte hit rates.cache and byte hit rates. Optional: Dump the content of the cache to Optional: Dump the content of the cache to

analyze the behavior of your cache replacement analyze the behavior of your cache replacement technique.technique.

Page 5: V Storage Manager

HintsHints

BDB-Disk is a full-fledged storage manager BDB-Disk is a full-fledged storage manager with a buffer pool, locking, crash-recovery, with a buffer pool, locking, crash-recovery, index structures.index structures. Configure its buffer pool size to be 256 MB.Configure its buffer pool size to be 256 MB.

V FunctionalitiesV Functionalities

Cache ReplacementCache Replacement

BDB-DiskBDB-Disk BDB-MemBDB-Mem

Page 6: V Storage Manager

HintsHints

Your implementation may need to keep track Your implementation may need to keep track of different counters. Example: count the of different counters. Example: count the number of requests issued (and the number number of requests issued (and the number of requests serviced from the main-memory of requests serviced from the main-memory instance of BDB) to compute the cache hit instance of BDB) to compute the cache hit rate.rate.

How to do this with multiple worker threads? How to do this with multiple worker threads?

Page 7: V Storage Manager

HintsHints

Your implementation may need to keep track Your implementation may need to keep track of different counters. Example: count the of different counters. Example: count the number of requests issued to compute the number of requests issued to compute the cache hit rate.cache hit rate.

How to do this with multiple worker threads?How to do this with multiple worker threads? The interlocked function provides a mechanism The interlocked function provides a mechanism

for synchronizing access to a variable that is for synchronizing access to a variable that is shared by multiple threads. shared by multiple threads.

You may define a “long” variable and use You may define a “long” variable and use InterlockedIncrement: “long cntr; InterlockedIncrement: “long cntr; InterlockedIncrement(&cntr);”InterlockedIncrement(&cntr);”

Make sure to include <windows.h> Make sure to include <windows.h>

Page 8: V Storage Manager

HintsHints

To compute byte hit rates, you need to To compute byte hit rates, you need to maintain two counters and increment them maintain two counters and increment them by the size of the referenced object. by the size of the referenced object.

Use “InterlockedExchangeAdd” function to Use “InterlockedExchangeAdd” function to perform an atomic addition of two 32 bit perform an atomic addition of two 32 bit values. values. Example: a = a + b;Example: a = a + b; InterlockedExchangeAdd(&a, &b);InterlockedExchangeAdd(&a, &b);

Other Interlocked methods might be useful Other Interlocked methods might be useful to you, such as InterlockedExchangePointer.to you, such as InterlockedExchangePointer.

Page 9: V Storage Manager

HintsHints

With invocation of methods, local variables With invocation of methods, local variables are pushed on the stack of a thread.are pushed on the stack of a thread. 4 different threads invoking a method will have 4 4 different threads invoking a method will have 4

different sets of mutually exclusive local different sets of mutually exclusive local variables as declared by that method.variables as declared by that method.

Foo(){Foo(){Char res[200];Char res[200];

Int cntr;Int cntr;

……

}}

A global variable is not part of the stack and A global variable is not part of the stack and must be protected when multiple threads are must be protected when multiple threads are manipulating it. How?manipulating it. How?

Page 10: V Storage Manager

HintsHints

With invocation of methods, local variables With invocation of methods, local variables are pushed on the stack of a thread.are pushed on the stack of a thread. 4 different threads invoking a method will have 4 4 different threads invoking a method will have 4

different sets of mutually exclusive local different sets of mutually exclusive local variables as declared by that method.variables as declared by that method.

Foo(){Foo(){Char res[200];Char res[200];Int cntr;Int cntr;……

}}

A global variable is not part of the stack and A global variable is not part of the stack and must be protected when multiple threads are must be protected when multiple threads are manipulating it. How?manipulating it. How? Consider making it a variable local to a method. Ask: Consider making it a variable local to a method. Ask:

Does this variable have to be global?Does this variable have to be global? Use critical sections.Use critical sections. Manage memory.Manage memory.

Page 11: V Storage Manager

HintsHints

With invocation of methods, local variables With invocation of methods, local variables are pushed on the stack of a thread.are pushed on the stack of a thread. 4 different threads invoking a method will have 4 4 different threads invoking a method will have 4

different sets of mutually exclusive local different sets of mutually exclusive local variables as declared by that method.variables as declared by that method.

Foo(){Foo(){Char res[200];Char res[200];

Int cntr;Int cntr;

……

}}

Similarly, memory allocated from the heap Similarly, memory allocated from the heap (new/malloc) is not a part of the stack and must (new/malloc) is not a part of the stack and must be managed.be managed. No memory-leaks.No memory-leaks.

Page 12: V Storage Manager

HintsHints

Consider an admission control technique.Consider an admission control technique. Without admission control:Without admission control:

Everytime an object is referenced and it is not in Everytime an object is referenced and it is not in memory then you place it in memory.memory then you place it in memory.

With admission control:With admission control: Every time a disk resident object is referenced, Every time a disk resident object is referenced,

compare its Q value with the minimum Q value to see if compare its Q value with the minimum Q value to see if it should be admitted into memory.it should be admitted into memory.

Page 13: V Storage Manager

Fast Algorithms for Mining Association Fast Algorithms for Mining Association Rules (by R. Agrawal and R. Srikant)Rules (by R. Agrawal and R. Srikant)

Shahram GhandeharizadehShahram GhandeharizadehComputer Science DepartmentComputer Science DepartmentUniversity of Southern CaliforniaUniversity of Southern California

Page 14: V Storage Manager

TerminologyTerminology

Objective: Discover association Rule over Objective: Discover association Rule over basket data.basket data.

Example: 98% of customers who purchase Example: 98% of customers who purchase tires and auto accessories also get tires and auto accessories also get automotive services done.automotive services done.

Motivation: valuable for cross-marketing Motivation: valuable for cross-marketing and attached mailing applications.and attached mailing applications. Watch Googlezon, Watch Googlezon,

http://www.youtube.com/watch?v=AT9ho2G0N_Yhttp://www.youtube.com/watch?v=AT9ho2G0N_Y

Requirements:Requirements: Fast algorithms,Fast algorithms, Must manipulate large data sets.Must manipulate large data sets.

Page 15: V Storage Manager

Problem StatementProblem Statement

Page 16: V Storage Manager

TerminologyTerminology

Association rule XAssociation rule XY has Y has confidenceconfidence c, c,

Out of those transactions that contain X, c% Out of those transactions that contain X, c% also contain Y.also contain Y.

Association rule XAssociation rule XY has Y has supportsupport s, s,

s% of transactions in D contain X and Y.s% of transactions in D contain X and Y.

Note:Note: X X A doesn’t mean X+Y A doesn’t mean X+YAA

May not have minimum supportMay not have minimum support

X X A and A A and A Z Z

doesn’t mean X doesn’t mean X Z Z May not have minimum confidenceMay not have minimum confidence

Page 17: V Storage Manager

ExampleExample

I = {beer, chips, salsa, nail-polish, toothpaste, toilet-I = {beer, chips, salsa, nail-polish, toothpaste, toilet-paper}paper}

D = {T1, T2, T3, …., T9999999}D = {T1, T2, T3, …., T9999999} T1 = {beer, chips, salsa}T1 = {beer, chips, salsa} T2 = {beer, toilet-paper}T2 = {beer, toilet-paper} T3 = {nail-polish, toothpaste}T3 = {nail-polish, toothpaste}

TID is the unique identifier for each transaction.TID is the unique identifier for each transaction. If X = {beer} then both T1 and T2 contain X.If X = {beer} then both T1 and T2 contain X. If X = {beer, chips} then T1 contains X.If X = {beer, chips} then T1 contains X. If X = {beer, nail-polish} then no transaction If X = {beer, nail-polish} then no transaction

contains X.contains X. The rule {beer, chips} => {salsa} with confidence The rule {beer, chips} => {salsa} with confidence

90% if 90% of transactions that contain {beer, chips} 90% if 90% of transactions that contain {beer, chips} also contain {salsa}. also contain {salsa}. NOTE: {beer, chips} intersect {salsa} is empty, satisfying NOTE: {beer, chips} intersect {salsa} is empty, satisfying

the constraint of the formal problem specification.the constraint of the formal problem specification.

The rule {beer, chips} => {salsa} has support 75% if The rule {beer, chips} => {salsa} has support 75% if 75% of transactions contain {beer, chips, salsa}.75% of transactions contain {beer, chips, salsa}.

Page 18: V Storage Manager

Example (Cont…)Example (Cont…)

Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

What is the confidence in {nail-polish} => What is the confidence in {nail-polish} => {tooth-paste}?{tooth-paste}?

Page 19: V Storage Manager

Example (Cont…)Example (Cont…)

Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

What is the confidence in {nail-polish} => What is the confidence in {nail-polish} => {tooth-paste}?{tooth-paste}? 100% because 5000 out of 5,000 transactions 100% because 5000 out of 5,000 transactions

that contain {nail-polish} also contain {tooth-that contain {nail-polish} also contain {tooth-paste}.paste}.

Page 20: V Storage Manager

Example (Cont…)Example (Cont…)

Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

What is the confidence in {beer} => {salsa}?What is the confidence in {beer} => {salsa}? 25% because 1000 out of 5000 transactions that 25% because 1000 out of 5000 transactions that

contain {beer} also contain {salsa}contain {beer} also contain {salsa}

Page 21: V Storage Manager

Example (Cont…)Example (Cont…)

Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

What is the confidence in {salsa} => {chips}?What is the confidence in {salsa} => {chips}?

Page 22: V Storage Manager

Example (Cont…)Example (Cont…)

Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

What is the confidence in {salsa} => {chips}?What is the confidence in {salsa} => {chips}? 100% because 6000 out of 6000 transactions that 100% because 6000 out of 6000 transactions that

contain {salsa} also contain {chips}contain {salsa} also contain {chips}

Page 23: V Storage Manager

Example (Cont…)Example (Cont…)

Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

What is the confidence in {salsa} => {nail-What is the confidence in {salsa} => {nail-polish}?polish}?

Page 24: V Storage Manager

Example (Cont…)Example (Cont…)

Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

What is the confidence in {salsa} => {nail-What is the confidence in {salsa} => {nail-polish}?polish}? 5/6 (83.33%) because 5000 out of 6000 5/6 (83.33%) because 5000 out of 6000

transactions that contain {salsa} also contain transactions that contain {salsa} also contain {chips}{chips}

Note:Note: Support for {salsa, nail-polish} is Support for {salsa, nail-polish} is

Page 25: V Storage Manager

Example (Cont…)Example (Cont…)

Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

What is the confidence in {salsa} => {nail-What is the confidence in {salsa} => {nail-polish}?polish}? 5/6 (83.33%) because 5000 out of 6000 5/6 (83.33%) because 5000 out of 6000

transactions that contain {salsa} also contain transactions that contain {salsa} also contain {chips}{chips}

Note:Note: Support for {salsa, nail-polish} is 50% (5000 out of Support for {salsa, nail-polish} is 50% (5000 out of

10000)10000) Support for {slasa} is Support for {slasa} is

Page 26: V Storage Manager

Example (Cont…)Example (Cont…)

Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

What is the confidence in {salsa} => {nail-What is the confidence in {salsa} => {nail-polish}?polish}? 5/6 (83.33%) because 5000 out of 6000 5/6 (83.33%) because 5000 out of 6000

transactions that contain {salsa} also contain transactions that contain {salsa} also contain {chips}{chips}

Note:Note: Support for {salsa, nail-polish} is 50% (5000 out of Support for {salsa, nail-polish} is 50% (5000 out of

10000)10000) Support for {slasa} is 60% (6000 out of 10000)Support for {slasa} is 60% (6000 out of 10000) Conf = 50% / 60% = 83.33%Conf = 50% / 60% = 83.33%

Page 27: V Storage Manager

Example (Cont…)Example (Cont…)

Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

What is the confidence in {beer, chips} => What is the confidence in {beer, chips} => {toilet-paper}?{toilet-paper}?

Page 28: V Storage Manager

Example (Cont…)Example (Cont…)

Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

What is the confidence in {beer, chips} => What is the confidence in {beer, chips} => {toilet-paper}?{toilet-paper}? 0% because none of the transactions satisfy this 0% because none of the transactions satisfy this

association rule.association rule.

Page 29: V Storage Manager

Example (Cont…)Example (Cont…)

Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

What is the support in {beer} => {toilet-What is the support in {beer} => {toilet-paper}?paper}?

Page 30: V Storage Manager

Example (Cont…)Example (Cont…)

Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

What is the support in {beer} => {toilet-What is the support in {beer} => {toilet-paper}?paper}? 40% because 4000 transactions (out of 10,000) 40% because 4000 transactions (out of 10,000)

contain {beer, toilet-paper}contain {beer, toilet-paper}

Page 31: V Storage Manager

Example (Cont…)Example (Cont…)

Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

What is the support in {chips} => {salsa}?What is the support in {chips} => {salsa}?

Page 32: V Storage Manager

Example (Cont…)Example (Cont…)

Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

What is the support in {chips} => {salsa}?What is the support in {chips} => {salsa}? 60%, 6000 transactions contain {chips, salsa}.60%, 6000 transactions contain {chips, salsa}.

Page 33: V Storage Manager

Example QueriesExample Queries

Compute all association rules with support Compute all association rules with support and confidence greater than 55%.and confidence greater than 55%. Assume:Assume:

1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

Answer:Answer:

Page 34: V Storage Manager

Example QueriesExample Queries

Compute all association rules with support Compute all association rules with support and confidence greater than 55%.and confidence greater than 55%. Assume:Assume:

1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

Answer: Answer: {chips} => {salsa}, {chips} => {salsa}, {salsa} => {chips}{salsa} => {chips}

Page 35: V Storage Manager

Example QueriesExample Queries

Compute all association rules with support > Compute all association rules with support > 30% and confidence greater than 40%.30% and confidence greater than 40%. Assume:Assume:

1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

Answer:Answer:

Page 36: V Storage Manager

Example QueriesExample Queries

Compute all association rules with support > Compute all association rules with support > 30% and confidence greater than 45%.30% and confidence greater than 45%. Assume:Assume:

1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}

Answer: Answer: {chips} => {salsa}, {chips} => {salsa}, {salsa} => {chips}, {salsa} => {chips}, {nail-polish} => {tooth-paste}, {nail-polish} => {tooth-paste}, {tooth-paste} => {nail-polish}, {tooth-paste} => {nail-polish}, {nail-polish} => {chips}, {nail-polish} => {chips}, {nail-polish}=>{tooth-paste}, {nail-polish}=>{tooth-paste}, {nail-polish} => {salsa}{nail-polish} => {salsa} ……..

Page 37: V Storage Manager

Divide the Problem into TwoDivide the Problem into Two

1.1. Find all sets of items that have support above Find all sets of items that have support above minimum support.minimum support. Itemsets with minimum support are called large itemsets Itemsets with minimum support are called large itemsets

and all others small itemsets.and all others small itemsets. Algorithms: Apriori and AprioriTid.Algorithms: Apriori and AprioriTid.

2.2. Use large itemsets to generate the desired rules.Use large itemsets to generate the desired rules. For every large itemset l, find all non-empty subsets of l. For every large itemset l, find all non-empty subsets of l.

Let a denote one subset.Let a denote one subset. For every subset a, output a rule of the form a => { {l} – For every subset a, output a rule of the form a => { {l} –

{a} } if support(l) / support(a) is at least minconf.{a} } if support(l) / support(a) is at least minconf. Say ABCD and AB are large itemsetsSay ABCD and AB are large itemsets ComputeCompute

conf = support(ABCD) / support(AB)conf = support(ABCD) / support(AB) If conf >= minconfIf conf >= minconf

AB AB CD holds. CD holds.

Page 38: V Storage Manager

Conquer Conquer

Focus on item 1:Focus on item 1:1.1. Find all sets of items that have support above a Find all sets of items that have support above a

pre-specified minimum support.pre-specified minimum support.

Example:Example: Assume the following database:Assume the following database: Itemsets with minimum support of 2 Itemsets with minimum support of 2

transactions?transactions?

Page 39: V Storage Manager

Conquer Conquer

Focus on item 1:Focus on item 1:1.1. Find all sets of items that have support above a Find all sets of items that have support above a

pre-specified minimum support.pre-specified minimum support.

Example:Example: Assume the following database:Assume the following database: Itemsets with minimum support of 2 Itemsets with minimum support of 2

transactions?transactions?

Page 40: V Storage Manager

How? How?

General idea:General idea: Multiple passes over the dataMultiple passes over the data First passFirst pass – count the support of individual items. – count the support of individual items. Subsequent passSubsequent pass

Generate Generate CandidatesCandidates using previous pass’s large using previous pass’s large itemset.itemset.

Go over the data and check the Go over the data and check the actualactual support of the support of the candidates.candidates.

Stop when no new large itemsets are found.Stop when no new large itemsets are found.

Page 41: V Storage Manager

How?How?

Make several passes of DB.Make several passes of DB. Pass 1: count item occurrences to Pass 1: count item occurrences to

determine the large 1-itemsets.determine the large 1-itemsets.

Page 42: V Storage Manager

How?How?

Make several passes of DB.Make several passes of DB. Pass 1: count item occurrences to Pass 1: count item occurrences to

determine the large 1-itemsets.determine the large 1-itemsets. Notice that {4} is missing!Notice that {4} is missing!

Pass 2: Compute the following query:Pass 2: Compute the following query:SELECT SELECT p.item1, q.item1p.item1, q.item1

FROM FROM L1 p, L1 qL1 p, L1 q

WHERE WHERE p.item1 < q.item1p.item1 < q.item1

Page 43: V Storage Manager

How?How?

Make several passes of DB.Make several passes of DB. Pass 1: count item occurrences to Pass 1: count item occurrences to

determine the large 1-itemsets.determine the large 1-itemsets. Notice that {4} is missing!Notice that {4} is missing!

Pass 2: Compute the priori-gen query and Pass 2: Compute the priori-gen query and count the support for each by making a pass count the support for each by making a pass of DB.of DB.

Page 44: V Storage Manager

How?How?

Make several passes of DB.Make several passes of DB. Pass 1: count item occurrences to Pass 1: count item occurrences to

determine the large 1-itemsets.determine the large 1-itemsets. Notice that {4} is missing!Notice that {4} is missing!

Pass 2: Compute the priori-gen query and Pass 2: Compute the priori-gen query and count the support for each by making a pass count the support for each by making a pass of DB.of DB. Drop those with support < minsupDrop those with support < minsup

Pass j (j >= 3): Compute candidate set using Pass j (j >= 3): Compute candidate set using apriori-gen algorithmapriori-gen algorithm

Page 45: V Storage Manager

Apriori-gen AlgorithmApriori-gen Algorithm

Intuition: Generate the candidate itemsets to Intuition: Generate the candidate itemsets to be counted in a pass by using only the be counted in a pass by using only the itemsets found large in the previous pass.itemsets found large in the previous pass. How?How?

Note that when k=2, this query computes a large Note that when k=2, this query computes a large number of rows: the cartesian product of L1 – number of rows: the cartesian product of L1 – number of rows in L1. If L1 has 100 rows, the number of rows in L1. If L1 has 100 rows, the resulting number of rows is 9900 (10000-100). resulting number of rows is 9900 (10000-100).

Page 46: V Storage Manager

Apriori-gen AlgorithmApriori-gen Algorithm

Intuition: Generate the candidate itemsets to Intuition: Generate the candidate itemsets to be counted in a pass by using only the be counted in a pass by using only the itemsets found large in the previous pass.itemsets found large in the previous pass. What is the result when k = 3?What is the result when k = 3?

What is the SQL command?What is the SQL command?

Page 47: V Storage Manager

Apriori-gen AlgorithmApriori-gen Algorithm

Intuition: Generate the candidate itemsets to Intuition: Generate the candidate itemsets to be counted in a pass by using only the be counted in a pass by using only the itemsets found large in the previous pass.itemsets found large in the previous pass. What is the result when k = 3?What is the result when k = 3?

INSERT into CkINSERT into CkSELECT p.item1, p.item2, q.item2SELECT p.item1, p.item2, q.item2FROM L2 p, L2 qFROM L2 p, L2 qWHERE p.item1 = q.item1 and p.item2 < q.item2WHERE p.item1 = q.item1 and p.item2 < q.item2

Result?Result?

Page 48: V Storage Manager

Apriori-gen AlgorithmApriori-gen Algorithm

Intuition: Generate the candidate itemsets to Intuition: Generate the candidate itemsets to be counted in a pass by using only the be counted in a pass by using only the itemsets found large in the previous pass.itemsets found large in the previous pass. What is the result when k = 3?What is the result when k = 3?

INSERT into CkINSERT into CkSELECT p.item1, p.item2, q.item2SELECT p.item1, p.item2, q.item2FROM L2 p, L2 qFROM L2 p, L2 qWHERE p.item1 = q.item1WHERE p.item1 = q.item1

Page 49: V Storage Manager

Apriori-gen AlgorithmApriori-gen Algorithm

Intuition: Generate the candidate itemsets to Intuition: Generate the candidate itemsets to be counted in a pass by using only the be counted in a pass by using only the itemsets found large in the previous pass.itemsets found large in the previous pass. What is the result when k = 3?What is the result when k = 3?

Computed by the SQL query.Computed by the SQL query.

Computed by making a pass on the DB.Computed by making a pass on the DB.

Page 50: V Storage Manager

IntuitionIntuition

Any subset of large itemset is large.Any subset of large itemset is large.

ThereforeTherefore

To find large k-itemsetTo find large k-itemset Create candidates by combining large k-Create candidates by combining large k-

1 itemsets.1 itemsets. Delete those that contain any subset Delete those that contain any subset

that is not large.that is not large.

Page 51: V Storage Manager

Assumptions & DefinitionsAssumptions & Definitions

Items in each transaction are kept sorted in Items in each transaction are kept sorted in their lexicographic order.their lexicographic order.

Number of items in an itemset is its size.Number of items in an itemset is its size. An itemset of size k is a k-itemset.An itemset of size k is a k-itemset. Each itemset has a count field to store the Each itemset has a count field to store the

support for this itemset.support for this itemset. LLkk is set of large k-itemsets (those with is set of large k-itemsets (those with

minimum support).minimum support). CCkk is set of candidate k-itemsets. Its is set of candidate k-itemsets. Its

members are potential members of Lmembers are potential members of Lkk..

Page 52: V Storage Manager

Apriori AlgorithmApriori Algorithm

Page 53: V Storage Manager

Apriori AlgorithmApriori Algorithm

Important detail:Important detail: With apriori-gen, the join may compute items With apriori-gen, the join may compute items

whose subset do NOT exist in Lwhose subset do NOT exist in Lk-1k-1. Prune these . Prune these by deleting an item c of Cby deleting an item c of Ckk such that some (k-1)- such that some (k-1)-subset of c is not in Lsubset of c is not in Lk-1k-1..

Example:Example: Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}

What is output for CWhat is output for C44??

Page 54: V Storage Manager

Apriori AlgorithmApriori Algorithm

Important detail:Important detail: With apriori-gen, the join may compute items With apriori-gen, the join may compute items

whose subset do NOT exist in Lwhose subset do NOT exist in Lk-1k-1. Prune these . Prune these by deleting an item c of Cby deleting an item c of Ckk such that some (k-1)- such that some (k-1)-subset of c is not in Lsubset of c is not in Lk-1k-1..

Example:Example: Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}

{ {1 2 3 4}, {1 3 4 5} }{ {1 2 3 4}, {1 3 4 5} } Subsets of {1 2 3 4} are { {1 2 3}, {2 3 4}, {1 3 4}, {1 2 4}}Subsets of {1 2 3 4} are { {1 2 3}, {2 3 4}, {1 3 4}, {1 2 4}} Subsets of {1 3 4 5} are { {1 3 4}, {1 3 5}, {3 4 5}, {1 4 5}}Subsets of {1 3 4 5} are { {1 3 4}, {1 3 5}, {3 4 5}, {1 4 5}}

Page 55: V Storage Manager

CorrectnessCorrectness

Show thatShow that

1k1k2k2k11

1k1k

1k1k1

k

q.itemp.item,q.itemp.item,...,q.itemp.item

qp,LL

itemqitempitempp.item

C

where

from

.,.,.,select

intoinsert

2

k

k-1

k

c from C

) L(s

ets s of c(k-1)-subs

C itemsets c

delete

then if

do forall

do forall

kk LC

Join extends Lk-1 with all items

Apriori removes those whose (k-1) subsets are not in Lk-1 Prevents duplications

Any subset of large itemset must also be large

Page 56: V Storage Manager

AIS & STEMAIS & STEM

AIS & STEM generate candidate itemsets AIS & STEM generate candidate itemsets based on transactions.based on transactions. Apriori uses the large itemsets to generate larger Apriori uses the large itemsets to generate larger

itemsets.itemsets.

Example:Example: Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}} With AIS, in pass 4, when encountering a With AIS, in pass 4, when encountering a

transaction with items {1 2 4 5}, AIS and STEM transaction with items {1 2 4 5}, AIS and STEM generate the following five candidate sets:generate the following five candidate sets: {1 2 3} => {1 2 3 4} and {1 2 3 5}{1 2 3} => {1 2 3 4} and {1 2 3 5} {1 2 4} => {1 2 4 5}{1 2 4} => {1 2 4 5} {1 3 4} => {1 3 4 5}{1 3 4} => {1 3 4 5} {2 3 4} => {2 3 4 5} {2 3 4} => {2 3 4 5}

Page 57: V Storage Manager

AprioriTidAprioriTid

Uses the database only once to count Uses the database only once to count support for 1-itemsets in Pass 1.support for 1-itemsets in Pass 1.

Builds a storage set C^Builds a storage set C^kk

Members has the form < TID, {XMembers has the form < TID, {Xkk} >} > XXk k are potentially large k-items in transaction TID.are potentially large k-items in transaction TID. For k=1, C^For k=1, C^11 is the database. is the database.

Uses C^Uses C^k k in pass k+1.in pass k+1. Advantages:Advantages:

C^C^k k could be smaller than the database.could be smaller than the database. If a transaction does not contain a candidate k-itemset, If a transaction does not contain a candidate k-itemset,

then C^then C^k k will not have an entry for this transaction.will not have an entry for this transaction.

For large k, each entry may be smaller than the For large k, each entry may be smaller than the transactiontransaction The transaction might contain only few candidates.The transaction might contain only few candidates.

Page 58: V Storage Manager

How? (Assume minsup = 2)How? (Assume minsup = 2)

1.1. Make a pass of DB and count item Make a pass of DB and count item occurrences to determine the large 1-occurrences to determine the large 1-itemsets.itemsets.

Page 59: V Storage Manager

How? (Assume minsup = 2)How? (Assume minsup = 2)

1.1. Make a pass of DB and count item Make a pass of DB and count item occurrences to determine the large 1-occurrences to determine the large 1-itemsets.itemsets.

Page 60: V Storage Manager

How? (Assume minsup = 2)How? (Assume minsup = 2)

1.1. Make a pass of DB and count item Make a pass of DB and count item occurrences to determine the large 1-occurrences to determine the large 1-itemsets.itemsets.

You areYou areHere!Here!

Page 61: V Storage Manager

How? (Assume minsup = 2)How? (Assume minsup = 2)

2. Construct C^12. Construct C^1 Note that C^1 = DatabaseNote that C^1 = Database

You areYou areHere!Here!

Page 62: V Storage Manager

How? (Assume minsup = 2)How? (Assume minsup = 2)

4. Compute C2 by invoking apriori-gen4. Compute C2 by invoking apriori-gen

You areYou areHere!Here!

Page 63: V Storage Manager

How? (Assume minsup = 2)How? (Assume minsup = 2)

9. Compute C2 by invoking apriori-gen9. Compute C2 by invoking apriori-gen

You areYou areHere!Here!

Page 64: V Storage Manager

How? (Assume minsup = 2)How? (Assume minsup = 2)

10.10. Compute C^2Compute C^2 Notice what happened to T100Notice what happened to T100

You areYou areHere!Here!

Page 65: V Storage Manager

How? (Assume minsup = 2)How? (Assume minsup = 2)

12.12. Compute L2Compute L2All entries of C2 with Support >= 2All entries of C2 with Support >= 2

You areYou areHere!Here!

Page 66: V Storage Manager

How? (Assume minsup = 2)How? (Assume minsup = 2)

Iter 2, Step 4: Compute C3Iter 2, Step 4: Compute C3

You areYou areHere!Here!

??

Page 67: V Storage Manager

How? (Assume minsup = 2)How? (Assume minsup = 2)

Iter 2, Step 4: Compute C3Iter 2, Step 4: Compute C3

You areYou areHere!Here!

Page 68: V Storage Manager

How? (Assume minsup = 2)How? (Assume minsup = 2)

Iter 2, Step 9: Count SupportIter 2, Step 9: Count Support

You areYou areHere!Here!

Page 69: V Storage Manager

How? (Assume minsup = 2)How? (Assume minsup = 2)

Iter 2, Step 10: Compute C^3Iter 2, Step 10: Compute C^3Transactions 100 and 400 are gone!Transactions 100 and 400 are gone!

You areYou areHere!Here!

Page 70: V Storage Manager

How? (Assume minsup = 2)How? (Assume minsup = 2)

Iter 2, Step 12: Generate L3Iter 2, Step 12: Generate L3

You areYou areHere!Here!

Page 71: V Storage Manager

How? (Assume minsup = 2)How? (Assume minsup = 2)

Iter 3, Step 4: Generate C4Iter 3, Step 4: Generate C4

You areYou areHere!Here!

??

Page 72: V Storage Manager

How? (Assume minsup = 2)How? (Assume minsup = 2)

Iter 3, Step 4: Generate C4Iter 3, Step 4: Generate C4Since C4 is empty, terminate the algorithm.Since C4 is empty, terminate the algorithm.

You areYou areHere!Here!

Empty setEmpty set

Page 73: V Storage Manager

Apriori versus Apriori-TDApriori versus Apriori-TD

Sizes of the candidate sets, Ck, is smaller Sizes of the candidate sets, Ck, is smaller with Apriori-TD with larger values of k.with Apriori-TD with larger values of k.

CCkk with with

Apriori & Apriori & AprioriTidAprioriTid

LLkk

Page 74: V Storage Manager

Apriroi versus AprioriTidApriroi versus AprioriTid

AprioriTid outperforms Apriori when AprioriTid outperforms Apriori when C^C^kk fits in memory, and fits in memory, and the distribution of the large itemsets has a long the distribution of the large itemsets has a long

tail.tail.

AprioriTid jumps AprioriTid jumps BecauseBecause

C^k does notC^k does notfit in memoryfit in memory

Page 75: V Storage Manager

Execution Time Per PassExecution Time Per Pass

In the earlier passes, Apriori does better In the earlier passes, Apriori does better than AprioriTid.than AprioriTid.

AprioriTid is better than Apriori in later AprioriTid is better than Apriori in later passes.passes.

Page 76: V Storage Manager

Apriori & AprioriTidApriori & AprioriTid

Similarities; both:Similarities; both: Use the same candidate Use the same candidate

generation procedure, generation procedure, counting the same counting the same itemsets.itemsets.

Observe a drop in the Observe a drop in the number of candidate number of candidate itemsets in the later itemsets in the later passes.passes.

Differences:Differences: In each pass, Apriroi In each pass, Apriroi

examine every examine every transaction. AprioriTid transaction. AprioriTid scan C^k and the size scan C^k and the size of C^k becomes of C^k becomes smaller than the smaller than the database size in each database size in each pass.pass.

When C^k fits in main When C^k fits in main memory, AprioriTid memory, AprioriTid does not incur the cost does not incur the cost of writing and reading of writing and reading C^k.C^k.

Page 77: V Storage Manager

AprioriHybridAprioriHybrid

Key idea:Key idea: Use Apriori in the initial passesUse Apriori in the initial passes Switch to AprioriTid when it expects C^k at the Switch to AprioriTid when it expects C^k at the

end of the pass will fit in memory.end of the pass will fit in memory.

How to esimtate if C^k fits in memory in the How to esimtate if C^k fits in memory in the next pass?next pass?

Page 78: V Storage Manager

Cost of SwitchingCost of Switching

Switching in the last pass incurs the cost of Switching in the last pass incurs the cost of constructing C^ without using it.constructing C^ without using it. In the kth pass, AprioriHybird incurs the cost of In the kth pass, AprioriHybird incurs the cost of

constructing C^constructing C^k+1k+1. . If there are no large (k+1)-itmesets (i.e., this is If there are no large (k+1)-itmesets (i.e., this is

the last pass), the algorithm terminates.the last pass), the algorithm terminates. With Apriori, the algorithm also terminates without With Apriori, the algorithm also terminates without

making a pass of the transactions.making a pass of the transactions. AprioriHybrid build C^AprioriHybrid build C^k+1k+1 and then terminates. and then terminates.

Page 79: V Storage Manager

ComparisonComparison

AprioriHybrid is faster if there is a gradual AprioriHybrid is faster if there is a gradual decline in the size of C^k.decline in the size of C^k.

AprioriHybrid AprioriHybrid switched in the switched in the

last pass!last pass!

Page 80: V Storage Manager

Comparison (Cont…)Comparison (Cont…)

If C^k remains large until nearly the end and If C^k remains large until nearly the end and then has an abrupt drop then AprioriHybrid then has an abrupt drop then AprioriHybrid will be the same as Apriori.will be the same as Apriori.

Page 81: V Storage Manager

QuestionQuestion

Page 82: V Storage Manager

QuestionQuestion

Why is AprioriTid worse than Apriori?Why is AprioriTid worse than Apriori? Is AprioriTid better than Apriori for some Is AprioriTid better than Apriori for some

experiment reported in this paper? If not then experiment reported in this paper? If not then why?why?

Page 83: V Storage Manager

AnswerAnswer

Why is AprioriTid worse than Apriori?Why is AprioriTid worse than Apriori? C^k is large in the first few passes, killing the C^k is large in the first few passes, killing the

overall execution time.overall execution time.

Page 84: V Storage Manager

CharacteristicisCharacteristicis

For a fixed collection of system parameters For a fixed collection of system parameters (e.g., minimum support level):(e.g., minimum support level): Response time increases linearly as a function of Response time increases linearly as a function of

the number of transactions.the number of transactions. With larger number of items (1000 versus With larger number of items (1000 versus

10,000), the execution time decreases a little as 10,000), the execution time decreases a little as the average support for an item decreased. the average support for an item decreased. Fewer itemsets provides faster execution times.Fewer itemsets provides faster execution times.

Page 85: V Storage Manager

Rest of this SemesterRest of this Semester

Project is due mid-night on Friday, April 24.Project is due mid-night on Friday, April 24. Review for midterm on April 28Review for midterm on April 28thth. 4 papers:. 4 papers:

Variant indexes.Variant indexes. Access path selection.Access path selection. Overview of query optimization.Overview of query optimization. Mining Association Rules.Mining Association Rules.

Midterm 2 on April 30Midterm 2 on April 30thth.. Meeting with the teams during 1Meeting with the teams during 1stst week of week of

May.May. E-mail to schedule meeting to follow.E-mail to schedule meeting to follow.