bab 06 - seq mining - part 2

8/12/2019 Bab 06 - Seq Mining - Part 2

1/26

Bab 6-2 - 1/26Data Mining Arif Djunaidy FTIF ITS

Data Mining

Bab 6

Part 2Mining Sequential Patterns(GSP Generalized Sequential Patterns)


2/26


Absence of time constraints Users often want to specify maximum and/or minimum time gaps

between adjacent elements of the sequential pattern. They may want tospecify that a customer should support a sequential pattern only ifadjacent elements occur within a specified time interval, say three months

For example, a book club probably does not careif someone boughtFoundation", followed by Foundation and Empire" three years later

Rigid definition of a transaction For many applications, it does not matterif items in an element of a

sequential pattern were present in two different transactions, as long as thetransaction-times of those transactions are within some small time window(i.e., each element of the pattern can be contained in the union of the itemsbought in a set of transactions, as long as the difference between themaximum and minimum transaction-times is less than the size of a slidingtime window)

For example, if the book-club specifies a time window of a week, acustomer who ordered the Foundation on Monday, Ringworld onSaturday, and then Foundation and Empire and Ringworld Engineersin a single order a few weeks later would still support the patternFoundation and Ringworld, followed by Foundation and Empire andRingworld Engineers ".

Limitations of Previous Algorithm - 1


3/26


Absence of taxonomies

Many datasets have a user-defined taxonomy (is-ahierarchy)over the items in the data, and users want to find patternsthat include items across different levels of the taxonomy

With the above taxonomy, a customer who boughtFoundation followed by Perfect Spy would support thepatterns Foundation followed by Perfect Spy ,Asimov followed by Perfect Spy , Science Fictionfollowed by Le Carre , etc.

Limitations of Previous Algorithm - 2


4/26


Let I = {i1, i2, , im} be a set of literals, called items. Let Tbe a directed acyclic

graph (DAG) on the literals. An edge in Trepresents anis-arelationship, and Trepresents a set of taxonomies. If there is an edge in Tfrompto c, we callpa parent of cand ca child ofpprepresents a generalization of c

The taxonomy is modeled as a DAG rather than a tree to allow for multipletaxonomies. We call an ancestorof y(and ya descendantof ) if there is anedge from to yin transitive-closure (T).

An itemsetis a non-empty set of items.

A sequenceis an ordered list of itemsets. We denote a sequence sby s1s2sn, where sjis an itemset (or sjis also called an elementof the sequence)

An item can occur only once in an element of a sequence, but can occur multipletimes in different elements

A sequence (a1a2 an) is a subsequentof another sequence (b1b2 bn) if thereexist integers i1 < i2< < insuch that

For example, the sequence (3) (4 5) (8) is contained in (7) (3 8) (9) (4 5 6) (8) ,since (3) (3 8), (4 5) (4 5 6) and (8) (8).However, the sequence (3) (5) is not contained in (3 5) (and vice versa).

Definition


5/26


A database Dof sequences called data-sequencesis given asan input. Each data-sequence is a list of transactions, ordered by increasing

transaction-time.

A transaction has the following fields: sequence-id, transaction-id,

transaction-time, and the items present in the transaction. Items in a transaction are assumed to be leaves in T.

For simplicity, no data-sequence is assumed to have morethan one transaction with the same transaction-time, anduse the transaction-time as the transaction-identifier.

Quantities of items in a transaction are not considered.

Input


6/26


The support count(or simply support) for a sequence isdefined as the fraction of total data-sequences that containthis sequence. Although the word contains is not strictly accurate once we

incorporate taxonomies, it captures the spirit of when a data-

sequence contributes to the support of a sequential pattern Definition when a data-sequence contains a sequence is

given below, starting with the definition as in chapter 6-1,and then adding taxonomies, sliding windows, and timeconstraints: as in chapter 6-1: In the absence of taxonomies, sliding windows and

time constraints, a data-sequence contains a sequence sif sis asubsequence of the data-sequence

Support - 1


7/26Bab 6-2 - 7/26Data Mining Arif Djunaidy FTIF ITS

plus taxonomies:

A transaction Tcontains an item x I if xis in T or xis

an ancestor of some items in T

A transaction Tcontains an itemset y I if T contains

every item in y A data sequence d= (d1 dm) contains a sequence (s1

sn) if there exist integers i1 < i2< < insuch that s1iscontained in di1, s2is contained in di2, snis contained in din

.

If there is no taxonomy, this definition generates into asimple subsequent test (as in chapter 6-1)

Support - 2



plus sliding windows: The sliding window generalization relaxes the definition of

when a data-sequence contributes to the support of a sequence:

- By allowing a set of transactions to contain an element of asequence, and

- As long as the difference in transaction-times between thetransactions in the set is less than the user-specified window-size

Formally, a data sequence d= (d1 dm) contains a sequence (s1

sn) if there exist integers l1 u1< l2u2< lnunsuch that:

Support - 3



Support - 4

The first twoconditions are the same as in the earlier definition of when a data-

sequence contains a pattern. The thirdcondition specifies the minimum time-gap constraint, and the lastspecifies the maximum time-gap constraint.

The is referred to , and the

is referred to

and correspond to the firstand last

transaction-times of set of transactions that contain si

plus time constrains: Time constraint restricts the gap between sets of transactions that contain

consecutive elements of the sequence Given user-specified window-size, max-gap and min-gap, a data sequence d=

(d1 dm) contains a sequence (s1 sn) if there exist integers l1 u1< l2u2< lnunsuch that:



Given a database Dof data-sequences, a taxonomy T, user-specified

min-gap and max-gap time constraints, and a user-specified sliding-window size, the problem of mining sequential patternsis to find allsequences whose support is greater than the user-specified minimumsupport. Each such sequence represents a sequential pattern, also calledafrequent sequence

Given a frequent sequence s= s1s2 sn, it is often useful to know thesupport relationship between elements of the sequence (i.e., whatfraction of the data-sequences that support s1si support the entiresequence s

Note that if there is no taxonomy, min-gap= 0, max-gap= and

window-size= 0, we get the notion of sequential patterns as explained inchapter 6-1, where there are no time constraints and items in anelement come from a single transaction

Problem Definition



Consider the data-sequences shown below. For simplicity, we haveassumed that the transaction-times are integers; they could represent,

for instance, the number of days after January 1. Assume that the minimum support has been set to 2 data-sequences

Illustration Example - 1



With the problem definition explained in chapter 6-1, the only2-element sequential patterns is:

(Ringworld) (Ringworld Engineers) and(Foundation) (Ringworld Engineers)

Setting a sliding-window of 7 days, adds the pattern(Foundation, Ringworld) (Ringworld Engineers)

(Foundation and Ringworld are present within a period of 7 days in data-sequence C1)

Further setting a max-gap of 30 days results in all patterns being dropped,since they are no longer supported by customer C2 (only supported bycustomer C1)

If we only add the taxonomy, but no sliding-window or time constraints, oneof the patterns added is:

(Foundation) (Asimov)

Illustration Example - 2



The algorithm makes multiple passes over the data.

The first pass determines the support of each item (i.e., the number ofdata-sequences that include the item).

At the end of the first pass, the algorithm knows which items are frequent (i.e.,items having minimum support)

Each such item yields a 1-element frequent sequence consisting of that item.

Each subsequent pass starts with a seed set: the frequent sequences found

in the previous pass. The seed set is used to generate new potentially frequent sequences, calledcandidate sequences.

Each candidate sequence has one more item than a seed sequence; so all thecandidate sequences in a pass will have the same number of items.

The support for these candidate sequences is cound during the pass over thedata.

At the end of the pass, the algorithm determines which of the candidatesequences are actually frequent.

These frequent candidates become the seed for the next pass.

The algorithm terminates when there are no frequent sequences at the endof a pass, or when there are no candidate sequences generated.

GSP:Algorithm



Candidate generation: how candidates sequences are

generated before the pass begins. The goal is to generate as few candidates as possible while

maintaining completeness.

Counting candidates: how the support count for the

candidate sequences is determined.

GSP: Two Key Details

Time constraints and sliding windows are firstconsidered in this discussion, but do not considertaxonomies.

Extensions required to handle taxonomies aredescribed later



A sequence withkitems denoted as k-sequence.

If an item occurs multiple times in different elements of a sequence

eachoccurrence contributes to the value of k

Let Lk denote the set of all frequent k-sequences and Ckdenote theset of candidate k-sequences

Problem definition:given Lk-1(the set of all frequent (k-1)-sequences)generate a superset of the set of all frequent k-sequences

Definition: given a sequence s= s1s2 snand a subsequence ccis acontiguoussubsequence of sif any of the following conditions hold:1. cis derived from s by dropping an item from either s1or sn2. cis derived from sby dropping an item from an element siwhich has at least

2 items

3. cis a contiguous subsequence cand cis a contiguous subsequence of s Example:consider the sequence s= (1, 2) (3, 4) (5) (6)

Sequences (2) (3, 4) (5) , (1, 2) (3) (5) (6) and (3) (5) are some of thecontiguous subsequences of s

But, (1, 2) (3, 4) (6) and (1) (5) (6) are not

GSP: Candidate Generation - 1



Join Phase:generate candidate sequences by joining Lk-1with Lk-1

A sequence s1joins with s2if the subsequence obtained by dropping the firstitem of s1is the same as the subsequence obtained by dropping the last itemof s2

The candidate sequence generated by joining s1with s2is the sequence s1extended with the last item in s2

The added item becomes a separate element if it was a separate element in s2,otherwise it becomes part of the last element of s1

When joining L1with L1, add the item in s2both as part of an itemset and as aseparate element, since both (x) (y) and (x, y) give the same sequence(y) on deletion of the first item

Prune Phase:Delete candidate sequences having a contiguous (k-

1)-subsequences whose support count is less than the minimumsupport If there is no max-gap constraint, also delete candidate sequences that have

any subsequence without minimum support

GSP: Candidate Generation - 2


17/26


GSP: Candidate Generation - Example

In the join phase, the sequence (1, 2) (3) joins with (2) (3, 4) togenerate (1, 2) (3, 4) and with (2) (3) (5) to generate (1, 2) (3) (5) .

The remaining sequences do not join with any sequence in L3. Forinstance, (1, 2) (4) does not join with any sequence since there is nosequence of the form (2) (4, x) or (2) (4) (x) .

In the prune phase, (1, 2) (3) (5) is dropped since its contiguoussubsequence (1) (3) (5) is not in L3.

L3and C4after the join and prune phases


18/26


While making a pass: Read one data-sequence at a time, and Increment the support count of candidates contained in the data-sequence

Thus, given a set of candidate sequences Cand a data-sequence dfindall sequences in Cthat are contained in d. The following 2 techniques canbe used to solve this problem:

1. Use a hash-treedata structure to reduce the number of candidates in Cthat are checked for a data-sequence A node of the hash-tree either contains a list of sequences ( a leafnode) or a

hash table (an interiornode)

In an interior node, each non-empty bucket of the hash table points to

another node The root of the hash-tree is defined to be at depth 1

An interior node at depthp points to nodes at depthp+1

2. Transform the representation of the data-sequence dso that the processof finding whether a specific candidate is a subsequence of dcan bedone efficiently (see example later).

GSP: Counting Candidates - 1

Reducing the number of candidates that need to be checked


19/26


Let d a data-sequence and s= s1 snbe a candidate sequence.

Contains test algorithm: The algorithm for checking if the data-sequence dcontains a candidate

sequence salternates between 2 phases: forwardand backward.

The algorithm starts in the forward phase from the first element

This algorithm assumes the existence of a procedure that finds the firstoccurrence of an element of sin dafter a given time

Forward phase:

The algorithm finds successive elements of sin das long as thedifference between the end-time of the element just found and thestart-time of the previous element is less than max-gap

Recall that for an element si, start-time(si) and end-time(si) correspond tothe first and last transaction-times of the set transactions that contain si

If the difference is more than max-gap, the algorithm switches to thebackward phase

If an element is not found, the data-sequence does not contain s


Checking whether a data-sequence contains a specific sequence

i did


20/26


Backward phase:

This algorithm backtracks and pulls up previous elements

If siis the current element and end-time(si) = t

the algorithm finds the firstset of transactions containing si-1whose transaction-times are after t max-gap

The start-time for si-1(after si-1is pulled) could be after the end-time for si

Pulling up si-1may necessitate pulling up si-2because the max-gap constraintbetween si-1and si-2may no longer be satisfied

The algorithm moves backwards until either the max-gap constraintbetween the element just pulled and the previous elements that is satisfied,or the first element has been pulled up

The algorithm then switches to the forward phase, finding elements of sin dstarting from the element after the last element pulled up

If any element cannot be pulled up (that is, there is no subsequent set of

transactions which contain the element), the data-sequence does not contain s


The procedure of switching between the backward and forward phases is repeated, untilall the elements are found

Although the algorithm moves back & forth among the elements of si, it terminatesbecause for any element si, the algorithm always checks whether a later set of transactioncontains sithus the transaction-times for an element always increase


21/26


GSP: Counting Candidates - Example /1

Consider the case when:max-gap= 30,min-gap= 5, and window-size=0

For the candidate sequence (1, 2) (3) (4):1. element (1, 2) is first found at transaction-time

= 10, then element (3) is found at time = 45

2. since the gap between those elements (35days) max-gap(1, 2) is pulled up

3. the first occurrence of (1, 2) is searched aftertime 15, because end-time((3)) = 45 and max-gap = 30even if (1, 2) occurs at some timebefore 15, it still will not satisfy the max-gapconstraint

4. element (1, 2) is found at time 50. Since this isthe first element, we do not have to check ifthe max-gap constraint between (1, 2) and theelement before that is satisfied

5. We now move forward phase


22/26


GSP: Counting Candidates - Example /2

Consider the case when:max-gap= 30,

min-gap= 5, and window-size=0 For the candidate sequence (1, 2) (3) (4):

5. .... We now move forward phase

6. Since (3) no longer occurs more than 5 daysafter (1, 2), the next occurrence of (3) after time

55 is searched forelement (3) is found attime 65

7. Since the max-gap constraint between (3) and(1, 2) is satisfiedwe continue to moveforward and find (4) at time 90

8. Since the max-gap constraint between (4) and(3) is satisfiedso the process stops.


23/26


GSP: Taxonomies - 1

The basic approach is to replace each data-sequence dwith an extended-sequenced, where each transaction diof dcontains the items in thecorresponding transaction diof d, as well as all the ancestors of each itemin di

For example:

With the taxonomy shown above, a data-sequence (Foundation, Ringworld)(Second Foundation)would be replaced with the extended-sequence(Foundation, Ringworld, Asimov, Niven, Science Fiction) (SecondFoundation, Asimov, Science Fiction)

The GSP is then executed on those extended-sequences

GSP T i 2


24/26


GSP: Taxonomies - 2

There 2 optimizations that can improve performance considerably:

1. Pre-compute the ancestors of each item and drop ancestors that are not in any ofthe candidates being counted before making a pass over the data

For example, if Ringworld, Second Foundation and Niven are not in any of thecandidates being counted in the current pass, replace the data-sequence (Foundation,Ringworld) (Second Foundation)with the extended-sequence (Foundation, Asimov,Science Fiction) (Asimov, Science Fiction)(instead of the extended-sequence(Foundation, Ringworld, Asimov, Niven, Science Fiction) (Second Foundation,Asimov, Science Fiction)

2. Not count sequential patterns with an element that contains both an item xandits ancestor y, since the support for that will always be the same as the supportfor the sequential pattern without y(any transaction that contains xwill alsocontain y)


25/26


GSP: Taxonomies - 3

Incorporating taxonomies can result in many redundant sequentialpatterns (a related issue). For example, suppose that the support of Asimov = 20%,

Foundation = 10% and support of the pattern (Asimov) (Ringworld)= 15%. Given that information, the support of the pattern (Foundation)

(Ringworld)is expected to be 7.5%, since half the Asimovs areFoundationss

If the actual support of (Foundation) (Ringworld)is close to 7.5%thepattern can be considered redundant

The essential idea is that:given a user-specified interest-level I patterns that have no ancestors, or patterns whose actual support is atleast I times their expected support (based on the support of theirancestors) is essential to display


26/26

B b 6 2 26/26Data Mining Arif Djunaidy FTIF ITS

Akhir

Bab 6-2

bab 06 - seq mining - part 2

Documents