mining frequent patterns from data streams

Mining Frequent Patterns

from Data Streams

Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Charu Aggarwal

And Jiawei Han, Jure Leskovec

OUTLINE

l Data Streams

l Characteristics of Data Streamsl Key Challenges in Stream Data

l Frequent Pattern Mining over Data Streams

l Counting Itemsetsl Lossy Countingl Extensions

Data Stream

l What is the data stream?l A data stream is an ordered sequence of instances that in many

applications of data stream mining can be read only once or a small number of items using limited computing and storage capabilities.

4

Data Streams

l Traditional DBMSl Data stored in finite, persistent data sets

l Data Streamsl Continuous, ordered, changing, fast, huge amountl Managed by Data Stream Management System (DSMS)

5

DBMS versus DSMS

l Persistent relationsl One-time queriesl Random accessl “Unbounded” disk storel Only current state mattersl No real-time servicesl Relatively low update ratel Data at any granularityl Assume precise datal Access plan determined by

query processor, physical DB design

l Transient streams l Continuous queriesl Sequential accessl Bounded main memoryl Historical data is importantl Real-time requirementsl Possibly multi-GB arrival ratel Data at fine granularityl Data stale/imprecisel Unpredictable/variable data

arrival and characteristics

6

Data Streams

l Data Streamsl Data streams - continuous, ordered, changing, fast, huge amount

l Traditional DBMS - data stored in finite, persistent data sets

l Characteristics of Data Streamsl Fast changing and requires fast, real-time response

7

Characteristics of Data Streams

l Data Streamsl Data streams—continuous, ordered, changing, fast, huge amount

l Traditional DBMS—data stored in finite, persistent data sets

l Characteristicsl Fast changing and requires fast, real-time responsel Huge volumes of continuous data, possibly infinitel Data stream captures nicely our data processing needs of today

APPLICATIONS

Example: Freeboard.io - Dashboards For the Internet Of Things - https://freeboard.io/

Sensors

Weather

Stock Exchange

Self Driving Cars

Trends

Tweets

Logs

Articles/News...

Wikipedia Edits

ATM Transactions

Chats

Television

Seisms

Music similarities

CO2 Level

Car Tracking...

9

Stream Data Applications

l Telecommunication calling recordsl Business: credit card transaction flowsl Network monitoring and traffic engineeringl Financial market: stock exchangel Engineering & industrial processes: power supply &

manufacturingl Sensor, monitoring & surveillance: video streams, RFIDsl Security monitoringl Web logs and Web page click streamsl Massive data sets (even saved but random access is too

expensive)

10

Characteristics of Data Streams

l Data Streamsl Data streams—continuous, ordered, changing, fast, huge amount

l Traditional DBMS—data stored in finite, persistent data sets

l Characteristicsl Fast changing and requires fast, real-time responsel Huge volumes of continuous data, possibly infinitel Data stream captures nicely our data processing needs of todayl Random access is expensive—single scan algorithm (can only have

one look)l Store only the summary of the data seen thus farl Most stream data are at pretty low-level or multi-dimensional in

nature, needs multi-level and multi-dimensional processing

11

Key Challenges in Stream Data

l Mining precise freq. patterns in stream data: unrealistic

l Infinite length

l Concept-drift

l Concept-evolution

l Feature evolution

Key Challenges: Infinite Length

l Infinite length

l In many data mining situations, we do not know the entire data set in advance. Stream management is important when the input rate is controlled externally

l Examples: Google queries, Twitter or Facebook status updates

l Infinite length: Impractical to store and use all historical data

l Requires infinite storage

l And running time

Key Challenges: Infinite Length

0 1 10

11

11

0

0 0

0

Key Challenges: Concept-Drift

Negative instancePositive instance

A data chunk

Current hyperplane

Previous hyperplane

Instances victim of concept-drift

Key Challenges: Concept-Evolution

l Concept-evolution occurs when a new class arrives in the stream.l In this example, we again see a data chunk having two dimensional

data points.l There are two classes here, + and -. Suppose we train a rule-based

classifier using this chunkl Suppose a new class x arrives in the stream in the next chunk.l If we use the same classification rules, all novel class instances will

be misclassified as either + or -.

XXXXX

XXXXXXXXXXXX

XXXXXXXXXXXX

XXXXXXXXXXXX

XXXXXX

Novel classy

x1

y1

y2

x

++++++

++++++

+++++++

++++++++

++++++++++++

++++++++++

- - - - -

- - - - -

- - - - -

++++++++

++++++++

- - - - - - -

- - - - - - - - - - - - - - - -

-- - - - - - - - - - - - - - -

-- - - - - - - - - - - - - - -

- - - - - - - -- - - - -

Classification rules:

R1. if (x > x1 and y < y2) or (x < x1and y < y1) then class = +

R2. if (x > x1 and y > y2) or (x < x1and y > y1) then class = -

A

CD

B

y

x1

y1

y2

x

++++++

++++++

+++++++

++++++++

++++++++++++

++++++++++

- - - - -

- - - - -

- - - - -

++++++++

++++++++

- - - - - - -

- - - - - - - - - - - - - - - -

-- - - - - - - - - - - - - - -

-- - - - - - - - - - - - - - -

- - - - - - - -- - - - -

A

CD

B

Key Challenges: Dynamic Features

l Why new features evolvingl Infinite data stream

l Normally, global feature set is unknownl New features may appear

l Concept driftl As concept drifting, new features may appear

l Concept evolutionl New type of class normally holds new set of features

Feature Extraction & Selection

i + 1st chunkith chunk

Existing classification models need complete fixed features and apply to all thechunks. Global features are difficult to predict. One solution is using all Englishwords and generate vector. Dimension of the vector will be too high.

Current model

TrainingNew Model

Feature SpaceConversion

Classification &Novel Class Detection

runway, climb runway, clear, ramp

runway, ground, ramp

Feature Set

ith chunk and i + 1st

chunk and models have different feature sets

Key Challenges: Dynamic Features

Frequent Pattern Mining

over Data Stream

l Items Counting

l Lossy Countingl Extensions

Items Counting

20

Counting Bits – (1)

l Problem:givenastreamof0’sand1’s,be

preparedtoanswerqueriesoftheform“how

many1’sinthelastk bits?”wherek≤ N.l Obvioussolution:storethemostrecentN bits.

l Whennewbitcomesin,discardtheN+1st bit.

21

Counting Bits – (2)

l Youcan’tgetanexactanswerwithoutstoring

theentirewindow.

l RealProblem:whatifwecannotaffordto

storeN bits?

l E.g.,we’reprocessing1billionstreamsandN =1billion

l Butwe’rehappywithanapproximateanswer.

22

DGIM* Method

l StoreO(log2N)bitsperstream.

l Givesapproximateanswer,neveroffbymore

than50%.

l Errorfactorcanbereducedtoanyfraction>0,with

morecomplicatedalgorithmandproportionallymore

storedbits.

*Datar, Gionis, Indyk, and Motwani

23

Something That Doesn’t

(Quite) Work

l Summarizeexponentiallyincreasingregionsof

thestream,lookingbackward.

l Dropsmallregionsiftheybeginatthesame

pointasalargerregion.

24

Key Idea

l Summarizeblocksofstreamwithspecific

numbersof1’s.

l Blocksizes (numberof1’s)increase

exponentiallyaswegobackintime

25

Example: Bucketized Stream

1001010110001011010101010101011010101010101110101010111010100010110010

N

1 ofsize 2

2 ofsize 4

2 ofsize 8

At least 1 ofsize 16. Partiallybeyond window.

2 ofsize 1

26

Timestamps

l Eachbitinthestreamhasatimestamp,starting1,2,…

l RecordtimestampsmoduloN (thewindowsize),

sowecanrepresentanyrelevant timestampin

O(log2N)bits.

27

Buckets

l Abucket intheDGIMmethodisarecord

consistingof:

1. Thetimestampofitsend[O(logN)bits].2. Thenumberof1’sbetweenitsbeginningandend

[O(loglogN)bits].l Constraintonbuckets:numberof1’smustbe

apowerof2.

l ThatexplainstheloglogN in(2).

28

Representing a Stream by Buckets

l Eitheroneortwobucketswiththesame

power-of-2numberof1’s.

l Bucketsdonotoverlapintimestamps.

l Bucketsaresortedbysize.

l Earlierbucketsarenotsmallerthanlaterbuckets.

l Bucketsdisappearwhentheirend-timeis>Ntimeunitsinthepast.

29

Updating Buckets – (1)

l Whenanewbitcomesin,dropthelast(oldest)

bucketifitsend-timeispriortoN timeunits

beforethecurrenttime.

l Ifthecurrentbitis0,nootherchangesare

needed.

30

Updating Buckets – (2)

l Ifthecurrentbitis1:

1. Createanewbucketofsize1,forjustthisbit.

◆ Endtimestamp=currenttime.

2. Iftherearenowthreebucketsofsize1,combinethe

oldesttwointoabucketofsize2.

3. Iftherearenowthreebucketsofsize2,combinethe

oldesttwointoabucketofsize4.

4. Andsoon…

31

Example

1001010110001011010101010101011010101010101110101010111010100010110010

0010101100010110101010101010110101010101011101010101110101000101100101

0010101100010110101010101010110101010101011101010101110101000101100101

0101100010110101010101010110101010101011101010101110101000101100101101

0101100010110101010101010110101010101011101010101110101000101100101101

0101100010110101010101010110101010101011101010101110101000101100101101

32

Querying

l Toestimatethenumberof1’sinthemost

recentN bits:

1. Sumthesizesofallbucketsbutthelast.

2. Addhalfthesizeofthelastbucket.

l Remember:wedon’tknowhowmany1’sof

thelastbucketarestillwithinthewindow.

33

Example: Bucketized Stream

1001010110001011010101010101011010101010101110101010111010100010110010

N

1 ofsize 2

2 ofsize 4

2 ofsize 8

At least 1 ofsize 16. Partiallybeyond window.

2 ofsize 1

34

Error Bound

l Supposethelastbuckethassize2k.

l Thenbyassuming2k-1 ofits1’sarestillwithin

thewindow,wemakeanerrorofatmost2k-1.

l Sincethereisatleastonebucketofeachof

thesizeslessthan2k,thetruesumisatleast1

+2+..+2k-1 =2k-1.

l Thus,erroratmost50%.

Frequent Pattern Mining

over Data Stream

l Items Countingl Lossy Counting

l Extensions

LOSSY COUNTING

37

Mining Approximate Frequent Patterns

l Mining precise freq. patterns in stream data: unrealisticl Even store them in a compressed form, such as FPtree

l Approximate answers are often sufficient (e.g., trend/pattern analysis)l Example: A router is interested in all flows:

l whose frequency is at least 1% (σ) of the entire traffic stream seen so far

l and feels that 1/10 of σ (ε = 0.1%) error is comfortable l How to mine frequent patterns with good approximation?

l Lossy Counting Algorithm (Manku & Motwani, VLDB’02)l Major ideas: not tracing items until it becomes frequentl Adv: guaranteed error boundl Disadv: keep a large set of traces

38

Lossy Counting for Frequent Single Items

Bucket 1 Bucket 2 Bucket 3

Divide stream into ‘buckets’ (bucket size is 1/ ε = 1000)

39

First Bucket of Stream

Empty(summary) +

At bucket boundary, decrease all counters by 1

40

Next Bucket of Stream

+

At bucket boundary, decrease all counters by 1

41

Approximation Guarantee

n Given: (1) support threshold: σ, (2) error threshold: ε, and (3)stream length N

n Output: items with frequency counts exceeding (σ – ε) Nn How much do we undercount?

If stream length seen so far = N and bucket-size = 1/εthen frequency count error £ #buckets

= N/bucket-size = N/(1/ε) = εNn Approximation guarantee

n No false negativesn False positives have true frequency count at least (σ–ε)Nn Frequency count underestimated by at most εN

42

Lossy Counting For Frequent Itemsets

Divide Stream into ‘Buckets’ as for frequent itemsBut fill as many buckets as possible in main memory one time

Bucket 1 Bucket 2 Bucket 3

If we put 3 buckets of data into main memory one time,then decrease each frequency count by 3

43

Update of Summary Data Structure

22

12111

summary data 3 bucket datain memory

44

10220

+

33

9

summary data

Itemset ( ) is deleted.That’s why we choose a large number of buckets – delete more

44

Pruning Itemsets – Apriori Rule

If we find itemset ( ) is not frequent itemset,then we needn’t consider its superset

3 bucket datain memory

1

+

summary data

221

1

45

Summary of Lossy Counting

l Strengthl A simple ideal Can be extended to frequent itemsets

l Weakness:l Space bound is not goodl For frequent itemsets, they do scan each record many timesl The output is based on all previous data. But sometimes, we are

only interested in recent datal A space-saving method for stream frequent item mining

l Metwally, Agrawal, and El Abbadi, ICDT'05

Extensions

47

Extensions

l Lossy Counting Algorithm (Manku & Motwani, VLDB’02)l Mine frequent patterns with Approximate frequent patterns.

l Keep only current frequent patterns. No changes can be detected.

l FP-Stream (C. Giannella, J. Han, X. Yan, P.S. Yu, 2003)l Use tilted time window frame.

l Mining evolution and dramatic changes of frequent patterns.

l Moment (Y. Chi, ICDM ‘04)l Very similar to FP-tree, except that keeps a dynamic set of items.

l Maintain closed frequent itemsets over a Stream Sliding Window

48

Lossy Counting versus FP-Stream

l Lossy Counting (Manku & Motwani VLDB’02)

l Keep only current frequent patterns—No changes can be detected

l FP-Stream: Mining evolution and dramatic changes of frequent

patterns (Giannella, Han, Yan, Yu, 2003)

l Use tilted time window frame

l Use compressed form to store significant (approximate) frequent

patterns and their time-dependent traces

Summary of FP-Stream

l Mining Frequent Itemsets at Multiple Timel Granularities Based in FP-Growthl Maintains

l Pattern Treel Tilted-time window

l Advantagesl Allows to answer time-sensitive queriesl Places greater information to recent data

l Drawbackl Time and memory complexity

Moment

l Regenerate frequent itemsets from the entire window whenever a new transaction comes into or an old transaction leaves the window

l Store every itemset, frequent or not, in a traditional data structure such as the prefix tree, and update its support whenever a new transaction comes into or an old transaction leaves the window

l Drawbackl Mining each window from scratch - too expensive

l Subsequent windows have many freq patterns in commonl Updating frequent patterns every new tuple, also too expensive

Summary of Moment

l Computes closed frequents itemsets in a sliding windowl Uses Closed Enumeration Treel Uses 4 type of Nodes:

l Closed Nodesl Intermediate Nodesl Unpromising Gateway Nodesl Infrequent Gateway Nodes

l Adding transactionsl Closed items remains closed

l Removing transactionsl Infrequent items remains infrequent

References

[Agrawal’ 94] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487–499, 1994.[Cheung’ 03] W. Cheung and O. R. Zaiane, “Incremental mining of frequent patterns without candidate generation or support,” in DEAS, 2003. [Chi’ 04] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, “Moment: Maintaining closed frequent itemsets over a stream sliding window,” in ICDM, November 2004.[Evfimievski’ 03] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” in PODS, 2003.[Han’ 00] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, 2000.[Koh’ 04] J. Koh and S. Shieh, “An efficient approach for maintaining association rules based on adjusting fp-tree structures.” in DASFAA, 2004.[Leung’ 05] C.-S. Leung, Q. Khan, and T. Hoque, “Cantree: A tree structure for efficient incremental mining of frequent patterns,” in ICDM, 2005.[Toivonen’ 96] H. Toivonen, “Sampling large databases for association rules,” in VLDB, 1996, pp. 134–145.Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo: Verifying and Mining Frequent Patterns from Large Windows over Data Streams. ICDE 2008: 179-188Hetal Thakkar, Barzan Mozafari, Carlo Zaniolo. Continuous Post-Mining of Association Rules in a Data Stream Management System. Chapter VII in Post-Mining of Association Rules: Techniques for Effective Knowledge Extraction, Yanchang Zhao; Chengqi Zhang; and Longbing Cao (eds.), ISBN: 978-1-60566-404-0.

mining frequent patterns from data streams

Documents