mining frequent patterns from data streams

52
Mining Frequent Patterns from Data Streams Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Charu Aggarwal And Jiawei Han, Jure Leskovec

Upload: others

Post on 18-Dec-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Frequent Patterns from Data Streams

Mining Frequent Patterns

from Data Streams

Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Charu Aggarwal

And Jiawei Han, Jure Leskovec

Page 2: Mining Frequent Patterns from Data Streams

OUTLINE

l Data Streams

l Characteristics of Data Streamsl Key Challenges in Stream Data

l Frequent Pattern Mining over Data Streams

l Counting Itemsetsl Lossy Countingl Extensions

Page 3: Mining Frequent Patterns from Data Streams

Data Stream

l What is the data stream?l A data stream is an ordered sequence of instances that in many

applications of data stream mining can be read only once or a small number of items using limited computing and storage capabilities.

Page 4: Mining Frequent Patterns from Data Streams

4

Data Streams

l Traditional DBMSl Data stored in finite, persistent data sets

l Data Streamsl Continuous, ordered, changing, fast, huge amountl Managed by Data Stream Management System (DSMS)

Page 5: Mining Frequent Patterns from Data Streams

5

DBMS versus DSMS

l Persistent relationsl One-time queriesl Random accessl “Unbounded” disk storel Only current state mattersl No real-time servicesl Relatively low update ratel Data at any granularityl Assume precise datal Access plan determined by

query processor, physical DB design

l Transient streams l Continuous queriesl Sequential accessl Bounded main memoryl Historical data is importantl Real-time requirementsl Possibly multi-GB arrival ratel Data at fine granularityl Data stale/imprecisel Unpredictable/variable data

arrival and characteristics

Page 6: Mining Frequent Patterns from Data Streams

6

Data Streams

l Data Streamsl Data streams - continuous, ordered, changing, fast, huge amount

l Traditional DBMS - data stored in finite, persistent data sets

l Characteristics of Data Streamsl Fast changing and requires fast, real-time response

Page 7: Mining Frequent Patterns from Data Streams

7

Characteristics of Data Streams

l Data Streamsl Data streams—continuous, ordered, changing, fast, huge amount

l Traditional DBMS—data stored in finite, persistent data sets

l Characteristicsl Fast changing and requires fast, real-time responsel Huge volumes of continuous data, possibly infinitel Data stream captures nicely our data processing needs of today

Page 8: Mining Frequent Patterns from Data Streams

APPLICATIONS

Example: Freeboard.io - Dashboards For the Internet Of Things - https://freeboard.io/

Sensors

Weather

Stock Exchange

Self Driving Cars

Trends

Tweets

Logs

Articles/News...

Wikipedia Edits

ATM Transactions

Chats

Television

Seisms

Music similarities

CO2 Level

Car Tracking...

Page 9: Mining Frequent Patterns from Data Streams

9

Stream Data Applications

l Telecommunication calling recordsl Business: credit card transaction flowsl Network monitoring and traffic engineeringl Financial market: stock exchangel Engineering & industrial processes: power supply &

manufacturingl Sensor, monitoring & surveillance: video streams, RFIDsl Security monitoringl Web logs and Web page click streamsl Massive data sets (even saved but random access is too

expensive)

Page 10: Mining Frequent Patterns from Data Streams

10

Characteristics of Data Streams

l Data Streamsl Data streams—continuous, ordered, changing, fast, huge amount

l Traditional DBMS—data stored in finite, persistent data sets

l Characteristicsl Fast changing and requires fast, real-time responsel Huge volumes of continuous data, possibly infinitel Data stream captures nicely our data processing needs of todayl Random access is expensive—single scan algorithm (can only have

one look)l Store only the summary of the data seen thus farl Most stream data are at pretty low-level or multi-dimensional in

nature, needs multi-level and multi-dimensional processing

Page 11: Mining Frequent Patterns from Data Streams

11

Key Challenges in Stream Data

l Mining precise freq. patterns in stream data: unrealistic

l Infinite length

l Concept-drift

l Concept-evolution

l Feature evolution

Page 12: Mining Frequent Patterns from Data Streams

Key Challenges: Infinite Length

l Infinite length

l In many data mining situations, we do not know the entire data set in advance. Stream management is important when the input rate is controlled externally

l Examples: Google queries, Twitter or Facebook status updates

Page 13: Mining Frequent Patterns from Data Streams

l Infinite length: Impractical to store and use all historical data

l Requires infinite storage

l And running time

Key Challenges: Infinite Length

0 1 10

11

11

0

0 0

0

Page 14: Mining Frequent Patterns from Data Streams

Key Challenges: Concept-Drift

Negative instancePositive instance

A data chunk

Current hyperplane

Previous hyperplane

Instances victim of concept-drift

Page 15: Mining Frequent Patterns from Data Streams

Key Challenges: Concept-Evolution

l Concept-evolution occurs when a new class arrives in the stream.l In this example, we again see a data chunk having two dimensional

data points.l There are two classes here, + and -. Suppose we train a rule-based

classifier using this chunkl Suppose a new class x arrives in the stream in the next chunk.l If we use the same classification rules, all novel class instances will

be misclassified as either + or -.

XXXXX

XXXXXXXXXXXX

XXXXXXXXXXXX

XXXXXXXXXXXX

XXXXXX

Novel classy

x1

y1

y2

x

++++++

++++++

+++++++

++++++++

++++++++++++

++++++++++

- - - - -

- - - - -

- - - - -

++++++++

++++++++

- - - - - - -

- - - - - - - - - - - - - - - -

-- - - - - - - - - - - - - - -

-- - - - - - - - - - - - - - -

- - - - - - - -- - - - -

Classification rules:

R1. if (x > x1 and y < y2) or (x < x1and y < y1) then class = +

R2. if (x > x1 and y > y2) or (x < x1and y > y1) then class = -

A

CD

B

y

x1

y1

y2

x

++++++

++++++

+++++++

++++++++

++++++++++++

++++++++++

- - - - -

- - - - -

- - - - -

++++++++

++++++++

- - - - - - -

- - - - - - - - - - - - - - - -

-- - - - - - - - - - - - - - -

-- - - - - - - - - - - - - - -

- - - - - - - -- - - - -

A

CD

B

Page 16: Mining Frequent Patterns from Data Streams

Key Challenges: Dynamic Features

l Why new features evolvingl Infinite data stream

l Normally, global feature set is unknownl New features may appear

l Concept driftl As concept drifting, new features may appear

l Concept evolutionl New type of class normally holds new set of features

Page 17: Mining Frequent Patterns from Data Streams

Feature Extraction & Selection

i + 1st chunkith chunk

Existing classification models need complete fixed features and apply to all thechunks. Global features are difficult to predict. One solution is using all Englishwords and generate vector. Dimension of the vector will be too high.

Current model

TrainingNew Model

Feature SpaceConversion

Classification &Novel Class Detection

runway, climb runway, clear, ramp

runway, ground, ramp

Feature Set

ith chunk and i + 1st

chunk and models have different feature sets

Key Challenges: Dynamic Features

Page 18: Mining Frequent Patterns from Data Streams

Frequent Pattern Mining

over Data Stream

l Items Counting

l Lossy Countingl Extensions

Page 19: Mining Frequent Patterns from Data Streams

Items Counting

Page 20: Mining Frequent Patterns from Data Streams

20

Counting Bits – (1)

l Problem:givenastreamof0’sand1’s,be

preparedtoanswerqueriesoftheform“how

many1’sinthelastk bits?”wherek≤ N.l Obvioussolution:storethemostrecentN bits.

l Whennewbitcomesin,discardtheN+1st bit.

Page 21: Mining Frequent Patterns from Data Streams

21

Counting Bits – (2)

l Youcan’tgetanexactanswerwithoutstoring

theentirewindow.

l RealProblem:whatifwecannotaffordto

storeN bits?

l E.g.,we’reprocessing1billionstreamsandN =1billion

l Butwe’rehappywithanapproximateanswer.

Page 22: Mining Frequent Patterns from Data Streams

22

DGIM* Method

l StoreO(log2N)bitsperstream.

l Givesapproximateanswer,neveroffbymore

than50%.

l Errorfactorcanbereducedtoanyfraction>0,with

morecomplicatedalgorithmandproportionallymore

storedbits.

*Datar, Gionis, Indyk, and Motwani

Page 23: Mining Frequent Patterns from Data Streams

23

Something That Doesn’t

(Quite) Work

l Summarizeexponentiallyincreasingregionsof

thestream,lookingbackward.

l Dropsmallregionsiftheybeginatthesame

pointasalargerregion.

Page 24: Mining Frequent Patterns from Data Streams

24

Key Idea

l Summarizeblocksofstreamwithspecific

numbersof1’s.

l Blocksizes (numberof1’s)increase

exponentiallyaswegobackintime

Page 25: Mining Frequent Patterns from Data Streams

25

Example: Bucketized Stream

1001010110001011010101010101011010101010101110101010111010100010110010

N

1 ofsize 2

2 ofsize 4

2 ofsize 8

At least 1 ofsize 16. Partiallybeyond window.

2 ofsize 1

Page 26: Mining Frequent Patterns from Data Streams

26

Timestamps

l Eachbitinthestreamhasatimestamp,starting1,2,…

l RecordtimestampsmoduloN (thewindowsize),

sowecanrepresentanyrelevant timestampin

O(log2N)bits.

Page 27: Mining Frequent Patterns from Data Streams

27

Buckets

l Abucket intheDGIMmethodisarecord

consistingof:

1. Thetimestampofitsend[O(logN)bits].2. Thenumberof1’sbetweenitsbeginningandend

[O(loglogN)bits].l Constraintonbuckets:numberof1’smustbe

apowerof2.

l ThatexplainstheloglogN in(2).

Page 28: Mining Frequent Patterns from Data Streams

28

Representing a Stream by Buckets

l Eitheroneortwobucketswiththesame

power-of-2numberof1’s.

l Bucketsdonotoverlapintimestamps.

l Bucketsaresortedbysize.

l Earlierbucketsarenotsmallerthanlaterbuckets.

l Bucketsdisappearwhentheirend-timeis>Ntimeunitsinthepast.

Page 29: Mining Frequent Patterns from Data Streams

29

Updating Buckets – (1)

l Whenanewbitcomesin,dropthelast(oldest)

bucketifitsend-timeispriortoN timeunits

beforethecurrenttime.

l Ifthecurrentbitis0,nootherchangesare

needed.

Page 30: Mining Frequent Patterns from Data Streams

30

Updating Buckets – (2)

l Ifthecurrentbitis1:

1. Createanewbucketofsize1,forjustthisbit.

◆ Endtimestamp=currenttime.

2. Iftherearenowthreebucketsofsize1,combinethe

oldesttwointoabucketofsize2.

3. Iftherearenowthreebucketsofsize2,combinethe

oldesttwointoabucketofsize4.

4. Andsoon…

Page 31: Mining Frequent Patterns from Data Streams

31

Example

1001010110001011010101010101011010101010101110101010111010100010110010

0010101100010110101010101010110101010101011101010101110101000101100101

0010101100010110101010101010110101010101011101010101110101000101100101

0101100010110101010101010110101010101011101010101110101000101100101101

0101100010110101010101010110101010101011101010101110101000101100101101

0101100010110101010101010110101010101011101010101110101000101100101101

Page 32: Mining Frequent Patterns from Data Streams

32

Querying

l Toestimatethenumberof1’sinthemost

recentN bits:

1. Sumthesizesofallbucketsbutthelast.

2. Addhalfthesizeofthelastbucket.

l Remember:wedon’tknowhowmany1’sof

thelastbucketarestillwithinthewindow.

Page 33: Mining Frequent Patterns from Data Streams

33

Example: Bucketized Stream

1001010110001011010101010101011010101010101110101010111010100010110010

N

1 ofsize 2

2 ofsize 4

2 ofsize 8

At least 1 ofsize 16. Partiallybeyond window.

2 ofsize 1

Page 34: Mining Frequent Patterns from Data Streams

34

Error Bound

l Supposethelastbuckethassize2k.

l Thenbyassuming2k-1 ofits1’sarestillwithin

thewindow,wemakeanerrorofatmost2k-1.

l Sincethereisatleastonebucketofeachof

thesizeslessthan2k,thetruesumisatleast1

+2+..+2k-1 =2k-1.

l Thus,erroratmost50%.

Page 35: Mining Frequent Patterns from Data Streams

Frequent Pattern Mining

over Data Stream

l Items Countingl Lossy Counting

l Extensions

Page 36: Mining Frequent Patterns from Data Streams

LOSSY COUNTING

Page 37: Mining Frequent Patterns from Data Streams

37

Mining Approximate Frequent Patterns

l Mining precise freq. patterns in stream data: unrealisticl Even store them in a compressed form, such as FPtree

l Approximate answers are often sufficient (e.g., trend/pattern analysis)l Example: A router is interested in all flows:

l whose frequency is at least 1% (σ) of the entire traffic stream seen so far

l and feels that 1/10 of σ (ε = 0.1%) error is comfortable l How to mine frequent patterns with good approximation?

l Lossy Counting Algorithm (Manku & Motwani, VLDB’02)l Major ideas: not tracing items until it becomes frequentl Adv: guaranteed error boundl Disadv: keep a large set of traces

Page 38: Mining Frequent Patterns from Data Streams

38

Lossy Counting for Frequent Single Items

Bucket 1 Bucket 2 Bucket 3

Divide stream into ‘buckets’ (bucket size is 1/ ε = 1000)

Page 39: Mining Frequent Patterns from Data Streams

39

First Bucket of Stream

Empty(summary) +

At bucket boundary, decrease all counters by 1

Page 40: Mining Frequent Patterns from Data Streams

40

Next Bucket of Stream

+

At bucket boundary, decrease all counters by 1

Page 41: Mining Frequent Patterns from Data Streams

41

Approximation Guarantee

n Given: (1) support threshold: σ, (2) error threshold: ε, and (3)stream length N

n Output: items with frequency counts exceeding (σ – ε) Nn How much do we undercount?

If stream length seen so far = N and bucket-size = 1/εthen frequency count error £ #buckets

= N/bucket-size = N/(1/ε) = εNn Approximation guarantee

n No false negativesn False positives have true frequency count at least (σ–ε)Nn Frequency count underestimated by at most εN

Page 42: Mining Frequent Patterns from Data Streams

42

Lossy Counting For Frequent Itemsets

Divide Stream into ‘Buckets’ as for frequent itemsBut fill as many buckets as possible in main memory one time

Bucket 1 Bucket 2 Bucket 3

If we put 3 buckets of data into main memory one time,then decrease each frequency count by 3

Page 43: Mining Frequent Patterns from Data Streams

43

Update of Summary Data Structure

22

12111

summary data 3 bucket datain memory

44

10220

+

33

9

summary data

Itemset ( ) is deleted.That’s why we choose a large number of buckets – delete more

Page 44: Mining Frequent Patterns from Data Streams

44

Pruning Itemsets – Apriori Rule

If we find itemset ( ) is not frequent itemset,then we needn’t consider its superset

3 bucket datain memory

1

+

summary data

221

1

Page 45: Mining Frequent Patterns from Data Streams

45

Summary of Lossy Counting

l Strengthl A simple ideal Can be extended to frequent itemsets

l Weakness:l Space bound is not goodl For frequent itemsets, they do scan each record many timesl The output is based on all previous data. But sometimes, we are

only interested in recent datal A space-saving method for stream frequent item mining

l Metwally, Agrawal, and El Abbadi, ICDT'05

Page 46: Mining Frequent Patterns from Data Streams

Extensions

Page 47: Mining Frequent Patterns from Data Streams

47

Extensions

l Lossy Counting Algorithm (Manku & Motwani, VLDB’02)l Mine frequent patterns with Approximate frequent patterns.

l Keep only current frequent patterns. No changes can be detected.

l FP-Stream (C. Giannella, J. Han, X. Yan, P.S. Yu, 2003)l Use tilted time window frame.

l Mining evolution and dramatic changes of frequent patterns.

l Moment (Y. Chi, ICDM ‘04)l Very similar to FP-tree, except that keeps a dynamic set of items.

l Maintain closed frequent itemsets over a Stream Sliding Window

Page 48: Mining Frequent Patterns from Data Streams

48

Lossy Counting versus FP-Stream

l Lossy Counting (Manku & Motwani VLDB’02)

l Keep only current frequent patterns—No changes can be detected

l FP-Stream: Mining evolution and dramatic changes of frequent

patterns (Giannella, Han, Yan, Yu, 2003)

l Use tilted time window frame

l Use compressed form to store significant (approximate) frequent

patterns and their time-dependent traces

Page 49: Mining Frequent Patterns from Data Streams

Summary of FP-Stream

l Mining Frequent Itemsets at Multiple Timel Granularities Based in FP-Growthl Maintains

l Pattern Treel Tilted-time window

l Advantagesl Allows to answer time-sensitive queriesl Places greater information to recent data

l Drawbackl Time and memory complexity

Page 50: Mining Frequent Patterns from Data Streams

Moment

l Regenerate frequent itemsets from the entire window whenever a new transaction comes into or an old transaction leaves the window

l Store every itemset, frequent or not, in a traditional data structure such as the prefix tree, and update its support whenever a new transaction comes into or an old transaction leaves the window

l Drawbackl Mining each window from scratch - too expensive

l Subsequent windows have many freq patterns in commonl Updating frequent patterns every new tuple, also too expensive

Page 51: Mining Frequent Patterns from Data Streams

Summary of Moment

l Computes closed frequents itemsets in a sliding windowl Uses Closed Enumeration Treel Uses 4 type of Nodes:

l Closed Nodesl Intermediate Nodesl Unpromising Gateway Nodesl Infrequent Gateway Nodes

l Adding transactionsl Closed items remains closed

l Removing transactionsl Infrequent items remains infrequent

Page 52: Mining Frequent Patterns from Data Streams

References

[Agrawal’ 94] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487–499, 1994.[Cheung’ 03] W. Cheung and O. R. Zaiane, “Incremental mining of frequent patterns without candidate generation or support,” in DEAS, 2003. [Chi’ 04] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, “Moment: Maintaining closed frequent itemsets over a stream sliding window,” in ICDM, November 2004.[Evfimievski’ 03] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” in PODS, 2003.[Han’ 00] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, 2000.[Koh’ 04] J. Koh and S. Shieh, “An efficient approach for maintaining association rules based on adjusting fp-tree structures.” in DASFAA, 2004.[Leung’ 05] C.-S. Leung, Q. Khan, and T. Hoque, “Cantree: A tree structure for efficient incremental mining of frequent patterns,” in ICDM, 2005.[Toivonen’ 96] H. Toivonen, “Sampling large databases for association rules,” in VLDB, 1996, pp. 134–145.Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo: Verifying and Mining Frequent Patterns from Large Windows over Data Streams. ICDE 2008: 179-188Hetal Thakkar, Barzan Mozafari, Carlo Zaniolo. Continuous Post-Mining of Association Rules in a Data Stream Management System. Chapter VII in Post-Mining of Association Rules: Techniques for Effective Knowledge Extraction, Yanchang Zhao; Chengqi Zhang; and Longbing Cao (eds.), ISBN: 978-1-60566-404-0.