198:671 processing massive data sets

22
198:671 Processing Massive Data Sets S. Muthukrishnan

Upload: kay-mccormick

Post on 31-Dec-2015

24 views

Category:

Documents


0 download

DESCRIPTION

198:671 Processing Massive Data Sets. S. Muthukrishnan. Details. Meeting: Core B, Thursday 6—8 PM. Muthu: x7212, Core 319, Office: Monday 3—4. Graham: x4580, Core 413, Office: We meet [1] 01/30 [4] 02/06 02/13 02/20 02/27 [3] 03/06 03/13 03/20 03/27 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 198:671 Processing Massive  Data Sets

198:671 Processing Massive Data Sets

S. Muthukrishnan

Page 2: 198:671 Processing Massive  Data Sets

Details • Meeting: Core B, Thursday 6—8 PM. • Muthu: x7212, Core 319, Office: Monday 3—4.• Graham: x4580, Core 413, Office: • We meet

– [1] 01/30– [4] 02/06 02/13 02/20 02/27– [3] 03/06 03/13 03/20 03/27– [4] 04/03 04/10 04/17 04/24– [1] 05/01 05/08?

• Write down your email addresses now. WHO DID NOT GET AN EMAIL FROM ME THIS WEEK?

Page 3: 198:671 Processing Massive  Data Sets

The Data Stream Phenomenon

• Highly detailed, automatic, rapid data feeds. – Radar: meteorological observations.– Satellite: geodetics, radiation,…– Astronomical surveys: optical, IR, radio,…– Internet: traffic logs, user queries, email, financial,– Sensor nodes: many more “observation points’’.

• Need for near-real time analysis of data feeds.– Detect outliers, extreme events, fraud, intrusion,

anomalous activity, complex correlations, classification,…

– Monitoring.

Page 4: 198:671 Processing Massive  Data Sets

Review of last lecture

• Webonym, webmorphism. • Some questions that arose (badri asked them?)

– How to collect email info, cutting across IP network layers? Routinely done in IP business. 100% inter-ISP email is SMTP which can be logged. Imap, pop3,…

– Credit card transactions in US. Visa did apparently 6.2 billion transactions last year in US. What is the number of packets sent by a 1Gb/s link in one hour assuming average packet size is 40 bytes?

– Do we have to look at large datasets for streaming algorithms to be interesting? NO: database monitoring.

Page 5: 198:671 Processing Massive  Data Sets

Homeworks

• HW1: Guess a few data sets that are likely to be large, and estimate their sizes. What is the largest dataset size you can think of?

• HW2: What is the best algorithm you can design for the problem of finding k missing numbers?

• HW3: What is an estimate for the total amount of information a human being sees during the course of their life?

• HW4: List a few queries you may pose to packet traffic streams.

Page 6: 198:671 Processing Massive  Data Sets

Questions: writeup

Which portion engaged you most?

Which portion grated you the most?

What did you take away from the writeup?

Application to many areas including Geomtry.

Fuller, detailed solutions were missing.

????

Page 7: 198:671 Processing Massive  Data Sets

Telephone/Internet Measurements

111.12.111, 121.25.211, 01/02/02, 14.12.21, 14.35.00, 12412, 100)212.78.123, 121.25.311, 01/02/02, 14.12.21, 14.35.01, 24, 1)

(202 262 47yx, 800 call att, 01/02/02, 14.12.21, 14.35.00)(973 360 7212, 202 262 47yx, 01/02/02, 14.36.00, 14.38.00)

Network management calls for rapid analysis of MASSIVE amounts of such data, in particular, summarizing various signals.

SNMPTCP logsPacket logsFlow logsFault alarms….

Call detail record,ss7 signaling, diagnostics, ...

Page 8: 198:671 Processing Massive  Data Sets

Models of Data Streams

• Signal s[1…n]. n is universe size.

• Implicitly presented.

• Three models:– Timeseries model: s(1), s(2),….– Cash Register model: s(j)= s(j)+ a(k). a(k) >0. – Turnstile model: s(j)= s(j)+ u(k).

• Any other models? Students were curious why we need Turnstile model.

Page 9: 198:671 Processing Massive  Data Sets

IP Network Signals

• Number of bytes (packets) sent by a source IP address during the day.

• Number of flows between a source and a destination IP address during the day.

• Number of active flows per source IP.

• Number of active flows per second.

2^(32) sized one-dimensional array; increment only

2^(64) sized two-dimensional array; aggregate packets.

One dimensional time series.

2^(32) sized one-dimensional array; increment and decrement.

Students mentioned Multidimensional signals.

Page 10: 198:671 Processing Massive  Data Sets

Models of Data Streams

• Compute functions on s.– How many distinct numbers in the stream? – How many items are rare? – What is the variance of s? – What is a good histogram for it? – What are the k largest s[j]’s?

Page 11: 198:671 Processing Massive  Data Sets

Data Stream Models…

• Desiderata:– Per item processing time– Space stored– Time for computing functions on s– Must all be polylog(n,||s||). – Why? See the writeup for

Polylog(n,t) definition.

Page 12: 198:671 Processing Massive  Data Sets

Homework 2• HW2.1: pick out one of the other streaming scenarios

and write down – Describe the scenario precisely.– What is the format of the data that is being streamed?– What are suitable datastream models?– What are interesting analysis questions?

• Eg: – email logs: headers and contents.– Webserver logs. Multiple webservers. – Atmospheric observations through radio telescopes. – Satellite observations of earth.

Page 13: 198:671 Processing Massive  Data Sets

Other applications of data stream models

• One pass algorithms.

• Database monitoring. – Selectivity estimation.– Approximate query answering.– Data quality monitoring.

Page 14: 198:671 Processing Massive  Data Sets

Project: Stream Join

• We are receiving two data streams, and want to output the join of these two streams: pairs of items which agree on some key

• Characterize the "relative disorderliness" of the streams: if it is small, then maybe we can do well.

• If we have a constant amount of memory, can we produce an approximation of the best possible result for that amount of memory?

• Implement, and evaluate how it performs in practice.

Page 15: 198:671 Processing Massive  Data Sets

Scientific Data Streams

• From Feb Issue of Wired mag:– A new oil drill transmits data about its current drilling

conditions at 1Mbs

– Scientists are using seismic readings from around the world in order to identify atomic particles passing through the earth.

• What are the characteristics of these kind of (scientific) data streams?

• What questions arise, and what techniques can we apply to answer them?

• What new and interesting problems do we encounter?

Page 16: 198:671 Processing Massive  Data Sets

Permutations and Disorder• Recent interest in counting the number of inversions in

sequences (how many i > j have a[i] < a[j]). Many similar questions remain to be explored.

• What about related measures of disorderliness: eg max (i-j) for which a[i] < a[j]

• What is the Longest Increasing Subsequence of the sequence? (using small space & 1 pass)

• Are these problems easier if each value is guaranteed to occur at most once?

• Can we find approximations of these quantities, or prove they are hard to approximate?

Page 17: 198:671 Processing Massive  Data Sets

Frequency Change

• Will later present how to monitor “top 10” items (eg Amazon best sellers by monitoring transactions)

• Recent work: do this with arrival/departure of items. • Open problem: what are top changes in frequency?

– Which stocks have biggest rise/fall in value?

– Which internet hosts have biggest relative change in traffic?

• Implement some solutions, run them on massive data.

Page 18: 198:671 Processing Massive  Data Sets

Homework

• HW2.3: Describe your chosen project for the semester. State the problem. State what you will do.

Page 19: 198:671 Processing Massive  Data Sets

Left out of the lecture…

Page 20: 198:671 Processing Massive  Data Sets

Few good terms property…

• In general it may be impossible to solve certain problems in data stream models. – Lower bound for the missing items puzzle.

• What helps is: few good terms property.

In distributions that occur in nature, there are few huge values, few histogram buckets,…

Page 21: 198:671 Processing Massive  Data Sets

Wavelets: Haar Wavelets

• a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8]

• w[l] is the dot-product of a with basis vector [l]

• Orthonormal Basis.

2

a[8]a[7]

2

a[6]a[5]

2

a[4]a[3]

2

a[2]a[1] 2

a[8]a[7]

2

a[6]a[5]

2

a[4]a[3]

2

a[2]a[1]

a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Page 22: 198:671 Processing Massive  Data Sets

Homework

• HW2.2 Take datasets from multiple disciplines, calculate their Haar wavelet coefficients and plot the energy of top k coefficients as k increases.