198:671 processing massive data sets

198:671 Processing Massive Data Sets

S. Muthukrishnan

Details • Meeting: Core B, Thursday 6—8 PM. • Muthu: x7212, Core 319, Office: Monday 3—4.• Graham: x4580, Core 413, Office: • We meet

– [1] 01/30– [4] 02/06 02/13 02/20 02/27– [3] 03/06 03/13 03/20 03/27– [4] 04/03 04/10 04/17 04/24– [1] 05/01 05/08?

• Write down your email addresses now. WHO DID NOT GET AN EMAIL FROM ME THIS WEEK?

The Data Stream Phenomenon

• Highly detailed, automatic, rapid data feeds. – Radar: meteorological observations.– Satellite: geodetics, radiation,…– Astronomical surveys: optical, IR, radio,…– Internet: traffic logs, user queries, email, financial,– Sensor nodes: many more “observation points’’.

• Need for near-real time analysis of data feeds.– Detect outliers, extreme events, fraud, intrusion,

anomalous activity, complex correlations, classification,…

– Monitoring.

Review of last lecture

• Webonym, webmorphism. • Some questions that arose (badri asked them?)

– How to collect email info, cutting across IP network layers? Routinely done in IP business. 100% inter-ISP email is SMTP which can be logged. Imap, pop3,…

– Credit card transactions in US. Visa did apparently 6.2 billion transactions last year in US. What is the number of packets sent by a 1Gb/s link in one hour assuming average packet size is 40 bytes?

– Do we have to look at large datasets for streaming algorithms to be interesting? NO: database monitoring.

Homeworks

• HW1: Guess a few data sets that are likely to be large, and estimate their sizes. What is the largest dataset size you can think of?

• HW2: What is the best algorithm you can design for the problem of finding k missing numbers?

• HW3: What is an estimate for the total amount of information a human being sees during the course of their life?

• HW4: List a few queries you may pose to packet traffic streams.

Questions: writeup

Which portion engaged you most?

Which portion grated you the most?

What did you take away from the writeup?

Application to many areas including Geomtry.

Fuller, detailed solutions were missing.

????

Telephone/Internet Measurements

111.12.111, 121.25.211, 01/02/02, 14.12.21, 14.35.00, 12412, 100)212.78.123, 121.25.311, 01/02/02, 14.12.21, 14.35.01, 24, 1)

(202 262 47yx, 800 call att, 01/02/02, 14.12.21, 14.35.00)(973 360 7212, 202 262 47yx, 01/02/02, 14.36.00, 14.38.00)

Network management calls for rapid analysis of MASSIVE amounts of such data, in particular, summarizing various signals.

SNMPTCP logsPacket logsFlow logsFault alarms….

Call detail record,ss7 signaling, diagnostics, ...

Models of Data Streams

• Signal s[1…n]. n is universe size.

• Implicitly presented.

• Three models:– Timeseries model: s(1), s(2),….– Cash Register model: s(j)= s(j)+ a(k). a(k) >0. – Turnstile model: s(j)= s(j)+ u(k).

• Any other models? Students were curious why we need Turnstile model.

IP Network Signals

• Number of bytes (packets) sent by a source IP address during the day.

• Number of flows between a source and a destination IP address during the day.

• Number of active flows per source IP.

• Number of active flows per second.

2^(32) sized one-dimensional array; increment only

2^(64) sized two-dimensional array; aggregate packets.

One dimensional time series.

2^(32) sized one-dimensional array; increment and decrement.

Students mentioned Multidimensional signals.

Models of Data Streams

• Compute functions on s.– How many distinct numbers in the stream? – How many items are rare? – What is the variance of s? – What is a good histogram for it? – What are the k largest s[j]’s?

Data Stream Models…

• Desiderata:– Per item processing time– Space stored– Time for computing functions on s– Must all be polylog(n,||s||). – Why? See the writeup for

Polylog(n,t) definition.

Homework 2• HW2.1: pick out one of the other streaming scenarios

and write down – Describe the scenario precisely.– What is the format of the data that is being streamed?– What are suitable datastream models?– What are interesting analysis questions?

• Eg: – email logs: headers and contents.– Webserver logs. Multiple webservers. – Atmospheric observations through radio telescopes. – Satellite observations of earth.

Other applications of data stream models

• One pass algorithms.

• Database monitoring. – Selectivity estimation.– Approximate query answering.– Data quality monitoring.

Project: Stream Join

• We are receiving two data streams, and want to output the join of these two streams: pairs of items which agree on some key

• Characterize the "relative disorderliness" of the streams: if it is small, then maybe we can do well.

• If we have a constant amount of memory, can we produce an approximation of the best possible result for that amount of memory?

• Implement, and evaluate how it performs in practice.

Scientific Data Streams

• From Feb Issue of Wired mag:– A new oil drill transmits data about its current drilling

conditions at 1Mbs

– Scientists are using seismic readings from around the world in order to identify atomic particles passing through the earth.

• What are the characteristics of these kind of (scientific) data streams?

• What questions arise, and what techniques can we apply to answer them?

• What new and interesting problems do we encounter?

Permutations and Disorder• Recent interest in counting the number of inversions in

sequences (how many i > j have a[i] < a[j]). Many similar questions remain to be explored.

• What about related measures of disorderliness: eg max (i-j) for which a[i] < a[j]

• What is the Longest Increasing Subsequence of the sequence? (using small space & 1 pass)

• Are these problems easier if each value is guaranteed to occur at most once?

• Can we find approximations of these quantities, or prove they are hard to approximate?

Frequency Change

• Will later present how to monitor “top 10” items (eg Amazon best sellers by monitoring transactions)

• Recent work: do this with arrival/departure of items. • Open problem: what are top changes in frequency?

– Which stocks have biggest rise/fall in value?

– Which internet hosts have biggest relative change in traffic?

• Implement some solutions, run them on massive data.

Homework

• HW2.3: Describe your chosen project for the semester. State the problem. State what you will do.

Left out of the lecture…

Few good terms property…

• In general it may be impossible to solve certain problems in data stream models. – Lower bound for the missing items puzzle.

• What helps is: few good terms property.

In distributions that occur in nature, there are few huge values, few histogram buckets,…

Wavelets: Haar Wavelets

• a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8]

• w[l] is the dot-product of a with basis vector [l]

• Orthonormal Basis.

2

a[8]a[7]

2

a[6]a[5]

2

a[4]a[3]

2

a[2]a[1] 2

a[8]a[7]

2

a[6]a[5]

2

a[4]a[3]

2

a[2]a[1]

a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Homework

• HW2.2 Take datasets from multiple disciplines, calculate their Haar wavelet coefficients and plot the energy of top k coefficients as k increases.

198:671 processing massive data sets

Documents

rapid data feeds

source ip address

number of packets

number of flows

ip business

data stream phenomenonhighly

ip network layers

number of active flows