@ carnegie mellon databases 1 finding frequent items in distributed data streams amit manjhi v....

26
@Carnegie Mellon Databases 1 Finding Frequent Finding Frequent Items Items in in Distributed Data Distributed Data Streams Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University ICDE 2005

Upload: brook-atkinson

Post on 18-Jan-2018

221 views

Category:

Documents


0 download

DESCRIPTION

@ Carnegie Mellon Databases 3 Other Applications of the Same Problem Find globally frequent items and their frequencies ItemsNodesApplications Accesses to web pages Web serversKeep tab on popular webpages Packets to specific destinations MachinesDetect DDoS attacks Signatures of different worms RoutersDetect prevalent worms

TRANSCRIPT

Page 1: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases1

Finding Frequent Items Finding Frequent Items in in

Distributed Data StreamsDistributed Data Streams

Amit Manjhi

V. Shkapenyuk, K. Dhamdhere, C. Olston

Carnegie Mellon University

ICDE 2005

Page 2: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases2

Usage Monitoring in Large NetworksUsage Monitoring in Large Networks

A BB

B B

CB

B… ……

Find bandwidth hogs—users using a lot of bandwidth across all machines, and their bandwidth usage

CB

A B C

Internet

Time

Packet: item, Machine: node monitoring a stream

Page 3: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases3

Other Applications of the Same ProblemOther Applications of the Same Problem

Find globally frequent items and their frequencies

Items Nodes Applications

Accesses to web pages

Web servers Keep tab on popular webpages

Packets to specific destinations

Machines Detect DDoS attacks

Signatures of different worms

Routers Detect prevalent worms

Page 4: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases4

Simple approach may not be scalableSimple approach may not be scalable

……

Node 1

……Node 2

……Node m

+

+

+

……

=

Sum

1%

Freq

uenc

ies

Items

Not scalable, particularly for large ‘m’

Page 5: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases5

Hierarchical approach alleviates load on the rootHierarchical approach alleviates load on the root

MmM1 M2

R

…Combine histograms using in-network aggregation

Answers

Excessive communication

due to long tails

. .

.1%

Page 6: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases6

For acceptable communication, need approximationFor acceptable communication, need approximation

MmM1 M2

R

…Combine histograms using in-network aggregation

ApproximateAnswers

. .

.1%

Where to introduce

approximation?

X XX

Page 7: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases7

OutlineOutline

• Motivation• Problem statement• Drawback of existing solution• Our solutions• Evaluation• Summary

Page 8: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases8

Formal Problem StatementFormal Problem Statement

MmM1 M2

R

• Find frequencies of all items whose frequency exceeds s% of total• Error tolerance: % of total, s À • Example: s=1, =0.1• Periodic answers(every “epoch” seconds)

Goal:

Minimize Communication

ApproximateAnswers

..

Page 9: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases9

Simple solution: Simple solution: Early dropEarly drop

MmM1 M2

R

Collect and decrement data Manku, Motwani VLDB’02

Combine histograms

Obtain approximate answers

..

Page 10: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases10

Drawback of Drawback of Early DropEarly Drop

1 1 3 1 1

24

26

24 4 2

15

1

I1

M3M2M1

I2

I3

R = 0.3

15

1

15

1

= 0.3

24

26

24 4 2

I1

M3M2M1

I2

I3

R

24 4

26

24 4 2

5

5

5

4 4

Drawback: Locally frequent items reach the root

Reason: Decrements based on local decisions

CA B

Legend

Page 11: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases11

Solution space: Setting precision gradientSolution space: Setting precision gradientP

reci

sion

Leaf Root

Early drop

Late drop

????

??

Need to balance two competing pressures:1. Early reduction of data2. Informed reduction of data

(Exact)

(Max possible error ) Height

Page 12: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases12

Optimal precision gradient depends on the applicationOptimal precision gradient depends on the application

Optimal precision gradient depends on the objective the application wants to achieve

We study two objectives:

1. Minimize total load on root node – conserve resources for other tasks

2. Minimize load on maximally loaded link – maximize ability to scale to large datasets

Load: number of counters traversing a link

Page 13: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases13

Objective 1: Minimize load on rootObjective 1: Minimize load on root

Simple; all decrements done by children of root node

Intuition: delay decrementing until most information about distribution is available

Leaf Root

Early drop

Late drop

MinRootLoad

Pre

cisi

on

(Exact)

(Max possible error )

Height

Page 14: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases14

Objective 2: Minimize maximum link loadObjective 2: Minimize maximum link load

For different inputs, different precision gradients are optimal

Find the “precision gradient” that minimizes the maximum load on any link, in the worst-case across all possible inputs

IWC

I

For any input I2 I – IWC , 9 I’2 IWC that has max. load no lower than I for any precision gradient

Page 15: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases15

Properties of Properties of IIWCWC

1. No item occurrence common to any two streams

2. All items in a stream occur with equal frequency

3. The same number of items occur in each input stream; the same number of distinct items occur in each input stream

Page 16: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases16

Minimize maximum link loadMinimize maximum link loadTo minimize the maximum load for any input in IWC

Set i = (Proof in paper)

Intuition: gradual gradient

Leaf Root

Early drop

Late drop

MinMaxLoad_WC

)2())1)(1((

dlldildd

Pre

cisi

on

(Exact)

(Max possible error ) Height

Page 17: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases17

Non-worst-case inputsNon-worst-case inputsReal data unlikely to exhibit worst-case characteristics –

optimal for worst case may not perform well in practice

Hybrid Solution: MinMaxLoad_NWC

: measure commonality between streams by sampling data

commonality: locally frequent items, also globally frequent

MinMaxLoad_WC Early drop

No commonality, = 0 Max. commonality, =1

Page 18: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases18

OutlineOutline

• Motivation• Problem statement • Drawback of Existing Solution• Our Solutions: MinRootLoad,

MinMaxLoad_WC, MinMaxLoad_NWC• Evaluation

• Workloads• Simulation results for the two metrics

• Summary

Page 19: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases19

WorkloadsWorkloads

• Internet 2 traffic logs (5 mins epoch)• Find hosts receiving large number of packets – can be

used as evidence of DoS attack• Auction and bulletin-board site – ran in a distributed

manner (15 mins epoch)• Find frequent database queries – usage monitoring

• Topology used: • 216 leaf nodes, fan-out = 6, 3 levels

• s = 1%, = 0.1%

: Bulletin-board (0.57), Internet2 (0.68), Auction (0.84)

Page 20: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases20

Load on root nodeLoad on root node

Page 21: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases21

Maximum load on any linkMaximum load on any link

Page 22: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases22

Related WorkRelated Work• Most prior work does not consider a distributed setting – single-stream case. e.g. [Manku, Motwani VLDB ’02;

Demaine et al. ESA ’03; Karp et al. TODS ’03; Estan, Varghese SIGCOMM ’02]

• Top-k monitoring [Babcock, Olston SIGMOD’03] – did not study precision gradient setting in a hierarchy

• Most closely related work [Greenwald, Khanna PODS ‘04] – more general problem; do not find optimal gradient

Page 23: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases23

SummarySummary

• Find frequent items in distributed streams; use hierarchical topology

• Gradual precision gradient minimizes communication

• Theoretical result: proof of optimality• Empirical result: Compared to existing solutions

• Factor of 5 improvement in load on the root • Factor of 2 improvement in max. load on any link

Page 24: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases24

Questions?Questions?

Thank You!

Proofs, details found at:

http://www.cs.cmu.edu/~manjhi/

Page 25: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases25

Results in detailResults in detail

Internet2 23 million total, 71K unique

3 above 1%, 5 above 0.9%, 139 above 0.1%Auction:

2.2 million total, 140K unique12 above 0.9% and 12 above 1%, 32 above 0.1%BBoard:

1.5 million total, 113K unique 11 above 0.9% and 11 above 1%, 44 above 0.1%

Page 26: @ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University

@Carnegie MellonDatabases26

Worst CaseWorst Case

• Extended set of inputs:• Items with fractional frequencies• Items with fractional weights

• w(I): max load on a link, input instance I• Any input I 2 I – IWC , 9 I’ 2 IWC such that

w(I’) ¸ w(I), Iwc characterized next