peer-to-peer result dissemination in high-volume data filtering shariq rizvi and paul burstein cs...

20
Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Upload: louisa-parks

Post on 19-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Peer-to-Peer Result Dissemination in High-Volume Data Filtering

Shariq Rizvi and Paul BursteinCS 294-4: Peer-to-Peer Systems

Page 2: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

P2P: A Delivery Infrastructure Overcast

Application-level multicasting Build data distribution trees Adapt to changing network conditions Inner nodes heavily loaded

SplitStream Load-balancing across all peers Split content into redundant streams Redundancy offers resilience to failures

Page 3: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Our Focus

Dynamic Application-level Multicast Single source Multiple receivers High-volume data flow (“document streams”) Dynamic: very large number of “groups” IP multicast is bad

Rigid to deploy Dynamic groups?

“Intelligent” trees on the fly?

Page 4: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Organization

Motivation Data filtering YFilter@Berkeley Distributed YFilter

Dynamic multicast Unstructured overlay network Metrics Experiments

Summary & future work

Page 5: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Data Filtering

Pub-sub systems XML: the “wire format” for data

Web services RDF Site Summary (RSS) data feeds

News Stock ticks

Personalized content delivery Message brokers

Filtering Transformation Delivery

Page 6: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

YFilter: A Data Filtering Engine

Picture blatantly stolen from “Path Sharing and Predicate Evaluation for High-Performance XML Filtering”, Diao et al., TODS 2003

Page 7: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

YFilter: Some Numbers

Incoming document flow – 10-20 per second Document sizes – 20KB Subscribers – Lots! Processing bottleneck

50ms per document with 100,000 simple XML path queries Dissemination bottleneck

Thousands of recepients per document – bandwidth needed ~ GbPS

Solution: Distributed filtering

Page 8: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Content-Based Routing

Embed filtering logic into the network “XML routers”

Overlay topologies (e.g. mesh) Parent routers hold disjunction of child routers’

queries Streams filtered on the fly Problems

Low network economy – scalability? Query aggregation challenges

Page 9: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Distributed Hierarchical Filtering

Filter CoreClients Clients

Recurring theme: dynamic multicast

Page 10: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Peer-to-Peer Result Dissemination

Source

Clients

Page 11: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Application-Level Dynamic Multicast Each document has a different receiver list Exploit “peers” for dissemination Build trees on the fly

Pass documents wrapped with receiver identities Each peer contributes a fanout

Possibly high delivery delays Heuristic: Try to minimize tree height

Application-level approach: high traffic Heuristic: Exploit geographical distribution of clients at

source

Page 12: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Possible Evaluation Metrics

Delivery delay Network economy

Document loss Out-of-order delivery

Page 13: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Experimental Setup

PlanetLab testbed Over 200 nodes 1-10 clients per node

Document Size:20KB

Generation Rate: 1document/second

Query Selectivity: 10%

Filter Fanout: 2 Filter Host:

planetlab1.lcs.mit.edu

Client Fanout: 1 - 20% - Modem 2 - 40% - DSL 4 - 40% - Cable

Page 14: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Result 1: Distribution of Delays Delivery Delay Distribution - 200 Clients

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

010

0020

0030

0040

0050

0060

0070

0080

00

Delivery Delay (ms)

% C

lien

ts

Page 15: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Result 2: ScalabilityDelivery Delay Distribution

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

010

0020

0030

0040

0050

0060

0070

0080

0090

00

1000

0

1100

0

1200

0

1300

0

1400

0

Delivery Delay (ms)

% C

lien

ts

200 Clients

400 Clients

1000 Clients

2000 Clients

Page 16: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Result 3: Bandwidth RequirementsOutgoing Bandwidth

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5Outgoing Bandwidth (KBps)

% C

lien

ts

200 Clients

400 Clients

1000 Clients

2000 Clients

Page 17: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Exploiting Geographical Distribution of Clients

Page 18: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Result 4: With the optimizationRegional Optimization

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000

Delivery Delay (ms)

% C

lient

s

2000 Clients

2000 Clients OP

Page 19: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Summary

Current filtering engines – processing and bandwidth bottlenecks

A possible scheme for distributed filtering Recurring theme: highly dynamic multicast

Application-level multicast Peer-to-peer delivery Trees construction on the fly

PlanetLab is crazy

Page 20: Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems

Future Work

Reliable, dedicated delivery nodes Exploiting query similarity for discovering

multicast groups