carnegie mellon university complex queries in distributed publish- subscribe systems ashwin r....

Carnegie Mellon University

Complex queries in distributed publish-subscribe systems

Ashwin R. Bharambe,

Justin Weisz and

Srinivasan Seshan


Outline

Publish-subscribe systems Subscription Languages Routing Protocols

MERCURY architecture Preliminary evaluation

Scalability Performance

Future work


Publish-subscribe systems

Challenges Subscription language - how to express “interests”? Routing mechanism - how is content routed and where

is it matched?

MatcherS P

?

PublicationSubscription

MatchedPublication


Subscription language

How to express interests? Channels or Subjects All content attributes

Operators? Exact matches Range queries Regular expressions

Subject = MATH

Name = ‘A’ & Age < 25

Price = 350$

300$ ≤ Price ≤ 350$

Name ~= Ash[win]*B*


MERCURY subscription language Subscription =

list of (type, attribute, rel-op, value) Can implement boolean expressions

(AND/ORs)

( int, price, LESS_THAN, 300 ) ( string, name, EQUALS, “Opensig” )

Publication = list of (type, attribute, value)


Example: Virtual reality

User

x ≥ 50x ≤ 150y ≥ 150y ≤ 250

Interests

x 100y 200

Events

(100,200)

(150,150)

(50,250)

Arena

Virtual World


Routing mechanism

Centralized? Easy to make publications “meet” subscriptions Single point of failure – not robust!

Distributed? Where are subscriptions stored? How do publications “meet” subscriptions? Broadcast-based solutions not scalable


Distributed routing - goals

Scalability is a key goal Flooding anything is bad, bad, bad… System should not have hot-spot in terms of:

Computational load – matching ) Subscriptions should be evenly distributed

Number of packets routed or received ) Publications should be evenly distributed

Yet – we should have low delivery delays!!


Previous work

Systems like Scribe use DHTs for scalability Why can’t we ? Exact matches vs. Range queries!

int x 1 int x 10hash

Subscription

int x = 9hash

Publication

?

How about generating 10 subscriptions ? Too many subscriptions Works for discrete-valued attributes only


Attribute Hubs

Divide range of an attribute into bins Each node responsible for range of attribute values

Hprice

[240, 320)

[80, 160)

[160, 240)[0, 80)

Attribute Hub

Hub-nodes connected through a circular overlay

Circle only for connectivity One hub per attribute Routing algo

compare value in content to my range


Routing illustrated

Send subscription to any one attribute hub Send publications to all attribute hubs

[0, 80)

[210, 320)

50 ≤ x ≤ 150150 ≤ y ≤ 250

x 100y 200

Hx

[240, 320)

[80, 160)

[160, 240)

Hy

[0, 105)

[105, 210)

Subscription

Publication

Rendezvous point


Efficient routing

Reduce number of hops Each hub-node maintains “small” number of

pointers to distant parts of the hub How to maintain these pointers?

Send ACKs for publication receipts Various caching policies determine the structure

of the pointer table e.g., LRU, Uniform-spacing, Exponential-spacing


Routing illustrated

[0, 80)

[240, 320)

[80, 160)

[160, 240)

Hx

x 100y 200

Publication

[105, 210)Hy

[210, 320)

[0, 105)

ACK

AC

K


Evaluation

Workload Experimental setup Metrics


Workload

One of our target apps multi-player games Model

Virtual world as square Subscriptions as rectangles around current

positions


Experimental setup

Player movements simulated using mobility models from ns-2

Two hubs – x and y co-ordinates Half the nodes in each hub Uniform partition of range


Metrics

Scalability metric load Number of publications routed by a node Averaged over time

Performance metric publication delivery delay Time between sending of a publication and its

receipt by all subscribers Averaged over all subscribers of a publication Averaged over all publications


Results: scalability

0

0.05

0.1

0.15

0.2

0.25

0.3

0.4 0.6 0.8 1 1.2 1.4 1.6Normalized load

Fra

ctio

n o

f n

od

es

100 nodes

50 nodes

20 nodes Ideal graph: delta function

Observed variation: ±12%


Results: performance

0

5

10

15

20

25

20 nodes 50 nodes 72 nodes 100nodes

200nodes

Dela

ys - in

secs

single

n-lru

n-uniform

nocache

graph-dia

Without caching: linear scaling

Caching reduces delays to near optimal

Workload effects ?

47.25

Cache size: log(n)


Conclusions

Expressive subscription language Decentralized architecture Scalability

Avoids flooding of subscriptions and publications – reduces network traffic

Distributes publications and subscriptions throughout the network – prevents swamping


Future Work

Load balancing Sensitive to data value

distribution Adapt ranges dynamically

according to the distribution

Affects pointer management, caching, etc.

x

Pr(

X=

x)


Future Work

Perform sensitivity analysis for different kinds of workloads

Generic API for building applications on top of MERCURY To be released soon

Build a full-fledged distributed Quake-II

carnegie mellon university complex queries in distributed publish- subscribe systems ashwin r....

Documents

range slide

value slide

scalable slide

hash subscription int

attribute routing algo

attribute hub hubnodes

interests x

exponentialspacing slide