towards efficient stream processing in the wide area

40
Towards Efficient Stream Processing in the Wide Area Matvey Arye Siddhartha Sen, Ariel Rabkin, Michael J. Freedman Princeton University

Upload: quanda

Post on 24-Feb-2016

21 views

Category:

Documents


0 download

DESCRIPTION

Towards Efficient Stream Processing in the Wide Area. Matvey A rye Siddhartha Sen , Ariel Rabkin , Michael J. Freedman Princeton University. Our Problem Domain. Also Our Problem Domain. Use Cases. Network Monitoring Internet Service Monitoring Military Intelligence Smart Grid - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Towards Efficient Stream Processing in the Wide Area

Towards Efficient Stream Processing in the Wide Area

Matvey Arye Siddhartha Sen, Ariel Rabkin,

Michael J. Freedman

Princeton University

Page 2: Towards Efficient Stream Processing in the Wide Area

Our Problem Domain

Page 3: Towards Efficient Stream Processing in the Wide Area

Also Our Problem Domain

Page 4: Towards Efficient Stream Processing in the Wide Area

Use Cases

• Network Monitoring• Internet Service Monitoring• Military Intelligence• Smart Grid• Environmental Sensing• Internet of Things

Page 5: Towards Efficient Stream Processing in the Wide Area

The World of Analytical ProcessingReal-Time Historical

Streaming OLAP Databases

Page 6: Towards Efficient Stream Processing in the Wide Area

The World of Analytical ProcessingReal-Time Historical

Streaming OLAP Databases

Simpler queriesStanding queries

Real-time answers

Borealis/StreambaseSystem-S, Storm

High ingest timeFast query time

Oracle, SAP, IBM

Single Datacenter

Page 7: Towards Efficient Stream Processing in the Wide Area

Data Transfer

Page 8: Towards Efficient Stream Processing in the Wide Area

[Above the Clouds, Armbrust et. al.]

Trends in Cost/Performance2003-2008

Series1

CPU(16x)Storage(10x)Bandwidth(2.7x)

Page 9: Towards Efficient Stream Processing in the Wide Area

Aggregate At Local Datacenters

Page 10: Towards Efficient Stream Processing in the Wide Area

The World of Analytical ProcessingReal-Time Historical

Streaming OLAP Databases

Single Datacenter

Wide area JetStream

JetStream = Real-time + Historical + Wide Area

Page 11: Towards Efficient Stream Processing in the Wide Area

Large Caveat

• Preliminary work

• We want feedback and suggestions

Page 12: Towards Efficient Stream Processing in the Wide Area

Challenges

• Query placement and scheduling• Approximation of answers• Supporting User Defined Functions (UDFs)• Queries on historical data• Adaptation to network changes• Handling node failures

Page 13: Towards Efficient Stream Processing in the Wide Area

Motivating Example

• “Top-K domains served by a CDN”– Recall CDN is globally distributed– Services many domains

• Main Challenge: Minimize backhaul of data

Page 14: Towards Efficient Stream Processing in the Wide Area

How Is the Query Specified

Union Count Sort Limit

Page 15: Towards Efficient Stream Processing in the Wide Area

Problems

Single aggregation point

Runs on a single node

Union Count Sort Limit

Page 16: Towards Efficient Stream Processing in the Wide Area

DC3

DC1

Aggregate at local DC

CountPartial

DC2CountPartial

Less Data

Union Count Sort Limit

Page 17: Towards Efficient Stream Processing in the Wide Area

Count Partials

CountPartial Union Count

(Google,1)(Google,1)(Google,1)

(Google,1)

(Google,1)

(Google,5)

Page 18: Towards Efficient Stream Processing in the Wide Area

DC3Non-Distributed Computation

DC1

DC2

Union Count Sort Limit

Page 19: Towards Efficient Stream Processing in the Wide Area

DC3Split Count

DC1

DC2

Union Sort Limit

CountA-H

CountI-M

CountN-Z

Page 20: Towards Efficient Stream Processing in the Wide Area

DC3Split Union

DC1

DC2

Sort Limit

CountA-H

CountI-M

CountN-Z

Load Bal.

Load Bal.

Page 21: Towards Efficient Stream Processing in the Wide Area

DC3Do Partial Sort

DC1

DC2

Sort Limit

CountA-H

CountI-M

CountN-Z

Load Bal.

Load Bal.

SortPartial

SortPartial

SortPartial

Page 22: Towards Efficient Stream Processing in the Wide Area

DC3Push Limit Back

Load Bal.

CountA-H

SortPartial Limit

CountI-M

CountN-Z

Load Bal.

Sort Limit

DC1

DC2

SortPartial

SortPartial

Limit

Limit

Page 23: Towards Efficient Stream Processing in the Wide Area

DC3Single Host

Distributed Version

Load Bal.

CountA-H

SortPartial Limit

CountI-M

CountN-Z

Load Bal.

Sort Limit

DC1

DC2

SortPartial

SortPartial

Limit

Limit

Page 24: Towards Efficient Stream Processing in the Wide Area

What Is New

• Previous streaming systems – User guided transformations (System-S, Storm)– Simple transforms (Aurora)

• JetStream– More complex transforms – Transformation is network aware– Annotations for user defined functions

Page 25: Towards Efficient Stream Processing in the Wide Area

Joint Problems

• Transformations– Choosing which ones

• Placement– Network constrained– Heterogeneous nodes– Resource availability

• Decision has to be made at run-time

Page 26: Towards Efficient Stream Processing in the Wide Area

Tackling the Joint Problems

• Using heuristics

• Split into increasingly more local decisions– Global decisions are coarse grained• Example: Assign operators to DCs

– Localized decisions• Operate only on local part of subgraph• Have more current view of available resources• Do not affect other parts of of query graph placement

Page 27: Towards Efficient Stream Processing in the Wide Area

DC3

DC1

Bottlenecks Still Possible

CountPartial

DC2CountPartial

Possible Bottleneck

Union Count Sort Limit

Use Approximations when necessary

Page 28: Towards Efficient Stream Processing in the Wide Area

DC3

DC1

Adjusting Amount of ApproximationAs a reaction to network dynamism

CountPartial

DC2CountPartial

If bottleneck goes away, return to exact answers

Union Count Sort Limit

Page 29: Towards Efficient Stream Processing in the Wide Area

Approximation Challenges

• How to quantify error for approximations?– Uniform across approximation methods– Easy to understand– Integrates well with metrics for source/node failures

• How do we allow UDF approximation algorithms– Which exact operators can they replace– Quantifying the tradeoffs– Placement & Scheduling

Page 30: Towards Efficient Stream Processing in the Wide Area

DC3

DC1

Approximation Composition

CountPartial

DC2CountPartial

If we approximate count, how does that error affect sort & final answer?

Union Count Sort Limit

Error=eError=?

Page 31: Towards Efficient Stream Processing in the Wide Area

DC3

DC1

Approximations in Uneven Networks

CountPartial

DC2CountPartial

Do we need to approximate link DC1-DC3 if we approximate link DC2-DC3?

Union Count Sort Limit

Low Bandwidth Link; Needs Approximation

High Bandwidth Link

Page 32: Towards Efficient Stream Processing in the Wide Area

Discovering data trends?

• How has top-k changed over past hour?

• Current streaming systems don’t answer this– Except by using centralized DBs.

• JetStream proposes using storage at the edges

Page 33: Towards Efficient Stream Processing in the Wide Area

Hypercube Data Structure

Google Yahoo<5Kb (10, 5ms) (100,20ms)50Kb-1Mb

(0, 0ms) (1, 4ms)

>1Mb (5,10ms (5, 30ms)

Minute 1 … 60

Page 34: Towards Efficient Stream Processing in the Wide Area

Hypercube Data Structure

Day

Hour

Minute

1 … 31

1 … 24

1 … 60

All

01 … 12Month

Page 35: Towards Efficient Stream Processing in the Wide Area

Hypercube Data Structure

Day

Hour

Minute

1 … 31

1 … 24

1 … 60

All

01 … 12Month

Aggr

egat

e

Google Yahoo

<5Kb (90, 9ms) (500,20ms)

50Kb-1Mb

(0, 0ms) (5, 9ms)

>1Mb (5,10ms (10, 30ms)

Page 36: Towards Efficient Stream Processing in the Wide Area

Query: “Last Hour and a half”(without materializing intermediate nodes)

Day

Hour

Minute

1 … 31

1 …2

1 … 60

All

01 … 12Month

1 …30…

Page 37: Towards Efficient Stream Processing in the Wide Area

Query: “Last Hour and a half”by materializing intermediate nodes

Day

Hour

Minute

1 … 31

1 …2

1 … 60

All

01 … 12Month

1 …30…

Page 38: Towards Efficient Stream Processing in the Wide Area

Historical Queries

• Hypercubes have been used before– In the database literature

• What’s Novel– Storage at the edges (and in the network)– Time hierarchy

Page 39: Towards Efficient Stream Processing in the Wide Area

Challenges we talked about

• Query placement and scheduling• Approximation of answers• Supporting User Defined Functions (UDFs)• Queries on historical data• Adaptation to network changes• Handling node failures

Page 40: Towards Efficient Stream Processing in the Wide Area

Conclusion

JetStream Explores…+ Stream Processing+ Historical Data / Trend Analysis+ Wide Area

[email protected]