continuously adaptive processing of data and query streams michael franklin uc berkeley april 2002...

Continuously Adaptive Processing of Data and Query Streams

Michael FranklinUC Berkeley

April 2002

Joint work w/Joe Hellerstein and the Berkeley DB Group.

2

Overview

Autonomic Computing is a technique for making computer systems self-managing.

But, real issue is supporting autonomic organizations.

They need information in a timely and tailored manner. workflows, business flows, news, business intelligence, sensor input,

…

The information infrastructure is the “nervous system” of the organization.

Adaptive, responsive data management systems are essential.

3

Opportunities for Autonomic Data Mgmt.

The functions currently expected of the DBA must be automated. (see next talk)– Data placement, index selection, caching, pre-

fetching, “freshening”, …– We are pursuing an approach based on User

Profiles. The query processing engine itself must be made

“autonomic”.– This is the focus of today’s talk.– But, issues are related: profiles are a type of standing

query.

4

Example 1: “Real-Time Business”

Event-driven processing B2B and Enterprise apps

– Supply-Chain, CRM– Trade Reconciliation, Order Processing etc.

(Quasi) real-time flow of events and data Must manage these flows to drive business

processes. Mine flows to create and adjust business rules. Can also “tap into” flows for on-line analysis.

5

Example 2: Information Dissemination

User Profiles

Users

Filtered Data

Data Sources

•Doc creation or crawler initiates flow of data towards users.•Users + system initiate flow of profiles back towards data.

6

Example 3: Sensor Nets

Tiny devices measure the environment.

– Berkeley “motes”, Smart Dust, Smart Tags, …

Major research thrust at Berkeley.– Apps: Transportation, Seismic, Energy,…

Form dynamic ad hoc networks, aggregate and communicate streams of values.

Sensors are the stimulus receptors of the autonomic organization.

7

Common Features Centrality of dataflow

– Architecture is focused on data movement– Moving streams of data through code in a network

Requires intelligent, low-overhead routing

Volatility of the environment– Dynamic resources & topology, partial failures– Long-running (never-ending?) tasks– Potential for user interaction during the flow– Large Scale: users, data, resources, … Requires adaptivity

8

Review: Distributed Query Processing

SELECT eid,ename,title FROM Emp, Proj, Assign

WHERE Emp.eid = Assign.eid

AND Proj.pid = Assign.pidAND Emp.loc <> Proj.loc

System handles query plan generation & optimization; ensures correct execution.

Originally conceived for corporate networks.

©1998 Ozsu and Valduriez

9

– Sources may be unreachable or slow to respond.

– Data statistics/cost estimates may be unavailable or unreliable.

– No view into internal processing and structure.

Traditional, static query processing approaches cannot cope with such problems at run-time.

Wide-area + Wrapped sources Wide-area + Wrapped sources UnpredictabilityUnpredictability

10

Batch processing is inappropriate for many apps.– Must provide feedback to the

user as quickly as possible.

Data access becomes a cooperative, iterative process:– User may correct, refine,

redirect, or change the query.

User-in-the-loop User-in-the-loop UnpredictabilityUnpredictability

11

– MobilityLocation-centric queriesMoving endpoints change data

staging needs

– Data Streams/SensorsVarying data arrival ratesAdapting resolutionsPush vs. PullQueries can be long lived and streams

may be infinite so a static plan is doomed to failure.

Mobility & Data Streams Mobility & Data Streams UnpredictabilityUnpredictability

12

Adaptive Query Processing

Existing systems are adaptive but at a slow rate.– Collect Stats– Compile and Optimize Query– Eventually collect stats again or change schema– Re-compile and optimize if necessary.

[Survey: Hellerstein, Franklin, et al., DE Bulletin 2000]

staticplans

currentDBMS

13


Goal: allow adjustments for runtime conditions. Basic idea: leave “options” in compiled plan. At runtime, bind options based on observed state:

– available memory, load, cache contents, etc. Once bound, plan is followed for entire query. [HP88,GW89,IN+92,GC94,AC+96, AZ96,LP97]

Dynamic,Parametric,

Competitive,…

staticplans

latebinding

currentDBMS

14


Start with a compiled plan. Observe performance between blocking (or

blocked) operators. Re-optimize remainder of plan if divergence from

expected data delivery rate [AF+96,UFA98] or data statistics [KD98].

Dynamic,Parametric,

Competitive,…

staticplans

latebinding

inter-operator

currentDBMS

Query Scrambling,MidQuery

Re-opt

15


Join operators themselves can be made adaptive:– to user needs (Ripple Joins [HH99])– to memory limitations (DPHJ [IF+99])– to memory limiations and delays (XJoin [UF00])

Plan Re-optimization can also be done in mid-operation (Convergent QP [IHW02])

Dynamic,Parametric,

Competitive,…

staticplans

latebinding

inter-operator

currentDBMS


Re-opt

Ripple Join,XJoin, DPHJ,Convergent

QP

intra-operator

16


This is the region that we are exploring in the Telegraph project at Berkeley.

Dynamic,Parametric,

Competitive,…

staticplans ???

latebinding

inter-operator

per tupl

ecurrentDBMS


Re-opt

Eddies,CACQ

XJoin, DPHJConvergent

QP

???

PSoup

intra-operator

17

Telegraph Overview

An adaptive system for large-scale dataflow processing. A convergence of Data Management and Networking approaches.

Based on an extensible set of operators:1) Ingress (data access) operators

Screen Scraper, Napster/Gnutella readers, File readers, Sensor Proxies

2) Non-Blocking Data processing operators Selections (filters), XJoins, …

3) Adaptive Routing Operators Eddies, STeMs, FLuX, etc.

18

Routing Operators: Eddies

• How to order and reorder operators over time?

– Traditionally, use performance, economic/admin feedback

– won’t work for never-ending queries over volatile streams

• Instead, use adaptive record routing.

Reoptimization = change in routing policy

[Avnur & Hellerstein, SIGMOD 00]

staticdataflow

A B

C

D

eddy

A B C D

19

Eddy – Per Tuple Adaptivity

Adjusts flow adaptively– Tuples routed through ops in diff. orders– Must visit each operator once before output

State is maintained on a per-tuple basis Two complementary routing mechanisms

– Back pressure: each operator has a queue, don’t route to ops with full queue – avoids expensive operators.

– Lottery Scheduling: Give preference to operators that are more selective.

Initial Results showed eddy could “learn” good plans and adjust to changes in stream contents over time.

Currently in the process of exploring the inherent tradeoffs of such fine-grained adaptivity.

20

Traditional Hash Joins block when one input stalls.

Non-Blocking Operators – Join

BuildProbe

Source A Source B

Hash Table A

21

Non-Blocking Operators – Join

Hash Table A

Hash Table B

•Symm Hash Join [WA91] blocks only if both stall. •Processes tuples as they arrive from sources.•Produces all answer tuples and no spurious duplicates.

BuildProbe

Source A Source B

22

SteMs:“State Modules” [Raman 01]

A generalization of the Symmetric Join Split join into modules that contain current state

of the operators.– Basically a store with one or more indexes.– Allows sharing of state across multiple queries.

Supports:

XJoin, Continuous Queries, Competitive processing, …

23

Using SteMs for Joins

HashAHashB HashC

HashD

A B C D

•SteMs maintain intermediate state for multiple joins.•Use Eddy to route tuples through the necessary modules.

•Be sure to enforce “build then probe” processing.•Note, can also maintain results of individual joins in SteMs.

24

CACQ [Madden et al, SIGMOD 02]

Stocks.

symbol = ‘MSFT’

Stocks.

symbol = ‘APPL’

Query 2

Query 1

Stock Quotes

‘MSFT’

‘APPL’

Stock Quotes

Expect hundreds to thousands of queries over same sources

Continuously Adaptive Continuous Queries– combine operators from many queries to

improve efficiency (i.e. share work).

25

Combining Queries in CACQ

Q1 = Select * From S whereS.a = s1 and S.b = s4;



•Can also use SteMs to store query specifications.•Need a good predicate index if many queries.

•Eddy now is routing tuples for multiple queries simultaneously•CACQ leverages Eddies for adaptive CQ processing

•Results show advantages over static approaches.

26

In The Beginning

Data

Query

Index

Result

27

Pub Sub/CQ/Filtering

Queries

Dat

a

Ind

ex

Result

These systems canbe looked at as databasesystems turned upsidedown.

28

PSoup [Chandrasekaran 02]

Logical (?) conclusion: – Treat queries and data as duals.

CACQ etc are “filters”:– No historical data.– Answers continuously returned as new data streams into

the system. Such continuous answers are often:

– Infeasible – intermittent connectivity and “Data Recharging” profiles.

– Inefficient – often user does not want to be continuously interrupted.

29

PSoup (cont.)

Query processing is treated as an n-way symmetric join between data tuples and query specifications.

PSoup model:

repeated invocation of standing queries over windows of streaming data.

30

PSoup: Query & Data Duality

Queries

Index

Result

DataData

Index

31

PSoup: Query & Data Duality

Queries

Index

Result

Data

Index

Query

32

PSoup (continued)

Queries specify a “window” (e.g. last 10 min) Queries are registered with the system

– These are entered into a “Query SteM” Data tuples stream into the system

– These are entered into a “Data SteM” Matching is done on the fly (in background).

– Effectively keeping shared materialized views. When a query is invoked:

– The query window is applied to the materialized answer and results are returned to the user.

33

Data StoreID R.bR.a

34

00

83

37

48

51

50

48

52

49

PSoup

Query StoreID Predicate

0<R.a<=5

R.a=4 AND R.b=3

0>R.b>4

R.a>4 AND R.b=3

23

22

20

21

PSoup: New Select Query Arrival

SELECT *FROM RWHERE R.a <= 4 AND R.b >=3BEGIN (NOW – 600)END (NOW)

NEW QUERY

BUILD

R.a<=4 AND R.b>=324

34

PROBE

Data StoreID R.bR.a

34

00

83

37

48

51

50

48

52

49R.a<=4 AND R.b>=324

New Selection Spec (continued)

Queries

2120

51

50

48

52

49

2322

T

F

24

F

T

F

Da

ta

RESULTS

35

Data StoreID R.bR.a

34

00

83

37

48

51

50

48

52

49


0<R.a<=5

R.a=4 AND R.b=3

0>R.b>4

R.a>4 AND R.b=3

23

22

20

21

24 R.a<=4 AND R.b>=3

Selection – new data

6353NEW DATA

BUILD 6353

PSoup

36

PROBE 6353


0<R.a<=5

R.a=4 AND R.b=3

0>R.b>4

R.a>4 AND R.b=3

23

22

20

21

24 R.a<=4 AND R.b>=3

Selection – new data (cont.)

Queries

2120

51

50

48

52

49

2322 24

Da

ta

RESULTS

53 FT FF T

37

PSoup – Query Invocation

PSoup continuously maintains materialized views over streaming data and queries.

Data is not returned to user until the query is invoked.– Invocation requires applying “windows” to precomputed

results. Adaptive approach allows system to continuously

absorb new data and new queries without recompilation.

Lots of issues to study: – Query indexing, Spilling to disk, bulk processing

38

Conclusions

The goal is to support autonomic organizations.– The information infrastructure is the “nervous system” of such

organizations. Existing QP technology is simply too brittle. The Telegraph project at Berkeley is “pushing the envelope” on

adaptive query processing.

staticplans ???

latebinding

inter-operator

per tupl

eEddies,CACQ

???

PSoup

intra-operator

continuously adaptive processing of data and query streams michael franklin uc berkeley april 2002...

Documents