continuously adaptive processing of data and query streams michael franklin uc berkeley april 2002...
Post on 19-Dec-2015
212 views
TRANSCRIPT
Continuously Adaptive Processing of Data and Query Streams
Michael FranklinUC Berkeley
April 2002
Joint work w/Joe Hellerstein and the Berkeley DB Group.
2
Overview
Autonomic Computing is a technique for making computer systems self-managing.
But, real issue is supporting autonomic organizations.
They need information in a timely and tailored manner. workflows, business flows, news, business intelligence, sensor input,
…
The information infrastructure is the “nervous system” of the organization.
Adaptive, responsive data management systems are essential.
3
Opportunities for Autonomic Data Mgmt.
The functions currently expected of the DBA must be automated. (see next talk)– Data placement, index selection, caching, pre-
fetching, “freshening”, …– We are pursuing an approach based on User
Profiles. The query processing engine itself must be made
“autonomic”.– This is the focus of today’s talk.– But, issues are related: profiles are a type of standing
query.
4
Example 1: “Real-Time Business”
Event-driven processing B2B and Enterprise apps
– Supply-Chain, CRM– Trade Reconciliation, Order Processing etc.
(Quasi) real-time flow of events and data Must manage these flows to drive business
processes. Mine flows to create and adjust business rules. Can also “tap into” flows for on-line analysis.
5
Example 2: Information Dissemination
User Profiles
Users
Filtered Data
Data Sources
•Doc creation or crawler initiates flow of data towards users.•Users + system initiate flow of profiles back towards data.
6
Example 3: Sensor Nets
Tiny devices measure the environment.
– Berkeley “motes”, Smart Dust, Smart Tags, …
Major research thrust at Berkeley.– Apps: Transportation, Seismic, Energy,…
Form dynamic ad hoc networks, aggregate and communicate streams of values.
Sensors are the stimulus receptors of the autonomic organization.
7
Common Features Centrality of dataflow
– Architecture is focused on data movement– Moving streams of data through code in a network
Requires intelligent, low-overhead routing
Volatility of the environment– Dynamic resources & topology, partial failures– Long-running (never-ending?) tasks– Potential for user interaction during the flow– Large Scale: users, data, resources, … Requires adaptivity
8
Review: Distributed Query Processing
SELECT eid,ename,title FROM Emp, Proj, Assign
WHERE Emp.eid = Assign.eid
AND Proj.pid = Assign.pidAND Emp.loc <> Proj.loc
System handles query plan generation & optimization; ensures correct execution.
Originally conceived for corporate networks.
©1998 Ozsu and Valduriez
9
– Sources may be unreachable or slow to respond.
– Data statistics/cost estimates may be unavailable or unreliable.
– No view into internal processing and structure.
Traditional, static query processing approaches cannot cope with such problems at run-time.
Wide-area + Wrapped sources Wide-area + Wrapped sources UnpredictabilityUnpredictability
10
Batch processing is inappropriate for many apps.– Must provide feedback to the
user as quickly as possible.
Data access becomes a cooperative, iterative process:– User may correct, refine,
redirect, or change the query.
User-in-the-loop User-in-the-loop UnpredictabilityUnpredictability
11
– MobilityLocation-centric queriesMoving endpoints change data
staging needs
– Data Streams/SensorsVarying data arrival ratesAdapting resolutionsPush vs. PullQueries can be long lived and streams
may be infinite so a static plan is doomed to failure.
Mobility & Data Streams Mobility & Data Streams UnpredictabilityUnpredictability
12
Adaptive Query Processing
Existing systems are adaptive but at a slow rate.– Collect Stats– Compile and Optimize Query– Eventually collect stats again or change schema– Re-compile and optimize if necessary.
[Survey: Hellerstein, Franklin, et al., DE Bulletin 2000]
staticplans
currentDBMS
13
Adaptive Query Processing
Goal: allow adjustments for runtime conditions. Basic idea: leave “options” in compiled plan. At runtime, bind options based on observed state:
– available memory, load, cache contents, etc. Once bound, plan is followed for entire query. [HP88,GW89,IN+92,GC94,AC+96, AZ96,LP97]
Dynamic,Parametric,
Competitive,…
staticplans
latebinding
currentDBMS
14
Adaptive Query Processing
Start with a compiled plan. Observe performance between blocking (or
blocked) operators. Re-optimize remainder of plan if divergence from
expected data delivery rate [AF+96,UFA98] or data statistics [KD98].
Dynamic,Parametric,
Competitive,…
staticplans
latebinding
inter-operator
currentDBMS
Query Scrambling,MidQuery
Re-opt
15
Adaptive Query Processing
Join operators themselves can be made adaptive:– to user needs (Ripple Joins [HH99])– to memory limitations (DPHJ [IF+99])– to memory limiations and delays (XJoin [UF00])
Plan Re-optimization can also be done in mid-operation (Convergent QP [IHW02])
Dynamic,Parametric,
Competitive,…
staticplans
latebinding
inter-operator
currentDBMS
Query Scrambling,MidQuery
Re-opt
Ripple Join,XJoin, DPHJ,Convergent
QP
intra-operator
16
Adaptive Query Processing
This is the region that we are exploring in the Telegraph project at Berkeley.
Dynamic,Parametric,
Competitive,…
staticplans ???
latebinding
inter-operator
per tupl
ecurrentDBMS
Query Scrambling,MidQuery
Re-opt
Eddies,CACQ
XJoin, DPHJConvergent
QP
???
PSoup
intra-operator
17
Telegraph Overview
An adaptive system for large-scale dataflow processing. A convergence of Data Management and Networking approaches.
Based on an extensible set of operators:1) Ingress (data access) operators
Screen Scraper, Napster/Gnutella readers, File readers, Sensor Proxies
2) Non-Blocking Data processing operators Selections (filters), XJoins, …
3) Adaptive Routing Operators Eddies, STeMs, FLuX, etc.
18
Routing Operators: Eddies
• How to order and reorder operators over time?
– Traditionally, use performance, economic/admin feedback
– won’t work for never-ending queries over volatile streams
• Instead, use adaptive record routing.
Reoptimization = change in routing policy
[Avnur & Hellerstein, SIGMOD 00]
staticdataflow
A B
C
D
eddy
A B C D
19
Eddy – Per Tuple Adaptivity
Adjusts flow adaptively– Tuples routed through ops in diff. orders– Must visit each operator once before output
State is maintained on a per-tuple basis Two complementary routing mechanisms
– Back pressure: each operator has a queue, don’t route to ops with full queue – avoids expensive operators.
– Lottery Scheduling: Give preference to operators that are more selective.
Initial Results showed eddy could “learn” good plans and adjust to changes in stream contents over time.
Currently in the process of exploring the inherent tradeoffs of such fine-grained adaptivity.
20
Traditional Hash Joins block when one input stalls.
Non-Blocking Operators – Join
BuildProbe
Source A Source B
Hash Table A
21
Non-Blocking Operators – Join
Hash Table A
Hash Table B
•Symm Hash Join [WA91] blocks only if both stall. •Processes tuples as they arrive from sources.•Produces all answer tuples and no spurious duplicates.
BuildProbe
Source A Source B
22
SteMs:“State Modules” [Raman 01]
A generalization of the Symmetric Join Split join into modules that contain current state
of the operators.– Basically a store with one or more indexes.– Allows sharing of state across multiple queries.
Supports:
XJoin, Continuous Queries, Competitive processing, …
23
Using SteMs for Joins
HashAHashB HashC
HashD
A B C D
•SteMs maintain intermediate state for multiple joins.•Use Eddy to route tuples through the necessary modules.
•Be sure to enforce “build then probe” processing.•Note, can also maintain results of individual joins in SteMs.
24
CACQ [Madden et al, SIGMOD 02]
Stocks.
symbol = ‘MSFT’
Stocks.
symbol = ‘APPL’
Query 2
Query 1
Stock Quotes
‘MSFT’
‘APPL’
Stock Quotes
Expect hundreds to thousands of queries over same sources
Continuously Adaptive Continuous Queries– combine operators from many queries to
improve efficiency (i.e. share work).
25
Combining Queries in CACQ
Q1 = Select * From S whereS.a = s1 and S.b = s4;
Q2 = Select * From S whereS.a = s2 and S.b = s5;
Q3 = Select * From S whereS.a = s3 and S.b = s6;
•Can also use SteMs to store query specifications.•Need a good predicate index if many queries.
•Eddy now is routing tuples for multiple queries simultaneously•CACQ leverages Eddies for adaptive CQ processing
•Results show advantages over static approaches.
26
In The Beginning
Data
Query
Index
Result
27
Pub Sub/CQ/Filtering
Queries
Dat
a
Ind
ex
Result
These systems canbe looked at as databasesystems turned upsidedown.
28
PSoup [Chandrasekaran 02]
Logical (?) conclusion: – Treat queries and data as duals.
CACQ etc are “filters”:– No historical data.– Answers continuously returned as new data streams into
the system. Such continuous answers are often:
– Infeasible – intermittent connectivity and “Data Recharging” profiles.
– Inefficient – often user does not want to be continuously interrupted.
29
PSoup (cont.)
Query processing is treated as an n-way symmetric join between data tuples and query specifications.
PSoup model:
repeated invocation of standing queries over windows of streaming data.
30
PSoup: Query & Data Duality
Queries
Index
Result
DataData
Index
31
PSoup: Query & Data Duality
Queries
Index
Result
Data
Index
Query
32
PSoup (continued)
Queries specify a “window” (e.g. last 10 min) Queries are registered with the system
– These are entered into a “Query SteM” Data tuples stream into the system
– These are entered into a “Data SteM” Matching is done on the fly (in background).
– Effectively keeping shared materialized views. When a query is invoked:
– The query window is applied to the materialized answer and results are returned to the user.
33
Data StoreID R.bR.a
34
00
83
37
48
51
50
48
52
49
PSoup
Query StoreID Predicate
0<R.a<=5
R.a=4 AND R.b=3
0>R.b>4
R.a>4 AND R.b=3
23
22
20
21
PSoup: New Select Query Arrival
SELECT *FROM RWHERE R.a <= 4 AND R.b >=3BEGIN (NOW – 600)END (NOW)
NEW QUERY
BUILD
R.a<=4 AND R.b>=324
34
PROBE
Data StoreID R.bR.a
34
00
83
37
48
51
50
48
52
49R.a<=4 AND R.b>=324
New Selection Spec (continued)
Queries
2120
51
50
48
52
49
2322
T
F
24
F
T
F
Da
ta
RESULTS
35
Data StoreID R.bR.a
34
00
83
37
48
51
50
48
52
49
Query StoreID Predicate
0<R.a<=5
R.a=4 AND R.b=3
0>R.b>4
R.a>4 AND R.b=3
23
22
20
21
24 R.a<=4 AND R.b>=3
Selection – new data
6353NEW DATA
BUILD 6353
PSoup
36
PROBE 6353
Query StoreID Predicate
0<R.a<=5
R.a=4 AND R.b=3
0>R.b>4
R.a>4 AND R.b=3
23
22
20
21
24 R.a<=4 AND R.b>=3
Selection – new data (cont.)
Queries
2120
51
50
48
52
49
2322 24
Da
ta
RESULTS
53 FT FF T
37
PSoup – Query Invocation
PSoup continuously maintains materialized views over streaming data and queries.
Data is not returned to user until the query is invoked.– Invocation requires applying “windows” to precomputed
results. Adaptive approach allows system to continuously
absorb new data and new queries without recompilation.
Lots of issues to study: – Query indexing, Spilling to disk, bulk processing
38
Conclusions
The goal is to support autonomic organizations.– The information infrastructure is the “nervous system” of such
organizations. Existing QP technology is simply too brittle. The Telegraph project at Berkeley is “pushing the envelope” on
adaptive query processing.
staticplans ???
latebinding
inter-operator
per tupl
eEddies,CACQ
???
PSoup
intra-operator