elke a. rundensteiner topics projects in database and information systems, such as, web information...
Post on 20-Dec-2015
219 views
TRANSCRIPT
Elke A. Rundensteiner
Topics projects in database and
Information systems, such as,web information systems,
distributed databases,Etc.
Database Systems Research Lab
Email: [email protected]
Office: Fuller 238
Phone: x – 5815
Webpages: http://www.cs.wpi.edu/~rundenst
http://davis.wpi.edu/dsrg
2
Core Projects in a Nutshell: Distributed Data Sources:
EVE : Data Warehousing over Distributed Data Sources
TOTAL-ETL : Distributed Extract Transform Load Tools
Web Information Systems: RAINBOW : XML Query
Engine and Tools XML to Relational Mapping
Technology
Databases and Visualization: Visualization-Driven
Data Caching Prefetching based on
User Access Patterns Stream Monitoring
Systems: New Generation of
Query Engines Adaptive Distributed
Optimizations
3
DSRG Research Group
2 faculty 8 Phd students 5 MS students 4 MQP teams
4
Stream Monitoring Applications Monitor troop movements during combat and
warn when soldiers veer off course Send alert when patient’s vital signs begin to
deteriorate Monitor incoming news feeds to see stories on
“Iraq” Scour network traffic logs looking for intruders
5
Properties of Monitoring Applications Queries and monitors run continuously,
possibly unending Applications have varying service
preferences: Patient monitoring only wants freshest data Remote sensors have limited memory News service wishes maximal throughput Taking 60 seconds to process vital signs and
sound an alert may be too long
6
Properties of Streaming Data
Possibly never ending stream of data Unpredictable arrival patterns:
Network congestion Weather (for external sensors) Sensor moves out of range
7
Querying and Monitoring Streaming Data
New Data Format: Querying and Analyzing Continuous, unbounded, rapid, time-varying streams of data elements
Traditional DB vs. Stream System
Persistent finite relations vs. transient infinite data streams
One-time queries vs. large number of continuously running queries
Random access vs. sequential access
Query plan determined by static design decisions vs. unpredictable data characteristics and arrival patterns
8
Databases Upside Down
data
Query
Query
Query
Query
data
data
data
data
data
streamsof data
static data
Standing queries
one-time queries
9
CAPE: Constraint-Aware Adaptive Processing Engine
Elke A. Rundensteiner, G. Heineman, Luping Ding, Timothy Sutherland, Yali Zhu
Brad Pielech, Nishant Mehta
Natasha Bogdanova, Mariana Jbantova
Extracted from VLDB 2004 CAPE Demo.
http://davis.wpi.edu/dsrg/CAPE/index.html
10
Uncertainties in Stream Query Processing
RegisterContinuous Queries
Stream QueryEngine
Stream QueryEngine
Streaming DataStreaming Result
May have different QoS Requirements.
May have time-varying rates and data di
stribution.
Available resources for executing each operator
may vary over time.
Adaptations are required for stream query engine.
11
What is CAPE?
Constraint-aware
Adaptive
Continuous Query
Processing
Engine
Exploit semantic constraints such as sliding windows and punctuations to reduce resource usage and improve response time.
Incorporate heterogeneous-grained adaptivity at all query processing levels.
- Adaptive query operator execution- Adaptive query plan re-optimization- Adaptive operator scheduling- Adaptive query plan distribution
Process queries in a real-time manner by employing well-coordinated heterogeneous-grained adaptations.
12
CAPE System Architecture
Distribution Manager
Plan Reoptimizer
OperatorScheduler
OperatorConfigurator
OperatorConfigurator
OperatorScheduler
PlanReoptimizer
CAPE Query Engine
QoS Inspector
Execution Engine
StorageManager
StreamSender
StreamFeeder
StreamReceiver
Internet
Control Flow
Data Flow
Legend:
DistributionManager
Query PlanGenerator
Stream / QueryRegistration
GUI
Query 2 . . Query nQuery 1
StreamingData
End User
13
Research Contributions Scalable Query Operators (Punctuations)
Adapt and select among tasks such as memory purging, stream reading, memory-to-disk shuffling, punctuation propagation, etc.
Cooperative Plan Optimization Operators request and selectively provide punctuations (meta-knowledge) to improve
their performance
Adaptive Operator Scheduling Selector scores alternate scheduling algorithm based on their effect on QoS
requirements, and selects candidate. On-line Query Plan Migration
On-line plan restructuring and then online migration to the new plan even for stateful operators.
Distributed Plan Execution Adaptively distribute computations across multiple machines to optimize QoS
requirements without information loss
14
Constraint-Aware Query Operators
Item_id Bidder_id Bid_price Bid_time
1082 Marlie 820.00 Nov-13-03 11:02:00
1080 Ultrasale 1000.00 Nov-13-03 11:05:00
1082 Jocelyn 850.00 Nov-13-03 11:14:00
1082 * * *
No more bidwill be placedon item 1082.
Punctuations [TMS+03] are embedded inside data streams to announce that no tuple after a punctuation will match this punctuation.
Sliding windows characterize data sets that current result is based on.
Punctuation
TSa
TSb - Wa
TSa - Wb
TSb
Wb
Timeline
T0
Wa
Stream A Stream B
• Utilize data-level constraint, punctuations, to purge no-longer-needed data• Exploit time-based constraints, sliding windows, to discard expired data• Incorporate optimizations enabled by interactions between different constraints• Include ability to propagate punctuations to help downstream operators• Employ adaptive execution logic to cope with dynamic stream environments
Window Join:
15
JoinA
a1 *…
a1
a1
a2
a3
b2
b4
b1
b2
A B
a1
a1
a3
c1
c6
c2
A C
… a1
a1
a1
a1
b2
b4
b2
b4
A Bc1
c1
c6
a6
C
a3 b2 c2
JoinA
…
a2
a3
b1
b2
A B
a1
a1
a3
c1
c6
c2
A C
… a1
a1
a1
a1
b2
b4
b2
b4
A Bc1
c1
c6
a6
C
a3 b2 c2
Punctuation-Exploiting Join Optimization
JoinA
…
a1
a1
a2
a3
b2
b4
b1
b2
A B
a1
a1
a3
c1
c6
c2
A C…
a1
a1
a1
a1
b2
b4
b2
b4
A Bc1
c1
c6
a6
C
a3 b2 c2
JoinA…
… a1
a1
a1
a1
b2
b4
b2
b4
A Bc1
c1
c6
a6
C
a3 b2 c2
1246
TS
Window-Based Join Optimization
357
TSa2 c5 8
Window: 6
Window: 8
a1
a2
a3
b4
b1
b2
A B
246
TS
a1
a1
a3
c1
c6
c2
A C357
TS
a2 c5 8
a2 b1 c5
a1
a1
b2
b4
a1 b2 1
17
Run-time Plan Re-Optimization Step 1 - Decide when to optimize
Statistics Monitoring Step 2 – Generate new query plan
Query Optimization Step 3 – Replace current plan by new plan
Plan Migration
18
Plan Migration Strategy : Out with the Old and In with the New ?
Migration Steps Pause execution of old plan Drain out all tuples inside old plan Replace old plan by new plan Resume execution of new plan
AB
BC
A B C
AB
BC
A B C
Problem: Works for stateless operators only
19
Stateful Operator in CQ
Why stateful Need non-blocking operators in CQ Operator needs to output partial results State data structure keep received tuples
AB
A B
b1b2b3b4b5
ax
State A State B
ax
ax b2ax b3
Key Observation: The purge of tuples in states relies on processing of new tuples.
Example: Symmetric NL join w/ window constraints
20
Migration Strategy Revisited
Steps(1) Pause execution of old plan(2) Drain out all tuples inside old plan(3) Replace old plan by new plan(4) Resume execution of new plan
AB
BC
A B C(2)
All tuples drained
(4)Processing
Resumed
(3) Old Replaced
By new
Deadlock Waiting Problem:
21
Dynamic Plan Migration
Dynamic Plan Migration Input (two migration boxes)
One contains old plan One contains new plan Have same input and output queues
Result Old box is replaced by new box
Two Strategies: Moving State Strategy Parallel Track
Guarantee Valid Migration No missing tuples No duplicates
BC
AB
QA QB QC QD
QABCD
AB
CD
BC
QA QB QC QD
QABCD
SAB SC
SA SBSB
SC
SBC SD
SBCDSACD
SABCSD
Key points:- Involved plans contain stateful operators- Migrate yet still retain useful states and discard useless states.
22
Distribution Manager
Connection Manager
Query Plan Manager
Runtime Monitor
Distribution DecisionMaker
Distributed Strategy Repository
Cost Model Repository
Configuration Repository
CAPE Engine ( Query Processor)
Distribution Manager
Connection Manager: synchronizes query plan distribution between query processors.
Query Plan Manager: maintains the location of each query operator and verifies that a particular distribution is valid.
Runtime Monitor: receives statistics from each query processor and calculates respective load over time.
Distribution Decision Maker: creates initial distribution and at runtime reallocates query operators to different machines.
23
M1 M2 M3Legend:
M1
M2
M3
Round-Robin Distribution Grouping Distribution
Goal: To minimize network connectivity.
Algorithm: Takes each query plan and creates sub-plans where neighbouring operators are grouped together.
Goal: To equalize workload per machine.
Algorithm: Iteratively takes each query operator and places it on the query processor with the least number of operators.
Initial Distribution Policies
24
Distribution Manager
Distribution Table
M 2Operator 8
M 2Operator 7
M 1Operator 6
M 1Operator 5
M 2Operator 4
M 2Operator 3
M 1Operator 2
M 1Operator 1
MachineOperator
Initial Distribution Process
Stream Source
Application
4321
876
5
1 2 3 4
5
67 8
M1
M2
Step 1 Step 2
Step 1: Create distribution table using initial distribution algorithm.
Step 2: Send distribution information to processing machines
25
Redistribution - New Distribution Balance Computation
4100 tuplesM 2
2000 tuplesM 1
M 2Operator 8
M 2Operator 7
M 1Operator 6
M 1Operator 5
M 2Operator 4
M 2Operator 3
M 1Operator 2
M 1Operator 1
Machine
(M)
Operator (OP)
Op 3: .3
Op 4: .2
Op 7: .3
Op 8: .2
.91M 2
Op 1: .25
Op 2: .25
Op 5: .25
Op 6: .25
.44M 1
Operator Cost
CostMachine
Statistics Table
M Capacity: 4500 tuples
Distribution Table
Op 3: .4
Op 4: .3
Op 8: .3
.64M 2
Op 1: .15
Op 2: .15
Op 5: .15
Op 6: .15
.71M 1
Operator Cost
CostMachineBalance
Cost Table (current) Cost Table (desired)
Cape’s costs: # tuples in memory and network output rate. Operators redistributed based on redistribution policy. Redistribution policies of Cape: Balance and Degradation. 6-Step Protocol: Moving Operators Across Machines
Op 7: .4
Cost per machine is determined as percentage of memory filled with tuples.
26
Move Operators Across Machines
27
Experimental Results of Distribution and Re-distribution Algorithms
0
100000
200000
300000
400000
500000
600000
700000
800000
Time (m)
To
tal
Tu
ple
s O
utp
utt
ed Round Robin Distribution
Grouping Distribution
RR + Redist
Grouping + Redist
Query Plan Performance with a Query Plan of 40 Operators.
Observations:
Initial distribution is important for query plan performance.
Redistribution improves at run-time query plan performance.
28
So Much Fun Challenges, So Little Time …
Motion Query Processing :
Sensor on Cars Novel classes of location-aware queries Queries such as “Find me the five nearest
restaurants to my current (moving) location” Not just data but also queries are streaming in. Massive Query Sharing and Shared Execution Novel classes of location-aware queries
29
Publications, TRs & URLs[RDZ04] E. A. Rundensteiner, L. Ding, Y. Zhu, T. Sutherland and B. Pielech, “CAPE: A Cons
traint-Aware Adaptive Stream Processing Engine”. Invited Book Chapter. http://www.cs.uno.edu/~nauman/streamBook/. July 2004.
[ZRH04] Y. Zhu, E. A. Rundensteiner and G. T. Heineman, "Dynamic Plan Migration for Continuous Queries Over Data Streams”. SIGMOD 2004, pages 431-442.
[DMR+04] L. Ding, N. Mehta, E. A. Rundensteiner and G. T. Heineman, "Joining Punctuated Streams“. EDBT 2004, pages 587-604.
[DR04] L. Ding and E. A. Rundensteiner, "Evaluating Window Joins over Punctuated Streams“. CIKM 2004, to appear.
[DRH03] L. Ding, E. A. Rundensteiner and G. T. Heineman, “MJoin: A Metadata-Aware Stream Join Operator”. DEBS 2003.
[SPR04] T. Sutherland, B. Pielech and E. A. Rundensteiner, "Adaptive Scheduling Framework for A Continuous Query System“. Tech Report, WPI-CS-TR-04-16, 2004.
[SR04] T. Sutherland and E. A. Rundensteiner, "D-CAPE: A Self-Tuning Continuous Query Plan Distribution Architecture“. Tech Report, WPI-CS-TR-04-18, 2004.
CAPE Project: http://davis.wpi.edu/dsrg/CAPE/index.html
Raindrop : XQueries on XML Streams (Automaton Meets Algebra)
Hong Su, Elke Rundensteiner, Murali Mani, Ming Li
Worcester Polytechnic Institute
Worcester, MA
** Extracted from VLDB 2004 demo **
31
XML Stream Processing
data sources
data requesters
Networks
32
What’s Special for XML Stream Processing?<Biditems>
<book year=“2001">
<title>Dream Catcher</title>
<author><last>King</last><first>S.</first></author>
<publisher>Bt Bound </publisher>
<price> 30 </initial>
</book>
…
<biditems> <book> <title> Dream Catcher </title> …
Token-by-Token access manner
timeline
Pattern retrieval + Filtering + Restructuring
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 20Return <Inexpensive> $t </Inexpensive>
Token: not a direct counterpart of a tuple
30Bt BoundS.KingDream2001
pricepublisherfirstlasttitleyear
Pattern Retrieval on Token Streams
33
Two Computation Paradigms
Automata-based [yfilter, xscan, xpush…] Algebraic [niagara00, …]
FOR $a in stream(bids)//auction, $b in $a/seller[homepage], $c in $a/bidder[sameAddr]WHERE $b/*/phone = “508”Return <auction> $b, $c </auction>
1auction
*
2
3seller
bidder
Automata
8Navigate
$a, /seller->$b
Navigate $a, /bidder-> $c
Tagger
Algebra
Navigate stream(bids),//auction->$a
4
homepage
9sameAddr
5 6* phone
…
7
bid
34
Comparison of Two Paradigms
Either paradigm has deficiencies
Both paradigms complement each other
Automata Paradigm Algebra Paradigm
Good for pattern retrieval on tokens Does not support token inputs
Need patches for filtering and restructuring
Good for filtering and restructuring
Present all details on same low level Support multiple descriptive levels (e.g., logical plan, physical plan)
Little studied as query processing paradigm
Well studied as query process paradigm
35
This Raindrop framework integrates both paradigms into one uniform model
36
One Uniform Algebraic View
Token-based plan (automata plan)
Tuple-based plan
Tuple stream
XML data stream
Query answer
Algebraic Stream Logical Plan
37
Example Uniform Algebraic Plan
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>
Tuple-based plan
Token-based plan (automata plan)
38
Example Uniform Algebraic Plan
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>
StructuralJoin$b
ExtractNest $b, $p
ExtractNest $b, $t
Navigate $b, /price->$p
Navigate $b, /title->$t
Navigate $S1, //book ->$b
Tuple-based plan
39
Example Uniform Algebraic Plan
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>
StructuralJoin$b
ExtractNest $b, $p
ExtractNest $b, $t
Navigate $b, /price->$p
Navigate $b, /title->$t
Navigate $S1, //book ->$b
Select$p<30
Tagger “Inexpensive”, $t->$r
40
Automata Push-In/Out?
Token-based plan (automata plan)
Tuple-based Plan
Tuple stream
XML data stream
Query answer
Pattern retrieval in Semantics-focused plan
Apply “push into automata”
41
Optimization Example: Computation Into or Out of Automata?
TokenNavigate $a, /bid/bi
dder->$c
ExtractUnnest $a, $c
ExtractUnnest $a, $b
StructuralJoin $a
TokenNavigate $a, /seller->$
b
TokenNavigate stream(bids), //a
uction->$a
ExtracUnnest stream(bids), $a
NavigateUnnest $a, /seller-
>$b
NavigateUnnest $a, /bid/bid
der->$c
TokenNavigate stream(bids), //aucti
on->$a
NavUnnest stream(bids), //auction->$a
NavigateUnnest $a, /seller ->$b
NavigateUnest $a, /bid/bidder ->$c
Out of Automata Into Automata
Automata Plan
Automata Plan
…
… …
42
Research Challenges:- Semantic Query Optimization- Automata versus Algebra
Optimization- Memory-Sensitive Non-Blocking
Evaluation- On-the-fly Plan Retargeting- XML Load Shedding …
43
See DSRG Web SiteVisit DSRG Research Labs
318 & 319Attend DSRG Research
MeetingCome and talk
Want More ?