elke a. rundensteiner topics projects in database and information systems, such as, web information...

Elke A. Rundensteiner

Topics projects in database and

Information systems, such as,web information systems,

distributed databases,Etc.

Database Systems Research Lab

Email: [email protected]

Office: Fuller 238

Phone: x – 5815

Webpages: http://www.cs.wpi.edu/~rundenst

http://davis.wpi.edu/dsrg

2

Core Projects in a Nutshell: Distributed Data Sources:

EVE : Data Warehousing over Distributed Data Sources

TOTAL-ETL : Distributed Extract Transform Load Tools

Web Information Systems: RAINBOW : XML Query

Engine and Tools XML to Relational Mapping

Technology

Databases and Visualization: Visualization-Driven

Data Caching Prefetching based on

User Access Patterns Stream Monitoring

Systems: New Generation of

Query Engines Adaptive Distributed

Optimizations

3

DSRG Research Group

2 faculty 8 Phd students 5 MS students 4 MQP teams

4

Stream Monitoring Applications Monitor troop movements during combat and

warn when soldiers veer off course Send alert when patient’s vital signs begin to

deteriorate Monitor incoming news feeds to see stories on

“Iraq” Scour network traffic logs looking for intruders

5

Properties of Monitoring Applications Queries and monitors run continuously,

possibly unending Applications have varying service

preferences: Patient monitoring only wants freshest data Remote sensors have limited memory News service wishes maximal throughput Taking 60 seconds to process vital signs and

sound an alert may be too long

6

Properties of Streaming Data

Possibly never ending stream of data Unpredictable arrival patterns:

Network congestion Weather (for external sensors) Sensor moves out of range

7

Querying and Monitoring Streaming Data

New Data Format: Querying and Analyzing Continuous, unbounded, rapid, time-varying streams of data elements

Traditional DB vs. Stream System

Persistent finite relations vs. transient infinite data streams

One-time queries vs. large number of continuously running queries

Random access vs. sequential access

Query plan determined by static design decisions vs. unpredictable data characteristics and arrival patterns

8

Databases Upside Down

data

Query

Query

Query

Query

data

data

data

data

data

streamsof data

static data

Standing queries

one-time queries

9

CAPE: Constraint-Aware Adaptive Processing Engine

Elke A. Rundensteiner, G. Heineman, Luping Ding, Timothy Sutherland, Yali Zhu

Brad Pielech, Nishant Mehta

Natasha Bogdanova, Mariana Jbantova

Extracted from VLDB 2004 CAPE Demo.

http://davis.wpi.edu/dsrg/CAPE/index.html

10

Uncertainties in Stream Query Processing

RegisterContinuous Queries

Stream QueryEngine

Stream QueryEngine

Streaming DataStreaming Result

May have different QoS Requirements.

May have time-varying rates and data di

stribution.

Available resources for executing each operator

may vary over time.

Adaptations are required for stream query engine.

11

What is CAPE?

Constraint-aware

Adaptive

Continuous Query

Processing

Engine

Exploit semantic constraints such as sliding windows and punctuations to reduce resource usage and improve response time.

Incorporate heterogeneous-grained adaptivity at all query processing levels.

- Adaptive query operator execution- Adaptive query plan re-optimization- Adaptive operator scheduling- Adaptive query plan distribution

Process queries in a real-time manner by employing well-coordinated heterogeneous-grained adaptations.

12

CAPE System Architecture

Distribution Manager

Plan Reoptimizer

OperatorScheduler

OperatorConfigurator

OperatorConfigurator

OperatorScheduler

PlanReoptimizer

CAPE Query Engine

QoS Inspector

Execution Engine

StorageManager

StreamSender

StreamFeeder

StreamReceiver

Internet

Control Flow

Data Flow

Legend:

DistributionManager

Query PlanGenerator

Stream / QueryRegistration

GUI

Query 2 . . Query nQuery 1

StreamingData

End User

13

Research Contributions Scalable Query Operators (Punctuations)

Adapt and select among tasks such as memory purging, stream reading, memory-to-disk shuffling, punctuation propagation, etc.

Cooperative Plan Optimization Operators request and selectively provide punctuations (meta-knowledge) to improve

their performance

Adaptive Operator Scheduling Selector scores alternate scheduling algorithm based on their effect on QoS

requirements, and selects candidate. On-line Query Plan Migration

On-line plan restructuring and then online migration to the new plan even for stateful operators.

Distributed Plan Execution Adaptively distribute computations across multiple machines to optimize QoS

requirements without information loss

14

Constraint-Aware Query Operators

Item_id Bidder_id Bid_price Bid_time

1082 Marlie 820.00 Nov-13-03 11:02:00

1080 Ultrasale 1000.00 Nov-13-03 11:05:00

1082 Jocelyn 850.00 Nov-13-03 11:14:00

1082 * * *

No more bidwill be placedon item 1082.

Punctuations [TMS+03] are embedded inside data streams to announce that no tuple after a punctuation will match this punctuation.

Sliding windows characterize data sets that current result is based on.

Punctuation

TSa

TSb - Wa

TSa - Wb

TSb

Wb

Timeline

T0

Wa

Stream A Stream B

• Utilize data-level constraint, punctuations, to purge no-longer-needed data• Exploit time-based constraints, sliding windows, to discard expired data• Incorporate optimizations enabled by interactions between different constraints• Include ability to propagate punctuations to help downstream operators• Employ adaptive execution logic to cope with dynamic stream environments

Window Join:

15

JoinA

a1 *…

a1

a1

a2

a3

b2

b4

b1

b2

A B

a1

a1

a3

c1

c6

c2

A C

… a1

a1

a1

a1

b2

b4

b2

b4

A Bc1

c1

c6

a6

C

a3 b2 c2

JoinA

…

a2

a3

b1

b2

A B

a1

a1

a3

c1

c6

c2

A C

… a1

a1

a1

a1

b2

b4

b2

b4

A Bc1

c1

c6

a6

C

a3 b2 c2

Punctuation-Exploiting Join Optimization

JoinA

…

a1

a1

a2

a3

b2

b4

b1

b2

A B

a1

a1

a3

c1

c6

c2

A C…

a1

a1

a1

a1

b2

b4

b2

b4

A Bc1

c1

c6

a6

C

a3 b2 c2

JoinA…

… a1

a1

a1

a1

b2

b4

b2

b4

A Bc1

c1

c6

a6

C

a3 b2 c2

1246

TS

Window-Based Join Optimization

357

TSa2 c5 8

Window: 6

Window: 8

a1

a2

a3

b4

b1

b2

A B

246

TS

a1

a1

a3

c1

c6

c2

A C357

TS

a2 c5 8

a2 b1 c5

a1

a1

b2

b4

a1 b2 1

17

Run-time Plan Re-Optimization Step 1 - Decide when to optimize

Statistics Monitoring Step 2 – Generate new query plan

Query Optimization Step 3 – Replace current plan by new plan

Plan Migration

18

Plan Migration Strategy : Out with the Old and In with the New ?

Migration Steps Pause execution of old plan Drain out all tuples inside old plan Replace old plan by new plan Resume execution of new plan

AB

BC

A B C

AB

BC

A B C

Problem: Works for stateless operators only

19

Stateful Operator in CQ

Why stateful Need non-blocking operators in CQ Operator needs to output partial results State data structure keep received tuples

AB

A B

b1b2b3b4b5

ax

State A State B

ax

ax b2ax b3

Key Observation: The purge of tuples in states relies on processing of new tuples.

Example: Symmetric NL join w/ window constraints

20

Migration Strategy Revisited

Steps(1) Pause execution of old plan(2) Drain out all tuples inside old plan(3) Replace old plan by new plan(4) Resume execution of new plan

AB

BC

A B C(2)

All tuples drained

(4)Processing

Resumed

(3) Old Replaced

By new

Deadlock Waiting Problem:

21

Dynamic Plan Migration

Dynamic Plan Migration Input (two migration boxes)

One contains old plan One contains new plan Have same input and output queues

Result Old box is replaced by new box

Two Strategies: Moving State Strategy Parallel Track

Guarantee Valid Migration No missing tuples No duplicates

BC

AB

QA QB QC QD

QABCD

AB

CD

BC

QA QB QC QD

QABCD

SAB SC

SA SBSB

SC

SBC SD

SBCDSACD

SABCSD

Key points:- Involved plans contain stateful operators- Migrate yet still retain useful states and discard useless states.

22


Connection Manager

Query Plan Manager

Runtime Monitor

Distribution DecisionMaker

Distributed Strategy Repository

Cost Model Repository

Configuration Repository

CAPE Engine ( Query Processor)


Connection Manager: synchronizes query plan distribution between query processors.

Query Plan Manager: maintains the location of each query operator and verifies that a particular distribution is valid.

Runtime Monitor: receives statistics from each query processor and calculates respective load over time.

Distribution Decision Maker: creates initial distribution and at runtime reallocates query operators to different machines.

23

M1 M2 M3Legend:

M1

M2

M3

Round-Robin Distribution Grouping Distribution

Goal: To minimize network connectivity.

Algorithm: Takes each query plan and creates sub-plans where neighbouring operators are grouped together.

Goal: To equalize workload per machine.

Algorithm: Iteratively takes each query operator and places it on the query processor with the least number of operators.

Initial Distribution Policies

24


Distribution Table

M 2Operator 8

M 2Operator 7

M 1Operator 6

M 1Operator 5

M 2Operator 4

M 2Operator 3

M 1Operator 2

M 1Operator 1

MachineOperator

Initial Distribution Process

Stream Source

Application

4321

876

5

1 2 3 4

5

67 8

M1

M2

Step 1 Step 2

Step 1: Create distribution table using initial distribution algorithm.

Step 2: Send distribution information to processing machines

25

Redistribution - New Distribution Balance Computation

4100 tuplesM 2

2000 tuplesM 1

M 2Operator 8

M 2Operator 7

M 1Operator 6

M 1Operator 5

M 2Operator 4

M 2Operator 3

M 1Operator 2

M 1Operator 1

Machine

(M)

Operator (OP)

Op 3: .3

Op 4: .2

Op 7: .3

Op 8: .2

.91M 2

Op 1: .25

Op 2: .25

Op 5: .25

Op 6: .25

.44M 1

Operator Cost

CostMachine

Statistics Table

M Capacity: 4500 tuples

Distribution Table

Op 3: .4

Op 4: .3

Op 8: .3

.64M 2

Op 1: .15

Op 2: .15

Op 5: .15

Op 6: .15

.71M 1

Operator Cost

CostMachineBalance

Cost Table (current) Cost Table (desired)

Cape’s costs: # tuples in memory and network output rate. Operators redistributed based on redistribution policy. Redistribution policies of Cape: Balance and Degradation. 6-Step Protocol: Moving Operators Across Machines

Op 7: .4

Cost per machine is determined as percentage of memory filled with tuples.

26

Move Operators Across Machines

27

Experimental Results of Distribution and Re-distribution Algorithms

0

100000

200000

300000

400000

500000

600000

700000

800000

Time (m)

To

tal

Tu

ple

s O

utp

utt

ed Round Robin Distribution

Grouping Distribution

RR + Redist

Grouping + Redist

Query Plan Performance with a Query Plan of 40 Operators.

Observations:

Initial distribution is important for query plan performance.

Redistribution improves at run-time query plan performance.

28

So Much Fun Challenges, So Little Time …

Motion Query Processing :

Sensor on Cars Novel classes of location-aware queries Queries such as “Find me the five nearest

restaurants to my current (moving) location” Not just data but also queries are streaming in. Massive Query Sharing and Shared Execution Novel classes of location-aware queries

29

Publications, TRs & URLs[RDZ04] E. A. Rundensteiner, L. Ding, Y. Zhu, T. Sutherland and B. Pielech, “CAPE: A Cons

traint-Aware Adaptive Stream Processing Engine”. Invited Book Chapter. http://www.cs.uno.edu/~nauman/streamBook/. July 2004.

[ZRH04] Y. Zhu, E. A. Rundensteiner and G. T. Heineman, "Dynamic Plan Migration for Continuous Queries Over Data Streams”. SIGMOD 2004, pages 431-442.

[DMR+04] L. Ding, N. Mehta, E. A. Rundensteiner and G. T. Heineman, "Joining Punctuated Streams“. EDBT 2004, pages 587-604.

[DR04] L. Ding and E. A. Rundensteiner, "Evaluating Window Joins over Punctuated Streams“. CIKM 2004, to appear.

[DRH03] L. Ding, E. A. Rundensteiner and G. T. Heineman, “MJoin: A Metadata-Aware Stream Join Operator”. DEBS 2003.

[SPR04] T. Sutherland, B. Pielech and E. A. Rundensteiner, "Adaptive Scheduling Framework for A Continuous Query System“. Tech Report, WPI-CS-TR-04-16, 2004.

[SR04] T. Sutherland and E. A. Rundensteiner, "D-CAPE: A Self-Tuning Continuous Query Plan Distribution Architecture“. Tech Report, WPI-CS-TR-04-18, 2004.

CAPE Project: http://davis.wpi.edu/dsrg/CAPE/index.html

Raindrop : XQueries on XML Streams (Automaton Meets Algebra)

Hong Su, Elke Rundensteiner, Murali Mani, Ming Li

Worcester Polytechnic Institute

Worcester, MA

** Extracted from VLDB 2004 demo **

31

XML Stream Processing

data sources

data requesters

Networks

32

What’s Special for XML Stream Processing?<Biditems>

<book year=“2001">

<title>Dream Catcher</title>

<author><last>King</last><first>S.</first></author>

<publisher>Bt Bound </publisher>

<price> 30 </initial>

</book>

…

<biditems> <book> <title> Dream Catcher </title> …

Token-by-Token access manner

timeline

Pattern retrieval + Filtering + Restructuring

FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 20Return <Inexpensive> $t </Inexpensive>

Token: not a direct counterpart of a tuple

30Bt BoundS.KingDream2001

pricepublisherfirstlasttitleyear

Pattern Retrieval on Token Streams

33

Two Computation Paradigms

Automata-based [yfilter, xscan, xpush…] Algebraic [niagara00, …]

FOR $a in stream(bids)//auction, $b in $a/seller[homepage], $c in $a/bidder[sameAddr]WHERE $b/*/phone = “508”Return <auction> $b, $c </auction>

1auction

*

2

3seller

bidder

Automata

8Navigate

$a, /seller->$b

Navigate $a, /bidder-> $c

Tagger

Algebra

Navigate stream(bids),//auction->$a

4

homepage

9sameAddr

5 6* phone

…

7

bid

34

Comparison of Two Paradigms

Either paradigm has deficiencies

Both paradigms complement each other

Automata Paradigm Algebra Paradigm

Good for pattern retrieval on tokens Does not support token inputs

Need patches for filtering and restructuring

Good for filtering and restructuring

Present all details on same low level Support multiple descriptive levels (e.g., logical plan, physical plan)

Little studied as query processing paradigm

Well studied as query process paradigm

35

This Raindrop framework integrates both paradigms into one uniform model

36

One Uniform Algebraic View

Token-based plan (automata plan)

Tuple-based plan

Tuple stream

XML data stream

Query answer

Algebraic Stream Logical Plan

37

Example Uniform Algebraic Plan


Tuple-based plan


38



StructuralJoin$b

ExtractNest $b, $p

ExtractNest $b, $t

Navigate $b, /price->$p

Navigate $b, /title->$t

Navigate $S1, //book ->$b

Tuple-based plan

39



StructuralJoin$b

ExtractNest $b, $p

ExtractNest $b, $t

Navigate $b, /price->$p

Navigate $b, /title->$t

Navigate $S1, //book ->$b

Select$p<30

Tagger “Inexpensive”, $t->$r

40

Automata Push-In/Out?


Tuple-based Plan

Tuple stream

XML data stream

Query answer

Pattern retrieval in Semantics-focused plan

Apply “push into automata”

41

Optimization Example: Computation Into or Out of Automata?

TokenNavigate $a, /bid/bi

dder->$c

ExtractUnnest $a, $c

ExtractUnnest $a, $b

StructuralJoin $a

TokenNavigate $a, /seller->$

b

TokenNavigate stream(bids), //a

uction->$a

ExtracUnnest stream(bids), $a

NavigateUnnest $a, /seller-

>$b

NavigateUnnest $a, /bid/bid

der->$c

TokenNavigate stream(bids), //aucti

on->$a

NavUnnest stream(bids), //auction->$a

NavigateUnnest $a, /seller ->$b

NavigateUnest $a, /bid/bidder ->$c

Out of Automata Into Automata

Automata Plan

Automata Plan

…

… …

42

Research Challenges:- Semantic Query Optimization- Automata versus Algebra

Optimization- Memory-Sensitive Non-Blocking

Evaluation- On-the-fly Plan Retargeting- XML Load Shedding …

43

See DSRG Web SiteVisit DSRG Research Labs

318 & 319Attend DSRG Research

MeetingCome and talk

Want More ?

elke a. rundensteiner topics projects in database and information systems, such as, web information...

Documents

data warehousing

data distribution

properties of streaming

stream query processing

xml query engine

range slide

edudsrg slide

long slide