elke a. rundensteiner topics projects in database and information systems, such as, web information...

42
Elke A. Rundensteiner Topics projects in database and Information systems, such as web information systems, distributed databases, Etc. Database Systems Research Lab Email: rundenst @ cs . wpi . edu Office: Fuller 238 Phone: x – 5815 Webpages: http://www. cs . wpi . edu /~ rundenst http://davis.wpi.edu/dsrg

Post on 20-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

Elke A. Rundensteiner

Topics projects in database and

Information systems, such as,web information systems,

distributed databases,Etc.

Database Systems Research Lab

Email: [email protected]

Office: Fuller 238

Phone: x – 5815

Webpages: http://www.cs.wpi.edu/~rundenst

http://davis.wpi.edu/dsrg

Page 2: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

2

Core Projects in a Nutshell: Distributed Data Sources:

EVE : Data Warehousing over Distributed Data Sources

TOTAL-ETL : Distributed Extract Transform Load Tools

Web Information Systems: RAINBOW : XML Query

Engine and Tools XML to Relational Mapping

Technology

Databases and Visualization: Visualization-Driven

Data Caching Prefetching based on

User Access Patterns Stream Monitoring

Systems: New Generation of

Query Engines Adaptive Distributed

Optimizations

Page 3: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

3

DSRG Research Group

2 faculty 8 Phd students 5 MS students 4 MQP teams

Page 4: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

4

Stream Monitoring Applications Monitor troop movements during combat and

warn when soldiers veer off course Send alert when patient’s vital signs begin to

deteriorate Monitor incoming news feeds to see stories on

“Iraq” Scour network traffic logs looking for intruders

Page 5: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

5

Properties of Monitoring Applications Queries and monitors run continuously,

possibly unending Applications have varying service

preferences: Patient monitoring only wants freshest data Remote sensors have limited memory News service wishes maximal throughput Taking 60 seconds to process vital signs and

sound an alert may be too long

Page 6: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

6

Properties of Streaming Data

Possibly never ending stream of data Unpredictable arrival patterns:

Network congestion Weather (for external sensors) Sensor moves out of range

Page 7: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

7

Querying and Monitoring Streaming Data

New Data Format: Querying and Analyzing Continuous, unbounded, rapid, time-varying streams of data elements

Traditional DB vs. Stream System

Persistent finite relations vs. transient infinite data streams

One-time queries vs. large number of continuously running queries

Random access vs. sequential access

Query plan determined by static design decisions vs. unpredictable data characteristics and arrival patterns

Page 8: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

8

Databases Upside Down

data

Query

Query

Query

Query

data

data

data

data

data

streamsof data

static data

Standing queries

one-time queries

Page 9: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

9

CAPE: Constraint-Aware Adaptive Processing Engine

Elke A. Rundensteiner, G. Heineman, Luping Ding, Timothy Sutherland, Yali Zhu

Brad Pielech, Nishant Mehta

Natasha Bogdanova, Mariana Jbantova

Extracted from VLDB 2004 CAPE Demo.

http://davis.wpi.edu/dsrg/CAPE/index.html

Page 10: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

10

Uncertainties in Stream Query Processing

RegisterContinuous Queries

Stream QueryEngine

Stream QueryEngine

Streaming DataStreaming Result

May have different QoS Requirements.

May have time-varying rates and data di

stribution.

Available resources for executing each operator

may vary over time.

Adaptations are required for stream query engine.

Page 11: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

11

What is CAPE?

Constraint-aware

Adaptive

Continuous Query

Processing

Engine

Exploit semantic constraints such as sliding windows and punctuations to reduce resource usage and improve response time.

Incorporate heterogeneous-grained adaptivity at all query processing levels.

- Adaptive query operator execution- Adaptive query plan re-optimization- Adaptive operator scheduling- Adaptive query plan distribution

Process queries in a real-time manner by employing well-coordinated heterogeneous-grained adaptations.

Page 12: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

12

CAPE System Architecture

Distribution Manager

Plan Reoptimizer

OperatorScheduler

OperatorConfigurator

OperatorConfigurator

OperatorScheduler

PlanReoptimizer

CAPE Query Engine

QoS Inspector

Execution Engine

StorageManager

StreamSender

StreamFeeder

StreamReceiver

Internet

Control Flow

Data Flow

Legend:

DistributionManager

Query PlanGenerator

Stream / QueryRegistration

GUI

Query 2 . . Query nQuery 1

StreamingData

End User

Page 13: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

13

Research Contributions Scalable Query Operators (Punctuations)

Adapt and select among tasks such as memory purging, stream reading, memory-to-disk shuffling, punctuation propagation, etc.

Cooperative Plan Optimization Operators request and selectively provide punctuations (meta-knowledge) to improve

their performance

Adaptive Operator Scheduling Selector scores alternate scheduling algorithm based on their effect on QoS

requirements, and selects candidate. On-line Query Plan Migration

On-line plan restructuring and then online migration to the new plan even for stateful operators.

Distributed Plan Execution Adaptively distribute computations across multiple machines to optimize QoS

requirements without information loss

Page 14: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

14

Constraint-Aware Query Operators

Item_id Bidder_id Bid_price Bid_time

1082 Marlie 820.00 Nov-13-03 11:02:00

1080 Ultrasale 1000.00 Nov-13-03 11:05:00

1082 Jocelyn 850.00 Nov-13-03 11:14:00

1082 * * *

No more bidwill be placedon item 1082.

Punctuations [TMS+03] are embedded inside data streams to announce that no tuple after a punctuation will match this punctuation.

Sliding windows characterize data sets that current result is based on.

Punctuation

TSa

TSb - Wa

TSa - Wb

TSb

Wb

Timeline

T0

Wa

Stream A Stream B

• Utilize data-level constraint, punctuations, to purge no-longer-needed data• Exploit time-based constraints, sliding windows, to discard expired data• Incorporate optimizations enabled by interactions between different constraints• Include ability to propagate punctuations to help downstream operators• Employ adaptive execution logic to cope with dynamic stream environments

Window Join:

Page 15: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

15

JoinA

a1 *…

a1

a1

a2

a3

b2

b4

b1

b2

A B

a1

a1

a3

c1

c6

c2

A C

… a1

a1

a1

a1

b2

b4

b2

b4

A Bc1

c1

c6

a6

C

a3 b2 c2

JoinA

a2

a3

b1

b2

A B

a1

a1

a3

c1

c6

c2

A C

… a1

a1

a1

a1

b2

b4

b2

b4

A Bc1

c1

c6

a6

C

a3 b2 c2

Punctuation-Exploiting Join Optimization

JoinA

a1

a1

a2

a3

b2

b4

b1

b2

A B

a1

a1

a3

c1

c6

c2

A C…

a1

a1

a1

a1

b2

b4

b2

b4

A Bc1

c1

c6

a6

C

a3 b2 c2

JoinA…

… a1

a1

a1

a1

b2

b4

b2

b4

A Bc1

c1

c6

a6

C

a3 b2 c2

1246

TS

Window-Based Join Optimization

357

TSa2 c5 8

Window: 6

Window: 8

a1

a2

a3

b4

b1

b2

A B

246

TS

a1

a1

a3

c1

c6

c2

A C357

TS

a2 c5 8

a2 b1 c5

a1

a1

b2

b4

a1 b2 1

Page 16: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

17

Run-time Plan Re-Optimization Step 1 - Decide when to optimize

Statistics Monitoring Step 2 – Generate new query plan

Query Optimization Step 3 – Replace current plan by new plan

Plan Migration

Page 17: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

18

Plan Migration Strategy : Out with the Old and In with the New ?

Migration Steps Pause execution of old plan Drain out all tuples inside old plan Replace old plan by new plan Resume execution of new plan

AB

BC

A B C

AB

BC

A B C

Problem: Works for stateless operators only

Page 18: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

19

Stateful Operator in CQ

Why stateful Need non-blocking operators in CQ Operator needs to output partial results State data structure keep received tuples

AB

A B

b1b2b3b4b5

ax

State A State B

ax

ax b2ax b3

Key Observation: The purge of tuples in states relies on processing of new tuples.

Example: Symmetric NL join w/ window constraints

Page 19: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

20

Migration Strategy Revisited

Steps(1) Pause execution of old plan(2) Drain out all tuples inside old plan(3) Replace old plan by new plan(4) Resume execution of new plan

AB

BC

A B C(2)

All tuples drained

(4)Processing

Resumed

(3) Old Replaced

By new

Deadlock Waiting Problem:

Page 20: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

21

Dynamic Plan Migration

Dynamic Plan Migration Input (two migration boxes)

One contains old plan One contains new plan Have same input and output queues

Result Old box is replaced by new box

Two Strategies: Moving State Strategy Parallel Track

Guarantee Valid Migration No missing tuples No duplicates

BC

AB

QA QB QC QD

QABCD

AB

CD

BC

QA QB QC QD

QABCD

SAB SC

SA SBSB

SC

SBC SD

SBCDSACD

SABCSD

Key points:- Involved plans contain stateful operators- Migrate yet still retain useful states and discard useless states.

Page 21: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

22

Distribution Manager

Connection Manager

Query Plan Manager

Runtime Monitor

Distribution DecisionMaker

Distributed Strategy Repository

Cost Model Repository

Configuration Repository

CAPE Engine ( Query Processor)

Distribution Manager

Connection Manager: synchronizes query plan distribution between query processors.

Query Plan Manager: maintains the location of each query operator and verifies that a particular distribution is valid.

Runtime Monitor: receives statistics from each query processor and calculates respective load over time.

Distribution Decision Maker: creates initial distribution and at runtime reallocates query operators to different machines.

Page 22: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

23

M1 M2 M3Legend:

M1

M2

M3

Round-Robin Distribution Grouping Distribution

Goal: To minimize network connectivity.

Algorithm: Takes each query plan and creates sub-plans where neighbouring operators are grouped together.

Goal: To equalize workload per machine.

Algorithm: Iteratively takes each query operator and places it on the query processor with the least number of operators.

Initial Distribution Policies

Page 23: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

24

Distribution Manager

Distribution Table

M 2Operator 8

M 2Operator 7

M 1Operator 6

M 1Operator 5

M 2Operator 4

M 2Operator 3

M 1Operator 2

M 1Operator 1

MachineOperator

Initial Distribution Process

Stream Source

Application

4321

876

5

1 2 3 4

5

67 8

M1

M2

Step 1 Step 2

Step 1: Create distribution table using initial distribution algorithm.

Step 2: Send distribution information to processing machines

Page 24: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

25

Redistribution - New Distribution Balance Computation

4100 tuplesM 2

2000 tuplesM 1

M 2Operator 8

M 2Operator 7

M 1Operator 6

M 1Operator 5

M 2Operator 4

M 2Operator 3

M 1Operator 2

M 1Operator 1

Machine

(M)

Operator (OP)

Op 3: .3

Op 4: .2

Op 7: .3

Op 8: .2

.91M 2

Op 1: .25

Op 2: .25

Op 5: .25

Op 6: .25

.44M 1

Operator Cost

CostMachine

Statistics Table

M Capacity: 4500 tuples

Distribution Table

Op 3: .4

Op 4: .3

Op 8: .3

.64M 2

Op 1: .15

Op 2: .15

Op 5: .15

Op 6: .15

.71M 1

Operator Cost

CostMachineBalance

Cost Table (current) Cost Table (desired)

Cape’s costs: # tuples in memory and network output rate. Operators redistributed based on redistribution policy. Redistribution policies of Cape: Balance and Degradation. 6-Step Protocol: Moving Operators Across Machines

Op 7: .4

Cost per machine is determined as percentage of memory filled with tuples.

Page 25: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

26

Move Operators Across Machines

Page 26: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

27

Experimental Results of Distribution and Re-distribution Algorithms

0

100000

200000

300000

400000

500000

600000

700000

800000

Time (m)

To

tal

Tu

ple

s O

utp

utt

ed Round Robin Distribution

Grouping Distribution

RR + Redist

Grouping + Redist

Query Plan Performance with a Query Plan of 40 Operators.

Observations:

Initial distribution is important for query plan performance.

Redistribution improves at run-time query plan performance.

Page 27: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

28

So Much Fun Challenges, So Little Time …

Motion Query Processing :

Sensor on Cars Novel classes of location-aware queries Queries such as “Find me the five nearest

restaurants to my current (moving) location” Not just data but also queries are streaming in. Massive Query Sharing and Shared Execution Novel classes of location-aware queries

Page 28: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

29

Publications, TRs & URLs[RDZ04] E. A. Rundensteiner, L. Ding, Y. Zhu, T. Sutherland and B. Pielech, “CAPE: A Cons

traint-Aware Adaptive Stream Processing Engine”. Invited Book Chapter. http://www.cs.uno.edu/~nauman/streamBook/. July 2004.

[ZRH04] Y. Zhu, E. A. Rundensteiner and G. T. Heineman, "Dynamic Plan Migration for Continuous Queries Over Data Streams”. SIGMOD 2004, pages 431-442.

[DMR+04] L. Ding, N. Mehta, E. A. Rundensteiner and G. T. Heineman, "Joining Punctuated Streams“. EDBT 2004, pages 587-604.

[DR04] L. Ding and E. A. Rundensteiner, "Evaluating Window Joins over Punctuated Streams“. CIKM 2004, to appear.

[DRH03] L. Ding, E. A. Rundensteiner and G. T. Heineman, “MJoin: A Metadata-Aware Stream Join Operator”. DEBS 2003.

[SPR04] T. Sutherland, B. Pielech and E. A. Rundensteiner, "Adaptive Scheduling Framework for A Continuous Query System“. Tech Report, WPI-CS-TR-04-16, 2004.

[SR04] T. Sutherland and E. A. Rundensteiner, "D-CAPE: A Self-Tuning Continuous Query Plan Distribution Architecture“. Tech Report, WPI-CS-TR-04-18, 2004.

CAPE Project: http://davis.wpi.edu/dsrg/CAPE/index.html

Page 29: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

Raindrop : XQueries on XML Streams (Automaton Meets Algebra)

Hong Su, Elke Rundensteiner, Murali Mani, Ming Li

Worcester Polytechnic Institute

Worcester, MA

** Extracted from VLDB 2004 demo **

Page 30: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

31

XML Stream Processing

data sources

data requesters

Networks

Page 31: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

32

What’s Special for XML Stream Processing?<Biditems>

<book year=“2001">

<title>Dream Catcher</title>

<author><last>King</last><first>S.</first></author>

<publisher>Bt Bound </publisher>

<price> 30 </initial>

</book>

<biditems> <book> <title> Dream Catcher </title> …

Token-by-Token access manner

timeline

Pattern retrieval + Filtering + Restructuring

FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 20Return <Inexpensive> $t </Inexpensive>

Token: not a direct counterpart of a tuple

30Bt BoundS.KingDream2001

pricepublisherfirstlasttitleyear

Pattern Retrieval on Token Streams

Page 32: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

33

Two Computation Paradigms

Automata-based [yfilter, xscan, xpush…] Algebraic [niagara00, …]

FOR $a in stream(bids)//auction, $b in $a/seller[homepage], $c in $a/bidder[sameAddr]WHERE $b/*/phone = “508”Return <auction> $b, $c </auction>

1auction

*

2

3seller

bidder

Automata

8Navigate

$a, /seller->$b

Navigate $a, /bidder-> $c

Tagger

Algebra

Navigate stream(bids),//auction->$a

4

homepage

9sameAddr

5 6* phone

7

bid

Page 33: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

34

Comparison of Two Paradigms

Either paradigm has deficiencies

Both paradigms complement each other

Automata Paradigm Algebra Paradigm

Good for pattern retrieval on tokens Does not support token inputs

Need patches for filtering and restructuring

Good for filtering and restructuring

Present all details on same low level Support multiple descriptive levels (e.g., logical plan, physical plan)

Little studied as query processing paradigm

Well studied as query process paradigm

Page 34: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

35

This Raindrop framework integrates both paradigms into one uniform model

Page 35: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

36

One Uniform Algebraic View

Token-based plan (automata plan)

Tuple-based plan

Tuple stream

XML data stream

Query answer

Algebraic Stream Logical Plan

Page 36: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

37

Example Uniform Algebraic Plan

FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>

Tuple-based plan

Token-based plan (automata plan)

Page 37: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

38

Example Uniform Algebraic Plan

FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>

StructuralJoin$b

ExtractNest $b, $p

ExtractNest $b, $t

Navigate $b, /price->$p

Navigate $b, /title->$t

Navigate $S1, //book ->$b

Tuple-based plan

Page 38: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

39

Example Uniform Algebraic Plan

FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>

StructuralJoin$b

ExtractNest $b, $p

ExtractNest $b, $t

Navigate $b, /price->$p

Navigate $b, /title->$t

Navigate $S1, //book ->$b

Select$p<30

Tagger “Inexpensive”, $t->$r

Page 39: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

40

Automata Push-In/Out?

Token-based plan (automata plan)

Tuple-based Plan

Tuple stream

XML data stream

Query answer

Pattern retrieval in Semantics-focused plan

Apply “push into automata”

Page 40: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

41

Optimization Example: Computation Into or Out of Automata?

TokenNavigate $a, /bid/bi

dder->$c

ExtractUnnest $a, $c

ExtractUnnest $a, $b

StructuralJoin $a

TokenNavigate $a, /seller->$

b

TokenNavigate stream(bids), //a

uction->$a

ExtracUnnest stream(bids), $a

NavigateUnnest $a, /seller-

>$b

NavigateUnnest $a, /bid/bid

der->$c

TokenNavigate stream(bids), //aucti

on->$a

NavUnnest stream(bids), //auction->$a

NavigateUnnest $a, /seller ->$b

NavigateUnest $a, /bid/bidder ->$c

Out of Automata Into Automata

Automata Plan

Automata Plan

… …

Page 41: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

42

Research Challenges:- Semantic Query Optimization- Automata versus Algebra

Optimization- Memory-Sensitive Non-Blocking

Evaluation- On-the-fly Plan Retargeting- XML Load Shedding …

Page 42: Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems

43

See DSRG Web SiteVisit DSRG Research Labs

318 & 319Attend DSRG Research

MeetingCome and talk

Want More ?