stream data management system prototypes ying sheng, richard sia june 1, 2004 professor carlo...

Stream Data Management System Prototypes

Ying Sheng, Richard SiaJune 1, 2004

Professor Carlo Zaniolo CS 240B

Spring 2004

Outline Motivation of DSMS Aurora (Brown, Brandeis, MIT)

Model Operator Scheduling Storage/Memory Management QoS issue

STREAM (Stanford) System Architecture Query Language Query Plans and Execution Performance Issues Approximation Techniques STREAM Interface

Conclusion

Motivation

HADP DAHP Continuous data and static queries

Monitoring using sensor Military Traffic Environment

Financial analysis Object tracking

Aurora

Aurora – Model General Purpose DSMS Continuous stream data comes Flow through a set of operators Output to application or materialized

Aurora – Model Components

Storage manager Scheduler Load Shedder Router QoS Monitor GUI

Aurora – Model 3 kinds of query supported

Continuous View Ad-Hoc Query

Aurora – Model 8 primitive operators (Box)

Windowed Slide Tumble Latch Resample

Non-windowed Filter Map GroupBy Join

Aurora – Operator Optimization Each operator associated with

Selectivity: s(b), sel(b) Computation time: c(b), cost(b)

General Optimization Techniques Pushing projection upstream Combining boxes Reordering boxes

Aurora – Operator Optimization Case 1 : cost of ab

c(a) + s(a)c(b)

Case 2: cost of ba c(b) + s(b)c(a)

Criteria for switching box position c(a)+s(a)c(b) > c(b)+s(b)c(a)

Aurora – Operator Scheduling Scheduling by OS

One thread per box, shift the job to OS Easier to program

Aurora Scheduler Single thread for the scheduler The scheduler pick a box with highest priority and

call the box to consume tuples from queue Allow finer control of resource

Scalable !

Aurora – Operator Scheduling

Aurora – Operator Scheduling Problem: which box to execute next?

Min-Cost (MC) Reduce computation cost

Min-Latency (ML) Return result as soon as possible

Min-Memory (MM) Reduce memory usage of queue

Aurora – Operator Scheduling

Example

streams application

Downstream

Aurora – Operator Scheduling Min-Cost

Objective: avoid overhead of calling boxes

Min-Latency Prefer box which can produce tuples in the output at a

shorter period of time

Min-Memory Give preference to box which will consume more tuples

with less computation time Similar to “Chain Operator Scheduling”

More at:Operator Scheduling in a Data Stream Manager, VLDB 2003

Aurora – Storage/Memory Management Manage the queue in front of each box

2 boxes sharing the same queue windowed operator

The initial queue size is 128 KB Queues are managed as a circular queue

If overflow, double the queue size, or vice versa

Aurora – Storage/Memory Management Swap in/out between memory / disk based

on priority of boxes using it

Work with Operator Scheduler to exchange box priority and buffer-state information

Connection Point Management A B-tree indexed on timestamp is built to support

random access of tuples by ad-hoc query

Aurora – Storage/Memory Management

Aurora – QoS Issue Different queries/applications have

different QoS requirement Stock market monitoring Average temperature of a set of sensor

QoS Graph

Latency-based QoS Graph

eol(b)

est(b)

latency(b)

cost(D(b))

Critical Point

Aurora – QoS-driven Scheduling Assign priority to each box based on

priority (b) = [utility (b), est (b)] utility (b) = gradient (eol (b))

How is the QoS degrading by the time the tuple leave the system when we process it now.

est (b) How soon it will exhibit another performance

degradation if we don’t process it now.

Performance 200 queries/application, each with 5 boxes Round robin - 0.43 QoS driven scheduling – 0.85

Aurora – Current Status Main components of a DSMS are introduced

Operator scheduler Memory/storage management QoS concept in stress environment Load shedding

Implemented in C++, with Java-based GUI Dependent on a few software/library

More? Distributed architecture – Aurora* Fault tolerance or disaster recovery ?

STREAM

STREAM – Introduction General-purpose prototype DSMS Supports data streams and stored

relations Declarative language for registering

continuous queries Flexible query plans and execution

strategies Aggressive sharing of state and

computation among queries

STREAM – Introduction Designed to cope with

Stream rates that may be high, variable, bursty

Continuous query loads that may be high, volatile

Primary coping techniques Graceful approximation as necessary Careful resource allocation and use Continuous self-monitoring and

reoptimization

Scratch Store

STREAM – System Architecture

Input streams

RegisterQuery

StreamedResult

StoredResult

Archive

StoredRelations

STREAM – Query Language

Continuous Query Language – CQL Extends SQL with

Streams as new data type Stream: Unbounded bag of pairs <tuple,

timestamp> Relation: time-varying bags of tuples

Continuous instead of one-time semantics Three classes of operators

Relation-to-relation Stream-to-relation Relation-to-stream

STREAM – CQL Operators Relation-to-relation

SQL constructs Stream-to-relation

Tuple-based sliding window: [Rows N], [Rows Unbounded]

Time-based sliding window: [Range ω], [Now] Partitioned sliding window: [Partition By A1,…Ak

Rows N] Relation-to-stream

Istream: insert stream Dstream: delete stream Rstream: relation stream

STREAM – Example Query 1

Two example streams:Orders (orderID, customer, cost)Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”:Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk =

“Sue” And O.customer = “Joe”

STREAM – Example Query 2

Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost: Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By

clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerk

STREAM – Simplified Query 2

Result is a relation, updated as stream elements arrive:Select F.clerk, Max(O.cost)From O, F [Rows 100]Where O.orderID = F.orderIDGroup By F.clerk

STREAM – Simplified Query 2

Result is streamed: Emits <clerk, max> stream element whenever max changes for a clerk (or new clerk):Select Istream(F.clerk, Max(O.cost))From O, F [Rows 100]Where O.orderID = F.orderIDGroup By F.clerk

STREAM – Example Query 3 Relation: CurPrice(stock, price) Average price over last day for

each stock: Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day]Group By stock

Istream provides history of CurPrice Window on history (back to relation),

group and aggregate

STREAM – Query plans and Execution When a continuous query is registered, generate a

query plan New plan merged with existing plans Users can also create & manipulate plans directly

Plans composed of three main components: Operators

Flag: insertion(+), deletion (-) Elements: tuple-timestamp-flag tuples Streams: only + elements Relations: both + and - elements

Queues Enforce nondecreasing timestamps (“heartbeats”) Mechanisms for buffering tuples

States (Synopses) Global scheduler for plan execution

STREAM – States

States (Synopses) Summarize elements seen so far

(exact or approximate) for operators requiring history

To implement windows Example: synopsis join

Sliding-window join Approximation of full join

State1 State2⋈

STREAM – Simple Query Plan

Select * From S1 [Rows 1000], S2 [Range 2 Minutes]Where S1.A = S2.A

And S1.A > 10

STREAM – Performance Issues

Synopsis Sharing Eliminate data redundancy

Exploiting Constraints Selectively discard data to reduce

state Operator Scheduling

Reduce queue sizes

STREAM – Synopsis Sharing

Eliminate redundancy by replacing the nearly identical synopses with

light weight stubs a single store to hold the actual tuples

Store tracks the progress of each stub, presents the appropriate view to each stub.

The store contains the union of its corresponding stubs

STREAM – Synopsis Sharing

Select * From S1 [Rows 1000], S2 [Range 2

Minutes]Where S1.A = S2.A

And S1.A > 10

Select A, Max(B) From S1 [Rows 200]Group By A

STREAM – Exploiting Constraints

Specify an adherence parameter k to capture how closely a given stream or sets of streams adheres to a constraint of that type Referential integrity k-constraint Ordered-arrival k-constraint Clustered-arrival k-constraint

Query execution plans reduce or eliminate sate based on k-constraints

If constraint violated, get approximate result

STREAM – Operator Scheduling Goal:Goal: minimize total queue size for unpredictable,

bursty stream arrival patterns Chain Scheduling Algorithm:Chain Scheduling Algorithm:

1. Mark the first operator in the plan as the “current” operator

2. Find the block of consecutive operators starting at the “current” operator that maximizes the reduction in total queue size per unit time.

3. Mark the first operator following this block as the “current” operator and repeat Step 2 until all operators have been assigned to chains.

4. Chains are scheduled according to the greedy algorithm, but within a chain, execution proceeds in FIFO order.

Proven:Proven: within constant factor of any “clairvoyant” strategy, i.e., the optimal strategy based on knowledge of future input, for some queries

Empirical results:Empirical results: large savings over naive strategies for many queries

But minimizing queue sizes is at odds with minimizing latency

STREAM – Approximation CPU-Limited Approximation

Insufficient CPU time to process each stream element due to the high data arrival rate.

load-shedding sampling operators Approximate by probabilistically dropping

elements before they are processed Memory-Limited Approximation

The total state required for all registered queries exceeds available memory.

The system selectively shrinks or discards synopses.

STREAM – Query Interface View the structure of query plans the

their component entities. View the detailed properties of each

entity. Dynamically adjust entity properties. View monitoring graphs that display

time-varying entity properties plotted dynamically against time. Queue sizes, throughput, overall memory

usage, and join selectivity.

STREAM – Query Plan Monitoring

STREAM – Current Status Version 1.0 up and running Includes a new monitoring and adaptive query

processing infrastructure – StreaMon Executor runs query plans to produce results. Profiler collects and maintains statistics about stream and

plan characteristics. Reoptimizer ensures that the plans and memory structures

are the most efficient for current characteristics. Web demo available at http://shark.stanford.edu:8080/ Future Directions:

Distributed Stream Processing Crash Recovery Improved Approximation Classification of Applications

Conclusion Ideal DSMS

Well defined and flexible query language User-friendly interface Scalable

Operator scheduling Storage management Synopsis sharing Approximation

Quality assurance Fault tolerant

References R. Motwani et al., “Query Processing, Approximation, and

Resource Management in a Data Stream Management System”, in proceedings of the 1st CIDR Conference, 2003.

S. Madden et al., “Continuously Adaptive Continuous Queries over Streams”, in proceedings of SIGMOD Conference, 2002

D. Carney et al., “Monitoring Streams - A New Class of Data Management Applications”, in Proceedings of VLDB conference, 2002.

D. Carney et al., “Operator Scheduling in a Data Stream Manager”, in Proceedings of VLDB conference, 2003

Stanford STREAM Project Website: http://www-db.stanford.edu/stream/index.html

Aurora Project Website: http://www.cs.brown.edu/research/aurora

stream data management system prototypes ying sheng, richard sia june 1, 2004 professor carlo...

aurora slide

query slide

boxes slide

materialized slide

operator scheduler

memory usage of queue

queue windowed operator

chain operator

Documents

analysing microarray expression data through effective...

scalable approximate query processing through scalable error...

tht-210b/ s/ 1800 2000/ l/ 2100/ s 2120/ 240b 2510/ s ......

high-performance complex event processing over xml...

index...

efficient xml storage, query, and update shi xu heng yuan...

extending stratiﬁed datalog to capture complexity classes...

an adaptive nearest neighbor classiﬁcation algorithm for...

extending stratiﬁed datalog to capture complexity classes...

1 extending dsms for data stream mining cs240b notes by...

cooper medc | solutions and technologies designed to save...

1 data stream management systems cs240b notes by carlo...

advisor: prof. zaniolo hung-chih yang ling-jyh chen xml...

installation and operation guidefmc-2000 pressure immune...

ee 240b discussion 4 - inst.eecs.berkeley.edu

data streams and continuous query systems cs 240b: professor...

leadership...

1 publishing naive bayesian classifiers: privacy without...

xml query language prepared by prof. zaniolo, hung-chih...

high-performance pattern detection and discovery for...