stream data management system prototypes ying sheng, richard sia june 1, 2004 professor carlo...
Post on 22-Dec-2015
217 Views
Preview:
TRANSCRIPT
Stream Data Management System Prototypes
Ying Sheng, Richard SiaJune 1, 2004
Professor Carlo Zaniolo CS 240B
Spring 2004
Outline Motivation of DSMS Aurora (Brown, Brandeis, MIT)
Model Operator Scheduling Storage/Memory Management QoS issue
STREAM (Stanford) System Architecture Query Language Query Plans and Execution Performance Issues Approximation Techniques STREAM Interface
Conclusion
Motivation
HADP DAHP Continuous data and static queries
Monitoring using sensor Military Traffic Environment
Financial analysis Object tracking
Aurora
Aurora – Model General Purpose DSMS Continuous stream data comes Flow through a set of operators Output to application or materialized
Aurora – Model Components
Storage manager Scheduler Load Shedder Router QoS Monitor GUI
Aurora – Model 3 kinds of query supported
Continuous View Ad-Hoc Query
Aurora – Model 8 primitive operators (Box)
Windowed Slide Tumble Latch Resample
Non-windowed Filter Map GroupBy Join
Aurora – Operator Optimization Each operator associated with
Selectivity: s(b), sel(b) Computation time: c(b), cost(b)
General Optimization Techniques Pushing projection upstream Combining boxes Reordering boxes
Aurora – Operator Optimization Case 1 : cost of ab
c(a) + s(a)c(b)
Case 2: cost of ba c(b) + s(b)c(a)
Criteria for switching box position c(a)+s(a)c(b) > c(b)+s(b)c(a)
a b
b a
Aurora – Operator Scheduling Scheduling by OS
One thread per box, shift the job to OS Easier to program
Aurora Scheduler Single thread for the scheduler The scheduler pick a box with highest priority and
call the box to consume tuples from queue Allow finer control of resource
Scalable !
Aurora – Operator Scheduling
Aurora – Operator Scheduling Problem: which box to execute next?
Min-Cost (MC) Reduce computation cost
Min-Latency (ML) Return result as soon as possible
Min-Memory (MM) Reduce memory usage of queue
Aurora – Operator Scheduling
Example
b4
b5
b6
b2
b3 b1
streams application
Downstream
Aurora – Operator Scheduling Min-Cost
Objective: avoid overhead of calling boxes
Min-Latency Prefer box which can produce tuples in the output at a
shorter period of time
Min-Memory Give preference to box which will consume more tuples
with less computation time Similar to “Chain Operator Scheduling”
More at:Operator Scheduling in a Data Stream Manager, VLDB 2003
Aurora – Storage/Memory Management Manage the queue in front of each box
2 boxes sharing the same queue windowed operator
The initial queue size is 128 KB Queues are managed as a circular queue
If overflow, double the queue size, or vice versa
Aurora – Storage/Memory Management Swap in/out between memory / disk based
on priority of boxes using it
Work with Operator Scheduler to exchange box priority and buffer-state information
Connection Point Management A B-tree indexed on timestamp is built to support
random access of tuples by ad-hoc query
Aurora – Storage/Memory Management
Aurora – QoS Issue Different queries/applications have
different QoS requirement Stock market monitoring Average temperature of a set of sensor
QoS Graph
Latency-based QoS Graph
b
time
QoS
0
D(b)
eol(b)
est(b)
latency(b)
cost(D(b))
Critical Point
Aurora – QoS-driven Scheduling Assign priority to each box based on
priority (b) = [utility (b), est (b)] utility (b) = gradient (eol (b))
How is the QoS degrading by the time the tuple leave the system when we process it now.
est (b) How soon it will exhibit another performance
degradation if we don’t process it now.
Performance 200 queries/application, each with 5 boxes Round robin - 0.43 QoS driven scheduling – 0.85
Aurora – Current Status Main components of a DSMS are introduced
Operator scheduler Memory/storage management QoS concept in stress environment Load shedding
Implemented in C++, with Java-based GUI Dependent on a few software/library
More? Distributed architecture – Aurora* Fault tolerance or disaster recovery ?
STREAM
STREAM – Introduction General-purpose prototype DSMS Supports data streams and stored
relations Declarative language for registering
continuous queries Flexible query plans and execution
strategies Aggressive sharing of state and
computation among queries
STREAM – Introduction Designed to cope with
Stream rates that may be high, variable, bursty
Continuous query loads that may be high, volatile
Primary coping techniques Graceful approximation as necessary Careful resource allocation and use Continuous self-monitoring and
reoptimization
DSMS
Scratch Store
STREAM – System Architecture
Input streams
RegisterQuery
StreamedResult
StoredResult
Archive
StoredRelations
STREAM – Query Language
Continuous Query Language – CQL Extends SQL with
Streams as new data type Stream: Unbounded bag of pairs <tuple,
timestamp> Relation: time-varying bags of tuples
Continuous instead of one-time semantics Three classes of operators
Relation-to-relation Stream-to-relation Relation-to-stream
STREAM – CQL Operators Relation-to-relation
SQL constructs Stream-to-relation
Tuple-based sliding window: [Rows N], [Rows Unbounded]
Time-based sliding window: [Range ω], [Now] Partitioned sliding window: [Partition By A1,…Ak
Rows N] Relation-to-stream
Istream: insert stream Dstream: delete stream Rstream: relation stream
STREAM – Example Query 1
Two example streams:Orders (orderID, customer, cost)Fulfillments (orderID, clerk)
Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”:Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk =
“Sue” And O.customer = “Joe”
STREAM – Example Query 2
Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost: Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By
clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerk
STREAM – Simplified Query 2
Result is a relation, updated as stream elements arrive:Select F.clerk, Max(O.cost)From O, F [Rows 100]Where O.orderID = F.orderIDGroup By F.clerk
STREAM – Simplified Query 2
Result is streamed: Emits <clerk, max> stream element whenever max changes for a clerk (or new clerk):Select Istream(F.clerk, Max(O.cost))From O, F [Rows 100]Where O.orderID = F.orderIDGroup By F.clerk
STREAM – Example Query 3 Relation: CurPrice(stock, price) Average price over last day for
each stock: Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day]Group By stock
Istream provides history of CurPrice Window on history (back to relation),
group and aggregate
STREAM – Query plans and Execution When a continuous query is registered, generate a
query plan New plan merged with existing plans Users can also create & manipulate plans directly
Plans composed of three main components: Operators
Flag: insertion(+), deletion (-) Elements: tuple-timestamp-flag tuples Streams: only + elements Relations: both + and - elements
Queues Enforce nondecreasing timestamps (“heartbeats”) Mechanisms for buffering tuples
States (Synopses) Global scheduler for plan execution
STREAM – States
States (Synopses) Summarize elements seen so far
(exact or approximate) for operators requiring history
To implement windows Example: synopsis join
Sliding-window join Approximation of full join
State1 State2⋈
STREAM – Simple Query Plan
Select * From S1 [Rows 1000], S2 [Range 2 Minutes]Where S1.A = S2.A
And S1.A > 10
STREAM – Performance Issues
Synopsis Sharing Eliminate data redundancy
Exploiting Constraints Selectively discard data to reduce
state Operator Scheduling
Reduce queue sizes
STREAM – Synopsis Sharing
Eliminate redundancy by replacing the nearly identical synopses with
light weight stubs a single store to hold the actual tuples
Store tracks the progress of each stub, presents the appropriate view to each stub.
The store contains the union of its corresponding stubs
STREAM – Synopsis Sharing
Select * From S1 [Rows 1000], S2 [Range 2
Minutes]Where S1.A = S2.A
And S1.A > 10
Select A, Max(B) From S1 [Rows 200]Group By A
STREAM – Exploiting Constraints
Specify an adherence parameter k to capture how closely a given stream or sets of streams adheres to a constraint of that type Referential integrity k-constraint Ordered-arrival k-constraint Clustered-arrival k-constraint
Query execution plans reduce or eliminate sate based on k-constraints
If constraint violated, get approximate result
STREAM – Operator Scheduling Goal:Goal: minimize total queue size for unpredictable,
bursty stream arrival patterns Chain Scheduling Algorithm:Chain Scheduling Algorithm:
1. Mark the first operator in the plan as the “current” operator
2. Find the block of consecutive operators starting at the “current” operator that maximizes the reduction in total queue size per unit time.
3. Mark the first operator following this block as the “current” operator and repeat Step 2 until all operators have been assigned to chains.
4. Chains are scheduled according to the greedy algorithm, but within a chain, execution proceeds in FIFO order.
Proven:Proven: within constant factor of any “clairvoyant” strategy, i.e., the optimal strategy based on knowledge of future input, for some queries
Empirical results:Empirical results: large savings over naive strategies for many queries
But minimizing queue sizes is at odds with minimizing latency
STREAM – Approximation CPU-Limited Approximation
Insufficient CPU time to process each stream element due to the high data arrival rate.
load-shedding sampling operators Approximate by probabilistically dropping
elements before they are processed Memory-Limited Approximation
The total state required for all registered queries exceeds available memory.
The system selectively shrinks or discards synopses.
STREAM – Query Interface View the structure of query plans the
their component entities. View the detailed properties of each
entity. Dynamically adjust entity properties. View monitoring graphs that display
time-varying entity properties plotted dynamically against time. Queue sizes, throughput, overall memory
usage, and join selectivity.
STREAM – Query Plan Monitoring
STREAM – Current Status Version 1.0 up and running Includes a new monitoring and adaptive query
processing infrastructure – StreaMon Executor runs query plans to produce results. Profiler collects and maintains statistics about stream and
plan characteristics. Reoptimizer ensures that the plans and memory structures
are the most efficient for current characteristics. Web demo available at http://shark.stanford.edu:8080/ Future Directions:
Distributed Stream Processing Crash Recovery Improved Approximation Classification of Applications
Conclusion Ideal DSMS
Well defined and flexible query language User-friendly interface Scalable
Operator scheduling Storage management Synopsis sharing Approximation
Quality assurance Fault tolerant
References R. Motwani et al., “Query Processing, Approximation, and
Resource Management in a Data Stream Management System”, in proceedings of the 1st CIDR Conference, 2003.
S. Madden et al., “Continuously Adaptive Continuous Queries over Streams”, in proceedings of SIGMOD Conference, 2002
D. Carney et al., “Monitoring Streams - A New Class of Data Management Applications”, in Proceedings of VLDB conference, 2002.
D. Carney et al., “Operator Scheduling in a Data Stream Manager”, in Proceedings of VLDB conference, 2003
Stanford STREAM Project Website: http://www-db.stanford.edu/stream/index.html
Aurora Project Website: http://www.cs.brown.edu/research/aurora
End
top related