chapter 10: stream-based data management title: retrospective on aurora authors: hari balakrishnan,...

Chapter 10: Stream-based Data Management

• Title: Retrospective on Aurora• Authors: Hari Balakrishnan, et. al.

Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core

• Problem– Problem Statement– Why is this problem important?– Why is this problem hard?

• Approaches– Approach description, key concepts– Contributions (novelty, improved)– Assumptions

Problem Statement

• Given– Stream data– Experience on the development of five stream-based

applications using Aurora stream processing engine

• Find:– Key requirements of streaming applications

• Objectives– Reflect on the design of Aurora based on this experience– Eliminate the limitations and address new challenges on a

follow-on project, Borealis

• Constraints– Data streams arrive in no particular order.– Data streams arrive without any temporal regularity.

Why is this problem important?

• Stream-processing applications– Financial Services – stock ticker– Transportation – congestion pricing, dynamic tolls– Sensor Networks – Environment monitoring– Defense – Battalion monitoring

Why is this problem Hard?

• High update rate• Time-series

– Streaming applications entail time series. – Time series operations are not well supported by current

DBMSs.

• Real-time constraints– Outbound processing, where data are stored before being

processed, cannot deliver real-time latency.– SPEs must adopt inbound processing, where query processing

is performed directly on incoming messages.

• Spikes in message load.– Incoming traffic is bursty.– Quality of Service (QOS) requirements

Novel Contributions

• Comparison with SQL-centric related Work:– Data Flow Network (DFN) centric– Developer – compose DFN using graphical user interface– Optimizer – rearrange DFN, e.g. swap boxes,– Compiler – Translate DFN to intermediate representation– Run-time – Schedule tasks based on QOS requirements

• Other Contributions – Lessons Learnt– Identify characteristics of streaming applications

• from 5 case studies

– Identify core performance tuning ideas

Aurora Architecture

• Aurora is based on a dataflow-style ‘boxes & arrows’ paradigm unlike others using SQL style query interface. (i.e., performing query back and forth adds system overhead and latency.)

• Can be spread across any number of machines for scalability and availability.

Aurora GUI

Input OutputOperator

Aurora Operators

Aurora Case Study 1: Financial Services

• An application detects feed problems and triggers switch between feeds in real time.

• Hierarchical Alarm– Low alarm is triggered when update is delayed beyond threshold

(e.g., 5 sec).– High alarm is triggered when low alarms accumulate beyond

threshold (e.g., 100 times).

• Boxes in red circle separate the alarms from

both Reuters and Comstock into alarms from

NYSE and alarms from NASDAQ.

Filter & Merging techniques

• This case study illustrates the ability to detect stream imperfections and extend functionality using user-defined Map functions.

Aurora Case Study 2: Linear Road Benchmark

• Linear Road is a bench mark for stream processing eingines.

• Simulates an unban highway system that uses ‘variable tolling’ (i.e, congestion-based pricing).

• Linear Road should support for– Two continuous queries

• Calculates a segment toll every time a vehicle enters the segment.• Detects and reports accidents and adjusts tolls accordingly.

– Three Historical queries • Request an account balance• Day’s total expenditure for a given vehicle• Prediction of travel time between two segments using historical data

– Each of these queries must be answered with a specified accuracy and within a specified response time.

Aurora Case Study 3: Battalion Monitoring

• Aircrafts gather data and send them to monitoring stations.• Enemy units cross a given line, signaling an attack.• The limited resource is the bandwidth between aircraft and

ground. When an attack is initiated, selective dropping of data is allowed to serve important classes.

• Authors could test their load-shedding techniques.– Insert random drop boxes to discard a fraction of their input tuples.– Insert semantic, predicate-based drop filters.

• Observations– The semantic load-shedding techniques achieve the least value

utility loss.– As load increases, two techniques show similar performance.– At high loads, all algorithms converge to same loss levels.

Aurora Case Study 4: Environmental Monitoring

• Monitoring toxins in water.• Stream data is fish behavior (e.g., breathing rate) and

water quality (e.g., temperature).• When the fish behave abnormally, an alarm is sounded.• The water data contain 1,2, and 4 hour sliding windows.• Ease of developing stream applications

– Aurora proved very convenient for sliding window calculation.– Aurora’s GUI proved invaluable.

Aurora Case Study 5: Medusa

• Is a distributed stream-processing system using Aurora.• Takes Aurora queries and distributes them across multiple

nodes.• Offers several Benefits:

– Incremental scalability over multiple nodes.– High availability by mutual monitoring between nodes.– Composition of stream feeds from different participants.– Handling load spikes by federated system.

Lessons Learnt: Application Characteristics

• Common Queries– Historical data using Open window

• Last 10 week’s worth of toll data for each driver

– Aggregate - How much a driver has spent on tolls over past 10 weeks?

– Tables of historical data with arbitrary update patterns

• Synchronization– Stream applications rely on shared data and computation.– WaitFor (P: Predicate, T: Timeout)

• Unpredictable stream behavior– Financial services application detects arrival rate of a stream.– Military application adjust resources during times of stress.

Lessons Learnt: Performance Tuning

• Requirements– Main memory implementation– Data movement across DFN elements– Scheduling of DFN elements

• Performance Decisions– Memory copying – memcpy() implementations– Scheduler

• Reduce scheduler overheads by aggressive profiling

– Tight loops • keep unnecessary house-keeping out of tight loops

– Data-structures• Optimize data-structures used to implement DFN elements

Future Plans: Borealis

• Dynamic revision of query results– Intelligently corrects query results that have already been emitted

with the corrected data that arrive later.

• Dynamic query modification– E.g., traders wish to be alerted of interesting events, where the

def’n of ‘interesting’ varies.

• Distributed optimization– Server-heavy or sensor-heavy optimization problem becomes

emerging.

– More flexible optimization to handle a very large # of devices

• Implementation plans

Summary

• Paper’s focus– Identify the requirements of stream applications by the

experience from the design and implementation of Aurora stream-processing engine

• Ideas – Describe five applications and their implementation in detail.– Reflect on the design of Aurora based on the experience.– Discuss future ideas on follow-on project.

• Contributions– Identify key requirements of streaming applications

• Analytical Validation– Case study

Assumptions, Rewrite today

• Assumptions– Archiving is not necessary!– Performance more important than declarative query language

• Rewrite today– Compare performance with competition, e.g. STREAM– Allow archiving along with stream processing– Consider other applications

• RFID, cell phone applications

– Include current status of Borealis implementation.

chapter 10: stream-based data management title: retrospective on aurora authors: hari balakrishnan,...

Documents