sqlstream structure 2012: back to the future - dataflow comes of age
Post on 14-Nov-2014
593 Views
Preview:
DESCRIPTION
TRANSCRIPT
Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
Back to the Future: Dataflow Finally Comes of Age
Damian BlackCEO SQLstream
Real-time Big Data with
Relational Streaming Dataflow Technology
Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.2
Brief History of Dataflow
What is Dataflow? Parallel processing model invented in the 70s Graphed-based execution, without destructive updates Data flow along arcs to nodes, are combined, and flow along output arcs
What happened to Dataflow? A number of experimental parallel computers designed and built Transputer and Occam were literally decades ahead of their time Due for a resurgence due to inexpensive multi-core servers & SQL
What is Relational Streaming? A dataflow paradigm for processing Streaming Big Data tuples Familiar relational expressions with automatic optimization Relational queries executed continuously on a massively parallel scale
Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.3
Dataflow Graph: Pipelined and Superscalar Processing
Relational Streaming: DAGs of fine-grained dataflow.
Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.4
Comparison of Techniques for Dataflow Scaling
Hadoop and HDFS RelationalStreaming
DataDistribution
Fat File Fat Stream
DataflowEnablement
Generate new tuples from old
leaving old tuples unaltered
Generate new tuples from old
leaving old tuples unaltered
Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.5
Dataflow: Hadoop versus Relational Streaming
Hadoop style: data chunking coarse-grained dataflow.
Relational Streaming: DAGs of fine-grained dataflow.
Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.6
» Hadoop Map Reduce Process» Relational Streaming Approach:
» Continuous Parallel Dataflow Execution
» Real-time Answers Immediately
» Intelligently populate data store:
Hadoop or
Data Warehouse
Parallel Dataflow Execution
Collect
Clean
Aggregate
Analyze
Deliver
Low Latency
Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.7
GroupAggJoinProjectSelect
ReduceCombineMapSplit
Hadoop & Relational Streaming Server
Sort
Order
Relational Streaming synergies with Hadoop
» Relational Stream Processors co-located with Hadoop Servers
» Stream/re-stream into and from locally data stores in parallel
» Combination performs Real-time and Historical processing:
» Querying the future – Continuous ETL and Analytics (parallel pipelines)
» Querying the past – Hadoop batch jobs on stored tuples (parallel batches)
GroupAggJoinProjectSelect
ReduceCombineMapSplit
Hadoop & Relational Streaming Server
Sort
Order GroupAggJoinProjectSelect
ReduceCombineMapSplit
Hadoop & Relational Streaming Server
Sort
OrderGroupAggJoinProjectSelect
ReduceCombineMapSplit
Hadoop & Relational Streaming Server
Sort
Order GroupAggJoinProjectSelect
ReduceCombineMapSplit
Hadoop & Relational Streaming Server
Sort
Order GroupAggJoinProjectSelect
ReduceCombineMapSplit
Hadoop & Relational Streaming Server
Sort
Order GroupAggJoinProjectSelect
ReduceCombineMapSplit
Hadoop & Relational Streaming Server
Sort
Order
Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.8
» Mozilla Firefox 4 – Real-time Download Monitor
» Continuous processing of download requests
» Real-time integration with Hadoop and HBase
Application Example – Google: “Youtube Mozilla Glow”
Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.9
SELECT STREAM ROWTIME, url, “numErrorsLastMinute” FROM ( SELECT STREAM ROWTIME, url, “numErrorsLastMinute”, AVG(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL ’1′ MINUTE PRECEDING) AS “avgErrorsPerMinute”, STDDEV(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL ’1′ MINUTE PRECEDING) AS “stdDevErrorsPerMinute” FROM “ServiceRequestsPerMinute”) AS S WHERE S.”numErrorsLastMinute” > S.”avgErrorsPerMinute” + 2 * S.”stdDevErrorsPerMinute”;
Cloud Monitoring – Detecting Service Error Spikes
» Millions of records per second
» Real-time Bollinger Bands
» Amazon EC2
stream Serverstream
Serverstream Serverstream
Server
stream Serverstream
Serverstream Serverstream
Serverstream Server
stream Server
stream Server
Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.10
Data Warehouses
RelationalStreaming
HadoopBig Data
Messaging Middleware
Historical analysisPeriodic batches
Continuous analysisReal-time processing
High-level DeclarativeLanguage & Operation
Low-level ProceduralLanguage & Operation
A New Streaming Data Management Quadrant
Real-timeBig Data
BatchedBig Data
Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.11
3. RT Parallel Processing Made easy, auto-optimized, massive scale
2. Real-time Analysis Process, analyze, and react – all in real-time
Benefits of Real-time “Big Dataflow” with Relational Streaming
Confidential and Trade Secret SQLstream Inc. © 2012
Dataflow finally comes of age.Relational Streaming. The Next Wave of Big Data.
1. Real-time Integration Continuous, real-time data integration
Query the Future ®The Future of Query.
Thanks! Any questions?
top related