designing hybrid data processing systems for heterogeneous …ey204/teaching/acs/r244_2017... ·...
TRANSCRIPT
Designing Hybrid Data ProcessingSystems for Heterogeneous Servers
Peter PietzuchLarge-Scale Distributed Systems (LSDS) Group
Imperial College Londonhttp://lsds.doc.ic.ac.uk<[email protected]>
University of Cambridge – Cambridge, United Kingdom – November 2017 1
Data is the New Oil
• Many new sources of data become available– Most data is produced continuously
• Data powers plethora of new and personalised services…
2Peter Pietzuch - Imperial College London
Mobiledevices Scientific
instruments
CamerasSocial feeds IoTdevices
Internet services,web sites
RFIDtags
Datarepositories
Data-Intensive Systems
• Data analytics over web click streams– How to maximise user experience with
relevant content?– How to analyse “click paths” to trace most
common user routes?
• Machine learning models foronline prediction– E.g. serving adverts on search engines
Peter Pietzuch - Imperial College London 3
Hits
Page Views
Visits
Unique Visitors
Uniquely Identified Visitors
Volume of Available Data
…
f1
fn
y E {−1,1}
predict
update
• Solution: AdPredictor– Bayesian learning algorithm
ranks adverts according toclick probabilities
Throughout and Result Freshness Matter
Facebook Insights: Aggregates 9 GB/s < 10 sec latencyFeedzai: 40K credit card transactions/s < 25 ms latencyGoogle Zeitgeist: 40K user queries/s < 1 ms latencyNovaSparks: 150M trade options/s < 1 ms latency
Peter Pietzuch - Imperial College London 4
…
High-throughputprocessing
Low-latencyresults
Data-intensivesystem
Design Space for Data-Intensive Systems
• Tension between performanceand algorithmic complexity
Peter Pietzuch - Imperial College London 5
Easy formostalgorithms
Hard formachinelearningalgorithms
Hard forall algorithms
Result latency
Dat
a am
ount
MBs
GBs
TBs
10s 1s 100ms 10ms 1ms
Algorithmic Complexity Increases
…Share state
Aggregate
Iterate…
Pre-process
Parallelize…
Online machinelearning, data
mining
Topic-basedfiltering
Content-basedfiltering
Complexpattern
matchingStreamqueries
highwaysegmentdirectionspeed
highwaysegmentdirectionspeed
highwaysegmentdirectionspeed
highwaysegmentdirectionspeed
highwaysegmentdirectionspeed
T1
T2
T3
T1(a, b, c)
T2(c, d, e)
T3(g, i, h)
Publish/Subscribe Complex EventProcessing (CEP)
Streamprocessing
Peter Pietzuch - Imperial College London 6
Scale Out Model in Data Centres
Peter Pietzuch - Imperial College London 7
Task Parallelism vs. Data Parallelism
Peter Pietzuch - Imperial College London 8
Input data ...
Servers indata centre
Results
select highway, segment, direction, AVG(speed) from Vehicles[range 5 seconds slide 1 second]group by highway, segment, directionhaving avg < 40
Task parallelism:Multiple data processing jobs
Data parallelism:Single data processing job
select distinct W.cidFrom Payments [range 300 seconds] as W,
Payments [partition-by 1 row] as Lwhere W.cid = L.cid and W.region != L.region
select distinct W.cidFrom Payments [range 300 seconds] as W,
Payments [partition-by 1 row] as Lwhere W.cid = L.cid and W.region != L.region
select distinct W.cidFrom Payments [range 300 seconds] as W,
Payments [partition-by 1 row] as Lwhere W.cid = L.cid and W.region != L.region
select highway, segment, direction, AVG(speed) from Vehicles[range 5 seconds slide 1 second]group by highway, segment, directionhaving avg < 40
Distributed Dataflow Systems
• Idea:Execute data-parallel taskson cluster nodes
• Tasks organised as dataflow graph
• Almost all big data systems do this:• Apache Hadoop, Apache Spark,
Apache Storm, Apache Flink,Google TensorFlow, ...
•Peter Pietzuch - Imperial College London 9
parallelismdegree 3
parallelismdegree 2
Nobody Ever Got Fired For Using Hadoop/Spark
• 2012 study of MapReduce workloads– Microsoft: median job size < 14 GB– Yahoo: median job size < 12.5 GB– Facebook: 90% of jobs < 100 GB
• Many data-intensive jobs easily fit into memory
• One server cheaper/more efficient than compute cluster
Peter Pietzuch - Imperial College London 10
(A. Rowstron, D. Narayanan, A. Donnely, G. O’Shea, A. Douglas, HotCDP’12)
Parallelism of Heterogeneous Servers
Servers have many parallel CPU coresHeterogeneous servers with GPUs common
New types of compute accelerators: Xeon Phi, Google's TPUs, FPGAs, ...Peter Pietzuch - Imperial College London 11
L3
C1
C2
C3
C4
C5
C6
C7
C8
L3
C1
C2
C3
C4
C5
C6
C7
C8
L2 Cache
DRAM DRAM
SMX1 ... SMXN
Sock
et 1
Sock
et 2
Command QueuePCIe Bus
DMA
1000s ofGPU cores
10s ofCPU cores
Slide courtesy of Torsten Hoefler (Systems Group, ETH Zürich)
Servers Are Becoming Increasingly Heterogeneous
Peter Pietzuch - Imperial College London 12
E How can Data-Intensive Systems Exploit Heterogeneous Hardware?
Roadmap
• SABER: Hybrid stream processing engine for heterogeneous servers• [SIGMOD’16]
• (1) How to parallelise computation on modern hardware?
• (2) How to utilise heterogeneous servers?
• (3) Experimental performance results
Peter Pietzuch - Imperial College London 13
Analytics with Window-based Stream Queries
• Real-time analytics over data streams• Windows define finite data amount for processing
Peter Pietzuch - Imperial College London 14
window
highwaysegmentdirectionspeed
highwaysegmentdirectionspeed
highwaysegmentdirectionspeed
highwaysegmentdirectionspeed
highwaysegmentdirectionspeed
highwaysegmentdirectionspeed
highwaysegmentdirectionspeed
highwaysegmentdirectionspeed
highwaysegmentdirectionspeed
highwaysegmentdirectionspeed
now
Time-based window with size τ at current time t[t - τ : t] Vehicles[Range τ seconds]
Count-based window with size n:last n tuples Vehicles[Rows n]
Defining Stream Query Semantics
• Windows convert data streams to dynamic relations (database table)
Peter Pietzuch - Imperial College London 15
Streams Relations
Window specification
Stream operators: Istream, Dstream, Rstream
Any relational query
(select, project,join, group by, etc)
SQL Stream Queries
SQL provides well-defined declarative semantics for queries– Based on relational algebra (select, project, join, …)
• Example: Identify slow moving traffic on highway– Input stream: Vehicles(highway, segment, direction, speed)– Find highway segments with average speed below 40 km/h
Peter Pietzuch - Imperial College London 16
select highway, segment, direction, AVG(speed) as avg
from Vehicles[range 5 sec slide 1 sec]group by highway, segment, directionhaving avg < 40
Input data
Output
Operators
(1) How to Parallelise Computation?
• Perform query evaluation across sliding windows in parallel– Exploit data parallelism across stream
Peter Pietzuch - Imperial College London 17
123456
w1
w2
w3
w4
size: 4 secslide: 1 sec
How to use GPUs with Stream Queries?
• Naive strategy parallelises computation along window boundaries
Peter Pietzuch - Imperial College London 18
123456 size: 4 secslide: 1 sec
Task T1
Task T2
Combine partial results
E Window-based parallelism results in redundant computation
How to use GPUs with Stream Queries?
• Parallel processing of non-overlapping window data?
Peter Pietzuch - Imperial College London 19
w1
w2
w3
w4
123456 size: 4 secslide: 1 sec
T1
T2
T3
T4
T5
Combine partial results
E Slide-based parallelism limits degree of parallelism
Apache Spark: Small Slides à Low Throughput
• Spark relates window slide to micro-batch size used for parallelisation
Peter Pietzuch - Imperial College London 20
00.20.40.60.8
11.21.41.61.8
2
0 1 2 3 4 5 6 7 8 9
Thro
ughp
ut
(106
tupl
es/s
)
Window slide (106 tuples)
select AVG(S.1) from S [rows 1024 slide x]
E Avoid coupling system parameters with query definition
SABER: Parallel Window Processing
• Idea: Parallelise using task size that is best for hardware
• Task contains one or more window fragments
Peter Pietzuch - Imperial College London 21
10 9 8 7 6 5 4 3 2 115 14 13 12 11 10 9 8 7 615 14 13 12 11
T1T2T3
w1w2
w3w4
w5
w1w2
w3w4
w5
size: 7 tuplesslide: 2 tuples
5 tuples/task
SABER: Window Fragment Processing
• Process window fragments in parallel• Reassemble partial results to obtain overall result
Peter Pietzuch - Imperial College London 22
Worker B: T2
w1w2w3
w4w5
Worker A: T1w1
w2w3
w1result
w2result
Result stageSlot 2 Slot 1
EmptyEmpty
Output result circular buffer
Partial result reassembly must also be done in parallel
• Fragment function ff– Processes window fragments
• Assembly function fa– Merges partial window results
• Batch function fb– Composes fragment functions within task– Allows incremental processing
10 9 8 7 6 5 4 3 2 1
T1T2
w1w2
size: 7 tuplesslide: 2 tuples
5 tuples/task
fa fa
ff ff
w2 results
ff ff
w1 results
output
fb
API for Operator Implementation
Peter Pietzuch - Imperial College London 23
SABER: Performance of Window-based Queries
Peter Pietzuch - Imperial College London 24
0
0.05
0.1
0.15
0.2
0
2
4
6
8
64 256 1024 4096 16384
Late
ncy
(sec
)
Thro
ughp
ut (G
B/s)
Window slide (tuples)
select AVG(S.1) from S [rows 1024 slide x]
E Performance of window-based queries remains predictable
0
2
4
6
8
0.0625 0.25 1 4 16 64 256 1024 4096
Thro
ughp
ut (G
B/s)
Task Size (KB)
CPU-only processing
GPU-only processing
How to Pick the Task Size?
• Problem: Small data transfers over PCIe bus costly– Example: select * from S where p1 [rows 1 slide 1]
Limited by dispatcher and thread contention
Limited by data movement
Peter Pietzuch - Imperial College London 25
Roadmap
• SABER: Hybrid stream processing engine for heterogeneous servers• [SIGMOD’16]
• (1) How to parallelise computation on modern hardware?
• (2) How to utilise heterogeneous servers?
Peter Pietzuch - Imperial College London 26
E Avoid coupling system parameters with processing semantics
(2) How to Utilise Heterogeneous Servers?
• Hard to decide acceleration potential of heterogeneous processors– Depends on operator semantics, window definition,
data distribution, …
Peter Pietzuch - Imperial College London 27
E Don’t leave decision about heterogeneous processors to users
0
2
4
6
Thro
ughp
ut (G
B/s)
0
0.1
0.2
0.3 CPU execution
GPU execution
Aggregation Group-by θ-join
GPU is faster CPU is faster
SABER: Hybrid Execution Model
• Idea: Execute tasks on all heterogeneous processors (CPUs, GPUs, ...)
• Fully utilise all hardware parallelism available in dedicated servers
Peter Pietzuch - Imperial College London 28
T2 T1T3T4T5T6T7T8T9
QBQAQBQBQBQBQAQBQA
T10
QA
Task queue CPU worker
GPU worker
0 3 6 9 12
CPUGPU T1 T2
T3
T2T1 T4 T5 T6
T7
T8 T9
T10
T4 T5 T6
Static Task Scheduling using Cost Model?
Peter Pietzuch - Imperial College London 29
CPU GPUQA 3 ms 2 msQB 3 ms 1 ms
T2 T1T3T4T5T6T7T8T9
QBQAQBQBQBQBQAQBQA
T10
QA
Task queue
T2
T1
CPUGPU
Static: QA on CPU and QB on GPU
T7 T9 T10
T3 T4 T5 T6 T8
0 3 6 9 12
CPU workers
GPU worker
E Static scheduling under-utilises processors
• Profile tasks to obtain cost model• Assign tasks to processor with shortest execution time
First-Come First-Serve Task Scheduling?
Peter Pietzuch - Imperial College London 30
E FCFS ignores effectiveness of processors for given task
CPU GPUQA 3 ms 2 msQB 3 ms 1 ms
T2 T1T3T4T5T6T7T8T9
QBQAQBQBQBQBQAQBQA
T10
QA
Task queue CPU workers
GPU worker
0 3 6 9 12
CPUGPU
First-Come First-Served
T1 T4 T8
T2 T3 T5 T6 T7 T9
T10
• Assign tasks to processors first-come, first-serve– CPU/GPU execute both QA and QB tasks
Heterogeneous Lookahead Scheduling (HLS)
Idea: Scheduler assigns tasks to idle processors dynamically– Skips tasks that could be executed faster by another processor
Peter Pietzuch - Imperial College London 31
CPU GPUQA 3 ms 2 msQB 3 ms 1 ms
T2 T1T3T4T5T6T7T8T9
QBQAQBQBQBQBQAQBQA
T10
QA
Task queue CPU workers
GPU worker
0 3 6 9 12
CPUGPU
HLS
T1 is slower on CPU: Skip
T1
T2 is slower on CPU: Skip
T2
T3 is slower on CPU but already have 3 ms of work for GPU
T3
T2T1
Skip T4, T5 and T6 and select T7
T4 T5 T6
T7
Finally skip T8 and T9 and select T10
T8 T9
T10
T4 T5 T6
E HLS achieves aggregate throughput of all heterogeneous processors
SABER Hybrid Stream Processing Engine
Peter Pietzuch - Imperial College London 32
T1
T2
T2 T1
op
ααop
CPU
GPU
T1 T2
Scheduling & execution stageDequeue tasks based on HLS
Dispatching stageDispatch fixed-size tasks
Merge & forward partial window results
Result stage
Java15K LOC
C & OpenCL4K LOC
Roadmap
• SABER: Hybrid stream processing engine for heterogeneous servers• [SIGMOD’16]
• (1) How to parallelise computation on modern hardware?
• (2) How to utilise heterogeneous servers?
• (3) Experimental performance results
Peter Pietzuch - Imperial College London 33
E Hybrid execution utilises all heterogeneous processors
E Avoid coupling system parameters with processing semantics
Experimental Evaluation
Peter Pietzuch - Imperial College London 34
Ubuntu Linux 14.04 NVIDIA driver 346.47
Intel Xeon 2.6 GHz
NVIDIA Quadro K5200
PCIe 3.0 x16
10 Gbps NIC
16 cores
64 GB RAM
2,304cores
8 GB RAM
Google Cluster Data 144M jobs events from Google infrastructure
SmartGridMeasurements974Mplugmeasurementsfromhouses
LinearRoadBenchmark11Mcarpositionsandspeedonhighway
What is SABER’s Performance?
Peter Pietzuch - Imperial College London 35
0
10
20
30
40
50
CM2 SG1 SG2 LRB3 LRB4Thro
ughp
ut (1
06tu
ples
/s)
SABER (CPU contrib.)
SABER (GPU contrib.)
Cluster Mgmt. Smart Grid LRB
aggravg group-byavg select
group-byavg group-bycnt
group-bycntgroup-byavg
select
Intel Xeon 2.6 GHz
NVIDIA Quadro K5200
16 cores
2,304 cores
E SABER exploits both CPUs and GPUs effectively for different queries
Is Hybrid Throughput Additive?
Peter Pietzuch - Imperial College London 36
0
2
4
6
Thro
ughp
ut (G
B/s)
0
0.1
0.2
0.3 SABER (CPU only)SABER (GPU only)SABER
Aggregation Group-by θ-join
GPU is faster CPU is faster Not additive due to queue contention
E Aggregate throughput of CPU + GPU always highest
What is the Trade-Off between CPUs and GPUs?
• Hybrid processing model benefits from GPU's ability to process complex predicates fast
0
2
4
6
8
1 4 16 64
Thro
ughp
ut (G
B/s)
# selection predicates
SABER (CPU only) SABER (GPU only) SABER
Dispatch bound
0
0.1
0.2
0.3
0.4
1 4 16 64# join predicates
[rows 1024, slide 1024] [rows 1024, slide 1024]
Peter Pietzuch - Imperial College London 37
Does SABER Adapt to Workload Changes?
38
0
0.1
0.2
0 10 20 30 40 50 60
Selec
tivity
0
2
4
6
8
0 10 20 30 40 50 60
Thro
ughp
ut (G
B/s)
Time (seconds)
Series1 Series2SABER SABER (GPU contribution)
HLS periodically uses idle, non-preferred processor to run tasks to update query task throughput matrix
E Higher selectivity à more predicates evaluated à GPU preferredPeter Pietzuch - Imperial College London
Summary
• Heterogeneous servers have huge impact on data-intensive systems– Shift from scale out to scale up model– Need new general-purpose system designs for heterogeneous servers
☛SABER: Hybrid Stream Processing Engine for CPUs & GPUs
• (1) Parallelise computation to fit hardware capabilitiesE Decouple hardware/system parameters from processing semantics
• (2) Fully utilise all heterogeneous processors independently of workloadE Hybrid processing model to achieve aggregate CPU/GPU throughput
39Peter Pietzuch - Imperial College London
Acknowledgement:LSDS Group at Imperial College London
Peter Pietzuch - Imperial College London 40
Peter Pietzuchhttp://lsds.doc.ic.ac.uk<[email protected]>Thank you!
Any Questions?
We're Hiring!Post-docs, PhDs