a survey on in-a-box parallel computing and its implications on system software research
TRANSCRIPT
A Survey on in-a-box parallel computing
and its implications on system software
research
Changwoo Min ([email protected])
Motivation
Technology ratios matter, Jim Gray
In the face of such "10X" forces, you can lose control of your destiny, Andrew S Grove
What is the implications of multicore evolution for system software researcher?
Survey Scope and Strategy
Multicore
CPU
Multicore
CPU GPGPU
Operating System
System Library
Parallel Programming Model
Parallel
Application
Parallel
Middleware
Multicore
CPU
Multicore
CPU GPGPU GPGPU …
Virtual Machine Monitor
Contents
Background
Parallel Programming Model and Productivity Tools
Optimization of System Software
Supporting GPU in a Virtualized Environment
Utilizing GPU in Middleware
Conclusion
Background
Why multicore?
Multicore CPU
Power wall
ILP(instruction level parallelism) wall
Memory wall
Wire delay
GPGPU(General Purpose computing on a Graphic Processing Unit)
GPU typically handles computation only for computer graphics.
Add followings to the rendering pipelines
programmable stages
higher precision arithmetic
Use stream processing on non-graphics data.
Architecture of GPGPU core
Parallel Programming Model and
Productivity Tools
OpenMP
Parallel Programming API for shared memory
multiprocessing programming in C, C++, Fortran
Use language extension – “#pragma omp”
Need compiler support
OpenMP (cont’d)
Fork-and-join model
Bounded parallel loop, reduction
Task-creation-and-join model
Unbounded loop, recursive algorithm, producer/consumer
Intel TBB (Threading Building Block)
Similar to OpenMP
API for shared memory multiprocessing
Fork-and-join
parallel-for, parallel-reduce
Task-creation-and-join
Task scheduler
Different from OpenMP
C++ template library
Concurrent container class
Hash map, vector, queue
Various synchronization mechanism
mutex, spin lock, …
Atomic type, atomic operations
Scalable memory allocator
Nvidia CUDA (Compute Unified Device Architecture)
CUDA
Computing engine in Nvidia GPU
Programming framework for Nvidia GPU
Use CUDA extended C
declspecs, keywords, intrinsic, runtime API, function launch, …
CUDA extended C Compiling CUDA Code Processing flow on CUDA
Nvidia CUDA (cont’d)
Execution Model Kernel Memory Access
OpenCL (Open Compute Language)
CPU/GPU heterogeneous computing framework
standardized by Khronous group
OpenCL Memory Model CUDA, OpenCL Example
Lithe: Enabling Efficient Composition of
Parallel Libraries
Who?
ParLab, UC Berkeley, HotPar’09
Problem
Composition of parallel libraries shows performance anomaly
Lithe: Enabling Efficient Composition of
Parallel Libraries (cont’d)
Solution
Virtualized thread are bad for parallel libraries.
Harts
Unvirtualized hardware thread context
Sharing harts
Lithe
Cooperative hierarchical scheduler framework for harts
Concurrency bug detection: DataCollider
Who?
Microsoft Research, OSDI’10
Problem
Detecting concurrency data race bug is difficult.
For large system such as Windows kernel, runtime overhead is critical.
Solution
Sampling using code break point
When a code break point is trapped,
Set data break point for its operand
Sleep for a while
If the data is changed, it could be data race.
Concurrency bug detection: SyncFinder
Who?
UC San Diego, OSDI ’10
Problem
How to find ad-hoc synchronization
Solution
Formalize patterns of ad-hoc synchronization
Detect such patterns using LLVM
Optimization of System Software
Memory Allocation: Hoard
Who?
UT, ASPLOS’00
Problem
Memory allocator is performance bottleneck in multi
processor environment.
Lock contention, False sharing, Blow up
Allocator induced false sharing
Memory Allocation: Hoard (cont’d)
Solution
Per-processor heap to reduce
lock contention and false
sharing
Global heap
Borrow memory from global
heap to increase per-processor
heap
Return memory to global heap if
there are too much free memory
in a per-processor heap
Memory Allocation: Xmalloc
Who?
UIUC, ICCIT’10
Problem
Scalable malloc for CUDA whereby hundreds of threads run
concurrently.
Solution
Memory allocation coalescing
System Call: FlexSC
Who?
University of Toronto, OSDI’10
Problem
Negative performance impact of system call is huge.
Direct cost + indirect cost
Solution
Batching, asynchronous system call
Revisiting OS Architecture
Multikernel
Who?
ETH Zurich, Microsoft Research Cambridge, SOSP’09
Problem
System diversity
It is no longer acceptable (or useful) to tune a general-purpose OS
design for a particular hardware model.
Multikernel (cont’d)
Problem (cont’d)
The interconnects matters
Core diversity
Programmable NICs
GPU
FPGA in CPU sockets
8-socket Nahelem On-chip interconnects
SH
M:s
talle
d c
ycle
(n
o lo
ckin
g!)
SHM vs. Message Passing
Multikernel (cont’d)
Solution
Today’s computer is already a distributed system. Why isn’t
your OS?
Barallelfish
Implementation of the multikernel approach
Message passing, shared nothing, replica maintenance
An Analysis of Linux Scalability to Many
Cores
Who?
MIT CSAIL, OSDI’10
Problem
If so, is Linux scalable enough?
Solution
Test linux scalability using 48 Intel cores with 7 applications
No kernel problems up to 48 cores
3002 LOC patches
Sloopy counter
: replicated reference counter
Supporting GPU in a virtualized
environment
HyVM (Hybrid Virtual Machines)
Who?
Georgia Tech
Problem
Asymmetries in performance, memory and cache
Functional differences
Multiple accelerators
Vector processor
Floating point
Additional instructions for accelerations
Solution
heterogeneity- and asymmetry-aware hypervisors
HyVM (cont’d)
Solution (cont’d)
HyVM Architecture GViM: GPU Virtualization Architecture
Memory management in GViM Harmony CPU/GPU co-scheduling
VMGL (Virtualizing OpenGL)
Who?
University of Toronto, VEE’07
Problem
How to support OpenGL in a virtual machine environment
Solution
Forward OpenGL command to the driver domain
Utilizing GPU in Middleware
StoreGPU
Who?
University of British Columbia, HDPC’10
Problem
In CAS(Contents Addressable Storage),
How to minimizing hash calculation cost
Solution
Offloading to GPU
StoreGPU Architecture
PacketShader
Who?
KAIST, SIGCOMM’10, NSDI’11
Problem
How to boot up performance of software router
Solution
Offload stateless (parallelizable) packet processing to GPU
PacketShader Architecture Basic Workflow of PacketShader
Conclusion
S