a survey on in-a-box parallel computing and its implications on system software research

A Survey on in-a-box parallel computing

and its implications on system software

research

Changwoo Min ([email protected])

Motivation

Technology ratios matter, Jim Gray

In the face of such "10X" forces, you can lose control of your destiny, Andrew S Grove

What is the implications of multicore evolution for system software researcher?

Survey Scope and Strategy

Multicore

CPU

Multicore

CPU GPGPU

Operating System

System Library

Parallel Programming Model

Parallel

Application

Parallel

Middleware

Multicore

CPU

Multicore

CPU GPGPU GPGPU …

Virtual Machine Monitor

Contents

Background

Parallel Programming Model and Productivity Tools

Optimization of System Software

Supporting GPU in a Virtualized Environment

Utilizing GPU in Middleware

Conclusion

Background

Why multicore?

Multicore CPU

Power wall

ILP(instruction level parallelism) wall

Memory wall

Wire delay

GPGPU(General Purpose computing on a Graphic Processing Unit)

GPU typically handles computation only for computer graphics.

Add followings to the rendering pipelines

programmable stages

higher precision arithmetic

Use stream processing on non-graphics data.

Architecture of GPGPU core

Parallel Programming Model and

Productivity Tools

OpenMP

Parallel Programming API for shared memory

multiprocessing programming in C, C++, Fortran

Use language extension – “#pragma omp”

Need compiler support

OpenMP (cont’d)

Fork-and-join model

Bounded parallel loop, reduction

Task-creation-and-join model

Unbounded loop, recursive algorithm, producer/consumer

Intel TBB (Threading Building Block)

Similar to OpenMP

API for shared memory multiprocessing

Fork-and-join

parallel-for, parallel-reduce

Task-creation-and-join

Task scheduler

Different from OpenMP

C++ template library

Concurrent container class

Hash map, vector, queue

Various synchronization mechanism

mutex, spin lock, …

Atomic type, atomic operations

Scalable memory allocator

Nvidia CUDA (Compute Unified Device Architecture)

CUDA

Computing engine in Nvidia GPU

Programming framework for Nvidia GPU

Use CUDA extended C

declspecs, keywords, intrinsic, runtime API, function launch, …

CUDA extended C Compiling CUDA Code Processing flow on CUDA

Nvidia CUDA (cont’d)

Execution Model Kernel Memory Access

OpenCL (Open Compute Language)

CPU/GPU heterogeneous computing framework

standardized by Khronous group

OpenCL Memory Model CUDA, OpenCL Example

Lithe: Enabling Efficient Composition of

Parallel Libraries

Who?

ParLab, UC Berkeley, HotPar’09

Problem

Composition of parallel libraries shows performance anomaly

Lithe: Enabling Efficient Composition of

Parallel Libraries (cont’d)

Solution

Virtualized thread are bad for parallel libraries.

Harts

Unvirtualized hardware thread context

Sharing harts

Lithe

Cooperative hierarchical scheduler framework for harts

Concurrency bug detection: DataCollider

Who?

Microsoft Research, OSDI’10

Problem

Detecting concurrency data race bug is difficult.

For large system such as Windows kernel, runtime overhead is critical.

Solution

Sampling using code break point

When a code break point is trapped,

Set data break point for its operand

Sleep for a while

If the data is changed, it could be data race.

Concurrency bug detection: SyncFinder

Who?

UC San Diego, OSDI ’10

Problem

How to find ad-hoc synchronization

Solution

Formalize patterns of ad-hoc synchronization

Detect such patterns using LLVM

Optimization of System Software

Memory Allocation: Hoard

Who?

UT, ASPLOS’00

Problem

Memory allocator is performance bottleneck in multi

processor environment.

Lock contention, False sharing, Blow up

Allocator induced false sharing

Memory Allocation: Hoard (cont’d)

Solution

Per-processor heap to reduce

lock contention and false

sharing

Global heap

Borrow memory from global

heap to increase per-processor

heap

Return memory to global heap if

there are too much free memory

in a per-processor heap

Memory Allocation: Xmalloc

Who?

UIUC, ICCIT’10

Problem

Scalable malloc for CUDA whereby hundreds of threads run

concurrently.

Solution

Memory allocation coalescing

System Call: FlexSC

Who?

University of Toronto, OSDI’10

Problem

Negative performance impact of system call is huge.

Direct cost + indirect cost

Solution

Batching, asynchronous system call

Revisiting OS Architecture

Multikernel

Who?

ETH Zurich, Microsoft Research Cambridge, SOSP’09

Problem

System diversity

It is no longer acceptable (or useful) to tune a general-purpose OS

design for a particular hardware model.

Multikernel (cont’d)

Problem (cont’d)

The interconnects matters

Core diversity

Programmable NICs

GPU

FPGA in CPU sockets

8-socket Nahelem On-chip interconnects

SH

M:s

talle

d c

ycle

(n

o lo

ckin

g!)

SHM vs. Message Passing

Multikernel (cont’d)

Solution

Today’s computer is already a distributed system. Why isn’t

your OS?

Barallelfish

Implementation of the multikernel approach

Message passing, shared nothing, replica maintenance

An Analysis of Linux Scalability to Many

Cores

Who?

MIT CSAIL, OSDI’10

Problem

If so, is Linux scalable enough?

Solution

Test linux scalability using 48 Intel cores with 7 applications

No kernel problems up to 48 cores

3002 LOC patches

Sloopy counter

: replicated reference counter

Supporting GPU in a virtualized

environment

HyVM (Hybrid Virtual Machines)

Who?

Georgia Tech

Problem

Asymmetries in performance, memory and cache

Functional differences

Multiple accelerators

Vector processor

Floating point

Additional instructions for accelerations

Solution

heterogeneity- and asymmetry-aware hypervisors

HyVM (cont’d)

Solution (cont’d)

HyVM Architecture GViM: GPU Virtualization Architecture

Memory management in GViM Harmony CPU/GPU co-scheduling

VMGL (Virtualizing OpenGL)

Who?

University of Toronto, VEE’07

Problem

How to support OpenGL in a virtual machine environment

Solution

Forward OpenGL command to the driver domain

Utilizing GPU in Middleware

StoreGPU

Who?

University of British Columbia, HDPC’10

Problem

In CAS(Contents Addressable Storage),

How to minimizing hash calculation cost

Solution

Offloading to GPU

StoreGPU Architecture

PacketShader

Who?

KAIST, SIGCOMM’10, NSDI’11

Problem

How to boot up performance of software router

Solution

Offload stateless (parallelizable) packet processing to GPU

PacketShader Architecture Basic Workflow of PacketShader

Conclusion

a survey on in-a-box parallel computing and its implications on system software research

Documents