a survey on in-a-box parallel computing and its implications on system software research

37
A Survey on in-a-box parallel computing and its implications on system software research Changwoo Min ([email protected])

Upload: changwoo-min

Post on 08-May-2015

1.035 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: A Survey on in-a-box parallel computing and its implications on system software research

A Survey on in-a-box parallel computing

and its implications on system software

research

Changwoo Min ([email protected])

Page 2: A Survey on in-a-box parallel computing and its implications on system software research

Motivation

Technology ratios matter, Jim Gray

In the face of such "10X" forces, you can lose control of your destiny, Andrew S Grove

What is the implications of multicore evolution for system software researcher?

Page 3: A Survey on in-a-box parallel computing and its implications on system software research

Survey Scope and Strategy

Multicore

CPU

Multicore

CPU GPGPU

Operating System

System Library

Parallel Programming Model

Parallel

Application

Parallel

Middleware

Multicore

CPU

Multicore

CPU GPGPU GPGPU …

Virtual Machine Monitor

Page 4: A Survey on in-a-box parallel computing and its implications on system software research

Contents

Background

Parallel Programming Model and Productivity Tools

Optimization of System Software

Supporting GPU in a Virtualized Environment

Utilizing GPU in Middleware

Conclusion

Page 5: A Survey on in-a-box parallel computing and its implications on system software research

Background

Page 6: A Survey on in-a-box parallel computing and its implications on system software research

Why multicore?

Multicore CPU

Power wall

ILP(instruction level parallelism) wall

Memory wall

Wire delay

GPGPU(General Purpose computing on a Graphic Processing Unit)

GPU typically handles computation only for computer graphics.

Add followings to the rendering pipelines

programmable stages

higher precision arithmetic

Use stream processing on non-graphics data.

Page 7: A Survey on in-a-box parallel computing and its implications on system software research

Architecture of GPGPU core

Page 8: A Survey on in-a-box parallel computing and its implications on system software research

Parallel Programming Model and

Productivity Tools

Page 9: A Survey on in-a-box parallel computing and its implications on system software research

OpenMP

Parallel Programming API for shared memory

multiprocessing programming in C, C++, Fortran

Use language extension – “#pragma omp”

Need compiler support

Page 10: A Survey on in-a-box parallel computing and its implications on system software research

OpenMP (cont’d)

Fork-and-join model

Bounded parallel loop, reduction

Task-creation-and-join model

Unbounded loop, recursive algorithm, producer/consumer

Page 11: A Survey on in-a-box parallel computing and its implications on system software research

Intel TBB (Threading Building Block)

Similar to OpenMP

API for shared memory multiprocessing

Fork-and-join

parallel-for, parallel-reduce

Task-creation-and-join

Task scheduler

Different from OpenMP

C++ template library

Concurrent container class

Hash map, vector, queue

Various synchronization mechanism

mutex, spin lock, …

Atomic type, atomic operations

Scalable memory allocator

Page 12: A Survey on in-a-box parallel computing and its implications on system software research

Nvidia CUDA (Compute Unified Device Architecture)

CUDA

Computing engine in Nvidia GPU

Programming framework for Nvidia GPU

Use CUDA extended C

declspecs, keywords, intrinsic, runtime API, function launch, …

CUDA extended C Compiling CUDA Code Processing flow on CUDA

Page 13: A Survey on in-a-box parallel computing and its implications on system software research

Nvidia CUDA (cont’d)

Execution Model Kernel Memory Access

Page 14: A Survey on in-a-box parallel computing and its implications on system software research

OpenCL (Open Compute Language)

CPU/GPU heterogeneous computing framework

standardized by Khronous group

OpenCL Memory Model CUDA, OpenCL Example

Page 15: A Survey on in-a-box parallel computing and its implications on system software research

Lithe: Enabling Efficient Composition of

Parallel Libraries

Who?

ParLab, UC Berkeley, HotPar’09

Problem

Composition of parallel libraries shows performance anomaly

Page 16: A Survey on in-a-box parallel computing and its implications on system software research

Lithe: Enabling Efficient Composition of

Parallel Libraries (cont’d)

Solution

Virtualized thread are bad for parallel libraries.

Harts

Unvirtualized hardware thread context

Sharing harts

Lithe

Cooperative hierarchical scheduler framework for harts

Page 17: A Survey on in-a-box parallel computing and its implications on system software research

Concurrency bug detection: DataCollider

Who?

Microsoft Research, OSDI’10

Problem

Detecting concurrency data race bug is difficult.

For large system such as Windows kernel, runtime overhead is critical.

Solution

Sampling using code break point

When a code break point is trapped,

Set data break point for its operand

Sleep for a while

If the data is changed, it could be data race.

Page 18: A Survey on in-a-box parallel computing and its implications on system software research

Concurrency bug detection: SyncFinder

Who?

UC San Diego, OSDI ’10

Problem

How to find ad-hoc synchronization

Solution

Formalize patterns of ad-hoc synchronization

Detect such patterns using LLVM

Page 19: A Survey on in-a-box parallel computing and its implications on system software research

Optimization of System Software

Page 20: A Survey on in-a-box parallel computing and its implications on system software research

Memory Allocation: Hoard

Who?

UT, ASPLOS’00

Problem

Memory allocator is performance bottleneck in multi

processor environment.

Lock contention, False sharing, Blow up

Allocator induced false sharing

Page 21: A Survey on in-a-box parallel computing and its implications on system software research

Memory Allocation: Hoard (cont’d)

Solution

Per-processor heap to reduce

lock contention and false

sharing

Global heap

Borrow memory from global

heap to increase per-processor

heap

Return memory to global heap if

there are too much free memory

in a per-processor heap

Page 22: A Survey on in-a-box parallel computing and its implications on system software research

Memory Allocation: Xmalloc

Who?

UIUC, ICCIT’10

Problem

Scalable malloc for CUDA whereby hundreds of threads run

concurrently.

Solution

Memory allocation coalescing

Page 23: A Survey on in-a-box parallel computing and its implications on system software research

System Call: FlexSC

Who?

University of Toronto, OSDI’10

Problem

Negative performance impact of system call is huge.

Direct cost + indirect cost

Solution

Batching, asynchronous system call

Page 24: A Survey on in-a-box parallel computing and its implications on system software research

Revisiting OS Architecture

Page 25: A Survey on in-a-box parallel computing and its implications on system software research

Multikernel

Who?

ETH Zurich, Microsoft Research Cambridge, SOSP’09

Problem

System diversity

It is no longer acceptable (or useful) to tune a general-purpose OS

design for a particular hardware model.

Page 26: A Survey on in-a-box parallel computing and its implications on system software research

Multikernel (cont’d)

Problem (cont’d)

The interconnects matters

Core diversity

Programmable NICs

GPU

FPGA in CPU sockets

8-socket Nahelem On-chip interconnects

SH

M:s

talle

d c

ycle

(n

o lo

ckin

g!)

SHM vs. Message Passing

Page 27: A Survey on in-a-box parallel computing and its implications on system software research

Multikernel (cont’d)

Solution

Today’s computer is already a distributed system. Why isn’t

your OS?

Barallelfish

Implementation of the multikernel approach

Message passing, shared nothing, replica maintenance

Page 28: A Survey on in-a-box parallel computing and its implications on system software research

An Analysis of Linux Scalability to Many

Cores

Who?

MIT CSAIL, OSDI’10

Problem

If so, is Linux scalable enough?

Solution

Test linux scalability using 48 Intel cores with 7 applications

No kernel problems up to 48 cores

3002 LOC patches

Sloopy counter

: replicated reference counter

Page 29: A Survey on in-a-box parallel computing and its implications on system software research

Supporting GPU in a virtualized

environment

Page 30: A Survey on in-a-box parallel computing and its implications on system software research

HyVM (Hybrid Virtual Machines)

Who?

Georgia Tech

Problem

Asymmetries in performance, memory and cache

Functional differences

Multiple accelerators

Vector processor

Floating point

Additional instructions for accelerations

Solution

heterogeneity- and asymmetry-aware hypervisors

Page 31: A Survey on in-a-box parallel computing and its implications on system software research

HyVM (cont’d)

Solution (cont’d)

HyVM Architecture GViM: GPU Virtualization Architecture

Memory management in GViM Harmony CPU/GPU co-scheduling

Page 32: A Survey on in-a-box parallel computing and its implications on system software research

VMGL (Virtualizing OpenGL)

Who?

University of Toronto, VEE’07

Problem

How to support OpenGL in a virtual machine environment

Solution

Forward OpenGL command to the driver domain

Page 33: A Survey on in-a-box parallel computing and its implications on system software research

Utilizing GPU in Middleware

Page 34: A Survey on in-a-box parallel computing and its implications on system software research

StoreGPU

Who?

University of British Columbia, HDPC’10

Problem

In CAS(Contents Addressable Storage),

How to minimizing hash calculation cost

Solution

Offloading to GPU

StoreGPU Architecture

Page 35: A Survey on in-a-box parallel computing and its implications on system software research

PacketShader

Who?

KAIST, SIGCOMM’10, NSDI’11

Problem

How to boot up performance of software router

Solution

Offload stateless (parallelizable) packet processing to GPU

PacketShader Architecture Basic Workflow of PacketShader

Page 36: A Survey on in-a-box parallel computing and its implications on system software research

Conclusion

Page 37: A Survey on in-a-box parallel computing and its implications on system software research

S