parallel applications parallel hardware parallel software 1 the parallel computing landscape: a view...

Parallel Applications

Parallel Hardware

Parallel Software

The Parallel Computing Landscape: A View from Berkeley 2.0

Krste Asanovic, Ras Bodik, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz,

Edward Lee, Nelson Morgan, George Necula, Dave Patterson, Koushik Sen, John Wawrzynek, David

Wessel, and Kathy YelickDagstuhl, September 3, 2007

A Parallel Revolution, Ready or Not Embedded: per product ASIC to programmable

platforms Multicore chip most competitive path Amortize design costs + Reduce design risk + Flexible platforms

PC, Server: Power Wall + Memory Wall = Brick Wall End of the way we’ve scaled uniprocessors for last 40 years

New Moore’s Law is 2X processors (“cores”) per chip every technology generation, but same clock rate “This shift toward increasing parallelism is not a triumphant stride

forward based on breakthroughs …; instead, this … is actually a retreat from even greater challenges that thwart efficient silicon implementation of traditional solutions.”

The Parallel Computing Landscape: A Berkeley View

Sea change for HW & SW industries since changing the model of programming and debugging

1985 1995 2005 2015

Millions of PCs / year

P.S. Parallel Revolution May Fail John Hennessy, President, Stanford University, 1/07:

“…when we start talking about parallelism and ease of use of truly parallel computers, we're talking about a problem that's as hard as any that computer science has faced. … I would be panicked if I were in industry.”

“A Conversation with Hennessy & Patterson,” ACM Queue Magazine, 4:10, 1/07.

100% failure rate of Parallel Computer Companies Convex, Encore, MasPar, NCUBE, Kendall Square Research,

Sequent, (Silicon Graphics), Transputer, Thinking Machines, … What if IT goes from a

growth industry to areplacement industry? If SW can’t effectively use

8, 16, 32, ... cores per chip SW no faster on new computer Only buy if computer wears out

Need a Fresh Approach to Parallelism

Berkeley researchers from many backgrounds meeting since Feb. 2005 to discuss parallelism Krste Asanovic, Ras Bodik, Jim Demmel, Kurt Keutzer, John

Kubiatowicz, Edward Lee, George Necula, Dave Patterson, Koushik Sen, John Shalf, John Wawrzynek, Kathy Yelick, …

Circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, scientific programming, and numerical analysis

Tried to learn from successes in high performance computing (LBNL) and parallel embedded (BWRC)

Led to “Berkeley View” Tech. Report and new Parallel Computing Laboratory (“Par Lab”)

Goal: Productive, Efficient, Correct Programming of 100+ cores & scale as double cores every 2 years (!)

2003 2005 2007 2009 2011 2013 2015

Why Target 100+ Cores? 5-year research program aim 8+ years out Multicore: 2X / 2 yrs ≈ 64 cores in 8 years Manycore: 8X to 16X multicore

Automatic Parallelization,Thread Level Speculation

Re-inventing Client/Server

Laptop/Handheld as future client, Datacenter as future server Focus on single socket (Laptop/Handheld or Blade)

“The Datacenter is the Computer”Building sized computers: Microsoft, …

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

“The Laptop/Handheld is the Computer”‘08: Dell no. laptops > desktops? ‘06 Profit?1B Cell phones/yr, increasing in functionOtellini demoed "Universal Communicator”

Combination cell phone, PC and video device

Apple iPhone

Try Application Driven Research?

Conventional Wisdom in CS Research Users don’t know what they want Computer Scientists solve individual parallel

problems with clever language feature (e.g., futures), new compiler pass, or novel hardware widget (e.g., SIMD)

Approach: Push (foist?) CS nuggets/solutions on users Problem: Stupid users don’t learn/use proper solution

Another Approach Work with domain experts developing compelling apps Provide HW/SW infrastructure necessary to build, compose, and understand parallel software written in multiple languages Research guided by commonly recurring patterns actually observed while developing compelling app

4 Themes of View 2.0/ Par Lab

1. Applications� Compelling Apps driving the research agenda top-down

2. Identify Computational Bottlenecks� Breaking through disciplinary boundaries

3. Developing Parallel Software with Productivity, Efficiency, and Correctness� 2 Layers + C&C Language, Autotuning

4. OS and Architecture� Composable Primitives, not Pre-Packaged Solutions

� Deconstruction, Fast Barrier Synchronization, Partition

Personal Health

Image Retriev

Hearing, Music

Speech

Parallel Browse

rDwarfs

Sketching

Legacy Code

Schedulers

Communication & Synch.

PrimitivesEfficiency Language Compilers

Par Lab Research OverviewEasy to write correct programs that run efficiently on

manycore

Legacy OS

Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

HypervisorOS

Productivi

ty Layer

Efficienc

y Layer Corr

Applicatio

nsComposition & Coordination Language (C&CL)

Parallel Libraries

Parallel Frameworks

Static Verificatio

Dynamic Checkin

gDebugging

with Replay

Directed Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type Systems

“Who needs 100 cores to run M/S Word?” Need compelling apps that use 100s of cores

How did we pick applications?1. Enthusiastic expert application partner, leader in

field, promise to help design, use, evaluate our technology

2. Compelling in terms of likely market or social impact, with short term feasibility and longer term potential

3. Requires significant speed-up, or a smaller, more efficient platform to work as intended

4. As a whole, applications cover the most important

Platforms (handheld, laptop, games) Markets (consumer, business, health)

Theme 1. Applications. What are

the problems?

Coronary Artery Disease

Modeling to help patient compliance?Modeling to help patient compliance?• 400k deaths/year, 16M w. symptom, 72MBP

Massively parallel, Real-time variationsMassively parallel, Real-time variations• CFD FECFD FE solid (non-linear), fluid (Newtonian), pulsatilesolid (non-linear), fluid (Newtonian), pulsatile• Blood pressure, activity, habitus, cholesterolBlood pressure, activity, habitus, cholesterol

Before After

Content-Based Image Retrieval

Relevance Feedback

ImageImageDatabaseDatabase

Query by example

SimilarityMetric

CandidateResults Final ResultFinal Result

Built around Key Characteristics of personal databases Very large number of pictures (>5K) Non-labeled images Many pictures of few people Complex pictures including people, events,

places, and objects

1000’s of images

Compelling Laptop/Handheld Apps

Meeting Diarist Laptops/

Handhelds at meeting coordinate to create speaker identified, partially transcribed text diary of meeting

Teleconference speaker identifier, speech helper

L/Hs used for teleconference, identifies who is speaking, “closed caption” hint of what being said

Compelling Laptop/Handheld Apps

Musicians have an insatiable appetite for computation More channels, instruments, more processing,

more interaction! Latency must be low (5 ms) Must be reliable (No clicks)

Music Enhancer Enhanced sound delivery systems for home

sound systems using large microphone and speaker arrays

Laptop/Handheld recreate 3D sound over ear buds

Hearing Augmenter Laptop/Handheld as accelerator for hearing

Novel Instrument User Interface New composition and performance systems

beyond keyboards Input device for Laptop/Handheld

Berkeley Center for New Music and Audio Technology (CNMAT) created a compact loudspeaker array: 10-inch-diameter icosahedron incorporating 120 tweeters.

Parallel Browser Goal: Desktop quality browsing on

handhelds Enabled by 4G networks, better output devices

Bottlenecks to parallelize Parsing, Rendering, Scripting

SkipJax Parallel replacement for JavaScript/AJAX Based on Brown’s FlapJax

Theme 2. Use computational bottlenecks instead of conventional benchmarks? How invent parallel systems of future when tied to old code, programming models, CPUs of the past?

Look for common patterns of communication and computation

1. Embedded Computing (42 EEMBC benchmarks)2. Desktop/Server Computing (28 SPEC2006)3. Data Base / Text Mining Software4. Games/Graphics/Vision5. Machine Learning6. High Performance Computing (Original “7 Dwarfs”) Result: 13 “Dwarfs”

How do compelling apps relate to 13 dwarfs?

Dwarf Popularity (Red Hot Blue Blue CoolCool)

EmbedSPECDB GamesML HPCHealth Image Speech Music Browser1 Finite State Mach.2 Combinational3 Graph Traversal4 Structured Grid5 Dense Matrix6 Sparse Matrix7 Spectral (FFT)8 Dynamic Prog9 N-Body

10 MapReduce11 Backtrack/ B&B12 Graphical Models13 Unstructured Grid

What about I/O in Dwarfs? High speed serial lines

many parallel I/O channels per chip Storage will be much faster

“Flash is disk, disk is tape” At least for laptop/handheld iPod, camera, cellphone, laptop $ to rapidly improve flash

memory New technologies even more promising:

Phase-Change RAM denser, faster write, 100X longer wear-out Disk Flash means 10,000x improvement in latency (10

millisecond to 1 microsecond)

Network will be much faster 10 Gbit/s copper is standard; 100 Gbit/sec is coming Parallel serial channels to multiply bandwidth Already need multiple fat cores to handle TCP/IP at 10 Gbit/s Super 3G -> 4G, 100Mb/s->1Gb/s WAN wireless to handset

4 Valuable Roles of Dwarfs1. “Anti-benchmarks”

� Dwarfs not tied to code or language artifacts encourage innovation in algorithms, languages, data structures, and/or hardware

2. Universal, understandable vocabulary � To talk across disciplinary boundaries

3. Bootstrapping: Parallelize parallel research � Allow analysis of HW & SW design without

waiting years for full apps

4. Targets for libraries (see later)

Themes 1 and 2 Summary Application-Driven Research vs.

CS Solution-Driven Research Drill down on (initially) 5 app areas to

guide research agenda Dwarfs to represent broader set of apps

to guide research agenda Dwarfs help break through traditional

interfaces Benchmarking, multidisciplinary conversations,

parallelizing parallel research, and target for libraries

Legacy OS

Personal Health

Image Retriev

Hearing, Music

Speech

Parallel Browse

rDwarfs

Sketching

Legacy Code

Schedulers

Primitives

Theme 3: Developing Parallel Software

Multicore/GPGPU

RAMP Manycore

HypervisorOS

Productivi

ty Layer

Efficienc

y Layer Corr

Applicatio

Parallel Libraries

Parallel Frameworks

Static Verificatio

Dynamic Checkin

gDebugging

with Replay

Directed Testing

Autotuners

Type Systems

Efficiency Language Compilers

Dwarfs become Libraries or Frameworks

Observation: Dwarfs are of 2 types

Libraries Dense matrices Sparse matrices Spectral Combinational Finite state machines

Patterns/Frameworks MapReduce Graph traversal, graphical models Dynamic programming Backtracking/B&B N-Body (Un) Structured Grid

Algorithms in the dwarfs can either be implemented as:• Compact parallel computations within a traditional library• Compute/communicate pattern implemented as framework

� Computations may be viewed at multiple levels: e.g., an FFT library may be built by instantiating a Map-Reduce framework, mapping 1D FFTs and then transposing (generalized reduce)

Two Layers for Two Types of Programmer

Efficiency Layer (10% of today’s programmers) Expert programmers build Frameworks & Libraries,

Hypervisors, … “Bare metal” efficiency possible at Efficiency Layer

Productivity Layer (90% of today’s programmers) Domain experts / Naïve programmers productively build

parallel apps using frameworks & libraries Frameworks & libraries composed to form app

frameworks Effective composition techniques allows the efficiency

programmers to be highly leveraged Create language for Composition and Coordination (C&C)

Personal Health

Image Retriev

Hearing, Music

Speech

Parallel Browse

rDwarfs

Sketching

Legacy Code

Schedulers

Productivity Layer

Legacy OS

Intel Multicore/GPGPU

RAMP Manycore

HypervisorOS

Productivi

ty Layer

Efficienc

y Layer Corr

Applicatio

Parallel Libraries

Parallel Frameworks

Static Verificatio

Dynamic Checkin

gDebugging

with Replay

Directed Testing

Autotuners

Type Systems

C & C Language Requirements

Applications specify C&C language requirements: Constructs for creating application frameworks Primitive parallelism constructs:

Data parallelism Divide-and-conquer parallelism Event driven execution

Constructs for composing frameworks Frameworks require independence Independence is proven at instantiation with a variety of

techniques Needs to have low runtime overhead and ability to

measure and control performance

Coordination & Composition

Composition is key to software reuse Solutions exist for libraries with hidden parallelism:

Partitions in OS help runtime composition Instantiating parallel frameworks is harder:

Framework specifies required independence E.g., operations in map, div&conq must not interfere

Guaranteed independence through types Type system extensions (side effects, ownership)

Extra: Image READ Array[double] Data decomposition may be implicit or explicit:

Partition: Array[T], … List [Array[T]] (well understood) Partition: Graph[T], … List [Graph[T]]

Efficiency layer code has these specifications at interfaces, which are verified, tested, or asserted

Independence is proven by checking side effects and overlap at instantiation

DCT, etc.

Delaunay

Sparse LU

Coordination & Composition Coordination is used to create parallelism: Support parallelism patterns of applications

Data parallelism (degree = data size, not core count) May be nested: forall images, forall blocks in image

Divide-and-conquer (parallelism from recursion) Event-driven: nondeterminism at algorithm level

Branch&Bound dwarf, etc.

Serial semantics with limited nondeterminism Data parallelism comes from array/aggregate operations and

loops without side effects (serial) Divide-and-conquer parallelism comes from recursive function

invocations with non-overlapping side effects (serial) Event-driven programs are written as guarded atomic

commands, which may be implemented as transactions Discovered parallelism mapped to available resources

Techniques include static scheduling, autotuning, dynamic schedule and possibly hints

Legacy OSOS Libraries &

ServicesHypervisor

Parallel Libraries

Parallel Frameworks

Sketching

Intel Multicore/GPGPU

RAMP ManycoreArch.

Productivi

ty Layer

Composition & Coordination Language (C&CL) Static Verificatio

Dynamic Checkin

gDebugging

with Replay

Directed Testing

Type Systems

Personal Health

Image Retriev

Hearing, Music

Speech

Parallel Browse

rDwarfs

Legacy Code

Schedulers

Efficiency Layer

Efficienc

y Layer Corr

Applicatio

Autotuners

Efficiency Layer Ground OS/HW Requirements

Create parallelism, manage resources: affinity, bandwidth Queries: #cores, #memory spaces, etc. Latency hiding (prefetch, DMA,…) to enable bandwidth

saturation, i.e., SPMV should run at peak bandwidth Programming in legacy languages with threads

For portability across local RAM vs. cache, global address space with memory hierarchy

x: 1y:

x: 5y:

x: 7y: 0

Shared partitioned on-chip

l: m: Private on-chip

Build by extending UPC

Shared off-chip DRAM

Efficiency Layer: Software RAMP Efficiency layer is abstract machine model

+ selective virtualization Software RAMP provides add-ons

Schedulers Add runtime w/ dynamic scheduling

for dynamic task tree Add semi-static schedule for given graph

Memory movement / sharing primitives Support to manage local RAM

Synchronization primitives Use of verification and testing at this level

General task graph with weights & structure

Div&Conq with task stealing

Synthesis via Sketching and Autotuning

Autotuning: Efficient by search Examples: Spectral (FFTW, SPIRAL),

Dense (PHiPAC, Atlas), Sparse (OSKI), Structured grids (Stencils)

Can select from algorithms/data structures changes not producible by compiler transform

Extensive tuning knobs at efficiency level Performance feedback from hardware and OS

Sketching: Correct by construction

Optimized code(tiled, prefetched,

time skewed)

Spec: simple implementation

(3 loop 3D stencil)

Sketch: optimizedskeleton

(5 loops, missing some index/bounds)

Optimization:1.5x more entries (zeros) 1.5x speedup

Compilers won’t do this!

Autotuning Sparse Matrices Sparse Matrix-Vector Multiply (SPMV) tuning steps:

Register Block to compress matrix data structure (choose r1xr2) Cache Block so corresponding vectors fit in local memory (c1xc2) Parallelize by dividing matrix evenly (p1xp2) Prefetch for some distance d Machine-specific code for SSE, etc.

ProteinFEM1FEM2FEM3

EpidemCircuit

ProteinFEM1FEM2FEM3

EpidemCircuit

Opteron (2xdual) Clovertow n (2xquad)

GFlop/s

Tuned Parallel

Naïve ParallelNaïve Serial

Autotuning is more important than parallelism on these machines!

Theme 3 Summary

Two Layers for two types of programmers Productivity layer: safety

Coordination and Composition language has serial semantics with controlled nondeterminism

Frameworks and libraries have enforced contracts, leverage expert programmers by reusing library/framework code

Efficiency layer: expressiveness and portability Base is abstract machine model Software RAMP for selective virtualization (add-ons) Sketching and autotuning for library/framework

development

Personal Health

Image Retriev

Hearing, Music

Speech

Parallel Browse

rDwarfs

Sketching

Legacy Code

Schedulers

Primitives

Theme 4: OS and Architecture

Legacy OS

Multicore/GPGPU

RAMP Manycore

HypervisorOS

Productivi

ty Layer

Efficienc

y Layer Corr

Applicatio

Parallel Libraries

Parallel Frameworks

Static Verificatio

Dynamic Checkin

gDebugging

with Replay

Directed Testing

Autotuners

Type Systems

Why change OS and Architecture?Old world: OS is the only parallel code, user jobs are

sequentialNew world: Many user applications are highly parallel

Efficiency layer wants to control own scheduling of threads onto cores (e.g. select non-virtualized or virtualized)

Legacy OS + Virtual Machine layers makes this intractable: VMM scheduler + Guest OS scheduler + Application scheduler?

Parallel applications need efficient inter-thread communication and synchronization operations

Legacy architectures have poor inter-thread communication (cache-coherent shared memory) and unwieldy synchronization (spin locks)

Managing data movement important for efficiency layer, must be able to saturate memory bandwidth with independent requests

Legacy OS + caches/prefetching make bad decisions about where data should be relative to accessing thread within application

Par Lab OS and Architecture ThemeDeconstruct OS and architectures into software-composable primitives not pre-packaged solutions

Very thin hypervisor layer providing hierarchical partitions Protection only, no scheduling/multiplexing in hypervisor. Traditional

OS functionalities in user libraries and service partitions.Efficiency layer takes over thread scheduling in partition

Low overhead hardware synchronization primitives Barriers+signals, atomic memory operations (fetch+op), user-level

active messages Memory read/write set tracking to support hybrid hardware/software

coherence, transactional memory (TM), race detection, checkpointingEfficiency layer uses and exports customized synchronization facilities

Configurable memory hierarchy Private and/or shared caches, local RAM, DMA engineEfficiency layer controls whether and how to cache data, and controls

data movement around system

Partition-Based Deconstructed OS

Hypervisor is a very thin software layer over the hardware providing only a resource protection mechanism, no resource multiplexing

Traditional OS functionality provided as libraries (cf. Exokernel) and service partitions (e.g., device drivers)

Hypervisor allows partition to subdivide into hierarchical child partitions, where parent retains control over children. Motivating examples: Protecting browser from third-party plugins Predictable real-time for audio processing units in music application Support legacy parallel libraries/frameworks with runtime in own partition Trusted OS or VM (e.g., Singularity or JavaVM) can avoid mandatory hardware

protection checking within own partition

Device

Driver

SingularityPlug-in Plug-in

Web BrowserRoot Partition

UIAudio Unit

Music Framework

Audio Unit

Process Process

Legacy OS

Hardware Partitioning SupportPartition can contain own cores, L1 and L2 $/RAM, DRAM, and interconnect bandwidth allocation

Inter-partition communication through protected shared memory

“Un-virtual” partitions give efficiency layer complete “bare metal” control of hardware

Per-partition performance counters give feedback to runtimes/autotuners

Benefits: Efficiency (fewer layers, custom OS) Enables new exposed HW primitives Performance isolation/predictability Robustness to faults/errors Security

CPUCPU

L2L2BankBank

DRAMDRAM

CPUCPU

L2L2BankBank

DRAMDRAM

CPUCPU

L2L2BankBank

DRAMDRAM

CPUCPU

L2L2BankBank

DRAMDRAM

CPUCPU

L2L2BankBank

DRAMDRAM

L2 InterconnectL2 Interconnect

DRAM & I/O DRAM & I/O InterconnectInterconnect

Partition 2Partition 2Partition 1Partition 1 Protected Protected Shared Shared MemoryMemory

Example: 16x16 Array of Cores

Customized Parallel Processors

Efficiency layer software uses hardware synchronization and communication primitives to glue cores together into a customized application “processor”

Partition always swapped in en masse Software protocols can support dynamic partition resizing

then “Partitions are the new Processors”

Partition Partition BoundariesBoundaries(only cores shown, (only cores shown, separately sizable separately sizable allocation for other allocation for other resources, e.g., L2 resources, e.g., L2 memory)memory)

If “Processors are the new Transistors”

RAMP Blue, July 20071000+ RISC cores @90MHzWorks! Runs UPC version of NAS parallel benchmarks.

RAMP Manycore Prototype Separate multi-university RAMP

project building FPGA emulation infrastructure BEE3 boards with Chuck Thacker/Microsoft

Expect to fit hundreds of 64-bit cores with full instrumentation in one rack

Run at ~100MHz, fast enough for application software development

Flexible cycle-accurate timing models What if DRAM latency 100 cycles? 200?

1000? What if barrier takes 5 cycles? 20? 50?

“Tapeout” every day, to incorporate feedback from application and software layers

Rapidly distribute hardware ideas to larger community

Summary: A Berkeley View 2.0 Our industry has bet its

future on parallelism (!) Recruit best minds to help?

Try Apps-Driven vs. CS Solution-Driven Research

Dwarfs as lingua franca, anti-benchmarks, …

Efficiency layer for ≈ 10% today’s programmers

Productivity layer for ≈ 90% today’s programmers

C&C language to help compose and coordinate

Autotuners vs. Parallelizing Compilers

OS & HW: Composable Primitives vs. Solutions

Personal

Health

Image Retriev

Hearing, Music

Speech

Parallel

Browser

Dwarfs

Sketching

Legacy Code

Schedulers

Primitives

Legacy OS

Multicore/GPGPU

RAMP Manycore

Hypervisor

y Corr

Composition & Coordination Language (C&CL)

Parallel Librarie

Parallel Frameworks

Static Verificati

Dynamic Checkin

Debugging

with Replay

Directed Testing

Autotuners

Type Systems

Easy to write correct programs that run efficiently on manycore

Acknowledgments Berkeley View material comes from discussions with:

Profs Krste Asanovic, Ras Bodik, Jim Demmel, Kurt Keutzer, John Kubiatowicz, John Wawrzynek, Kathy Yelick

UCB Grad students Bryan Catanzaro, Jike Chong, Joe Gebis, William Plishker, and Sam Williams

LBNL:Parry Husbands, Bill Kramer, Lenny Oliker, John Shalf See view.eecs.berkeley.edu RAMP based on work of RAMP Developers:

Krste Asanovic (Berkeley), Derek Chiou (Texas), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley, Co-PI), and John Wawrzynek (Berkeley, PI)

See ramp.eecs.berkeley.edu

parallel applications parallel hardware parallel software 1 the parallel computing landscape: a view...

parallel computers

parallel revolution

parallel computing landscape

new computer

parallel embedded bwrc

computer architecture

computer science

computeraided design

Documents

cs 252 graduate computer architecture lecture 17 parallel...

krste asanovic krste@eecs.berkeley inst.eecs.berkeley...

asanovic/devadas spring 2002 6.823 6.823 computer system...

ras bodik shaon barman thibaud hottelier

parallel programming with synthesis...

asanovic/devadas spring 2002 6.823 complex instruction set...

ras bodik mangpo and ali

asanovic/devadas spring 2002 6.823 simple instruction...

the future is parallel software composability is...

the landscape of parallel computing research: a view from...

asanovic/devadas spring 2002 6.823 advanced cisc...

© krste asanovic, 2015cs252, fall 2015, lecture 1 cs252...

asanovic/devadas spring 2002 6.823 vliw/epic: statically...

complex pipelining krste asanovic laboratory for computer...

© krste asanovic, 2014cs252, spring 2014, lecture 10 cs252...

asanovic/devadas spring 2002 6.823 virtual machines and...

a view of the parallel computing landscape - people @...

asanovic/devadas spring 2002 6.823 advanced superscalar...

© krste asanovic, 2014cs252, spring 2014, lecture 1 cs252...

ras bodik with ali & mangpo