parallel applications parallel hardware parallel software 1 the parallel computing landscape: a view...
Post on 21-Dec-2015
227 Views
Preview:
TRANSCRIPT
1
Parallel Applications
Parallel Hardware
Parallel Software
The Parallel Computing Landscape: A View from Berkeley 2.0
Krste Asanovic, Ras Bodik, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz,
Edward Lee, Nelson Morgan, George Necula, Dave Patterson, Koushik Sen, John Wawrzynek, David
Wessel, and Kathy YelickDagstuhl, September 3, 2007
2
A Parallel Revolution, Ready or Not Embedded: per product ASIC to programmable
platforms Multicore chip most competitive path Amortize design costs + Reduce design risk + Flexible platforms
PC, Server: Power Wall + Memory Wall = Brick Wall End of the way we’ve scaled uniprocessors for last 40 years
New Moore’s Law is 2X processors (“cores”) per chip every technology generation, but same clock rate “This shift toward increasing parallelism is not a triumphant stride
forward based on breakthroughs …; instead, this … is actually a retreat from even greater challenges that thwart efficient silicon implementation of traditional solutions.”
The Parallel Computing Landscape: A Berkeley View
Sea change for HW & SW industries since changing the model of programming and debugging
3
0
50
100
150
200
250
1985 1995 2005 2015
Millions of PCs / year
P.S. Parallel Revolution May Fail John Hennessy, President, Stanford University, 1/07:
“…when we start talking about parallelism and ease of use of truly parallel computers, we're talking about a problem that's as hard as any that computer science has faced. … I would be panicked if I were in industry.”
“A Conversation with Hennessy & Patterson,” ACM Queue Magazine, 4:10, 1/07.
100% failure rate of Parallel Computer Companies Convex, Encore, MasPar, NCUBE, Kendall Square Research,
Sequent, (Silicon Graphics), Transputer, Thinking Machines, … What if IT goes from a
growth industry to areplacement industry? If SW can’t effectively use
8, 16, 32, ... cores per chip SW no faster on new computer Only buy if computer wears out
4
Need a Fresh Approach to Parallelism
Berkeley researchers from many backgrounds meeting since Feb. 2005 to discuss parallelism Krste Asanovic, Ras Bodik, Jim Demmel, Kurt Keutzer, John
Kubiatowicz, Edward Lee, George Necula, Dave Patterson, Koushik Sen, John Shalf, John Wawrzynek, Kathy Yelick, …
Circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, scientific programming, and numerical analysis
Tried to learn from successes in high performance computing (LBNL) and parallel embedded (BWRC)
Led to “Berkeley View” Tech. Report and new Parallel Computing Laboratory (“Par Lab”)
Goal: Productive, Efficient, Correct Programming of 100+ cores & scale as double cores every 2 years (!)
5
1
2
4
8
16
32
6464
128
256
512
1
10
100
1000
2003 2005 2007 2009 2011 2013 2015
Why Target 100+ Cores? 5-year research program aim 8+ years out Multicore: 2X / 2 yrs ≈ 64 cores in 8 years Manycore: 8X to 16X multicore
Automatic Parallelization,Thread Level Speculation
6
Re-inventing Client/Server
Laptop/Handheld as future client, Datacenter as future server Focus on single socket (Laptop/Handheld or Blade)
“The Datacenter is the Computer”Building sized computers: Microsoft, …
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
“The Laptop/Handheld is the Computer”‘08: Dell no. laptops > desktops? ‘06 Profit?1B Cell phones/yr, increasing in functionOtellini demoed "Universal Communicator”
Combination cell phone, PC and video device
Apple iPhone
7
Try Application Driven Research?
Conventional Wisdom in CS Research Users don’t know what they want Computer Scientists solve individual parallel
problems with clever language feature (e.g., futures), new compiler pass, or novel hardware widget (e.g., SIMD)
Approach: Push (foist?) CS nuggets/solutions on users Problem: Stupid users don’t learn/use proper solution
Another Approach Work with domain experts developing compelling apps Provide HW/SW infrastructure necessary to build, compose, and understand parallel software written in multiple languages Research guided by commonly recurring patterns actually observed while developing compelling app
8
4 Themes of View 2.0/ Par Lab
1. Applications� Compelling Apps driving the research agenda top-down
2. Identify Computational Bottlenecks� Breaking through disciplinary boundaries
3. Developing Parallel Software with Productivity, Efficiency, and Correctness� 2 Layers + C&C Language, Autotuning
4. OS and Architecture� Composable Primitives, not Pre-Packaged Solutions
� Deconstruction, Fast Barrier Synchronization, Partition
9
Personal Health
Image Retriev
al
Hearing, Music
Speech
Parallel Browse
rDwarfs
Sketching
Legacy Code
Schedulers
Communication & Synch.
PrimitivesEfficiency Language Compilers
Par Lab Research OverviewEasy to write correct programs that run efficiently on
manycore
Legacy OS
Multicore/GPGPU
OS Libraries & Services
RAMP Manycore
HypervisorOS
Arch.
Productivi
ty Layer
Efficienc
y Layer Corr
ect
ness
Applicatio
nsComposition & Coordination Language (C&CL)
Parallel Libraries
Parallel Frameworks
Static Verificatio
n
Dynamic Checkin
gDebugging
with Replay
Directed Testing
Autotuners
C&CL Compiler/Interpreter
Efficiency Languages
Type Systems
10
“Who needs 100 cores to run M/S Word?” Need compelling apps that use 100s of cores
How did we pick applications?1. Enthusiastic expert application partner, leader in
field, promise to help design, use, evaluate our technology
2. Compelling in terms of likely market or social impact, with short term feasibility and longer term potential
3. Requires significant speed-up, or a smaller, more efficient platform to work as intended
4. As a whole, applications cover the most important
Platforms (handheld, laptop, games) Markets (consumer, business, health)
Theme 1. Applications. What are
the problems?
11
Coronary Artery Disease
Modeling to help patient compliance?Modeling to help patient compliance?• 400k deaths/year, 16M w. symptom, 72MBP
Massively parallel, Real-time variationsMassively parallel, Real-time variations• CFD FECFD FE solid (non-linear), fluid (Newtonian), pulsatilesolid (non-linear), fluid (Newtonian), pulsatile• Blood pressure, activity, habitus, cholesterolBlood pressure, activity, habitus, cholesterol
Before After
12
Content-Based Image Retrieval
Relevance Feedback
ImageImageDatabaseDatabase
Query by example
SimilarityMetric
CandidateResults Final ResultFinal Result
Built around Key Characteristics of personal databases Very large number of pictures (>5K) Non-labeled images Many pictures of few people Complex pictures including people, events,
places, and objects
1000’s of images
13
Compelling Laptop/Handheld Apps
Meeting Diarist Laptops/
Handhelds at meeting coordinate to create speaker identified, partially transcribed text diary of meeting
Teleconference speaker identifier, speech helper
L/Hs used for teleconference, identifies who is speaking, “closed caption” hint of what being said
14
Compelling Laptop/Handheld Apps
Musicians have an insatiable appetite for computation More channels, instruments, more processing,
more interaction! Latency must be low (5 ms) Must be reliable (No clicks)
Music Enhancer Enhanced sound delivery systems for home
sound systems using large microphone and speaker arrays
Laptop/Handheld recreate 3D sound over ear buds
Hearing Augmenter Laptop/Handheld as accelerator for hearing
aide
Novel Instrument User Interface New composition and performance systems
beyond keyboards Input device for Laptop/Handheld
Berkeley Center for New Music and Audio Technology (CNMAT) created a compact loudspeaker array: 10-inch-diameter icosahedron incorporating 120 tweeters.
15
Parallel Browser Goal: Desktop quality browsing on
handhelds Enabled by 4G networks, better output devices
Bottlenecks to parallelize Parsing, Rendering, Scripting
SkipJax Parallel replacement for JavaScript/AJAX Based on Brown’s FlapJax
16
Theme 2. Use computational bottlenecks instead of conventional benchmarks? How invent parallel systems of future when tied to old code, programming models, CPUs of the past?
Look for common patterns of communication and computation
1. Embedded Computing (42 EEMBC benchmarks)2. Desktop/Server Computing (28 SPEC2006)3. Data Base / Text Mining Software4. Games/Graphics/Vision5. Machine Learning6. High Performance Computing (Original “7 Dwarfs”) Result: 13 “Dwarfs”
17
How do compelling apps relate to 13 dwarfs?
Dwarf Popularity (Red Hot Blue Blue CoolCool)
EmbedSPECDB GamesML HPCHealth Image Speech Music Browser1 Finite State Mach.2 Combinational3 Graph Traversal4 Structured Grid5 Dense Matrix6 Sparse Matrix7 Spectral (FFT)8 Dynamic Prog9 N-Body
10 MapReduce11 Backtrack/ B&B12 Graphical Models13 Unstructured Grid
18
What about I/O in Dwarfs? High speed serial lines
many parallel I/O channels per chip Storage will be much faster
“Flash is disk, disk is tape” At least for laptop/handheld iPod, camera, cellphone, laptop $ to rapidly improve flash
memory New technologies even more promising:
Phase-Change RAM denser, faster write, 100X longer wear-out Disk Flash means 10,000x improvement in latency (10
millisecond to 1 microsecond)
Network will be much faster 10 Gbit/s copper is standard; 100 Gbit/sec is coming Parallel serial channels to multiply bandwidth Already need multiple fat cores to handle TCP/IP at 10 Gbit/s Super 3G -> 4G, 100Mb/s->1Gb/s WAN wireless to handset
19
4 Valuable Roles of Dwarfs1. “Anti-benchmarks”
� Dwarfs not tied to code or language artifacts encourage innovation in algorithms, languages, data structures, and/or hardware
2. Universal, understandable vocabulary � To talk across disciplinary boundaries
3. Bootstrapping: Parallelize parallel research � Allow analysis of HW & SW design without
waiting years for full apps
4. Targets for libraries (see later)
20
Themes 1 and 2 Summary Application-Driven Research vs.
CS Solution-Driven Research Drill down on (initially) 5 app areas to
guide research agenda Dwarfs to represent broader set of apps
to guide research agenda Dwarfs help break through traditional
interfaces Benchmarking, multidisciplinary conversations,
parallelizing parallel research, and target for libraries
21
Legacy OS
Personal Health
Image Retriev
al
Hearing, Music
Speech
Parallel Browse
rDwarfs
Sketching
Legacy Code
Schedulers
Communication & Synch.
Primitives
Theme 3: Developing Parallel Software
Multicore/GPGPU
OS Libraries & Services
RAMP Manycore
HypervisorOS
Arch.
Productivi
ty Layer
Efficienc
y Layer Corr
ect
ness
Applicatio
nsComposition & Coordination Language (C&CL)
Parallel Libraries
Parallel Frameworks
Static Verificatio
n
Dynamic Checkin
gDebugging
with Replay
Directed Testing
Autotuners
C&CL Compiler/Interpreter
Efficiency Languages
Type Systems
Efficiency Language Compilers
22
Dwarfs become Libraries or Frameworks
Observation: Dwarfs are of 2 types
Libraries Dense matrices Sparse matrices Spectral Combinational Finite state machines
Patterns/Frameworks MapReduce Graph traversal, graphical models Dynamic programming Backtracking/B&B N-Body (Un) Structured Grid
Algorithms in the dwarfs can either be implemented as:• Compact parallel computations within a traditional library• Compute/communicate pattern implemented as framework
� Computations may be viewed at multiple levels: e.g., an FFT library may be built by instantiating a Map-Reduce framework, mapping 1D FFTs and then transposing (generalized reduce)
23
Two Layers for Two Types of Programmer
Efficiency Layer (10% of today’s programmers) Expert programmers build Frameworks & Libraries,
Hypervisors, … “Bare metal” efficiency possible at Efficiency Layer
Productivity Layer (90% of today’s programmers) Domain experts / Naïve programmers productively build
parallel apps using frameworks & libraries Frameworks & libraries composed to form app
frameworks Effective composition techniques allows the efficiency
programmers to be highly leveraged Create language for Composition and Coordination (C&C)
24
Personal Health
Image Retriev
al
Hearing, Music
Speech
Parallel Browse
rDwarfs
Sketching
Legacy Code
Schedulers
Communication & Synch.
PrimitivesEfficiency Language Compilers
Productivity Layer
Legacy OS
Intel Multicore/GPGPU
OS Libraries & Services
RAMP Manycore
HypervisorOS
Arch.
Productivi
ty Layer
Efficienc
y Layer Corr
ect
ness
Applicatio
nsComposition & Coordination Language (C&CL)
Parallel Libraries
Parallel Frameworks
Static Verificatio
n
Dynamic Checkin
gDebugging
with Replay
Directed Testing
Autotuners
C&CL Compiler/Interpreter
Efficiency Languages
Type Systems
25
C & C Language Requirements
Applications specify C&C language requirements: Constructs for creating application frameworks Primitive parallelism constructs:
Data parallelism Divide-and-conquer parallelism Event driven execution
Constructs for composing frameworks Frameworks require independence Independence is proven at instantiation with a variety of
techniques Needs to have low runtime overhead and ability to
measure and control performance
26
Coordination & Composition
Composition is key to software reuse Solutions exist for libraries with hidden parallelism:
Partitions in OS help runtime composition Instantiating parallel frameworks is harder:
Framework specifies required independence E.g., operations in map, div&conq must not interfere
Guaranteed independence through types Type system extensions (side effects, ownership)
Extra: Image READ Array[double] Data decomposition may be implicit or explicit:
Partition: Array[T], … List [Array[T]] (well understood) Partition: Graph[T], … List [Graph[T]]
Efficiency layer code has these specifications at interfaces, which are verified, tested, or asserted
Independence is proven by checking side effects and overlap at instantiation
27
DCT, etc.
Delaunay
Sparse LU
Coordination & Composition Coordination is used to create parallelism: Support parallelism patterns of applications
Data parallelism (degree = data size, not core count) May be nested: forall images, forall blocks in image
Divide-and-conquer (parallelism from recursion) Event-driven: nondeterminism at algorithm level
Branch&Bound dwarf, etc.
Serial semantics with limited nondeterminism Data parallelism comes from array/aggregate operations and
loops without side effects (serial) Divide-and-conquer parallelism comes from recursive function
invocations with non-overlapping side effects (serial) Event-driven programs are written as guarded atomic
commands, which may be implemented as transactions Discovered parallelism mapped to available resources
Techniques include static scheduling, autotuning, dynamic schedule and possibly hints
28
Legacy OSOS Libraries &
ServicesHypervisor
Parallel Libraries
Parallel Frameworks
Sketching
Intel Multicore/GPGPU
RAMP ManycoreArch.
Productivi
ty Layer
Composition & Coordination Language (C&CL) Static Verificatio
n
Dynamic Checkin
gDebugging
with Replay
Directed Testing
C&CL Compiler/Interpreter
Type Systems
Personal Health
Image Retriev
al
Hearing, Music
Speech
Parallel Browse
rDwarfs
Legacy Code
Schedulers
Communication & Synch.
PrimitivesEfficiency Language Compilers
Efficiency Layer
OS
Efficienc
y Layer Corr
ect
ness
Applicatio
ns
Autotuners
Efficiency Languages
29
Efficiency Layer Ground OS/HW Requirements
Create parallelism, manage resources: affinity, bandwidth Queries: #cores, #memory spaces, etc. Latency hiding (prefetch, DMA,…) to enable bandwidth
saturation, i.e., SPMV should run at peak bandwidth Programming in legacy languages with threads
For portability across local RAM vs. cache, global address space with memory hierarchy
x: 1y:
x: 5y:
x: 7y: 0
Shared partitioned on-chip
l: m: Private on-chip
Build by extending UPC
Shared off-chip DRAM
30
Efficiency Layer: Software RAMP Efficiency layer is abstract machine model
+ selective virtualization Software RAMP provides add-ons
Schedulers Add runtime w/ dynamic scheduling
for dynamic task tree Add semi-static schedule for given graph
Memory movement / sharing primitives Support to manage local RAM
Synchronization primitives Use of verification and testing at this level
General task graph with weights & structure
Div&Conq with task stealing
31
Synthesis via Sketching and Autotuning
Autotuning: Efficient by search Examples: Spectral (FFTW, SPIRAL),
Dense (PHiPAC, Atlas), Sparse (OSKI), Structured grids (Stencils)
Can select from algorithms/data structures changes not producible by compiler transform
Extensive tuning knobs at efficiency level Performance feedback from hardware and OS
Sketching: Correct by construction
Optimized code(tiled, prefetched,
time skewed)
Spec: simple implementation
(3 loop 3D stencil)
Sketch: optimizedskeleton
(5 loops, missing some index/bounds)
Optimization:1.5x more entries (zeros) 1.5x speedup
Compilers won’t do this!
32
Autotuning Sparse Matrices Sparse Matrix-Vector Multiply (SPMV) tuning steps:
Register Block to compress matrix data structure (choose r1xr2) Cache Block so corresponding vectors fit in local memory (c1xc2) Parallelize by dividing matrix evenly (p1xp2) Prefetch for some distance d Machine-specific code for SSE, etc.
0
1
2
3
ProteinFEM1FEM2FEM3
EpidemCircuit
LP
ProteinFEM1FEM2FEM3
EpidemCircuit
LP
Opteron (2xdual) Clovertow n (2xquad)
GFlop/s
Tuned Parallel
Naïve ParallelNaïve Serial
Autotuning is more important than parallelism on these machines!
33
Theme 3 Summary
Two Layers for two types of programmers Productivity layer: safety
Coordination and Composition language has serial semantics with controlled nondeterminism
Frameworks and libraries have enforced contracts, leverage expert programmers by reusing library/framework code
Efficiency layer: expressiveness and portability Base is abstract machine model Software RAMP for selective virtualization (add-ons) Sketching and autotuning for library/framework
development
34
Efficiency Language Compilers
Personal Health
Image Retriev
al
Hearing, Music
Speech
Parallel Browse
rDwarfs
Sketching
Legacy Code
Schedulers
Communication & Synch.
Primitives
Theme 4: OS and Architecture
Legacy OS
Multicore/GPGPU
OS Libraries & Services
RAMP Manycore
HypervisorOS
Arch.
Productivi
ty Layer
Efficienc
y Layer Corr
ect
ness
Applicatio
nsComposition & Coordination Language (C&CL)
Parallel Libraries
Parallel Frameworks
Static Verificatio
n
Dynamic Checkin
gDebugging
with Replay
Directed Testing
Autotuners
C&CL Compiler/Interpreter
Efficiency Languages
Type Systems
35
Why change OS and Architecture?Old world: OS is the only parallel code, user jobs are
sequentialNew world: Many user applications are highly parallel
Efficiency layer wants to control own scheduling of threads onto cores (e.g. select non-virtualized or virtualized)
Legacy OS + Virtual Machine layers makes this intractable: VMM scheduler + Guest OS scheduler + Application scheduler?
Parallel applications need efficient inter-thread communication and synchronization operations
Legacy architectures have poor inter-thread communication (cache-coherent shared memory) and unwieldy synchronization (spin locks)
Managing data movement important for efficiency layer, must be able to saturate memory bandwidth with independent requests
Legacy OS + caches/prefetching make bad decisions about where data should be relative to accessing thread within application
36
Par Lab OS and Architecture ThemeDeconstruct OS and architectures into software-composable primitives not pre-packaged solutions
Very thin hypervisor layer providing hierarchical partitions Protection only, no scheduling/multiplexing in hypervisor. Traditional
OS functionalities in user libraries and service partitions.Efficiency layer takes over thread scheduling in partition
Low overhead hardware synchronization primitives Barriers+signals, atomic memory operations (fetch+op), user-level
active messages Memory read/write set tracking to support hybrid hardware/software
coherence, transactional memory (TM), race detection, checkpointingEfficiency layer uses and exports customized synchronization facilities
Configurable memory hierarchy Private and/or shared caches, local RAM, DMA engineEfficiency layer controls whether and how to cache data, and controls
data movement around system
37
Partition-Based Deconstructed OS
Hypervisor is a very thin software layer over the hardware providing only a resource protection mechanism, no resource multiplexing
Traditional OS functionality provided as libraries (cf. Exokernel) and service partitions (e.g., device drivers)
Hypervisor allows partition to subdivide into hierarchical child partitions, where parent retains control over children. Motivating examples: Protecting browser from third-party plugins Predictable real-time for audio processing units in music application Support legacy parallel libraries/frameworks with runtime in own partition Trusted OS or VM (e.g., Singularity or JavaVM) can avoid mandatory hardware
protection checking within own partition
Device
Driver
SingularityPlug-in Plug-in
Web BrowserRoot Partition
UIAudio Unit
Music Framework
Audio Unit
Process Process
Legacy OS
38
Hardware Partitioning SupportPartition can contain own cores, L1 and L2 $/RAM, DRAM, and interconnect bandwidth allocation
Inter-partition communication through protected shared memory
“Un-virtual” partitions give efficiency layer complete “bare metal” control of hardware
Per-partition performance counters give feedback to runtimes/autotuners
Benefits: Efficiency (fewer layers, custom OS) Enables new exposed HW primitives Performance isolation/predictability Robustness to faults/errors Security
CPUCPU
L1L1
L2L2BankBank
DRAMDRAM
CPUCPU
L1L1
L2L2BankBank
DRAMDRAM
CPUCPU
L1L1
L2L2BankBank
DRAMDRAM
CPUCPU
L1L1
L2L2BankBank
DRAMDRAM
CPUCPU
L1L1
L2L2BankBank
DRAMDRAM
L2 InterconnectL2 Interconnect
DRAM & I/O DRAM & I/O InterconnectInterconnect
Partition 2Partition 2Partition 1Partition 1 Protected Protected Shared Shared MemoryMemory
39
Example: 16x16 Array of Cores
Customized Parallel Processors
Efficiency layer software uses hardware synchronization and communication primitives to glue cores together into a customized application “processor”
Partition always swapped in en masse Software protocols can support dynamic partition resizing
then “Partitions are the new Processors”
Partition Partition BoundariesBoundaries(only cores shown, (only cores shown, separately sizable separately sizable allocation for other allocation for other resources, e.g., L2 resources, e.g., L2 memory)memory)
If “Processors are the new Transistors”
40
RAMP Blue, July 20071000+ RISC cores @90MHzWorks! Runs UPC version of NAS parallel benchmarks.
RAMP Manycore Prototype Separate multi-university RAMP
project building FPGA emulation infrastructure BEE3 boards with Chuck Thacker/Microsoft
Expect to fit hundreds of 64-bit cores with full instrumentation in one rack
Run at ~100MHz, fast enough for application software development
Flexible cycle-accurate timing models What if DRAM latency 100 cycles? 200?
1000? What if barrier takes 5 cycles? 20? 50?
“Tapeout” every day, to incorporate feedback from application and software layers
Rapidly distribute hardware ideas to larger community
41
Summary: A Berkeley View 2.0 Our industry has bet its
future on parallelism (!) Recruit best minds to help?
Try Apps-Driven vs. CS Solution-Driven Research
Dwarfs as lingua franca, anti-benchmarks, …
Efficiency layer for ≈ 10% today’s programmers
Productivity layer for ≈ 90% today’s programmers
C&C language to help compose and coordinate
Autotuners vs. Parallelizing Compilers
OS & HW: Composable Primitives vs. Solutions
Personal
Health
Image Retriev
al
Hearing, Music
Speech
Parallel
Browser
Dwarfs
Sketching
Legacy Code
Schedulers
Communication & Synch.
Primitives
Efficiency Language Compilers
Legacy OS
Multicore/GPGPU
OS Libraries & Services
RAMP Manycore
Hypervisor
OS
Arc
h.
Pro
duct
ivit
yEffi
cienc
y Corr
ect
ness
Apps
Composition & Coordination Language (C&CL)
Parallel Librarie
s
Parallel Frameworks
Static Verificati
on
Dynamic Checkin
g
Debugging
with Replay
Directed Testing
Autotuners
C&CL Compiler/Interpreter
Efficiency Languages
Type Systems
Easy to write correct programs that run efficiently on manycore
42
Acknowledgments Berkeley View material comes from discussions with:
Profs Krste Asanovic, Ras Bodik, Jim Demmel, Kurt Keutzer, John Kubiatowicz, John Wawrzynek, Kathy Yelick
UCB Grad students Bryan Catanzaro, Jike Chong, Joe Gebis, William Plishker, and Sam Williams
LBNL:Parry Husbands, Bill Kramer, Lenny Oliker, John Shalf See view.eecs.berkeley.edu RAMP based on work of RAMP Developers:
Krste Asanovic (Berkeley), Derek Chiou (Texas), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley, Co-PI), and John Wawrzynek (Berkeley, PI)
See ramp.eecs.berkeley.edu
top related