(how) can programmers conquer the multicore menace? saman amarasinghe computer science and...
TRANSCRIPT
(How) Can Programmers Conquer the
Multicore Menace?
Saman Amarasinghe
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Outline
• The Multicore Menace
• Deterministic Multithreading via Kendo
• Algorithmic Choices via PetaBricks
• Conquering the Multicore Menace
2
Today: The Happily ObliviousAverage Joe Programmer
• Joe is oblivious about the processor– Moore’s law bring Joe performance – Sufficient for Joe’s requirements
• Joe has built a solid boundary between Hardware and Software– High level languages abstract away the processors
– Ex: Java bytecode is machine independent
• This abstraction has provided a lot of freedom for Joe
• Parallel Programming is only practiced by a few experts
3
4
1
10
100
1000
10000
100000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
Per
form
ance
(vs
. VA
X-1
1/78
0)
25%/year
52%/year
??%/year
8086
286
386
486
Pentium
P2
P3
P4
Itanium
Itanium 2
Moore’s Law
From David Patterson
1,000,000,000
100,000
10,000
1,000,000
10,000,000
100,000,000
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
Nu
mb
er o
f Tra
nsisto
rs
5
8086
286
386
486
Pentium
P2
P3
P4
Itanium
Itanium 2
1
10
100
1000
10000
100000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
Per
form
ance
(vs
. VA
X-1
1/78
0)
25%/year
52%/year
Uniprocessor Performance (SPECint)
From David Patterson
1,000,000,000
100,000
10,000
1,000,000
10,000,000
100,000,000
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
Nu
mb
er o
f Tra
nsisto
rs
6
8086
286
386
486
Pentium
P2
P3
P4
Itanium
Itanium 2
1
10
100
1000
10000
100000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
Per
form
ance
(vs
. VA
X-1
1/78
0)
25%/year
52%/year
??%/year
Uniprocessor Performance (SPECint)
From David Patterson
1,000,000,000
100,000
10,000
1,000,000
10,000,000
100,000,000
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
Nu
mb
er o
f Tra
nsisto
rs
Squandering of the Moore’s Dividend
• 10,000x performance gain in 30 years! (~46% per year)• Where did this performance go?• Last decade we concentrated on correctness and
programmer productivity• Little to no emphasis on performance • This is reflected in:
– Languages– Tools– Research– Education
• Software Engineering: Only engineering discipline where performance or efficiency is not a central theme
7
Matrix Multiply An Example of Unchecked Excesses
• Abstraction and Software Engineering– Immutable Types– Dynamic Dispatch– Object Oriented
• High Level Languages• Memory Management
– Transpose for unit stride– Tile for cache locality
• Vectorization• Prefetching• Parallelization
2,271x296,260x522x1,117x7,514x12,316x33,453x87,042x220x
Matrix Multiply An Example of Unchecked Excesses
• Typical Software Engineering Approach– In Java– Object oriented– Immutable– Abstract types– No memory optimizations– No parallelization
• Good Performance Engineering Approach– In C/Assembly– Memory optimized (blocked)– BLAS libraries– Parallelized (to 4 cores)
9
14,700x
• In Comparison: Lowest to Highest MPG in transportation
296,260x
294,000x
Joe the Parallel Programmer
• Moore’s law is not bringing anymore performance gains
• If Joe needs performance he has to deal with multicores– Joe has to deal with
performance– Joe has to deal with
parallelism
10
Joe
11
Why Parallelism is Hard
• A huge increase in complexity and work for the programmer– Programmer has to think about performance! – Parallelism has to be designed in at every level
• Programmers are trained to think sequentially– Deconstructing problems into parallel tasks is hard for many of us
• Parallelism is not easy to implement– Parallelism cannot be abstracted or layered away– Code and data has to be restructured in very different (non-intuitive) ways
• Parallel programs are very hard to debug– Combinatorial explosion of possible execution orderings – Race condition and deadlock bugs are non-deterministic and illusive – Non-deterministic bugs go away in lab environment and with
instrumentation
Outline
• The Multicore Menace
• Deterministic Multithreading via Kendo– Joint work with Marek Olszewski and Jason Ansel
• Algorithmic Choices via PetaBricks
• Conquering the Multicore Menace
12
Racing for Lock Acquisition• Two threads
– Start at the same time– 1st thread: 1000 instructions to the lock acquisition– 2nd thread: 1100 instructions to the lock acquisition
13
Instruction #
Tim
e
Non-Determinism
• Inherent in parallel applications– Accesses to shared data can experience many possible
interleavings
– New! Was not the case for sequential applications!
– Almost never part of program specifications
– Simplest parallel programs, i.e. a work queue, is non deterministic
• Non-determinism is undesirable– Hard to create programs with repeatable results
– Difficult to perform cyclic debugging
– Testing offers weaker guarantees
14
Deterministic Multithreading
• Observation:– Non-determinism need not be a required property of threads
– We can interleave thread communication in a deterministic manner
– Call this Deterministic Multithreading
• Deterministic multithreading:– Makes debugging easier
– Tests offer guarantees again
– Supports existing programming models/languages
– Allows programmers to “determinize” computations that have
previously been difficult to do so using today’s programming idioms
– e.g.: Radiosity (Singh et al. 1994), LocusRoute (Rose 1988), and
Delaunay Triangulation (Kulkarni et al. 2008)
Deterministic Multithreading
• Strong Determinism– Deterministic interleaving for all accesses to shared data for a given input
– Attractive, but difficult to achieve efficiently without hardware support
• Weak Determinism– Deterministic interleaving of all lock acquisitions for a given input
– Cheaper to enforce
– Offers same guarantees as strong determinism for data-race-free program executions
– Can be checked with a dynamic race detector!
Kendo
• A Prototype Deterministic Locking Framework– Provides Weak Determinism for C and C++ code
– Runs on commodity hardware today!
– Implements a subset of the pthreads API
– Enforces determinism without sacrificing load balance
– Tracks progress of threads to dynamically construct
the deterministic interleaving: – Deterministic Logical Time
– Incurs low performance overhead (16% geomean on Splash2)
Deterministic Logical Time
• Abstract counterpart to physical time
– Used to deterministically order events on an SMP machine
– Necessary to construct the deterministic interleaving
• Represented as P independently updated deterministic logical clocks
– Not updated based on the progress of other threads
(unlike Lamport clocks)
– Event1 (on Thread 1) occurs before Event2 (on Thread 2) in Deterministic
Logical Time if:
– Thread 1 has lower deterministic logical clock than Thread 2 at time of events
Deterministic Logical Clocks
• Requirements– Must be based on events that are deterministically reproducible
from run to run
– Track progress of threads in physical time as closely as possible
(for better load balancing of the deterministic interleaving)
– Must be cheap to compute
– Must be portable over micro-architecture
– Must be stored in memory for other threads to observe
Deterministic Logical Clocks
• Some x86 performance counter events satisfy many of
these requirements– Chose the “Retired Store Instructions” event
• Required changes to Linux Kernel– Performance counters are kernel level accessible only
– Added an interrupt service routine
– Increments each thread’s deterministic logical clock (in memory) on
every performance counter overflow
– Frequency of overflows can be controlled
Locking Algorithm
• Construct a deterministic interleaving of lock acquires from deterministic logical clocks– Simulate the interleaving that would occur if running in
deterministic logical time
• Uses concept of a turn– It’s a thread’s turn when:
– All thread’s with smaller ID have greater deterministic logical clocks– All thread’s with larger ID have greater or equal deterministic logical
clocks
Locking Algorithm
function det_mutex_lock(l) { pause_logical_clock(); wait_for_turn(); lock(l); inc_logical_clock(); enable_logical_clock();}
function det_mutex_unlock(l) { unlock(l);}
Example
Thread 1
det_lock(a) t=25
Thread 2
t=18
Det
erm
inis
tic L
ogic
al T
ime
wait_for_turn()
Physical Time
Example
Thread 1
t=25
Thread 2
t=22
Det
erm
inis
tic L
ogic
al T
ime
det_lock(a)wait_for_turn()
det_lock(a)
Physical Time
Example
Thread 1
t=25
Thread 2
t=22
Det
erm
inis
tic L
ogic
al T
ime
det_lock(a)wait_for_turn()
det_lock(a) wait_for_turn()
Physical Time
Example
Thread 1
t=25
Thread 2
t=22
Det
erm
inis
tic L
ogic
al T
ime
det_lock(a)wait_for_turn()
det_lock(a) lock()
Physical Time
Example
Thread 1
t=25
Thread 2
t=22
Det
erm
inis
tic L
ogic
al T
ime
det_lock(a)wait_for_turn()
det_lock(a)
Thread 2 will always acquire the lock first!
Physical Time
Example
Thread 1
t=25
Thread 2
t=26
Det
erm
inis
tic L
ogic
al T
ime
det_lock(a)wait_for_turn()
det_lock(a)
Physical Time
Example
Thread 1
t=25
Thread 2
t=26
Det
erm
inis
tic L
ogic
al T
ime
det_lock(a)
det_lock(a)lock(a)
Physical Time
Example
Thread 1
t=25
Thread 2
Det
erm
inis
tic L
ogic
al T
ime
t=32
det_lock(a)
det_unlock(a)
det_lock(a)
lock(a)
Physical Time
Example
Thread 1
t=28
Thread 2
t=32
Det
erm
inis
tic L
ogic
al T
ime
det_lock(a)
det_unlock(a)
det_lock(a)
Physical Time
Locking Algorithm Improvements
• Eliminate deadlocks in nested locks – Make thread increment its deterministic logical clock while it
spins on the lock
– Must do so deterministically
• Queuing for fairness• Lock priority boosting
• See ASPLOS09 Paper on Kendo for details
Evaluation
• Methodology– Converted Splash2 benchmark suite to run use the Kendo
framework
– Eliminated data-races
– Checked determinism by examining output and the final
deterministic logical clocks of each thread
• Experimental Framework– Processor: Intel Core2 Quad-core running at 2.66GHz
– OS: Linux 2.6.23 (modified for performance counter support)
Results
tsp quicksort ocean barnes radiosity raytrace fmm volrend water-nsqrd
mean0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
Application Time Interrupt Overhead Deterministic Wait Overhead
Benchmarks
Exe
cuti
on
Tim
e (R
elat
ive
to N
on
-Det
erm
inis
tic)
Effect of interrupt frequency
64 128 256 512 1K 2K 4K 8K 16K0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Application Time Interrupt Overhead Deterministic Wait Overhead
Interrupt Period
Ex
ec
uti
on
Tim
e (
Re
lati
ve
to
No
n-D
ete
rmin
isti
c)
Related Work
• DMP – Deterministic Multiprocessing– Hardware design that provides Strong Determinism
• StreamIt Language– Streaming programming model only allows one interleaving of
inter-thread communication
• Cilk Language– Fork/join programming model that can produce programs with
semantics that always match a deterministic “serialization” of the code
– Cannot be used with locks– Must be data-race free (can be checked with a Cilk race detector)
Outline
• The Multicore Menace
• Deterministic Multithreading via Kendo
• Algorithmic Choices via PetaBricks– Joint work with Jason Ansel, Cy Chan, Yee Lok Wong,
Qin Zhao, and Alan Edelman
• Conquering the Multicore Menace
48
Observation 1: Algorithmic Choice
• For many problems there are multiple algorithms – Most cases there is no single winner– An algorithm will be the best performing for a given:
– Input size– Amount of parallelism– Communication bandwidth / synchronization cost– Data layout– Data itself (sparse data, convergence criteria etc.)
• Multicores exposes many of these to the programmer– Exponential growth of cores (impact of Moore’s law)– Wide variation of memory systems, type of cores etc.
• No single algorithm can be the best for all the cases
49
Observation 2: Natural Parallelism
• World is a parallel place– It is natural to many, e.g. mathematicians
– ∑, sets, simultaneous equations, etc.
• It seems that computer scientists have a hard time thinking in parallel– We have unnecessarily imposed sequential ordering on the world
– Statements executed in sequence – for i= 1 to n– Recursive decomposition (given f(n) find f(n+1))
• This was useful at one time to limit the complexity…. But a big problem in the era of multicores
50
Observation 3: Autotuning
• Good old days model based optimization• Now
– Machines are too complex to accurately model– Compiler passes have many subtle interactions– Thousands of knobs and billions of choices
• But…– Computers are cheap– We can do end-to-end execution of multiple runs – Then use machine learning to find the best choice
51
PetaBricks Language
transform MatrixMultiply
from A[c,h], B[w,c]
to AB[w,h]
{
// Base case, compute a single element
to(AB.cell(x,y) out)
from(A.row(y) a, B.column(x) b) {
out = dot(a, b);
}
}
• Implicitly parallel description
52
A
B
ABc
h
w
c
h
w
ABy
x
y
x
PetaBricks Language
transform MatrixMultiply
from A[c,h], B[w,c]
to AB[w,h]
{
// Base case, compute a single element
to(AB.cell(x,y) out)
from(A.row(y) a, B.column(x) b) {
out = dot(a, b);
}
// Recursively decompose in c
to(AB ab)
from(A.region(0, 0, c/2, h ) a1,
A.region(c/2, 0, c, h ) a2,
B.region(0, 0, w, c/2) b1,
B.region(0, c/2, w, c ) b2) {
ab = MatrixAdd(MatrixMultiply(a1, b1),
MatrixMultiply(a2, b2));
}
• Implicitly parallel description
• Algorithmic choice
53
A
B
ABABa1 a2 b1
b2
PetaBricks Language
transform MatrixMultiply
from A[c,h], B[w,c]
to AB[w,h]
{
// Base case, compute a single element
to(AB.cell(x,y) out)
from(A.row(y) a, B.column(x) b) {
out = dot(a, b);
}
// Recursively decompose in c
to(AB ab)
from(A.region(0, 0, c/2, h ) a1,
A.region(c/2, 0, c, h ) a2,
B.region(0, 0, w, c/2) b1,
B.region(0, c/2, w, c ) b2) {
ab = MatrixAdd(MatrixMultiply(a1, b1),
MatrixMultiply(a2, b2));
}
// Recursively decompose in w
to(AB.region(0, 0, w/2, h ) ab1,
AB.region(w/2, 0, w, h ) ab2)
from( A a,
B.region(0, 0, w/2, c ) b1,
B.region(w/2, 0, w, c ) b2) {
ab1 = MatrixMultiply(a, b1);
ab2 = MatrixMultiply(a, b2);
}
54
a
B
ABAB
b2b1
ab1 ab2
PetaBricks Language
transform MatrixMultiply
from A[c,h], B[w,c]
to AB[w,h]
{
// Base case, compute a single element
to(AB.cell(x,y) out)
from(A.row(y) a, B.column(x) b) {
out = dot(a, b);
}
// Recursively decompose in c
to(AB ab)
from(A.region(0, 0, c/2, h ) a1,
A.region(c/2, 0, c, h ) a2,
B.region(0, 0, w, c/2) b1,
B.region(0, c/2, w, c ) b2) {
ab = MatrixAdd(MatrixMultiply(a1, b1),
MatrixMultiply(a2, b2));
}
// Recursively decompose in w
to(AB.region(0, 0, w/2, h ) ab1,
AB.region(w/2, 0, w, h ) ab2)
from( A a,
B.region(0, 0, w/2, c ) b1,
B.region(w/2, 0, w, c ) b2) {
ab1 = MatrixMultiply(a, b1);
ab2 = MatrixMultiply(a, b2);
}
// Recursively decompose in h
to(AB.region(0, 0, w, h/2) ab1,
AB.region(0, h/2, w, h ) ab2)
from(A.region(0, 0, c, h/2) a1,
A.region(0, h/2, c, h ) a2,
B b) {
ab1=MatrixMultiply(a1, b);
ab2=MatrixMultiply(a2, b);
}
}
55
PetaBricks Compiler Internals
ChoiceGridChoiceGridRule/TransformHeaders
Ru
le B
od
ies
Code Generation
Compiler Passes
C++
ChoiceGridChoice
Dependency Graph
ChoiceGridChoiceGridRule Body IR
CompilerPasses
Compiler Passes
Seq
uen
tialL
eaf Co
de
Parallel D
ynam
ically S
ched
uled
PetaBricksSource Code
RUNTIME
Choice Grids
Rule1: to(B.cell(i) b) from(B.cell(i-1) left, A.cell(i) a) { … }
Rule2: to(B.cell(i) b) from(A.region(0, i) as) { … }
Rule1 or Rule2Rule2
0 1 n
Input
transform RollingSum from A[n] to B[n] {
}
0 n
A:
B:
Choice Dependency Graph
Rule1 or Rule2Rule2
Input
(r2, =)
(r1, =), (r2, <=)
(r1, =, -1)
(r1, <)
Rule1: to(B.cell(i) b) from(B.cell(i-1) left, A.cell(i) a) { … }
Rule2: to(B.cell(i) b) from(A.region(0, i) as) { … }
transform RollingSum from A[n] to B[n] {
}
PetaBricks Autotuning
ChoiceGridChoiceGridRule/TransformHeaders
Ru
le B
od
ies
Code Generation
Compiler Passes
C++
ChoiceGridChoice
Dependency Graph
ChoiceGridChoiceGridRule Body IR
CompilerPasses
Compiler Passes
ChoiceGridChoiceGridCompiled User Code
Parallel Runtime Engine
Seq
uen
tialL
eaf Co
de
Parallel D
ynam
ically S
ched
uled
PetaBricksSource Code
Autotuner
Choice Configuration
File
PetaBricks Execution
ChoiceGridChoiceGridRule/TransformHeaders
Ru
le B
od
ies
Code Generation
Compiler Passes
C++
ChoiceGridChoice
Dependency Graph
ChoiceGridChoiceGridRule Body IR
CompilerPasses
Compiler Passes
ChoiceGridChoiceGridCompiled
User Code
Parallel Runtime Engine
Dependency Graph
Seq
uen
tialL
eaf Co
de
Parallel D
ynam
ically S
ched
uled
PetaBricksSource Code
Choice Configuration
File
Pruning
Choice Configuration
File
Experimental Setup
• Test System– Dual-quad core (8 cores) – Xeon X5460 @ 3.16GHz w/ 8GB RAM– CSAIL Debian 4.0 (etch), kernel 2.6.18
• Training– Using our hybrid genetic tuner– Trained using all 8 cores– Training times varied from ~1 min to ~1 hour
Sort
62
Size
Tim
e
0 200 400 600 800 1000 1200 1400 1600 1800 20000.000
0.002
0.004
0.006
0.008
0.010
Insertion SortQuick SortMerge SortRadix Sort
Sort
63
Size
Tim
e
0 200 400 600 800 1000 1200 1400 1600 1800 20000.000
0.002
0.004
0.006
0.008
0.010
Insertion SortQuick SortMerge SortRadix SortAutotuned
Eigenvector Solve
64
Size
Tim
e
0 50 100 150 200 250 300 350 400 450 500
-0.01
0.00
0.01
0.02
0.03
0.04
0.05
Bisection DC
QR
Eigenvector Solve
65
Size
Tim
e
0 50 100 150 200 250 300 350 400 450 500
-0.01
0.00
0.01
0.02
0.03
0.04
0.05
Bisection
DC
QR
Autotuned
Poisson
66
Matrix Size
Tim
e
3 5 9 17 33 65 129 257 513 1025 20493.81469726562501E-06
0.0000305175781250001
0.000244140625
0.001953125
0.015625
0.125
1
7.99999999999999
63.9999999999999
511.999999999999
4095.99999999999
Direct
Jacobi
SOR
Multigrid
Poisson
67
Matrix Size
Tim
e
3 5 9 17 33 65 129 257 513 1025 20493.81469726562501E-06
0.0000305175781250001
0.000244140625
0.001953125
0.015625
0.125
1
7.99999999999999
63.9999999999999
511.999999999999
4095.99999999999
Direct
Jacobi
SOR
Multigrid
Autotuned
Scalability
68
1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8
MM
Sort
Poisson
Eigenvector Solve
Number of Cores
Spe
edup
Impact of Autotuning
• Custom hybrid genetic tuner• Huge gains by training on the target architecture:
SunFire T200Niagra (8 cores)
Xeon E7340(8 cores)
Xeon E7340(1 core)
SunFire T200Niagra (8 cores) 1.00x 0.72x
Xeon E7340(8 cores) 0.43x 1.00x 0.30x
Trained On:
Run On:
Related Work
• SPARSITY, OSKI – Sparse Matrices
• ATLAS, FLAME – Linear Algebra
• FFTW
• STAPL – Template Framework Library
• SPL – Digital signal processing
• High level optimization via automated statistical modeling. (Eric Brewer)
Outline
• The Multicore Menace
• Deterministic Multithreading via Kendo
• Algorithmic Choices via PetaBricks
• Conquering the Multicore Menace
71
Conquering the Menace
• Parallelism Extraction– The world is parallel,
but most computer science is based in sequential thinking– Parallel Languages
– Natural way to describe the maximal concurrency in the problem
– Parallel Thinking– Theory, Algorithms, Data Structures Education
• Parallelism Management– Mapping algorithmic parallelism to a given architecture– New hardware support
– Easier to enforce correctness– Reduce the cost of bad decisions
– A Universal Parallel Compiler
72
73
Hardware Opportunities
• Don’t have to contend with uniprocessors• Not your same old multiprocessor problem
– How does going from Multiprocessors to Multicores impact programs?
– What changed?– Where is the Impact?
– Communication Bandwidth– Communication Latency
74
Communication Bandwidth
• How much data can be communicated between two cores?
• What changed?– Number of Wires– Clock rate– Multiplexing
• Impact on programming model?– Massive data exchange is possible– Data movement is not the bottleneck
processor affinity not that important
32 Giga bits/sec ~300 Tera bits/sec
10,000X
Parallel Language Opportunities
• We need a lot more innovation! Languages that…..
– require no non-intuitive reorganization of data or code.
– make the programmer focus on concurrency, but not performance Off-load the parallelism and performance issues to the compiler (akin to ILP compilation to VLIW machines)
– eliminate hard problems such as race conditions and deadlocks (akin to the elimination of memory bugs in Java)
– inform the programmer if they have done something illegal (akin to a type system or runtime null-pointer checks)
– take advantage of domains to reduce the parallelization burden (akin to the StreamIt language for the streaming domain)
– use novel hardware to eliminate problems & help the programmer (akin to cache coherence hardware)
75
Compilation Opportunities
• Universal Parallel Compiler: GCC for Uniprocessors– Easily portable to any uniprocessor– Able to obtain respectable performance Single program (in C) runs on all uniprocessors
• MultiCompiler: Universal Compiler for Parallel Systems– Language exposes maximal parallelism Compiler manages it– Unlike uniprocessors, many single decisions are performance critical
– Candidates: Don’t bind a single decision, keep multiple tracks– Learning: Learn and improve heuristics – Adaptation: Dynamically choose candidates and adapt the program to
resources and runtime conditions
76
77
Conclusions
• Kendo
– The first system to efficiently provide weak determinism on commodity hardware
– Provide a systematic method of reproducing many non-deterministic bugs
– Incurs modest performance overhead when running on 4 processors
– This low overhead makes it possible to leave on while an application is deployed
• PetaBricks– First language where micro-level algorithmic choice can be naturally expressed– Autotuning can find the best choice – Can switch between choices as solution is constructed
• Switching to multicores without losing the gains in programmer productivity may be the Grandest of the Grand Challenges
– Half a century of work, still no winning solution– Will affect everyone! – A lot more work to do to solve this problem!!!
http://groups.csail.mit.edu/commit/