t4: compiling sequential code for effective speculative
TRANSCRIPT
![Page 1: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/1.jpg)
T4: Compiling Sequential Code for Effective Speculative
Parallelization in HardwareV IC TOR A . Y IN GM ARK C . JEFFREY *D AN IEL SAN C HEZ
I SC A 2 020
*University of Toronto starting Fall 2020
![Page 2: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/2.jpg)
Parallelization: Gap between programmers and hardware
Multicores are everywhere Programmers still write sequential code
ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE 2
Speculative parallelization: new architectures and compilers to parallelize sequential code without knowing what is safe to run in parallel
Intel Skylake-SP (2017): 28 cores per die
1.…2.…3.…
![Page 3: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/3.jpg)
T4: Trees of Tiny Timestamped TasksOur T4 compiler exploits recently proposed hardware features:◦ Timestamps encode order, letting tasks spawn out-of-order
◦ Trees unfold branches in parallel for high-throughput spawn
◦ Compiler optimizations make task spawn efficient
◦ Efficient parallel spawns allows for tiny tasks (10’s of instructions)» Tiny tasks create opportunities to reduce communication and improve locality
We target hard-to-parallelize C/C++ benchmarks from SPEC CPU2006◦ Modest overheads (gmean 31% on 1 core)
◦ Speedups up to 49x on 64 cores
3ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
swarm.csail.mit.edu
![Page 4: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/4.jpg)
4ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
Background
T4 Principles in Action
T4: Parallelizing Entire Programs
Evaluation
Background
T4 Principles in Action
T4: Parallelizing Entire Programs
Evaluation
![Page 5: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/5.jpg)
Thread-Level Speculation (TLS) [Multiscalar (’92-’98), Hydra (’94-’05), Superthreaded (’96), Atlas (’99),
Krishnan et al. (‘98-’01), STAMPede (‘98-’08), Cintra et al. (’00, ‘02), IMT (‘03), TCC (’04), POSH (‘06), Bulk (’06), Luo et al. (’09-’13), RASP (‘11), MTX (‘10-’20), and many others]
◦ Divide program into tasks (e.g., loop iterations or function calls)
◦ Speculatively execute tasks in parallel
◦ Detect dependencies at runtime and recover
Prior TLS systems did not scale many real-world programs beyond a few cores due to
◦ Expensive aborts
◦ Serial bottlenecks in task spawns or commits
5ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
![Page 6: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/6.jpg)
TLS creates chains of tasks
Example: maximal independent set◦ Iterates through vertices in graph
One task per outer-loop iteration◦ Each tasks spawns the next
◦ Hardware tries to run tasks in parallel
Hardware tracks memory accesses to discover data dependences
6
Time
A
B
C
D
E
F …
for (int v = 0; v < numVertices; v++) {if (state[v] == UNVISITED) {state[v] = INCLUDED;for (int nbr : neighbors(v))
state[nbr] = EXCLUDED;}
}
rd
wr
Indirect memory accesses
ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
![Page 7: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/7.jpg)
Task chains incur costly misspeculation recovery
Tasks abort if they violated data dependence
Tasks that abort must roll back their effects, including successors they spawned or forwarded data to
7
Time
D′E′
F′ …
rd
A
B
C
D
E
F …
A
B
C
D
E
F
rd
wr
…
for (int v = 0; v < numVertices; v++) {if (state[v] == UNVISITED) {state[v] = INCLUDED;for (int nbr : neighbors(v))
state[nbr] = EXCLUDED;}
}
Unselective aborts waste a lot of workABORT
RE-EXECUTE
ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
![Page 8: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/8.jpg)
Swarm architecture[Jeffrey et al. MICRO’15, MICRO’16, MICRO’18; Subramanian et al. ISCA’17]
Execution model:◦ Program comprises timestamped tasks◦ Tasks spawn children with greater or
equal timestamp◦ Tasks appear to run sequentially,
in timestamp order
Detects order violations and selectively aborts dependent tasks
Distributed task units queue, dispatch, and commit multiple tasks per cycle◦ <2% area overhead◦ Runs hundreds of tiny speculative tasks
8ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
![Page 9: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/9.jpg)
9ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
Background
T4 Principles in Action
T4: Parallelizing Entire Programs
Evaluation
![Page 10: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/10.jpg)
T4’s decoupled spawn enables selective aborts
T4 compiles sequential C/C++ to exploit parallelism on Swarm
Put most work into worker tasks at the leaves of the task tree◦ Use Swarm’s mechanisms for
cheap selective aborts
10
for (int v = 0; v < numVertices; v++) {if (state[v] == UNVISITED) {state[v] = INCLUDED;for (int nbr : neighbors(v))
state[nbr] = EXCLUDED;}
}
Time…
D′rd
wr
A
B
C
D
E
F
rd
Spawners
Workers
ABORT
RE-EXECUTE
ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
![Page 11: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/11.jpg)
Tiny tasks make aborts cheapIsolate contentious memory accesses into tiny tasks, to limit the damage when they abort
11ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
wr
wr wr
Time…
Time
wr
wr wr
…
for (int v = 0; v < numVertices; v++) {if (state[v] == UNVISITED) {state[v] = INCLUDED;for (int nbr : neighbors(v))
state[nbr] = EXCLUDED;}
}
Parallelize outer loop only:
Parallelize both loops:Tiny tasks (a few instructions)are difficult to spawn effectively
![Page 12: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/12.jpg)
Tiny tasks (a few instructions)are difficult to spawn effectively
T4’s balanced task trees enable scalability
Spawners recursively divide the range of iterations
12
for (int v = 0; v < numVertices; v++) {if (state[v] == UNVISITED) {state[v] = INCLUDED;for (int nbr : neighbors(v))
state[nbr] = EXCLUDED;}
}
ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
…………
Spawners
Workers
Spawners
Balanced spawner trees reduce critical path length to O(log(tripcount))
![Page 13: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/13.jpg)
13ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
Background
T4 Principles in Action
T4: Parallelizing Entire Programs
Evaluation
![Page 14: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/14.jpg)
T4: Parallelizing entire real-world programsT4 divides the entire program into tasks starting from the first instruction of main()
T4 automatically generates tasks from◦ Loop iterations◦ Function calls◦ Continuations of the above
T4 extracts nested parallelism fromthe entire program despite◦ Loops with unknown tripcount◦ Opaque function calls◦ Data-dependent control flow◦ Arbitrary pointer manipulation
14ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
![Page 15: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/15.jpg)
Progressive expansion of unknown-tripcount loops
Progressive expansion generates balanced spawnertrees for loops with unknown tripcount
- loops with break statements
- while loops
15ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
int i = 0;while (status[i]) {
if (foo(i)) break;i++;
}:
Source code: void iter(Timestamp i) {if (!done) {if (!status[i]) done = 1;else if (foo(i)) done = 1;
}}
0
iter(0)
iter(1)
4
iter(4)
iter(5)
2
iter(2)
iter(3)
6
iter(6)
iter(7)10
iter(10)
iter(11)8
iter(8)
iter(9)12
iter(12)
iter(13)
![Page 16: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/16.jpg)
Continuation-passing style eliminates the call stack
Problem: Independent function spawns serialize on stack-frame allocation
Solution:◦ When needed, T4 allocates continuation closures on the heap instead◦ T4 optimizations ensure most tasks don’t need memory allocation◦ These software techniques could apply to any TLS system
16ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
g
g
f
f
f
ffor (int i = 0; i < N; i++) {
float x = f();if (x > 0.0) g(x);
}
![Page 17: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/17.jpg)
Me
mo
ry con
troller / IO
Memory controller / IO
Memory controller / IO
Me
mo
ry c
on
tro
ller /
IO
0xE6823
Spatial-hint generation for localityTiny tasks may access only one memory location, which is known when the task is spawned.
Hardware uses these spatial hints to improve locality:◦ maps each address to a tile.
◦ Send tasks for that address to that tile.
17ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
A
C1
B
C2
![Page 18: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/18.jpg)
Manual annotations for task splitting
Programmer may add task boundaries for tiny tasks
Guaranteed to haveno effect on program output
Added <0.1% to source code
18ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
![Page 19: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/19.jpg)
T4 implementation in LLVM/Clang
Intraprocedural passes: small compile times (linear in code size)Use all standard LLVM optimizations to generate high-quality codeMore in the paper:◦ Topological sorting to generate timestamps◦ Bundling stack allocations to the heap with privatization◦ Loop task coarsening to reduce false sharing of cache lines◦ Case studies and sensitivity studies
19ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
Object fileClang
frontend
LLVM backend
Optimizations
(e.g., -O3)x86_64 code
generation
C/C++
source
code
T4 Parallelization
Passes
T4 Parallelization
Passes
![Page 20: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/20.jpg)
20ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
Background
T4 Principles in Action
T4: Parallelizing Entire Programs
Evaluation
![Page 21: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/21.jpg)
Methodology1-, 4-, 16-, and 64-core systems
C/C++ benchmarks from SPEC CPU2006
All speedups normalized to serial code compiled with clang -O3
21ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
4-wide superscalar out-of-order cores (Haswell-like)
32 KB L1 caches1 MB L2 cache per 4-core tile4 MB L3 slice per 4-core tile
256 entries 64 entries/tile (1024 tasks for 64-core chip)
4 4×4 mesh networks
![Page 22: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/22.jpg)
T4 scales to tens of cores
22
Hot loops have some independent iterations
Hot loops have serializing variables updated every iteration
ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
![Page 23: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/23.jpg)
T4 overheads are moderate
23ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
Serial code, -O3 Parallelized with T4
Task-spawn overheads are geo. mean 31%
![Page 24: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/24.jpg)
Parallelization redoubles performance
24ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
Cores spend most time executing useful work, not aborting
Para
llel s
pee
du
p
![Page 25: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/25.jpg)
Parallelization redoubles performance
25ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
T4 scales many programs to tens of cores
Para
llel s
pee
du
p
![Page 26: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/26.jpg)
ContributionsT4 compiler provides parallelization needed to allow sequential programmers to use multicores
T4 broadens the applications for which speculative parallelization is effective by exploiting the recent Swarm architecture◦ Parallelization of sequential C/C++ yields speedups of up to 49× on 64 cores
New code transformations:◦ Decoupled spawners enable cheap selective aborts of tiny tasks◦ Progressive expansion: balanced task trees for unknown-tripcount loops◦ Stack elimination and loop task coarsening reduce false sharing◦ Task spawn optimizations
26ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
![Page 27: T4: Compiling Sequential Code for Effective Speculative](https://reader033.vdocument.in/reader033/viewer/2022060211/62935ac31d96125131005921/html5/thumbnails/27.jpg)
Questions?T4 is open-source and available for you to build on:
Join online Q&A @ ISCA: First paper in Session 2B on June 1, 20209am in Los Angeles, Noon in New York, 6pm in Brussels, Midnight in Beijing
27ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE
swarm.csail.mit.edu