1 compiling with multicore jeehyung lee 15-745 spring 2009
TRANSCRIPT
![Page 1: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/1.jpg)
1
Compiling with multicore
Jeehyung Lee
15-745 Spring 2009
![Page 2: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/2.jpg)
2
Papers
Automatic Thread Extraction with Decoupled Software Pipelining Fully automatic Fine grained pipelining
A Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C Programs Semi-automatic Coarse grained pipelining
![Page 3: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/3.jpg)
3
First paper
Automatic Thread Extraction with Decoupled Software PipeliningGuilherme Ottoni, Ram Rangan, Adam Stol
er and David AugustFrom Princeton University
![Page 4: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/4.jpg)
4
What is the paper about?
Despite increasing uses of multiprocessors, many single threaded applications do not benefit
Let the compiler automatically extract threads and exploit lurking pipeline parallelism Extract non-speculative and truly decoupled thread
s through Decoupled Software Pipelining(DSWP)
![Page 5: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/5.jpg)
5
Why decoupled pipelining?
Example
Linked list traversal
![Page 6: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/6.jpg)
6
Why decoupled pipelining?
DOACROSS
Iteration * (LD latency + communication latency)
![Page 7: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/7.jpg)
7
Why decoupled pipelining?
DSWP
Iteration * LD latency
One way pipelining
![Page 8: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/8.jpg)
8
DSWP
Flow of data (dependency) is acyclic among cores
With use of inter-core queue, threads can be decoupled Efficiency + high tolerance for latency
![Page 9: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/9.jpg)
9
DSWP Algorithm
Build dependence graph Find strongly connected components (SCC) Create DAG of SCC Partition DAG Split codes into partitions Add flows to partitions
![Page 10: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/10.jpg)
10
Build dependence graph
Include every traditional dependence (data, control, and memory) & extensions
![Page 11: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/11.jpg)
11
Find SCC
SCC : Instructions that form a dependency cycle in a loop
Instructions in SCC cannot be parallelized
1
2
1
1
2
2
![Page 12: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/12.jpg)
12
Create DAG of SCCs
Merge instructions within each SCC and update dependency arrows
![Page 13: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/13.jpg)
13
Partition DAG
Partition DAG nodes into n partitions
( n <= # of processors) Use heuristic to maximize load balance
Decide # of partitions (threads) Start filling in from partition 1 with nodes from the
top of DAG. When the partition is stuffed (estimated by # of
cycles), move on to next partition
Find the best # of threads and its partition
![Page 14: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/14.jpg)
14
Split codes and insert flows (done!)
For each partition, insert code basic blocks relevant to its contained SCC node
Add in codes for dependency flow
![Page 15: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/15.jpg)
15
Result
19.4% speedup on important benchmark loops, 9.2% overall
When core bandwidth is halved Single threaded code slows down by 17.1% DSWP code is still slightly faster than single-thread
ed code running on full-bandwidth core
Promising enabler for Thread-Level-Parallelism(TLP)?
![Page 16: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/16.jpg)
16
Second Paper
A Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C ProgramsWilliam Thies, Vikram Chandrasekhar and S
aman AmaransingheFrom MIT
![Page 17: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/17.jpg)
17
What is the paper about?
Despite increasing uses of multiprocessors, many single threaded… (Repeated)
Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes
Let people define pipeline, and learn practical dependencies in runtime
![Page 18: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/18.jpg)
18
What is the paper about?
Despite increasing uses of multiprocessors, many single threaded… (Repeated)
Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes
Let people define stages, and learn practical dependencies in runtime …for streaming applications
![Page 19: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/19.jpg)
19
Interface
Add annotations in the body of top loop
![Page 20: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/20.jpg)
20
Dynamic analysis
The system creates a stream graph according to annotations.
How do they find dependencies?
![Page 21: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/21.jpg)
21
Dynamic analysis
Streaming applications tend to have a fixed pattern of dataflow (stable flow) among pipeline stages
![Page 22: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/22.jpg)
22
Dynamic analysis
Run the application on training examples, and record every relevant store-load pair across pipeline boundaries
This gives us practical dependencies
![Page 23: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/23.jpg)
23
Interface
Program shows a complete stream graph
User decides if he/she likes this
pipelining or not
• If yes, done!
• else, redo annotations. Iterate over until satisfied
![Page 24: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/24.jpg)
24
Actual pipelining
When compiled, annotation macros emit codes that will fork original program for each pipeline stage
![Page 25: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009](https://reader035.vdocument.in/reader035/viewer/2022062715/56649da85503460f94a948d9/html5/thumbnails/25.jpg)
25
Result
Average 2.78x speedup, max 3.89x on 4-core Seems unsound but practical (?)