review student: fan bai instructor: dr. sushil prasad 2012.03.21 andrew nere, atifhashmi, and...
TRANSCRIPT
![Page 1: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/1.jpg)
Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms
Review student: Fan BaiInstructor: Dr. Sushil Prasad
2012.03.21
Andrew Nere, AtifHashmi, and MikkoLipasti
University of Wisconsin –Madison
IPDPS 2011
![Page 2: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/2.jpg)
Purpose Background Why can be parallelized Mapping to CUDA Optimizations methods Experiment results
Outline
![Page 3: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/3.jpg)
Utilize NvidiaGPUs to accelerate a neocortex inspired learning algorithm
The purpose of this paper
![Page 4: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/4.jpg)
The neocortex is the part of the brain that is unique to mammals and is mostly responsible for executive processing skills such as mathematics, music, language, vision, perception, etc.
The neocortex comprises around 77% of the entire human brain.
For a typical adult, it is estimated the neocortex has around 11.5 billion neurons and 360 trillion synapses, or connections between neurons
What is the Neocortex?
![Page 5: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/5.jpg)
Hierarchical and regular structure Composed of cortical columns –Neuroscientist Vernon MountcastleMountcastle was the first to observe the structural
uniformity of the neocortex. He proposed that the neocortex is composed of millions
of nearly identical functional units which he termed cortical columns because of the seemingly column shaped organizations.
Neocortex
![Page 6: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/6.jpg)
Neuroscientist Hubel and Mountcastle further classified cortical columns into hypercolumns and minicolumns.
Neocortex
![Page 7: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/7.jpg)
Minicolumns –80-100 neurons –Represent unique
features –Share common receptive
field•Hypercolumns –50-100 minicolumns –Functional unit of
neocortex
•Connectivity –Lateral –Feedforward(bottom-up) –Feedback (top-down)
Cortical Columns
![Page 8: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/8.jpg)
Cortical Network Model
![Page 9: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/9.jpg)
Highly Parallel
![Page 10: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/10.jpg)
“Compute Unified Device Architecture”
Hardware
–Streaming Multiprocessors (SMs)
–Shared memory (16-48KB)
–DRAM (1-6GB)
•Programming Framework
–Threads –1000s
–CTAs –groups of threads
Cooperative Thread-Arrays (CTAs)
–Kernel –group of CTAs
NvidiaCUDA
![Page 11: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/11.jpg)
Mapping to CUDA
![Page 12: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/12.jpg)
Experimental Setup
![Page 13: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/13.jpg)
GPGPU Performance
![Page 14: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/14.jpg)
Problem: Multiple kernel launch overhead
–1 –2.5% of execution time
–No CTA-CTA communication
Problem: GPGPU resources underutilized
–Convergence is key part of the model / algorithm
–Performance benefits diminish
50x speedup for large layers
> 10x SLOWDOWN for small layers
Limitations of Multiple Kernels
![Page 15: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/15.jpg)
We can see that 1-2.5% of the total execution time for a hierarchy is spent on the additional kernel launch overhead, with smaller cortical networks suffering from larger overhead.
Execution time
![Page 16: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/16.jpg)
(1) 50x speedup for large layers
(2) > 10x SLOWDOWN for small layers
GPGPU resources underutilized
![Page 17: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/17.jpg)
Solution 1: Pipeline cortical network execution
(1)Single kernel with 1 hypercolumn/ CTA
(2)Double buffer maintains dependenciesA double buffer between hierarchy levels guarantees
that producer-consumer relationships are enforced.
(3)–Improve resource utilization
–Multiple kernel launches to fully propagate
–Increases storage overhead
Algorithmic Optimizations Pipelining to Increase Resource Utilization
![Page 18: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/18.jpg)
We instead create a software work-queue to explicitly orchestrate the order in which hypercolumns are executed.
The work-queue is managed directly in the GPU’s global memory space, as in Figure 9.
Software work-queue Ideally we would like to be able to execute the entire cortical
architecture on the GPU concurrently, reducing the overhead to a single kernel launch. However, a limitation of the CUDA architecture is that there is no guarantee as to the order in which CTAs are scheduled.
![Page 19: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/19.jpg)
(1)Single kernel launch A single CUDA-kernel is launched with only as many CTAs as can concurrently fit
across all of the SMs in the GPGPU, as determined by the Occupancy calculator (Figure 9 shows 2 concurrent CTAs per Streaming-Multiprocessors(SM)).
(2)Each CTA uses an atomic primitive to gain a unique index into the work-queue (solid blue arrows ’A’ and ’C’).
The work-queue contains each hypercolumn’s ID in the cortical network and is organized to execute hypercolumns in order from the bottom of the hierarchy to the top.
Algorithmic Optimizations:Solution 2: Work-queue
If all input activations are available, the hypercolumn can calculate its output activations.
(in Figure 9, HC0’s inputs are ready,
while HC9 must wait for its inputs to be produced by HC0).
![Page 20: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/20.jpg)
Once a hypercolumn has calculated its output activations,they are written back to the global memory.
The dashed red arrow (B) in the figure depicts how HC0 indicates to HC9 that all input activations are available via atomic increment of the flag.
Finally, the CTA atomically indexes again into the work-queue to execute another hypercolumn until the work-queue is empty.
![Page 21: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/21.jpg)
(3)Concurrent CTAs execute entire cortical network -Doesn’t rely on CTA scheduler -CUDA Disclaimer –CTA to CTA communication
![Page 22: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/22.jpg)
Work Queue -Example
![Page 23: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/23.jpg)
Single GPU Optimization Results
![Page 24: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/24.jpg)
GT200 Architecture Performance
![Page 25: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/25.jpg)
Multi-GPU Systems
![Page 26: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/26.jpg)
![Page 27: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/27.jpg)
Multi-GPU Systems
![Page 28: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/28.jpg)
Multi-GPU Results
![Page 29: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/29.jpg)
Problem: Synchronization and workload imbalance
Solution: Key algorithmic optimizations
Profiling / distributing cortical networks on multi-GPU systems
Provide insight into NvidiaGPU architectures
My summary
![Page 30: Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649daf5503460f94a9ceb8/html5/thumbnails/30.jpg)
Cortical network algorithm well suited to GPGPUs –34x speedup baseline / 39x with optimizations –Synchronization overhead / workload imbalance Combat with algorithmic changesFermi vs. GTX 280 architecture –Application sensitive (32 vs. 128 threads) –Improved GigaThreadCTA scheduler in FermiMulti-GPU implementation –Online profiling / deployment –60x speedup vs. serial
Conclusion