understanding sources of inefficiency in general-purpose chips...intel (1280×720 hd) 11 122 2023...
TRANSCRIPT
![Page 1: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/1.jpg)
Understanding Sources of Inefficiency in General-Purpose ChipsHameed, Rehan, et al.
PRESENTED BY:
XIAOMING GUO
SIJIA HE
1
![Page 2: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/2.jpg)
Outline➢Motivation
➢H.264 Basics
➢Key ideas
➢Implementation & Evaluation
➢Summary
➢Conclusion
➢Discussion
2
![Page 3: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/3.jpg)
We Love General-Purpose Chips…
Runs a broad range of applications!
14nm lithography
4GHz frequency
4 cores/8 threads
64KB L1 cache/core
256KB L2 cache/core
8MB LLC
Intel Core i7-6700K
https://www.techpowerup.com/215333/intel-skylake-die-layout-detailed.html
3
![Page 4: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/4.jpg)
But…… General-purpose chips are Inefficient for specific tasks
Performance (fps) Area (mm2) Energy/frame (mJ)
Intel (720×480 SD) 30 122 742
Intel (1280×720 HD) 11 122 2023
ASIC (1280×720 HD) 30 8 4
Highly optimized 2.8GHz Intel Pentium 4 implementation of H.264 encoder vs. ASIC
GP is 500x less energy efficient than ASIC in H.264 encoding
4
![Page 5: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/5.jpg)
Outline➢Introduction
➢H.264 Basics
➢Key ideas
➢Implementation & Evaluation
➢Summary
➢Conclusion
➢Discussion
5
![Page 6: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/6.jpg)
SequentialParallel
H.264 BasicsOne of the most commonly used video coding format
MotionPrediction
Video Source
Inter-framePrediction
Intra-framePrediction (IP)
IME FME
Transform/Quantize
Entropy Encoding(CABAC)
Compressed Video
6
![Page 7: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/7.jpg)
H.264 Basics: An Example
http://blog.webmproject.org/2010/07/inside-webm-technology-vp8-intra-and.html
7
![Page 8: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/8.jpg)
Outline➢Introduction
➢H.264 Basics
➢Key ideas
➢Implementation & Evaluation
➢Summary
➢Conclusion
➢Discussion
8
![Page 9: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/9.jpg)
Key Ideas
1. Use an Tensilica-based, extensible CMP system as the baseline
2. Explore customizations of the baseline that reduce overheads totransform the baseline into an ASIC-like H.264 encoder
3. Study how the overhead is reduced
4. Understand the source of inefficiency in GP
9
![Page 10: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/10.jpg)
Outline➢Introduction
➢H.264 Basics
➢Key ideas
➢Implementation & Evaluation
➢Summary
➢Conclusion
➢Discussion
10
![Page 11: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/11.jpg)
Implementation - Baseline
11
![Page 12: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/12.jpg)
Implementation – Customization Step 1Data Level Parallelism:
Up to 18-way SIMD
https://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-docs/CellProgrammingTutorial/BasicsOfSIMDProgramming.html
Instruction Level Parallelism:Up to 3-slot VLIW
Instruction bundle
Instruction 1 Instruction 2 Instruction 3
FPU Integer Unit Memory Unit
12
![Page 13: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/13.jpg)
Evaluation – Customization Step 1Speedup: 9.2x
Good: An order of magnitude increase in performance and energy efficiency
Bad: Still two orders of magnitude worse than ASIC
Taken from the presentation slides for this paper on 2010 ISCA
13
![Page 14: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/14.jpg)
Evaluation – Customization Step 1
Good: Function unit becomes more energy efficient
Bad: Overheads from other operations are still significant. FU only accounts for 10% in all the energy consumed.
Taken from the presentation slides for this paper on 2010 ISCA
14
![Page 15: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/15.jpg)
Implementation – Customization Step 2Operation Fusion: Combine several instructions into one
Pseudo-code:acc = 0;acc = Addshft (acc, x0, x1, 20);acc = Addshft (acc, x0, x1, -5);acc = Addshft (acc, x0, x1, 1);xn = Sat(acc);
acc = acc + 20(x0 + x1):temp1 = 20 * x0
temp2 = 20 * x1
temp3 = temp1 + temp2acc = acc + temp3
15
![Page 16: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/16.jpg)
Evaluation – Customization Step 2
Good: Compiler can do operation fusion. So it is “free”.
Bad: No significant change in performance and energy efficiency compared to SIMD + VLIW
Speedup: 15.7x (baseline)Speedup: 1.7x (SIMD+VLIW)
Taken from the presentation slides for this paper on 2010 ISCA
16
![Page 17: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/17.jpg)
Evaluation – Observations So Far…Exploring data and instruction level parallelism helps ➢ 10x improvement in both performance and energy efficiency
The inefficiency is dominated by data movement➢ FU only takes 10% of total energy even with SIMD + VLIW
Need large computation/low communication➢ Specific hardware to perform 100s operations with 1 instruction
17
![Page 18: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/18.jpg)
Implementation – Customization Step 3Special Hardware for FME
18
![Page 19: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/19.jpg)
Implementation – Customization Step 3Special Hardware for FME
x3
x2
x1
x0
x-1
x-2
19
![Page 20: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/20.jpg)
Implementation – Customization Step 3Special Hardware for FME
x3
x2
x1
x0
x-1
20
![Page 21: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/21.jpg)
Implementation – Customization Step 3Special Hardware for FME
x4
x3
x2
x1
x0
x-1
21
![Page 22: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/22.jpg)
Implementation – Customization Step 3Special Hardware for IME
This image cannot currently be displayed.
• SAD: Sum of Absolute Difference
• 256 SAD / operation
• Pixels blocks in red box can shift down and right, maximizing data re-use.
• Minimize bits fetched/operation
Taken from the presentation slides for this paper on 2010 ISCA
22
![Page 23: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/23.jpg)
Evaluation – Customization Step 3
Good: Orders of magnitude faster than GP. Energy efficiency within 3x of ASIC.
Bad: Maybe very difficult to implement in GP. Requires customized data storage, function unit and instruction.
Speedup: 256x (baseline)This image cannot currently be displayed.
Taken from the presentation slides for this paper on 2010 ISCA
23
![Page 24: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/24.jpg)
Evaluation – Customization Step 3
Good: Over 35% of the energy is now in FU. More data re-use, lowering total energy.
Bad: Most of the code involves “magic” instructions
Taken from the presentation slides for this paper on 2010 ISCA
This image cannot currently be displayed.
24
Processor Energy Breakdown
![Page 25: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/25.jpg)
Outline➢Introduction
➢H.264 Basics
➢Key ideas
➢Implementation & Evaluation
➢Summary
➢Conclusion
➢Discussion
25
![Page 26: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/26.jpg)
Summary➢Basic operations only take hundreds femtojoule/op in ASIC
➢Instruction/data fetch, pipeline registers, control, etc. take 140pJ/op
➢Even with SIMD, VLIW and operation fusion, the gap is hard to close
➢Need “magic instruction” that can perform 100s operations each time
➢Can be done by coupling customized data storage and data path
➢Example: using shift registers to enable data reuse and parallel access
26
![Page 27: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/27.jpg)
Outline➢Introduction
➢H.264 Basics
➢Key ideas
➢Implementation & Evaluation
➢Summary
➢Conclusion
➢Discussion
27
![Page 28: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/28.jpg)
Conclusion➢Achieving ASIC-like energy efficiency is difficult due to the large overheads in fetch, pipeline registers, control, etc.
➢Need customized hardware that fits in processor framework that can execute 100s of operations/instruction.
➢Need tools that enable designers to create customized hardware easily to reduce the cost. (Bridging Moore’s Law performance gap is more about “How Much” does it cost – Prof. Todd Austin)
28
![Page 29: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/29.jpg)
Questions?
29
![Page 30: Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023 ASIC (1280×720 HD) 30 8 4 ... • Pixels blocks in red box can shift down and right,](https://reader033.vdocument.in/reader033/viewer/2022051906/5ff86b9bf41b5851e852c2bd/html5/thumbnails/30.jpg)
Discussion1. The paper only studied the gap between general purpose chips and H.264
ASIC. Do the results also apply to other application?
1. The authors introduced “magic” instructions that can perform 100s operations to achieve ASIC-like efficiency. Is it realistic to have those instructions in a real design?
30