cs671 parallel programming in the many-core eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣too...
TRANSCRIPT
![Page 1: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/1.jpg)
CS671 Parallel Programming in the Many-Core Era
Lecture 4: Introduction to Locality Theory and Practice
Zheng Zhang
Rutgers University
![Page 2: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/2.jpg)
Review: Memory Wall
‣ The processor memory performance gap
![Page 3: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/3.jpg)
Memory Hierarchy
‣Hierarchical memory* L1, L2, L3 cache* scratch-pad, off-chip memory, disk cache ...* automatic placement and replacement* separation of concerns: data usage vs. coherence management
‣Trading space for time* the faster the access* the smaller the data capacity
‣Software solution* exploit locality -- temporal and/or spatial* transform computation order or data layout* compilers, runtime, performance tuning tools
![Page 4: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/4.jpg)
The Story of the Locality Theory
‣Started as an empirical observation “During any interval of execution, a program favors a subset of its pages, and this set of favored pages changes slowly” -- Peter Denning
‣How to quantify?* the performance of a machine* the demand of a program* the locality of an operation* is there a “primary” metric?
‣Two example quantities* reuse time & footprint
![Page 5: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/5.jpg)
Locality Statistics‣ Miss Ratio
‣ Reuse Distance
‣ Footprint
![Page 6: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/6.jpg)
Locality Statistics‣ Miss Ratio
‣ Reuse Distance
‣ Footprint
![Page 7: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/7.jpg)
Cache Miss Ratio
‣ Cache Performance of the Integer portion of the SPEC CPU2000
![Page 8: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/8.jpg)
![Page 9: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/9.jpg)
![Page 10: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/10.jpg)
Locality Statistics‣ Miss Ratio
‣ Reuse Distance
‣ Footprint
![Page 11: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/11.jpg)
Reuse Distance‣ Reuse distance of an access to datum d
the number of distinct data accessed after the last access to d
‣ Locality signature of an executionthe distribution of all finite reuse distances determines working set size and miss rate of caches of all sizes
![Page 12: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/12.jpg)
Reuse Distance Calculation I
‣ Naive counting, O(N) time per access, O(N) space-- N is the number of memory accesses-- M is the number of distinct data elements
‣Too costly: N up to 120 billion, M 25 million
![Page 13: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/13.jpg)
Reuse Distance Calculation II
‣Stack algorithm [Mattson+ IBM 70]-- O(M) time per access, O(M) space
![Page 14: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/14.jpg)
Reuse Distance Calculation III
‣Tree based algorithm -- search tree [Olken LBL 81, Sugumar&Abraham UM 93] O(log M) time per access, O(M) space
![Page 15: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/15.jpg)
Reuse Distance Calculation III
• Stack algorithm [Mattson+ IBM 70] O(M) time per access, O(M) space
• Search tree [Olken LBL 81, Sugumar&Abraham UM 93] O(log M) time per access, O(M) space
• Space cost remains a major problem
[Ding+ PLDI’03/TOPLAS’09]O(N log logM) time and O(logM) space
![Page 16: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/16.jpg)
Locality Statistics‣ Miss Ratio
‣ Reuse Distance
‣ Footprint
![Page 17: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/17.jpg)
Footprint‣ Amount of data access in an execution period
‣Example: “abbb”
‣Example “xyz xyz”
Footprint
• fp(w): average footprint of ALL windows of length w• length-n trace, O(n^2) windows• 1 billion accesses, half quintillion windows
• 3 length-2 windows: “ab”, “bb”, “bb”• footprints 2, 1, 1• the average fp(2) = (2 + 1 + 1)/3 = 4/3
• fp( i ) = i for 0 <= i <= 3• fp( i ) = 3 for i > 3
Reuse Time?[Xiang+ ASPLOS’13]
![Page 18: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •](https://reader035.vdocument.in/reader035/viewer/2022063014/5fd157ce78cdeb6b6e76af1e/html5/thumbnails/18.jpg)
Footprint Measurement‣Working set
limit value in an infinitely long trace [Denning & Schwartz 1972]
‣ Direct countingsingle window size [Thiebaut & Stone TOCS’87] seminal paper on footprints in shared cache
‣ Statistical approximation[Denning & Schwartz 1972; Suh et al. ICS’01; Berg & Hagersten PASS’04; Chandra et al. HPCA’05; Shen et al. POPL’07]
‣Precise definition/solutionfootprint distribution, O(n log m) [Xiang et al. PPoPP’11]footprint function, O(n) [Xiang et al. PACT’11]