refrint: intelligent refresh to minimize power in on-chip multiprocessor cache...
TRANSCRIPT
Refrint: Intelligent Refresh to Minimize Power in On-Chip Multiprocessor Cache Hierarchies
Aditya Agrawal, Prabhat Jain, Amin Ansari and Josep Torrellas
University of Illinois at Urbana Champaign
http://iacoma.cs.uiuc.edu
Motivation
• As Vdd decreases Leakage power more important
• On-chip SRAM memories major contributor to leakage
• eDRAMs have low leakage power
– Already used as LLC in POWER 7
• Problem: Refresh energy
• Goal: Only refresh the lines that will be used soon
Feb 26, 2013 HPCA 2013, Shenzen, China 2
Contributions
• Refrint: Intelligent fine-grained refresh of eDRAMs
• Only refresh lines which will be used soon
• Don’t refresh lines that are inactive or frequently used
• Significant energy reductions with Refrint
𝐸 𝑅𝑒𝑓𝑟𝑖𝑛𝑡 𝑒𝐷𝑅𝐴𝑀 𝐶𝑎𝑐ℎ𝑒 𝐻𝑖𝑒𝑟𝑎𝑟𝑐ℎ𝑦
𝐸(𝑆𝑅𝐴𝑀 𝐶𝑎𝑐ℎ𝑒 𝐻𝑖𝑒𝑟𝑎𝑟𝑐ℎ𝑦) = 0.30
𝐸 Conv. 𝑒𝐷𝑅𝐴𝑀 𝐶𝑎𝑐ℎ𝑒 𝐻𝑖𝑒𝑟𝑎𝑟𝑐ℎ𝑦
𝐸(𝑆𝑅𝐴𝑀 𝐶𝑎𝑐ℎ𝑒 𝐻𝑖𝑒𝑟𝑎𝑟𝑐ℎ𝑦) = 0.56
Feb 26, 2013 HPCA 2013, Shenzen, China 3
Outline
• Motivation and Contribution
• Refrint
– Sources of Unnecessary Refreshes
– Time-based policy
– Data-based policy
• Implementation
• Evaluation Setup
• Results
• Conclusion
Feb 26, 2013 HPCA 2013, Shenzen, China 4
Unnecessary Refreshes
Two sources of unnecessary refreshes
• Cold lines
– Not accessed or accessed far apart in time
– Found in lower level caches like L3
– Propose: data-based policies (What to refresh?)
Feb 26, 2013 HPCA 2013, Shenzen, China 5
Retention time
Unnecessary refreshesLast access
Unnecessary Refreshes
• Hot lines
– Actively accessed (and automatically refreshed)
– Found in upper level caches like L2
– Propose: time-based policies (When to refresh?)
Feb 26, 2013 HPCA 2013, Shenzen, China 6
Retention time
Unnecessary refreshesAccess Refresh required
Time-Based Policy: Polyphase
• For hot lines: decides when to refresh
• The retention period is divided into Phases
• Each cache line records the phase when it was last accessed
• A line is refreshed only when the same phase arrives in the next retention period
Feb 26, 2013 HPCA 2013, Shenzen, China 7
Retention time
Line refreshedhere in PeriodicAccess
phase 0 phase 1 phase 2 phase 3
Line refreshedhere in Polyphase
phase 0 phase 1 phase 2 phase 3
Polyphase Effectiveness
Frequency of accesses > Refresh rate
• True for higher levels of cache but refresh energy small
• LLCs have high refresh energy but few accesses
• LLCs can benefit from Polyphase under
– Fine-grained sharing (repeated writebacks and reads)
– Significant conflict in higher level caches
– Accesses bypassing higher level caches
Feb 26, 2013 HPCA 2013, Shenzen, China 8
Frequency of accesses1/Retention Time HighLow
Polyphase effectiveness
Data-Based Policy
• For cold lines: decides what to refresh
• 4 simple policies using the line state
All: All lines
Valid: Only valid lines (includes clean and dirty lines)
Dirty: Only dirty lines
WB(n,m): Idle dirty lines n times before writeback and
idle valid clean lines m times before invalidation
Feb 26, 2013 HPCA 2013, Shenzen, China 9
Data-Based Policy: Effectiveness
Class 2
WB (n,m)
Large n,m
Class 1
WB (n,m)
Small n,m
Class 3
Valid
Small Large
High
Low
Footprint
Visibility
Application categorization as seen from LLC
Feb 26, 2013 HPCA 2013, Shenzen, China 10
Outline
• Motivation and Contribution
• Refrint
• Implementation
– Key Ideas
– Hardware Support
– Operation
• Issuing Refresh Request
• Processing Refresh Request
• Evaluation Setup
• Results
• Conclusion
Feb 26, 2013 HPCA 2013, Shenzen, China 11
Key Ideas
• Retention time is divided into 2N intervals: Global Phase
• Each cache line has N bits: Local Phase
• Phase Array:
– Hardware structure in cache controller
– Has N local phase bits and a copy of the valid bit of each cache line
Feb 26, 2013 HPCA 2013, Shenzen, China 12
Hardware Support
State Data + Tag
Decision Logic
Cache Controller
Phase Array
Local Phase
V
= = =
R/W Request
Refresh Request
Count
GlobalPhase
Feb 26, 2013 HPCA 2013, Shenzen, China 13
Operation: Issuing Refresh
# On normal rd or wr access
local phase = global phase
#At beginning of each global phase
hold all rd and wr requests
for (all the lines of the cache) {
if ((global phase == local phase)
&& (line == Valid))
issue a refresh request
#processing on next slide
}
release rd and wr requests
Feb 26, 2013 HPCA 2013, Shenzen, China 14
State Data + Tag
Decision Logic
Cache Controller
Phase Array
Local Phase
V
= = =
R/W Request
Refresh Request
Count
GlobalPhase
Operation: Processing Refresh
All: Refresh all lines
Valid: Refresh all valid lines
Dirty: Refresh all dirty lines
Invalidate clean lines
WB(n,m): next slide
Feb 26, 2013 HPCA 2013, Shenzen, China 15
State Data + Tag
Decision Logic
Cache Controller
Phase Array
Local Phase
V
= = =
R/W Request
Refresh Request
Count
GlobalPhase
Processing Refresh for WB(n,m)
if (Count >= 1)
refresh line
Count --
else if (Dirty == 1)
write back
State = Valid Clean
Count = m
else if (Valid == 1)
invalidate
Feb 26, 2013 HPCA 2013, Shenzen, China 16
State Data + Tag
Decision Logic
Cache Controller
Phase Array
Local Phase
V
= = =
R/W Request
Refresh Request
Count
GlobalPhase
Outline
• Motivation and Contribution
• Refrint
• Implementation
• Evaluation Setup
• Results
• Conclusion
Feb 26, 2013 HPCA 2013, Shenzen, China 17
Architectural Parameters
Simulated Architectural Parameters
Chip 16 core CMP
Core MIPS32, 2 issue out-of-order
IL1 (SRAM) 32 KB, 2 way
DL1 (SRAM) 32 KB, 4 way, private
L2 (eDRAM) 256 KB, 8 way, private
L3 (eDRAM) 16 MB, 16 banks, shared
L3 bank 1 MB, 8 way
Line size 64 Bytes
Network 4 X 4 torus
Coherence MESI directory protocol at L3
Feb 26, 2013 HPCA 2013, Shenzen, China 18
Technology Parameters & Tools
Technology Parameters
Technology node 32 nm
Frequency 1000 MHz
Device type LOP (Low operating power)
Tools and Applications
Architectural Simulator SESC
Timing and Power McPAT & CACTI
Applications SPLASH-2 and PARSEC
Feb 26, 2013 HPCA 2013, Shenzen, China 19
Assumptions
Parameters for L2 and L3
eDRAM access time = SRAM access time
eDRAM access energy = SRAM access energy
eDRAM leakage power = 1/8 X SRAM leakage power
eDRAM line refresh time = eDRAM line access time
eDRAM line refresh energy = eDRAM line access energy
Feb 26, 2013 HPCA 2013, Shenzen, China 20
Parameter Sweep
Parameter Values
Retention time 50 us
Timing policy Periodic Polyphase (# of phases = 1) Polyphase (# of phases = 2) Polyphase (# of phases = 4)
Data policy All Valid Dirty WB(4,4) WB(8,8) WB(16,16) WB(32,32)
Total combinations 29 (28 + baseline)
Feb 26, 2013 HPCA 2013, Shenzen, China 21
Outline
• Motivation and Contribution
• Refrint
• Implementation
• Evaluation Setup
• Results
– Cache hierarchy (+ DRAM access) energy
– Total energy
– Execution time
– Effectiveness of Polyphase
• Conclusion
Feb 26, 2013 HPCA 2013, Shenzen, China 22
Plots
• Retention time: 50 us
• 4 time-based policies
– Periodic (P)
– Polyphase with 1 phase (PP1)
• Lookup for invalid lines done in phase array Saves cycles wrt P
– Polyphase with 2 and 4 phases (PP2, PP4)
• 7 data-based policies
– All, Valid, Dirty
– WB(4,4), WB(8,8), WB(16,16), WB(32,32)
Note: Baseline is SRAM cache hierarchy
Feb 26, 2013 HPCA 2013, Shenzen, China 23
Cache Hierarchy Energy
• Very large reduction in refresh energy
• PP1, PP2 and PP4 do better than Periodic(P)
• PP1 is as good as PP2 and PP4
• WB(32,32) does better than other data-policies
Feb 26, 2013 HPCA 2013, Shenzen, China 24
Total On-chip Energy
• Same trends as cache hierarchy energy
• Conventional eDRAM hierarchy consumes 77% of baseline
• Refrint eDRAM hierarchy consumes 58% of baseline
Feb 26, 2013 HPCA 2013, Shenzen, China 25
Execution Time
• Conventional eDRAM hierarchy slows down by 25%
• Refrint eDRAM hierarchy slows down by only 6%
Feb 26, 2013 HPCA 2013, Shenzen, China 26
Effectiveness of Polyphase
• Kernel with fine grained sharing: L3 sees frequent updates (more than refresh rate)
• Polyphase PP4 saves significant energy
• Across all data policies: PP4 > PP2 > PP1 Feb 26, 2013 HPCA 2013, Shenzen, China 27
Outline
• Motivation and Contribution
• Refrint
• Implementation
• Evaluation Setup
• Results
– L3 and L2 energy
– Total energy
– Execution time
– Effectiveness of Polyphase
• Conclusion
Feb 26, 2013 HPCA 2013, Shenzen, China 28
Conclusion
• eDRAM + Refrint shaves away most of refresh energy
– Refrint eDRAM hierarchy
• Consumes 30% of baseline cache hierarchy energy
• Slowdown of 6%
– Conventional eDRAM hierarchy
• Consumes 56% of baseline cache hierarchy energy
• Slowdown of 25%
• Simple hardware implementation
Feb 26, 2013 HPCA 2013, Shenzen, China 29
Questions
Thanks !
谢谢
Feb 26, 2013 HPCA 2013, Shenzen, China 30