cse 661 paper presentation
DESCRIPTION
CSE 661 PAPER PRESENTATION. PERFORMANCE AND ENERGY IMPLICATIONS OF MANY-CORE CACHES FOR THROUGHPUT COMPUTING By C. J. Hughes et al. Presented By SALAMI, Hamza Onoruoiza g201002240. OUTLINE OF PRESENTATION. Throughput Computing Benchmarks Used - PowerPoint PPT PresentationTRANSCRIPT
CSE 661 PAPER CSE 661 PAPER PRESENTATIONPRESENTATION
PERFORMANCE AND ENERGY IMPLICATIONS OF MANY-CORE CACHES FOR
THROUGHPUT COMPUTINGBy C. J. Hughes et al
Presented BySALAMI, Hamza Onoruoiza
g201002240
OUTLINE OF OUTLINE OF PRESENTATIONPRESENTATIONThroughput ComputingBenchmarks UsedDegree of Sharing of L2 Caches in
BenchmarksCache Designs ConsideredExperimental SetupResults (Performance and Energy)Possible ImprovementsFinal ResultsConclusion, Comments and
Questions
2
THROUGHPUT THROUGHPUT COMPUTINGCOMPUTINGPerforming huge number of
computations with large amounts of parallelism
Also known as GPGPU
3
BENCHMARKS USEDBENCHMARKS USED
Working Set: 64KB – 2MB64 Threads with private 32KB Cache256KB L2 CacheL2 < 2MB may result in bad
performance
L1 Miss rate without prefetching
4
BENCHMARKS USED (2)BENCHMARKS USED (2)
L1 Miss rate without prefetching
5
DEGREE OF SHARING OF L2 DEGREE OF SHARING OF L2 CACHE IN BENCHMARKCACHE IN BENCHMARK
SHARING DEGREE◦ Spatial: Fraction of each cache line accessed. Most data
private except for svm◦ Temporal: Fraction of accesses to line. Shared data is
prevalente.g pcg. 0.1% of lines involved in global R/W giving 19.2% of L2 cache accesses
6
CACHE DESIGNS CACHE DESIGNS CONSIDEREDCONSIDEREDASSUMPTIONSTwo level caching (private L1, Varying L2); inclusive cacheDirectory Based CoherenceTiled Design (Tile = Core + Private Caches + Switch)
7
CACHE DESIGNS CONSIDERED CACHE DESIGNS CONSIDERED (2)(2)1) PRIVATE LLCLLC in tile’s coreMost flexible design (replicas of cache line can exist in all LLCs simultaneously)Fewer unique cache lines => more LLC misses Each tile contains tag directoryHash function (cache block address) = home tileHome tile provides info. On which LLC(s) hold required data.Cache-to-Cache transfer takes place
8
CACHE DESIGNS CONSIDERED CACHE DESIGNS CONSIDERED (3)(3)
2) UNCONTROLLED REPLICATIONSimilar to private LLCTries to increase no. of unique linesEviction of cache block with one sharer? Move block to its home tile. Already in home tile? Evict from chip.
9
CACHE DESIGNS CONSIDERED CACHE DESIGNS CONSIDERED (4)(4)
3) CONTROLLED REPLICATIONBuilds on Uncontrolled ReplicationTries to further increase no. of unique linesEach block has reference bit. Reference bit = 1 => likely part of working setDuplicate copies of cache blocks not in active use are favored for LRU eviction.
10
CACHE DESIGNS CONSIDERED CACHE DESIGNS CONSIDERED (5)(5)4) NO REPLICATIONLimited flexibilityCache lines reside in at most one LLC at a time .Shared lines held in lines’ home tile’s LLC (=> easy accessibility)Private lines held in user’s LLC (RDP points to line’s location). Eviction of private line or increased number of sharers returns block to its home LLC
11
CACHE DESIGNS CONSIDERED CACHE DESIGNS CONSIDERED (6)(6)5) SHAREDLeast flexibilityAll cache lines reside in their home LLC.Easy to find linesIncreased average access latency and on-die traffic for private lines
12
CACHE DESIGNS CACHE DESIGNS CONSIDERED(7)CONSIDERED(7)
CACHE DESIGNSPrivateUncontrolled ReplicationControlled ReplicationNo ReplicationShared
Flexibility
Reduction in On-Die bandwidth usage
Effective Cache Capacity(No. of Unique Blocks)Reducti
on in Off-Die bandwidth usage
13
EXPERIMENTAL SETUPEXPERIMENTAL SETUPSimulator is usedL1 has hardware
stride prefetcher
Energy Consumption•Storage energy: tag and cache line access to LLC, tag directory and RDP•On-die data messages•On-die coherence messages•Off-die accesses
14
RESULTS (PERFORMANCE)RESULTS (PERFORMANCE)
• Least flexible designs offer better performance!!!• Least flexible designs
• High throughput to heavily R/W lines (on a miss, home tile responds directly, no need for acknowledgement)
• Single write causes invalidation for readers (less impact for centralized data design, worse for flexible designs)
• Flexible designs• No centralized data storage• No overlapped cache-to-cache transfer; directory receives
acknowledgement from sending tile before processing another request.
15
RESULTS (ENERGY)RESULTS (ENERGY)
• Flexible designs consume significantly less energy than other designs!!!
• Flexible designs minimize on-die traffic because of replication.
• Increase in off-die traffic (fewer unique lines) but most lines have few shares. See Figure 1
• On-die traffic for No Replication better than Shared due to data migration
• Off-die traffic increases as we move from Private to Uncontrolled Replication to Controlled Replication
16
RESULTS SO FAR…RESULTS SO FAR…
• Flexible designs are more energy efficient
• Less flexible designs offer better performance
• Controlled Replication uses least energy.
• Can we improve its parallelism for handling multiple reads of the same cache line?
17
POSSIBLE IMPROVEMENTSPOSSIBLE IMPROVEMENTSTag Directory Buffer
◦ Small, fully associative buffer added to tag directory to hold clean lines having at least 3 shared readers (similar to Shared design)
Tag Directory Buffer All◦ Similar to Tag Directory Buffer◦ In this case, all read-shared lines are placed in tag
directory buffer
Four-entry buffer of 256 bytes is used
18
POSSIBLE IMPROVEMENTS POSSIBLE IMPROVEMENTS (2)(2)Sharing Migration
◦ Similar to Tag Directory Buffer◦ However, uses home tile’s LLC instead of a buffer
Tag Directory Buffer All◦ Similar to Tag Directory Buffer All◦ Uses home tile’s LLC instead of a buffer
Parallel Reads◦ Allows simultaneous (overlapped) cache-to-cache
transfers of the same cache lines for reads.
19
FINAL RESULTSFINAL RESULTSTag Directory Buffer provides highest
performance and close to the least energy consumption. See also figure 3
20
CONCLUDING REMARKS, CONCLUDING REMARKS, COMMENTS AND QUESTIONSCOMMENTS AND QUESTIONS
THANK YOU
21