Xian-He Sun
C-AMAT : Concurrent Average Memory Access Time
Xian-He SunXian-He Sun
April , 2015Illinois Institute of Technology
With Yuhang Liu and Dawei Wang
Xian-He Sun
Outline
Motivation
Memory System and Metrics
C-AMAT: Definition and Contribution
Experimental Design and Verification
Application and Related Work
Conclusion
2
X.-H. Sun and D. Wang, "Concurrent Average Memory Access Time", in IEEE Computers, vol. 47, no. 5, pp. 74-80,May 2014 D. Wang and X. Sun, “APC: A Novel Memory Metric and Measurement Methodology for Modern Memory System,” IEEE Transactions on Computers, vol. 63, no. 7, pp. 1626–1639, 2014.
Reference
Xian-He Sun
Motivation
Processor is 400x faster than memory, and applications become more data intensive
Data access becomes THE performance bottleneck of high-end computing
Many concurrency based technologies are developed to improve data access speed, but their impact on final performance is elusive and, therefore, are not fully utilized
Existing memory optimization strategies are still primarily based on the sequential single-access assumption
3
Xian-He Sun
Memory Wall Problem
µProc 1.52/yr.(2X/1.5yr)
Processor-MemoryPerformance Gap:(grows 50% / year)
DRAM7%/yr.(2X/10 yrs)
“Moore’s Law”
Processor-DRAM Memory GapµProc 1.20/yr.
• 1980: no cache in micro-processor; 2010: 3-level cache on chip, 4-level cache off chip• 1989 the first Intel processor with on-chip L1 cache was Intel 486, 8KB size• 1995 the first Intel processor with on-chip L2 cache was Intel Pentium Pro, 256KB size• 2003 the first Intel processor with on-chip L3 cache was Intel Itanium 2, 6MB size
Source: Computer Architecture A Quantitative Approach
Xian-He Sun
Extremely Unbalanced Operation LatencyC
ycle
s
5~15M cycles
IO Access
Xian-He Sun6
Data Access becomes Performance Bottleneck
Source: Gromacs
GROMACS (molecular dynamics)
Source: MPQC
MPQC (Massively Parallel Quantum Chemistry)
Source: Multi-grid solver
Multi-Grid solver (CFD)Microstructure
Xian-He Sun7
Data Access becomes Performance Bottleneck
Computational Fluid Dynamics
Data miningComputational Finance
Adaptive Multigrid
Xian-He Sun
CPU Registers <8KB <0.2~0.5 ns
L1 Cache <128B 0.5-1 ns
Main Memory Giga Bytes 50ns-100ns
Disk Tera Bytes, 5 ms
Capacity Access Time
Registers
L1 Cache
Memory
Disk
Instr. Operands
Blocks
Pages
StagingXfer Unit
prog./compiler1-8 bytes
L2 cache cntl32-128 bytes
OS4K-4M bytes
Upper Level
Lower Level
faster
Larger
Solution: Memory Hierarchy
L2 CacheL2 Cache <50MB 1-10 ns
L1 cache cntl32-128 bytes
Xian-He Sun
Data Access Concurrency Exist
9
Xian-He Sun
Multi-coreMulti-threadingMulti-issue
Multi-banked CacheMulti-level Cache
Multi-channelMulti-rankMulti-bank
CPU
Cache
Memory
Out-of-order ExecutionSpeculative ExecutionRunahead Execution
Pipelined CacheNon-blocking Cache Data PrefetchingWrite buffer
Solution: Memory Hierarchy & Parallelism
Parallel File SystemParallel File SystemInput-Output (I/O)
Disks
PipelineNon-blocking PrefetchingWrite buffer
Xian-He Sun
Extremely Unbalanced Operation Latency
Cyc
les
IO Access 5~15M cycles
Assumption of Current SolutionsAssumption of Current Solutions
Memory Hierarchy: Locality Concurrence: Data access pattern
o Data stream
Performances vary Performances vary largelylargely
Xian-He Sun
Existing Memory Metrics Miss Rate(MR)
o {the number of miss memory accesses} over {the number of total memory accesses}
Misses Per Kilo-Instructions(MPKI)o {the number of miss memory accesses} over {the number of total committed
Instructions × 1000}
Average Miss Penalty(AMP)o {the summary of single miss latency} over {the number of miss memory accesses}
Average Memory Access Time (AMAT)o AMAT = Hit time + MR×AMP
Flaw of Existing Metrics o Focus on a single component oro A single memory access
Missing memory parallelism/concurrency
Xian-He Sun
Concurrent AMAT (C-AMAT)
13
-H M
H pAMPC AMAT pMR
C C
• H is Hit time
• CH is the hit concurrency
• CM is the pure miss concurrency
• pMR and pAMP are pure-miss ratio and pure-miss penalty
• a Pure-miss cycle is a miss cycle there is no hit
AMAT H MR AMP
Xian-He Sun
Different perspectives
Sequential perspective: AMAT
Concurrent perspective: C-AMAT
14
Access 1
Access 2
Access 3
Access 4
Access 5
Hit phase Pure miss phase
Hit phase Hit/Miss phase
Hit phase
pure miss cycles
Miss cycles
Xian-He Sun
Pure-miss Miss is not important (Pure miss is)
The penalty is due to pure miss
15
M
pAMPpMR
C
Access 1
Access 2
Access 3
Access 4
Access 5
Hit phase Pure miss phase
Hit phase Hit/Miss phase
Hit phase
pure miss cycles
Miss cycles
Xian-He Sun
C-AMAT is Recursive
16
1
11 1 1 2- -
H
HC AMAT pMR C AMAT
C
1 1
1 11 1-
H M
H pAMPC AMAT pMR
C C
2 2
2 22 2-
H M
H pAMPC AMAT pMR
C C
where
1
1
11
1
m
M
CpAMP
AMP C
This Eq. shows the recurrence relation of C-AMAT1 and C-AMAT2
Xian-He Sun
The physical meaning of η1
R1 = pure miss cycles / miss cycles
R2 = pure misses / misses
η1 = R1 / R2
The penalty at L2 is C-AMAT2
The actual delay impact is η1 x C-AMAT2
η1 is the L1 (concurrency) data delay reducer
17
Xian-He Sun
Architecture Impacts CH could be contributed by
o multi-port cacheo multi-banked cacheo pipelined cache structures
CM could be contributed by o non-blocking cache structures o prefetching logic
These techniques can both increase the CH and CM o out-of-order execution o multiple issue pipeline o SMT o CMP
18
Xian-He Sun
Detecting System
19
CPU Interface
Hit Concurrency DetectorCache
MSHR Miss Concurrency Detector
C-AMAT analyzer
Structure for detecting cache hit concurrency and cache miss
concurrency using the C-AMAT metric
Xian-He Sun
Experimental Environment Simulator
o GEM5
Benchmarks
o 29 benchmarks from SPEC CPU2006 suite
For each benchmark, 10 million instructions were simulated to collect statistics
Average values of the correspondent memory metrics are shown
A good memory metric should matches the actual design choices for modern processors
20
Xian-He Sun
Default configuration
21
Default processor and cache configuration parameters forsimulated testing of C-AMAT
Xian-He Sun
Experimental Results
22
L1 DCache AMAT and C-AMAT when Changing Issue Pipeline Width
AMAT getting worse and C-AMAT getting better when concurrency increase
Xian-He Sun
Experimental Results
23
L1 DCache AMAT and C-AMAT when Changing MSHR Size
AMAT getting worse and C-AMAT getting better when concurrency increase
Xian-He Sun
Experimental Results
24
L2 Cache AMAT and C-AMAT when Changing MSHR Size
More results can be found in X. H. Sun and D. Wang, "Concurrent Average Memory Access Time," IEEE Computer, 47(5), May 2014, pp.74-80.
AMAT getting worse and C-AMAT getting better when concurrency increase
Xian-He Sun
Potential of C-AMAT and Data Concurrency
Assume total running time is T Data stall time is d, d/T is up to 70%, that is d/T is
0.7 T Compute time is t, and t is 0.3 T Therefore, data stall time can be up to 0.7/0.3 = 2.3
folds of compute time If layered performance matching can be achieved
when the overlapping effect of data access concurrency is enough, data stall time is only 1% of compute time
Then memory performance can be improved 230 times!
25
Xian-He Sun
Improvement potential due to concurrency
26
Aided by concurrency, memory system performances can be improved up to hundreds of times (230X) at each layer of a memory hierarchy with layered performance matching
Xian-He Sun
How 230x Improvement Achieved
27
Increasing data access concurrency to have a 230 speedup of memory system performance with our LPM algorithm
Xian-He Sun
Technique Impact Analysis (Original)
28Figure 2.11 on page 96 in Hennessy & Patterson’s latest book
Xian-He Sun
Technique Impact Analysis (Ours)
29
A new technique summation table with C-AMAT
Xian-He Sun
The Impact of C-AMAT
New understanding of memory systems with a rigor mathematical description
Unified the influence of data locality and concurrency under one formulation
Foundation for developing new concurrency-based optimizations, and utilizing existing locality-based optimizations
Foundation for automatic tuning for best configuration, partition, and scheduling, etc.
30
Xian-He Sun
C-AMAT in Action
Data stall time
New C-AMAT model
CPU-time = IC×(CPIexe + fmem×C-AMAT×(1–overlapRatioc-m))×cycle-time
Data stall time
Traditional AMAT model
Data stall time
Only pure miss will cause processor stall, and the penalty is formulated here
Y.-H. Liu and X.-H. Sun, “Reevaluating data stall time with the consideration of data access concurrency,” Journal of Computer Science and Technology, vol. 30, no. 2, pp. 227–245, 2015.
Xian-He Sun
C-AMAT in Action
Layered performance matching at each memory hierarchy
Using recursive C-AMAT to measure and mitigate layered performance mismatch
For instance, the impact of C-AMAT2 can be trimmed by pMR1 and η1
The key is to reduce pure miss, not miss, and data concurrence can do so
32
Y.-H. Liu, X.-H. Sun, "LPM: Layered Performance Matching in Memory Hierarchy," Illinois Institute of Technology Technical Report (IIT/CS-SCS-2014-08), 2014.
Xian-He Sun
C-AMAT in Action
Online Reconfiguration and Smart Schedulingo A performance optimization tool has been developed
base on C-AMATo Provide measurement and optimization suggestionso Measure C-AMAT on existing computing systemso Optimization in hardware reconfigurationo Optimization in software task partitioning and
scheduling
33
Y.-H. Liu, X.-H. Sun, "TuningC: A Concurrency-aware Optimization Tool," Illinois Institute of Technology Technical Report (IIT/CS-SCS-2015-05), 2015.
Xian-He Sun
Related Work: APC Versus C-AMAT Access Per (memory active) Cycle (APC)
o APC = A/T APC is a measurement, a companion of C-AMAT C-AMAT is a analysis and optimization tool APC is very different with the traditional IPC
o Memory Active Cycle (data centric/access)o Overlapping mode (concurrent data access)
C-AMAT does not depend on its five parameters for its value
C-AMAT = 1/APC
D. Wang, X.-H. Sun "Memory Access Cycle and the Measurement of Memory Systems", IEEE Transactions on Computers, vol. 63, no. 7, pp. 1626-1639, July.2014
Xian-He Sun
Related Work: MLP Memory Level Parallelism (MLP)
o Average number of long-latency main memory outstanding accesses when there is at least one such outstanding access
o Assuming each off-chip memory access has a constant latency, say m cycles, APCM=MLP/m
o That means APCM is directly proportional to MLPo APC is superset of MLP
C-AMAT is an analytical tool and measurement, MLP is a measurement
MLP does not consider locality, will APC and C-AMAT do
Xian-He Sun
Conclusions Data access delay is the premier bottleneck of
computing
Hardware memory concurrence exists but is under utilized
C-AMAT unifies data concurrency with locality for combined data access optimizations
C-AMAT can improve AMAT performance 230 times
This 230X number could be even larger. With the multicore technology, CPU can be built faster. The question is if data can be moved up fast enough
36
Xian-He Sun
Conclusions
Develop C-AMAT Develop C-AMAT based technologies based technologies to reduce data to reduce data access time !access time !
37
Xian-He Sun
Thank YouThank You
& &
Questions ?Questions ?
38