1 system-level exploration of power, temperature, performance, and area for multicore architectures...
TRANSCRIPT
1
System-Level Exploration of Power, Temperature, Performance, and Area
for Multicore Architectures
Houman Homayoun1, Manish Arora1, Luis Angel D. Bathen2
1Department of Computer Science and EngineeringUniversity of California San Diego
2Department of Computer ScienceUniversity of California, Irvine
2
Outline Why power, temperature and reliability? (H. Homayoun)
Tools for architects? (H. Homayoun)
Thermal simulation using hotspot (H. Homayoun)
Power/performance modeling for NVMs using nvsim (H. Homayoun)
Cycle-level NOC simulation using DARSIM (H. Homayoun)
Using bottlenecks analysis and McPAT for efficient CPU design space exploration (M. Arora)
Simplescalar: a computer system design and analysis infrastructure (L. Bathen)
PHiLOSoftware: Software-Controlled Memory Virtualization (L. Bathen) SimpleScalar + CACTI + NVSIM + SystemC
3
Power
Short-lived batteries Huge heatsinks High electric bills Unreliable microprocessors
Power consumption: first-order design constraint Embedded System: Battery life time High Performance System: Heat dissipation Large Scale Systems: Billing cost
Power
Thermal Reliability
4
2006: Data Centers in US consumed 61 billion KWh, a electricity cost of about $9 billion1
2012: Doubles! $18 billion1
Impact of 10% reduction of processor and RAM power consumption
6.5% reduction of total power consumption
1Environmental Protection Agency (EPA), “Report to Congress on Server and Data Center Energy Efficiency”, August 2, 2007
Power Importance – Data Center Example
Processor + RAM (65%)
Storage (21%)
Network (9%)Others (5%)
$18B x 6.5% = $1.1 billion in savings
Power
Thermal Reliability
5
Temperature Trend – Temperature Crisis Energy = Heat Heat dissipation is costly Increasing power dissipation in computer systems
Ever increasing cooling costs
Max Power Consumption8008
8080
8085
8086
286386
486Pentium®
P6
1
10
100
1000
10000
1975 1985 1995 2005 2015
Year
Power Density (W/cm2)
POWER2 POWER3 POWER6
Itanium 2AMD K8
Core 2 Duo
Nuclear ReactorRocket Nozzle
POWER4
Athlon IINehalem
Hot Plate
Power
Thermal Reliability
Reliability Trend – Reliability Crisis Decrease lifetime reliability
High power densities: high temperatures Every 10o temperature increase doubles the failure rate
Technology Scaling Manufacturing defects, process variation
6
Source: N. Kim, et al., Analyzing the Impact of Joint Optimization of Cell Size, TVLSI 2011
Pro
babili
ty o
f Fa
ilure
Relative Cell Size
Power
Thermal Reliability
Efficiency Crisis
7
Moore’s law allow to double transistor budget every 18 months The power and thermal budget have not changed significantly Efficiency problem in new generations of microprocessor
Power
Thermal Reliability
Intel Teraflops(Many-Core)
8x10 Cores
Power Down
8
What Can We Do About it? In order to achieve “sustainable computing”, and fight back the “Power
Problem”, we need to rethink from a “Green Computing” perspective.
Understand levels of design abstraction; technology level, circuit level, and architecture level.
Understand where power/temperature is dissipated, where reliability issue exposed and where performance bottleneck exist.
Think about ways to tackle these issue at all levels.
9
Tools for Architects Performance
CPU: SimpleScalar, GEMS5, SMTSIM GPU: GPGPU-SIM Network: DARSIM, NoC-SIM
Power CPU: Wattch SRAM/CAM, eDRAM Cache: Cacti Non-Volatile Memory: NVSIM
Reliability: VARIUS Temperature: HotSpot Power, Timing, Area for CPU + Cache + Main Memory:
McPAT
Thermal Simulation Using HOTSPOT
Slides Courtesy of A Quick Thermal Tutorial by Kevin Skadron and Mircea Stan
10
Thermal modeling1
A fine-grained, dynamic model for temperature Architects can use Accounts for adjacency and package No detailed designs Provide detailed temperature distribution Fast
HotSpot - a compact model based on thermal R, C Parameterized for various
Architectures Power models Floorplans Thermal Packages
11
[1] A Quick Thermal Tutorial, Kevin Skadron and Mircea Stan
HotSpot Time evolution of temperature is driven by unit
activities and power dissipations averaged over N cycles (10K shown to provide enough accuracy)
Power dissipations can come from any power simulator, act as “current sources” in RC circuit ('P' vector in the equations)
Simulation overhead in Wattch/SimpleScalar: <1% Requires models of
Floorplan: important for adjacency Package: important for spreading and time constants R and C matrices are derived from the above
12
Example System1
Heat sink
Heat spreader
PCB
Die
IC Package
Pin
Interface material
[1] A Quick Thermal Tutorial, Kevin Skadron and Mircea Stan13
HotSpot Implementation Primarily a circuit solver Steady state solution
Mainly matrix inversion – done in two steps Decomposition of the matrix into lower and upper triangular
matrices Successive backward substitution of solved variables
Implements the pseudocode from CLR Transient solution
Inputs – current temperature and power Output – temperature for the next interval Computed using a fourth order Runge-Kutta
(RK4) method
14
Validation Validated and calibrated using MICRED test chips
9x9 array of power dissipators and sensors Compared to HotSpot configured with same grid, package
Within 7% for both steady-state and transient step-response Interface material (chip/spreader) matters
Also validated against an FPGA Can instantiate a temp. sensor based on a ring oscillator and
counter Also validated against IBM AnSYS FEM simulations 15
HotSpot Interface Inputs
a power trace file a floorplan file A config file (package information)
Outputs the corresponding transient temperatures steady state temperatures Thermal map (perl script)
16
Config File# thermal model parameters
# chip specs # chip thickness in meters -t_chip 0.00015 # silicon thermal conductivity in W/(m-K) -k_chip 100.0 # silicon specific heat in J/(m^3-K) -p_chip 1.75e6 # temperature threshold for DTM (kelvin) -thermal_threshold 354.95
# heat sink specs # convection capacitance in J/K -c_convec 140.4 # convection resistance in K/W -r_convec 0.1 # heatsink side in meters -s_sink 0.06 # heatsink thickness in meters -t_sink 0.0069 # heatsink thermal conductivity in W/(m-K) -k_sink 400.0 # heatsink specific heat in J/(m^3-K) -p_sink 3.55e6
17
# heat spreader specs # spreader side in meters -s_spreader 0.03 # spreader thickness in meters -t_spreader 0.001 # heat spreader thermal conductivity in W/(m-K) -k_spreader 400.0 # heat spreader specific heat in J/(m^3-K) -p_spreader 3.55e6
# interface material specs # interface material thickness in meters -t_interface 2.0e-05 # interface material thermal conductivity in W/(m-K) -k_interface 4.0 # interface material specific heat in J/(m^3-K) -p_interface 4.0e6
FLPfile# Floorplan close to the Alpha EV6 processor# Line Format: <unit-name>\t<width>\t<height>\t<left-x>\t<bottom-y># all dimensions are in meters# comment lines begin with a '#'# comments and empty lines are ignored
L2_left 0.004900 0.006200 0.000000 0.009800L2 0.016000 0.009800 0.000000 0.000000L2_right 0.004900 0.006200 0.011100 0.009800Icache 0.003100 0.002600 0.004900 0.009800Dcache 0.003100 0.002600 0.008000 0.009800Bpred_0 0.001033 0.000700 0.004900 0.012400Bpred_1 0.001033 0.000700 0.005933 0.012400Bpred_2 0.001033 0.000700 0.006967 0.012400DTB_0 0.001033 0.000700 0.008000 0.012400DTB_1 0.001033 0.000700 0.009033 0.012400DTB_2 0.001033 0.000700 0.010067 0.012400FPAdd_0 0.001100 0.000900 0.004900 0.013100FPAdd_1 0.001100 0.000900 0.006000 0.013100FPReg_0 0.000550 0.000380 0.004900 0.014000FPReg_1 0.000550 0.000380 0.005450 0.014000FPMul_0 0.001100 0.000950 0.004900 0.014380FPMul_1 0.001100 0.000950 0.006000 0.014380FPMap_0 0.001100 0.000670 0.004900 0.015330FPMap_1 0.001100 0.000670 0.006000 0.015330IntMap 0.000900 0.001350 0.007100 0.014650IntQ 0.001300 0.001350 0.008000 0.014650IntReg_0 0.000900 0.000670 0.009300 0.015330IntReg_1 0.000900 0.000670 0.010200 0.015330IntExec 0.001800 0.002230 0.009300 0.013100FPQ 0.000900 0.001550 0.007100 0.013100LdStQ 0.001300 0.000950 0.008000 0.013700ITB_0 0.000650 0.000600 0.008000 0.013100ITB_1 0.000650 0.000600 0.008650 0.013100
18
HotSpot Modes of running Block Level
fast less accurate Example: hotspot -c hotspot.config -f ev6.flp -p gcc.ptrace -o
gcc.ttrace -steady_file gcc.steady Grid-Level
slow more accurate Example: hotspot -c hotspot.config -f ev6.flp -p gcc.ptrace -
steady_file gcc.steady -model_type grid Change grid size for trade-off between speed up and accuracy
default grid size is 64x64
19
3D Modeling with HotSpot HotSpot's grid model is capable of modeling stacked 3D
chips HotSpot need Layer Configuration File (LCF) for 3D
simulation. LCF specifies the set of vertical layers to be modeled including its
physical properties (thickness, conductivity etc.)
20
Example: LCF file for two layers
#<Layer Number>#<Lateral heat flow Y/N?>#<Power Dissipation Y/N?>#<Specific heat capacity in J/(m^3K)>#<Resistivity in (m-K)/W>#<Thickness in m>#<floorplan file>
21
# Layer 0: Silicon0YY1.75e60.010.00015ev6.flp
# Layer 1: thermal interface material (TIM)1YN4e60.252.0e-05ev6.flp
For instance, the above sample file shows an LCF corresponding to the default HotSpot configuration with two layers: one layer of silicon and one layer of Thermal Interface Material (TIM)
Command line example: hotspot -c hotspot.config -f <some_random_file> -p example.ptrace -o example.ttrace -model_type grid -grid_layer_file example.lcf
Example: Modeling memory peripheral temperature
22
23
Power/Performance Modeling for Non-Volatile Memories Using NVSIM
Why yet-another Circuit-Level Estimation Tool for Cache Memories
Emerging non-volatile memory devices show a large variation on performance, energy, and density.
Some of them are performance-optimized; some of them are area-optimized…
For system-level research, it is NOT correct to pick random device parameters from multiple sources.
24
NVSIM1
`nvsim' is designed to be a general circuit-level performance, power, and area model
Emerging memory technologies supported: NAND PCM MRAM(STT-RAM) Memristor SRAM DRAM eDRAM
25
[1] "Design Implications of Memristor-Based RRAM Cross-Point Structures." In DATE 2011, C. Xu, X. Dong, N. P. Jouppi, and Y. Xie
NVSim Model Developed on the basis of CACTI
CACTI models SRAM and DRAM caches CACTI does NOT support eNVM.
26
2D array of memory cells
Precharge & Equalization
Bitline MuxSense Amplifies
Sense Amplifier MuxOutput/Write Drivers
Wor
dlin
e D
river
sR
ow D
ecod
ers
CACTI-modeled memory subarray
Memory cells
Peripheral circuitry
NVSim made modificationson the subarray-levelAnd the bank-level
Tricks (Subarray-Level) Why the circuit design space is large?
Many design tricks
27
2D array of memory cells
Precharge & Equalization
Bitline MuxSense Amplifies
Sense Amplifier MuxOutput/Write Drivers
Wor
dlin
e D
river
sR
ow D
ecod
ers
•Transistor typeoHigh-performanceoLow-poweroLow-standby
•Interconnect typeoWire pitchoRepeater design
•Sense ampoCurrent-sensingoVoltage-sensing
•DriveroArea-optoLatency-opt
•ArrayoMOS-accessedoCrosspoint
Configuring NVSim NVSim provides a variety of functionalities by supporting two
categories of the configuration input files: <.cfg> files and <.cell> files.
<.cfg> Configuration: <.cfg> files are used to specify the non-volatile memory module parameters and tune the design exploration knobs. The details of how to configure <.cfg> files are under the cfg files page.
<.cell> Configuration: <.cell> files are used to specify the non-volatile memory cell properties. The information held in these files are usually from the device level. NVSim provides default <.cell> files for PC-RAM, STT-RAM, and R-RAM as well as allows advanced users to tailor their own cell properties by add new <.cell> files. The details of how to configure <.cell> files are under the cell files page.
28
NVSIM Interface-DesignTarget: cache//-DesignTarget: RAM//-DesignTarget: CAM
-CacheAccessMode: Normal//-CacheAccessMode: Fast//-CacheAccessMode: Sequential
//-OptimizationTarget: ReadLatency//-OptimizationTarget: WriteLatency//-OptimizationTarget: ReadDynamicEnergy//-OptimizationTarget: WriteDynamicEnergy//-OptimizationTarget: ReadEDP//-OptimizationTarget: WriteEDP//-OptimizationTarget: LeakagePower-OptimizationTarget: Area//-OptimizationTarget: Exploration
//-ProcessNode: 200//-ProcessNode: 120//-ProcessNode: 90//-ProcessNode: 65-ProcessNode: 45//-ProcessNode: 32
29
-Capacity (KB): 128//-Capacity (MB): 1-WordWidth (bit): 512-Associativity (for cache only): 8
-DeviceRoadmap: HP//-DeviceRoadmap: LSTP//-DeviceRoadmap: LOP
-Routing: H-tree//-Routing: non-H-tree
-MemoryCellInputFile: SRAM.cell//-MemoryCellInputFile: Memristor_3.cell//-MemoryCellInputFile: PCRAM_JSSC_2007.cell//-MemoryCellInputFile: PCRAM_JSSC_2008.cell//-MemoryCellInputFile: PCRAM_IEDM_2004.cell//-MemoryCellInputFile: MRAM_ISSCC_2007.cell//-MemoryCellInputFile: MRAM_ISSCC_2010_14_2.cell//-MemoryCellInputFile: MRAM_Qualcomm_IEDM.cell//-MemoryCellInputFile: SLCNAND.cell-Temperature (K): 380
Example 1: PCRAM1 32-nm 16MB 8-way L3 caches (with different PCRAM design optimizations)
30
[1] PCRAMsim: System-Level Performance, Energy, and Area Modeling for Phase-Change RAM, Xiangyu Dong, et al. ICCAD 2009
31
Cycle-Level NoC Simulation Using DARSIM
DARSIM1
A parallel, highly configurable, cycle-level network-on-chip simulator based on an ingress-queued wormhole router architecture
Most hardware parameters are configurable, including geometry, bandwidth, crossbar dimensions
Packets arrive flit-by-flit on ingress ports are buffered in ingress virtual channel (VC) buffers until they have been assigned a next-hop node and VC; they then compete for the crossbar and, after crossing, depart from the egress ports.
32
Basic datapath of an NoC router modeled by DARSIM.
[1] Darsim: A Parallel Cycle-Level NoC Simulator, Lis, Mieszko; Shim, Keun Sup; Cho, Myong Hyon; Ren, Pengju; Khan, Omer; Devadas, Srinivas, ISPASS 2011.
DARSIm Simulation Parameters--cycles arg simulate for arg cycles (0 = until drained)--packets arg simulate until arg packets arrive (0 = until drained)--stats-start arg start statistics after cycle arg (default: 0)--no-stats do not report statistics--no-fast-forward do not fast-forward when system is drained--memory-traces arg read memory traces from file arg--log-file arg write a log to file arg--random-seed arg set random seed (default: use system entropy)--version show program version and exit-h [ --help ] show this help message and exit
33
Sample Config File[geometry]
height = 8width = 8[routing]node = weightedqueue = setone queue per flow = falseone flow per queue = false[node]queue size = 8[bandwidth]cpu = 16/1net = 16north = 1/1east = 1/1south = 1/1west = 1/1
34
[queues]cpu = 0 1net = 8 9north = 16 18east = 28 30south = 20 22west = 24 26
[core]default = injector
Network Configuration
-x (arg) : network width (8)-y (arg) : network height (8)-v (arg) : number of virtual channels per set (1)-q (arg) : capacity of each virtual channel in flits(4)-c (arg) : core type (memtraceCore)-m (arg) : memory type (privateSharedMSI)-n (arg) : number of VC sets-o (arg) : output filename (output.cfg)
35
Sample Flow Eventtick 12094flow 0x001b0000 size 13tick 12140flow 0x00001f00 size 5tick 12141flow 0x001f0000 size 5tick 12212flow 0x00002100 size 5tick 12212flow 0x00210000 size 13
36
The first two lines indicate that at the cycle of 12094, a packet which consists of 13 flits is injected to Node 27 (0x1b), and its destination is Node 0 (0x00).
Statisticsflit counts:
flow 00010000: offered 58, sent 58, received 58 (0 in flight)flow 00020000: offered 44, sent 44, received 44 (0 in flight)flow 00030000: offered 34, sent 34, received 34 (0 in flight)flow 00040700: offered 32, sent 32, received 32 (0 in flight)flow 00050700: offered 34, sent 34, received 34 (0 in flight).....
all flows counts: offered 109724, sent 109724, received 109724 (0 in flight)
in-network sent flit latencies (mean +/- s.d., [min..max] in # cycles):flow 00010000: 4.06897 +/- 0.364931, range [4..6]flow 00020000: 5.20455 +/- 0.756173, range [5..9]flow 00030000: 6.38235 +/- 0.874969, range [6..10]flow 00040700: 6.1875 +/- 0.526634, range [6..8]flow 00050700: 5.11765 +/- 0.32219, range [5..6].....all flows in-network flit latency: 9.95079 +/- 20.5398
37
Example: Effect of Routing and VC1
38
The effect of routing and VC configuration on network transit latencyin a relatively congested network on the WATER benchmark: while O1TURNand ROMM clearly outperform XY, the margin is not particularly impressive.
[1] Scalable Accurate Multicore Simulation in the 1000 core era", M. Lis et al. ISPASS 2011
Design Flow
39
Trace Driven NUCA Non-Volatile Cache Simulation in 3D
40
Trace Driven NUCA Non-Volatile Cache Simulation in 3D
41
cycle accurate simulator like
SimpleScalar or SMTSIM
DARSIM Network Simulation
NVSIM-based simulator
Feed cache trace
Feed new cache latency for performance impact
HotSpot 3D (Virginia)
3D thermal simulation
Feed temperature for accurate leakage power modeling
Questions?
04/19/23 42
Using Bottlenecks Analysis and McPAT for Efficient CPU Design
Space Exploration
Manish Arora1
Computer Science and EngineeringUniversity of California, San Diego
1Credit also goes to my co-authors: Feng Wang (Qualcomm), Bob Rychlik (Qualcomm) and Dean Tullsen (UC San Diego)
Tackling Design Complexity Increasingly complex design decisions
Multicore exacerbates the problem Accurate simulation is slow Simulation of all design points not feasible Commonly followed techniques inadequate Sensitivity analysis
Vary a single parameter while keeping other parameters fixed E.g. L2 performance by varying size and keeping else constant
Dependent on the choice of fixed point of reference L2 performance correlated with L1 size
June 4 2012 44
Accelerating Design Space Exploration Speeding up individual simulations
Benchmark subsetting (SMARTS[1], SimPoint[2] and MinneSpec[3]) Analytical models instead of cycle accurate simulation (Karkhanis
et al. [4]) Regression models to derive performance models (Lee et al. [5])
Design space pruning Hill-climbing (Systems et al. [6]) Tabu search (Axelsson et al. [7]) Genetic search (Palesi et al. [8]) Plackett and Burman (P&B) based design (Yi et al. [9] and Arora et
al. [10])
June 4 2012 45
Plackett and Burman (P&B) Based Design Advantages
Exploration over ranges of parameter values Linear or near linear number of experiments Non-iterative technique (exploit cluster parallelism)
Workings Provide a high value (+1) and low value (-1) for each component
CPU Freq 2GHz (+1) / 1GHz (-1) L2 Cache 1MB (+1) / 256KB (-1) and so on…
Run a P&B specified set of experiments Evaluate “Impact” of each component
E.g. CPU Frequency has a 30% influence when CPU Freq changed from 1GHz to 2GHz AND changing L2 from 256KB to 1MB AND …
June 4 2012 46
System Under Design - 1
June 4 2012 47
Sub-system consisting of 11 components Up to 10 choices per component
System Under Design - 2
June 4 2012 48
12 Mobile CPU centric benchmarks
June 4 2012 49
June 4 2012 50
June 4 2012 51
Using P&B for Cost-Optimized Designs Recapping P&B
P&B yields unit-less “Impact” (influence of changing a component) Provides “Impact” trends by changing upper bounds
Constrained systems Most systems are cost constrained (area, power or energy) Need to look at cost together with performance
L2 cache size has higher impact than L2 Associativity L2 Associativity might still provide the best “Bang for the buck”
Use Cost Normalized Marginal Impact Impact gained / Cost incurred
Use McPAT [11, 12] to evaluate baseline and marginal costs
June 4 2012 52
McPAT: High Level Features Integrated modeling framework
Power (Peak, Dynamic, Short-circuit and Leakage) Area Critical path timing
Hardware validated (~20% error) Configurable system components
Cores, NOC, clock tree, PLL, caches and memory controllers etc. Technology nodes 90nm to 22nm Device types Bulk CMOS, SOI and Double Gate Flexible XML interface
Standalone and performance simulator integrationJune 4 2012 53
McPAT 1.0 Overview (from [11])
June 4 2012 54
Unspecified parameters filled, Structures
optimized to satisfy timing
Use models and configurations to
evaluate numbers
Framework Components Hierarchical power, area and timing model
Model structures at a low level but allow high-level configuration Model core details rigorously and allow to connect multiple cores
Optimizer for circuit level implementations Determines unspecified parameters in the internal chip
representation User specifies cache size and number of banks but optimizer specifies
cachebank wordline and bitline lengths User can choose to specify everything themselves
Internal chip representation Driven by user inputs and those generated by the optimizer
June 4 2012 55
Hierarchical Modeling (from [11])
June 4 2012 56
Power, Area and Timing Modeling Power Modeling
Dynamic power using load capacitive modeling, supply voltage, clock frequency and activity factors
Short-circuit power using published models Leakage using published data and existing models such as MASTER
Timing Modeling Estimate critical path Use RC delays to estimate time similar to CACTI
Area Modeling Similar to CACTI to model gates and regular structures Empirical modeling techniques for non-regular structures
June 4 2012 57
Multicore Architectural Modeling Core
Configurable models of fetch, execute, load-store and OOO etc. Reservation station style and physical-register-file architectures In-order, OOO and Multithreaded architectures
NOC Signal link and router models
Shared and private cache hierarchies Memory Controller
Front end, transaction processing and PHY models Clocking
PLL and clock tree models
June 4 2012 58
Circuit and Technology Level Modeling Wires
Hierarchical repeated wires for local and global wires Short wires using pi-RC models Latches automatically inserted and modeled to satisfy clock rates
Devices Use ITRS 2007 roadmap data 90nm, 65nm, 45nm, 32nm and 22nm nodes supported Planar bulk (up to 36nm), SOI (up to 25nm) and double-gate
(22nm) modeled Support for Power-saving modes
McPAT 1.0 supports multiple sleep states Coming this summer (v0.8 current)
June 4 2012 59
McPAT Operation Requires input from user and simulator
Target clock rate, architectural and technology parameters Optimization function (timing or ED^2) Unit activity factors
McPAT optimizes structures to satisfy timing Configurations not satisfying timing are discarded Optimization functions are applied to all timing
satisfying configurations Numbers calculated using remaining configurations +
activity factors
June 4 2012 60
Downloading and Installing Current version 0.8 available from HP Labs website
http://www.hpl.hp.com/research/mcpat/ Download and build the tool (“make” works)
Works on unix compatible systems Command line operation (standalone XML input)
Print levels provide verbose results
Alternatively can build together with simulator
June 4 2012 61
Running McPAT Standalone mode (with XML input file)
Architectural and technology details specified within XML Find correspondence between McPAT stats and simulator stats Run performance simulation and pass counters to XML
<component id="system.core0.icache" name="icache"><param name="icache_config" value="131072,32,8,1,8,3,32,0"/><stat name="read_accesses" value=“2000"/><stat name="read_misses" value=“116"/><stat name="conflicts" value=“9"/></component>
Integrated with multiple simulators (M5, SMTSIM, Multi2SIM etc.) Documentation gives tips on building together with
simulatorJune 4 2012 62
XML Specification (Top Level)
June 4 2012 63
XML Specification (Core)
June 4 2012 64
XML Specification (Memory Controller)
June 4 2012 65
Results (Top Level)
June 4 2012 66
Results (Core)
June 4 2012 67
Results (Memory Controller)
June 4 2012 68
Cost Normalized Marginal Impact
June 4 2012 69
Created XML specification for our processor system Modeled a mobile processor and obtained activity factors from
a custom cycle-accurate simulator Obtained baseline power and area
Obtained Marginal costs Obtained cost normalized marginal impact
June 4 2012 70
Results: Cost Normalized Impact
Obtaining Cost Optimized Designs
June 4 2012 71
Make design decisions utilizing the marginal impact and marginal cost information
Results: Cost Optimized Designs
June 4 2012 72
Set budgets to 70% - 40% of highest end system Selection algorithm minimizes impact loss while reducing cost as much as possible
Performance within 16% of peak @ 40% area Performance within 19% of peak @ nearly half the power
To Summarize Looked at the problem of efficient design space
exploration Used the Plackett and Burman Method to yield
“Impact” or a measure of bottleneck for components Understood the basic workings of McPAT Understood the use of McPAT to obtain area and
power costs for various system configurations Used cost numbers to obtain cost normalized impact Used cost normalized impact values to obtain efficient
design choices
June 4 2012 73
References[1] Wunderlich et al. “SMARTS: Accelerating microarchitecural simulation via rigorous statistical
sampling”, ISCA 2003.[2] Sherwood et al. “Automaticaly characterizing large scale program behavior”, ASPLOS 2002.[3] KleinOsowski et al. “MinneSPEC: A new SPEC benchmark workload for simulation based
computer architecture research”, CAL 2002.[4] Karkhanis et al. “A first-order superscalar processor model”, ISCA 2004.[5] Lee at al. “Accurate and efficient regression modeling for microarchitectural performance
and power prediction”, ASPLOS 2006.[6] Systems et al. “Spacewalker: Automated design space exploration”, HP Labs 2001.[7] Axelsson et al. “Architecture synthesis and partitioning of realtime systems”, CODES 1997.[8] Palesi et al. “Multi-objective design space exploration using genetic algorithms”, CODES
2002.[9] Yi et al. “A statistically rigorous approach for improving simulation methodology”, HPCA
2003.[10] Arora et al. “Efficient system design using the SAAB methodology”, SAMOS 2012.[11] S. Li et al. “McPAT: An integrated Power, Area and Timing framework”, MICRO 2009.[12] S. Li et al. McPAT 1.0 technical report, HP Labs 2009.
June 4 2012 74
Questions?
SimpleScalar Simulator (ISS) and PHiLOSoftware Framework (SystemC)
Luis Angel D. Bathen (Danny)3
Slides Courtesy of: Kyoungwoo Lee1, Aviral Shrivastava2, and Nikil Dutt3
2Dept. of Computer Science & EngineeringArizona State University
1Dept. of Computer ScienceYonsei University
3Dept. of Computer ScienceUniversity of California
at Irvine
Contents
77
SimpleScalar Overview Demo 1: a simple simulation
(w/ 3.1 version of SimpleScalar) PHiLOSoftware Simulator
SimpleScalar + CACTI + SystemC TLM Demo 2: Bus Protocol Selection Demo 3: Software-Controlled Memory Virtualization
Contents
78
SimpleScalar Overview Demo 1: a simple simulation
(w/ 3.1 version of SimpleScalar) PHiLOSoftware Simulator
SimpleScalar + CACTI + SystemC TLM Demo 2: Bus Protocol Selection Demo 3: Software-Controlled Memory Virtualization
Overview
79
What is an architectural simulator? a tool that reproduces the behavior of a computing
device Why we use a simulator?
Leverage a faster, more flexible software development cycle
Permit more design space exploration Facilitates validation before H/W becomes available Level of abstraction is tailored by design task Possible to increase/improve system instrumentation Usually less expensive than building a real system
A Taxonomy of Simulation Tools
80
Shaded tools are included in SimpleScalar Tool Set
Functional vs.Performance
81
Functional simulators implement the architecture. Perform real execution Implement what programmers see
Performance simulators implement the microarchitecture. Model system resources/internals Concern about time Do not implement what programmers see
Trace- vs. Execution-Driven
82
Trace-Driven Simulator reads a ‘trace’ of the instructions captured
during a previous execution Easy to implement, no functional components necessary
Execution-Driven Simulator runs the program (trace-on-the-fly) Difficult to implement Advantages
Faster than tracing No need to store traces Register and memory values usually are not in trace Support mis-speculation cost modeling
SimpleScalar Tool Set Overview
83
Computer architecture research test bed Compilers, assembler, linker, libraries, and simulators Targeted to the virtual SimpleScalar architecture Hosted on most any Unix-like machine
SimpleScalar Suite
84
Strength of SimpleScalar
85
Highly flexible functional simulator + performance simulator
Portable Host: virtual target runs on most Unix-like systems Target: simulators can support multiple ISAs
Extensible Source is included for compiler, libraries, simulators Easy to write simulators
Performance Runs codes approaching ‘real’ sizes
Contents
86
SimpleScalar Overview Demo 1: a simple simulation
(w/ 3.1 version of SimpleScalar) PHiLOSoftware Simulator
SimpleScalar + CACTI + SystemC TLM Demo 2: Bus Protocol Selection Demo 3: Software-Controlled Memory Virtualization
Helloworld!
87
./sim-safe helloworld
Create a new file, hello.c, that has the following code: #include<stdio.h> main() { printf("Hello World!\n"); } then compile it using the following command: $ $IDIR/bin/sslittle-na-sstrix-gcc –o hello hello.c That should generate a file hello, which we will run over the
simulator: $ $IDIR/simplesim-3.0/sim-safe hello In the output, you should be able to find the following: sim: ** starting functional simulation ** Hello World!
TESTS-PISA – test-math
88
./sim-safe test-math
Simple set of test executables ./tests-pisa/bin.little/
anagram, test-fmath, test-lswlr, test-printf, test-llong, test-math
Run test-math $ $IDIR/simplesim-3.0/sim-safe
tests-pisa/bin.little/test-math
In the output (1)
89
./sim-safe test-math
sim: ** starting functional simulation **pow(12.0, 2.0) == 144.000000pow(10.0, 3.0) == 1000.000000pow(10.0, -3.0) == 0.001000str: 123.456x: 123.000000str: 123.456x: 123.456000str: 123.456x: 123.456000123.456 123.456000 123 1000sinh(2.0) = 3.62686sinh(3.0) = 10.01787h=3.60555atan2(3,2) = 0.98279pow(3.60555,4.0) = 169169 / exp(0.98279 * 5) = 1.241023.93117 + 5*log(3.60555) = 10.34355cos(10.34355) = -0.6068, sin(10.34355) = -0.79486x 0.5xx0.5 xx 0.5x-1e-17 != -1e-17 Worked!
In the output (2)
90
./sim-safe test-math
sim: ** simulation statistics **sim_num_insn 213703 # total number of instructions executedsim_num_refs 56899 # total number of loads and stores executedsim_elapsed_time 1 # total simulation time in secondssim_inst_rate 213703.0000 # simulation speed (in insts/sec)ld_text_base 0x00400000 # program text (code) segment baseld_text_size 91744 # program text (code) size in bytesld_data_base 0x10000000 # program initialized data segment baseld_data_size 13028 # program init'ed `.data' and uninit'ed `.bss' size in bytesld_stack_base 0x7fffc000 # program stack segment base (highest address in stack)ld_stack_size 16384 # program initial stack sizeld_prog_entry 0x00400140 # program entry point (initial PC)ld_environ_base 0x7fff8000 # program environment base address addressld_target_big_endian 0 # target executable endian-ness, non-zero if big endianmem.page_count 33 # total number of pages allocatedmem.page_mem 132k # total size of memory pages allocatedmem.ptab_misses 34 # total first level page table missesmem.ptab_accesses 1546771 # total page table accessesmem.ptab_miss_rate 0.0000 # first level page table miss rate
Cache Simulator
91
./sim-cache test-math
Run test-math with sim-cache $ $IDIR/simplesim-3.0/sim-cache
tests-pisa/bin.little/test-math
In the output
92
./sim-cache test-math
sim: ** simulation statistics **sim_num_insn 213703 # total number of instructions executedsim_num_refs 56899 # total number of loads and stores executedsim_elapsed_time 1 # total simulation time in secondssim_inst_rate 213703.0000 # simulation speed (in insts/sec)
il1.accesses 213703 # total number of accessesil1.hits 189940 # total number of hitsil1.misses 23763 # total number of missesil1.replacements 23507 # total number of replacementsil1.writebacks 0 # total number of writebacksil1.invalidations 0 # total number of invalidationsil1.miss_rate 0.1112 # miss rate (i.e., misses/ref)il1.repl_rate 0.1100 # replacement rate (i.e., repls/ref)il1.wb_rate 0.0000 # writeback rate (i.e., wrbks/ref)il1.inv_rate 0.0000 # invalidation rate (i.e., invs/ref)dl1.accesses 57480 # total number of accessesdl1.hits 56675 # total number of hitsdl1.misses 805 # total number of missesdl1.replacements 549 # total number of replacementsdl1.writebacks 416 # total number of writebacksdl1.invalidations 0 # total number of invalidationsdl1.miss_rate 0.0140 # miss rate (i.e., misses/ref)dl1.repl_rate 0.0096 # replacement rate (i.e., repls/ref)dl1.wb_rate 0.0072 # writeback rate (i.e., wrbks/ref)dl1.inv_rate 0.0000 # invalidation rate (i.e., invs/ref)…
Cache Configuration
93
./sim-cache –cache:dl1 dl1:32:32:32:f test-math
Cache configuration
<name>:<nsets>:<bsize>:<assoc>:<repl>
<name> - name of the cache being defined <nsets> - number of sets in the cache <bsize> - block size of the cache <assoc> - associativity of the cache <repl> - block replacement strategy, 'l'-LRU, 'f'-FIFO, 'r'-random
Examples: -cache:dl1 dl1:4096:32:1:l Run test-math with sim-cache $ $IDIR/simplesim-3.0/sim-cache –cache:dl1 dl1:32:32:32:f
tests-pisa/bin.little/test-math
In the output
94
./sim-cache –cache:dl1 dl1:32:32:32:f test-math
sim: ** simulation statistics **sim_num_insn 213703 # total number of instructions executedsim_num_refs 56899 # total number of loads and stores executedsim_elapsed_time 1 # total simulation time in secondssim_inst_rate 213703.0000 # simulation speed (in insts/sec)
il1.accesses 213703 # total number of accessesil1.hits 189940 # total number of hitsil1.misses 23763 # total number of missesil1.replacements 23507 # total number of replacementsil1.writebacks 0 # total number of writebacksil1.invalidations 0 # total number of invalidationsil1.miss_rate 0.1112 # miss rate (i.e., misses/ref)il1.repl_rate 0.1100 # replacement rate (i.e., repls/ref)il1.wb_rate 0.0000 # writeback rate (i.e., wrbks/ref)il1.inv_rate 0.0000 # invalidation rate (i.e., invs/ref)dl1.accesses 57480 # total number of accessesdl1.hits 56938 # total number of hitsdl1.misses 542 # total number of missesdl1.replacements 0 # total number of replacementsdl1.writebacks 0 # total number of writebacksdl1.invalidations 0 # total number of invalidationsdl1.miss_rate 0.0094 # miss rate (i.e., misses/ref)dl1.repl_rate 0.0000 # replacement rate (i.e., repls/ref)dl1.wb_rate 0.0000 # writeback rate (i.e., wrbks/ref)dl1.inv_rate 0.0000 # invalidation rate (i.e., invs/ref)…
Difference
95
Different Cache Configurations
Different Configurations < -cache:dl1 dl1:32:32:32:f # l1 data cache config, i.e., {<config>|
none} > -cache:dl1 dl1:256:32:1:l # l1 data cache config, i.e., {<config>|
none}
dl1 output differences < dl1.hits 56938 # total number of hits
< dl1.misses 542 # total number of misses< dl1.replacements 0 # total number of replacements< dl1.writebacks 0 # total number of writebacks
> dl1.hits 56675 # total number of hits> dl1.misses 805 # total number of misses> dl1.replacements 549 # total number of replacements> dl1.writebacks 416 # total number of writebacks
Performance Simulation
96
./sim-outorder test-math
sim-outorder Performance simulation Out-of-Order Issue
Run test-math with sim-outorder $ $IDIR/simplesim-3.0/sim-order
tests-pisa/bin.little/test-math
In the output
97
./sim-outorder test-math
sim: ** simulation statistics **
sim_num_insn 213703 # total number of instructions committedsim_num_refs 56899 # total number of loads and stores committedsim_num_loads 34105 # total number of loads committedsim_num_stores 22794.0000 # total number of stores committedsim_num_branches 38594 # total number of branches committedsim_elapsed_time 1 # total simulation time in secondssim_inst_rate 213703.0000 # simulation speed (in insts/sec)
sim_total_insn 233029 # total number of instructions executedsim_total_refs 61927 # total number of loads and stores executedsim_total_loads 37545 # total number of loads executedsim_total_stores 24382.0000 # total number of stores executedsim_total_branches 42770 # total number of branches executedsim_cycle 224302 # total simulation time in cyclessim_IPC 0.9527 # instructions per cyclesim_CPI 1.0496 # cycles per instructionsim_exec_BW 1.0389 # total instructions (mis-spec + committed) per cyclesim_IPB 5.5372 # instruction per branchIFQ_count 352201 # cumulative IFQ occupancyIFQ_fcount 74028 # cumulative IFQ full countifq_occupancy 1.5702 # avg IFQ occupancy (insn's)ifq_rate 1.0389 # avg IFQ dispatch rate (insn/cycle)ifq_latency 1.5114 # avg IFQ occupant latency (cycle's)ifq_full 0.3300 # fraction of time (cycle's) IFQ was fullRUU_count 1440457 # cumulative RUU occupancyRUU_fcount 45203 # cumulative RUU full countruu_occupancy 6.4220 # avg RUU occupancy (insn's)ruu_rate 1.0389 # avg RUU dispatch rate (insn/cycle)ruu_latency 6.1814 # avg RUU occupant latency (cycle's)ruu_full 0.2015 # fraction of time (cycle's) RUU was full…
Contents
98
SimpleScalar Overview Demo 1: a simple simulation
(w/ 3.1 version of SimpleScalar) PHiLOSoftware Simulator
SimpleScalar + CACTI + SystemC TLM Demo 2: Bus Protocol Selection Demo 3: Software-Controlled Memory Virtualization
A Taxonomy of Simulation Tools
99
Software-Controlled Memories Two types of memory subsystems
Hardware-controlled Caches Software-controlled SRAMs
ScratchPad Memories (SPMs), Tightly Coupled Memories (ARM), Local Stores (Cell), Streaming Memories, Software Caches
SPMs are preferred over caches due to: Smaller area footprint Lower power consumption More predictable Allow explicit control of their data
SPMs lack hardware support Compiler and programmer need to explicitly manage them
Accessed through physical addresses Difficult to capture irregularity at compile time
What about sharing their address space?
ARM Cortex-M3Address Map
New memory subsystems are deploying with distributed on-chip software-
controlled memories
New memory subsystems are deploying with distributed on-chip software-
controlled memories
[Leverich et al., ISCA ‘07]
04/19/23 100
Goal: Abstracting Physical Characteristics from the Device Challenges:
Software-controlled memories are explicitly accessed Programmers assume full access to their physical space
Sharing issues and security risks (open environments) Different physical characteristics
Due to voltage scaling Increased latencies Higher error rates
Due to interconnection network Different access latencies
Local vs. Remote Due to technology
Process variations SRAM vs. NVM characteristics
Virtualization as a viable solution Traditional virtualization makes no difference between address spaces
Mutually Dependent
Can we exploit virtualization to minimize programmer burden while
opportunistically exploiting the variation in physical
characteristics of the device?
Can we exploit virtualization to minimize programmer burden while
opportunistically exploiting the variation in physical
characteristics of the device? 04/19/23 101
Voltage Scaled
SPMSPM
CPUCPU
SPMSPM
Nominal Voltage
NVM
NVM
Voltage Scaled
SPMSPM
CPUCPU
SPMSPM
Nominal Voltage
NVM
NVM
MemoryControllerMemory
Controller
Low Power DIMMLow Power DIMM
High Power DIMM
SR
AM
Pre
ferr
edN
VM
DR
AM
Remotely Allocated On-Chip Space
Locally Allocated On-Chip Space
PHiLOSoftware’s Heterogeneous Virtual Address Space
InterconnectionNetwork
(1) Voltage Scaled Low-Power / Mid Latency
(1) Voltage Scaled Low-Power / Mid Latency
(2) Voltage Scaled Fault-tolerant
(2) Voltage Scaled Fault-tolerant
(3) Nominal-Voltage High Power / Low Latency
(3) Nominal-Voltage High Power / Low Latency
(4) Nominal-Voltage Low-Priority
(4) Nominal-Voltage Low-Priority
(5) Nominal-Voltage Higher Power / Latency
(5) Nominal-Voltage Higher Power / Latency
(6) NVM High Write Power / Latency
(6) NVM High Write Power / Latency
(7) NVM Higher Write Power / Latency
(7) NVM Higher Write Power / Latency
(8) Low-Power DRAM(8) Low-Power DRAM
(9) High-Power DRAM(9) High-Power DRAM
Propose the idea of virtual address spaces with different characteristicsPropose the idea of virtual address
spaces with different characteristics04/19/23 102
PHiLOSoftware Framework
PHiLOSoftware
Software Annotations@LowPower(arr,a,64)@Reliable(arr,b,64)@Secure(arr,c,128)
Software Annotations@LowPower(arr,a,64)@Reliable(arr,b,64)@Secure(arr,c,128)
Application Layer (Static)Application Layer (Static)
Application MarketApplication Market
Compiler(Static Analysis)
Compiler(Static Analysis)
Application1
Application1 Service1Service1
Over the Air Update
User ProfileManager
User ProfileManagerApplication
1 Application
1 Service1Service1 User ProfileManager
User ProfileManagerApplication
1 Application
1 Service1Service1 User ProfileManager
User ProfileManager
User Profile
Manager
User Profile
Manager
User Profile
Manager
User Profile
ManagerThird Party
App1Third Party
App1
Ap
plicatio
n
Run-Time System (OS/Hypervisor)Run-Time System (OS/Hypervisor)Run-Time Layer
(Dynamic)Run-Time Layer
(Dynamic)
App. OSApp. OS Services. OSServices. OS Private. OSPrivate. OS Third Party OS
Third Party OS R
TS
Platform Configuration (Variable)
Platform Configuration (Variable)
DRAM DRAM DRAM
DRAM HDDHDD
Low PowerVoltage Scaled Medium Power
High PowerManager
SPM SPM NVM NVM
CPU CPU CPU CPU
DMA
CMPEmerging Memories
Platform Type(CMP,NoC,
MPSoC)
Platform Type(CMP,NoC,
MPSoC)
Memory Tech.(SRAM, NVMs,
DRAM)
Memory Tech.(SRAM, NVMs,
DRAM)
Platform
Allo
catio
n P
olicie
sA
lloca
tion
Po
licies
04/19/23 103
PHiLOSoftware Simulator
A1 A2 A3 A4 A5 A6 A7 A8
GuestOS1 GuestOS3GuestOS2 GuestOS4
Hypervisor
A1 A2 A1
RTOS
Simulated RTOS Environment Simulated Virtualized Environment
SHASHA
AESAES
BLOWFISHBLOWFISH
H263H263
MOTIONMOTION
JPEGJPEG
GSMGSM
ADPCMADPCM
OM OM OM
OM HDDHDD
Low PowerVoltage Scaled Medium Power
High PowerManagerManager
SPM SPM SPM SPM
CPU CPU CPU CPU
S-DMA
CMPNominal Voltage
SHASHA
AESAES
BLOWFISHBLOWFISH
H263H263
MOTIONMOTION
JPEGJPEG
GSMGSM
ADPCMADPCM
Application Policies
SystemC TLMPower/Perf.Models
Annotations@LowPower(arr,a,64)@Reliable(arr,b,64)@Secure(arr,c,128)
SimpleScalarSimpleScalar
Memory Technology(CACTI/NVSim)
Memory Technology(CACTI/NVSim)Platform DBPlatform DB Fault & Variability ModelsFault & Variability Models
SimpleScalar Traces
Performance
Models
Performance
Models
while(...) {...if(!p.is_finished() && p.has_more()){ inst = p.get_next_inst();
#ifdef VERBOSE printf("%g %s : EXECUTE_INST <%s, 0x%lx, 0x%lx>\n", sc_simulation_time(), name(), inst.op.c_str(), inst.pc, inst.address);#endif #ifdef LOAD_INSTRUCTIONS entry_p = p.v_lut.lookup(inst.pc); if(entry_p!=NULL) { if(entry_p->in_spm) { spm_ax_inc(p.id); #ifdef VSPM_ENABLED read(&off_bus_port,entry_p->pa,packet);#else read(&on_bus_port,entry_p->pa,packet);#endif }#ifdef RISC_CYCLES_EN wait(RISC_CYCLES(S_ISA_TO_INT_ISA(inst.op))+inst.cycles, SC_NS);#else wait(1, SC_NS);#endif...
// create cpus gen_0 = new generator("gen_0", MIN_PRIO+1, NORMAL, 2, 0x00); gen_1 = new generator("gen_1", MIN_PRIO+2, NORMAL, 2, 0x01);
spmvisor_0 = new spmvisor("spmvisor_0", SPMVISORBASE, SPMVISORBASE + SPMVISORSPACESIZE - 1, 1, 1, MIN_PRIO+0, NORMAL, 2, 0x04); // mem spm_0 = new spm("spm_0", SPMBASE + 0*SPM_SIZE, SPMBASE + 1*SPM_SIZE-1,1, 1); spm_1 = new spm("spm_1", SPMBASE + 1*SPM_SIZE, SPMBASE + 2*SPM_SIZE-1,1, 1);
// bus bus_0 = new channel_amba2("bus_amba2_0",false); bus_arbiter_0 = new arbiter("arbiter_amba2_0", STATIC, false); bus_1 = new channel_amba2("bus_amba2_1",false); bus_arbiter_1 = new arbiter("arbiter_amba2_1", STATIC, false);
Sample architecture configuration file for 2 core CMP with SPMsSample architecture configuration file for 2 core CMP with SPMs
PriorityPriority HW IDHW ID
Physical Start AddressPhysical Start Address
Physical End AddressPhysical End Address
Arbitration PolicyArbitration Policy
// connect modules to the bus bus_0->arbiter_port(*bus_arbiter_0); gen_0->off_bus_port(*bus_0); gen_0->on_bus_port(*bus_1); gen_1->off_bus_port(*bus_0); gen_1->on_bus_port(*bus_1);
spmvisor_0->bus_port_req(*bus_0); spmvisor_0->bus_port_spm(*bus_1); bus_0->slave_port(*spmvisor_0); bus_0->slave_port(*ram);
bus_1->slave_port(*spm_0); bus_1->slave_port(*spm_1);
Module ConnectivityModule Connectivity
Contents
105
SimpleScalar Overview Demo 1: a simple simulation
(w/ 3.1 version of SimpleScalar) PHiLOSoftware Simulator
SimpleScalar + CACTI + SystemC TLM Demo 2: Bus Protocol Selection Demo 3: Software-Controlled Memory Virtualization
PHiLOSoftware DEMO1: Bus Protocol Selection
CPUCPU
8KB I$8KB I$ 8KB D$8KB D$
SimpleScalarSimpleScalar
CPUCPU
8KB I$8KB I$ 8KB D$8KB D$
SimpleScalarSimpleScalar
CPUCPU
8KB I$8KB I$ 8KB D$8KB D$
SimpleScalarSimpleScalar
CPUCPU
8KB I$8KB I$ 8KB D$8KB D$
SimpleScalarSimpleScalar
AMBA AHB TLMAMBA AHB TLM
Access Type / Address / Number of byteslw r16,0(r29) 0x400140 0x400140 _YY_R_ 0x7fff8000 4 0lui r28,0x1001 0x400148addiu r28,r28,-26160 0x400150addiu r17,r29,4 0x400158addiu r3,r17,4 0x400160sll r2,r16,2 0x400168addu r3,r3,r2 0x400170addu r18,r0,r3 0x400178sw r18,-32412(r28) 0x400180 0x400180 _YY_W_ 0x10001b34 4 0addiu r29,r29,-24 0x400188addu r4,r0,r16 0x400190addu r5,r0,r17 0x400198addu r6,r0,r18 0x4001a0jal 0x403800 0x4001a8addiu r29,r29,-24 0x403800
SystemC – TLM Model
CACTI
SimpleScalar Trace
DEMO1: Running
Usage: testbench.x <arb protocol> <cycles to simulate>
./cmp_2cpu_cache.x STATIC 1000000
Can vary from STATIC, RANDOM, ROUNDROBIN, TDMA, TDMA_RR
Cycles from 10,000, -1 for full application run
Sample run output:501858 ram : read 0x1046370347 at address 0xc14501940 ram : write 0x364210008 at address 0xffdfe20501946 simple_cpu_0 : EXECUTE_INST <lw, 0x4005c8, 0x2147450764>501951 simple_cpu_0 : EXECUTE_INST <addiu, 0x0, 0x11032>501951 simple_cpu_0 : EXECUTE_INST <sw, 0x4005d8, 0x2147450764>502022 ram : read 0x628966950 at address 0xc18502114 simple_cpu_0 : EXECUTE_INST <sw, 0x4005e8, 0x268445784>
DEMO1: Output Analysis
Effect of arbitration protocol: STATIC is blocking (e.g., CPU 0 has highest priority)
May lead to starvation of other CPUs Run 10000 cycles with STATIC arb protocol to see example
Each of the different arbitration protocols has different behavior Can be observed by number of arbitration cycles from each cache interface
./cmp_2cpu_cache.x STATIC 1000000 cache_0: Arbitration wait cycles (Reads) = 0 cache_0: Arbitration wait cycles (Writes) = 0 cache_1: Arbitration wait cycles (Reads) = 4581 cache_1: Arbitration wait cycles (Writes) = 391
./cmp_2cpu_cache.x TDMA 1000000 cache_0: Arbitration wait cycles (Reads) = 103313 cache_0: Arbitration wait cycles (Writes) = 128494 cache_1: Arbitration wait cycles (Reads) = 148688 cache_1: Arbitration wait cycles (Writes) = 28329
DEMO1: Output Analysis
TDMA Fair Increases total execution time because of time-slots
FULL SIM: 8.44379e+06
STATIC Might cause starvation Processes with high priority finish faster
E.g. CPU0 has highest priority FULL SIM: 5.93757e+06
Reverse priorities (CPU1>CPU0): 6.0784e+06 CPU0 has longest task to execute
Suffers of performance degradation because CPU1 has higher priority
DEMO1: 8 Core CMP ./cmp_8cpu_cache.x STATIC 1000000
1 Million cycles and STATIC arbitration protocol
./cmp_8cpu_cache.x RANDOM 1000000 Same as above, RANDOM arbitration
ram: |cache_0| Reads: 2240 Writes: 4936ram: |cache_1| Reads: 3072 Writes: 1518ram: |cache_2| Reads: 475 Writes: 19
ram: |cache_0| Reads: 1150 Writes: 416ram: |cache_1| Reads: 1445 Writes: 198ram: |cache_2| Reads: 1088 Writes: 492ram: |cache_3| Reads: 768 Writes: 679ram: |cache_4| Reads: 576 Writes: 880ram: |cache_5| Reads: 576 Writes: 901ram: |cache_6| Reads: 768 Writes: 784ram: |cache_7| Reads: 1088 Writes: 470
Transaction
Starvation
Transaction
Starvation
Greater fairnessGreater fairness
DEMO1: 8 Core CMP – Protocol Comparison
Protocol Reads
Writes
Total
Average Transactions
Average Arb. Wait Cycles
$ ReadSTDev
$ WriteSTDev
STATIC 5787 6473 12260 766.25 9802.5 1145.329531 1636.738941
RANDOM 7459 4820 12279 767.4375 10051.125 288.6667358 232.6671442
ROUNDROBIN 7076 4460 11536 721 68994.75 238.1438641 238.1438641
TDMA 7076 4460 11536 721 68994.75 238.1438641 238.1438641
- Almost same number of accesses for all protocols
- Higher STDev means higher variation in number of accesses- E.g., STATIC vs. RANDOM, RR, TDMA
Contents
112
SimpleScalar Overview Demo 1: a simple simulation
(w/ 3.1 version of SimpleScalar) PHiLOSoftware Simulator
SimpleScalar + CACTI + SystemC TLM Demo 2: Bus Protocol Selection Demo 3: Software-Controlled Memory Virtualization
Shared SPMs in HeterogeneousMulti-tasking Environments (Open Environment)
Problem: Fixed allocation policies Enforce pre-defined policies
known single/multiple apps No SPM space available, map all data off-chip
What about data and task criticality/priority? Need dynamic and selective enforcement of allocation
policies Reduced power consumption/better performance…
App3
App4
App2
App1
CPU0
CPU1
SPM0
SPM1
MM-SPM Transfers:
MM
App1App1 App2App2 App1App1
RTOS
SPM1SPM1 SPM2SPM2
CPU1 CPU2
MMMM
Crypto
DMA
AMBA AHB
App3App3 App4App4 App3App3
RTOS
App5App5
Want to launch newly downloaded App!
App5App5
Task Priority:
High
Medium
Low
Fixed policies do not work well in open environments where the number of applications running concurrently is
unknown!
nGB-1
MM Space
4K
8K-1
ProtectedEvict Memory (PEM) Space
Virtual ScratchPad Memories (vSPMs)
0
4K-1
A1A1 A2A2
SPMSPM
A1A1 A2A2
vSPM1vSPM1 vSPM2
vSPM2
1K1K
1K1K
1K1K
1K1K
1K1K
1K1K
1K1K
1K1K
1K1K
1K1K
1K1K
1K1KvSPM1
vSPM2
Compete for same SPM
Applications see their own dedicated SPM(s)
SPM Space
Block-based priority-drivenallocation policies
1K1K
1K1K
1K1K
1K1K
PEM
MM Space
4K
nGB-1
Priority-based Dynamic Memory Allocation
May have data/application based priorities Create vSPM(s) prior to running applications Selective eviction
No need to evict data from SPMs on every context switch
App3
App4
App2
App1CPU0
CPU1
SPM0
SPM1
PEM
P1
P2
P3
Task Priority:
High
Medium
Low
MM-SPM Transfers:
Data Priority:
S2 S3
S4
S5
S6S1
vSPM Normal Allocation
vSPM Normal Allocation
Low priority data is mapped to PEM
space!
Low priority data is mapped to PEM
space!
Need to launch new (trusted) application
Low-priority blocks
mapped to PEM space
Low-priority blocks
mapped to PEM space
Selective eviction (data-priority)Selective eviction (data-priority)
Selective
Eviction
Selective
Eviction
ALL data is protected through vSPMs
ALL data is protected through vSPMs
Support priority-based selective allocation (data and application driven)
Support priority-based selective allocation (data and application driven)
Minimize overheads of generating trusted environments! (data protected through SPMVisor)
DEMO3: Setup
Total of 4 active cores 8KB SPM space (per core)
Total of 32KB on-chip space SPMVisor
Simulated 1 vSPM per app Total of 8 * 8KB = 64KB on-chip virtualized space
Simulated Virtualized Environment: Generates input for SystemC model
annotated traces with context switch information per application
SPMVisor
SPM SPM SPM SPM
CPU CPU CPU CPU
MM
Crypto
S-DMA
A1 A2 A3 A4 A5 A6 A7 A8GuestOS1 GuestOS3GuestOS2 GuestOS4
Hypervisor
Simulated Virtualized Environment
name time_slot cx_cost mem_sz(MB) n_app {program name, ...}os_a 10000 4000 128 2 adpcm/mem.trace aes/mem.traceos_b 10000 4000 128 2 blowfish/mem.trace gsm/mem.traceos_c 10000 4000 128 2 h263/mem.trace jpeg/mem.traceos_d 10000 4000 128 2 motion/mem.trace sha/mem.tracehyp 20000 6000 128
name entries policy (supported: full assoc(fifo) = 1, not supported yet: 2-set asso = 2, 4-set asso = 4)dtlb 12 1
DEMO3: Hypervisor - CX vs. vSPMs Hypervisor CX: on a context-switch, evict data from
SPMs Load data for new tasks onto SPMs Protects integrity of SPM data
Hypervisor w/vSPMs: no need to evict data from SPMs Each Application has a dedicated virtual space
At run-time load SPM allocation tables
Object Name T_Start T_End Lifetime Addr_Start Addr_End # of AccessesBuffer_0x7f<7>: 1 87 86 2147450879 2147454974 7Buffer_0x7f<8>: 16 233608 233592 2147446783 2147450878 84825Buffer_0x10<0>: 1295 232119 230824 268435456 268439551 3822Buffer_0x10<1>: 89 233598 233509 268439552 268443647 370Buffer_0x10<2>: 9 233169 233160 268443648 268447743 25088Buffer_0x10<3>: 226971 233166 6195 268447744 268451839 1058Buffer_0x10<4>: 228507 230038 1531 268451840 268455935 1024Buffer_0x10<5>: 230043 231574 1531 268455936 268460031 1024Buffer_0x10<6>: 231816 233043 1227 268460032 268464127 17
DEMO3: Hypervisor - CX vs. vSPMs (Cont.) Comparison of cx and vSPMs
E1: spmvisor_e1.x – 2 Applications, 1 OS, Hypervisor, 1 Core, 1 SPM, 2 x vSPM rtos_e1.x – 2 applications, 1 OS, Hypervisor, 1 Core, 1 SPM
Traditional (CX) SPMVisor Improvements
Execution Time2.89E+07 1.46E+07
49 % Lower Execution Time
Total Energy (nJ)677486.6228 52877.27006
92 % Energy Savings
DEMO3: Hypervisor - CX vs. vSPMs (Cont.) Comparison of cx and vSPMs
E4: spmvisor_e4.x – 8 Applications, 4 Oses, Hypervisor, 4 Cores, 4 SPMs, 8
vSPMs rtos_e4.x – 8 Applications, 4 Oses, Hypervisor, 4 Cores, 4 SPMs
Traditional (CX) SPMVisor Improvements
Execution Time6.25E+07 2.18E+07
65 % Lower Execution Time
Total Energy (nJ)2475745.172 465330.5984
81 % Energy Savings
- The number of data evictions/loads due to context switching hurts both performance and energy
Questions?
04/19/23 120
Contact Information
Houman Homayoun Email: [email protected]://cseweb.ucsd.edu/~hhomayoun/
Manish AroraEmail: [email protected]://cseweb.ucsd.edu/~marora/
Luis Angel BathenEmail:[email protected]/~lbathen/
121
Architecture Tool Publicly Available for Download HotSpot
http://lava.cs.virginia.edu/HotSpot/ DARSIM (Hornet)
http://csg.csail.mit.edu/hornet/ NVSIM
http://www.rioshering.com/nvsimwiki/index.php?title=Main_Page McPAT
http://www.hpl.hp.com/research/mcpat/ CACTI
http://www.hpl.hp.com/research/cacti/ GEMS5
http://www.m5sim.org/Main_Page SimpleScalar
http://www.simplescalar.com/ GPGPU-SIM
http://www.gpgpu-sim.org/ VARIUS
http://iacoma.cs.uiuc.edu/varius/ 122