1 system-level exploration of power, temperature, performance, and area for multicore architectures...

1

System-Level Exploration of Power, Temperature, Performance, and Area

for Multicore Architectures

Houman Homayoun1, Manish Arora1, Luis Angel D. Bathen2

1Department of Computer Science and EngineeringUniversity of California San Diego

2Department of Computer ScienceUniversity of California, Irvine

2

Outline Why power, temperature and reliability? (H. Homayoun)

Tools for architects? (H. Homayoun)

Thermal simulation using hotspot (H. Homayoun)

Power/performance modeling for NVMs using nvsim (H. Homayoun)

Cycle-level NOC simulation using DARSIM (H. Homayoun)

Using bottlenecks analysis and McPAT for efficient CPU design space exploration (M. Arora)

Simplescalar: a computer system design and analysis infrastructure (L. Bathen)

PHiLOSoftware: Software-Controlled Memory Virtualization (L. Bathen) SimpleScalar + CACTI + NVSIM + SystemC

3

Power

Short-lived batteries Huge heatsinks High electric bills Unreliable microprocessors

Power consumption: first-order design constraint Embedded System: Battery life time High Performance System: Heat dissipation Large Scale Systems: Billing cost

Power

Thermal Reliability

4

2006: Data Centers in US consumed 61 billion KWh, a electricity cost of about $9 billion1

2012: Doubles! $18 billion1

Impact of 10% reduction of processor and RAM power consumption

6.5% reduction of total power consumption

1Environmental Protection Agency (EPA), “Report to Congress on Server and Data Center Energy Efficiency”, August 2, 2007

Power Importance – Data Center Example

Processor + RAM (65%)

Storage (21%)

Network (9%)Others (5%)

$18B x 6.5% = $1.1 billion in savings

Power

Thermal Reliability

5

Temperature Trend – Temperature Crisis Energy = Heat Heat dissipation is costly Increasing power dissipation in computer systems

Ever increasing cooling costs

Max Power Consumption8008

8080

8085

8086

286386

486Pentium®

P6

1

10

100

1000

10000

1975 1985 1995 2005 2015

Year

Power Density (W/cm2)

POWER2 POWER3 POWER6

Itanium 2AMD K8

Core 2 Duo

Nuclear ReactorRocket Nozzle

POWER4

Athlon IINehalem

Hot Plate

Power

Thermal Reliability

Reliability Trend – Reliability Crisis Decrease lifetime reliability

High power densities: high temperatures Every 10o temperature increase doubles the failure rate

Technology Scaling Manufacturing defects, process variation

6

Source: N. Kim, et al., Analyzing the Impact of Joint Optimization of Cell Size, TVLSI 2011

Pro

babili

ty o

f Fa

ilure

Relative Cell Size

Power

Thermal Reliability

Efficiency Crisis

7

Moore’s law allow to double transistor budget every 18 months The power and thermal budget have not changed significantly Efficiency problem in new generations of microprocessor

Power

Thermal Reliability

Intel Teraflops(Many-Core)

8x10 Cores

Power Down

8

What Can We Do About it? In order to achieve “sustainable computing”, and fight back the “Power

Problem”, we need to rethink from a “Green Computing” perspective.

Understand levels of design abstraction; technology level, circuit level, and architecture level.

Understand where power/temperature is dissipated, where reliability issue exposed and where performance bottleneck exist.

Think about ways to tackle these issue at all levels.

9

Tools for Architects Performance

CPU: SimpleScalar, GEMS5, SMTSIM GPU: GPGPU-SIM Network: DARSIM, NoC-SIM

Power CPU: Wattch SRAM/CAM, eDRAM Cache: Cacti Non-Volatile Memory: NVSIM

Reliability: VARIUS Temperature: HotSpot Power, Timing, Area for CPU + Cache + Main Memory:

McPAT

Thermal Simulation Using HOTSPOT

Slides Courtesy of A Quick Thermal Tutorial by Kevin Skadron and Mircea Stan

10

Thermal modeling1

A fine-grained, dynamic model for temperature Architects can use Accounts for adjacency and package No detailed designs Provide detailed temperature distribution Fast

HotSpot - a compact model based on thermal R, C Parameterized for various

Architectures Power models Floorplans Thermal Packages

11

[1] A Quick Thermal Tutorial, Kevin Skadron and Mircea Stan

HotSpot Time evolution of temperature is driven by unit

activities and power dissipations averaged over N cycles (10K shown to provide enough accuracy)

Power dissipations can come from any power simulator, act as “current sources” in RC circuit ('P' vector in the equations)

Simulation overhead in Wattch/SimpleScalar: <1% Requires models of

Floorplan: important for adjacency Package: important for spreading and time constants R and C matrices are derived from the above

12

Example System1

Heat sink

Heat spreader

PCB

Die

IC Package

Pin

Interface material

[1] A Quick Thermal Tutorial, Kevin Skadron and Mircea Stan13

HotSpot Implementation Primarily a circuit solver Steady state solution

Mainly matrix inversion – done in two steps Decomposition of the matrix into lower and upper triangular

matrices Successive backward substitution of solved variables

Implements the pseudocode from CLR Transient solution

Inputs – current temperature and power Output – temperature for the next interval Computed using a fourth order Runge-Kutta

(RK4) method

14

Validation Validated and calibrated using MICRED test chips

9x9 array of power dissipators and sensors Compared to HotSpot configured with same grid, package

Within 7% for both steady-state and transient step-response Interface material (chip/spreader) matters

Also validated against an FPGA Can instantiate a temp. sensor based on a ring oscillator and

counter Also validated against IBM AnSYS FEM simulations 15

HotSpot Interface Inputs

a power trace file a floorplan file A config file (package information)

Outputs the corresponding transient temperatures steady state temperatures Thermal map (perl script)

16

Config File# thermal model parameters

# chip specs # chip thickness in meters -t_chip 0.00015 # silicon thermal conductivity in W/(m-K) -k_chip 100.0 # silicon specific heat in J/(m^3-K) -p_chip 1.75e6 # temperature threshold for DTM (kelvin) -thermal_threshold 354.95

# heat sink specs # convection capacitance in J/K -c_convec 140.4 # convection resistance in K/W -r_convec 0.1 # heatsink side in meters -s_sink 0.06 # heatsink thickness in meters -t_sink 0.0069 # heatsink thermal conductivity in W/(m-K) -k_sink 400.0 # heatsink specific heat in J/(m^3-K) -p_sink 3.55e6

17

# heat spreader specs # spreader side in meters -s_spreader 0.03 # spreader thickness in meters -t_spreader 0.001 # heat spreader thermal conductivity in W/(m-K) -k_spreader 400.0 # heat spreader specific heat in J/(m^3-K) -p_spreader 3.55e6

# interface material specs # interface material thickness in meters -t_interface 2.0e-05 # interface material thermal conductivity in W/(m-K) -k_interface 4.0 # interface material specific heat in J/(m^3-K) -p_interface 4.0e6

FLPfile# Floorplan close to the Alpha EV6 processor# Line Format: <unit-name>\t<width>\t<height>\t<left-x>\t<bottom-y># all dimensions are in meters# comment lines begin with a '#'# comments and empty lines are ignored

L2_left 0.004900 0.006200 0.000000 0.009800L2 0.016000 0.009800 0.000000 0.000000L2_right 0.004900 0.006200 0.011100 0.009800Icache 0.003100 0.002600 0.004900 0.009800Dcache 0.003100 0.002600 0.008000 0.009800Bpred_0 0.001033 0.000700 0.004900 0.012400Bpred_1 0.001033 0.000700 0.005933 0.012400Bpred_2 0.001033 0.000700 0.006967 0.012400DTB_0 0.001033 0.000700 0.008000 0.012400DTB_1 0.001033 0.000700 0.009033 0.012400DTB_2 0.001033 0.000700 0.010067 0.012400FPAdd_0 0.001100 0.000900 0.004900 0.013100FPAdd_1 0.001100 0.000900 0.006000 0.013100FPReg_0 0.000550 0.000380 0.004900 0.014000FPReg_1 0.000550 0.000380 0.005450 0.014000FPMul_0 0.001100 0.000950 0.004900 0.014380FPMul_1 0.001100 0.000950 0.006000 0.014380FPMap_0 0.001100 0.000670 0.004900 0.015330FPMap_1 0.001100 0.000670 0.006000 0.015330IntMap 0.000900 0.001350 0.007100 0.014650IntQ 0.001300 0.001350 0.008000 0.014650IntReg_0 0.000900 0.000670 0.009300 0.015330IntReg_1 0.000900 0.000670 0.010200 0.015330IntExec 0.001800 0.002230 0.009300 0.013100FPQ 0.000900 0.001550 0.007100 0.013100LdStQ 0.001300 0.000950 0.008000 0.013700ITB_0 0.000650 0.000600 0.008000 0.013100ITB_1 0.000650 0.000600 0.008650 0.013100

18

HotSpot Modes of running Block Level

fast less accurate Example: hotspot -c hotspot.config -f ev6.flp -p gcc.ptrace -o

gcc.ttrace -steady_file gcc.steady Grid-Level

slow more accurate Example: hotspot -c hotspot.config -f ev6.flp -p gcc.ptrace -

steady_file gcc.steady -model_type grid Change grid size for trade-off between speed up and accuracy

default grid size is 64x64

19

3D Modeling with HotSpot HotSpot's grid model is capable of modeling stacked 3D

chips HotSpot need Layer Configuration File (LCF) for 3D

simulation. LCF specifies the set of vertical layers to be modeled including its

physical properties (thickness, conductivity etc.)

20

Example: LCF file for two layers

#<Layer Number>#<Lateral heat flow Y/N?>#<Power Dissipation Y/N?>#<Specific heat capacity in J/(m^3K)>#<Resistivity in (m-K)/W>#<Thickness in m>#<floorplan file>

21

# Layer 0: Silicon0YY1.75e60.010.00015ev6.flp

# Layer 1: thermal interface material (TIM)1YN4e60.252.0e-05ev6.flp

For instance, the above sample file shows an LCF corresponding to the default HotSpot configuration with two layers: one layer of silicon and one layer of Thermal Interface Material (TIM)

Command line example: hotspot -c hotspot.config -f <some_random_file> -p example.ptrace -o example.ttrace -model_type grid -grid_layer_file example.lcf

Example: Modeling memory peripheral temperature

22

23

Power/Performance Modeling for Non-Volatile Memories Using NVSIM

Why yet-another Circuit-Level Estimation Tool for Cache Memories

Emerging non-volatile memory devices show a large variation on performance, energy, and density.

Some of them are performance-optimized; some of them are area-optimized…

For system-level research, it is NOT correct to pick random device parameters from multiple sources.

24

NVSIM1

`nvsim' is designed to be a general circuit-level performance, power, and area model

Emerging memory technologies supported: NAND PCM MRAM(STT-RAM) Memristor SRAM DRAM eDRAM

25

[1] "Design Implications of Memristor-Based RRAM Cross-Point Structures." In DATE 2011, C. Xu, X. Dong, N. P. Jouppi, and Y. Xie

NVSim Model Developed on the basis of CACTI

CACTI models SRAM and DRAM caches CACTI does NOT support eNVM.

26

2D array of memory cells

Precharge & Equalization

Bitline MuxSense Amplifies

Sense Amplifier MuxOutput/Write Drivers

Wor

dlin

e D

river

sR

ow D

ecod

ers

CACTI-modeled memory subarray

Memory cells

Peripheral circuitry

NVSim made modificationson the subarray-levelAnd the bank-level

Tricks (Subarray-Level) Why the circuit design space is large?

Many design tricks

27

2D array of memory cells

Precharge & Equalization

Bitline MuxSense Amplifies

Sense Amplifier MuxOutput/Write Drivers

Wor

dlin

e D

river

sR

ow D

ecod

ers

•Transistor typeoHigh-performanceoLow-poweroLow-standby

•Interconnect typeoWire pitchoRepeater design

•Sense ampoCurrent-sensingoVoltage-sensing

•DriveroArea-optoLatency-opt

•ArrayoMOS-accessedoCrosspoint

Configuring NVSim NVSim provides a variety of functionalities by supporting two

categories of the configuration input files: <.cfg> files and <.cell> files.

<.cfg> Configuration: <.cfg> files are used to specify the non-volatile memory module parameters and tune the design exploration knobs. The details of how to configure <.cfg> files are under the cfg files page.

<.cell> Configuration: <.cell> files are used to specify the non-volatile memory cell properties. The information held in these files are usually from the device level. NVSim provides default <.cell> files for PC-RAM, STT-RAM, and R-RAM as well as allows advanced users to tailor their own cell properties by add new <.cell> files. The details of how to configure <.cell> files are under the cell files page.

28

NVSIM Interface-DesignTarget: cache//-DesignTarget: RAM//-DesignTarget: CAM

-CacheAccessMode: Normal//-CacheAccessMode: Fast//-CacheAccessMode: Sequential

//-OptimizationTarget: ReadLatency//-OptimizationTarget: WriteLatency//-OptimizationTarget: ReadDynamicEnergy//-OptimizationTarget: WriteDynamicEnergy//-OptimizationTarget: ReadEDP//-OptimizationTarget: WriteEDP//-OptimizationTarget: LeakagePower-OptimizationTarget: Area//-OptimizationTarget: Exploration

//-ProcessNode: 200//-ProcessNode: 120//-ProcessNode: 90//-ProcessNode: 65-ProcessNode: 45//-ProcessNode: 32

29

-Capacity (KB): 128//-Capacity (MB): 1-WordWidth (bit): 512-Associativity (for cache only): 8

-DeviceRoadmap: HP//-DeviceRoadmap: LSTP//-DeviceRoadmap: LOP

-Routing: H-tree//-Routing: non-H-tree

-MemoryCellInputFile: SRAM.cell//-MemoryCellInputFile: Memristor_3.cell//-MemoryCellInputFile: PCRAM_JSSC_2007.cell//-MemoryCellInputFile: PCRAM_JSSC_2008.cell//-MemoryCellInputFile: PCRAM_IEDM_2004.cell//-MemoryCellInputFile: MRAM_ISSCC_2007.cell//-MemoryCellInputFile: MRAM_ISSCC_2010_14_2.cell//-MemoryCellInputFile: MRAM_Qualcomm_IEDM.cell//-MemoryCellInputFile: SLCNAND.cell-Temperature (K): 380

Example 1: PCRAM1 32-nm 16MB 8-way L3 caches (with different PCRAM design optimizations)

30

[1] PCRAMsim: System-Level Performance, Energy, and Area Modeling for Phase-Change RAM, Xiangyu Dong, et al. ICCAD 2009

31

Cycle-Level NoC Simulation Using DARSIM

DARSIM1

A parallel, highly configurable, cycle-level network-on-chip simulator based on an ingress-queued wormhole router architecture

Most hardware parameters are configurable, including geometry, bandwidth, crossbar dimensions

Packets arrive flit-by-flit on ingress ports are buffered in ingress virtual channel (VC) buffers until they have been assigned a next-hop node and VC; they then compete for the crossbar and, after crossing, depart from the egress ports.

32

Basic datapath of an NoC router modeled by DARSIM.

[1] Darsim: A Parallel Cycle-Level NoC Simulator, Lis, Mieszko; Shim, Keun Sup; Cho, Myong Hyon; Ren, Pengju; Khan, Omer; Devadas, Srinivas, ISPASS 2011.

DARSIm Simulation Parameters--cycles arg simulate for arg cycles (0 = until drained)--packets arg simulate until arg packets arrive (0 = until drained)--stats-start arg start statistics after cycle arg (default: 0)--no-stats do not report statistics--no-fast-forward do not fast-forward when system is drained--memory-traces arg read memory traces from file arg--log-file arg write a log to file arg--random-seed arg set random seed (default: use system entropy)--version show program version and exit-h [ --help ] show this help message and exit

33

Sample Config File[geometry]

height = 8width = 8[routing]node = weightedqueue = setone queue per flow = falseone flow per queue = false[node]queue size = 8[bandwidth]cpu = 16/1net = 16north = 1/1east = 1/1south = 1/1west = 1/1

34

[queues]cpu = 0 1net = 8 9north = 16 18east = 28 30south = 20 22west = 24 26

[core]default = injector

Network Configuration

-x (arg) : network width (8)-y (arg) : network height (8)-v (arg) : number of virtual channels per set (1)-q (arg) : capacity of each virtual channel in flits(4)-c (arg) : core type (memtraceCore)-m (arg) : memory type (privateSharedMSI)-n (arg) : number of VC sets-o (arg) : output filename (output.cfg)

35

Sample Flow Eventtick 12094flow 0x001b0000 size 13tick 12140flow 0x00001f00 size 5tick 12141flow 0x001f0000 size 5tick 12212flow 0x00002100 size 5tick 12212flow 0x00210000 size 13

36

The first two lines indicate that at the cycle of 12094, a packet which consists of 13 flits is injected to Node 27 (0x1b), and its destination is Node 0 (0x00).

Statisticsflit counts:

flow 00010000: offered 58, sent 58, received 58 (0 in flight)flow 00020000: offered 44, sent 44, received 44 (0 in flight)flow 00030000: offered 34, sent 34, received 34 (0 in flight)flow 00040700: offered 32, sent 32, received 32 (0 in flight)flow 00050700: offered 34, sent 34, received 34 (0 in flight).....

all flows counts: offered 109724, sent 109724, received 109724 (0 in flight)

in-network sent flit latencies (mean +/- s.d., [min..max] in # cycles):flow 00010000: 4.06897 +/- 0.364931, range [4..6]flow 00020000: 5.20455 +/- 0.756173, range [5..9]flow 00030000: 6.38235 +/- 0.874969, range [6..10]flow 00040700: 6.1875 +/- 0.526634, range [6..8]flow 00050700: 5.11765 +/- 0.32219, range [5..6].....all flows in-network flit latency: 9.95079 +/- 20.5398

37

Example: Effect of Routing and VC1

38

The effect of routing and VC configuration on network transit latencyin a relatively congested network on the WATER benchmark: while O1TURNand ROMM clearly outperform XY, the margin is not particularly impressive.

[1] Scalable Accurate Multicore Simulation in the 1000 core era", M. Lis et al. ISPASS 2011

Design Flow

39

Trace Driven NUCA Non-Volatile Cache Simulation in 3D

40

Trace Driven NUCA Non-Volatile Cache Simulation in 3D

41

cycle accurate simulator like

SimpleScalar or SMTSIM

DARSIM Network Simulation

NVSIM-based simulator

Feed cache trace

Feed new cache latency for performance impact

HotSpot 3D (Virginia)

3D thermal simulation

Feed temperature for accurate leakage power modeling

Questions?

04/19/23 42

Using Bottlenecks Analysis and McPAT for Efficient CPU Design

Space Exploration

Manish Arora1

Computer Science and EngineeringUniversity of California, San Diego

1Credit also goes to my co-authors: Feng Wang (Qualcomm), Bob Rychlik (Qualcomm) and Dean Tullsen (UC San Diego)

Tackling Design Complexity Increasingly complex design decisions

Multicore exacerbates the problem Accurate simulation is slow Simulation of all design points not feasible Commonly followed techniques inadequate Sensitivity analysis

Vary a single parameter while keeping other parameters fixed E.g. L2 performance by varying size and keeping else constant

Dependent on the choice of fixed point of reference L2 performance correlated with L1 size

June 4 2012 44

Accelerating Design Space Exploration Speeding up individual simulations

Benchmark subsetting (SMARTS[1], SimPoint[2] and MinneSpec[3]) Analytical models instead of cycle accurate simulation (Karkhanis

et al. [4]) Regression models to derive performance models (Lee et al. [5])

Design space pruning Hill-climbing (Systems et al. [6]) Tabu search (Axelsson et al. [7]) Genetic search (Palesi et al. [8]) Plackett and Burman (P&B) based design (Yi et al. [9] and Arora et

al. [10])

June 4 2012 45

Plackett and Burman (P&B) Based Design Advantages

Exploration over ranges of parameter values Linear or near linear number of experiments Non-iterative technique (exploit cluster parallelism)

Workings Provide a high value (+1) and low value (-1) for each component

CPU Freq 2GHz (+1) / 1GHz (-1) L2 Cache 1MB (+1) / 256KB (-1) and so on…

Run a P&B specified set of experiments Evaluate “Impact” of each component

E.g. CPU Frequency has a 30% influence when CPU Freq changed from 1GHz to 2GHz AND changing L2 from 256KB to 1MB AND …

June 4 2012 46

System Under Design - 1

June 4 2012 47

Sub-system consisting of 11 components Up to 10 choices per component

System Under Design - 2

June 4 2012 48

12 Mobile CPU centric benchmarks

June 4 2012 49

June 4 2012 50

June 4 2012 51

Using P&B for Cost-Optimized Designs Recapping P&B

P&B yields unit-less “Impact” (influence of changing a component) Provides “Impact” trends by changing upper bounds

Constrained systems Most systems are cost constrained (area, power or energy) Need to look at cost together with performance

L2 cache size has higher impact than L2 Associativity L2 Associativity might still provide the best “Bang for the buck”

Use Cost Normalized Marginal Impact Impact gained / Cost incurred

Use McPAT [11, 12] to evaluate baseline and marginal costs

June 4 2012 52

McPAT: High Level Features Integrated modeling framework

Power (Peak, Dynamic, Short-circuit and Leakage) Area Critical path timing

Hardware validated (~20% error) Configurable system components

Cores, NOC, clock tree, PLL, caches and memory controllers etc. Technology nodes 90nm to 22nm Device types Bulk CMOS, SOI and Double Gate Flexible XML interface

Standalone and performance simulator integrationJune 4 2012 53

McPAT 1.0 Overview (from [11])

June 4 2012 54

Unspecified parameters filled, Structures

optimized to satisfy timing

Use models and configurations to

evaluate numbers

Framework Components Hierarchical power, area and timing model

Model structures at a low level but allow high-level configuration Model core details rigorously and allow to connect multiple cores

Optimizer for circuit level implementations Determines unspecified parameters in the internal chip

representation User specifies cache size and number of banks but optimizer specifies

cachebank wordline and bitline lengths User can choose to specify everything themselves

Internal chip representation Driven by user inputs and those generated by the optimizer

June 4 2012 55

Hierarchical Modeling (from [11])

June 4 2012 56

Power, Area and Timing Modeling Power Modeling

Dynamic power using load capacitive modeling, supply voltage, clock frequency and activity factors

Short-circuit power using published models Leakage using published data and existing models such as MASTER

Timing Modeling Estimate critical path Use RC delays to estimate time similar to CACTI

Area Modeling Similar to CACTI to model gates and regular structures Empirical modeling techniques for non-regular structures

June 4 2012 57

Multicore Architectural Modeling Core

Configurable models of fetch, execute, load-store and OOO etc. Reservation station style and physical-register-file architectures In-order, OOO and Multithreaded architectures

NOC Signal link and router models

Shared and private cache hierarchies Memory Controller

Front end, transaction processing and PHY models Clocking

PLL and clock tree models

June 4 2012 58

Circuit and Technology Level Modeling Wires

Hierarchical repeated wires for local and global wires Short wires using pi-RC models Latches automatically inserted and modeled to satisfy clock rates

Devices Use ITRS 2007 roadmap data 90nm, 65nm, 45nm, 32nm and 22nm nodes supported Planar bulk (up to 36nm), SOI (up to 25nm) and double-gate

(22nm) modeled Support for Power-saving modes

McPAT 1.0 supports multiple sleep states Coming this summer (v0.8 current)

June 4 2012 59

McPAT Operation Requires input from user and simulator

Target clock rate, architectural and technology parameters Optimization function (timing or ED^2) Unit activity factors

McPAT optimizes structures to satisfy timing Configurations not satisfying timing are discarded Optimization functions are applied to all timing

satisfying configurations Numbers calculated using remaining configurations +

activity factors

June 4 2012 60

Downloading and Installing Current version 0.8 available from HP Labs website

http://www.hpl.hp.com/research/mcpat/ Download and build the tool (“make” works)

Works on unix compatible systems Command line operation (standalone XML input)

Print levels provide verbose results

Alternatively can build together with simulator

June 4 2012 61

http://www.hpl.hp.com/research/mcpat/

Running McPAT Standalone mode (with XML input file)

Architectural and technology details specified within XML Find correspondence between McPAT stats and simulator stats Run performance simulation and pass counters to XML

<component id="system.core0.icache" name="icache"><param name="icache_config" value="131072,32,8,1,8,3,32,0"/><stat name="read_accesses" value=“2000"/><stat name="read_misses" value=“116"/><stat name="conflicts" value=“9"/></component>

Integrated with multiple simulators (M5, SMTSIM, Multi2SIM etc.) Documentation gives tips on building together with

simulatorJune 4 2012 62

XML Specification (Top Level)

June 4 2012 63

XML Specification (Core)

June 4 2012 64

XML Specification (Memory Controller)

June 4 2012 65

Results (Top Level)

June 4 2012 66

Results (Core)

June 4 2012 67

Results (Memory Controller)

June 4 2012 68

Cost Normalized Marginal Impact

June 4 2012 69

Created XML specification for our processor system Modeled a mobile processor and obtained activity factors from

a custom cycle-accurate simulator Obtained baseline power and area

Obtained Marginal costs Obtained cost normalized marginal impact

June 4 2012 70

Results: Cost Normalized Impact

Obtaining Cost Optimized Designs

June 4 2012 71

Make design decisions utilizing the marginal impact and marginal cost information

Results: Cost Optimized Designs

June 4 2012 72

Set budgets to 70% - 40% of highest end system Selection algorithm minimizes impact loss while reducing cost as much as possible

Performance within 16% of peak @ 40% area Performance within 19% of peak @ nearly half the power

To Summarize Looked at the problem of efficient design space

exploration Used the Plackett and Burman Method to yield

“Impact” or a measure of bottleneck for components Understood the basic workings of McPAT Understood the use of McPAT to obtain area and

power costs for various system configurations Used cost numbers to obtain cost normalized impact Used cost normalized impact values to obtain efficient

design choices

June 4 2012 73

References[1] Wunderlich et al. “SMARTS: Accelerating microarchitecural simulation via rigorous statistical

sampling”, ISCA 2003.[2] Sherwood et al. “Automaticaly characterizing large scale program behavior”, ASPLOS 2002.[3] KleinOsowski et al. “MinneSPEC: A new SPEC benchmark workload for simulation based

computer architecture research”, CAL 2002.[4] Karkhanis et al. “A first-order superscalar processor model”, ISCA 2004.[5] Lee at al. “Accurate and efficient regression modeling for microarchitectural performance

and power prediction”, ASPLOS 2006.[6] Systems et al. “Spacewalker: Automated design space exploration”, HP Labs 2001.[7] Axelsson et al. “Architecture synthesis and partitioning of realtime systems”, CODES 1997.[8] Palesi et al. “Multi-objective design space exploration using genetic algorithms”, CODES

2002.[9] Yi et al. “A statistically rigorous approach for improving simulation methodology”, HPCA

2003.[10] Arora et al. “Efficient system design using the SAAB methodology”, SAMOS 2012.[11] S. Li et al. “McPAT: An integrated Power, Area and Timing framework”, MICRO 2009.[12] S. Li et al. McPAT 1.0 technical report, HP Labs 2009.

June 4 2012 74

Questions?

SimpleScalar Simulator (ISS) and PHiLOSoftware Framework (SystemC)

Luis Angel D. Bathen (Danny)3

Slides Courtesy of: Kyoungwoo Lee1, Aviral Shrivastava2, and Nikil Dutt3

2Dept. of Computer Science & EngineeringArizona State University

1Dept. of Computer ScienceYonsei University

3Dept. of Computer ScienceUniversity of California

at Irvine

Contents

77

SimpleScalar Overview Demo 1: a simple simulation

(w/ 3.1 version of SimpleScalar) PHiLOSoftware Simulator

SimpleScalar + CACTI + SystemC TLM Demo 2: Bus Protocol Selection Demo 3: Software-Controlled Memory Virtualization

Contents

78




Overview

79

What is an architectural simulator? a tool that reproduces the behavior of a computing

device Why we use a simulator?

Leverage a faster, more flexible software development cycle

Permit more design space exploration Facilitates validation before H/W becomes available Level of abstraction is tailored by design task Possible to increase/improve system instrumentation Usually less expensive than building a real system

A Taxonomy of Simulation Tools

80

Shaded tools are included in SimpleScalar Tool Set

Functional vs.Performance

81

Functional simulators implement the architecture. Perform real execution Implement what programmers see

Performance simulators implement the microarchitecture. Model system resources/internals Concern about time Do not implement what programmers see

Trace- vs. Execution-Driven

82

Trace-Driven Simulator reads a ‘trace’ of the instructions captured

during a previous execution Easy to implement, no functional components necessary

Execution-Driven Simulator runs the program (trace-on-the-fly) Difficult to implement Advantages

Faster than tracing No need to store traces Register and memory values usually are not in trace Support mis-speculation cost modeling

SimpleScalar Tool Set Overview

83

Computer architecture research test bed Compilers, assembler, linker, libraries, and simulators Targeted to the virtual SimpleScalar architecture Hosted on most any Unix-like machine

SimpleScalar Suite

84

Strength of SimpleScalar

85

Highly flexible functional simulator + performance simulator

Portable Host: virtual target runs on most Unix-like systems Target: simulators can support multiple ISAs

Extensible Source is included for compiler, libraries, simulators Easy to write simulators

Performance Runs codes approaching ‘real’ sizes

Contents

86




Helloworld!

87

./sim-safe helloworld

Create a new file, hello.c, that has the following code: #include<stdio.h> main() { printf("Hello World!\n"); } then compile it using the following command: $ $IDIR/bin/sslittle-na-sstrix-gcc –o hello hello.c That should generate a file hello, which we will run over the

simulator: $ $IDIR/simplesim-3.0/sim-safe hello In the output, you should be able to find the following: sim: ** starting functional simulation ** Hello World!

TESTS-PISA – test-math

88

./sim-safe test-math

Simple set of test executables ./tests-pisa/bin.little/

anagram, test-fmath, test-lswlr, test-printf, test-llong, test-math

Run test-math $ $IDIR/simplesim-3.0/sim-safe

tests-pisa/bin.little/test-math

In the output (1)

89


sim: ** starting functional simulation **pow(12.0, 2.0) == 144.000000pow(10.0, 3.0) == 1000.000000pow(10.0, -3.0) == 0.001000str: 123.456x: 123.000000str: 123.456x: 123.456000str: 123.456x: 123.456000123.456 123.456000 123 1000sinh(2.0) = 3.62686sinh(3.0) = 10.01787h=3.60555atan2(3,2) = 0.98279pow(3.60555,4.0) = 169169 / exp(0.98279 * 5) = 1.241023.93117 + 5*log(3.60555) = 10.34355cos(10.34355) = -0.6068, sin(10.34355) = -0.79486x 0.5xx0.5 xx 0.5x-1e-17 != -1e-17 Worked!

In the output (2)

90


sim: ** simulation statistics **sim_num_insn 213703 # total number of instructions executedsim_num_refs 56899 # total number of loads and stores executedsim_elapsed_time 1 # total simulation time in secondssim_inst_rate 213703.0000 # simulation speed (in insts/sec)ld_text_base 0x00400000 # program text (code) segment baseld_text_size 91744 # program text (code) size in bytesld_data_base 0x10000000 # program initialized data segment baseld_data_size 13028 # program init'ed `.data' and uninit'ed `.bss' size in bytesld_stack_base 0x7fffc000 # program stack segment base (highest address in stack)ld_stack_size 16384 # program initial stack sizeld_prog_entry 0x00400140 # program entry point (initial PC)ld_environ_base 0x7fff8000 # program environment base address addressld_target_big_endian 0 # target executable endian-ness, non-zero if big endianmem.page_count 33 # total number of pages allocatedmem.page_mem 132k # total size of memory pages allocatedmem.ptab_misses 34 # total first level page table missesmem.ptab_accesses 1546771 # total page table accessesmem.ptab_miss_rate 0.0000 # first level page table miss rate

tel:213703.0000

Cache Simulator

91

./sim-cache test-math

Run test-math with sim-cache $ $IDIR/simplesim-3.0/sim-cache


In the output

92

./sim-cache test-math

sim: ** simulation statistics **sim_num_insn 213703 # total number of instructions executedsim_num_refs 56899 # total number of loads and stores executedsim_elapsed_time 1 # total simulation time in secondssim_inst_rate 213703.0000 # simulation speed (in insts/sec)

il1.accesses 213703 # total number of accessesil1.hits 189940 # total number of hitsil1.misses 23763 # total number of missesil1.replacements 23507 # total number of replacementsil1.writebacks 0 # total number of writebacksil1.invalidations 0 # total number of invalidationsil1.miss_rate 0.1112 # miss rate (i.e., misses/ref)il1.repl_rate 0.1100 # replacement rate (i.e., repls/ref)il1.wb_rate 0.0000 # writeback rate (i.e., wrbks/ref)il1.inv_rate 0.0000 # invalidation rate (i.e., invs/ref)dl1.accesses 57480 # total number of accessesdl1.hits 56675 # total number of hitsdl1.misses 805 # total number of missesdl1.replacements 549 # total number of replacementsdl1.writebacks 416 # total number of writebacksdl1.invalidations 0 # total number of invalidationsdl1.miss_rate 0.0140 # miss rate (i.e., misses/ref)dl1.repl_rate 0.0096 # replacement rate (i.e., repls/ref)dl1.wb_rate 0.0072 # writeback rate (i.e., wrbks/ref)dl1.inv_rate 0.0000 # invalidation rate (i.e., invs/ref)…

tel:213703.0000

Cache Configuration

93

./sim-cache –cache:dl1 dl1:32:32:32:f test-math

Cache configuration

<name>:<nsets>:<bsize>:<assoc>:<repl>

<name> - name of the cache being defined <nsets> - number of sets in the cache <bsize> - block size of the cache <assoc> - associativity of the cache <repl> - block replacement strategy, 'l'-LRU, 'f'-FIFO, 'r'-random

Examples: -cache:dl1 dl1:4096:32:1:l Run test-math with sim-cache $ $IDIR/simplesim-3.0/sim-cache –cache:dl1 dl1:32:32:32:f


In the output

94

./sim-cache –cache:dl1 dl1:32:32:32:f test-math

sim: ** simulation statistics **sim_num_insn 213703 # total number of instructions executedsim_num_refs 56899 # total number of loads and stores executedsim_elapsed_time 1 # total simulation time in secondssim_inst_rate 213703.0000 # simulation speed (in insts/sec)

il1.accesses 213703 # total number of accessesil1.hits 189940 # total number of hitsil1.misses 23763 # total number of missesil1.replacements 23507 # total number of replacementsil1.writebacks 0 # total number of writebacksil1.invalidations 0 # total number of invalidationsil1.miss_rate 0.1112 # miss rate (i.e., misses/ref)il1.repl_rate 0.1100 # replacement rate (i.e., repls/ref)il1.wb_rate 0.0000 # writeback rate (i.e., wrbks/ref)il1.inv_rate 0.0000 # invalidation rate (i.e., invs/ref)dl1.accesses 57480 # total number of accessesdl1.hits 56938 # total number of hitsdl1.misses 542 # total number of missesdl1.replacements 0 # total number of replacementsdl1.writebacks 0 # total number of writebacksdl1.invalidations 0 # total number of invalidationsdl1.miss_rate 0.0094 # miss rate (i.e., misses/ref)dl1.repl_rate 0.0000 # replacement rate (i.e., repls/ref)dl1.wb_rate 0.0000 # writeback rate (i.e., wrbks/ref)dl1.inv_rate 0.0000 # invalidation rate (i.e., invs/ref)…

tel:213703.0000

Difference

95

Different Cache Configurations

Different Configurations < -cache:dl1 dl1:32:32:32:f # l1 data cache config, i.e., {<config>|

none} > -cache:dl1 dl1:256:32:1:l # l1 data cache config, i.e., {<config>|

none}

dl1 output differences < dl1.hits 56938 # total number of hits

< dl1.misses 542 # total number of misses< dl1.replacements 0 # total number of replacements< dl1.writebacks 0 # total number of writebacks

> dl1.hits 56675 # total number of hits> dl1.misses 805 # total number of misses> dl1.replacements 549 # total number of replacements> dl1.writebacks 416 # total number of writebacks

Performance Simulation

96

./sim-outorder test-math

sim-outorder Performance simulation Out-of-Order Issue

Run test-math with sim-outorder $ $IDIR/simplesim-3.0/sim-order


In the output

97

./sim-outorder test-math

sim: ** simulation statistics **

sim_num_insn 213703 # total number of instructions committedsim_num_refs 56899 # total number of loads and stores committedsim_num_loads 34105 # total number of loads committedsim_num_stores 22794.0000 # total number of stores committedsim_num_branches 38594 # total number of branches committedsim_elapsed_time 1 # total simulation time in secondssim_inst_rate 213703.0000 # simulation speed (in insts/sec)

sim_total_insn 233029 # total number of instructions executedsim_total_refs 61927 # total number of loads and stores executedsim_total_loads 37545 # total number of loads executedsim_total_stores 24382.0000 # total number of stores executedsim_total_branches 42770 # total number of branches executedsim_cycle 224302 # total simulation time in cyclessim_IPC 0.9527 # instructions per cyclesim_CPI 1.0496 # cycles per instructionsim_exec_BW 1.0389 # total instructions (mis-spec + committed) per cyclesim_IPB 5.5372 # instruction per branchIFQ_count 352201 # cumulative IFQ occupancyIFQ_fcount 74028 # cumulative IFQ full countifq_occupancy 1.5702 # avg IFQ occupancy (insn's)ifq_rate 1.0389 # avg IFQ dispatch rate (insn/cycle)ifq_latency 1.5114 # avg IFQ occupant latency (cycle's)ifq_full 0.3300 # fraction of time (cycle's) IFQ was fullRUU_count 1440457 # cumulative RUU occupancyRUU_fcount 45203 # cumulative RUU full countruu_occupancy 6.4220 # avg RUU occupancy (insn's)ruu_rate 1.0389 # avg RUU dispatch rate (insn/cycle)ruu_latency 6.1814 # avg RUU occupant latency (cycle's)ruu_full 0.2015 # fraction of time (cycle's) RUU was full…

tel:213703.0000

Contents

98




A Taxonomy of Simulation Tools

99

Software-Controlled Memories Two types of memory subsystems

Hardware-controlled Caches Software-controlled SRAMs

ScratchPad Memories (SPMs), Tightly Coupled Memories (ARM), Local Stores (Cell), Streaming Memories, Software Caches

SPMs are preferred over caches due to: Smaller area footprint Lower power consumption More predictable Allow explicit control of their data

SPMs lack hardware support Compiler and programmer need to explicitly manage them

Accessed through physical addresses Difficult to capture irregularity at compile time

What about sharing their address space?

ARM Cortex-M3Address Map

New memory subsystems are deploying with distributed on-chip software-

controlled memories

New memory subsystems are deploying with distributed on-chip software-

controlled memories

[Leverich et al., ISCA ‘07]

04/19/23 100

Goal: Abstracting Physical Characteristics from the Device Challenges:

Software-controlled memories are explicitly accessed Programmers assume full access to their physical space

Sharing issues and security risks (open environments) Different physical characteristics

Due to voltage scaling Increased latencies Higher error rates

Due to interconnection network Different access latencies

Local vs. Remote Due to technology

Process variations SRAM vs. NVM characteristics

Virtualization as a viable solution Traditional virtualization makes no difference between address spaces

Mutually Dependent

Can we exploit virtualization to minimize programmer burden while

opportunistically exploiting the variation in physical

characteristics of the device?

Can we exploit virtualization to minimize programmer burden while

opportunistically exploiting the variation in physical

characteristics of the device? 04/19/23 101

Voltage Scaled

SPMSPM

CPUCPU

SPMSPM

Nominal Voltage

NVM

NVM

Voltage Scaled

SPMSPM

CPUCPU

SPMSPM

Nominal Voltage

NVM

NVM

MemoryControllerMemory

Controller

Low Power DIMMLow Power DIMM

High Power DIMM

SR

AM

Pre

ferr

edN

VM

DR

AM

Remotely Allocated On-Chip Space

Locally Allocated On-Chip Space

PHiLOSoftware’s Heterogeneous Virtual Address Space

InterconnectionNetwork

(1) Voltage Scaled Low-Power / Mid Latency

(1) Voltage Scaled Low-Power / Mid Latency

(2) Voltage Scaled Fault-tolerant

(2) Voltage Scaled Fault-tolerant

(3) Nominal-Voltage High Power / Low Latency

(3) Nominal-Voltage High Power / Low Latency

(4) Nominal-Voltage Low-Priority

(4) Nominal-Voltage Low-Priority

(5) Nominal-Voltage Higher Power / Latency

(5) Nominal-Voltage Higher Power / Latency

(6) NVM High Write Power / Latency

(6) NVM High Write Power / Latency

(7) NVM Higher Write Power / Latency

(7) NVM Higher Write Power / Latency

(8) Low-Power DRAM(8) Low-Power DRAM

(9) High-Power DRAM(9) High-Power DRAM

Propose the idea of virtual address spaces with different characteristicsPropose the idea of virtual address

spaces with different characteristics04/19/23 102

PHiLOSoftware Framework

PHiLOSoftware

Software Annotations@LowPower(arr,a,64)@Reliable(arr,b,64)@Secure(arr,c,128)

Software Annotations@LowPower(arr,a,64)@Reliable(arr,b,64)@Secure(arr,c,128)

Application Layer (Static)Application Layer (Static)

Application MarketApplication Market

Compiler(Static Analysis)

Compiler(Static Analysis)

Application1

Application1 Service1Service1

Over the Air Update

User ProfileManager

User ProfileManagerApplication

1 Application

1 Service1Service1 User ProfileManager

User ProfileManagerApplication

1 Application

1 Service1Service1 User ProfileManager

User ProfileManager

User Profile

Manager

User Profile

Manager

User Profile

Manager

User Profile

ManagerThird Party

App1Third Party

App1

Ap

plicatio

n

Run-Time System (OS/Hypervisor)Run-Time System (OS/Hypervisor)Run-Time Layer

(Dynamic)Run-Time Layer

(Dynamic)

App. OSApp. OS Services. OSServices. OS Private. OSPrivate. OS Third Party OS

Third Party OS R

TS

Platform Configuration (Variable)

Platform Configuration (Variable)

DRAM DRAM DRAM

DRAM HDDHDD

Low PowerVoltage Scaled Medium Power

High PowerManager

SPM SPM NVM NVM

CPU CPU CPU CPU

DMA

CMPEmerging Memories

Platform Type(CMP,NoC,

MPSoC)

Platform Type(CMP,NoC,

MPSoC)

Memory Tech.(SRAM, NVMs,

DRAM)

Memory Tech.(SRAM, NVMs,

DRAM)

Platform

Allo

catio

n P

olicie

sA

lloca

tion

Po

licies

04/19/23 103

PHiLOSoftware Simulator

A1 A2 A3 A4 A5 A6 A7 A8

GuestOS1 GuestOS3GuestOS2 GuestOS4

Hypervisor

A1 A2 A1

RTOS

Simulated RTOS Environment Simulated Virtualized Environment

SHASHA

AESAES

BLOWFISHBLOWFISH

H263H263

MOTIONMOTION

JPEGJPEG

GSMGSM

ADPCMADPCM

OM OM OM

OM HDDHDD

Low PowerVoltage Scaled Medium Power

High PowerManagerManager

SPM SPM SPM SPM

CPU CPU CPU CPU

S-DMA

CMPNominal Voltage

SHASHA

AESAES

BLOWFISHBLOWFISH

H263H263

MOTIONMOTION

JPEGJPEG

GSMGSM

ADPCMADPCM

Application Policies

SystemC TLMPower/Perf.Models

Annotations@LowPower(arr,a,64)@Reliable(arr,b,64)@Secure(arr,c,128)

SimpleScalarSimpleScalar

Memory Technology(CACTI/NVSim)

Memory Technology(CACTI/NVSim)Platform DBPlatform DB Fault & Variability ModelsFault & Variability Models

SimpleScalar Traces

Performance

Models

Performance

Models

while(...) {...if(!p.is_finished() && p.has_more()){ inst = p.get_next_inst();

#ifdef VERBOSE printf("%g %s : EXECUTE_INST <%s, 0x%lx, 0x%lx>\n", sc_simulation_time(), name(), inst.op.c_str(), inst.pc, inst.address);#endif #ifdef LOAD_INSTRUCTIONS entry_p = p.v_lut.lookup(inst.pc); if(entry_p!=NULL) { if(entry_p->in_spm) { spm_ax_inc(p.id); #ifdef VSPM_ENABLED read(&off_bus_port,entry_p->pa,packet);#else read(&on_bus_port,entry_p->pa,packet);#endif }#ifdef RISC_CYCLES_EN wait(RISC_CYCLES(S_ISA_TO_INT_ISA(inst.op))+inst.cycles, SC_NS);#else wait(1, SC_NS);#endif...

// create cpus gen_0 = new generator("gen_0", MIN_PRIO+1, NORMAL, 2, 0x00); gen_1 = new generator("gen_1", MIN_PRIO+2, NORMAL, 2, 0x01);

spmvisor_0 = new spmvisor("spmvisor_0", SPMVISORBASE, SPMVISORBASE + SPMVISORSPACESIZE - 1, 1, 1, MIN_PRIO+0, NORMAL, 2, 0x04); // mem spm_0 = new spm("spm_0", SPMBASE + 0*SPM_SIZE, SPMBASE + 1*SPM_SIZE-1,1, 1); spm_1 = new spm("spm_1", SPMBASE + 1*SPM_SIZE, SPMBASE + 2*SPM_SIZE-1,1, 1);

// bus bus_0 = new channel_amba2("bus_amba2_0",false); bus_arbiter_0 = new arbiter("arbiter_amba2_0", STATIC, false); bus_1 = new channel_amba2("bus_amba2_1",false); bus_arbiter_1 = new arbiter("arbiter_amba2_1", STATIC, false);

Sample architecture configuration file for 2 core CMP with SPMsSample architecture configuration file for 2 core CMP with SPMs

PriorityPriority HW IDHW ID

Physical Start AddressPhysical Start Address

Physical End AddressPhysical End Address

Arbitration PolicyArbitration Policy

// connect modules to the bus bus_0->arbiter_port(*bus_arbiter_0); gen_0->off_bus_port(*bus_0); gen_0->on_bus_port(*bus_1); gen_1->off_bus_port(*bus_0); gen_1->on_bus_port(*bus_1);

spmvisor_0->bus_port_req(*bus_0); spmvisor_0->bus_port_spm(*bus_1); bus_0->slave_port(*spmvisor_0); bus_0->slave_port(*ram);

bus_1->slave_port(*spm_0); bus_1->slave_port(*spm_1);

Module ConnectivityModule Connectivity

Contents

105




PHiLOSoftware DEMO1: Bus Protocol Selection

CPUCPU

8KB I$8KB I$ 8KB D$8KB D$


CPUCPU



CPUCPU



CPUCPU



AMBA AHB TLMAMBA AHB TLM

Access Type / Address / Number of byteslw r16,0(r29) 0x400140 0x400140 _YY_R_ 0x7fff8000 4 0lui r28,0x1001 0x400148addiu r28,r28,-26160 0x400150addiu r17,r29,4 0x400158addiu r3,r17,4 0x400160sll r2,r16,2 0x400168addu r3,r3,r2 0x400170addu r18,r0,r3 0x400178sw r18,-32412(r28) 0x400180 0x400180 _YY_W_ 0x10001b34 4 0addiu r29,r29,-24 0x400188addu r4,r0,r16 0x400190addu r5,r0,r17 0x400198addu r6,r0,r18 0x4001a0jal 0x403800 0x4001a8addiu r29,r29,-24 0x403800

SystemC – TLM Model

CACTI

SimpleScalar Trace

DEMO1: Running

Usage: testbench.x <arb protocol> <cycles to simulate>

./cmp_2cpu_cache.x STATIC 1000000

Can vary from STATIC, RANDOM, ROUNDROBIN, TDMA, TDMA_RR

Cycles from 10,000, -1 for full application run

Sample run output:501858 ram : read 0x1046370347 at address 0xc14501940 ram : write 0x364210008 at address 0xffdfe20501946 simple_cpu_0 : EXECUTE_INST <lw, 0x4005c8, 0x2147450764>501951 simple_cpu_0 : EXECUTE_INST <addiu, 0x0, 0x11032>501951 simple_cpu_0 : EXECUTE_INST <sw, 0x4005d8, 0x2147450764>502022 ram : read 0x628966950 at address 0xc18502114 simple_cpu_0 : EXECUTE_INST <sw, 0x4005e8, 0x268445784>

DEMO1: Output Analysis

Effect of arbitration protocol: STATIC is blocking (e.g., CPU 0 has highest priority)

May lead to starvation of other CPUs Run 10000 cycles with STATIC arb protocol to see example

Each of the different arbitration protocols has different behavior Can be observed by number of arbitration cycles from each cache interface

./cmp_2cpu_cache.x STATIC 1000000 cache_0: Arbitration wait cycles (Reads) = 0 cache_0: Arbitration wait cycles (Writes) = 0 cache_1: Arbitration wait cycles (Reads) = 4581 cache_1: Arbitration wait cycles (Writes) = 391

./cmp_2cpu_cache.x TDMA 1000000 cache_0: Arbitration wait cycles (Reads) = 103313 cache_0: Arbitration wait cycles (Writes) = 128494 cache_1: Arbitration wait cycles (Reads) = 148688 cache_1: Arbitration wait cycles (Writes) = 28329

DEMO1: Output Analysis

TDMA Fair Increases total execution time because of time-slots

FULL SIM: 8.44379e+06

STATIC Might cause starvation Processes with high priority finish faster

E.g. CPU0 has highest priority FULL SIM: 5.93757e+06

Reverse priorities (CPU1>CPU0): 6.0784e+06 CPU0 has longest task to execute

Suffers of performance degradation because CPU1 has higher priority

DEMO1: 8 Core CMP – Protocol Comparison

Protocol Reads

Writes

Total

Average Transactions

Average Arb. Wait Cycles

$ ReadSTDev

$ WriteSTDev

STATIC 5787 6473 12260 766.25 9802.5 1145.329531 1636.738941

RANDOM 7459 4820 12279 767.4375 10051.125 288.6667358 232.6671442

ROUNDROBIN 7076 4460 11536 721 68994.75 238.1438641 238.1438641

TDMA 7076 4460 11536 721 68994.75 238.1438641 238.1438641

- Almost same number of accesses for all protocols

- Higher STDev means higher variation in number of accesses- E.g., STATIC vs. RANDOM, RR, TDMA

Contents

112




Shared SPMs in HeterogeneousMulti-tasking Environments (Open Environment)

Problem: Fixed allocation policies Enforce pre-defined policies

known single/multiple apps No SPM space available, map all data off-chip

What about data and task criticality/priority? Need dynamic and selective enforcement of allocation

policies Reduced power consumption/better performance…

App3

App4

App2

App1

CPU0

CPU1

SPM0

SPM1

MM-SPM Transfers:

MM

App1App1 App2App2 App1App1

RTOS

SPM1SPM1 SPM2SPM2

CPU1 CPU2

MMMM

Crypto

DMA

AMBA AHB

App3App3 App4App4 App3App3

RTOS

App5App5

Want to launch newly downloaded App!

App5App5

Task Priority:

High

Medium

Low

Fixed policies do not work well in open environments where the number of applications running concurrently is

unknown!

nGB-1

MM Space

4K

8K-1

ProtectedEvict Memory (PEM) Space

Virtual ScratchPad Memories (vSPMs)

0

4K-1

A1A1 A2A2

SPMSPM

A1A1 A2A2

vSPM1vSPM1 vSPM2

vSPM2

1K1K

1K1K

1K1K

1K1K

1K1K

1K1K

1K1K

1K1K

1K1K

1K1K

1K1K

1K1KvSPM1

vSPM2

Compete for same SPM

Applications see their own dedicated SPM(s)

SPM Space

Block-based priority-drivenallocation policies

1K1K

1K1K

1K1K

1K1K

PEM

MM Space

4K

nGB-1

Priority-based Dynamic Memory Allocation

May have data/application based priorities Create vSPM(s) prior to running applications Selective eviction

No need to evict data from SPMs on every context switch

App3

App4

App2

App1CPU0

CPU1

SPM0

SPM1

PEM

P1

P2

P3

Task Priority:

High

Medium

Low

MM-SPM Transfers:

Data Priority:

S2 S3

S4

S5

S6S1

vSPM Normal Allocation

vSPM Normal Allocation

Low priority data is mapped to PEM

space!

Low priority data is mapped to PEM

space!

Need to launch new (trusted) application

Low-priority blocks

mapped to PEM space

Low-priority blocks

mapped to PEM space

Selective eviction (data-priority)Selective eviction (data-priority)

Selective

Eviction

Selective

Eviction

ALL data is protected through vSPMs

ALL data is protected through vSPMs

Support priority-based selective allocation (data and application driven)

Support priority-based selective allocation (data and application driven)

Minimize overheads of generating trusted environments! (data protected through SPMVisor)

DEMO3: Setup

Total of 4 active cores 8KB SPM space (per core)

Total of 32KB on-chip space SPMVisor

Simulated 1 vSPM per app Total of 8 * 8KB = 64KB on-chip virtualized space

Simulated Virtualized Environment: Generates input for SystemC model

annotated traces with context switch information per application

SPMVisor

SPM SPM SPM SPM

CPU CPU CPU CPU

MM

Crypto

S-DMA

A1 A2 A3 A4 A5 A6 A7 A8GuestOS1 GuestOS3GuestOS2 GuestOS4

Hypervisor

Simulated Virtualized Environment

name time_slot cx_cost mem_sz(MB) n_app {program name, ...}os_a 10000 4000 128 2 adpcm/mem.trace aes/mem.traceos_b 10000 4000 128 2 blowfish/mem.trace gsm/mem.traceos_c 10000 4000 128 2 h263/mem.trace jpeg/mem.traceos_d 10000 4000 128 2 motion/mem.trace sha/mem.tracehyp 20000 6000 128

name entries policy (supported: full assoc(fifo) = 1, not supported yet: 2-set asso = 2, 4-set asso = 4)dtlb 12 1

DEMO3: Hypervisor - CX vs. vSPMs Hypervisor CX: on a context-switch, evict data from

SPMs Load data for new tasks onto SPMs Protects integrity of SPM data

Hypervisor w/vSPMs: no need to evict data from SPMs Each Application has a dedicated virtual space

At run-time load SPM allocation tables

Object Name T_Start T_End Lifetime Addr_Start Addr_End # of AccessesBuffer_0x7f<7>: 1 87 86 2147450879 2147454974 7Buffer_0x7f<8>: 16 233608 233592 2147446783 2147450878 84825Buffer_0x10<0>: 1295 232119 230824 268435456 268439551 3822Buffer_0x10<1>: 89 233598 233509 268439552 268443647 370Buffer_0x10<2>: 9 233169 233160 268443648 268447743 25088Buffer_0x10<3>: 226971 233166 6195 268447744 268451839 1058Buffer_0x10<4>: 228507 230038 1531 268451840 268455935 1024Buffer_0x10<5>: 230043 231574 1531 268455936 268460031 1024Buffer_0x10<6>: 231816 233043 1227 268460032 268464127 17

DEMO3: Hypervisor - CX vs. vSPMs (Cont.) Comparison of cx and vSPMs

E1: spmvisor_e1.x – 2 Applications, 1 OS, Hypervisor, 1 Core, 1 SPM, 2 x vSPM rtos_e1.x – 2 applications, 1 OS, Hypervisor, 1 Core, 1 SPM

Traditional (CX) SPMVisor Improvements

Execution Time2.89E+07 1.46E+07

49 % Lower Execution Time

Total Energy (nJ)677486.6228 52877.27006

92 % Energy Savings

DEMO3: Hypervisor - CX vs. vSPMs (Cont.) Comparison of cx and vSPMs

E4: spmvisor_e4.x – 8 Applications, 4 Oses, Hypervisor, 4 Cores, 4 SPMs, 8

vSPMs rtos_e4.x – 8 Applications, 4 Oses, Hypervisor, 4 Cores, 4 SPMs

Traditional (CX) SPMVisor Improvements

Execution Time6.25E+07 2.18E+07

65 % Lower Execution Time

Total Energy (nJ)2475745.172 465330.5984

81 % Energy Savings

- The number of data evictions/loads due to context switching hurts both performance and energy

Questions?

04/19/23 120

Contact Information

Houman Homayoun Email: [email protected]://cseweb.ucsd.edu/~hhomayoun/

Manish AroraEmail: [email protected]://cseweb.ucsd.edu/~marora/

Luis Angel BathenEmail:[email protected]/~lbathen/

121

http://cseweb.ucsd.edu/~hhomayoun/

mailto:[email protected]

http://cseweb.ucsd.edu/~marora/





http://www.ics.uci.edu/~lbathen/





Architecture Tool Publicly Available for Download HotSpot

http://lava.cs.virginia.edu/HotSpot/ DARSIM (Hornet)

http://csg.csail.mit.edu/hornet/ NVSIM

http://www.rioshering.com/nvsimwiki/index.php?title=Main_Page McPAT

http://www.hpl.hp.com/research/mcpat/ CACTI

http://www.hpl.hp.com/research/cacti/ GEMS5

http://www.m5sim.org/Main_Page SimpleScalar

http://www.simplescalar.com/ GPGPU-SIM

http://www.gpgpu-sim.org/ VARIUS

http://iacoma.cs.uiuc.edu/varius/ 122

1 system-level exploration of power, temperature, performance, and area for multicore architectures...

Documents

power problem

ram power consumption

core 8x10 cores power

year power density wcm

irvine slide

computer system design

reliability issue

homayoun tools