applying control theory to the caches of multiprocessors department of eecs university of tennessee,...

Applying Control Theory to the Caches of Multiprocessors

Department of EECSUniversity of Tennessee, Knoxville

Kai Ma

Applying Control Theory to the Caches of Multiprocessors

Shared L2 cache is one of the most important on-chip shared resource. Largest area and leakage power consumer One of the dominant players in terms of performance

Two Papers: Relative Cache Latency Control for Performance Differentiations in

Power-Constrained Chip Multiprocessors SHARP Control: Controlled Shared Cache Management in Chip

Multiprocessors

Relative Cache Latency Control for Performance Differentiations in Power-

Constrained Chip Multiprocessors

Department of EECSUniversity of Tennessee, Knoxville

Xiaorui Wang, Kai Ma, Yefu Wang

Background

NUCA (Non Uniform Cache Architecture)

Key idea: Different cache banks have different access latencies.

Introduction The power of the cache part needs to be constrained.

With controlled power, the performance of the caches also need to be guaranteed. Why control relative latency (the ratio between the average

cache access latencies of two threads)?

1. Accelerate critical threads 2. Reduce priority inversion

System Design

Thread 1 on core 1

Thread 0 on core 0

Thread 3 on core 3

Latency Monitor

Thread 2 on core 2

Relative Latency Controller

Cache Resizing and Partitioning Modulator

Power Monitor

Power Controller

Latency Monitor Latency Monitor

Latency Monitor

Relative Latency Controller

Shared L2 Cache

Relative Latency Control Loop

Power Control Loop

Cache bank of Thread 0

Inactive cache bank

Relative Latency Controller (RLC)

New cache ratio RLRLC

Relative latency set point

• PI (Proportional-Integral) controller System modeling Controller design Control analysis

Error: 0.3Increase 0.2

Workload variation Total cache size variation

Shared L2 caches

)()()(n

jji jkcbjklakl

Relative Latency Model

is the relative latency between and core is the cache size ratio between and core

RL model

System identification Model orders Parameters

ii ba ,0.25 0.17 0.17

0.22 0.17 0.17

0.18 0.15 0.15

01 n 11 n 21 n

Model Orders and Error

)(klithi thi )1(

thi thi )1( ic

Controller Design PID controller

Proportional Integral

Design: Root Locus

New cache ratio Relative latencyRelative Latency

set point

)(ke )(

)1()()1()( 21 keKkeKkckc ii

Shared L2 caches

Control Analysis

Derive the transfer function of the controller

Derive the transfer function of the system with system model variations

Derive the transfer function of the close-loop system and compute the poles

The control period of the power control loop is selected to be longer than the settling time of the relative latency control loop.

)1()1(')( 11 kcbklakl ii

Stability range:

18.1'69.0 1 a

Power Controller is the total cache size in the power control period. is the cache power in the power control period. are the parameters depended on applications

System Model Leakage power is proportional to the cache size. Leakage power counts for the largest portion of cache

power.

PI Controller

Controller analysis: and

( ) * ( )p k c s k d

( )p k( )s k thk

thk,c d

0'c 76.0' c

Simulation Simulator

Simplescalar with NUCA cache (Alpha 21264 like core)

Power reading Dynamic part: Wattch (with CACTI) Leakage part: Hotleakage

Workload Selected workloads from SPEC2000

Actuator Cache bank resizing and partitioning

Single Control Evaluation

Switch workloads here

RLC set point change Power controller set point change

Workload switch Total cache bank count change

Relative Latency & IPC

Coordination

Cache access latencies and IPC values of the four threads on the four cores of the CMP.

Cache access latencies and IPC values of the two threads on Core 0 and Core 1 for different benchmarks.

Conclusions Relative Cache Latency Control for Performance

Differentiations in Power-Constrained Chip Multiprocessors

Simultaneously control power and relative latency

Achieve desired performance differentiations

Theoretically analyze the single loop control and coordinated system stability

SHARP Control: Controlled Shared Cache Management in Chip Multiprocessors

Shekhar Srikantaiah, Mahmut Kandemir, *Qian Wang

Department of CSE

*Department of MNE

The Pennsylvania State University

Introduction Lack of control over shared on-chip resource

Faded performance isolation Lack of Quality of Service (QoS) guarantee

It is challenging to achieve high utilization meanwhile guaranteeing the QoS. Static/dynamic resource reservations may lead to low

resource utilization. Existing heuristics adjustment cannot provide theoretical

guarantee like “settling time” or “stability range”.

Contribution Two-layer control theory based SHARP (SHAred

Resource Partitioning) architecture Propose an empirical model Design a customized application controller (Reinforced

Oscillation Resistant controller) Study two policies can be used in SHARP

SD (Service Differentiation) FSI (Fair Speedup Improvement)

Napp base

i app scheme

System Design

Why not PID? Disadvantages of PID (Proportional-Integral-

Derivative) controller Painstaking to tune the parameters Hard to be integrated with hierarchical architecture Sensitive to model variation during run time Static parameters Generic controller (not problem-specific) Linear model based controller

Application Controller

Pre-Actuation Negotiator (PAN) Map an overly demanded cache partition to a

feasible partition

Policies:

SD (Service Differentiation )

FSI (Fair Speedup Improvement )

spillwwfloorw

ii Wwspillw

SHARP Controller Increase IPC set points when cache ways are under

utilized

FSI & SD policies

The proof of guaranteed optimal utilization

j jrefjout

j jrefii

Experimental Setup Simulator : Simics (Full system simulator)

Operating System: Solaris 10

Configuration (2, 8 cores)

Workload: 6 mixes of applications selected from SPEC2000

Evaluation (Application Controller)

Long run results of PID controller and ROR controller

Evaluation (FSI)

SHARP vs Baselines

Evaluation (SD)

Adaptation of IPC with the SD policy using the ROR controllers.

Sensitivity & Scalability

Sensitivity analysis for different reference points

Scalability (8 cores)

Conclusion SHARP Control: Controlled Shared Cache

Management in Chip Multiprocessor Propose and design the SHARP control architecture for

shared L2 caches Validate SHARP with different management policies (FSI or

SD) Achieve desired FS and SD specifications

Critiques (1)

How to decide the relative latency set point?

For accelerating critical thread purpose, the parallel workloads may be more applicable.

Critiques (2)

No stability proof

Insufficient description about how to update the parameters for the application controllers

ComparisonRelative latency control with the power constraint

SHARP control architecture

Goal Guarantee NUCA L2 cache relative latency with different power budget

Improve the normal L2 cache utilization while guaranteeing the QoS metrics

Design Two-layer hierarchical design

Two-layer hierarchical design

Controller PID ROR

Coordination & Stability Yes No

Actuator Cache bank resizing and partitioning

Cache way resizing and partitioning

Evaluation Simplescalar Simics

Thank you

Backup Slides Start

Relative Controller Evaluation (2)

Application Controller Evaluation (2)

Guaranteed Optimal Utilization Proof are time varying coefficient depended on applications,i iK

( ) ( )

( 1) ( )

( 1)( )

( )( )

refi i i

refi i N

refi i

ii out

refi i N

refiiout

w t P t

P t P K t

WP t P

System Design

applying control theory to the caches of multiprocessors department of eecs university of tennessee,...

relative cache latency

control relative latency

relative latency model

inactive cache bank

different cache banks

cache size ratio

latency monitorthread

average cache access

Documents

multiprocessors - trinity college dublin...

multiprocessors - inf.ed.ac.uk · multiprocessors ! why...

eecc756 - shaaban #1 lec # 10 spring2006 5-4-2006...

8.1 multiprocessors

synchronization, coherence, and event ordering in...

large scale multiprocessors and scientific applications ·...

optimizing cuda - penn engineeringcis565/lecture2010/... ·...

hierarchical checking of multiprocessors using · pdf...

eecc756 - shaaban #1 lec # 10 spring2009 5-5-2009...

ieee transactions on computer-aided design of … ·...

fast data delivery for many-core...

dynamic cache clustering for chip multiprocessors · chip...

multiprocessors - university of california, san...

scalable multiprocessors

acm: an efficient approach for managing shared caches in...

csl718 : multiprocessors

multiprocessors interconnection networks

1 multiprocessors computer organization prof. h. yoon...

shared memory multiprocessors -...

1 multiprocessors computer organization computer...