dynamic and application-driven i-cache partitioning for low-power embedded multitasking

Mathew Paul and Peter PetrovProceedings of the IEEE Symposium on Application

Specific Processors (SASP ’09)July 2009

112/04/19

The abundance of wireless connectivity and the increased workload complexity have further underlined the importance of energy efficiency for modern embedded applications. The cache memory is a major contributor to the system power consumption, and as such is a primary target for energy reduction techniques. Recent advances in configurable cache architecture have enabled an entirely new set of approaches for application-driven energy- and cost-efficient cache resource utilization.

We propose a run-time cross-layer specialization methodology, which leverages configurable cache architectures to achieve an energy- and performance-conscious adaptive mapping of instruction cache resources to tasks in dynamic multitasking workloads.

AbstractAbstract

- 2 -

Sizable leakage and dynamic power reductions are achieved with only a negligible and system-controlled performance impact. The methodology assumes no prior information regarding the dynamics and the structure of the workload. As the proposed dynamic cache partitioning alleviates the detrimental effects of cache interference, performance is maintained very close to the baseline case, while achieving 50%-70% reductions in dynamic and static leakage power for the on-chip instruction cache.

Abstract – Cont.Abstract – Cont.

- 3 -

The cache memory is a major contributor to the total dynamic and leakage power Occupy up to 50% of die area and 80% of transistor budget

How to customize the configurable cache dynamically to provide a task only its required cache volume Goal: reduce power consumption with limited degradation in

performance

What’s the ProblemWhat’s the Problem

- 4 -

Task0

Performance doesn’t improve noticeably beyond half of cache

Task0

Idle

Energy Efficient Cache

Normal Cache

Partition the instruction cache and adapt its utilization at run time Cache partitioning: eliminate cache interference Utilize configurable cache: only the required subsection of

cache is active

The Proposed Methodology for Dynamic The Proposed Methodology for Dynamic Cache CustomizationCache Customization

- 5 -

Dynamic multitasking workload

Task0

Task1Task1 Task 2 Idle

Idle

Idle

Only one task is active at a time

From t2 ~ t3:

Base on cache partitioning formation (initial partition) policy Cache requirements of each task (detailed later)

。Task0: 2K 2-way, Task1: 8K 4-way, Task2: 4K 2-way

Functional OverviewFunctional Overview

- 6 -

Dynamic multitasking workload

16K 4-way Baseline Cache

Map to subsection equal to the required $ size

Active section during Task2 execution

Low power drowsy mode

However, overlap cache partitioning is inevitable Some tasks may require larger cache partitions

Overlap brings the problem of cache interference Result in performance worse than the required miss rate bound

Handle such case through dynamic partition update Update the overlapped partitions dynamically

Functional Overview – Cont. Functional Overview – Cont.

- 7 -

Ideal Case

Task0

Task1Task1 Task 2

Map to exclusive

Task0Task1Task1 Task 2

Initial Partition

Overlapping

Task0Task1Task1 Task 2

Dynamic Partition Update

Enlarge partition when performance worse

The mechanisms required for efficient cache utilization with minimal interference Initial partition formation

。Identify the individual task cache requirement at compile-time Use the cache miss statistics information local to each task

Initial partition assignment。Assign the initial partition to a task at run-time

Set the “Cache Way Select Register (CWSR)” and the “mask register” to vary the # of sets

Dynamic partition update policy。Fine-tune the partition size when performance worse

Ensure miss-rate remain within the threshold bounds

Dynamic Cache CustomizationDynamic Cache Customization

- 8 -

Identify cache requirement and determine the initial partition size for each task Aim at reducing energy while keep performance close to the

baseline case, i.e., BASE(Ti) Use the IND_BASE(Ti) instead

。Then define a “Threshold” accounts for the cache interference Hence, the miss rate bound for a task is IND_BASE(Ti) + Threshold

The starting cache configuration is picked such that。MISS(Pi,j) IND_BASE(T≦ i) + Threshold

Part1: Initial Partition FormationPart1: Initial Partition Formation

- 9 -

Task 4

task0task0task3task2task2task1

BASE(Ti)

Actual baseline miss rate of task Ti with

interference

task0task0

IND_BASE(Ti)

Miss rate of task Ti when baseline cache is used in

isolation

Not available at compiler-time

Task-specific

MCS (Missrate Cache Space) Table Cache miss statistics for each cache configuration

。Obtain through profiling

Part1: Initial Partition Formation - ExamplePart1: Initial Partition Formation - Example

- 10 -

Cache Way size

512 1K 2K 4K 8K

512 1K 2K 4K 8K

Task0Task0

Task1Task1

Task2Task2

IND_BASE(T0)= 0%

IND_BASE(T1)= 0.15%

IND_BASE(T2)= 0.17%

Threshold

0.1%

0.1%

0.1%

+

+

+

MISS(P0,j) ≦

MISS(P1,j) ≦

MISS(P2,j) ≦

Find the minimal cache that satisfy

condition

Find the minimal cache that satisfy

condition

Starting configuration for G721: 8K 2-way

Starting configuration for LAME: 4K 4-way

Starting configuration for GSM: 8K 2-way

# of W

ays# of W

ays# of W

ays

Assign the initial partition to a task at run-time Set the control register and mask register of configurable cache

Attempt to assign partitions exclusive of each other But not always possible

。Total $ requirement of G721, LAME, and GSM is 20K but only 16K is available

Part2: Initial Partition Assignment Part2: Initial Partition Assignment

- 11 -

At time t0, allocate 8K 2-way to G721At time t0, allocate 8K 2-way to G721

At time t1, allocate 4K 4-way to LAME

(can’t exclusive of G721, and allow overlapping)

At time t1, allocate 4K 4-way to LAME

(can’t exclusive of G721, and allow overlapping)

At time t2, allocate 8K 2-way to GSM

(with a small portion being used by LAME)

At time t2, allocate 8K 2-way to GSM

(with a small portion being used by LAME)

Tasks with overlapping partitions can’t be prevented Interference and miss rates may exceed the bound

Part3: Dynamic Partition Update Part3: Dynamic Partition Update

- 12 -

Trigger the dynamic partition update

HW miss counter inside

CPU

HW miss counter inside

CPU＞ IND_BASE(Ti) + Threshold＞ IND_BASE(Ti) + Threshold

Trigger partition rescaling

Trigger partition rescaling

Enlarge the partition size until it is less than

the miss rate bound

Enlarge the partition size until it is less than

the miss rate bound

Partition rescaling trades-off power savings for meeting performance requirement

Part3: Dynamic Partition Update - ExamplePart3: Dynamic Partition Update - Example

- 13 -

For LAME, the miss-rate bound is exceeded in the

overlapped region

For LAME, the miss-rate bound is exceeded in the

overlapped region The next configuration after 4K 4-way with miss rate less

than 0.25% is 6K 3-way

The next configuration after 4K 4-way with miss rate less

than 0.25% is 6K 3-way

512 1K 2K 4K 8K

LAME: IND_BASE(T1) + Threshold= 0.25%

GSM rescaled to 12K 3-way due to increased overlap with the

rescaled LAME partition

GSM rescaled to 12K 3-way due to increased overlap with the

rescaled LAME partition

Partition reshuffling When a task leaves on completing execution

。The cache resource is freed up and available to currently executing tasks The previously rescaled partition is considered for reshuffling

。Completely allocate this task’s starting configuration without overlap

Part3: Dynamic Partition Update - ExamplePart3: Dynamic Partition Update - Example

- 14 -

Reshuffling

At time t4, both G721 and GSM complete only LAME is left executingReshuffle to starting configuration

(reverting to smaller partition results in reduced power)

Reshuffle to starting configuration(reverting to smaller partition results in reduced power)

Use the cache configurations found in high-end embedded processor (Intel XScale and ARM9) 16K 4-way 32K 4-way

Scheduling policy to model multitasking Round-robin policy with a context-switch frequency of 33K Inst.

The miss-rate impact threshold is set to 0.1% Evaluate two categories of benchmark

Static benchmarks: all tasks start and finish at the same time Dynamic benchmarks:

Experiment SetupExperiment Setup

- 15 -

Structure of Dynamic Benchmarks

Partitioning: apply the initial partition assignment only Rescaling: apply partitioning + rescaling Reshuffling: apply partitioning + rescaling + reshuffling For some configuration, the rescaling and reshuffling are omitted

Since the miss-rate impact is within the threshold after initial assignment

Miss-Rate Impact: Increase Miss-Rate Miss-Rate Impact: Increase Miss-Rate Compared to Baseline CacheCompared to Baseline Cache

- 16 -

Better

After rescaling, the miss-rate

impact is reduced

After rescaling, the miss-rate

impact is reduced

GSM is subjected to rescaling Miss-rate bound is exceeded due to interference in the overlapped

The partition reshuffling maximizes power reduction Power reduction is achieved while keeping miss-rate impact below

the threshold value

BM_3 Individual Task Miss-Rates for 16K BM_3 Individual Task Miss-Rates for 16K CacheCache

- 17 -

Better

Improve performance, even low than baseline

case

Improve performance, even low than baseline

case

Exceed miss-rate bound of 0.27%

Exceed miss-rate bound of 0.27%

- 18 -

Shared cache

Task 4

task0task0task3task2task2task1

Thrashing

dynamic and application-driven i-cache partitioning for low-power embedded multitasking

Documents

cache way

cache memory

dynamic cache customization

required subsection

required cache volumegoal

efficient cache utilization

chip instruction cache

larger cache partitionsoverlap