pedro benedicte, carles hernandez, jaume abella, francisco ... · jaume abella, francisco j....

35
Euromicro Conference July 4 th on Real-Time Systems Barcelona, Spain HWP: Hardware Support to Reconcile Cache Energy, Complexity, Performance and WCET Estimates in Multicore Real-Time Systems Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco J. Cazorla

Upload: others

Post on 17-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Euromicro Conference July 4th on Real-Time Systems Barcelona, Spain

HWP: Hardware Support to Reconcile Cache Energy, Complexity, Performance and WCET Estimates in Multicore Real-Time Systems

Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco J. Cazorla

Page 2: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Performance in Real-Time Systems

• Future, more complex features require an increase in guaranteed performance

• COTS hardware used in HPC/commodity domain offers higher performance

• Common features: • Multicores

• Caches

2

NVIDIA Pascal (auto) SnapDragon

(auto)

NXP T2080 (avionics/rail) Zynq UltraScale+ EC

(space/auto)

• We focus on multicore with multilevel caches (MLC)

Page 3: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Write policy

3

At the heart of MLC write policy

Metrics

Performance

Energy

Reliability

Coherence

Complexity

Page 4: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Contributions

• Analysis of most used policies • Write-Through (WT)

• Write-Back (WB)

• Write policies used in commercial processors

• Proposal HWP: Hybrid Write Policy • Try to take the best of both policies

• Evaluation • Guaranteed Performance

• Energy

• Reliability

• Coherence complexity

4

Page 5: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Assumptions

• Multi-core system • Private level of cache • Shared level of cache • Bus to connect the different cores

• Reliability • Parity when no correction needed • SECDED otherwise

• Coherence • Snooping based protocol

• Good with a moderate number of cores • Also can be applied to directory based

• We assume write-invalidate (MESI)

5

L1

Bus

Core

L2

Memory

ECC

L1

Core

L1

Core

Page 6: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Write-Through

6

L1 L1

L2

Bus

Core Core write A

A

Page 7: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Write-Through

6

L1 L1

L2

Bus

Core Core

A

A

A read A

Page 8: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Write-Through

6

L1 L1

L2

Bus

Core Core

ECC

Parity

Parity

Metric

Performance

Energy

Coherence simplicity

Reliability cost

Metric

Performance

Energy

Coherence simplicity

Metric

Performance

Energy

Page 9: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Shared bus writes

• Each write is sent to the bus

• Store takes k bus cycles

• Bus admits 1/k accesses per cycle without saturation

• 4 cores accessing

• Bus admits 1/(4·k)

• WT increases load on bus with writes

7

k k k Bus access time

Bus access time

k k k k k k k

Page 10: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Store percentage in real-time applications

• 9% stores on average • Data-intensive

real-time applications have a higher percentage of memory operations

• 4 cores: 36% stores

• If store takes > 3 cycles

Bus saturated

8

0

5

10

15

20

25

30EEMBC automotive store %

0

5

10

15

20

25

30MediaBench store %

Page 11: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

WT: reliability and coherence complexity

• Reliability: • dL1 does not keep dirty data

• No need to correct data in dL1 • Just detect error and request to L2

• Parity in dL1 • 1,6% overhead

• Data in L2 is always updated

• SECDED in L2 • 12,5% overhead

• Coherence: • Data is always in L2, no dirty state

• A simple valid/invalid protocol is enough

9

64 bit line P

SECDED 64 bit line

Page 12: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Write-though: summary

1. Stores to bus can create contention and affect guaranteed performance

2. More accesses to bus and L2 increase energy consumption

3. Only requires parity in L1

4. Simple coherence protocol

10

(higher is better)

1

2

3

4

Page 13: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Write-Back

11

L1 L1

L2

Bus

Core Core write A

A

Page 14: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Write-Back

11

L1 L1

L2

Bus

Core Core

A

A

A read A

A

Page 15: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Write-Back

11

L1 L1

L2

Bus

Core Core

ECC

ECC ECC

Metric

Performance

Energy

Coherence simplicity

Reliability cost

Metric

Performance

Energy

Coherence simplicity

Metric

Performance

Energy

Page 16: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Write-back: summary

• Reduced pressure on bus improves guaranteed performance and energy consumption

• ECC (SECDED) is required for private caches • There can be dirty data in L1

• Increase in coherence protocol complexity • Due to private dirty lines tracking

12

Page 17: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Write Policies in Commercial Architectures

• There is a mixture of WT/WB implementations • No obvious solution

• Both solutions can be appropriate depending on the requirements

13

Processor Cores Frequency L1 WT? L1 WB?

ARM Cortex R5 1-2 160MHz Yes, ECC/parity Yes, ECC/parity

ARM Cortex M7 1-2 200MHz Yes, ECC Yes, ECC

Freescale PowerQUICC 1 250MHz Yes, ECC Yes, parity

Freescale P4080 8 1,5GHz No Yes, ECC

Cobham LEON 3 2 100MHz Yes, parity No

Cobham LEON 4 4 150MHz Yes, parity No

Page 18: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

WT and WB comparison

14

Write-through Write-back

• Each policy has pros and cons

• We want to get the best of each policy

HWP

Page 19: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Hybrid Write Policy: main idea

• Observations: • Coherence complex with WB because shared cache lines accessed may

be dirty in local L1 caches

• Private data is unaffected by cache coherence

• A significant percentage of data is only accessed by one processor (even in parallel applications), so no coherence management is needed

• Based on these observations, we propose HWP: • Shared data is managed like in WT cache

• Private data is managed like in WB caches

• Elements to consider: • Classify data as private/shared

• Implementation (cost, complexity…)

15

Page 20: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Hybrid Write Policy

16

L1 L1

L2

Bus

Core Core write A

A Shared data

Page 21: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Hybrid Write Policy

16

L1 L1

L2

Bus

Core Core

A

A

A read A

Shared data

Page 22: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Hybrid Write Policy

16

L1 L1

L2

Bus

Core Core

Private data

write A

A

Page 23: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Hybrid Write Policy

16

L1 L1

L2

Bus

Core Core

ECC

ECC ECC

Metric

Performance

Energy

Coherence simplicity

Reliability cost

Metric

Performance

Energy

Coherence simplicity

Metric

Performance

Energy

Page 24: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Private/Shared data classification

• The hardware needs to know if data is shared or private

• Page granularity is optimal for OS

• If any data in a page is shared, the page is classified as shared

• Techniques already exist in both OS (Linux) and real hardware platform (LEON3)

• Possible techniques: • Dynamic classification

• Predictability issues in RTS

• Software address partitioning • We assume this solution

17

Page 25: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Implementation

• Small hardware modifications

18

Page 26: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

HWP: summary

• Guaranteed performance • Accesses to bus are limited to shared

data

• Energy consumption of bus and L2 also reduced

• Reliability • Sensitive data could be marked as

shared so is always in L2

• For critical applications, SECDED needed, private data can be in L1 and not in L2

• Coherence • Same coherence complexity as WT

19

Page 27: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

WT, WB and HWT comparison

20

Write-through Write-back Hybrid Write Policy

Page 28: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Evaluation: Setup

21

• SoCLib simulator for cycles

• CACTI for energy usage • Architecture based on NGMP

• With 8 cores instead of 4

• Private iL1 and dL1, shared L2 • Benchmarks:

• EEMBC automotive, MediaBench

Page 29: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Methodology

• 4 different mixes from single thread benchmarks

• Suppose different percentages of shared data to evaluate the different scenarios

• Model for bus contention [1] • Uses PMC to count the type of the competing cores’ accesses

• With this model we obtain partially time composable WCET estimates

• To summarize, the model takes into consideration the worst possible accesses the other cores DO make

• Task : 100 accesses to bus Other tasks: 50 accesses to bus

• The model takes into account only the 50 potential interferences

• More tight WCET estimates

22 [1] J. Jalle et al. Bounding resource contention interference in the next-generation microprocessor (NGMP)

Page 30: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Guaranteed performance

23

• Normalized WCET bus contention

• 10% of data is shared

• WT does not scale well with the

number of cores • HWP scales similar to WB • Some degradation due to shared

accesses

WT

HWP

WB

Cores

Page 31: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Guaranteed performance

24

0% shared data 10% shared data 20% shared data 40% shared data

• Each plot normalized to its own single-core • Same trends we saw are seen across all setups

Page 32: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Energy

25

• Coherence is higher in WB policy • Reliability has a small energy cost • Main difference: L2 access energy

EEMBC MediaBench

Page 33: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Coherence

26

EEMBC MediaBench

• Invalidation messages • WT has a high number • WB and HWP only broadcast to shared data

• Shared dirty data communication • Significant impact in WB

Invalidation messages Shared dirty data communication

Invalidation messages Shared dirty data communication

Page 34: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Conclusions

• Both WT and WB offer tradeoffs in different metrics • No best policy, commercial architectures show this

• HWP tries to improve this • Not perfect, but improves overall

• Guaranteed performance and energy similar to WB

• Coherence complexity like WT

27

Page 35: Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco ... · Jaume Abella, Francisco J. Cazorla . Performance in Real-Time Systems •Future, more complex features require an

Thank you! Any questions?

Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco J. Cazorla