cmp-msi feb. 11 th 2007 core to memory interconnection implications for forthcoming on-chip...

CMP-MSIFeb. 11th 2007

Core to Memory Interconnection Implications for Forthcoming

On-Chip Multiprocessors

Carmelo AcostaCarmelo Acosta 11

Francisco J. CazorlaFrancisco J. Cazorla 22

Alex RamírezAlex Ramírez 1,2 1,2

Mateo ValeroMateo Valero 1,21,2

1 1

UPC-BarcelonaUPC-Barcelona2 2

Barcelona Supercomputing CenterBarcelona Supercomputing Center

22CMP-MSIFeb. 11th 2007

OverviewOverview

Introduction

Simulation Methodology

Results

Conclusions


IntroductionIntroduction

As Process Technology advances it is more important what to do with transistors.

Current trend to replicate cores. Intel: Pentium4, Core Duo, Core 2 Duo, Core 2 Quad AMD: Opteron Dual-Core, Opteron Quad-Core IBM: POWER4, POWER5 Sun Microsystems: Niagara T1, Niagara T2



Power4 (CMP) Power5 (CMP+SMT)

Memory Subsystem (green) spreads over more than half the chip area.



Each L1 is connected to each L2 bank with a bus-based interconnection network.


GoalGoal

Is directly applicable prior research in the SMT field in the new CMP+SMT scenario?

NO…we have to revisit well-known SMT ideas.

Instruction Fetch Policy


ICOUNTICOUNT

FetchROB


ICOUNTICOUNT

FetchROB

L2 miss

FETCH Stalled

Processor’s resources balanced between running threads. All resources devoted to blue thread unused until L2 miss resolution.


FLUSHFLUSH

FetchROB

L2 miss

All resources devoted to the pending instructions of the blue thread are freed.

FLUSH Triggered


FLUSHFLUSH

FetchROB

L2 miss

Freed resources allow additional forward progress. L2 miss late detection L2 miss prediction.

Thread Stalled


Single vs Multi CoreSingle vs Multi Core

I$ D$

Core

L2 b0

I$ D$

Core

I$ D$

Core

L2 b1 L2 b2 L2 b3

I$ D$

Core

I$ D$

Core

L2 b0 L2 b1 L2 b2 L2 b3

More pressure on both:• Interconnection Network• Shared L2 banks


Single vs Multi CoreSingle vs Multi Core

I$ D$

Core

L2 b0

I$ D$

Core

I$ D$

Core

L2 b1 L2 b2 L2 b3

I$ D$

Core

I$ D$

Core

L2 b0 L2 b1 L2 b2 L2 b3

More Unpredictable L2 Access Latency - BAD for FLUSH


OverviewOverview

Introduction


Results

Conclusions


Simulation MethodologySimulation Methodology

Trace driven SMT simulator derived from SMTsim.

C2T2, C3T2, C4T2 multicore configurations. (CXTY, where X= Num. Cores and Y= Num. Threads/Core)

I$ D$

Core

L2 b0

I$ D$

Core

I$ D$

Core

L2 b1 L2 b2 L2 b3

I$ D$

Core

Core Details(* per thread)


Simulation MethodologySimulation Methodology

Instruction Fetch Policies:

ICOUNT

FLUSH

Workload classified per type: ILP All threads have good memory behavior. MEM All threads have bad memory behavior. MIX Mixes both types of threads.


OverviewOverview

Introduction


Results

Conclusions


Results : Single-Core (2 threads)Results : Single-Core (2 threads)

FLUSH yields 22% average speedup over ICOUNT, in MIX workloads. Mainly on MEM/MIX workloads


Results : Multi-Core (2 threads/core)Results : Multi-Core (2 threads/core)

FLUSH drops to 9% average slowdown over ICOUNT in a four-cored multicore.

+Cores -Speedup


Results : L2 Hits Latency on Multi-CoreResults : L2 Hits Latency on Multi-Core

+Cores +latency

+dispersion

L2 hit latency (cycles)


Results : L2 miss predictionResults : L2 miss prediction

In this four-cored example, the best choice is predicting L2 miss after 90 cycles.


Results : L2 miss predictionResults : L2 miss prediction

But, in this other four-cored example the best choice is not to predict L2 miss.


OverviewOverview

Introduction


Results

Conclusions


ConclusionsConclusions

Future high-degree CMPs open new challenging research topics in CMP+SMT cooperation.

The CMP outer cache level and interconnection characteristics may heavily affect SMT intra-core performance.

For example, FLUSH relies on a predictable L2 hit latency, heavily affected in a CMP+SMT scenario.

FLUSH drops from 22% average speedup to 9% average slowdown when moving from single-core to quad-core configuration.

CMP-MSIFeb. 11th 2007

Thank you

Questions?Questions?

cmp-msi feb. 11 th 2007 core to memory interconnection implications for forthcoming on-chip...

Documents

core core details

core duo

flush slide

core l2 b1l2 b2l2 b3

core l2 b0l2 b1l2 b2l2

policy slide

opteron dualcore

flush triggered slide