cmp-msi feb. 11 th 2007 core to memory interconnection implications for forthcoming on-chip...

24
CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta Carmelo Acosta 1 Francisco J. Cazorla Francisco J. Cazorla 2 Alex Ramírez Alex Ramírez 1,2 1,2 Mateo Valero Mateo Valero 1,2 1,2 1 1 UPC-Barcelona UPC-Barcelona 2 2 Barcelona Supercomputing Center Barcelona Supercomputing Center

Upload: aldous-kennedy

Post on 13-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

CMP-MSIFeb. 11th 2007

Core to Memory Interconnection Implications for Forthcoming

On-Chip Multiprocessors

Carmelo AcostaCarmelo Acosta 11

Francisco J. CazorlaFrancisco J. Cazorla 22

Alex RamírezAlex Ramírez 1,2 1,2

Mateo ValeroMateo Valero 1,21,2

1 1

UPC-BarcelonaUPC-Barcelona2 2

Barcelona Supercomputing CenterBarcelona Supercomputing Center

Page 2: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

22CMP-MSIFeb. 11th 2007

OverviewOverview

Introduction

Simulation Methodology

Results

Conclusions

Page 3: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

33CMP-MSIFeb. 11th 2007

IntroductionIntroduction

As Process Technology advances it is more important what to do with transistors.

Current trend to replicate cores. Intel: Pentium4, Core Duo, Core 2 Duo, Core 2 Quad AMD: Opteron Dual-Core, Opteron Quad-Core IBM: POWER4, POWER5 Sun Microsystems: Niagara T1, Niagara T2

Page 4: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

44CMP-MSIFeb. 11th 2007

IntroductionIntroduction

Power4 (CMP) Power5 (CMP+SMT)

Memory Subsystem (green) spreads over more than half the chip area.

Page 5: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

55CMP-MSIFeb. 11th 2007

IntroductionIntroduction

Each L1 is connected to each L2 bank with a bus-based interconnection network.

Page 6: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

66CMP-MSIFeb. 11th 2007

GoalGoal

Is directly applicable prior research in the SMT field in the new CMP+SMT scenario?

NO…we have to revisit well-known SMT ideas.

Instruction Fetch Policy

Page 7: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

77CMP-MSIFeb. 11th 2007

ICOUNTICOUNT

FetchROB

Page 8: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

88CMP-MSIFeb. 11th 2007

ICOUNTICOUNT

FetchROB

L2 miss

FETCH Stalled

Processor’s resources balanced between running threads. All resources devoted to blue thread unused until L2 miss resolution.

Page 9: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

99CMP-MSIFeb. 11th 2007

FLUSHFLUSH

FetchROB

L2 miss

All resources devoted to the pending instructions of the blue thread are freed.

FLUSH Triggered

Page 10: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

1010CMP-MSIFeb. 11th 2007

FLUSHFLUSH

FetchROB

L2 miss

Freed resources allow additional forward progress. L2 miss late detection L2 miss prediction.

Thread Stalled

Page 11: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

1111CMP-MSIFeb. 11th 2007

Single vs Multi CoreSingle vs Multi Core

I$ D$

Core

L2 b0

I$ D$

Core

I$ D$

Core

L2 b1 L2 b2 L2 b3

I$ D$

Core

I$ D$

Core

L2 b0 L2 b1 L2 b2 L2 b3

More pressure on both:• Interconnection Network• Shared L2 banks

Page 12: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

1212CMP-MSIFeb. 11th 2007

Single vs Multi CoreSingle vs Multi Core

I$ D$

Core

L2 b0

I$ D$

Core

I$ D$

Core

L2 b1 L2 b2 L2 b3

I$ D$

Core

I$ D$

Core

L2 b0 L2 b1 L2 b2 L2 b3

More Unpredictable L2 Access Latency - BAD for FLUSH

Page 13: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

1313CMP-MSIFeb. 11th 2007

OverviewOverview

Introduction

Simulation Methodology

Results

Conclusions

Page 14: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

1414CMP-MSIFeb. 11th 2007

Simulation MethodologySimulation Methodology

Trace driven SMT simulator derived from SMTsim.

C2T2, C3T2, C4T2 multicore configurations. (CXTY, where X= Num. Cores and Y= Num. Threads/Core)

I$ D$

Core

L2 b0

I$ D$

Core

I$ D$

Core

L2 b1 L2 b2 L2 b3

I$ D$

Core

Core Details(* per thread)

Page 15: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

1515CMP-MSIFeb. 11th 2007

Simulation MethodologySimulation Methodology

Instruction Fetch Policies:

ICOUNT

FLUSH

Workload classified per type: ILP All threads have good memory behavior. MEM All threads have bad memory behavior. MIX Mixes both types of threads.

Page 16: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

1616CMP-MSIFeb. 11th 2007

OverviewOverview

Introduction

Simulation Methodology

Results

Conclusions

Page 17: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

1717CMP-MSIFeb. 11th 2007

Results : Single-Core (2 threads)Results : Single-Core (2 threads)

FLUSH yields 22% average speedup over ICOUNT, in MIX workloads. Mainly on MEM/MIX workloads

Page 18: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

1818CMP-MSIFeb. 11th 2007

Results : Multi-Core (2 threads/core)Results : Multi-Core (2 threads/core)

FLUSH drops to 9% average slowdown over ICOUNT in a four-cored multicore.

+Cores -Speedup

Page 19: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

1919CMP-MSIFeb. 11th 2007

Results : L2 Hits Latency on Multi-CoreResults : L2 Hits Latency on Multi-Core

+Cores +latency

+dispersion

L2 hit latency (cycles)

Page 20: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

2020CMP-MSIFeb. 11th 2007

Results : L2 miss predictionResults : L2 miss prediction

In this four-cored example, the best choice is predicting L2 miss after 90 cycles.

Page 21: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

2121CMP-MSIFeb. 11th 2007

Results : L2 miss predictionResults : L2 miss prediction

But, in this other four-cored example the best choice is not to predict L2 miss.

Page 22: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

2222CMP-MSIFeb. 11th 2007

OverviewOverview

Introduction

Simulation Methodology

Results

Conclusions

Page 23: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

2323CMP-MSIFeb. 11th 2007

ConclusionsConclusions

Future high-degree CMPs open new challenging research topics in CMP+SMT cooperation.

The CMP outer cache level and interconnection characteristics may heavily affect SMT intra-core performance.

For example, FLUSH relies on a predictable L2 hit latency, heavily affected in a CMP+SMT scenario.

FLUSH drops from 22% average speedup to 9% average slowdown when moving from single-core to quad-core configuration.

Page 24: CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex

CMP-MSIFeb. 11th 2007

Thank you

Questions?Questions?