sparc64(tm) vii fujitsu's next generation quad-core processor · 2014. 5. 2. · tr=190m cmos...

18
All Rights Reserved,Copyright© FUJITSU LIMITED 2008 SPARC64 VII Fujitsu’s Next Generation Quad-Core Processor August 26, 2008 Takumi Maruyama LSI Development Division Next Generation Technical Computing Unit Fujitsu Limited

Upload: others

Post on 27-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

All Rights Reserved,Copyright© FUJITSU LIMITED 2008

SPARC64™ VIIFujitsu’s Next Generation

Quad-Core Processor

August 26, 2008

Takumi Maruyama

LSI Development DivisionNext Generation Technical Computing UnitFujitsu Limited

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

2004~2000~2003

1998~1999

1996~1997

~1995

SPARC64II

SPARC64™Processor

SPARC64SPARC64™™ProcessorProcessor

Tr=190MCMOS Cu

130nm

Tr=30MCMOS Cu

180nm / 150nm

Tr=190MCMOS Cu

130nmSPARC64

V

SPARC64GP

Tr=30MCMOS Al

250nm / 220nm

Tr=46MCMOS Cu180nm

GS8900

MainframeMainframeMainframe

:Technology generation

GS21600

GS8800B

Tr=400MCMOS Cu

90nm

Tr=540MCMOS Cu

90nm

SPARC64VI

GS8800

Tr=500MCMOS Cu

90nm

Tr=10MCMOS Al

350nm

Store AheadBranch HistoryPrefetch

Single-chip CPU

Non-Blocking $O-O-O ExecutionSuper-Scalar

L2$ on DieMulti-core / Multi-thread

Fujitsu Processor DevelopmentFujitsu Processor Development

High Perform

anceT

echnologyH

igh Reliability

Technology

Cache ECCRegister/ALU ParityInstruction RetryCache Dynamic DegradationRC/RT/History

Cache ECCRegister/ALU ParityInstruction RetryCache Dynamic DegradationRC/RT/History

GS8600

GS21900

Tr=600MCMOS Cu

65nm

SPARC64V +

HPCHPCHPC

SPARC64VII

GS21

SPARC64

UNIXUNIXUNIXHardware Barrier

Venus

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

SPARC64™ VII Chip

• Architecture Features• 4core x 2threads (SMT)• Embedded 6MB L2$• 2.5GHz• Jupiter Bus

• Fujitsu 65nm CMOS• 21.31mm x 20.86mm• 600M transistors• 456 signal pins

• 135 W (max)• 44% power reduction per core

from SPARC64TM VI

Core 1Core 1 Core 3Core 3

Core 2Core 2

System I/OSystem I/O

SystemInterfaceSystem

Interface

L2 CacheControl

L2 CacheControl

L2 CacheTag

L2 CacheTag

InstructionControl

InstructionControl

ExecutionUnit

ExecutionUnit

L1D CacheL1D Cache

L1 Cache Control

L1 Cache Control

L2 CacheData

L2 CacheData

L1I CacheL1I Cache

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

SPARC64™ VII Design Target• Upgradeable from current SPARC64™ VI on SPARC

Enterprise ServersKeep single thread performance & high reliabilityby reusing SPARC64™ VI core as much as possibleSame system I/F: Jupiter-Bus

• More Throughput PerformanceDual core to Quad-coreVMT (Vertical Multithreading) to SMT (Simultaneous Multithreading)

• Technical ComputingShared L2$ & Hardware Barrier Increased Bus frequency (on Fujitsu ‘FX1’ server)

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

L1 I$64KB2Way

BranchTarget

Address8Kentry4Way

Decode& Issue

RSE8x2Entry

RSA10Entry

RSF8x2Entry

RSBR10Entry

GUB32Registers

GPR156Registers

x2

EXA

EXB

EAGA

EAGB

FPR64Registers

x2

FUB48Registers

FLA

FLB

FetchPort

16Entry

Store Port

16Entry

Store Buffer16Entry

L1 D$64KB2Way

System BusInterface

Fetch(4stages)

Issue(2stages)

Dispatch Reg.-Read(4stages)

Execute Memory(L1$: 3stages)

CSE64Entry

Commit(2stages)

PCx2

ControlRegisters

x2

Pipeline Overview

L2$6MB

12Way

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

Simultaneous Multi-Threading• Fine Grain Multi-thread

• Fetch/Issue/Commit stage:Alternatively select one of the two thread each cycle

• Execute stage:Select instructions based on oldest-ready independently from threads.

SMT throughput increase• x 1.2 INT/FP• x 1.3 Java• x 1.3 OLTP

t0 t1 t0 t1

t2 t3 t2 t3

time

Core0

Core1

t0Core0t1

t2 Core1

t4t5

t6

Core2

Core3

SPARC64 VII

SPARC64 VI

t3

6481648488x28x28One

SPFPFUBGUBRSFRSE8+8 32+32

CSE

8+8

Port

24+24

Rename Registers

8x28x2

Reservation Station.

24+244+4

IB

Two

#Active threads

The other thread is idle

t1

t3

t5

t7 t7

• To avoid interaction between threads.

• Automatically combined if the other thread is idle.

• Split HW Queues

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

Integrated Multi-core Parallel Architecture• L2 cache

– Shared by 4 Cores to avoid false-sharing– B/W between L2 cache and L1cache

(2x of SPARC64TM VI)• L2$→L1$: 32byte/cycle/2core x 4core• L2$←L1$: 16byte/cycle/core x 4core

• Hardware Barrier– High speed synchronization mechanism

between cores in a CPU chip. – Reduce overhead for parallel execution

L1$

Hardware Barrier

IUEU

SharedL2$

(6MB12way)

LBSYCPU Core

Handles the multi-core CPU as one equivalent fast CPU with Compiler Technology (Fujitsu’s Parallelnavi Language Package 3.0)

– Automatic parallelization– Optimize the innermost loop

DO J=1,NDO I=1,MA(I,J)=A(I,J+1)*B(I,J)

ENDEND

PPP

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

Hardware Barrier• Barrier resources

– BST: Barrier STatus– BST_mask: Select Bits in BST– LBSY: Last Barrier synchronization status– Synchronization is established when all

BST bits selected by BST_mask are set to the same value.

• Usage– Each core updates BST– Wait until LBSY gets flipped

Synchronization time: 60ns– 10 times faster than SW

Sample Code of Barrier Synchronization/** %r1: VA of a window ASI, %r2:, %r3: work*/ldxa [%r1]ASI_LBSY, %r2 ! read current LBSYnot %r2 ! inverse LBSYand %r2, 1, %r2 ! mask out reserved bitsstxa %r2, [%r1]ASI_BST ! update BSTmembar #storeload ! to make sure stxa is completeloop:ldxa [%r1]ASI_LBSY, %r3 ! read LBSYand %r3, 1, %r3 ! mask out reserved bitssubcc %r3, %r2, %g0 ! check if status changedbne,a loopsleep ! if not changed, sleep for a while

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

Other performance enhancements• Faster FMA (Fused Multiply-Add)

– 7cycles→6cycles– DGEMM efficiency: 93% on 4 cores

LINPACK=2,023Gflops with 64CPU• Prefetch Improvement

– HW prefetch algorithm further refined – SW is now able to specify “strong”

prefetch to avoid prefetch lost.• Shared context

– Virtual address space shared by two or more processes

– Effective to save TLB entries• New Instruction

– Integer Multiply-Add with FP registers• etc…

Performance relative to SPARC64TM VI– Single thread performance : x1.0 - x1.2– Throughput: x1.7 - x1.9

Single thread Multi-thread throughput

SPARC64TM VII Relative Performance(SPARC64TM VI=1.0)

0 .0

0 .5

1 .0

1 .5

2 .0

INT FP INT FP JAVA

Hardware measured results

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

Reliability, Availability, Serviceability

• New RAS features of SPARC64™VII• Integer registers are ECC protected• Number of checkers increased to ~3,400

• Existing RAS features

Error

PC

Fetch Execute Commit

PC

SW visibleresources

SW visibleresources

Single step execution

Instruction Retry

Back to normal execution after the re-executed Instruction gets committed without error.

X

GPR,FPR

Memory

GPR,FPR

Memory

Fetch Execute Commit

Update of SW visible resources

flush

CSE

RSE,RSF

RSA

RSBR

GUB,FUB

IWRIBF

ALUEAGA/BEXA/BFLA/B

CSE

RSE,RSF

RSA

RSBR

GUB,FUB

IWRIBF

ALUEAGA/BEXA/BFLA/B

Cache Tag ECC (L2$)Protection Dupicate & Parity (L1$)

Data ECC (L2$, L1D$)Parity (L1I$)

Cache

ECC (INT Regs)Parity (FP Regs, etc)

HW Instruction Retry YesHistory Yes

Dynamic Degradation Yes

ALU/Register

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

RAS coverageRF/Latch (0.9Mbits)

Green: 1bit error CorrectableYellow: 1bit error DetectableGray: 1bit error harmless

• All RAMs are ECC protected or Duplicated

• Most latches are parity protected

Guaranteed Data Integrity

RAM (71.5Mbits)

More than 70% of Transient States are 1bit error detectable.

Recoverable through HW Instruction Retry

L2$(data)

L1$L2$(tag)

Others Arch Reg (INT)

Transient States

Arch Reg (FP) etc

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

Servers based on SPARC64TM VII

JSC

SPARC64TM

VII

40GB/s40GB/s

JSC DDR2

DDR2

• SPARC Enterprise: UNIX– Max 64CPUs / SMP– Upgradeable from SPARC64TM VI– Possible to mix SPARC64TM VI and

SPARC64TM VII within the same SB

• FX1: Technical Computing– Single CPU node– Increased Jupiter Bus freq = 1.26GHz– System chip is newly designed to realize

• Higher memory throughput• Lower Memory latency• Smaller footprint

– Performance• FP throughput/socket: x1.5• STREAM benchmark: >13.5GB/s

SC

SC

MC

MC

MC

MC

CPU

CPU

CPU

CPU

DDR2

DDR2

DDR2

DDR2

SC

SC

SPARC Enterprise M8000

FX1

SPARC64TM VI or VII46GB/s (for 4CPUs)

32B

32B20B(in)+12B(out)

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

Performance Analysis

• Performance Analyzer– About 100 performance events can be

monitored– Available through cputrack() and

cpustat()– 8 performance events can be

gathered at the same time.

• Commit base performance events– Number of commit instructions each

cycle.– The cause of no commit

• L2$miss• L1D$miss• Fetch miss• Execution Unit busy• …

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

0

1

2

3

SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1

bwaves gamess milc zeusmp gromacs cactus leslie3d namd dealII soplex povray calculix Gems tonto lbm wrf sphinx3 Average

CPI

SPARC64TM VII CPI (Cycle Per Instruction) Example

LowerPerformance

HigherPerformance

4 Commit2-3 Commit1 Commit

0 Commit due to core

0 Commit due to $0 Commit due to system

SE: SPARC Enterprise Server M8000

FX1: FX1 Technical Computing Server

Memory access cost has been reduced on FX1.

Hardware measured results

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

SPARC64TM Relative Performance (SPARC64TM V/1.35GHz = 1.0)

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

1.0 1.5 2.0 2.5Single-Thread (Integer)

Mul

ti-Th

read

Thr

ougp

utSPARC64™ Future

• Design History:Evolution rather than revolution

• SPARC64TM V (1core)RASSingle thread performance

• SPARC64TM VI (2core x 2VMT)Throughput

• SPARC64TM VII (4core x 2SMT)More ThroughputHigh Performance Computing

• What’s next?

SPARC64TM VI

SPARC64TM VII

SPARC64TM VII w/ FX1

Single thread Throughput

FP

JAVA

Hardware measured results

SPARC64TM V

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

SPARC64TM VII Summary• The same system I/F with existing SPARC64TM VI to protect

customer’s investment

• 4core x 2SMT design realizes high throughput without sacrificing single thread performance

• Shared L2$ and HW barrier makes 4 cores behave as a single processor with compiler technology.

• The new system design has fully exploited potential of SPARC64TM VII.

• Fujitsu will continue to develop SPARC64TM series to meet the needs of a new era.

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

Venus

• For PETA-scale Computing Server8coreEmbedded Memory Controller

• SPARC-V9 + extension (HPC-ACE)SIMD

• Fujitsu 45nm CMOS

128GF@socket

L1D$

IBUF DecIssue

RSA

RSE

RSF

RSBR

FP/SP

EX

FL

AGEN

L1I$

BRpredict

L2$

MC

DDR3DDR3

x8

DDR3…

Venus

SPARC64™ VII All Rights Reserved,Copyright© FUJITSU LIMITED 2008

Abbreviations• SPARC64TM VII

– RSA: Reservation Station for Address generation– RSE: Reservation Station for Execution– RSF: Reservation Station for Floating-point– RSBR: Reservation Station for Branch– GUB: General Update Buffer– FUB: Floating point Update Buffer– GPR: General Purpose Register– FPR: Floating Point Register– CSE: Commit Stack Entry– FP: Fetch Port– SP: Store Port

• Chip-sets– SC: System Controller– MC: Memory Controller– JSC: Jupiter System Controller– SB: System Board

• Others– DGEMM: Double precision matrix multiply and add