micro-architecture of godson-3 multi-core processor of godson-3 multi-core processor weiwu hu, jian...

31
1 Micro-architecture of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese Academy of Sciences [email protected] Hotchips 2008, San Francisco, August 26

Upload: hoangliem

Post on 30-Apr-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

1

Micro-architecture of Godson-3 Multi-Core Processor

Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen

Institute of Computing Technology (ICT)

Chinese Academy of Sciences

[email protected]

Hotchips 2008, San Francisco, August 26

Page 2: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

2

Contents

A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation PetaFLOPS and TeraFLOPS

Godson is the academic name of LoongsonTM

Page 3: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

3

National Project High performance CPUs are of national strategic importance

Chinese ICT industry is growing to a significant scale2007 Demand: ICT market = 5.6 trillion RMB 2007 Supply: only 22% by domestic companies, 3.75% profits

Godson CPU is supported by National 863 projectNational 973 projectNational Science Foundation of ChinaNational key projectKey project of Chinese Academy of Sciences

Page 4: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

4

Godson CPU Briefs

ICT started Godson CPU design in 2001 The 32-bit Godson-1 CPU in 2002 is the

first general purpose CPU in China The 64-bit Godson-2B in 2003.10 The 64-bit Godson-2C in 2004.12 The 64-bit Godson-2E in 2006.03 Each tripled the performance of its

previous one

Page 5: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

5

Godson Development

10

100

1000

10000

1999 2000 2001 2002 2003 2004 2005 2006

Intel/AMD/HP/IBM/SGI/Sparc SPEC cpu2000 rate

Godson rate

Page 6: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

6

Programs Reftime Run time Ratio

164.gzip 1400 403 347

175.vpr 1400 273 512

176.gcc 1100 221 497

181.mcf 1800 307 586

186.crafty 1000 167 598

197.parser 1800 472 382

252.eon 1300 188 690

253.perlbmk 1800 354 508

254.gap 1100 240 458

255.vortex 1900 263 722

256.bzip2 1500 365 411

300.twolf 3000 645 465

SPEC int2000 <503>

Godson-2E SPEC CPU2000 RatePrograms Ref time Run time Ratio

168.wupwise 1600 238 672

171.swim 3100 660 469

172.mgrid 1800 579 311

173.applu 2100 549 382

177.mesa 1400 221 634

178.galgel 2900 412 704

179.art 2600 416 624

183.equake 1300 208 624

187.facerec 1900 300 632

188.ammp 2200 432 509

189.lucas 2000 396 506

191.fma3d 2100 531 395

200.sixtrack 1100 345 319

301.apsi 2600 528 493

SPEC fp2000 <503>

Page 7: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

7

Godson-2E and Godson-2F 1.0GHz@90nm CMOS, 5-7W 47M xtors, area 36mm^2 Godson-2 CPU core

64-bit MIPS III compatible Four-issue, OOO 64KB+64KB L1 (four-way) 512KB L2 (four-way)

On-chip DDR controller SysAD Front-end bus

1.0GHz@90nm CMOS, 3-5W 51M xtors, area 43mm^2 Godson-2 CPU core

64-bit MIPS III compatible Four-issue, OOO 64KB+64KB L1 (four-way) 512KB L2 (four-way)

On-Chip DDR2 controller PCI/PCIX, local IO, GPIO, etc. Volume production

Page 8: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

8

Low end roadmap: From CPU to SOC

CPU

NorthBridge

SouthBridge

Graphic

CPU+NB

SouthBridge

Graphic

CPU+NB

GPU+SouthBridge

CPU+GPU+NB+SB

2E 2006 2H 20092F 2007 2G 2008

PCI

PCI PCI/HT

Page 9: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

9

Contents

A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation PetaFLOPS and TeraFLOPS

Page 10: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

10

Scalable architecture Reconfigurable CPU core and L2 X86 binary translation optimization Low power consumption >1.0GHz@65nm

Godson-3 Briefs

Page 11: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

11

Scalable Architecture Design Scalable interconnection network

Crossbar + MeshSingle crossbar connects cores, L2s, and four directions

Directory-based cache coherence protocolDistributed L2 caches are globally addressedEach cache block has a directory entry Both data cache and instruction cache are recorded in directory

8x8 Xbar

P0 P1 P2 P3

L2 L2 L2 L2

E

S

W

N

E

S

W

N

Page 12: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

12

Reconfigurable

Architecture

6*6 AXI Switch

S0

P0 P1 P2 P3

S1 S2 S3

m1 m2 m3 m4

m5

s5s1 s2 s3 s4

5*4 AXI Switch

MC1MC0

m0

s0

DM

A C

on

trolle

r

HT

HT

DM

A C

on

trolle

r

PC

I...Shared L2 can be configured as internal RAM; DMA to internal RAM directly

General Purpose Core:64-bit, 4-issue, OOO,AXI interface

Multiple Purpose Core: LINPACK, biology, signal processing, AXI interface

DMA engine supports pre-fetch and matrix transpose

8 configurable address windows of each master port allow pages migration across L2 and memory

Page 13: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

13

GS464 general purpose core MIPS64, 200+ more instructions for X86

binary translation and media acceleration Four-issue superscalar OOO pipeline Two fix, two FP, one memory units Two FP units each supports full pipelined

double/paired-single MAC operation 48-bit VA and PA, 128-bit memory access 64KB icache and 64KB dcache, 4-way 64-entry fully associated TLB, 16-entry

ITLB, variable page size Non-blocking accesses, load speculation Directory-based cache-coherence for CMP Parity check for icache, ECC for dcache EJTAG for debugging Standard 128-bit AXI interface

DMA+AXI Controller

XBAR

1024*64

4W4R

AXI interface

Micro-Controller

Micro-Code Store

1024*64

4W4R

1024*64

4W4R

1024*64

4W4R

1024*64

4W4R

1024*64

4W4R

1024*64

4W4R

1024*64

4W4R

Processor Interface

BTB

BHT

ITLB

ICache

FixQueue

Reorder Queue

FloatingPoint

RegisterFile

ALU1

ALU2

FPU1

FPU2

FloatQueue

CP0Queue

DCACHE

DTLB

ROQBRQ

IntegerRegister

File

AGU

Write back Bus

Commit Bus

Map Bus

missqueue

Refill Bus

imemread dmemwrite

PC

PC

+16

dmemread, duncacheucqueue wtbkqueue

AXI Interface

EJTAG TAP ControllerTest Controller

JTAG InterfaceTest Interface

GS464 Architecture

Decode

Bus

Branch Bus

clock, reset, int, …

Pre

de

co

de

r

Deco

de

r

Reg

iste

r Ma

pp

er

Tag

Com

pare

GStera multiple purpose core Target for LINPACK, biology

computation, signal processing 8-16 MACs per node Big multi-port register file Reconfigurable based on applications. Standard 128-bit AXI interface

Page 14: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

14

Matrix Transposing Performance

15+times faster

0

50000000

100000000

150000000

200000000

250000000

256x256 512x512 1024x1024

dsp

cpu

Page 15: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

15

Hardware Support for X86 Binary Translation

Define new instructionsX86 ISA function and MIPS ISA formatBinary translation mechanism supporting>200 instructions are defined with 5% additional silicon cost

Speed up X86-to-MIPS binary translation 10 times compared to software only QEMU

649 6121010

421

1643

4178

2312

5430

4851

6743

9032

6086

9980

8616

10212

7954

14350

11611

9964

0

2000

4000

6000

8000

10000

12000

14000

16000

IDCT FFT(FX) FFT(FP) GP EFLAG

No HO, No SO

HO Only

Both

Ideal

MS windows Linux apps. on X86

Linux apps. on MIPS

System level X86 VM Process level X86 VM

Linux on MIPS

Enhanced MIPS decode

Enhanced Godson internal operations

Figure 2. The architecture of Godson-2 virtual machine

Page 16: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

16

Contents

A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation PetaFLOPS and TeraFLOPS

Page 17: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

17

Physical implementation

65nm CMOS LP/GP technology Cell-based design methodology

DC + ICC Manual P&R for critical cells

2008: 4-core (4GP + 0MP) + 4MB L2GP: General purpose coreMP: Multiple purpose core10w@1GHz

2009: 8-core (4GP + 4MP) + 4MB L2 20w@1GHz

Page 18: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

18

4-core

20084-core (general purpose core)65nm, 1GHz, 10w

6*6 X1 Switch

S0

P0 P1 P2 P3

S1 S2 S3

m1 m2 m3 m4

m5

s5s1 s2 s3 s4

5*4 X2 Switch

MC1MC0

m0

s0

DM

A C

on

trolle

r

HT

, PC

IE

HT

, PC

IE

DM

A C

on

trolle

r

PC

I...

Page 19: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

19

8-core

X1 Switch

S0

P0 P1 P2 P3

S1 S2 S3

m1 m2 m3 m4

m5

s5s1 s2 s3 s4

X2 Switch

MC0

PCILPC

m0

s0

DM

A C

on

trolle

r

HT

, PC

IE

X1 Switch

S0

P0 P1 P2 P3

S1 S2 S3

m1 m2 m3 m4

m5

s5s1 s2 s3 s4

X2 Switch

MC1

m0

s0

HT

, PC

IE

DM

A C

on

trolle

r

20098-core (4GP + 4MP)65nm, 1GHz, 20w

Page 20: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

20

Physical register file

Size: 321um x 262um

Power: 50mW@1GHz

Delay: 470ps

TLB CAM

Size: 224 um x 235 um

Power: 55mw @ 1GHz

Delay: 550ps

Full Customer Register file and CAM

Page 21: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

21

HT1.0 Driver & Receiver

FlipChip Compatible 2Row design

800mw @ 1.6Gbps

Size: 250um x 300um

Power: < 10mW

Freq: 1.6GHz

Jitter: 20 ps

HyperTransport PHY

Page 22: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

22

TEST CHIP

ST 65nm

1206um x 1206um

Function:

CAM1W1R - BIST

CAM1W1R - Scan

RAM4W4R - BIST

RAM4W4R - Scan

ICT_PLL - Freq. test

HT1.0 - BIST

HT1.0 - Error rate test

Test Chip for Customer Blocks

Page 23: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

23

Cell-based High Performance Physical Design

The Full Hierarchical Design Methodology Manual placement & route for critical paths Manual placement of all FFs and clock buffers,

manual clock gating Architecture optimization with physical feedback

ictreg1

ictreg0

Mul

Div

regfile0

rissuebus2

resbus2

fwdbus2

mapbus0/1/2/3

resbus0

resbus1

resb

us0

resb

us2

fwdb

us2

br_p

c_va

lue

resb

us2/

3

fwdb

us2/

3

jr_ta

rget

alu1

qissbus

rissbus

alu_buf_vsrc

alu2

cam0fxqueue

Page 24: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

24

Clock Tree

H-Tree + Mesh Manual placement

of FFs Manual clock gate

Page 25: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

25

Layout of 4-core Godson-3

Xbar

Xbar

L2L2L2 L2

GS464GS464

GS464 GS464

HT HT

DMA DMA

DDR2/3 DDR2/3

Page 26: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

26

Contents

A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation PetaFLOPS and TeraFLOPS

Page 27: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

27

PetaFLOPS and TeraFLOPS

PetaFLOPS for Large Scale ApplicationsTo build PetaFLOPS HPC with Godson-3 in 2010

TeraFLOPS for Personal HPCPutting desktop to pocketsPutting TeraFLOPS to desktopHigh-performance computing for the masses

Page 28: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

28

Scaling down TeraFLOPS

$100K/2007“refrigerator”

$50K/2008“washing machine”

$10K/2009“microwave oven”

TeraFLOPS in 1997

Page 29: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

29

References

The architecture of Godson-2 superscalar architecture is available at: Weiwu Hu, Fuxin Zhang, Zusong Li, “Microarchitecture of the Godson-2

Processor”, Journal of Computer Science and Technology, 20(2):243-249, Mar. 2005

Weiwu Hu, Jiye Zhao, Shiqiang Zhong, Xu Yang, Elio Guidetti, Chris Wu, “Implementing a 1GHz Four-issue Out-of-Order Execution Microprocessor in a Standard Cell ASIC Methodology”, Journal of Computer Science and Technology, 22(1):1-14, Jan. 2007

The experiences learning from Godson processor design is available at: Weiwu Hu, Jian Wang, “Making Effective Decisions in Computer

Architects’ Real-World: Lessons and Experiences with Godson-2 Processor Designs”, Journal of Computer Science and Technology, 23(4), July 2008

The architecture of Godson-3 multi-core is available at: HotChip’08

Page 30: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

30

Concluding Remarks

CPU R&D are of national strategic importance Godson-3 has a low-power, scalable architecture Godson-3 will be used to build

client side systemsPetaflops machinesTeraflops systems for the masses

The Godson team at ICT: open cooperation

Page 31: Micro-architecture of Godson-3 Multi-Core Processor of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese

31

Thanks