hc20.26.620.micro-architecture of godson-3 multi-core processor

35
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 1/35 1 Micro-architecture of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology Chinese Academy of Sciences [email protected]

Upload: dahuang

Post on 30-May-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 1/35

1

Micro-architecture of Godson-3Multi-Core Processor

Weiwu Hu, Jian Wang, Xiang Gao, and Yunji ChenInstitute of Computing Technology

Chinese Academy of [email protected]

Page 2: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 2/35

2

Contents

A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation Preliminary performance evaluation PetaFLOPS and TeraFLOPS

Godson is the academic name of Loongson TM

Page 3: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 3/35

3

National Project High performance CPU is national strategic product National security: CPU is key components

Chinese IT industry is big but not strong: 5.6 trillion RMB in 2007, only22% by domestic companies, 3.75% profits

Godson CPU is supported by

National 863 project National 973 project National Science Foundation of China National key project Key project of Chinese Academy of Sciences

Page 4: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 4/35

4

Godson CPU Briefs

ICT started Godson CPU design in 2001. The 32-bit Godson-1 CPU in 2002 is thefirst general purpose CPU in China.

The 64-bit Godson-2B in 2003.10 The 64-bit Godson-2C in 2004.12 The 64-bit Godson-2E in 2006.03 Each Triple the performance of itsprevious one.

Page 5: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 5/35

5

Godson Development

Intel/AMD/HP/IBM/SGI/Sparc SPEC cpu2000 rate

Godson rate

Page 6: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 6/35

6

Programs Reftime Run time Ratio164.gzip 1400 403 347

175.vpr 1400 273 512

176.gcc 1100 221 497

181.mcf 1800 307 586

186.crafty 1000 167 598

197.parser 1800 472 382

252.eon 1300 188 690

253.perlbmk 1800 354 508

254.gap 1100 240 458

255.vortex 1900 263 722

256.bzip2 1500 365 411300.twolf 3000 645 465

SPEC int2000 <503>

Godson-2E SPEC CPU2000 RatePrograms Ref time Run time Ratio

168.wupwise 1600 238 672

171.swim 3100 660 469

172.mgrid 1800 579 311

173.applu 2100 549 382

177.mesa 1400 221 634

178.galgel 2900 412 704

179.art 2600 416 624

183.equake 1300 208 624

187.facerec 1900 300 632

188.ammp 2200 432 509

189.lucas 2000 396 506191.fma3d 2100 531 395

200.sixtrack 1100 345 319

301.apsi 2600 528 493

SPEC fp2000 <503>

Page 7: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 7/35

7

Godson-2E and Godson-2F 1.0GHz@90nm CMOS, 5-7W 47M xtors, area 36mm^2 Godson-2 CPU Core

64-bit MIPS III Compatible Four-issue, OOO 64KB+64KB L1 (four-way)

512KB L2 (four-way) On-chip DDR Controller SysAD Front-end bus

1.0GHz@90nm CMOS, 3-5W 51M xtors, area 43mm^2 Godson-2 CPU Core

64-bit MIPS III Compatible Four-issue, OOO 64KB+64KB L1 (four-way)

512KB L2 (four-way) On-Chip DDR2 controller. PCI/PCIX, Local IO, GPIO, etc. Volume production

Page 8: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 8/35

Page 9: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 9/35

9

Contents

A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation Preliminary performance evaluation PetaFLOPS and TeraFLOPS

Page 10: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 10/35

10

Distributed scalable architecture Reconfigurable CPU core and L2 X86 binary translation speedup Low Power Consumption >1.0GHz@65nm

Godson-3 Briefs

Page 11: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 11/35

11

Scalable Architecture Design Scalable interconnection network

Crossbar + Mesh Single crossbar connects cores, L2s, and four directions

Directory-based cache coherence protocol Distributed L2 caches are global addressed Each cache block has a directory entry Both data cache and instruction cache are recorded in directory

8x8 Xbar

P0 P1 P2 P3

L2 L2 L2 L2

E S

W

N

E S W

N

Page 12: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 12/35

12

ReconfigurableArchitecture

6*6 AXI Switch

S0

P0 P1 P2 P3

S1 S2 S3

m1 m2 m3 m4 m5

s5s1 s2 s3 s4

5*4 AXI Switch

MC1MC0

m0 s0

DMA

C on

t r ol l e

r

HT

,P

C I E

HT ,P

C I E

DMA

C on

t r ol l er

P C I ...

Shared L2 can be configuredas internal RAM, DMA tointernal RAM directly

General Purpose Core,64-bit, 4-issue, OOO,AXI interface

Multiple Purpose Core LINPACK, biology, signal

processing, AXI interface

DMA engine supports pre-fetch and matrix

8 configurable addresswindows of each master port allow pagesmigration across L2 andmemory

Page 13: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 13/35

13

Generality vs. Efficiency

Power, area efficiency

Generality

5% 20%-30% 50~60%

multicore

Streamprocessor

manycore

General purpose core +multiple purpose core for wide applicationsGodson-3

Page 14: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 14/35

14

GS464 general purpose core MIPS64, 200+ more instructions for X86

binary translation and media acceleration Four-issue superscalar OOO pipeline Two fix, two FP, one memory units Two FP units each supports full pipelined

double/paired-single MAC operation 48-bit VA and PA, 128-bit memory access 64KB icache and 64KB dcache, 4-way 64-entry fully associated TLB, 16-entry

ITLB, variable page size Non-blocking accesses, load speculation

Directory-based cache-coherence for CMP Parity check for icache, ECC for dcache EJTAG for debugging Standard 128-bit AXI interface

DMA+AXI Controller

XBAR

1024*64

4W4R

AXI interface

Micro-Controller

Micro-CodeStore

1024*64

4W4R

1024*64

4W4R

1024*64

4W4R

1024*64

4W4R

1024*64

4W4R

1024*64

4W4R

1024*64

4W4R

Processor Interface

BTB

BHT

ITLB

ICache

FixQueue

Reorder Queue

FloatingPoint

Register File

ALU1

ALU2

FPU1

FPU2

FloatQueue

CP0Queue

DCACHE

DTLB

ROQBRQ

Integer Register

File AGU

Write back Bus Commit Bus

Map Bus

missqueue

Refill Bus imemread dmemwrite

PC

P C + 1

6

dmemread, duncache

ucqueue wtbkqueue AXI Interface EJTAG TAP Controller Test Controller

JTAG Interface Test Interface

GS464 Architecture

D e c o d e B

u s

Branch Bus

clock, reset, int, …

P r e

d e c o d er

D e c o d er

R e gi s t er M

a p p er

T a g C om

p ar e

GStera multiple purpose core Target for LINPACK, biology

computation, signal processing 8-16 MACs per node Big multi-port register file Reconfigurable based on applications. Standard 128-bit AXI interface

Page 15: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 15/35

15

Hardware Support for X86 Binary Translation

Define new instructions X86 ISA function and MIPS ISA format Binary translation mechanism supporting >200 instructions are defined with 5% additional silicon cost

Speedup X86-to-MIPS binary translation by 10 times

MS windows Linux apps. on X86

Linux apps. on MIPS System level X86 VM Process level X86 VM

Linux on MIPS

Enhanced MIPS decode

Enhanced Godson internal operations

Figure 2. The architecture of Godson-2 virtual machine

Page 16: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 16/35

16

Low Power Design

Low leakage process is selected LP/GP mixed HVT/SVT mixed

Manual clock gating regarding architecture Much efficient in reducing power consumptioncompared to P&R tools

Power management

Module level (CPU core, HT, PCIE, DDR2) clock gating Frequency Scaling Temperature Sensor

Page 17: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 17/35

17

Contents

A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation Preliminary performance evaluation PetaFLOPS and TeraFLOPS

Page 18: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 18/35

18

Physical implementation

65nm CMOS LP/GP technology Cell-based design methodology

DC + ICC Manual P&R for critical cells

2008: 4-core (4GP + 0MP) + 4MB L2 GP: General purpose core MP: Multiple purpose core 10w@1GHz

2009: 8-core (4GP + 4MP) + 4MB L2 20w@1GHz

Page 19: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 19/35

19

4-core

2008 4-core (general purpose core) 65nm, 1GHz, 10w

6*6 X1 Switch

S0

P0 P1 P2 P3

S1 S2 S3

m1 m2 m3 m4 m5

s5s1 s2 s3 s4

5*4 X2 Switch

MC1MC0

m0 s0

DMA

C on

t r ol l er

HT

,P C I E

HT

,P C I E

DMA

C on

t r ol l er

P C I ...

Page 20: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 20/35

20

8-core

X1 Switch

S0

P0 P1 P2 P3

S1 S2 S3

m1 m2 m3 m4 m5

s5s1 s2 s3 s4

X2 Switch

MC0

PCILPC

m0 s0

DMA

C on t r ol l er

HT ,

P C I E

X1 Switch

S0

P0 P1 P2 P3

S1 S2 S3

m1 m2 m3 m4 m5

s5s1 s2 s3 s4

X2 Switch

MC1

m0 s0

HT ,P C I E

DMA

C on t r ol l er

2009 8-core (4GP + 4MP) 65nm, 1GHz, 20w

Page 21: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 21/35

21

Physical register file

Size: 321um x 262um

Power: 50mW@1GHz

Delay: 470ps

TLB CAM

Size: 224 um x 235 um

Power: 55mw @ 1GHz

Delay: 550ps

Full Customer Register file and CAM

Page 22: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 22/35

22

HT1.0 Driver & Receiver

FlipChip Compatible 2Row design

800mw @ 1.6Gbps

Size: 250um x 300um

Power: < 10mW

Freq: 1.6GHz

Jitter: 20 ps

HyperTransport PHY

Page 23: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 23/35

23

TEST CHIP

ST 65nm

1206um x 1206um

Function:

CAM1W1R - BIST

CAM1W1R - Scan

RAM4W4R - BIST

RAM4W4R - Scan ICT_PLL - Freq. test

HT1.0 - BIST

HT1.0 - Error rate test

Test Chip for Customer Blocks

Page 24: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 24/35

24

Cell-based high performance physical design The Full Hierarchical Design Methodology Manual placement & route for critical paths Manual placement of all FFs and clock buffers,manual clock gating Architecture optimization with physical feedback

Page 25: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 25/35

25

Clock Tree

H-Tree + Mesh Manual placementof FFs Manual clock gate

Page 26: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 26/35

26

Layout of 4-core Godson-3

Xbar

Xbar

L2L2L2 L2

GS464GS464

GS464 GS464

PCIE PCIE

HT HT

DDR2/3 DDR2/3

Page 27: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 27/35

27

Contents

A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation Preliminary performance evaluation PetaFLOPS and TeraFLOPS

Page 28: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 28/35

28

Preliminary Performance Evaluation

Virtual Machine Performance Miscellaneous Performance

Page 29: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 29/35

29

Preliminary Virtual Machine Performance

QEMU@LS2F 800MHz vs. PIII 850MHz Average efficiency 18% (except mcf and art) INT 15%, FP 21%

QEMU@LS2F 800MHz vs. LS2F 800MHz Average efficiency 14% INT 17%, FP 11%

QEMU @LS3 vs. LS2F Average efficiency 27% INT 34%, FP %19

Page 30: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 30/35

30

Miscellaneous Performance

Barrier 4x1: ~500ns 4x2: ~2000ns 4x4: ~10000ns

Matrix Transposing: 15+times faster

Page 31: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 31/35

31

Contents

A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation Preliminary performance evaluation PetaFLOPS and TeraFLOPS

Page 32: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 32/35

32

PetaFLOPS and TeraFLOPS

PetaFLOPS for National Strategy Application To build PetaFLOPS HPC with Godson-3 in 2010.

TeraFLOPS for Personal HPC Putting desktop to pockets Putting TeraFLOPS to desktop: computing for the masses

Page 33: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 33/35

33

Scaling down TeraFLOPS

$100K/2007“refrigerator”

$50K/2008“washing machine”

$10K/2009“microwave oven”

TeraFLOPS in 1997

Page 34: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 34/35

34

References

The architecture of Godson-2 superscalar architecture isavailable at: Weiwu Hu, Fuxin Zhang, Zusong Li, “Microarchitecture of the Godson-2

Processor”, Journal of Computer Science and Technology , 20(2):243-249,Mar. 2005

Weiwu Hu, Jiye Zhao, Shiqiang Zhong, Xu Yang, Elio Guidetti, Chris Wu,“Implementing a 1GHz Four-issue Out-of-Order Execution Microprocessorin a Standard Cell ASIC Methodology”, Journal of Computer Science and Technology , 22(1):1-14, Jan. 2007

The experiences learning from Godson processor design isavailable at:

Weiwu Hu, Jian Wang, “Making Effective Decisions in ComputerArchitects’ Real-World: Lessons and Experiences with Godson-2 ProcessorDesigns”, Journal of Computer Science and Technology , 23(4), July 2008

The architecture of Godson-3 multi-core is available at: HotChip’08

Page 35: HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor

http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 35/35

35

Thanks