hc20.26.620.micro-architecture of godson-3 multi-core processor
TRANSCRIPT
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 1/35
1
Micro-architecture of Godson-3Multi-Core Processor
Weiwu Hu, Jian Wang, Xiang Gao, and Yunji ChenInstitute of Computing Technology
Chinese Academy of [email protected]
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 2/35
2
Contents
A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation Preliminary performance evaluation PetaFLOPS and TeraFLOPS
Godson is the academic name of Loongson TM
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 3/35
3
National Project High performance CPU is national strategic product National security: CPU is key components
Chinese IT industry is big but not strong: 5.6 trillion RMB in 2007, only22% by domestic companies, 3.75% profits
Godson CPU is supported by
National 863 project National 973 project National Science Foundation of China National key project Key project of Chinese Academy of Sciences
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 4/35
4
Godson CPU Briefs
ICT started Godson CPU design in 2001. The 32-bit Godson-1 CPU in 2002 is thefirst general purpose CPU in China.
The 64-bit Godson-2B in 2003.10 The 64-bit Godson-2C in 2004.12 The 64-bit Godson-2E in 2006.03 Each Triple the performance of itsprevious one.
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 5/35
5
Godson Development
Intel/AMD/HP/IBM/SGI/Sparc SPEC cpu2000 rate
Godson rate
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 6/35
6
Programs Reftime Run time Ratio164.gzip 1400 403 347
175.vpr 1400 273 512
176.gcc 1100 221 497
181.mcf 1800 307 586
186.crafty 1000 167 598
197.parser 1800 472 382
252.eon 1300 188 690
253.perlbmk 1800 354 508
254.gap 1100 240 458
255.vortex 1900 263 722
256.bzip2 1500 365 411300.twolf 3000 645 465
SPEC int2000 <503>
Godson-2E SPEC CPU2000 RatePrograms Ref time Run time Ratio
168.wupwise 1600 238 672
171.swim 3100 660 469
172.mgrid 1800 579 311
173.applu 2100 549 382
177.mesa 1400 221 634
178.galgel 2900 412 704
179.art 2600 416 624
183.equake 1300 208 624
187.facerec 1900 300 632
188.ammp 2200 432 509
189.lucas 2000 396 506191.fma3d 2100 531 395
200.sixtrack 1100 345 319
301.apsi 2600 528 493
SPEC fp2000 <503>
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 7/35
7
Godson-2E and Godson-2F 1.0GHz@90nm CMOS, 5-7W 47M xtors, area 36mm^2 Godson-2 CPU Core
64-bit MIPS III Compatible Four-issue, OOO 64KB+64KB L1 (four-way)
512KB L2 (four-way) On-chip DDR Controller SysAD Front-end bus
1.0GHz@90nm CMOS, 3-5W 51M xtors, area 43mm^2 Godson-2 CPU Core
64-bit MIPS III Compatible Four-issue, OOO 64KB+64KB L1 (four-way)
512KB L2 (four-way) On-Chip DDR2 controller. PCI/PCIX, Local IO, GPIO, etc. Volume production
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 8/35
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 9/35
9
Contents
A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation Preliminary performance evaluation PetaFLOPS and TeraFLOPS
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 10/35
10
Distributed scalable architecture Reconfigurable CPU core and L2 X86 binary translation speedup Low Power Consumption >1.0GHz@65nm
Godson-3 Briefs
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 11/35
11
Scalable Architecture Design Scalable interconnection network
Crossbar + Mesh Single crossbar connects cores, L2s, and four directions
Directory-based cache coherence protocol Distributed L2 caches are global addressed Each cache block has a directory entry Both data cache and instruction cache are recorded in directory
8x8 Xbar
P0 P1 P2 P3
L2 L2 L2 L2
E S
W
N
E S W
N
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 12/35
12
ReconfigurableArchitecture
6*6 AXI Switch
S0
P0 P1 P2 P3
S1 S2 S3
m1 m2 m3 m4 m5
s5s1 s2 s3 s4
5*4 AXI Switch
MC1MC0
m0 s0
DMA
C on
t r ol l e
r
HT
,P
C I E
HT ,P
C I E
DMA
C on
t r ol l er
P C I ...
Shared L2 can be configuredas internal RAM, DMA tointernal RAM directly
General Purpose Core,64-bit, 4-issue, OOO,AXI interface
Multiple Purpose Core LINPACK, biology, signal
processing, AXI interface
DMA engine supports pre-fetch and matrix
8 configurable addresswindows of each master port allow pagesmigration across L2 andmemory
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 13/35
13
Generality vs. Efficiency
Power, area efficiency
Generality
5% 20%-30% 50~60%
multicore
Streamprocessor
manycore
General purpose core +multiple purpose core for wide applicationsGodson-3
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 14/35
14
GS464 general purpose core MIPS64, 200+ more instructions for X86
binary translation and media acceleration Four-issue superscalar OOO pipeline Two fix, two FP, one memory units Two FP units each supports full pipelined
double/paired-single MAC operation 48-bit VA and PA, 128-bit memory access 64KB icache and 64KB dcache, 4-way 64-entry fully associated TLB, 16-entry
ITLB, variable page size Non-blocking accesses, load speculation
Directory-based cache-coherence for CMP Parity check for icache, ECC for dcache EJTAG for debugging Standard 128-bit AXI interface
DMA+AXI Controller
XBAR
1024*64
4W4R
AXI interface
Micro-Controller
Micro-CodeStore
1024*64
4W4R
1024*64
4W4R
1024*64
4W4R
1024*64
4W4R
1024*64
4W4R
1024*64
4W4R
1024*64
4W4R
Processor Interface
BTB
BHT
ITLB
ICache
FixQueue
Reorder Queue
FloatingPoint
Register File
ALU1
ALU2
FPU1
FPU2
FloatQueue
CP0Queue
DCACHE
DTLB
ROQBRQ
Integer Register
File AGU
Write back Bus Commit Bus
Map Bus
missqueue
Refill Bus imemread dmemwrite
PC
P C + 1
6
dmemread, duncache
ucqueue wtbkqueue AXI Interface EJTAG TAP Controller Test Controller
JTAG Interface Test Interface
GS464 Architecture
D e c o d e B
u s
Branch Bus
clock, reset, int, …
P r e
d e c o d er
D e c o d er
R e gi s t er M
a p p er
T a g C om
p ar e
GStera multiple purpose core Target for LINPACK, biology
computation, signal processing 8-16 MACs per node Big multi-port register file Reconfigurable based on applications. Standard 128-bit AXI interface
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 15/35
15
Hardware Support for X86 Binary Translation
Define new instructions X86 ISA function and MIPS ISA format Binary translation mechanism supporting >200 instructions are defined with 5% additional silicon cost
Speedup X86-to-MIPS binary translation by 10 times
MS windows Linux apps. on X86
Linux apps. on MIPS System level X86 VM Process level X86 VM
Linux on MIPS
Enhanced MIPS decode
Enhanced Godson internal operations
Figure 2. The architecture of Godson-2 virtual machine
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 16/35
16
Low Power Design
Low leakage process is selected LP/GP mixed HVT/SVT mixed
Manual clock gating regarding architecture Much efficient in reducing power consumptioncompared to P&R tools
Power management
Module level (CPU core, HT, PCIE, DDR2) clock gating Frequency Scaling Temperature Sensor
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 17/35
17
Contents
A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation Preliminary performance evaluation PetaFLOPS and TeraFLOPS
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 18/35
18
Physical implementation
65nm CMOS LP/GP technology Cell-based design methodology
DC + ICC Manual P&R for critical cells
2008: 4-core (4GP + 0MP) + 4MB L2 GP: General purpose core MP: Multiple purpose core 10w@1GHz
2009: 8-core (4GP + 4MP) + 4MB L2 20w@1GHz
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 19/35
19
4-core
2008 4-core (general purpose core) 65nm, 1GHz, 10w
6*6 X1 Switch
S0
P0 P1 P2 P3
S1 S2 S3
m1 m2 m3 m4 m5
s5s1 s2 s3 s4
5*4 X2 Switch
MC1MC0
m0 s0
DMA
C on
t r ol l er
HT
,P C I E
HT
,P C I E
DMA
C on
t r ol l er
P C I ...
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 20/35
20
8-core
X1 Switch
S0
P0 P1 P2 P3
S1 S2 S3
m1 m2 m3 m4 m5
s5s1 s2 s3 s4
X2 Switch
MC0
PCILPC
m0 s0
DMA
C on t r ol l er
HT ,
P C I E
X1 Switch
S0
P0 P1 P2 P3
S1 S2 S3
m1 m2 m3 m4 m5
s5s1 s2 s3 s4
X2 Switch
MC1
m0 s0
HT ,P C I E
DMA
C on t r ol l er
2009 8-core (4GP + 4MP) 65nm, 1GHz, 20w
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 21/35
21
Physical register file
Size: 321um x 262um
Power: 50mW@1GHz
Delay: 470ps
TLB CAM
Size: 224 um x 235 um
Power: 55mw @ 1GHz
Delay: 550ps
Full Customer Register file and CAM
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 22/35
22
HT1.0 Driver & Receiver
FlipChip Compatible 2Row design
800mw @ 1.6Gbps
Size: 250um x 300um
Power: < 10mW
Freq: 1.6GHz
Jitter: 20 ps
HyperTransport PHY
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 23/35
23
TEST CHIP
ST 65nm
1206um x 1206um
Function:
CAM1W1R - BIST
CAM1W1R - Scan
RAM4W4R - BIST
RAM4W4R - Scan ICT_PLL - Freq. test
HT1.0 - BIST
HT1.0 - Error rate test
Test Chip for Customer Blocks
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 24/35
24
Cell-based high performance physical design The Full Hierarchical Design Methodology Manual placement & route for critical paths Manual placement of all FFs and clock buffers,manual clock gating Architecture optimization with physical feedback
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 25/35
25
Clock Tree
H-Tree + Mesh Manual placementof FFs Manual clock gate
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 26/35
26
Layout of 4-core Godson-3
Xbar
Xbar
L2L2L2 L2
GS464GS464
GS464 GS464
PCIE PCIE
HT HT
DDR2/3 DDR2/3
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 27/35
27
Contents
A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation Preliminary performance evaluation PetaFLOPS and TeraFLOPS
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 28/35
28
Preliminary Performance Evaluation
Virtual Machine Performance Miscellaneous Performance
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 29/35
29
Preliminary Virtual Machine Performance
QEMU@LS2F 800MHz vs. PIII 850MHz Average efficiency 18% (except mcf and art) INT 15%, FP 21%
QEMU@LS2F 800MHz vs. LS2F 800MHz Average efficiency 14% INT 17%, FP 11%
QEMU @LS3 vs. LS2F Average efficiency 27% INT 34%, FP %19
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 30/35
30
Miscellaneous Performance
Barrier 4x1: ~500ns 4x2: ~2000ns 4x4: ~10000ns
Matrix Transposing: 15+times faster
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 31/35
31
Contents
A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation Preliminary performance evaluation PetaFLOPS and TeraFLOPS
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 32/35
32
PetaFLOPS and TeraFLOPS
PetaFLOPS for National Strategy Application To build PetaFLOPS HPC with Godson-3 in 2010.
TeraFLOPS for Personal HPC Putting desktop to pockets Putting TeraFLOPS to desktop: computing for the masses
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 33/35
33
Scaling down TeraFLOPS
$100K/2007“refrigerator”
$50K/2008“washing machine”
$10K/2009“microwave oven”
TeraFLOPS in 1997
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 34/35
34
References
The architecture of Godson-2 superscalar architecture isavailable at: Weiwu Hu, Fuxin Zhang, Zusong Li, “Microarchitecture of the Godson-2
Processor”, Journal of Computer Science and Technology , 20(2):243-249,Mar. 2005
Weiwu Hu, Jiye Zhao, Shiqiang Zhong, Xu Yang, Elio Guidetti, Chris Wu,“Implementing a 1GHz Four-issue Out-of-Order Execution Microprocessorin a Standard Cell ASIC Methodology”, Journal of Computer Science and Technology , 22(1):1-14, Jan. 2007
The experiences learning from Godson processor design isavailable at:
Weiwu Hu, Jian Wang, “Making Effective Decisions in ComputerArchitects’ Real-World: Lessons and Experiences with Godson-2 ProcessorDesigns”, Journal of Computer Science and Technology , 23(4), July 2008
The architecture of Godson-3 multi-core is available at: HotChip’08
8/14/2019 HC20.26.620.Micro-Architecture of Godson-3 Multi-Core Processor
http://slidepdf.com/reader/full/hc2026620micro-architecture-of-godson-3-multi-core-processor 35/35
35
Thanks