lecture on high performance processor architecture ( cs05162 )
DESCRIPTION
Lecture on High Performance Processor Architecture ( CS05162 ). TRIPS: A Polymorphous Tiled Architecture. Ren Yongqing [email protected] Fall 2007 University of Science and Technology of China Department of Computer Science and Technology. Outline. Background - PowerPoint PPT PresentationTRANSCRIPT
Lecture on High Performance Processor Architecture
(CS05162)
Ren Yongqingrenyqmailustceducn
Fall 2007University of Science and Technology of China
Department of Computer Science and Technology
TRIPS A Polymorphous Tiled Architecture
23424 USTC CS AN Hong 2
Outline
Background TRIPS ISA amp Architecture introduction TRIPS Compile amp Schedule TRIPS Polymorphous TRIPS ASIC Implementation Problem
23424 USTC CS AN Hong 3
Key Trends
Power Slowing (stopping) frequency increases Wire Delays Reliability Reconfigurable
90 nm
65 nm
35 nm
130 nm
20 mm
23424 USTC CS AN Hong 4
SuperScalar Core
23424 USTC CS AN Hong 5
Conventional Microarchitectures
23424 USTC CS AN Hong 6
TRIPS Project Goals Technology scalable processor and memory
architecturesminus Techniques to scale to 35nm and beyondminus Enable high clock rates if desiredminus High design productivity through replication
Good performance across diverse workloadsminus Exploit instruction thread and data level parallelismminus Work with standard programming models
Power-efficient instruction level parallelism Demonstrate via custom hardware prototype
minus Implement with small design teamminus Evaluate identify bottlenecks tune microarchitecture
23424 USTC CS AN Hong 7
Key Features EDGE ISA
minus Block-oriented instruction set architectureminus Helps reduce bottlenecks and expose ILP
Tiled Microarchitectureminus Modular designminus No global wires
TRIPS Processorminus Distributed processor designminus Dataflow graph execution engine
NUCA L2 Cacheminus Distributed cache design
23424 USTC CS AN Hong 8
TRIPS Chip
2 TRIPS Processors NUCA L2 Cache
minus 1 MB 16 banks
On-Chip Network (OCN)minus 2D mesh networkminus Replaces on-chip bus
Misc Controllersminus 2 DDR SDRAM controllersminus 2 DMA controllersminus External bus controllerminus C2C network controller
23424 USTC CS AN Hong 9
TRIPS Processor Want an aggressive general-purpose processor
minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads
But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction
semanticsminus Vulnerable to speculation hazards
TRIPS introduces a new microarchitecture and ISA
23424 USTC CS AN Hong 10
EDGE ISA Explicit Data Graph Execution
(EDGE) Block-Oriented
minus Atomically fetch execute and commit whole blocks of instructions
minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the
block levelminus Dataflow execution semantics inside
each block
Direct Target Encodingminus Encode instructions so that results go
directly to the instruction(s) that will consume them
minus No need to go through centralized register file and rename logic
23424 USTC CS AN Hong 11
Block Formation Basic blocks are often too small
(just a few insts) Predication allows larger
hyperblocks to be created Loop unrolling and function
inlining also help TRIPS blocks can hold up to 128
instructions Large blocks improve fetch
bandwidth and expose ILP Hard-to-predict branches can
sometimes be hidden inside a hyperblock
23424 USTC CS AN Hong 12
TRIPS Block Format Each block is formed from two to
five 128-byte programldquochunksrdquo Blocks with fewer than five
chunks are expanded to five chunks in the L1 I-cache
The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions
Each instruction chunk holds 32 4-byte instructions (including NOPs)
A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions
HeaderChunk
InstructionChunk 0
PC
128 Bytes
128 Bytes
128 Bytes
128 Bytes
128 Bytes
InstructionChunk 1
InstructionChunk 2
InstructionChunk 3
23424 USTC CS AN Hong 13
Processor Tiles Partition all major structures into
banks distribute and interconnect Execution Tile (E)
minus 64-entry Instruction Queue bankminus Single-issue execute pipeline
Register Tile (R)minus 32-entry Register bank (per thread)
Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks
Instruction Tile (I)minus 16KB Instruction Cache bank
Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic
23424 USTC CS AN Hong 14
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
Grid Processor Tiles and Interfaces
23424 USTC CS AN Hong 15
Mapping TRIPS Blocks to the Microarchitecture
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
23424 USTC CS AN Hong 16
Mapping TRIPS Blocks to the Microarchitecture
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
HeaderChunk
InstChunk 0
InstChunk 3
Block i mapped into Frame 0
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 2
Outline
Background TRIPS ISA amp Architecture introduction TRIPS Compile amp Schedule TRIPS Polymorphous TRIPS ASIC Implementation Problem
23424 USTC CS AN Hong 3
Key Trends
Power Slowing (stopping) frequency increases Wire Delays Reliability Reconfigurable
90 nm
65 nm
35 nm
130 nm
20 mm
23424 USTC CS AN Hong 4
SuperScalar Core
23424 USTC CS AN Hong 5
Conventional Microarchitectures
23424 USTC CS AN Hong 6
TRIPS Project Goals Technology scalable processor and memory
architecturesminus Techniques to scale to 35nm and beyondminus Enable high clock rates if desiredminus High design productivity through replication
Good performance across diverse workloadsminus Exploit instruction thread and data level parallelismminus Work with standard programming models
Power-efficient instruction level parallelism Demonstrate via custom hardware prototype
minus Implement with small design teamminus Evaluate identify bottlenecks tune microarchitecture
23424 USTC CS AN Hong 7
Key Features EDGE ISA
minus Block-oriented instruction set architectureminus Helps reduce bottlenecks and expose ILP
Tiled Microarchitectureminus Modular designminus No global wires
TRIPS Processorminus Distributed processor designminus Dataflow graph execution engine
NUCA L2 Cacheminus Distributed cache design
23424 USTC CS AN Hong 8
TRIPS Chip
2 TRIPS Processors NUCA L2 Cache
minus 1 MB 16 banks
On-Chip Network (OCN)minus 2D mesh networkminus Replaces on-chip bus
Misc Controllersminus 2 DDR SDRAM controllersminus 2 DMA controllersminus External bus controllerminus C2C network controller
23424 USTC CS AN Hong 9
TRIPS Processor Want an aggressive general-purpose processor
minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads
But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction
semanticsminus Vulnerable to speculation hazards
TRIPS introduces a new microarchitecture and ISA
23424 USTC CS AN Hong 10
EDGE ISA Explicit Data Graph Execution
(EDGE) Block-Oriented
minus Atomically fetch execute and commit whole blocks of instructions
minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the
block levelminus Dataflow execution semantics inside
each block
Direct Target Encodingminus Encode instructions so that results go
directly to the instruction(s) that will consume them
minus No need to go through centralized register file and rename logic
23424 USTC CS AN Hong 11
Block Formation Basic blocks are often too small
(just a few insts) Predication allows larger
hyperblocks to be created Loop unrolling and function
inlining also help TRIPS blocks can hold up to 128
instructions Large blocks improve fetch
bandwidth and expose ILP Hard-to-predict branches can
sometimes be hidden inside a hyperblock
23424 USTC CS AN Hong 12
TRIPS Block Format Each block is formed from two to
five 128-byte programldquochunksrdquo Blocks with fewer than five
chunks are expanded to five chunks in the L1 I-cache
The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions
Each instruction chunk holds 32 4-byte instructions (including NOPs)
A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions
HeaderChunk
InstructionChunk 0
PC
128 Bytes
128 Bytes
128 Bytes
128 Bytes
128 Bytes
InstructionChunk 1
InstructionChunk 2
InstructionChunk 3
23424 USTC CS AN Hong 13
Processor Tiles Partition all major structures into
banks distribute and interconnect Execution Tile (E)
minus 64-entry Instruction Queue bankminus Single-issue execute pipeline
Register Tile (R)minus 32-entry Register bank (per thread)
Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks
Instruction Tile (I)minus 16KB Instruction Cache bank
Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic
23424 USTC CS AN Hong 14
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
Grid Processor Tiles and Interfaces
23424 USTC CS AN Hong 15
Mapping TRIPS Blocks to the Microarchitecture
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
23424 USTC CS AN Hong 16
Mapping TRIPS Blocks to the Microarchitecture
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
HeaderChunk
InstChunk 0
InstChunk 3
Block i mapped into Frame 0
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 3
Key Trends
Power Slowing (stopping) frequency increases Wire Delays Reliability Reconfigurable
90 nm
65 nm
35 nm
130 nm
20 mm
23424 USTC CS AN Hong 4
SuperScalar Core
23424 USTC CS AN Hong 5
Conventional Microarchitectures
23424 USTC CS AN Hong 6
TRIPS Project Goals Technology scalable processor and memory
architecturesminus Techniques to scale to 35nm and beyondminus Enable high clock rates if desiredminus High design productivity through replication
Good performance across diverse workloadsminus Exploit instruction thread and data level parallelismminus Work with standard programming models
Power-efficient instruction level parallelism Demonstrate via custom hardware prototype
minus Implement with small design teamminus Evaluate identify bottlenecks tune microarchitecture
23424 USTC CS AN Hong 7
Key Features EDGE ISA
minus Block-oriented instruction set architectureminus Helps reduce bottlenecks and expose ILP
Tiled Microarchitectureminus Modular designminus No global wires
TRIPS Processorminus Distributed processor designminus Dataflow graph execution engine
NUCA L2 Cacheminus Distributed cache design
23424 USTC CS AN Hong 8
TRIPS Chip
2 TRIPS Processors NUCA L2 Cache
minus 1 MB 16 banks
On-Chip Network (OCN)minus 2D mesh networkminus Replaces on-chip bus
Misc Controllersminus 2 DDR SDRAM controllersminus 2 DMA controllersminus External bus controllerminus C2C network controller
23424 USTC CS AN Hong 9
TRIPS Processor Want an aggressive general-purpose processor
minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads
But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction
semanticsminus Vulnerable to speculation hazards
TRIPS introduces a new microarchitecture and ISA
23424 USTC CS AN Hong 10
EDGE ISA Explicit Data Graph Execution
(EDGE) Block-Oriented
minus Atomically fetch execute and commit whole blocks of instructions
minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the
block levelminus Dataflow execution semantics inside
each block
Direct Target Encodingminus Encode instructions so that results go
directly to the instruction(s) that will consume them
minus No need to go through centralized register file and rename logic
23424 USTC CS AN Hong 11
Block Formation Basic blocks are often too small
(just a few insts) Predication allows larger
hyperblocks to be created Loop unrolling and function
inlining also help TRIPS blocks can hold up to 128
instructions Large blocks improve fetch
bandwidth and expose ILP Hard-to-predict branches can
sometimes be hidden inside a hyperblock
23424 USTC CS AN Hong 12
TRIPS Block Format Each block is formed from two to
five 128-byte programldquochunksrdquo Blocks with fewer than five
chunks are expanded to five chunks in the L1 I-cache
The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions
Each instruction chunk holds 32 4-byte instructions (including NOPs)
A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions
HeaderChunk
InstructionChunk 0
PC
128 Bytes
128 Bytes
128 Bytes
128 Bytes
128 Bytes
InstructionChunk 1
InstructionChunk 2
InstructionChunk 3
23424 USTC CS AN Hong 13
Processor Tiles Partition all major structures into
banks distribute and interconnect Execution Tile (E)
minus 64-entry Instruction Queue bankminus Single-issue execute pipeline
Register Tile (R)minus 32-entry Register bank (per thread)
Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks
Instruction Tile (I)minus 16KB Instruction Cache bank
Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic
23424 USTC CS AN Hong 14
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
Grid Processor Tiles and Interfaces
23424 USTC CS AN Hong 15
Mapping TRIPS Blocks to the Microarchitecture
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
23424 USTC CS AN Hong 16
Mapping TRIPS Blocks to the Microarchitecture
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
HeaderChunk
InstChunk 0
InstChunk 3
Block i mapped into Frame 0
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 4
SuperScalar Core
23424 USTC CS AN Hong 5
Conventional Microarchitectures
23424 USTC CS AN Hong 6
TRIPS Project Goals Technology scalable processor and memory
architecturesminus Techniques to scale to 35nm and beyondminus Enable high clock rates if desiredminus High design productivity through replication
Good performance across diverse workloadsminus Exploit instruction thread and data level parallelismminus Work with standard programming models
Power-efficient instruction level parallelism Demonstrate via custom hardware prototype
minus Implement with small design teamminus Evaluate identify bottlenecks tune microarchitecture
23424 USTC CS AN Hong 7
Key Features EDGE ISA
minus Block-oriented instruction set architectureminus Helps reduce bottlenecks and expose ILP
Tiled Microarchitectureminus Modular designminus No global wires
TRIPS Processorminus Distributed processor designminus Dataflow graph execution engine
NUCA L2 Cacheminus Distributed cache design
23424 USTC CS AN Hong 8
TRIPS Chip
2 TRIPS Processors NUCA L2 Cache
minus 1 MB 16 banks
On-Chip Network (OCN)minus 2D mesh networkminus Replaces on-chip bus
Misc Controllersminus 2 DDR SDRAM controllersminus 2 DMA controllersminus External bus controllerminus C2C network controller
23424 USTC CS AN Hong 9
TRIPS Processor Want an aggressive general-purpose processor
minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads
But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction
semanticsminus Vulnerable to speculation hazards
TRIPS introduces a new microarchitecture and ISA
23424 USTC CS AN Hong 10
EDGE ISA Explicit Data Graph Execution
(EDGE) Block-Oriented
minus Atomically fetch execute and commit whole blocks of instructions
minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the
block levelminus Dataflow execution semantics inside
each block
Direct Target Encodingminus Encode instructions so that results go
directly to the instruction(s) that will consume them
minus No need to go through centralized register file and rename logic
23424 USTC CS AN Hong 11
Block Formation Basic blocks are often too small
(just a few insts) Predication allows larger
hyperblocks to be created Loop unrolling and function
inlining also help TRIPS blocks can hold up to 128
instructions Large blocks improve fetch
bandwidth and expose ILP Hard-to-predict branches can
sometimes be hidden inside a hyperblock
23424 USTC CS AN Hong 12
TRIPS Block Format Each block is formed from two to
five 128-byte programldquochunksrdquo Blocks with fewer than five
chunks are expanded to five chunks in the L1 I-cache
The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions
Each instruction chunk holds 32 4-byte instructions (including NOPs)
A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions
HeaderChunk
InstructionChunk 0
PC
128 Bytes
128 Bytes
128 Bytes
128 Bytes
128 Bytes
InstructionChunk 1
InstructionChunk 2
InstructionChunk 3
23424 USTC CS AN Hong 13
Processor Tiles Partition all major structures into
banks distribute and interconnect Execution Tile (E)
minus 64-entry Instruction Queue bankminus Single-issue execute pipeline
Register Tile (R)minus 32-entry Register bank (per thread)
Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks
Instruction Tile (I)minus 16KB Instruction Cache bank
Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic
23424 USTC CS AN Hong 14
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
Grid Processor Tiles and Interfaces
23424 USTC CS AN Hong 15
Mapping TRIPS Blocks to the Microarchitecture
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
23424 USTC CS AN Hong 16
Mapping TRIPS Blocks to the Microarchitecture
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
HeaderChunk
InstChunk 0
InstChunk 3
Block i mapped into Frame 0
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 5
Conventional Microarchitectures
23424 USTC CS AN Hong 6
TRIPS Project Goals Technology scalable processor and memory
architecturesminus Techniques to scale to 35nm and beyondminus Enable high clock rates if desiredminus High design productivity through replication
Good performance across diverse workloadsminus Exploit instruction thread and data level parallelismminus Work with standard programming models
Power-efficient instruction level parallelism Demonstrate via custom hardware prototype
minus Implement with small design teamminus Evaluate identify bottlenecks tune microarchitecture
23424 USTC CS AN Hong 7
Key Features EDGE ISA
minus Block-oriented instruction set architectureminus Helps reduce bottlenecks and expose ILP
Tiled Microarchitectureminus Modular designminus No global wires
TRIPS Processorminus Distributed processor designminus Dataflow graph execution engine
NUCA L2 Cacheminus Distributed cache design
23424 USTC CS AN Hong 8
TRIPS Chip
2 TRIPS Processors NUCA L2 Cache
minus 1 MB 16 banks
On-Chip Network (OCN)minus 2D mesh networkminus Replaces on-chip bus
Misc Controllersminus 2 DDR SDRAM controllersminus 2 DMA controllersminus External bus controllerminus C2C network controller
23424 USTC CS AN Hong 9
TRIPS Processor Want an aggressive general-purpose processor
minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads
But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction
semanticsminus Vulnerable to speculation hazards
TRIPS introduces a new microarchitecture and ISA
23424 USTC CS AN Hong 10
EDGE ISA Explicit Data Graph Execution
(EDGE) Block-Oriented
minus Atomically fetch execute and commit whole blocks of instructions
minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the
block levelminus Dataflow execution semantics inside
each block
Direct Target Encodingminus Encode instructions so that results go
directly to the instruction(s) that will consume them
minus No need to go through centralized register file and rename logic
23424 USTC CS AN Hong 11
Block Formation Basic blocks are often too small
(just a few insts) Predication allows larger
hyperblocks to be created Loop unrolling and function
inlining also help TRIPS blocks can hold up to 128
instructions Large blocks improve fetch
bandwidth and expose ILP Hard-to-predict branches can
sometimes be hidden inside a hyperblock
23424 USTC CS AN Hong 12
TRIPS Block Format Each block is formed from two to
five 128-byte programldquochunksrdquo Blocks with fewer than five
chunks are expanded to five chunks in the L1 I-cache
The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions
Each instruction chunk holds 32 4-byte instructions (including NOPs)
A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions
HeaderChunk
InstructionChunk 0
PC
128 Bytes
128 Bytes
128 Bytes
128 Bytes
128 Bytes
InstructionChunk 1
InstructionChunk 2
InstructionChunk 3
23424 USTC CS AN Hong 13
Processor Tiles Partition all major structures into
banks distribute and interconnect Execution Tile (E)
minus 64-entry Instruction Queue bankminus Single-issue execute pipeline
Register Tile (R)minus 32-entry Register bank (per thread)
Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks
Instruction Tile (I)minus 16KB Instruction Cache bank
Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic
23424 USTC CS AN Hong 14
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
Grid Processor Tiles and Interfaces
23424 USTC CS AN Hong 15
Mapping TRIPS Blocks to the Microarchitecture
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
23424 USTC CS AN Hong 16
Mapping TRIPS Blocks to the Microarchitecture
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
HeaderChunk
InstChunk 0
InstChunk 3
Block i mapped into Frame 0
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 6
TRIPS Project Goals Technology scalable processor and memory
architecturesminus Techniques to scale to 35nm and beyondminus Enable high clock rates if desiredminus High design productivity through replication
Good performance across diverse workloadsminus Exploit instruction thread and data level parallelismminus Work with standard programming models
Power-efficient instruction level parallelism Demonstrate via custom hardware prototype
minus Implement with small design teamminus Evaluate identify bottlenecks tune microarchitecture
23424 USTC CS AN Hong 7
Key Features EDGE ISA
minus Block-oriented instruction set architectureminus Helps reduce bottlenecks and expose ILP
Tiled Microarchitectureminus Modular designminus No global wires
TRIPS Processorminus Distributed processor designminus Dataflow graph execution engine
NUCA L2 Cacheminus Distributed cache design
23424 USTC CS AN Hong 8
TRIPS Chip
2 TRIPS Processors NUCA L2 Cache
minus 1 MB 16 banks
On-Chip Network (OCN)minus 2D mesh networkminus Replaces on-chip bus
Misc Controllersminus 2 DDR SDRAM controllersminus 2 DMA controllersminus External bus controllerminus C2C network controller
23424 USTC CS AN Hong 9
TRIPS Processor Want an aggressive general-purpose processor
minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads
But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction
semanticsminus Vulnerable to speculation hazards
TRIPS introduces a new microarchitecture and ISA
23424 USTC CS AN Hong 10
EDGE ISA Explicit Data Graph Execution
(EDGE) Block-Oriented
minus Atomically fetch execute and commit whole blocks of instructions
minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the
block levelminus Dataflow execution semantics inside
each block
Direct Target Encodingminus Encode instructions so that results go
directly to the instruction(s) that will consume them
minus No need to go through centralized register file and rename logic
23424 USTC CS AN Hong 11
Block Formation Basic blocks are often too small
(just a few insts) Predication allows larger
hyperblocks to be created Loop unrolling and function
inlining also help TRIPS blocks can hold up to 128
instructions Large blocks improve fetch
bandwidth and expose ILP Hard-to-predict branches can
sometimes be hidden inside a hyperblock
23424 USTC CS AN Hong 12
TRIPS Block Format Each block is formed from two to
five 128-byte programldquochunksrdquo Blocks with fewer than five
chunks are expanded to five chunks in the L1 I-cache
The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions
Each instruction chunk holds 32 4-byte instructions (including NOPs)
A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions
HeaderChunk
InstructionChunk 0
PC
128 Bytes
128 Bytes
128 Bytes
128 Bytes
128 Bytes
InstructionChunk 1
InstructionChunk 2
InstructionChunk 3
23424 USTC CS AN Hong 13
Processor Tiles Partition all major structures into
banks distribute and interconnect Execution Tile (E)
minus 64-entry Instruction Queue bankminus Single-issue execute pipeline
Register Tile (R)minus 32-entry Register bank (per thread)
Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks
Instruction Tile (I)minus 16KB Instruction Cache bank
Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic
23424 USTC CS AN Hong 14
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
Grid Processor Tiles and Interfaces
23424 USTC CS AN Hong 15
Mapping TRIPS Blocks to the Microarchitecture
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
23424 USTC CS AN Hong 16
Mapping TRIPS Blocks to the Microarchitecture
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
HeaderChunk
InstChunk 0
InstChunk 3
Block i mapped into Frame 0
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 7
Key Features EDGE ISA
minus Block-oriented instruction set architectureminus Helps reduce bottlenecks and expose ILP
Tiled Microarchitectureminus Modular designminus No global wires
TRIPS Processorminus Distributed processor designminus Dataflow graph execution engine
NUCA L2 Cacheminus Distributed cache design
23424 USTC CS AN Hong 8
TRIPS Chip
2 TRIPS Processors NUCA L2 Cache
minus 1 MB 16 banks
On-Chip Network (OCN)minus 2D mesh networkminus Replaces on-chip bus
Misc Controllersminus 2 DDR SDRAM controllersminus 2 DMA controllersminus External bus controllerminus C2C network controller
23424 USTC CS AN Hong 9
TRIPS Processor Want an aggressive general-purpose processor
minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads
But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction
semanticsminus Vulnerable to speculation hazards
TRIPS introduces a new microarchitecture and ISA
23424 USTC CS AN Hong 10
EDGE ISA Explicit Data Graph Execution
(EDGE) Block-Oriented
minus Atomically fetch execute and commit whole blocks of instructions
minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the
block levelminus Dataflow execution semantics inside
each block
Direct Target Encodingminus Encode instructions so that results go
directly to the instruction(s) that will consume them
minus No need to go through centralized register file and rename logic
23424 USTC CS AN Hong 11
Block Formation Basic blocks are often too small
(just a few insts) Predication allows larger
hyperblocks to be created Loop unrolling and function
inlining also help TRIPS blocks can hold up to 128
instructions Large blocks improve fetch
bandwidth and expose ILP Hard-to-predict branches can
sometimes be hidden inside a hyperblock
23424 USTC CS AN Hong 12
TRIPS Block Format Each block is formed from two to
five 128-byte programldquochunksrdquo Blocks with fewer than five
chunks are expanded to five chunks in the L1 I-cache
The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions
Each instruction chunk holds 32 4-byte instructions (including NOPs)
A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions
HeaderChunk
InstructionChunk 0
PC
128 Bytes
128 Bytes
128 Bytes
128 Bytes
128 Bytes
InstructionChunk 1
InstructionChunk 2
InstructionChunk 3
23424 USTC CS AN Hong 13
Processor Tiles Partition all major structures into
banks distribute and interconnect Execution Tile (E)
minus 64-entry Instruction Queue bankminus Single-issue execute pipeline
Register Tile (R)minus 32-entry Register bank (per thread)
Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks
Instruction Tile (I)minus 16KB Instruction Cache bank
Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic
23424 USTC CS AN Hong 14
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
Grid Processor Tiles and Interfaces
23424 USTC CS AN Hong 15
Mapping TRIPS Blocks to the Microarchitecture
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
23424 USTC CS AN Hong 16
Mapping TRIPS Blocks to the Microarchitecture
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
HeaderChunk
InstChunk 0
InstChunk 3
Block i mapped into Frame 0
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 8
TRIPS Chip
2 TRIPS Processors NUCA L2 Cache
minus 1 MB 16 banks
On-Chip Network (OCN)minus 2D mesh networkminus Replaces on-chip bus
Misc Controllersminus 2 DDR SDRAM controllersminus 2 DMA controllersminus External bus controllerminus C2C network controller
23424 USTC CS AN Hong 9
TRIPS Processor Want an aggressive general-purpose processor
minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads
But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction
semanticsminus Vulnerable to speculation hazards
TRIPS introduces a new microarchitecture and ISA
23424 USTC CS AN Hong 10
EDGE ISA Explicit Data Graph Execution
(EDGE) Block-Oriented
minus Atomically fetch execute and commit whole blocks of instructions
minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the
block levelminus Dataflow execution semantics inside
each block
Direct Target Encodingminus Encode instructions so that results go
directly to the instruction(s) that will consume them
minus No need to go through centralized register file and rename logic
23424 USTC CS AN Hong 11
Block Formation Basic blocks are often too small
(just a few insts) Predication allows larger
hyperblocks to be created Loop unrolling and function
inlining also help TRIPS blocks can hold up to 128
instructions Large blocks improve fetch
bandwidth and expose ILP Hard-to-predict branches can
sometimes be hidden inside a hyperblock
23424 USTC CS AN Hong 12
TRIPS Block Format Each block is formed from two to
five 128-byte programldquochunksrdquo Blocks with fewer than five
chunks are expanded to five chunks in the L1 I-cache
The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions
Each instruction chunk holds 32 4-byte instructions (including NOPs)
A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions
HeaderChunk
InstructionChunk 0
PC
128 Bytes
128 Bytes
128 Bytes
128 Bytes
128 Bytes
InstructionChunk 1
InstructionChunk 2
InstructionChunk 3
23424 USTC CS AN Hong 13
Processor Tiles Partition all major structures into
banks distribute and interconnect Execution Tile (E)
minus 64-entry Instruction Queue bankminus Single-issue execute pipeline
Register Tile (R)minus 32-entry Register bank (per thread)
Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks
Instruction Tile (I)minus 16KB Instruction Cache bank
Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic
23424 USTC CS AN Hong 14
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
Grid Processor Tiles and Interfaces
23424 USTC CS AN Hong 15
Mapping TRIPS Blocks to the Microarchitecture
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
23424 USTC CS AN Hong 16
Mapping TRIPS Blocks to the Microarchitecture
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
HeaderChunk
InstChunk 0
InstChunk 3
Block i mapped into Frame 0
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 9
TRIPS Processor Want an aggressive general-purpose processor
minus Up to 16 instructions per cycleminus Up to 4 loads and stores per cycleminus Up to 64 outstanding L1 data cache missesminus Up to 1024 dynamically executed instructionsminus Up to 4 simultaneous multithreading (SMT) threads
But existing microarchitectures donrsquot scale wellminus Structures become large multi-ported and slowminus Lots of overhead to convert from sequential instruction
semanticsminus Vulnerable to speculation hazards
TRIPS introduces a new microarchitecture and ISA
23424 USTC CS AN Hong 10
EDGE ISA Explicit Data Graph Execution
(EDGE) Block-Oriented
minus Atomically fetch execute and commit whole blocks of instructions
minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the
block levelminus Dataflow execution semantics inside
each block
Direct Target Encodingminus Encode instructions so that results go
directly to the instruction(s) that will consume them
minus No need to go through centralized register file and rename logic
23424 USTC CS AN Hong 11
Block Formation Basic blocks are often too small
(just a few insts) Predication allows larger
hyperblocks to be created Loop unrolling and function
inlining also help TRIPS blocks can hold up to 128
instructions Large blocks improve fetch
bandwidth and expose ILP Hard-to-predict branches can
sometimes be hidden inside a hyperblock
23424 USTC CS AN Hong 12
TRIPS Block Format Each block is formed from two to
five 128-byte programldquochunksrdquo Blocks with fewer than five
chunks are expanded to five chunks in the L1 I-cache
The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions
Each instruction chunk holds 32 4-byte instructions (including NOPs)
A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions
HeaderChunk
InstructionChunk 0
PC
128 Bytes
128 Bytes
128 Bytes
128 Bytes
128 Bytes
InstructionChunk 1
InstructionChunk 2
InstructionChunk 3
23424 USTC CS AN Hong 13
Processor Tiles Partition all major structures into
banks distribute and interconnect Execution Tile (E)
minus 64-entry Instruction Queue bankminus Single-issue execute pipeline
Register Tile (R)minus 32-entry Register bank (per thread)
Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks
Instruction Tile (I)minus 16KB Instruction Cache bank
Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic
23424 USTC CS AN Hong 14
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
Grid Processor Tiles and Interfaces
23424 USTC CS AN Hong 15
Mapping TRIPS Blocks to the Microarchitecture
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
23424 USTC CS AN Hong 16
Mapping TRIPS Blocks to the Microarchitecture
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
HeaderChunk
InstChunk 0
InstChunk 3
Block i mapped into Frame 0
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 10
EDGE ISA Explicit Data Graph Execution
(EDGE) Block-Oriented
minus Atomically fetch execute and commit whole blocks of instructions
minus Programs are partitioned into blocksminus Each block holds dozens of instructionsminus Sequential execution semantics at the
block levelminus Dataflow execution semantics inside
each block
Direct Target Encodingminus Encode instructions so that results go
directly to the instruction(s) that will consume them
minus No need to go through centralized register file and rename logic
23424 USTC CS AN Hong 11
Block Formation Basic blocks are often too small
(just a few insts) Predication allows larger
hyperblocks to be created Loop unrolling and function
inlining also help TRIPS blocks can hold up to 128
instructions Large blocks improve fetch
bandwidth and expose ILP Hard-to-predict branches can
sometimes be hidden inside a hyperblock
23424 USTC CS AN Hong 12
TRIPS Block Format Each block is formed from two to
five 128-byte programldquochunksrdquo Blocks with fewer than five
chunks are expanded to five chunks in the L1 I-cache
The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions
Each instruction chunk holds 32 4-byte instructions (including NOPs)
A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions
HeaderChunk
InstructionChunk 0
PC
128 Bytes
128 Bytes
128 Bytes
128 Bytes
128 Bytes
InstructionChunk 1
InstructionChunk 2
InstructionChunk 3
23424 USTC CS AN Hong 13
Processor Tiles Partition all major structures into
banks distribute and interconnect Execution Tile (E)
minus 64-entry Instruction Queue bankminus Single-issue execute pipeline
Register Tile (R)minus 32-entry Register bank (per thread)
Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks
Instruction Tile (I)minus 16KB Instruction Cache bank
Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic
23424 USTC CS AN Hong 14
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
Grid Processor Tiles and Interfaces
23424 USTC CS AN Hong 15
Mapping TRIPS Blocks to the Microarchitecture
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
23424 USTC CS AN Hong 16
Mapping TRIPS Blocks to the Microarchitecture
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
HeaderChunk
InstChunk 0
InstChunk 3
Block i mapped into Frame 0
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 11
Block Formation Basic blocks are often too small
(just a few insts) Predication allows larger
hyperblocks to be created Loop unrolling and function
inlining also help TRIPS blocks can hold up to 128
instructions Large blocks improve fetch
bandwidth and expose ILP Hard-to-predict branches can
sometimes be hidden inside a hyperblock
23424 USTC CS AN Hong 12
TRIPS Block Format Each block is formed from two to
five 128-byte programldquochunksrdquo Blocks with fewer than five
chunks are expanded to five chunks in the L1 I-cache
The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions
Each instruction chunk holds 32 4-byte instructions (including NOPs)
A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions
HeaderChunk
InstructionChunk 0
PC
128 Bytes
128 Bytes
128 Bytes
128 Bytes
128 Bytes
InstructionChunk 1
InstructionChunk 2
InstructionChunk 3
23424 USTC CS AN Hong 13
Processor Tiles Partition all major structures into
banks distribute and interconnect Execution Tile (E)
minus 64-entry Instruction Queue bankminus Single-issue execute pipeline
Register Tile (R)minus 32-entry Register bank (per thread)
Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks
Instruction Tile (I)minus 16KB Instruction Cache bank
Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic
23424 USTC CS AN Hong 14
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
Grid Processor Tiles and Interfaces
23424 USTC CS AN Hong 15
Mapping TRIPS Blocks to the Microarchitecture
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
23424 USTC CS AN Hong 16
Mapping TRIPS Blocks to the Microarchitecture
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
HeaderChunk
InstChunk 0
InstChunk 3
Block i mapped into Frame 0
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 12
TRIPS Block Format Each block is formed from two to
five 128-byte programldquochunksrdquo Blocks with fewer than five
chunks are expanded to five chunks in the L1 I-cache
The header chunk includes a block header (execution flags plus a store mask) and register readwrite instructions
Each instruction chunk holds 32 4-byte instructions (including NOPs)
A maximally sized block contains 128 regular instructions 32 read instructions and 32 write instructions
HeaderChunk
InstructionChunk 0
PC
128 Bytes
128 Bytes
128 Bytes
128 Bytes
128 Bytes
InstructionChunk 1
InstructionChunk 2
InstructionChunk 3
23424 USTC CS AN Hong 13
Processor Tiles Partition all major structures into
banks distribute and interconnect Execution Tile (E)
minus 64-entry Instruction Queue bankminus Single-issue execute pipeline
Register Tile (R)minus 32-entry Register bank (per thread)
Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks
Instruction Tile (I)minus 16KB Instruction Cache bank
Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic
23424 USTC CS AN Hong 14
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
Grid Processor Tiles and Interfaces
23424 USTC CS AN Hong 15
Mapping TRIPS Blocks to the Microarchitecture
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
23424 USTC CS AN Hong 16
Mapping TRIPS Blocks to the Microarchitecture
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
HeaderChunk
InstChunk 0
InstChunk 3
Block i mapped into Frame 0
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 13
Processor Tiles Partition all major structures into
banks distribute and interconnect Execution Tile (E)
minus 64-entry Instruction Queue bankminus Single-issue execute pipeline
Register Tile (R)minus 32-entry Register bank (per thread)
Data Tile (D)minus 8KB Data Cache bankminus LSQ and MHU banks
Instruction Tile (I)minus 16KB Instruction Cache bank
Global Control Tile (G)minus Tracks up to 8 blocks of instsminus Branch prediction amp resolution logic
23424 USTC CS AN Hong 14
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
Grid Processor Tiles and Interfaces
23424 USTC CS AN Hong 15
Mapping TRIPS Blocks to the Microarchitecture
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
23424 USTC CS AN Hong 16
Mapping TRIPS Blocks to the Microarchitecture
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
HeaderChunk
InstChunk 0
InstChunk 3
Block i mapped into Frame 0
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 14
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
Grid Processor Tiles and Interfaces
23424 USTC CS AN Hong 15
Mapping TRIPS Blocks to the Microarchitecture
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
23424 USTC CS AN Hong 16
Mapping TRIPS Blocks to the Microarchitecture
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
HeaderChunk
InstChunk 0
InstChunk 3
Block i mapped into Frame 0
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 15
Mapping TRIPS Blocks to the Microarchitecture
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
23424 USTC CS AN Hong 16
Mapping TRIPS Blocks to the Microarchitecture
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
HeaderChunk
InstChunk 0
InstChunk 3
Block i mapped into Frame 0
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 16
Mapping TRIPS Blocks to the Microarchitecture
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
otE-tile[33]
HeaderChunk
InstChunk 0
InstChunk 3
Block i mapped into Frame 0
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 17
Mapping TRIPS Blocks to the Microarchitecture
HeaderChunk
InstChunk 0
InstChunk 3
Block i+1 mapped into Frame 1
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7Sl
ot
E-tile[33]
Frame
0 1 23 4 5 6 7
0
31
LSID
D-tile[3]
Architecture Register Files
ReadWrite Queues
Frame0 1 2 3 4 5 6 7
Thread0 1 2 3 R-tile[3]
RW
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 18
Mapping Target Identifiers to Reservation Stations
I OP1 OP2Block 4
7
0
Slot
100 10 10110 11
Target = 87 OP1
Frame 4
Type(2 bits)
Y(2 bits)
X(2 bits)
Slot(3 bits)
Frame(3 bits)
ISA Target IdentifierFrame
(assigned by GTat runtime)
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
[1011]
10 11100 10 101
I OP1 OP2
Instruction reservation stationsFrame0 1 2 3 4 5 6
7
0
7
Slot
E-tile
Frame 100Slot 101OP 10 = OP1
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 19
Block Fetch Fetch commands sent to
each Instruction Cache bank
The fetch pipeline is from 4 to 11 stages deep
A new block fetch can be initiated every 8 cycles
Instructions are fetched into Instruction Queue banks (chosen by the compiler)
EDGE ISA allows instructions to be fetched out-of-order
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 20
Block Execution Instructions execute (out-oforder)
when all of their operands arrive Intermediate values are sent
from instruction to instruction Register reads and writes
access the register banks Loads and stores access the
data cache banks Branch results go to the
global controller Up to 8 blocks can execute
simultaneously
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 21
Block Commit 1048577 Block completion is detected
and reported to the global controller
1048577 If no exceptions occurred the results may be committed
1048577 Writes are committed to Register files
1048577 Stores are committed to cache or memory
1048577 Resources are deallocated after a commit acknowledgement
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 22
Block Execution Timeline
COMMITFETCH EXECUTE
5 10 30 400Frame 2 Bi
(variable execution time)
Time (cycles)
Frame 4
Frame 5
Frame 6
Frame 7
Frame 0
Frame 1
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Frame 3 Bi+1
Executecommit overlapped across multiple blocks
Bi+8
G-tile manages frames as a circular bufferminus D-morph 1 thread 8 framesminus T-morph up to 4 threads 2 frames each
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 23
NUCA L2 Cache 1048577 Prototype has 1MB L2
cache divided into sixteen 64KB banks
1048577 4x10 2D mesh topology 1048577 Links are 128 bits wide 1048577 Each processor can initiate
5 requests per cycle 1048577 Requests and replies are
wormhole-routed across the network
1048577 4 virtual channels prevent deadlocks
1048577 Can sustain over 100 bytes per cycle to the processors
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 24
Compiling for TRIPS
C
InliningLoop UnrollingFlattening
Scalar Optimizations
Your standard compileryoursquove seen this before
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS
TRIPS Block Formation
Register AllocationSplitting for Spill Code
PeepholeLoadStore ID Assignment
Store Nullification
Block Splitting
Scheduling and Assembly
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 25
Fixed Size Constraint 128 Instructions
bull O3 every basic block is a TRIPS block Simple but not high performance
bull O4 hyperblocks as TRIPS blocks
B1
B3B2
B4 B5
B6
B7
7 TRIPS Blocks
1 TRIPS Block
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 26
Size Analysis How big is this block 3 Instructions 5 Instructions More
read sp g1movi t3 1
store 384(sp) t3
store 8(sp) t3
Max immediate is 256
Immediate instructions have one targetread sp g1movi t3 1mov t4 t3
addi t7 sp 256store 128(t7) t4
store 8(sp) t4
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 27
Too Big Block Splitting What if the block is too large
Predicated blocksReverse if-convert
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
L0 read t4 g10 addi t3 t4 16 load t5 0(t3) hellip branch L1 write g11 t5
L1 read t4 g10 read t5 g11 subi t7 t5 8 mult t8 t7 t7 store 0(t4) t8 store 8(t4) t8 branch L2
Unpredicated (basic blocks) Insert a branch and label
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 28
bull 128 registers (32 x 4 banks)
bull Compute liveness over hyperblocksbull Ignore local variablesbull Hyperblocks as large instructions
Register Constraints Linear Scan Allocator
SPEC2000 Alpha TRIPS (O4)
applu 247 331
apsi 1326 196
gcc 4490 6622
mesa 2614 3821
mgrid 366 77
sixtrack 494 220
Total spills (18 Alpha vs 6 TRIPS)
Average spill 1 store for 2-3 loads
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 29
Block Termination Constraint
Block Termination constant output per blockminus Constant number of stores and writes executeminus One branchminus Simplifies hardware logic for detecting block completion
All writes completeminus Write nullification
All stores completeminus Store nullificationminus LSID assignment
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 30
ldshladdswbr
TRIPS Scheduling Problem
addaddldcmpbr
subshlldcmpbr
ldaddaddswbr
swswaddcmpbr
ld
Register File
Data C
aches
Hyperblock
addadd
Flowgraph
bull Place instructions on 4x4x8 gridbull Encode placement in target form
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 31
Scheduling AlgorithmsHeuristic-based list scheduler [PACT 2005]
minus Greedy top-downminus Prioritizes critical pathminus Reprioritizes after each placementminus Balances functional unit utilizationminus Accounts for data cache localityminus Accounts for register bank locality
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 32
TRIPS Polymorphous Different Levels of Parallelism Instruction-level parallelism[Nagarajan et al Micro01]
minus Populate large instruction window with useful instructionsminus Schedule instructions to optimize communication andminus concurrency
Thread-level parallelismminus Partition instruction window among different threadsminus Reduce contentions for instruction and data supply
Data-level parallelismminus Provide high density of computational elementsminus Provide high bandwidth tofrom data memory
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 33
TRIPS Configurable Resources
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 34
Aggregating Reservation Stations Frames
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 35
Extracting ILP Frames for Speculation
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 36
Configuring Frames for TLP
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 37
Using Frames for DLP
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 38
Configuring Data Memory for DLP
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 39
Configuring Data Memory for DLP
Regular data accesses Subset of L2 cache banks configured as SRF High bandwidth data channels to SRF Reduced address communication Constants saved in reservation stations
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 40
Performance results ILP instruction window occupancy
minus Peak 4x4x128 array 1048577 2048 instructionsminus Sustained 493 for Spec Int 1412 for Spec FPminus Bottleneck branch prediction
TLP instruction and data supplyminus Peak 100 efficiencyminus Sustained 87 for two threads 61 for four threads
DLP data supply bandwidthminus Peak 16 opscycleminus Sustained 69 opscycle
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 41
ASIC Implementation 1048577 130 nm 7LM IBM ASIC
process 1048577 335 mm2 die 1048577 475 mm x 475 mm
package 1048577 ~170 million transistors 1048577 ~600 signal IOs 1048577 ~500 MHz clock freq 1048577 Tape-out fall 2005 1048577 System bring-up spring 2006
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 42
Functional Area Breakdown
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 43
TRIPS Summary Distributed microarchitecture
minus 1048577 Acknowledges and tolerates wire delayminus 1048577 Scalable protocols tailored for distributed components
Tiled microarchitectureminus 1048577 Simplifies scalabilityminus 1048577 Improves design productivity
Coarse-grained homogeneous approach with polymorphismminus ILP Well-partitioned powerful uniprocessor (GPA)minus TLP Divide instruction window among different threadsminus DLP Mapping reuse of instructions and constants in grid
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 44
Problems Scalable
minus Larger gridmore communication latencyminus ILP DLPTLPminus Multicore Manycore
Compatibilityminus Instruction amp block code
Low-efficiency architectureminus Instruction bufferminus I-Cache bank GDNminus Oprand networkminus LSQ amp readwrite queue
Polymorphousminus How to realize DLP
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 45
Scalable
Larger grid
GSN global status network
GDN global dispatch network
OPN operand network
GDN global dispatch network GDN global dispatch network
GSN global status network
GCN global control network
OPN operand network
G R R RRI
D E E EEI
D E E EEI
D E E EEI
D E E EEI
GCN global control network
GDN global dispatch network
GSN global status network
OPN operand network OPN operand network
GDN global dispatch network
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 46
Scalable
Multicore Manycore
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 47
Compatibility
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 48
Low-efficiency architecture
Link utilization for SPEC CPU 2000
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-
23424 USTC CS AN Hong 49
Polymorphous
Imagine
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- Slide 24
- Slide 25
- Slide 26
- Slide 27
- Slide 28
- Slide 29
- Slide 30
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- Slide 35
- Slide 36
- Slide 37
- Slide 38
- Slide 39
- Slide 40
- Slide 41
- Slide 42
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- Slide 48
- Slide 49
-