asci winterschool on embedded systems march 2004 renesse processor components the cornerstones of...
Post on 12-Jan-2016
231 Views
Preview:
TRANSCRIPT
ASCI Winterschool on Embedded Systems
March 2004Renesse
Processor Componentsthe cornerstones of future platforms
with emphasis on ILP exploitation
Henk CorporaalPeter Knijnenburg
ASCI winterschool H.C.-P.K. 2
Future
We foresee that
many characteristics of current high performance architectures will find their way into the embedded domain.
ASCI winterschool H.C.-P.K. 3
What are we talking about?
ILP = Instruction Level Parallelism =
ability to perform multiple operations (or instructions),from a single instruction stream,
in parallel
ASCI winterschool H.C.-P.K. 4
Processor ComponentsOverview
• Motivation and Goals• Trends in Computer Architecture• RISC processors • ILP Processors• Transport Triggered Architectures• Configurable components• Summary and Conclusions
ASCI winterschool H.C.-P.K. 5
Motivation for ILP (and other types of parallelism)• Increasing VLSI densities; decreasing feature size
• Increasing performance requirements
• New application areas, like– Multi-media (image, audio, video, 3-D)– intelligent search and filtering engines– neural, fuzzy, genetic computing
• More functionality
• Use of existing Code (Compatibility)
• Low Power: P = fCV2
ASCI winterschool H.C.-P.K. 6
Low power through parallelism
• Sequential Processor– Switching capacitance C– Frequency f– Voltage V– P = fCV2
• Parallel Processor (two times the number of units)– Switching capacitance 2C– Frequency f/2– Voltage V’ < V– P = f/2 2C V’2 = fCV’2
ASCI winterschool H.C.-P.K. 7
ILP Goals
• Making the most powerful single chip processor
• Exploiting parallelism between independent instructions (or operations) in programs
• Exploit hardware concurrency– multiple FUs, buses, reg files, bypass paths, etc.
• Code compatibility– binary: superscalar and super-pipelined– HLL: VLIW
• Incorporate enhanced functionality (ASIP)
ASCI winterschool H.C.-P.K. 8
Overview
• Motivation and Goals
• Trends in Computer Architecture
• RISC processors
• ILP Processors
• Transport Triggered Architectures
• Configurable components
• Summary and Conclusions
ASCI winterschool H.C.-P.K. 9
Trends in Computer Architecture
• Bridging the semantic gap• Performance increase• VLSI developments• Architecture developments: design space• The role of compiler• Right match
ASCI winterschool H.C.-P.K. 10
FunctionUnit(s)
Very simple processor
DataMemory
MDR
MAR
r0r1r2
FunctionUnit(s)
Registerfile
Instruction register
Decode logic
Processor datapath
ASCI winterschool H.C.-P.K. 11
Bridging the Semantic Gap
Programming domains• Application domain• Architecture domain• Data path domain
Larchiteccture
LD r1, M(&B)
LD r2,M(&C)
ADD r1,r1,r2
ST r1, M(&A)SW compilation orinterpretation
Ldatapath
&B MARMDR r1&C MARMDR r2
r1 ALUinput-1
r2 ALUinput-2
ALUoutput := ALUinput-1
ALUoutput r1r1 MDR&A MARHW interpretation
Lapplication
A := B + C
Example:
ASCI winterschool H.C.-P.K. 12
Bridging the Semantic Gap: Different Methods
Direct Hardwareinterpretation
Direct Execution Architectures
Application
Architecture
Operations &Data Transports
Application
Compilationand/or softwareinterpretation
Architecture
Micro-Codeinterpretation
Operations &Data Transports
CISC Architectures
Application
Compilationand/or softwareinterpretation
Architecture
Micro-Codeinterpretation
Operations &Data Transports
RISC Architectures
Application
Direct Compilation
and/or softwareinterpretation
Architecture
Operations &Data Transports
Microcoded Architectures
ASCI winterschool H.C.-P.K. 13
Bridging the Semantic Gap: What happens to the semantic level ?
Interpretation
Compiler and/or
interpretation
?
Year
Sem
anti
c L
evel
1950 1960 1970 1980 1990 2000
Application Domain
Architecture Domain
Datapath Domain
CISCRISC
2010
ASCI winterschool H.C.-P.K. 14
Performance Increase
Year
78 80 82 84 86 88 90 92 94 96 98
0.1
1.0
10
100
1000
SPECfp92 dataSPECint92 data
SPECfp92 growthSPECint92 growth
SP
EC
int
and
SP
EC
fp r
atin
gs
Microprocessor SPEC Ratings
• 50% SPECint improvement / year
• 60% SPECfp improvement / year
00 02
ASCI winterschool H.C.-P.K. 15
VLSI Developments
Year
70 80 90 00
10e3
10e5
10e7
Den
sity
in t
ran
sist
ors
/ch
ip
0.1
1.0
10M
inim
um
feature size in
(um
)
DensityFeature Size
# Transistors (DRAM)
~ 2(year-1956) * 2/3
Cycle time:
tcycle ~ tgate * #gate_levels + wiring_delay + pad_delay
What happens to these contributions ?
ASCI winterschool H.C.-P.K. 16
Architecture Developments
How to improve performance?
• (Super)-pipelining• Powerful instructions
– MD-technique
• multiple data operands per operation– MO-technique
• multiple operations per instruction• Multiple instruction issue
ASCI winterschool H.C.-P.K. 17
Architecture Developments
Pipelined Execution of InstructionsIF: Instruction Fetch
DC: Instruction Decode
RF: Register Fetch
EX: Execute instruction
WB: Write Result Register
IF DC RF EX WBIF DC RF EX WB
IF DC RF EX WBIF DC RF EX WB
INS
TR
UC
TIO
N
CYCLE
1 2 43 5 6 7 8
12
3
4
Purpose:• Reduce #gate_levels in critical path• Reduce CPI close to one• More efficient Hardware
Problems• Hazards: pipeline stalls
• Structural hazards: add more hardware• Control hazards, branch penalties: use branch prediction• Data hazards: by passing required
Superpipelining: Split one or more of the critical pipeline stages
Simple 5-stage pipeline
ASCI winterschool H.C.-P.K. 18
Architecture Developments
Powerful Instructions (1)MD-technique• Multiple data operands per operation
Two Styles• Vector• SIMD
SIMD Execution Method
time
Instruction 1
Instruction 2
Instruction 3
Instruction n
node1 node2 node-K
Inst
r K
+1
Inst
ruct
ion
1
Inst
ruct
ion
2
Inst
ruct
ion
3
Inst
ruct
ion
K
time
FU1FU2 FU3 FU-K
Vector Execution Method
a = B * c + d
ASCI winterschool H.C.-P.K. 19
Architecture Developments
Powerful Instructions (1)
Vector Computing• FU mix may match the application domain• Use of interleaved memory• FUs need to be tightly connected
SIMD computing• Nodes used for independent operations• Mesh or hypercube connectivity• Exploit data locality of e.g. image processing applications• SIMD on restricted scale: Multi-media instructions
– MMX, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia, ......
– Example: i=1..4|ai-bi|
ASCI winterschool H.C.-P.K. 20
Architecture Developments
Powerful Instructions (2)
MO-technique: multiple operations per instruction
• CISC (Complex Instruction Set Computer)• VLIW (Very Long Instruction Word)
sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5)
FU 1 FU 2 FU 3 FU 4field
instruction bnez r5, 13
FU 5
VLIW instruction example
ASCI winterschool H.C.-P.K. 21
Architecture Developments: Powerful Instructions (2)
VLIW Characteristics
• Only RISC like operation support Short cycle times
• Flexible: Can implement any FU mixture• Extensible• Tight inter FU connectivity required• Large instructions• Not binary compatible
ASCI winterschool H.C.-P.K. 22
Architecture Developments
Multiple instruction issue (per cycle)
Who guarantees semantic correctness?
• User specifies multiple instruction streams– MIMD (Multiple Instruction Multiple Data)
• Run-time detection of ready instructions– Superscalar
• Compile into dataflow representation– Dataflow processors
ASCI winterschool H.C.-P.K. 23
Multiple instruction issue
Three Approaches
a := b + 15;
c := 3.14 * d;
e := c / f;
Translation to DDG (Data Dependence Graph)
ld
+
st
&b
15
&a
ld *
/ st
ld
st
&f 3.14
&e
&d
&c
Example code
ASCI winterschool H.C.-P.K. 24
Generated Code
Instr. Sequential Code Dataflow Code I1 ld r1,M(&b) ld(M(&b) -> I2I2 addi r1,r1,15 addi 15 -> I3I3 st r1,M(&a) st M(&a)I4 ld r1,M(&d) ld M(&d) -> I5I5 muli r1,r1,3.14 muli 3.14 -> I6, I8I6 st r1,M(&c) st M(&c)I7 ld r2,M(&f) ld M(&f) -> I8I8 div r1,r1,r2 div -> I9I9 st r1,M(&e) st M(&e) Notes:
• An MIMD may execute two streams: (1) I1-I3 (2) I4-I9– No dependencies between streams; in practice communication and
synchronization required between streams
• A superscalar issues multiple instructions from sequential stream– Obey dependencies (True and name dependencies)
– Reverse engineering of DDG needed at run-time
• Dataflow code is direct representation of DDG
ASCI winterschool H.C.-P.K. 25
Instruction Pipeline Overview
IF DC RF EX WB
IF DC/RF EX WB
CISC
RISC
IF1 DC1 RF1 EX1 ROBISSUE WB1
IF2 DC2 RF2 EX2 ROBISSUE WB2
IF3 DC3 RF3 EX3 ROBISSUE WB3
IFk DCk RFk EXk ROBISSUE WBk
Superscalar
IF1 IF2 IFs DC RF--- EX1 EX2 --- EX5 WBSuperpipelined
IF DC
RF1 EX1 WB1
RF2 EX2 WB2
RFk EXk WBk
VLIW
RF1 EX1 WB1
RF2 EX2 WB2
RFk EXk WBkD
AT
AF
LOW
ASCI winterschool H.C.-P.K. 26
Four dimensional representation of the architecture design space <I, O, D, S>
Instructions/cycle ‘I’
Superpipelining Degree ‘S’
Operations/instruction ‘O’
Data/operation ‘D’
Superscalar MIMD Dataflow
Superpipelined
RISC
VLIW
10 100
1010
0.1
Vector
10
SIMD100
CISC
ASCI winterschool H.C.-P.K. 27
Architecture design space
Architecture K I O D S MparCISC 1 0.2 1.2 1.1 1 0.26RISC 1 1 1 1 1.2 1.2VLIW 10 1 10 1 1.2 12Superscalar 3 3 1 1 1.2 3.6Superpipelined 1 1 1 1 3 3Vector 7 0.1 1 64 5 32SIMD 128 1 1 128 1.2 154MIMD 32 32 1 1 1.2 38Dataflow 10 10 1 1 1.2 12
Typical values of K (# of functional units or processor nodes), and
<I, O, D, S> for different architectures
Mpar = I*O*D*S
Op I_set
S(architecture) = f(Op) * lt (Op)
ASCI winterschool H.C.-P.K. 28
The Role of the Compiler
9 steps required to translate an HLL program
• Front-end compilation• Determine dependencies• Graph partitioning: make multiple threads (or tasks)• Bind partitions to compute nodes• Bind operands to locations• Bind operations to time slots: Scheduling• Bind operations to functional units• Bind transports to buses• Execute operations and perform transports
ASCI winterschool H.C.-P.K. 29
Division of responsibilities between hardware and compiler
Frontend
Binding of Operands
Determine Dependencies
Scheduling
Binding of Transports
Binding of Operations
Execute
Binding of Operands
Determine Dependencies
Scheduling
Binding of Transports
Binding of Operations
Responsibility of compiler Responsibility of Hardware
Application
Superscalar
Dataflow
Multi-threaded
Indep. Arch
VLIW
TTA
ASCI winterschool H.C.-P.K. 30
The Right Match
Year
72 80 90
103
104
105
106
107
Tra
nsi
sto
r p
er C
PU
Ch
ip
00
108
8-bit Microprocessor
RISC 32-bit core
CISC 32-bit core
RISC+ MMU + FP 64-bit
VLIW Superscalar Dataflow
MIMD
INTRODUCTIO
N
DISADVANTAGE
ASCI winterschool H.C.-P.K. 31
Overview
• Motivation and Goals
• Trends in Computer Architecture
• RISC processors
• ILP Processors
• Transport Triggered Architectures
• Configurable components
• Summary and Conclusions
ASCI winterschool H.C.-P.K. 32
RISC basicsIF DC EX WB
INS
TR
UC
TIO
N
CYCLE
1 2 43 5 6 7 8
12
3
4
IF DC EX WB
IF DC EX WB
IF DC EX WB
Forwarding
Register File
Op-1 Op-1
Memory Unit ALU
BP-1
Immediate
mux
mux
Bypassbuses
Functionunit
operand regs.RISC datapathNote: Ifetch path not shown
ASCI winterschool H.C.-P.K. 33
Why RISC? Make the common case fast• Reduced number of instructions• Limited addressing modes
– load-store architecture
• Large uniform register set• Limited number of instruction sizes
(preferably one)– know directly where the following instruction
starts
• Limited number of instruction formats
Enables pipelining
ASCI winterschool H.C.-P.K. 34
Overview
• Motivation and Goals
• Trends in Computer Architecture
• RISC processors
• ILP Processors
• Transport Triggered Architectures
• Configurable components
• Summary and Conclusions
ASCI winterschool H.C.-P.K. 35
ILP Processors• Overview• General ILP organization• VLIW concept
– examples like: TriMedia, Mpact, TMS320C6x, IA-64
• Superscalar concept– examples like: HP-PA8000, Alpha 21264, MIPS
R10k/R12k, Pentium I-IV, AMD5-7, UltraSparc – (Ref: IEEE Micro April 1996 (HotChips issue)
• Comparing Superscalar and VLIW
ASCI winterschool H.C.-P.K. 36
General ILP processor organization
Instruction FetchUnit
Inst
ruct
ion
M
em
ory Instruction
DecodeUnit
FU-1
FU-2
FU-K
Re
gis
ter
File
Da
taM
em
ory
Central Processing Unit
ASCI winterschool H.C.-P.K. 37
ILP processor characteristics
• Issue multiple operations/instructions per cycle
• Multiple concurrent Function Units
• Pipelined execution
• Shared register file
• Four Superscalar variants– In-order/Out-of-order execution– In-order/Out-of-order completion
ASCI winterschool H.C.-P.K. 38
VLIW concept
Int Register File
Instruction Memory
Int FU
A VLIW architecture with 7 FUs
Data Memory
Int FU Int FU LD/ST LD/ST FP FU
Floating PointRegister File
FP FU
ASCI winterschool H.C.-P.K. 39
VLIW example: Trimedia
SDRAM
Timers
MemoryInterface
PCI interface
Video In Video Out
Audio In Audio Out
I2C Interface Serial Interface
VLIWProcessor
32K I$
16K D$VLD
coprocessor
32 bit, 33 MHZ
40 Mpix/s
208 chanel digital audio
Huffman decoder MPEG1,2
19 Mpix/s
Stereo digital audio
* 5-issue* 128 registers* 27 Fus* 32-bit
* 8-Way set associative caches
* dual ported data cache
* gaurded operations
Trimedia Overview
ASCI winterschool H.C.-P.K. 40
VLIW example: TMS320C62
TMS320C62 VelociTI Processor
• 8 operations (of 32-bit) per instruction (256 bit)• Two clusters
– 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs)– 2 x 16 registers– One port available to read from register file of other cluster
• Flexible addressing modes (like circular addressing)• Flexible instruction packing• All operations conditional• 5 ns, 200 MHz, 0.25 um, 5-layer CMOS• 128 KB on-chip RAM
ASCI winterschool H.C.-P.K. 41
VelociTIC64 datapath
Cluster
ASCI winterschool H.C.-P.K. 42
VLIW example: IA-64
Intel HP 64 bit VLIW like architecture• 128 bit instruction bundle containing 3 instructions• 128 Integer + 128 Floating Point registers : 7-bit reg id.
• Guarded instructions– 64 entry boolean register file heavily rely on if-conversion to
remove branches
• Specify instruction independence– some extra bits per bundle
• Fully interlocked– i.e. no delay slots: operations are latency compatible within family
of architectures
• Split loads– non trapping load + exception check
ASCI winterschool H.C.-P.K. 43
Intel Itanium 2• EPIC• 0.18um 6ML• 8 issue slots• 1 GHz
(8000 MIPS)• 130 W (max)• 61 MOPS/W• 128b bundle
(3x41b + 5b)
ASCI winterschool H.C.-P.K. 44
Superscalar: Concept
InstructionMemory
InstructionCache
Decoder
BranchUnit
ALU-1 ALU-2Logic &
ShiftLoadUnit
StoreUnit
ReorderBuffer
RegisterFile
DataCache
DataMemory
Reservation Stations
Address
DataData
Instruction
ASCI winterschool H.C.-P.K. 45
Intel Pentium 4• Superscalar• 0.12um 6ML• 1.0 V• 3 issue• >3 GHz• 58 W• 20 stage pipeline• ALUs clocked at
2X• Trace cache
ASCI winterschool H.C.-P.K. 46
Pentium 4
• Trace cache
• Hyper threading
• Add with ½ cycle throughput (1 ½ cycle latency)
cycle cycle cycle
add least signif. 16 bits
add most signif. 16 bits
calculate flags
forwarding carry
ASCI winterschool H.C.-P.K. 47
P4 vs P II, PIII pipeline
11 22 33 44 55 66 77 88 99 1010
FetchFetch FetchFetch DecodeDecode DecodeDecode DecodeDecode RenameRename ROB RdROB Rd Rdy/SchRdy/Sch DispatchDispatch ExecExec
Basic P6 PipelineBasic P6 Pipeline
Basic PentiumBasic Pentium®® 4 Processor Pipeline 4 Processor Pipeline
11 22 33 44 55 66 77 88 99 1010 1111 1212
TC Nxt IPTC Nxt IP TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF
Intro at Intro at 1.4GHz1.4GHz
.18µ.18µ
Intro at Intro at 733MHz733MHz
.18µ.18µ
ASCI winterschool H.C.-P.K. 48
Example with Higher IPC and Faster Clock!
CodeSequence:
Ld Add Add Ld Add Add
10 clocks10 clocks10ns10nsIPC = 0.6IPC = 0.6
6 clocks6 clocks4.3ns4.3nsIPC = 1.0IPC = 1.0
P6P6@1GHz@1GHz
Pentium® 4 Pentium® 4 ProcessorProcessor@1.4GHz@1.4GHz
ASCI winterschool H.C.-P.K. 49
Superscalar Issues
• How to fetch multiple instructions in time (across basic block boundaries) ? Trace Cache
• Handling control hazards: Branch prediction• Non-blocking memory system: Hit over miss• Handling dependencies: Renaming• How to support precise interrupts?: ROB• How to recover from mis-predicted branch path?
ROB
ASCI winterschool H.C.-P.K. 50
Renaming
All four instructions may issue simultaneously– (If resources are available)
Renaming is implemented using– Reorder buffer: Pentium II/III, HP PA-8000, PowerPC 604,
SPARC64– Direct register remapping: MIPS 10k/12k, DEC 21264
Example:# Original Code dependence latency renamed version 1 mul r1,r2,r3 4 mul p1,p2,p32 st r1,3(r2) RaW 1 st p1,3(p2)3 add r1,r5,#4 WaW, WaR 1 add p4,p5,#44 shl r2,r1,r3 RaW, WaR 1 shl p6,p4,#3
ASCI winterschool H.C.-P.K. 51
Renaming
Note: Old mapping r1-p1not needed anymore; however p1 still active
When may we reuse physical register p1?– Old mapping has changed (r1-p4)– p1 has been committed
Logic register Physical register r1 p4 r2 p6 r3 p3 r4 - r5 p5
Mapping (after I4)
ASCI winterschool H.C.-P.K. 52
Branch Prediction
• Branch Prediction techniques, why?– Speculatively execute beyond branches– Reduce branch penalties
• Classification– Static techniques; prediction based on:
• Profiling information• Static analysis of code: use of heuristics
– Dynamic techniques• 1-level: Branch prediction buffer with n-bit prediction counters
• 2-level: Branch correlation using branch history
• Hybrid methods (e.g. Alpha 21264)
– Combinations of static and dynamic
ASCI winterschool H.C.-P.K. 53
Static Techniques: Heuristic Based (Ball and Larus’93)• Loop Branch Heuristic
– Back-edge will be taken 88% of the time
• Pointer Heuristic– A comparison of two pointers will fail 60% of the time
• Call Heuristic– A successor block containing a call and which does not post-
dominate the block containing the branchwill not be taken 78% of the time
• Opcode Heuristic– A test of an integer for ‘ 0’, or ’ 0’ or ‘ some constant’ will fail
outcome 84% of the time
• Loop Exit Heuristic– A branch in a loop in which no successor block is a loop head will not
exit the loop 80% of the time
ASCI winterschool H.C.-P.K. 54
Static Heuristic Based (Ball and Larus’93)• Return Heuristic
– A successor block containing a return will not be taken 72% of the time
• Store Heuristic– A successor block containing a store instruction and which does not
post-dominate will not be taken 55% of the time
• Loop Header Heuristic– A successor block which is a loop header or a loop pre-header (I.e.
passes control unconditionally to a loop head which it dominates) and which does not post-dominate will be taken 75% of the time
• Guard Heuristic– A successor block in which a register is used before being defined
and which does not post-dominate will be taken 62% of the time if that register is an operand of the branch
ASCI winterschool H.C.-P.K. 55
Static Heuristic Based Prediction
When multiple predictors apply we use
‘Dempster-Shafer’
evidence combination
Pold * Pheuristic
Pold * Pheuristic + (1- Pold)*(1-Pheuristic)Pnew =
For example if both ‘Loop Exit’ and ‘Store’ heuristic are applied
0.8*0.45
0.8*0.45 + (1 - 0.8)*(1 - 0.45)Pnew = = 0.766
ASCI winterschool H.C.-P.K. 56
Dynamic Techniques:Branch Prediction Buffer: 1 bit prediction
Problems
• Aliasing: lower K bits of different branch instructions could be the same
– Soln: Use tags (the buffer becomes a tag); however very expensive
• Loops are predicted wrong twice
– Soln: Use n-bit saturation counter prediction
* taken if counter 2 (n-1)
* not-taken if counter < 2 (n-1)
– A 2 bit saturating counter predicts a loop wrong only once
Branch address 2 K entries
(Lower K bits)
prediction bit
1-bit
ASCI winterschool H.C.-P.K. 57
Using n-bit Saturating Counters
2-bit saturating counter scheme
Branch address
a
n-bit saturating Up/Down Counter
Prediction
11/T 10/T
00/N01/N N
N
T
NT
N
TT
ASCI winterschool H.C.-P.K. 58
Branch Correlation Using Branch HistoryTwo schemes (a, k, m, n)
• PA: Per address history, a > 0• GA: Global history, a = 0
n-bit saturating Up/Down Counter Prediction
Table size (usually n = 2): #bits = k * 2a + 2k * 2m *n
Variant: Gshare (Scott McFarling’93): GA which takes logic XOR of PC address bits and branch history bits
Branch Address
0 1 2k-1
0
1
2m-1
Branch History Table
a k
m
Pattern History Table
ASCI winterschool H.C.-P.K. 59
Predicting the Target Address
1. Branch Target Buffer (BTB)
2. Branch Folding (Store instruction in BTB)
3. Return Stack
ASCI winterschool H.C.-P.K. 60
Accuracy (taking the best combination of parameters):
Predictor Size (bytes)64 128
Bra
nch
Pre
dic
tio
n A
ccu
racy
(%
)
256 1K 2K 4K 8K 16K 32K 64K
89
91
95
96
97
98
92
93
94
PA(10, 6, 4, 2)
GA(0,11,5,2)
Bimodal
GAs
PAs
ASCI winterschool H.C.-P.K. 61
Comparing Superscalar and VLIW
Characteristic Superscalar VLIW
Architecture Type Multiple issue Multiple operations
Complexity High Low
Binary Code Compat.. Yes No
Source Code Compat. Yes Yes, if good compiler
Scheduling Dynamic Static
Scheduling Window 10 instructions 100 - 1000 instructions
Speculation Dynamic Static
Branch Prediction Dynamic Static
Mem ref disambiguation Dynamic Static
Scalability Medium High
Functional Flexibility High Very High
Application General Purpose Special Purpose
ASCI winterschool H.C.-P.K. 62
Overview
• Motivation and Goals
• Trends in Computer Architecture
• RISC processors
• ILP Processors
• Transport Triggered Architectures
• Configurable components
• Summary and Conclusions
ASCI winterschool H.C.-P.K. 63
Reducing Datapath Complexity: TTATTA: Transport Triggered Architecture
Overview
PhilosophyMIRROR THE PROGRAMMING PARADIGM
• Program transports, operations are side effects of transports
• Compiler is in control of hardware transport capacity
ASCI winterschool H.C.-P.K. 64
Transport Triggered Architecture
FU1 FU1 FU1IntegerReg File
FPRegFIle
BooleanRegFile
General Structure of TTA
Data-transport Buses / Move Buses Sockets
ASCI winterschool H.C.-P.K. 65
Program TTAs
How to do data operations ?1. Transport of operands to FU
• Operand move (s)• Trigger move
2. Transport of results from FU• Result move (s)
How to do Control flow ?1. Jumps: #jump-address pc
2. Branch: #displacement pcd
3. Call: pc r; #call-address pcd
Example Add r3,r1,r2 becomesr1 Oint // operand move to integer unitr2 Tadd // trigger move to integer unit…………. // addition operation in progressRint r3 // result move from integer unit
Trigger Operand
Internal stage
Result
FU Pipeline
ASCI winterschool H.C.-P.K. 66
Program TTAs
Scheduling advantages of Transport Triggered Architectures
1. Software bypassingRint r1r1 Tadd Rint r1; Rint Tadd
2. Dead writeback removalRint r1; Rint Tadd Rint Tadd
3. Common operand elimination#4 Oint; r1 Tadd #4 Oint; r1 Tadd#4 Oint; r2 Tadd r2 Tadd
4. Decouple operand, trigger and result moves completelyr1 Oint; r2 Tadd r1 OintRint r3 ---
r2 Tadd --- Rint r3
ASCI winterschool H.C.-P.K. 67
TTA Advantages
Summary of advantages of TTAs
• Better usage of transport capacity– Instead of 3 transports per dyadic operation, about 2 are needed– # register ports reduced with at least 50%– Inter FU connectivity reduces with 50-70%
• No full connectivity required
• Both the transport capacity and # register ports become independent design parameters; this removes one of the major bottlenecks of VLIWs
• Flexible: FUs can incorporate arbitrary functionality• Scalable: #FUs, #reg.files, etc. can be changed• TTAs are easy to design and can have short cycle times
ASCI winterschool H.C.-P.K. 68
TTA automatic DSE
Architectureparameters
OptimizerOptimizer
Parametric compilerParametric compiler Hardware generatorHardware generator
feedbackfeedback
Userintercation
Parallel object code chip
Pareto curve(solution space)
cost
exec
. tim
e
x
x
x
x
xx
x
xx
x
x
x
x
x
x
xx x
x
x
Move framework
ASCI winterschool H.C.-P.K. 69
Overview
• Motivation and Goals
• Trends in Computer Architecture
• RISC processors
• ILP Processors
• Transport Triggered Architectures
• Configurable components
• Summary and Conclusions
ASCI winterschool H.C.-P.K. 70
Tensilica Xtensa• Configurable RISC• 0.13um• 0.9V• 1 issue slot / 5 stage pipeline• 490 MHz typical• 39.2 mW (no mem.)• 12500 MOPS / W
• Tool support• Optional vector unit• Special Function Units
ASCI winterschool H.C.-P.K. 71
CLB
CLB
CLB
CLB
SwitchMatrix
ProgrammableInterconnect I/O Blocks (IOBs)
ConfigurableLogic Blocks (CLBs)
D Q
SlewRate
Control
PassivePull-Up,
Pull-Down
Delay
Vcc
OutputBuffer
InputBuffer
Q D
Pad
D QSD
RDEC
S/RControl
D QSD
RDEC
S/RControl
1
1
F'
G'
H'
DIN
F'
G'
H'
DIN
F'
G'
H'
H'
HFunc.Gen.
GFunc.Gen.
FFunc.Gen.
G4G3G2G1
F4F3F2F1
C4C1 C2 C3
K
Y
X
H1 DIN S/R EC
Fine-Grained reconfigurable: Xilinx XC4000 FPGA
ASCI winterschool H.C.-P.K. 72
Coarse-Grained reconfigurable: Chameleon CS2000
Highlights:•32-bit datapath (ALU/Shift)•16x24 Multiplier•distributed local memory•fixed timing
ASCI winterschool H.C.-P.K. 73
Hybrid FPGAs: Virtex II-Pro
ReConfig
.logic
Up to 16 serial transceivers
PowerP
Cs
Courtesy of Xilinx (Virtex II Pro)
PowerPC
Reconfigurable logicblocks
Memory blocks
GHz IO: Up to 16 serial transceivers
ASCI winterschool H.C.-P.K. 74
HW or SW reconfigurable?
Data path granularityfine coarse
Rec
onfi
gura
tion
tim
e
1 cycleSubword parallelism
loopbuffercontext
reset
Spatial mapping
Temporal mapping
FPGA
VLIW
configuration bandwidth
ASCI winterschool H.C.-P.K. 75
Granularity Makes Differences
Fine-Grained Architecture
Coarse-Grained Architecture
Clock Speed Low High
Configuration Time
Long Short
Unit Amount Large Small
Flexibility High Low
Power High Low
Area Large Small
ASCI winterschool H.C.-P.K. 76
Overview
• Motivation and Goals
• Trends in Computer Architecture
• RISC processors
• ILP Processors
• Transport Triggered Architectures
• Configurable components
• Multi-threading
• Summary and Conclusions
ASCI winterschool H.C.-P.K. 77
Simultaneous Multithreading Characteristics
• An SMT has separate front-ends for the different threads but shares the back-end between all threads.
• Each thread has its own– Re-order buffer– Branch History Register
• Registers, caches, branch prediction tables, instruction queues, FUs etc. are shared.
ASCI winterschool H.C.-P.K. 78
Multi-threading in Uniprocessor Architectures
SuperscalarSimultaneousMultithreading
ConcurrentMultithreading
Issue slots
Clo
ck c
ycle
s
Empty Slot
Thread 1
Thread 2
Thread 3
Thread 4
ASCI winterschool H.C.-P.K. 79
Instruction Fetch Policies• The Instruction Fetch policy decides from which
threads to fetch each cycle.• Performance and throughput is highly sensitive to
the Instr.Fetch policy.• “Standard” icount fetches from thread with least
instructions in front-end.• Performance of a thread depends on policy as
well as workload and becomes highly unpredictable.
ASCI winterschool H.C.-P.K. 80
Resource Allocation in SMT
• Better to perform dynamic resource allocation to drive instruction fetch.
• DCRA outperforms icount in many cases.
• Possible to use resource allocation to guarantee certain percentage of single-thread performance.
• Improves predictability and hence suitability of SMT for real-time embedded systems.
ASCI winterschool H.C.-P.K. 81
Future Processors Components• New TriMedia has deep pipeline, L1 and L2
cache, and branch prediction.• META is a (simple) simultaneous multithreaded
architecture.• Calistro is a embedded multi-processor platform
for mobile applications.• Imagine (Stanford): combines operation (VLIW)
and data level parallelism (SIMD).• TRISP (Texas Austin / IBM) and SCALE (MIT)
processors combine task, operation and data level parallelism.
ASCI winterschool H.C.-P.K. 82
Summary and Conclusions
ILP architectures have great potential• Superscalars
– Binary compatible upgrade path
• VLIWs– Very flexible ASIPs
• TTAs– Avoid control and datapath bottlenecks– Completely compiler controlled– Very good cost-performance ratio– Low power
• Multi-threading– Surpass exploitable ILP in applications– How to choose threads ?
top related