tkt-3526 processor design - computer engineering at ... · 2 compilation for tta • compilation is...
Post on 11-Mar-2019
221 Views
Preview:
TRANSCRIPT
2
Compilation for TTA • Compilation is process of transferring algorithm
written in High Level Language (HLL) to machine code • Compiler must preserve meaning of program being
compiled • Compiler must improve input program in some way
• Contains several stages • Processing source code in HLL • Optimizing intermediate representation • Generating target machine code
3
Processing source code • Task is to verify that the source code is correct in given
HLL language
• Lexical analysis: • Is the sequence of characters keyword in language? • Is the sequence of characters variable, operator? • Is the sequence of characters number/string? • … • whille (0a < 0.23zf9) printff (“Correct!”); • Not a C language code, spelling errors… • Final result is sequence of tokens
• There are tools to generate lexical analyzers (Flex, Jlex, …)
4
Processing source code
• Syntactic analysis (parsing) analyses sequence of tokens to test if they are grammatically correct • while ( a < 0.23) printf; (“Correct!”) • a = 12.23 + • Wrong code, C grammar requires operator + to have
two arguments • printf needs arguments to make function call and ;
should appear afterwards … • There are tools to generate parsers (Bison, Yacc,
ANTLR, …)
5
Processing source code • Semantic analysis adds semantic information and
builds symbol table • Type checking • Object binding • …
• Final internal representation (parse tree) is ready for optimization and code generation
• Three phases (lexical, syntactic and semantic analysis forms compiler frontend – language dependent)
6
Optimizations
• Optimizations on internal representations can be target independent • Variables replaced by constants if the value is
known • Unreachable code (after return) is removed • inlining • Loop transformations (unrolling) • …
7
Target dependent optimizations
• Until now, the work is done in LLVM compiler, independent of the target TTA architecture
• Once the target TTA is defined, following still needs to be done: • Instruction selection • Register allocation • Instruction scheduling
8
Instruction Selection
• The target TTA architecture description defines which Function Units are available and what operations they can perform
• On right, abstract syntax tree of expression:
• x * 2 + z * 2 * y
+
* *
x 2 * y
z 2
9
Instruction Selection (TTA)
• Instruction selector finds from the target instruction set operations for expressions in the syntax tree (operation pattern matching) • This can include custom operations (MAC, ADDSUB) • Replacing nonexistent operations with workarounds
• Code uses floating point multiplication, but there is no such operation in the instruction set
• Convert to the function call to software implementations
10
Instruction Selection (TTA)
• We can map tree on right to 3 operations
• ADD • SUB • MUL • Or, in case TTA has
ADDSUB • ADDSUB • MUL
*
+ -
x y x y
x + y * x - y
11
Register Allocation • Number of registers present in hardware is limited • Number of variables in the processed source code is not
limited (for practical purposes there is some limit) • We need to map variables in the code to hardware registers • Thing to consider - live range of variables
• a = b + c • d = a + b • a = c + d • Value of a changes in line 3, it is new variable can be in
different register then original a • Variable c is needed in first and 3 line • In second line we need 4 registers to store all of the
variables • Optimal register allocation can be done by graph coloring
12
Register Allocation
• Registers are usually grouped together into the register file • Number of ports of register file limits number of
simultaneous register reads and writes! • With lot of operations in parallel -> huge number of ports is
required • Multiported register files are expensive – area, energy … • Common solution
• Split registers into multiple register files • It is not sufficient to allocate register to variable, also
register file allocation is required • Single register file can be connected only to subset of
function units (creating clusters)
13
Instruction Scheduling
• Once the instructions and registers are selected, they need to be put into the order for target machine to execute
• Most simple solution – follow the order in the source code • Guarantees same semantics as the source • Operations are executed in same order as programmer
wrote them • Problem is inefficiency • Loading or storing value from/to memory is slow • Processor can do something else at that time! • This is common in “out of order” execution processors • Hardware analyses short sequence of instructions and
issues them in such a way as to shorten the execution time
14
Instruction Scheduling • Processors that do not have out of order execution require
different mechanism • Static scheduling of operations is work for compiler • Reorder operations, compared to order in source code to
shorten execution time • Another aspect – instruction level parallelism
• Some processors allows to issue several operations per cycle – this is not how the code is written in the source code
• Dependence analysis in compiler is required • Find how the variables depends on each other • Schedule operations that can be performed independently
in parallel
15
Compiling for TTA
• So far, we got frontend processing course code and performing some target independent optimizations
• We got backend selecting instructions based on instruction set of architecture defined
• We got register and register files allocated based on number and location of registers in target architectures
• We got dependence between operations analyzed and we issued operations in parallel where possible
• Is this sufficient for compiling for TTA?
16
NO
• With TTAs, also connectivity is problem and needs to be allocated. And then there are those moves.
FU:
ALU
{ add, sub, and, ior, min, minu, xo...
FU:
shifter
{ shl, shr, shru }
RF:
RF1
16x32
RF:
RF2
16x16
GCU:
control_unit
{ jump, call }
0
1
2
3
17
Compiling for TTA, the steps
• LLVM (Low Level Virtual Machine) works as the frontend compiler. • Processes input written in HLL and creates intermediate
representation. • Does lot of target independent optimizations, guided by
compiler flags provided by TTA developers team. • LLVM works as a middle end.
• Accepts particular TTA architecture as a plugin and performs instruction selection and register allocation.
• Resulting output is still missing instruction scheduling! • Scheduling performed in the TTA backend (together with few
more optimizations).
18
Instruction format of TTA
• TTA can be characterized as one instruction computer • There really is only one actual instruction:
• Transport from A to B (also known as move) • Number of transport busses determine maximum
number of parallel moves • When encoded for target architecture, the instruction is
set of moves defined for each of the buses
Add r1, r2, r3 Mul r1, r3, r1 Sub r5, r3, r1
vs
r2 -> add.1; r3-> add.2; add.r -> mul.1; r3 -> mul.2; mul.r -> sub.1; r3-> sub.2; sub.r -> r5; nop;
19
Instruction format of TTA
• Instruction for machine with 6 buses with 6 moves
r1 -> add.1 nop 12 -> add.2 sub.r -> r5 nop sub.r -> r8
r12 -> mul.1 nop 4372895748
• Instruction for machine with 6 buses with 2 moves and large number encoded in remaining bits of instruction
20
Steps to perform Instruction Scheduling 1 (CFG)
• Analyze intermediate representation generated so far (instructions are selected, variables are assigned to registers) • Construct Control Flow Graph (CFG)
• This graphs analyzes changes in control flow inside a function – if/else statements, loops, calls
• Splits the code into graph nodes – Basic Blocks • Basic Block – sequence of code with single entry and
single exit – no control flow inside Basic Block • Graph connects Basic Blocks with the edges
representing flow of execution
21
CFG continues
• Equivalent of:
if (condition1 is true) { (15-27) do { (28-55) statements; } while (condition2 is true); } statements; (56-61) return;
15 - 27
28 - 55
FallThrough_true
56 - 61
Jump_false
Entry
Jump_normal
Jump_BackEdge_true
FallThrough_false
Exit
Jump_normal
22
Steps to perform Instruction Scheduling 2 (DDG)
• Now that we identified Basic Blocks without control flow inside, we can look at dependencies
add r1, r2, r3 mul r4, r1, r3 sub r5, r2, r3 • Register r1 is defined in first statement and used in second,
so we can not change order of those two statements or compute them in parallel
• But third statement does not use values computed in first or second statement, so it can be executed in parallel with any of them!
add r1, r2, r3 ; sub r5, r2, r3 mul r4, r1, r3 ; sub r5, r2, r3
23
DDG continues
• There are many types of register dependencies • Register is read after it is written (Read After Write) • Register is overwritten by new value (Write After Write) • Register must be read before it is overwritten (Write After
Read) • Register is read after it is read (Read After Read) – this
one is not relevant for us • Similarly, we define same dependencies also for memory
locations • We represent individual statements as node in graph and
dependencies as arrow between statements • In case of TTA, the statements are single moves
24
DDG continues -1: ENTRYNODE
sp -> add.1
R_raw:universal_integer_rf.1
ra -> stw.2
RA_raw:RA
iarg1 -> r9
R_raw:universal_integer_rf.2
iarg2 -> r8
R_raw:universal_integer_rf.3
iarg3 -> r7
R_raw:universal_integer_rf.4
iarg4 -> gt.1
R_raw:universal_integer_rf.5
iarg4 -> gt.1
R_raw:universal_integer_rf.5
-4 -> add.2
T
add.3 -> sp
O:ADD R_war:universal_integer_rf.1
O:ADD
sp -> stw.1
R_raw:universal_integer_rf.1
sp -> ldw.1
R_raw:universal_integer_rf.1
sp -> add.1
R_raw:universal_integer_rf.1
add.3 -> sp
R_waw:universal_integer_rf.1
T
iarg1 -> ldw.1
M_raw:
iarg2 -> ldw.1
M_raw:
iarg1 -> stw.2
M_waw:
M_raw:
R_war:universal_integer_rf.1
M_raw:
M_raw:
M_waw:
M_raw:
ldw.2 -> ra
RA_war:RA
r9 -> add.2
R_raw:universal_integer_rf.9
add.3 -> iarg1
R_war:universal_integer_rf.2
shl.3 -> iarg2
R_war:universal_integer_rf.3
r8 -> add.2
R_raw:universal_integer_rf.8
add.3 -> iarg3
R_war:universal_integer_rf.4 r7 -> add.2
R_raw:universal_integer_rf.7
0 -> r6
r6 -> gt.2
R_raw:universal_integer_rf.6
r6 -> shl.1
R_raw:universal_integer_rf.6
r6 -> add.1
R_raw:universal_integer_rf.6
add.3 -> r6
R_waw:universal_integer_rf.6
T
gt.3 -> bool
O:GT
O:GTR_war:universal_integer_rf.6
! 56 -> jump.1
R_G_raw:universal_boolean_rf.0
gt.3 -> bool
R_waw:universal_boolean_rf.0
R_G_war:universal_boolean_rf.0
2 -> shl.2
T
O:SHL
R_war:universal_integer_rf.6
O:SHL
iarg2 -> add.1
R_raw:universal_integer_rf.3
iarg2 -> add.1
R_raw:universal_integer_rf.3 iarg2 -> add.1
R_raw:universal_integer_rf.3
add.3 -> iarg2
R_waw:universal_integer_rf.3T
O:ADD R_war:universal_integer_rf.3
O:ADD
LOOP:1_R_waw:universal_integer_rf.4
iarg3 -> stw.1
R_raw:universal_integer_rf.4
T
O:ADD
R_war:universal_integer_rf.3
O:ADD
R_raw:universal_integer_rf.2
ldw.2 -> iarg1
R_waw:universal_integer_rf.2 T
O:ADDR_war:universal_integer_rf.3
O:ADD
R_raw:universal_integer_rf.3
ldw.2 -> iarg2
R_waw:universal_integer_rf.3
O:LDWR_war:universal_integer_rf.2
M_war:
iarg1 -> mul.1
R_raw:universal_integer_rf.2
mul.3 -> iarg1
R_waw:universal_integer_rf.2
O:LDWR_war:universal_integer_rf.3
M_war:
LOOP:1_R_waw:universal_integer_rf.3
iarg2 -> mul.2
R_raw:universal_integer_rf.3
T
O:MUL R_war:universal_integer_rf.2
LOOP:1_R_war:universal_integer_rf.3O:MUL
LOOP:1_R_waw:universal_integer_rf.2
R_raw:universal_integer_rf.2
LOOP:1_R_war:universal_integer_rf.4
LOOP:1_M_raw:
LOOP:1_M_raw:
LOOP:1_M_waw:T
M_raw:
LOOP:1_R_war:universal_integer_rf.2
LOOP:1_M_raw:
LOOP:1_M_raw:
LOOP:1_M_waw:
M_raw:
1 -> add.2
T
O:ADD R_war:universal_integer_rf.6
O:ADD
LOOP:1_R_raw:universal_integer_rf.6 LOOP:1_R_raw:universal_integer_rf.6
LOOP:1_R_waw:universal_integer_rf.6
r6 -> gt.2
R_raw:universal_integer_rf.6 T
O:GT
LOOP:1_R_war:universal_integer_rf.6
O:GT
LOOP:1_R_waw:universal_boolean_rf.0
? 28 -> jump.1
R_G_raw:universal_boolean_rf.0 LOOP:1_R_G_war:universal_boolean_rf.0
O:LDWR_war:universal_integer_rf.1
ra -> jump.1
RA_raw:RA
4 -> add.2
T
O:ADDR_war:universal_integer_rf.1
O:ADD
R_raw_HP:universal_integer_rf.1
25
• The actual task of instruction scheduling is simple • Order the nodes in such a way that the dependencies are
not broken • Most simple way would be to start at the top and use
topological sort to get order in which nodes should be written into the instructions
• But would it run on the target processor? • No. • The schedule is constrained by the data dependencies as
well as resources available. • So we need to check for each node if it can be scheduled in
position (instruction) where we want it!
26
FU:
ALU
{ add, sub, and, ior, min, minu, xo...
FU:
shifter
{ shl, shr, shru }
RF:
RF1
16x32
RF:
RF2
16x16
GCU:
control_unit
{ jump, call }
0
1
2
3
27
• Check if there is function unit providing required operation available in given cycle
• Check if there are buses available for transporting operands and result at required cycles • Check if sources and destinations of the moves a
actually connected by some bus • If that bus available for given cycle?
• The goal is still to generate fastest code possible • Smallest number of instructions – packing moves
into instruction as tightly as possible
28
Modeling resources • Interaction between instruction scheduling and
resource manager is as follow • Instruction scheduler finds candidate node for
schedule and finds cycle in which node can be scheduled respecting dependencies
• Scheduler checks with Resource Manager if the node can be scheduled in hardware
• What does it mean in practice? • Our compiler needs to create model of target
architecture • It needs to keep record of what is the status of the
resource assignments in each cycle already partially scheduled
29
Resource Modeling
• What it means in practice?
• Finding optimal schedule based on data dependencies can be done easily (simply use topological sort)
• This is however hypothetical – works only on processor with unlimited resources
• In real case, instruction scheduling spend half of it’s execution time analyzing graph and half a time checking resource availability
30
cycle 0
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
cycle 7
cycle 8
cycle 9
0 integer5.0 -> fu22.o0
1 1 -> fu22.trigger.shru
T
3 fu22.r0 -> integer3.0
O:SHRU:0
O:SHRU:2 2 integer2.2 -> float2.31
5 float2.31 -> fu24.trigger.gtu
R_raw:float2.31
4 integer3.0 -> fu24.o0
R_raw:integer3.0
T
6 fu24.r0 -> float2.31
O:GTU:0
O:GTU:1
7 float2.31 -> boolean0.1
R_raw:float2.31
8 ?boolean0.1 2206 -> gcu.trigger.jump
R_G_raw:boolean0.1
top related