tkt-3526 processor design - computer engineering at ... · 2 compilation for tta • compilation is...

1

TKT-3526 Processor Design

Compiling for TTA Architectures

Vladimir Guzma

2

Compilation for TTA •  Compilation is process of transferring algorithm

written in High Level Language (HLL) to machine code •  Compiler must preserve meaning of program being

compiled •  Compiler must improve input program in some way

•  Contains several stages •  Processing source code in HLL •  Optimizing intermediate representation •  Generating target machine code

3

Processing source code •  Task is to verify that the source code is correct in given

HLL language

•  Lexical analysis: •  Is the sequence of characters keyword in language? •  Is the sequence of characters variable, operator? •  Is the sequence of characters number/string? •  … •  whille (0a < 0.23zf9) printff (“Correct!”); •  Not a C language code, spelling errors… •  Final result is sequence of tokens

•  There are tools to generate lexical analyzers (Flex, Jlex, …)

4

Processing source code

•  Syntactic analysis (parsing) analyses sequence of tokens to test if they are grammatically correct •  while ( a < 0.23) printf; (“Correct!”) •  a = 12.23 + •  Wrong code, C grammar requires operator + to have

two arguments •  printf needs arguments to make function call and ;

should appear afterwards … •  There are tools to generate parsers (Bison, Yacc,

ANTLR, …)

5

Processing source code •  Semantic analysis adds semantic information and

builds symbol table •  Type checking •  Object binding •  …

•  Final internal representation (parse tree) is ready for optimization and code generation

•  Three phases (lexical, syntactic and semantic analysis forms compiler frontend – language dependent)

6

Optimizations

•  Optimizations on internal representations can be target independent •  Variables replaced by constants if the value is

known •  Unreachable code (after return) is removed •  inlining •  Loop transformations (unrolling) •  …

7

Target dependent optimizations

•  Until now, the work is done in LLVM compiler, independent of the target TTA architecture

•  Once the target TTA is defined, following still needs to be done: •  Instruction selection •  Register allocation •  Instruction scheduling

8

Instruction Selection

•  The target TTA architecture description defines which Function Units are available and what operations they can perform

•  On right, abstract syntax tree of expression:

•  x * 2 + z * 2 * y

+

* *

x 2 * y

z 2

9

Instruction Selection (TTA)

•  Instruction selector finds from the target instruction set operations for expressions in the syntax tree (operation pattern matching) •  This can include custom operations (MAC, ADDSUB) •  Replacing nonexistent operations with workarounds

•  Code uses floating point multiplication, but there is no such operation in the instruction set

•  Convert to the function call to software implementations

10

Instruction Selection (TTA)

•  We can map tree on right to 3 operations

•  ADD •  SUB •  MUL •  Or, in case TTA has

ADDSUB •  ADDSUB •  MUL

*

+ -

x y x y

x + y * x - y

11

Register Allocation •  Number of registers present in hardware is limited •  Number of variables in the processed source code is not

limited (for practical purposes there is some limit) •  We need to map variables in the code to hardware registers •  Thing to consider - live range of variables

•  a = b + c •  d = a + b •  a = c + d •  Value of a changes in line 3, it is new variable can be in

different register then original a •  Variable c is needed in first and 3 line •  In second line we need 4 registers to store all of the

variables •  Optimal register allocation can be done by graph coloring

12

Register Allocation

•  Registers are usually grouped together into the register file •  Number of ports of register file limits number of

simultaneous register reads and writes! •  With lot of operations in parallel -> huge number of ports is

required •  Multiported register files are expensive – area, energy … •  Common solution

•  Split registers into multiple register files •  It is not sufficient to allocate register to variable, also

register file allocation is required •  Single register file can be connected only to subset of

function units (creating clusters)

13

Instruction Scheduling

•  Once the instructions and registers are selected, they need to be put into the order for target machine to execute

•  Most simple solution – follow the order in the source code •  Guarantees same semantics as the source •  Operations are executed in same order as programmer

wrote them •  Problem is inefficiency •  Loading or storing value from/to memory is slow •  Processor can do something else at that time! •  This is common in “out of order” execution processors •  Hardware analyses short sequence of instructions and

issues them in such a way as to shorten the execution time

14

Instruction Scheduling •  Processors that do not have out of order execution require

different mechanism •  Static scheduling of operations is work for compiler •  Reorder operations, compared to order in source code to

shorten execution time •  Another aspect – instruction level parallelism

•  Some processors allows to issue several operations per cycle – this is not how the code is written in the source code

•  Dependence analysis in compiler is required •  Find how the variables depends on each other •  Schedule operations that can be performed independently

in parallel

15

Compiling for TTA

•  So far, we got frontend processing course code and performing some target independent optimizations

•  We got backend selecting instructions based on instruction set of architecture defined

•  We got register and register files allocated based on number and location of registers in target architectures

•  We got dependence between operations analyzed and we issued operations in parallel where possible

•  Is this sufficient for compiling for TTA?

16

NO

•  With TTAs, also connectivity is problem and needs to be allocated. And then there are those moves.

FU:

ALU

{ add, sub, and, ior, min, minu, xo...

FU:

shifter

{ shl, shr, shru }

RF:

RF1

16x32

RF:

RF2

16x16

GCU:

control_unit

{ jump, call }

0

1

2

3

17

Compiling for TTA, the steps

•  LLVM (Low Level Virtual Machine) works as the frontend compiler. •  Processes input written in HLL and creates intermediate

representation. •  Does lot of target independent optimizations, guided by

compiler flags provided by TTA developers team. •  LLVM works as a middle end.

•  Accepts particular TTA architecture as a plugin and performs instruction selection and register allocation.

•  Resulting output is still missing instruction scheduling! •  Scheduling performed in the TTA backend (together with few

more optimizations).

18

Instruction format of TTA

•  TTA can be characterized as one instruction computer •  There really is only one actual instruction:

•  Transport from A to B (also known as move) •  Number of transport busses determine maximum

number of parallel moves •  When encoded for target architecture, the instruction is

set of moves defined for each of the buses

Add r1, r2, r3 Mul r1, r3, r1 Sub r5, r3, r1

vs

r2 -> add.1; r3-> add.2; add.r -> mul.1; r3 -> mul.2; mul.r -> sub.1; r3-> sub.2; sub.r -> r5; nop;

19

Instruction format of TTA

•  Instruction for machine with 6 buses with 6 moves

r1 -> add.1 nop 12 -> add.2 sub.r -> r5 nop sub.r -> r8

r12 -> mul.1 nop 4372895748

•  Instruction for machine with 6 buses with 2 moves and large number encoded in remaining bits of instruction

20

Steps to perform Instruction Scheduling 1 (CFG)

•  Analyze intermediate representation generated so far (instructions are selected, variables are assigned to registers) •  Construct Control Flow Graph (CFG)

•  This graphs analyzes changes in control flow inside a function – if/else statements, loops, calls

•  Splits the code into graph nodes – Basic Blocks •  Basic Block – sequence of code with single entry and

single exit – no control flow inside Basic Block •  Graph connects Basic Blocks with the edges

representing flow of execution

21

CFG continues

•  Equivalent of:

if (condition1 is true) { (15-27) do { (28-55) statements; } while (condition2 is true); } statements; (56-61) return;

15 - 27

28 - 55

FallThrough_true

56 - 61

Jump_false

Entry

Jump_normal

Jump_BackEdge_true

FallThrough_false

Exit

Jump_normal

22

Steps to perform Instruction Scheduling 2 (DDG)

•  Now that we identified Basic Blocks without control flow inside, we can look at dependencies

add r1, r2, r3 mul r4, r1, r3 sub r5, r2, r3 •  Register r1 is defined in first statement and used in second,

so we can not change order of those two statements or compute them in parallel

•  But third statement does not use values computed in first or second statement, so it can be executed in parallel with any of them!

add r1, r2, r3 ; sub r5, r2, r3 mul r4, r1, r3 ; sub r5, r2, r3

23

DDG continues

•  There are many types of register dependencies •  Register is read after it is written (Read After Write) •  Register is overwritten by new value (Write After Write) •  Register must be read before it is overwritten (Write After

Read) •  Register is read after it is read (Read After Read) – this

one is not relevant for us •  Similarly, we define same dependencies also for memory

locations •  We represent individual statements as node in graph and

dependencies as arrow between statements •  In case of TTA, the statements are single moves

24

DDG continues -1: ENTRYNODE

sp -> add.1

R_raw:universal_integer_rf.1

ra -> stw.2

RA_raw:RA

iarg1 -> r9


iarg2 -> r8


iarg3 -> r7


iarg4 -> gt.1


iarg4 -> gt.1


-4 -> add.2

T

add.3 -> sp

O:ADD R_war:universal_integer_rf.1

O:ADD

sp -> stw.1


sp -> ldw.1


sp -> add.1


add.3 -> sp

R_waw:universal_integer_rf.1

T

iarg1 -> ldw.1

M_raw:

iarg2 -> ldw.1

M_raw:

iarg1 -> stw.2

M_waw:

M_raw:

R_war:universal_integer_rf.1

M_raw:

M_raw:

M_waw:

M_raw:

ldw.2 -> ra

RA_war:RA

r9 -> add.2


add.3 -> iarg1


shl.3 -> iarg2


r8 -> add.2


add.3 -> iarg3

R_war:universal_integer_rf.4 r7 -> add.2


0 -> r6

r6 -> gt.2


r6 -> shl.1


r6 -> add.1


add.3 -> r6


T

gt.3 -> bool

O:GT

O:GTR_war:universal_integer_rf.6

! 56 -> jump.1

R_G_raw:universal_boolean_rf.0

gt.3 -> bool

R_waw:universal_boolean_rf.0

R_G_war:universal_boolean_rf.0

2 -> shl.2

T

O:SHL


O:SHL

iarg2 -> add.1


iarg2 -> add.1

R_raw:universal_integer_rf.3 iarg2 -> add.1


add.3 -> iarg2

R_waw:universal_integer_rf.3T


O:ADD

LOOP:1_R_waw:universal_integer_rf.4

iarg3 -> stw.1


T

O:ADD


O:ADD


ldw.2 -> iarg1

R_waw:universal_integer_rf.2 T

O:ADDR_war:universal_integer_rf.3

O:ADD


ldw.2 -> iarg2


O:LDWR_war:universal_integer_rf.2

M_war:

iarg1 -> mul.1


mul.3 -> iarg1



M_war:


iarg2 -> mul.2


T

O:MUL R_war:universal_integer_rf.2

LOOP:1_R_war:universal_integer_rf.3O:MUL



LOOP:1_R_war:universal_integer_rf.4

LOOP:1_M_raw:

LOOP:1_M_raw:

LOOP:1_M_waw:T

M_raw:


LOOP:1_M_raw:

LOOP:1_M_raw:

LOOP:1_M_waw:

M_raw:

1 -> add.2

T


O:ADD

LOOP:1_R_raw:universal_integer_rf.6 LOOP:1_R_raw:universal_integer_rf.6


r6 -> gt.2

R_raw:universal_integer_rf.6 T

O:GT


O:GT

LOOP:1_R_waw:universal_boolean_rf.0

? 28 -> jump.1

R_G_raw:universal_boolean_rf.0 LOOP:1_R_G_war:universal_boolean_rf.0


ra -> jump.1

RA_raw:RA

4 -> add.2

T

O:ADDR_war:universal_integer_rf.1

O:ADD

R_raw_HP:universal_integer_rf.1

25

•  The actual task of instruction scheduling is simple •  Order the nodes in such a way that the dependencies are

not broken •  Most simple way would be to start at the top and use

topological sort to get order in which nodes should be written into the instructions

•  But would it run on the target processor? •  No. •  The schedule is constrained by the data dependencies as

well as resources available. •  So we need to check for each node if it can be scheduled in

position (instruction) where we want it!

26

FU:

ALU

{ add, sub, and, ior, min, minu, xo...

FU:

shifter

{ shl, shr, shru }

RF:

RF1

16x32

RF:

RF2

16x16

GCU:

control_unit

{ jump, call }

0

1

2

3

27

•  Check if there is function unit providing required operation available in given cycle

•  Check if there are buses available for transporting operands and result at required cycles •  Check if sources and destinations of the moves a

actually connected by some bus •  If that bus available for given cycle?

•  The goal is still to generate fastest code possible •  Smallest number of instructions – packing moves

into instruction as tightly as possible

28

Modeling resources •  Interaction between instruction scheduling and

resource manager is as follow •  Instruction scheduler finds candidate node for

schedule and finds cycle in which node can be scheduled respecting dependencies

•  Scheduler checks with Resource Manager if the node can be scheduled in hardware

•  What does it mean in practice? •  Our compiler needs to create model of target

architecture •  It needs to keep record of what is the status of the

resource assignments in each cycle already partially scheduled

29

Resource Modeling

•  What it means in practice?

•  Finding optimal schedule based on data dependencies can be done easily (simply use topological sort)

•  This is however hypothetical – works only on processor with unlimited resources

•  In real case, instruction scheduling spend half of it’s execution time analyzing graph and half a time checking resource availability

30

cycle 0

cycle 1

cycle 2

cycle 3

cycle 4

cycle 5

cycle 6

cycle 7

cycle 8

cycle 9

0 integer5.0 -> fu22.o0

1 1 -> fu22.trigger.shru

T

3 fu22.r0 -> integer3.0

O:SHRU:0

O:SHRU:2 2 integer2.2 -> float2.31

5 float2.31 -> fu24.trigger.gtu

R_raw:float2.31

4 integer3.0 -> fu24.o0

R_raw:integer3.0

T

6 fu24.r0 -> float2.31

O:GTU:0

O:GTU:1

7 float2.31 -> boolean0.1

R_raw:float2.31

8 ?boolean0.1 2206 -> gcu.trigger.jump

R_G_raw:boolean0.1

tkt-3526 processor design - computer engineering at ... · 2 compilation for tta • compilation is...

Documents