tkt-3526 processor design - computer engineering at ... · 2 compilation for tta • compilation is...

TKT-3526 Processor Design

Compiling for TTA Architectures

Vladimir Guzma

Compilation for TTA •  Compilation is process of transferring algorithm

written in High Level Language (HLL) to machine code •  Compiler must preserve meaning of program being

compiled •  Compiler must improve input program in some way

•  Contains several stages •  Processing source code in HLL •  Optimizing intermediate representation •  Generating target machine code

Processing source code •  Task is to verify that the source code is correct in given

HLL language

•  Lexical analysis: •  Is the sequence of characters keyword in language? •  Is the sequence of characters variable, operator? •  Is the sequence of characters number/string? •  … •  whille (0a < 0.23zf9) printff (“Correct!”); •  Not a C language code, spelling errors… •  Final result is sequence of tokens

•  There are tools to generate lexical analyzers (Flex, Jlex, …)

Processing source code

•  Syntactic analysis (parsing) analyses sequence of tokens to test if they are grammatically correct •  while ( a < 0.23) printf; (“Correct!”) •  a = 12.23 + •  Wrong code, C grammar requires operator + to have

two arguments •  printf needs arguments to make function call and ;

should appear afterwards … •  There are tools to generate parsers (Bison, Yacc,

ANTLR, …)

Processing source code •  Semantic analysis adds semantic information and

builds symbol table •  Type checking •  Object binding •  …

•  Final internal representation (parse tree) is ready for optimization and code generation

•  Three phases (lexical, syntactic and semantic analysis forms compiler frontend – language dependent)

Optimizations

•  Optimizations on internal representations can be target independent •  Variables replaced by constants if the value is

known •  Unreachable code (after return) is removed •  inlining •  Loop transformations (unrolling) •  …

Target dependent optimizations

•  Until now, the work is done in LLVM compiler, independent of the target TTA architecture

•  Once the target TTA is defined, following still needs to be done: •  Instruction selection •  Register allocation •  Instruction scheduling

Instruction Selection

•  The target TTA architecture description defines which Function Units are available and what operations they can perform

•  On right, abstract syntax tree of expression:

•  x * 2 + z * 2 * y

x 2 * y

Instruction Selection (TTA)

•  Instruction selector finds from the target instruction set operations for expressions in the syntax tree (operation pattern matching) •  This can include custom operations (MAC, ADDSUB) •  Replacing nonexistent operations with workarounds

•  Code uses floating point multiplication, but there is no such operation in the instruction set

•  Convert to the function call to software implementations

Instruction Selection (TTA)

•  We can map tree on right to 3 operations

•  ADD •  SUB •  MUL •  Or, in case TTA has

ADDSUB •  ADDSUB •  MUL

x y x y

x + y * x - y

Register Allocation •  Number of registers present in hardware is limited •  Number of variables in the processed source code is not

limited (for practical purposes there is some limit) •  We need to map variables in the code to hardware registers •  Thing to consider - live range of variables

•  a = b + c •  d = a + b •  a = c + d •  Value of a changes in line 3, it is new variable can be in

different register then original a •  Variable c is needed in first and 3 line •  In second line we need 4 registers to store all of the

variables •  Optimal register allocation can be done by graph coloring

Register Allocation

•  Registers are usually grouped together into the register file •  Number of ports of register file limits number of

simultaneous register reads and writes! •  With lot of operations in parallel -> huge number of ports is

required •  Multiported register files are expensive – area, energy … •  Common solution

•  Split registers into multiple register files •  It is not sufficient to allocate register to variable, also

register file allocation is required •  Single register file can be connected only to subset of

function units (creating clusters)

Instruction Scheduling

•  Once the instructions and registers are selected, they need to be put into the order for target machine to execute

•  Most simple solution – follow the order in the source code •  Guarantees same semantics as the source •  Operations are executed in same order as programmer

wrote them •  Problem is inefficiency •  Loading or storing value from/to memory is slow •  Processor can do something else at that time! •  This is common in “out of order” execution processors •  Hardware analyses short sequence of instructions and

issues them in such a way as to shorten the execution time

Instruction Scheduling •  Processors that do not have out of order execution require

different mechanism •  Static scheduling of operations is work for compiler •  Reorder operations, compared to order in source code to

shorten execution time •  Another aspect – instruction level parallelism

•  Some processors allows to issue several operations per cycle – this is not how the code is written in the source code

•  Dependence analysis in compiler is required •  Find how the variables depends on each other •  Schedule operations that can be performed independently

in parallel

Compiling for TTA

•  So far, we got frontend processing course code and performing some target independent optimizations

•  We got backend selecting instructions based on instruction set of architecture defined

•  We got register and register files allocated based on number and location of registers in target architectures

•  We got dependence between operations analyzed and we issued operations in parallel where possible

•  Is this sufficient for compiling for TTA?

•  With TTAs, also connectivity is problem and needs to be allocated. And then there are those moves.

{ add, sub, and, ior, min, minu, xo...

shifter

{ shl, shr, shru }

control_unit

{ jump, call }

Compiling for TTA, the steps

•  LLVM (Low Level Virtual Machine) works as the frontend compiler. •  Processes input written in HLL and creates intermediate

representation. •  Does lot of target independent optimizations, guided by

compiler flags provided by TTA developers team. •  LLVM works as a middle end.

•  Accepts particular TTA architecture as a plugin and performs instruction selection and register allocation.

•  Resulting output is still missing instruction scheduling! •  Scheduling performed in the TTA backend (together with few

more optimizations).

Instruction format of TTA

•  TTA can be characterized as one instruction computer •  There really is only one actual instruction:

•  Transport from A to B (also known as move) •  Number of transport busses determine maximum

number of parallel moves •  When encoded for target architecture, the instruction is

set of moves defined for each of the buses

Add r1, r2, r3 Mul r1, r3, r1 Sub r5, r3, r1

r2 -> add.1; r3-> add.2; add.r -> mul.1; r3 -> mul.2; mul.r -> sub.1; r3-> sub.2; sub.r -> r5; nop;

Instruction format of TTA

•  Instruction for machine with 6 buses with 6 moves

r1 -> add.1 nop 12 -> add.2 sub.r -> r5 nop sub.r -> r8

r12 -> mul.1 nop 4372895748

•  Instruction for machine with 6 buses with 2 moves and large number encoded in remaining bits of instruction

Steps to perform Instruction Scheduling 1 (CFG)

•  Analyze intermediate representation generated so far (instructions are selected, variables are assigned to registers) •  Construct Control Flow Graph (CFG)

•  This graphs analyzes changes in control flow inside a function – if/else statements, loops, calls

•  Splits the code into graph nodes – Basic Blocks •  Basic Block – sequence of code with single entry and

single exit – no control flow inside Basic Block •  Graph connects Basic Blocks with the edges

representing flow of execution

CFG continues

•  Equivalent of:

if (condition1 is true) { (15-27) do { (28-55) statements; } while (condition2 is true); } statements; (56-61) return;

15 - 27

28 - 55

FallThrough_true

56 - 61

Jump_false

Jump_normal

Jump_BackEdge_true

FallThrough_false

Jump_normal

Steps to perform Instruction Scheduling 2 (DDG)

•  Now that we identified Basic Blocks without control flow inside, we can look at dependencies

add r1, r2, r3 mul r4, r1, r3 sub r5, r2, r3 •  Register r1 is defined in first statement and used in second,

so we can not change order of those two statements or compute them in parallel

•  But third statement does not use values computed in first or second statement, so it can be executed in parallel with any of them!

add r1, r2, r3 ; sub r5, r2, r3 mul r4, r1, r3 ; sub r5, r2, r3

DDG continues

•  There are many types of register dependencies •  Register is read after it is written (Read After Write) •  Register is overwritten by new value (Write After Write) •  Register must be read before it is overwritten (Write After

Read) •  Register is read after it is read (Read After Read) – this

one is not relevant for us •  Similarly, we define same dependencies also for memory

locations •  We represent individual statements as node in graph and

dependencies as arrow between statements •  In case of TTA, the statements are single moves

DDG continues -1: ENTRYNODE

sp -> add.1

R_raw:universal_integer_rf.1

ra -> stw.2

RA_raw:RA

iarg1 -> r9

iarg2 -> r8

iarg3 -> r7

iarg4 -> gt.1

-4 -> add.2

add.3 -> sp

O:ADD R_war:universal_integer_rf.1

sp -> stw.1

sp -> ldw.1

sp -> add.1

add.3 -> sp

R_waw:universal_integer_rf.1

iarg1 -> ldw.1

M_raw:

iarg2 -> ldw.1

M_raw:

iarg1 -> stw.2

M_waw:

M_raw:

R_war:universal_integer_rf.1

M_raw:

M_waw:

M_raw:

ldw.2 -> ra

RA_war:RA

r9 -> add.2

add.3 -> iarg1

shl.3 -> iarg2

r8 -> add.2

add.3 -> iarg3

R_war:universal_integer_rf.4 r7 -> add.2

0 -> r6

r6 -> gt.2

r6 -> shl.1

r6 -> add.1

add.3 -> r6

gt.3 -> bool

O:GTR_war:universal_integer_rf.6

! 56 -> jump.1

R_G_raw:universal_boolean_rf.0

gt.3 -> bool

R_waw:universal_boolean_rf.0

R_G_war:universal_boolean_rf.0

2 -> shl.2

iarg2 -> add.1

R_raw:universal_integer_rf.3 iarg2 -> add.1

add.3 -> iarg2

R_waw:universal_integer_rf.3T

LOOP:1_R_waw:universal_integer_rf.4

iarg3 -> stw.1

ldw.2 -> iarg1

R_waw:universal_integer_rf.2 T

O:ADDR_war:universal_integer_rf.3

ldw.2 -> iarg2

O:LDWR_war:universal_integer_rf.2

M_war:

iarg1 -> mul.1

mul.3 -> iarg1

M_war:

iarg2 -> mul.2

O:MUL R_war:universal_integer_rf.2

LOOP:1_R_war:universal_integer_rf.3O:MUL

LOOP:1_R_war:universal_integer_rf.4

LOOP:1_M_raw:

LOOP:1_M_waw:T

M_raw:

LOOP:1_M_raw:

LOOP:1_M_waw:

M_raw:

1 -> add.2

LOOP:1_R_raw:universal_integer_rf.6 LOOP:1_R_raw:universal_integer_rf.6

r6 -> gt.2

R_raw:universal_integer_rf.6 T

LOOP:1_R_waw:universal_boolean_rf.0

? 28 -> jump.1

R_G_raw:universal_boolean_rf.0 LOOP:1_R_G_war:universal_boolean_rf.0

ra -> jump.1

RA_raw:RA

4 -> add.2

O:ADDR_war:universal_integer_rf.1

R_raw_HP:universal_integer_rf.1

•  The actual task of instruction scheduling is simple •  Order the nodes in such a way that the dependencies are

not broken •  Most simple way would be to start at the top and use

topological sort to get order in which nodes should be written into the instructions

•  But would it run on the target processor? •  No. •  The schedule is constrained by the data dependencies as

well as resources available. •  So we need to check for each node if it can be scheduled in

position (instruction) where we want it!

{ add, sub, and, ior, min, minu, xo...

shifter

{ shl, shr, shru }

control_unit

{ jump, call }

•  Check if there is function unit providing required operation available in given cycle

•  Check if there are buses available for transporting operands and result at required cycles •  Check if sources and destinations of the moves a

actually connected by some bus •  If that bus available for given cycle?

•  The goal is still to generate fastest code possible •  Smallest number of instructions – packing moves

into instruction as tightly as possible

Modeling resources •  Interaction between instruction scheduling and

resource manager is as follow •  Instruction scheduler finds candidate node for

schedule and finds cycle in which node can be scheduled respecting dependencies

•  Scheduler checks with Resource Manager if the node can be scheduled in hardware

•  What does it mean in practice? •  Our compiler needs to create model of target

architecture •  It needs to keep record of what is the status of the

resource assignments in each cycle already partially scheduled

Resource Modeling

•  What it means in practice?

•  Finding optimal schedule based on data dependencies can be done easily (simply use topological sort)

•  This is however hypothetical – works only on processor with unlimited resources

•  In real case, instruction scheduling spend half of it’s execution time analyzing graph and half a time checking resource availability

cycle 0

cycle 1

cycle 2

cycle 3

cycle 4

cycle 5

cycle 6

cycle 7

cycle 8

cycle 9

0 integer5.0 -> fu22.o0

1 1 -> fu22.trigger.shru

3 fu22.r0 -> integer3.0

O:SHRU:0

O:SHRU:2 2 integer2.2 -> float2.31

5 float2.31 -> fu24.trigger.gtu

R_raw:float2.31

4 integer3.0 -> fu24.o0

R_raw:integer3.0

6 fu24.r0 -> float2.31

O:GTU:0

O:GTU:1

7 float2.31 -> boolean0.1

R_raw:float2.31

8 ?boolean0.1 2206 -> gcu.trigger.jump

R_G_raw:boolean0.1

tkt-3526 processor design - computer engineering at ... · 2 compilation for tta • compilation is...

Documents

materials management hll

project - hll rough draft

final report to hll

hll lifecare ltd - knowledge...

cag findings hll

hll project2 (1)

hll (supply chain)

d-link ™ des-3526

hll riemann solver

des-3526&3526dc manual r3.6

hll basic calculation

des-3526&3526dc cli manual r3.6

rj series hll configuration technical...

rural marketing-strategies-hll

product mix of hll

hll and nirma

hll lifecare ltd

hll project presentation

hll erp tender

hll thenewindia report final