intel itanium ia64 - merced barbora petrtýlová tomáš kubeš ls 2002/2003 presentaiton for e36aps...

Intel Itanium IA64 - Merced

Barbora PetrtýlováTomáš Kubeš

LS 2002/2003Presentaiton for E36APS

presented: 24. 4. 2003

Introduction

3

Basics

• Brand new Intel architecture (designed from ground)

• Not compactible with x86• 64 bit (only 44bit physical addressing, 54bit virtual addressing)

• RISC + Superscalar

• EPIC (Explicitly Parallel Instruction Computing)

• Speeds 733, 800MHz

4

Performance

• EPIC technology enables up to 20 operations/clock (peak)

• BUT needs optimized code• First tests: running of x86 code (SW emulation) –

performance on level of Pentium I 150MHz (hence: „Itanic“ )

• Generally server or workstation processor; up to 32 processors can be in one machine (512 processors can work together)

5

HW Overview IMassive HW resources• 17 execution units• 128 integer registers• 128 floating point registers• 64 predicate registers• 8 branch registers• Supports register stacking, register rotating;

predicating, branch hints, speculation, parallelism• Processor also contains own ROM (CPU information) and EEPROM (can be programmed)

6

HW Overview II• Execution: 6 inst.

per clock (effe-ctive value up to 20 ops.)

• 10 stage pipeline

• 4 ALU, 4 Multimedia ALU, 4 FP (up to 8 FP ops./cycle), 2 Load / Store, 3 Branch units

• Allows instruction templates, that can increase effective value of executed isntructions

• Cache 32KB (split) L1/ 96KB L2 / 2-4MB L3

• Transistor Count: 25 million transistors in CPU; 300 million in cache.

Instruction Set & Predicating

8

Basics• Instruction set: based on classical RISC, but offering

new instructions for branch prediction, prefetching, and parallelism

• Has SIMD instructions (1 inst. operating with multiple single prec. FP or integer data)

• Inst. set architecture: “Revolutionary“ – each triplet of instructions is packed to bundle, it can give inst. special properties

• Each instruction has predicate - reference to predicate register - if it‘s value is 0, instruction is carried out as NOP.

9

Instruction Format[qp] mnemonic [comp1] [comp2] dest src• qp – predication register• mnemonic – instruction name• comp – completer (kvalifikátor)• destination, source – registers, generally 3

svazek (bundle) 128binstrukce 2

41binstrukce 1

41binstrukce 0

41btemplate

5b

EPIC “architecture“• Compiler knows about parallelism.• Compiler supports parallelism and is able to express it.• Dependent instructions in each bundle need to be explicitely marked, classes of instructions that can be executed in parallel need to be explicitely marked – cycle break.

10

Instruction Bundles• Machine

fetches two bundels each clock.

• Each bundle can have own template

• Depending on template, instruction can represent more operations.

Standard: up to 8 ops. – 2 branches, 2 load/store, 2 ALU, 2 post incrementScientific: 12 ops. – 4 dp load, 4 dp FP, 2 ALU, 2 post incrementDigital content: 20 ops. (SIMD) – 8 sp load, 8 sp FP, 2 ALU, 2 post inc.*there are 10 template formats all together

Ex:

11

Predicating

• Itanium has 64 1bit predicate registers.

• Each physical instruction has predicate.

• Predicate determines if instruction will be executed normally, or if it will be executed as NOP.

• This allows to implement program branching without jumps, which bother pipeline so much.

12

Predicating – Motivation I

• Why to predicate? Let‘s see example!(note: example is not exactly corresponding to real Itanium program, it would make it too

complicated)

some independent instructions Aadd r2, r3, r4 ;;cmp r2, r7 ;;je equal

some instructions Bequal:

some instructions C

Ex:

marks dependenceof next instruction

• What happens if branch is taken, but processor was thinking it will be not?

13

Predicating – Motivation II• On pipelined processor, taking an unpredicted

branch usually means to flush whole pipeline!• For Itanium, it would mean also emptying buffers

and queues = throwing away 9*6 instructions from pipeline + those from buffer which could mean loosing up to 200 effecitve operations (not counting. probable necessity of mem. acces)

• Tests shows that 5%-10% of wrong predictions can decrease performance by 25%!

NIGTMARE OF ALL HW DEVELOPERS

14

Predicating – Motivation III

• So what would Itanium do? (assuming optimizing compiler)

- Use scheduling and predicating!some indep. instructions Aadd r2, r3, r4 ;;cmp r2, r7 ;;je equal

some instructions Bequal:

some instructions Cother isntructions D

add r2, r3, r4some indep. instructions Acmp.eq p1, p2=r2, r7some more indep. insts.(p1) some instructions B(p2) some instructions Cother instructions D

• This will result only in loss of few instructions, that will not be taken.

15

Predicating trnasformsthat to this.

Predicating - Effects

• Removes unpredictable branches

• Basic block of parlallel instructions increases

• ILP in a block increases

• Thus it allows much better resource utilization

16

Predicating - Conclusion• Predicating means that both outcomes of

branch will be put into execution and wrong one will be discarded

• Reduction of “penalty for branch output

misprediction“ is HUGE• This effect is significant for short branches

• Drawback: instructions for both outputs need to be fetched, so this method is effective only for short branches

Instruction Pipeline

18

INSTRUCTION PIPELINEBLOCK DIAGRAM

19

INSTRUCTION PIPELINEDESCRIPTION

• 10 stage in-order core pipeline • executes up to 6 instructions in parallel each cycle

• 2-3 stages shorter than Pentium III but 2-3 stages longer than Alpha 21264

• queue of instructions

instruction selection continues even in case when the execution part is stopped and vice versa

• 8 bundles of instructions or 24 instructions in the queue

enough to overcome resteer

insufficient for whole coverage of prediction miss

20

INSTRUCTION PIPELINEFRONT-END

• IPG - address calculation• FET - instruction cache access • ROT - instruction rotation instruction fetch and instruction delivery into a decoupling buffer in ROT bold line is point of decoupling

INSTRUCTION DELIVERY

• EXP - expand • REN - register rename/remapping dispersal and register renaming

21

INSTRUCTION PIPELINEOPERAND DELIVERY

• WLD - word-line decode • REG - register read operand delivery

EXECUTION CORE

• EXE - instruction execution• DET - exception detection • WRB - writeback wide parallel execution, followed by exception management and retirement

• ItaniumTM does not change the sequence of instructions BUT they may finish in a different sequence

Instruction Fetching & Jump Prediction

23

Instruction FetchingFetch:• Itanium has 16K of L1

instruction cache. (4way set associative.)

• Fetches 2 bundles (6 inst.)/cl• They are fed to decoupling

buffer. (holds 8 bundles)• From decoupling buffer, they

are sent to inst. issue & reg. rename logic depending on availability of resources (or eventual dependencies)

• Instructions can be issued in order, one by one

24

Instruction PrefetchingPrefetch• Itanium has sophisticated prediction logic (its

principles will be explained 2 slides later)

• Probable target adresses can be stored in 4 target address registers

• Itanium tries to speculatively fetch instructions for possible branch output to decoupling buffer

• Prefetch can be also initiated by SW – it can probe if instructions for branch target are in L1 cache

• SW prefetched instructions are taken from L2 cache and filled into a streaming buffer and eventualy stored in L1I cache

25

Why Predict?

• Penalty for wrong branch prediction is very high: 9 cycles are lost = 9*6 instructions; memory acces might be required = major slowdown

• Short branches can be fully eliminated by predicating, but long cannot (not effective)

• Longer branches need to be accurately predicted - tests shows that 5%-10% of wrong predictions can decrease performance by 25%!

• So Itanium is equipped with complex prediction means, both on HW and compiler level

26

HW for Branch Prediction• 4 TAR registers, they store branch address and

branch instruction address – it is compared with current state of Program Counter – when PC reaches this value instruction from pointing address will be fetched next cycle

• 8 item RSB – return stack buffer, to know where to return from procedure calls

• 512 item BHT – branch history table (20Kbit) Itanium does not store designated statically predicted branches, this increases BTH efficiency

• 64 item BTAC – branch target address cache• 2 Branch address calculation units

27

Branch Prediction I• Itanium employs a hierarchy of branch prediction

structures to deliver high-accuracy predictions• It is assisted by branch hint directives - Branch

PRrediction instructions (BPR) and hints directly in branch instruction code

• Those provide: branch target address, static hint, and indication where to use dynamic prediction

• Machine provides 4 progressive predictions and corrections to the fetch pointer

Resteer 1 – single cycled, using address from compiler fed Target Address Registers (Itanium has 4 TARs)

Resteer 2 – two level multiway predictor and return predictor

Resteer 3,4 – Branch address calculation and prediction BAC1, BAC2

28

Branch Prediction II• Compiler is able to load items into BTAC• Branching, that is in BTAC or RSB goes directly to pipline (fetch)• If branching is in BHT, but is not in BTAC, target address needs to be

calculated in some of BACs (branch address calculator)• If branching is not in BHT, BAC will use info. from static prediction hint• BAC1 is able to trace end of cycle (surpress TAR)• BAC2 can compute target adress of any branch

IPG ROTFET EXP

I-Cache and I-TLB 8-bundleinstr. queue

PC to dispatch

512-entryBHT

8-entryRSB

64-entryMBHT

64-entryBTAC

BAC1 BAC24-entryTAR

3rd instr all instrindex

Reminder:• TAR – target

address register

• RSB – return stack buffer

• BHT – branch history table

• BTAC – branch target address cache

29

Branch Prediction• Using prediction from BTAC or RSB causes 1

tact bubble in instruction loading• BAC1 causes 2 tact bubble• BAC2 even 3 tact bubble

BUT:• Loading of instructions is separated from

instruction execution, so bubbles can be usually compensated by instructions waiting in decoupling buffer

• So pipline only stops if prediction was wrong, since true result of branch is known in DET stage

30

Branch Prediction Conclusion• Missed branch penalty is high for Itanium• So Itanium posses powerful tools for

branch prediction – various prediction hints on compiler level and complex prediction logic on HW level

• This in most cases ensures that proper instructions for branch result are fed to pipeline before true result is known

• Thus pipeline only needs to stop when prediction was wrong – that is rare case if code is optimized properly

Instruction Queue & Execution, Work with Registers

32

INSTRUCTION QUEUE

• buffer between the first (instruction fetch) and the second (execution) stage of the pipeline

one part can work even if the second is not working

• queue is dimensioned on 8 bundles (groups of instructions)

• up to 2 bundles can be selected in each cycle

33

INSTRUCTION EXECUTION

• in the execution stage of the pipeline• in-order instruction execution • instruction stream - divided into so-called instruction groups• end of one of execution groups is defined in part of template bundles• if instructions selected from an instruction queue belong to the same instruction group assigned to functional units operands are chosen for those instructions renaming/remapping of registers (if necessary) is performed contents of specific (now renamed!) registers are loaded instruction is executed possible exceptions are executed and misprediction is checked result is writen

34

WORK WITH REGISTERS

• large number of registers register file

AVAILABLE REGISTERS

• 128 integer registers (64 bits + 1 NaT bit)• 128 floating point registers • 64 predicate registers (1 bit fire/do not fire)• 8 branch registers• 128 application registers • CPUID registers

35

WORK WITH REGISTERSGENERAL PURPOSE REGISTERS

• 65 –64 bits for data and one NaT bit• accessible from all privilege levels• 2 subgroups static GRs

• GR0 – GR31 –> visible and shared by all subprograms (procedures)• GR0 has permanent value 0

stacked GRs • GR32 – GR127• register stack frame

36

WORK WITH REGISTERSFLOATING POINT REGISTERS

• IA-64 fully implements IEEE 754 standard • accessible from all program levels• usage: floating point operations• 2 subgroups static FRs

• FR0 – FR31• FR0 and FR1 have permanent value 0.0 and +1.0, respectively

rotating FRs • FR32 – FR127• can be renamed -> acceleration of cycle execution

37

WORK WITH REGISTERSPREDICATE REGISTERS

• accessible from all program levels• usage: store values of comparator instructions • 2 subgroups static PRs

• PR0 – PR15• PR0 - if used as source operand and has a permanent value 1

- if used as destination operand the result of such operation is ignored

rotating PRs • PR16 – PR63• can be renamed -> acceleration of cycle execution

38

WORK WITH REGISTERSBRANCH REGISTERS

• usage: store information about branching of the program • accessible from all program levels

INSTRUCTION POINTER

• addresses of bundles with presently-executed IA-64 instructions possible to read its value directly but not modify it directly • lowest quartile of bits is zero

• describes the current state of stack frame• not possible to read and write directly

CURRENT FRAME MAKER

CPU IDENTIFICATION REGISTERS (CPUID)

• number is larger or equal to 4• registers 0 – 3 contain information about the processor

39

WORK WITH REGISTERSAPPLICATION REGISTERS

• usage: numbering of specific operations • kernel registers (AR0 – 7)• previous state function registers (AR64) • loop counter registers (AR65) • etc.

USER MASK

• information about the addressing method and arrangement of addressable units in memory • organisational method of multi-byte units (big-endian, little-endian) and user defined efficiency monitors

• information about instruction execution • can be set only by privileged instructions • read-only accessible from all program levels

PERFORMANCE MONITOR DATA REGISTERS

40

WORK WITH REGISTERSROTATING REGISTERS

• general purpose (GR32 – GR127), floating-point (FR32 – FR127), and predicate registers (PR16 – PR63) the value RRB for each type is stored in register CFM • enable a more effective way of cycle execution • cycle execution: sequentially or in parallel• problem with parallel execution: each loop iteration works with the same registers loop unrolling• most modern processors use so-called register renaming

• on the hardware level – usually complicated• on the software level – done by the compiler

-> may lead to code enlargement• IA-64 uses loop unrolling but reduces the enlargement of the code by rotating registers• in the cycle some offsets to a base (saved in Register Rename Base register) are used instead of ”absolute” numbers of registers value of this register is decremented after each rotation

41

WORK WITH REGISTERSROTATING REGISTERS

• one iteration of a cycle: if a value A is stored in some register X, then the value in the register after one rotation will be X+1 if X is the highest value of for example general purpose register GR127, then the value A will be located in register GR32 after one iteration

ex. simple cycle (the register ”rXX” in code represents a GRXX):loop: ld4 r34=[r10],4 ; load 4B into ”r34”, address is

; stored in r10st4 [r11]=r36,4 ; store from previous ”r34” to

; address stored in r11 br.ctop loop ; decrement loop counter and

; branch

note: ”r34” denotes only offset, in reality (if rrb.gr=40) it could be register GR74

• automatically renames registers in hardware to improve software loop performance without the additional overhead required in traditional models

42

WORK WITH REGISTERSREGISTER STACK

• better manipulation with subprograms• registers GR32 – GR127• each subroutine (procedure) has a set of registers reserved for itself stack frame (0 – 96 registers)• achieved by register renaming • appears to be infinite on the outside!!! • Register Stack Engine (RSE) • stack frame is generally divided into two sections:

local data part output data part

Instructions Load/Store, Speculative Execution

44

LOAD AND STORE INSTRUCTIONSLOAD

• transfers data into general-purpose (GR) and floating-point (FR) registers and possibly floating-point pairs (pairs of floating-point registers)

• general register loads: data of size 1, 2, 4 and 8 bytes can be transferred • floating-point loads:

-single precision (4B)-double precision (8B)-double-extended precision (10B)-single precision pair (8B)-double precision pair (16B)

• load instructions can be speculative

45

LOAD AND STORE INSTRUCTIONSSTORE

• instructions opposite to the previous case• NOT defined for floating-point pairs! • possible to store blocks of data of the same size (except for single and double precision pairs)• all store instructions are NON-speculative

46

SPECULATIVE EXECUTION

• enables reduction of memory latency• compiler during compilation ‘performs’ a call on instruction which is to be executed speculatively earlier than is the actual instruction call for the given instruction• speculative instructions are those that can be executed speculatively• each instruction which stores results into GRs or FRs can be speculative• instruction which modifies different registers is non-speculative• itanium-implemented speculation has two forms

47

SPECULATIVE EXECUTIONCONTROL SPECULATION

• optimisation when an instruction is performed earlier before the dynamic process of the program reaches the place, where the result of this instruction is needed for instructions whose execution is longer • rest of the program is performed in parallel• basic necessary condition: in case when a speculatively executed instruction produces an exception, then this exception is deferred and this erroneous state is marked in the target register if it is a general purpose register, then NaT-bit is set to 1 if it is a floating point register, then the NaT value is stored in it• performed in such a way that the speculative instruction (e.g. ld.s) is placed in the code before the original instruction (in this case ld) and an instruction of result check of speculative execution (chk.s) is placed at this point

48

SPECULATIVE EXECUTIONCONTROL SPECULATION

• result(s) of speculative instruction can be used for other speculative execution • exception token • usually applied on instructions in different branches of a program which are then processed speculatively before the place of the specific branching

49

SPECULATIVE EXECUTIONDATA SPECULATION

• optimisation which enables speculative execution of instructions whose operands could be dependent on the results of other (non-speculative) instructions -> data dependencies• example: store instruction precedes the load instruction which we would like to execute speculatively question: How do we know that the result of the load instruction are dependent on the store instruction which follows?

ALAT – Advanced Load Address Table (Merced: ALAT is a 32-entry and indexable by a 7-bit register ID)

• no deference of exceptions

50

SPECULATIVE EXECUTIONSUMMARY

• improves performance by allowing the compiler to schedule load instructions ahead of branches and stores to reduce memory latency• basis of the mechanism is a physical execution of an instruction before its actual location in the program results of the instruction are known already at the moment when they are just required accelerates the program execution in the case of time-demanding instructions

• if the result of this speculatively executed instruction is incorrect -> it is necessary to perform the instruction again at the place of its actual location in the program

51

ConclusionItanium IA64 - Merced:• Processor designed from ground, based on the

latest knowledge (design thoughs started in about 1998) using many new techniques and having new features: 64bit, massive HW resources, superscalar execution (6 inst/clock), RISC & EPIC, register renaming + stacking, sophisticated jump predictions, predication...

• Drawback of these is that code optimization is vital – absolutely necessary.

• It was quite big piece to handle – even for

52

References• [1] Intel: Itanium Product Brief (www.intel.com)• [2] Intel: Itanium Hardware Developer‘s Manual

(www.intel.com)• [3] Intel: Itanium Data Sheet (www.intel.com)• [4] Intel: IA64 Assembler: Users Guide

(www.intel.com)• [5] Intel: Understanding Itanium Architecture –

presentation slides (www.intel.com)

Presentation can be downloaded also from:www.tomaskubes.net/download.html

intel itanium ia64 - merced barbora petrtýlová tomáš kubeš ls 2002/2003 presentaiton for e36aps...

Documents