arun hariharan (n.m.s.u). motivation need for high speed computing and architecture more complex...

Arun Hariharan (N.M.S.U)

MOTIVATION

Need for high speed computing and Architecture

More complex compilers (JAVA)

Large Database Systems

Distributed Computing on Internet

Peer competition from other manufacturers

SOLUTION

Instruction Level Parallelism (ILP) in general-purpose Microprocessors

Wide floating-point exponents

Register Stack Engine Hardware exception deferral

Control speculation Register rotation

Large register files Data speculation

Predication Parallel semantics

GOALS OF ARCHITECTURE

Overcome performance limiters :

Branches

Memory Latency

Sequential Program Model

Long Architectural Life

Large Register File

Fully Interlocked Architecture – Not tied to any particular design

No Fixed Issue – ex. Instructions length.

REGISTER RESOURCES

• 128 65-bit General Registers (1 KB) ( 64 + 1”NaT” )

• 128 82-bit Floating Point Registers

• Space for up to 128 64-bit special-purpose application registers (1 KB)

• Eight 64-bit branch registers for function call linkage and return

• 64 one-bit predicate

INSTRUCTION ENCODING

Key Words• Long life• Instruction bundle

PredicateReg 3Reg 2Reg 1Op code

5 bit 7 bit 7 bit 7 bit 6bit = 32 bit

Also called Template• Helps to decode and route instruction•Marks end of basic block

=41 bits

DISTRIBUTING RESPONSIBILITY

Shift a lot of the complexity to the compiler

Out-of-Order Execution

Control Flow Parallelism

Influencing Dynamic Events – Learn hints from compiler about branch prediction, instruction/data caching & pre-fetching.

ILP – Instruction Level Parallelism

• Sequential In-Order execution was not enough to have maximum parallelism

• Out-of-order execution – Compilers task to creates instruction groups so that all instructions in an instruction group can be safely executed in parallel

Key Word

• Basic Block

CONTROL FLOW PARALLELISM

Traditional execution

• Compare a and 0

• Check flag if true

• Store flag value for further computation

• Compare b <= 5

• Check flag if true

• Store flag value for further computation

• Compare if any one had set the flag.

• Move 8 to r3

In IA-64• Initialize p1 to false• Set compare condition’s prerequisite• Compare in parallel• Branch

FINDING AND CREATING PARALLELISM

BRANCHES LIMIT ILP:Sequential, no-predict: normal bank tellerSequential, predict: fill out slip in advance (predict whether deposit or withdrawal)Predicated Execution: fill out both slips, throw away whichever is wrong

FINDING AND CREATING PARALLELISM (cont..)

Scheduling and Speculation

Moving basic blocks ahead of barriers - compilers task to find possible

route and schedule it instead of the processor.

Use of basic blocks (Define)

Best possible Route – Most predicted flow of program (speculation), not all instructions are executed

Compilers – Have a birds eye view of program, unlike the processor.

CONTROL SPECULATION

Removing branches – Expensive

Not all can be removed

Moving basic blocks call cause Exceptions

=41 bits

Key Word

• Fix-up Code

DATA SPECULATION

ALAT – Adv. Load Address table

Key Word

• Fix-up Code

REGISTER MODEL

• 128 – 64bit registers of which 32 are fixed for µP operations (like RISC)• 96 are free to compiler to use.• Unlimited registers use possible as they are paged to memory in background using the RSE (Register Stack Engine)• “Alloc” to specify number for registers for local and output (for parameters to calls.• Programs renames registers to start from 32 to 127.

RSE (Register Stack Engine)

Automatically saves/restores stack registers without software intervention (Can work synchronously)

• Provides the illusion of infinite physical registers by mapping to a stack of physical registers in memory• Overflow: Alloc needs more registers than available needs more • Underflow: Return needs to restore frame saved in memory

RSE may be designed to utilize unused memory bandwidth to perform register spill and fill operations in the background (Asynchronously - Speculatively to load and store data)

SOFTWARE PIPELINE

Time complexity is calculated by O(n)This notation is used to count time spent in loops That is because loops take most execution time

Time complexity is calculated by ____ ?

Can we implement loops in parallel ?ANS : Yes. If we resolve some problems.

• Managing the loop count, • Handling the renaming of registers for the pipeline,• Finishing the work in progress when the loop ends,• Starting the pipeline when the loop is entered, and• Unrolling to expose cross-iteration parallelism.

IA-64 Solution• Special architecture

• Loop count LC• Epilog count EC• Use of register rename base (rrb)

SUMMARY

• Synergy

• ILP by compiler and hardware

• Data and Control Speculation

• Multi-chip and multi-processing

• EPIC – Explicit parallel instruction computing

• “RISC architectures claim to match many of the features of IA-64 with similar sounding instructions. However, just like a tank formed by bolting weapons and armor to an old truck, the benefits are limited to specific conditions, but fall short in the heat of battle.”

• Existing RISC architectures that use ‘cmoves’ and similar instructions may remove branches, but at the cost of adding so many instructions that the benefits are nearly outweighed by the code-bloat (hardly worth the trade-off). The reason why ILP works with IA-64 is the use of completely new architectural constructs such as predicates that are not available to any existing RISC architecture.

• Traditional RISC architectures can use a ‘non-faulting load’ to avoid costly error handling when loading data ahead of time which may not be valid. But if you want to turn off the errors, why have errors in the first place? Traditional RISC architectures face one of two alternatives: add extra error-checking code which, once again, cancels out the performance benefit of speculative execution ; or ‘work without a net,’ risking disastrous undetected errors due to turning off the error messages. IA-64 gets around both problems by offering a novel architectural approach to dealing with errors when loading data.

RISC Vs IA-64– Whitepaper by Intel & HP(1999)

Benchmark comparison

BACKWARD COMPATIBILITY

Intel promises compatibility with the 32-bit software (IA-32).

It should be possible to run software in real mode (16 bits), protected mode (32 bits) and virtual mode 86 (16 bits).

Questions?

REFERENCES

1. Ricardo Zelenovsky and Alexandre Mendonca – “Intel 64-bit Architecture” – 2001

2. Bruce Jacob – “The IA-64 Architecture” – University of Maryland (College Park)

3. Whitepaper – “IA-64 Architecture Innovations” –HP & Intel – 19994. Carole Dulong et al. - “An overview of Intel IA-64 Compiler”5. M. F. Guest - “Intel’s Itanium IA-64 Processor: Overview and Initial

Experience” – CLRC Daresburg Laboratory

arun hariharan (n.m.s.u). motivation need for high speed computing and architecture more complex...

Documents

honors compilers run-time support for compilers mar 21st...

compilers for embedded systems: why are compilers an issue?

cse443 compilers

compilers - univ-

arun singh rawat - maths by arun sir

optimizing compilers cisc 673 spring 2009 overview of...

compilers and computer architecture:...

aligning reads ramesh hariharan strand life sciences iisc

topics in algorithms 2007 ramesh hariharan. tree embeddings

compilers -principles_techniques_and_tools

complaint: hariharan et al. v. adobe et al

compilers ide

akash tamang kumar yadav arun gnawali arun gurung arun kumar...

algorithms 2005 ramesh hariharan. algebraic methods

14 jun 2015 page no: 221 government of karnataka...

jyotish how k p pinpoint events prasna k hariharan

mod02 compilers

topics in algorithms 2007 ramesh hariharan. support vector...

266 hariharan

compilers - cgi.di.uoa.gr