arun hariharan (n.m.s.u). motivation need for high speed computing and architecture more complex...
Post on 12-Jan-2016
215 Views
Preview:
TRANSCRIPT
Arun Hariharan (N.M.S.U)
MOTIVATION
Need for high speed computing and Architecture
More complex compilers (JAVA)
Large Database Systems
Distributed Computing on Internet
Peer competition from other manufacturers
SOLUTION
Instruction Level Parallelism (ILP) in general-purpose Microprocessors
Wide floating-point exponents
Register Stack Engine Hardware exception deferral
Control speculation Register rotation
Large register files Data speculation
Predication Parallel semantics
GOALS OF ARCHITECTURE
Overcome performance limiters :
Branches
Memory Latency
Sequential Program Model
Long Architectural Life
Large Register File
Fully Interlocked Architecture – Not tied to any particular design
No Fixed Issue – ex. Instructions length.
REGISTER RESOURCES
• 128 65-bit General Registers (1 KB) ( 64 + 1”NaT” )
• 128 82-bit Floating Point Registers
• Space for up to 128 64-bit special-purpose application registers (1 KB)
• Eight 64-bit branch registers for function call linkage and return
• 64 one-bit predicate
INSTRUCTION ENCODING
Key Words• Long life• Instruction bundle
PredicateReg 3Reg 2Reg 1Op code
5 bit 7 bit 7 bit 7 bit 6bit = 32 bit
Also called Template• Helps to decode and route instruction•Marks end of basic block
=41 bits
DISTRIBUTING RESPONSIBILITY
Shift a lot of the complexity to the compiler
ILP
Out-of-Order Execution
Control Flow Parallelism
Influencing Dynamic Events – Learn hints from compiler about branch prediction, instruction/data caching & pre-fetching.
ILP – Instruction Level Parallelism
• Sequential In-Order execution was not enough to have maximum parallelism
• Out-of-order execution – Compilers task to creates instruction groups so that all instructions in an instruction group can be safely executed in parallel
Key Word
• Basic Block
CONTROL FLOW PARALLELISM
Traditional execution
• Compare a and 0
• Check flag if true
• Store flag value for further computation
• Compare b <= 5
• Check flag if true
• Store flag value for further computation
|
|
• Compare if any one had set the flag.
• Move 8 to r3
In IA-64• Initialize p1 to false• Set compare condition’s prerequisite• Compare in parallel• Branch
FINDING AND CREATING PARALLELISM
BRANCHES LIMIT ILP:Sequential, no-predict: normal bank tellerSequential, predict: fill out slip in advance (predict whether deposit or withdrawal)Predicated Execution: fill out both slips, throw away whichever is wrong
FINDING AND CREATING PARALLELISM (cont..)
Scheduling and Speculation
Moving basic blocks ahead of barriers - compilers task to find possible
route and schedule it instead of the processor.
Use of basic blocks (Define)
Best possible Route – Most predicted flow of program (speculation), not all instructions are executed
Compilers – Have a birds eye view of program, unlike the processor.
CONTROL SPECULATION
Removing branches – Expensive
Not all can be removed
Moving basic blocks call cause Exceptions
=41 bits
Key Word
• Fix-up Code
DATA SPECULATION
ALAT – Adv. Load Address table
Key Word
• Fix-up Code
REGISTER MODEL
• 128 – 64bit registers of which 32 are fixed for µP operations (like RISC)• 96 are free to compiler to use.• Unlimited registers use possible as they are paged to memory in background using the RSE (Register Stack Engine)• “Alloc” to specify number for registers for local and output (for parameters to calls.• Programs renames registers to start from 32 to 127.
RSE (Register Stack Engine)
Automatically saves/restores stack registers without software intervention (Can work synchronously)
• Provides the illusion of infinite physical registers by mapping to a stack of physical registers in memory• Overflow: Alloc needs more registers than available needs more • Underflow: Return needs to restore frame saved in memory
RSE may be designed to utilize unused memory bandwidth to perform register spill and fill operations in the background (Asynchronously - Speculatively to load and store data)
SOFTWARE PIPELINE
Time complexity is calculated by O(n)This notation is used to count time spent in loops That is because loops take most execution time
Time complexity is calculated by ____ ?
Can we implement loops in parallel ?ANS : Yes. If we resolve some problems.
• Managing the loop count, • Handling the renaming of registers for the pipeline,• Finishing the work in progress when the loop ends,• Starting the pipeline when the loop is entered, and• Unrolling to expose cross-iteration parallelism.
IA-64 Solution• Special architecture
• Loop count LC• Epilog count EC• Use of register rename base (rrb)
SUMMARY
• Synergy
• ILP by compiler and hardware
• Data and Control Speculation
• Multi-chip and multi-processing
• EPIC – Explicit parallel instruction computing
• “RISC architectures claim to match many of the features of IA-64 with similar sounding instructions. However, just like a tank formed by bolting weapons and armor to an old truck, the benefits are limited to specific conditions, but fall short in the heat of battle.”
• Existing RISC architectures that use ‘cmoves’ and similar instructions may remove branches, but at the cost of adding so many instructions that the benefits are nearly outweighed by the code-bloat (hardly worth the trade-off). The reason why ILP works with IA-64 is the use of completely new architectural constructs such as predicates that are not available to any existing RISC architecture.
• Traditional RISC architectures can use a ‘non-faulting load’ to avoid costly error handling when loading data ahead of time which may not be valid. But if you want to turn off the errors, why have errors in the first place? Traditional RISC architectures face one of two alternatives: add extra error-checking code which, once again, cancels out the performance benefit of speculative execution ; or ‘work without a net,’ risking disastrous undetected errors due to turning off the error messages. IA-64 gets around both problems by offering a novel architectural approach to dealing with errors when loading data.
RISC Vs IA-64– Whitepaper by Intel & HP(1999)
Benchmark comparison
BACKWARD COMPATIBILITY
Intel promises compatibility with the 32-bit software (IA-32).
It should be possible to run software in real mode (16 bits), protected mode (32 bits) and virtual mode 86 (16 bits).
Questions?
REFERENCES
1. Ricardo Zelenovsky and Alexandre Mendonca – “Intel 64-bit Architecture” – 2001
2. Bruce Jacob – “The IA-64 Architecture” – University of Maryland (College Park)
3. Whitepaper – “IA-64 Architecture Innovations” –HP & Intel – 19994. Carole Dulong et al. - “An overview of Intel IA-64 Compiler”5. M. F. Guest - “Intel’s Itanium IA-64 Processor: Overview and Initial
Experience” – CLRC Daresburg Laboratory
top related