superscalar pipeline architectures by: matthew osborne, philip ho, xun chen april 19, 2004
TRANSCRIPT
![Page 1: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/1.jpg)
Superscalar Pipeline Architectures
By: Matthew Osborne, Philip Ho, Xun Chen
April 19, 2004
![Page 2: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/2.jpg)
Superscalar Architecture
• Relatively new, first appeared in early 1990s
• Builds on the concept of pipelining
• Superscalar architectures can process multiple instructions in one clock cycle (multiple instruction execution units)
• Allows for instruction execution rate to exceed the clock rate (CPI of less than 1)
![Page 3: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/3.jpg)
Overview of Selected Superscalar Architectures
• Intel
• MIPS
• PowerPC
• T 1000 Architectures
• Hobbes: A Multi-threaded superscalar
![Page 4: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/4.jpg)
Intel Superscalar Architecture
According to Sara Sarimento, in her essay “Recent History of Intel Architecture – A Refresher”
- Intel’s first use of a superscalar architecture was its Pentium Processor
- “Instruction Level Parallelism” - instructions independent of the outcome of one another execute concurrently to utilize more of the available hardware resources and increase instruction throughput.
![Page 5: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/5.jpg)
Intel P5 Microarchitecture
•Used in initial Pentium processor•Could execute up to 2 instructions simultaneously•Instructions sent through the pipeline in order - if the next two instructions had a dependency issue, only one instruction (pipe) would be executed and the second execution unit (pipe) went unused for that clock cycle.
![Page 6: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/6.jpg)
Intel P6 Microarchitecture
- Used in the Pentium II, III and Pro processors
-3 instruction decoders, which break each CISC instruction (macro-op) into equivalent micro-operations (µops) for the Out-of-Order Execution unit
-10 stage instruction pipeline utilized in this architecture
![Page 7: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/7.jpg)
Intel P6 Microarchitecture
• “Out of Order” instruction execution - executes instructions without data dependency issues out of order for a higher level of hardware utilization
• “Scheduler” unit resolves data dependency issues between individual instructions
• “Re-Order Buffer” puts instructions back in order before writing them back to memory
• Up to 3 instructions can be retired concurrently to memory
![Page 8: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/8.jpg)
Intel NetBurst MicroArchitecture
-New architecture used for the Intel Pentium IV and Pentium Xeon processors
![Page 9: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/9.jpg)
Intel NetBurst Microarchitecture
Changes from P6 Architecture• Only one instruction decoder present• Decoder moved outside the Out-of-Order
Execution Unit; an Execution Trace Cache was added in its place
• Increased number of pipeline stages to 20• Improved branch prediction algorithms• ALUs operate twice as quickly as their P6
counterparts
![Page 10: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/10.jpg)
Intel NetBurst Microarchitecture
Execution Trace Cache• Alleviates delays in fetching and translating CISC
instructions to their appropriate µops• Instructions are now decoded by a translation engine,
with the resulting µops stored as traces (sequence of µops) in the Execution trace cache.
• Traces stored in path of predicted program execution flow, with results of branches in the code integrated into this path
• Delivers up to 3 µops to the core of the Execution Unit per clock cycle
![Page 11: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/11.jpg)
Intel NetBurst Microarchitecture
Branch Prediction• Branch targets are predicted based on their
linear address using branch prediction logic and fetched as soon as possible
• Targets are fetched from the Execution Trace Cache if cached there; otherwise they are fetched form the memory hierarchy
• Downside: despite the improved prediction algorithm, one of the biggest costs of this architecture is mispredicted branches because of the longer instruction pipeline than previous architectures.
![Page 12: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/12.jpg)
MIPS Superscalar Architecture
• MIPS is a RISC instruction platform, versus Intel’s CISC instruction platform (made design of Superscalar Architecture easier than for Intel’s CISC platform)
• First MIPS processor with a Superscalar Architecture was the MIPS R8000 64 bit, released in 1994.
![Page 13: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/13.jpg)
MIPS R8000 Processor
R8000 Chip Set Diagram
Courtesy of Silicon Graphics http://sgi.cartsys.net/i2sec7.html
![Page 14: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/14.jpg)
MIPS R8000 Features
• Superscalar
• Can support/process 4 in-order instructions each cycle
• Multi-component chip set (Integer Unit, Floating Point Unit, Tag RAMs and Data Streaming Cache)
• Designed for peak performance with Floating Point Operations
![Page 15: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/15.jpg)
MIPS R8000 Pitfalls
• Integer operation performance limited
• Very high cost
As a result of these two key factors:
• The R8000 was only in the marketplace for about a year.
• This processor was mainly used only in the scientific community
![Page 16: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/16.jpg)
MIPS R10000 Processor
Superscalar Pipeline Architecture for the R10000 processor. Diagram courtesy of R10000 Microprocessor User’s Manual.
http://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/t5.Ver.2.0.book_12.html
![Page 17: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/17.jpg)
R10000 Processor - Features
• Introduced in 1995• Improved integer instruction performance• Ability to create a multi-processor system
(can attach up to 4 R10000 chips together)• Fetches and decodes 4 instructions each
clock cycle/pipeline stage• “Out Of Order” Instruction Execution –
First MIPS Processor to support this feature
![Page 18: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/18.jpg)
R10000 Block Diagram
Each decoded instruction is sent to one of 3 instruction queues -Address Queue (Load/Store Instructions) -Integer Queue (Integer ALU Operations) -Floating Point Queue (Floating Point Arithmetic Operations)
![Page 19: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/19.jpg)
MIPS R10000 Processor
• 5 Execution Pipelines
- Load/Store Unit
- Two Integer ALUs
- Floating Point Adder
- Floating Point Multiplier• Can process up to 4 out of order instructions
simultaneously• Base architecture core that all successor MIPS
processors have been built from
![Page 20: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/20.jpg)
PowerPC
• Direct descendent of IBM 801, RT PC and RS/6000
• All are RISC
• RS/6000 first superscalar
• PowerPC 601 superscalar design similar to RS/6000
• Later versions extend superscalar concept
![Page 21: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/21.jpg)
PowerPC 601 Pipeline Structure
![Page 22: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/22.jpg)
PowerPC 601 Pipeline
![Page 23: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/23.jpg)
PowerPC 601 General View
![Page 24: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/24.jpg)
PowerPC storage model
• Supports for byte(8-bits), halfword(16-bits), word(32-bits) and doubleword(64-bits) data types.
• Handles string operations for multi-byte strings up to 128 bytes
• 32-bit PowerPC implementations supports a 4-GB effective address space.
• 64-bits PowerPC implementations supports a 16-exabyte effiective address space.
![Page 25: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/25.jpg)
General-purpose registers (GPR)
• User Instruction Set architecture specifies all implementations have 32 GPRs
• GPRs are the source and destination of all integer operations
• No lookup is done for GPR0’s contents.
![Page 26: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/26.jpg)
Floating-point registers (FPR)
• All implementations have 32 FPRs.
• FPR are source and destination operands of all floating-point operations.
• Contains 32-bit and 64-bit signed and unsigned integer vlaues, single-precision and double-precision floating-point values.
![Page 27: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/27.jpg)
Special-purpose registers (SPR)
• Give status and control of resources within the processor core.
• Read and written by applications without support from a system service include the Count Register, the Link Register and the Integer Exception Register.
• Can only be ready by applications with support form a system service include the Time Base and other timers.
![Page 28: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/28.jpg)
T1000 Architectures
• The T1000 Architectures are reconfigurable computing architectures embedded into a superscalar
• T1000 Architectures rely on the programmable functional unit ( PFU ), integrated into the datapath.
• T1000 is assumed to be a 4-issue out-of-order machine. It helps tolerate the latencies of some data dependent instruction sequences.
• T1000 extended instruction is encoded as a register-register operation with a specific opcode.
![Page 29: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/29.jpg)
Hobbes
• A multi-threaded architecture attempt to increase pipeline utilization by concurrently executing instructions from different threads.
• The architecture chosen was the aggressive speculative and out-of-order superscalar processor based on the MIPS R2000 instruction set.
• The Hobbes architecture combines multi-threading with superscalar issue, with the supposition that strengths of one should offset the weaknesses of the other.
• By supporting superscalar issue from more than one thread, the architecture overcomes the lack of instruction-level parallelism that plagues other superscalar structures.
![Page 30: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/30.jpg)
Background
• The Hobbes micro-architecture draws its inspiration from two widely differing architectures: Multi-threaded and superscalar.
• It is hoped that the combined of the fundamental concepts of these architecture will build upon their respective strengths and compensate for their corresponding weaknesses, allowing a hybrid to be greater than the sum of its parts.
![Page 31: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/31.jpg)
Multi-threaded Architectures
• Multi-threaded processors can concurrently execute instructions from more than one thread.
• The contexts of multiple threads are stored on-board, which allows instructions to be issued from different threads.
• Traditional multi-threaded architectures have usually implemented a round-robin execution strategy with switched that instruction execution to a new a thread every cycle.
![Page 32: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/32.jpg)
The Thread Unit of Hobbes
• The Thread unit contains all of the elements required to support a single thread.
• It consists of a fetch buffer, issue buffer, decode logic, branch adder and the thread state storage.
![Page 33: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/33.jpg)
The Thread Unit
• Instruction fetch is performed by reading an entire cacheline of four words and storing it in the fetch buffer.
• Each thread decodes and issues its instructions in program order. After and instruction has been decoded, it is stalled until all of its operands are available.
• Once the operands are ready, the instruction is placed into the issue buffer and the issue unit is notified.
• The register file is very similar to that found on the R2000. The register file has two write ports and both of these may be from the same thread.
• Branches which do not affect the register file are executed in the thread unit and are not issued to the execution unit.
![Page 34: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/34.jpg)
The Execution Units of Hobbes
• The Hobbes architecture has an almost identical set of execution units as out-of –order superscalar processor.
• The characteristics of the execution units approximately correspond to those of the R2000/R2010.
• Execution Units• Integer: 2 ALUs,
Shifter, Multiply / Divide, Load / Store, Data cache interface
• FP: FP Convert, FP Add, FP Multiply, FP Divide
![Page 35: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/35.jpg)
Superscalar Architecture
• Superscalar processors improve performance by reducing the average number of cycles required to execute each instruction
• This is accomplished by issuing and executing more than one independent instruction per cycle, rather than limiting execution to just on instruction per cycle as traditional pipelined architectures.
• For superscalar architectures to experience speed-up over traditional pipelined architectures they require the average level of available instruction-level parallelism to be greater than one.
![Page 36: Superscalar Pipeline Architectures By: Matthew Osborne, Philip Ho, Xun Chen April 19, 2004](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e225503460f94b0e6d8/html5/thumbnails/36.jpg)
References• Hennessy, John L and Patterson, David A. “Computer Organization and Design, The
Hardware/Software Interface.” San Francisco: Morgan Kaufmann Publishers 1998. • Sarimento, Sara. “Recent History of Intel Architecture – A Refresher.” 17 April 2004. Intel
Corporation www.intel.com 18 April 2004 http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htm
• Zhou & Martonosi. “Augmenting Modern Suuperscalar Architectures with Configurable Extended Instructions”. 19 April 2004. http://ipdps.eece.unm.edu/2000/raw/18000943.pdf
• Kish & Preiss. “Hobbes: A Multi-Threaded Superscalar Architecture 19, April 2004 http://www.brpreiss.com/page75.html
• R10000 Processor User’s Manual. 9 Dec 1996. SGI Corporation. 22 April 2004 http://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.html#HEADING1
• “MIPS Architecture.” 17 April 2004. Wikipedia, The Free Encyclopedia http://en.wikipedia.org/wiki/Main_Page 23 April 2004 http://en.wikipedia.org/wiki/MIPS_architecture.
• Mapleson, Ian. “Indigo 2 and Power Indigo 2 Technical Report.” SiliconGraphics. 23 April 2004 http://sgi.cartsys.net/i2sec7.html.
• “Power PC Architecture” 23 April 2004 http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/ppc_arch.html