® * other brands and names may be claimed as the property of others. ece 371 microprocessors...
Post on 19-Jan-2016
222 Views
Preview:
TRANSCRIPT
®®
* Other brands and names may be claimed as the property of others.
ECE 371Microprocessors
Chapter 6Intel© x86 Microprocessor
ArchitectureDerived from Dr. Herbert G. Mayer 2003 Presentation Derived from Dr. Herbert G. Mayer 2003 Presentation
totoIntel’s Software CollegeIntel’s Software College
Status 8/30/2015Status 8/30/2015For use at CCUT Fall 2015For use at CCUT Fall 2015
®®
* Other brands and names may be claimed as the property of others.
2
AgendaAgenda
AssumptionsAssumptions Speed LimitationsSpeed Limitations x86 Architecture Progressionx86 Architecture Progression Architecture EnhancementsArchitecture Enhancements Intel ® x86 ArchitecturesIntel ® x86 Architectures
®®
* Other brands and names may be claimed as the property of others.
3
AssumptionsAssumptions Audience: Understands general x86 architectureAudience: Understands general x86 architecture Knows some assembly languageKnows some assembly language
– Flavor used here: Gnu assembler gasFlavor used here: Gnu assembler gas– Result on right-hand-side:Result on right-hand-side:
– mov [temp], %eax;mov [temp], %eax; is a load into register ais a load into register a– add %eax, %ebx;add %eax, %ebx; new integer sum is in register bnew integer sum is in register b
– Different from Microsoft * masm, and tasmDifferent from Microsoft * masm, and tasm
Understand some architectural concepts:Understand some architectural concepts:– Caches, Multi-level caches, (some MESI)Caches, Multi-level caches, (some MESI)– Threading, multi-threaded codeThreading, multi-threaded code– Blocking (cache), blocking (aka tiling), blocking (thread synch.)Blocking (cache), blocking (aka tiling), blocking (thread synch.)
Causes of pipeline stallsCauses of pipeline stalls– Control flow changeControl flow change– Data dependence, registers and dataData dependence, registers and data
NOT discussed: VTune, CISC vs. RISCNOT discussed: VTune, CISC vs. RISC
®®
* Other brands and names may be claimed as the property of others.
4
Speed LimitationsSpeed Limitations
®®
* Other brands and names may be claimed as the property of others.
5
AgendaAgenda
Performance LimitersPerformance Limiters Register Starvation Register Starvation Processor-Memory GapProcessor-Memory Gap Processor StallsProcessor Stalls Store ForwardingStore Forwarding Misc Limitations:Misc Limitations:
– Spin-Lock in Multi ThreadSpin-Lock in Multi Thread
– Misaligned DataMisaligned Data
– Denorm FloatsDenorm Floats
®®
* Other brands and names may be claimed as the property of others.
6
Performance LimitersPerformance Limiters
Architectural limitations the programmer or Architectural limitations the programmer or compiler can overcome:compiler can overcome:– Indirect limitations: stall via branch, call, returnIndirect limitations: stall via branch, call, return
– Incidental limits: resource constraintIncidental limits: resource constraint
– Historical limits: register starved x86Historical limits: register starved x86
– Technological: ALU speed vs. memory access speedTechnological: ALU speed vs. memory access speed
– Logical limits: data- and resource dependenceLogical limits: data- and resource dependence
®®
* Other brands and names may be claimed as the property of others.
7
Register StarvationRegister Starvation How many regs needed (compiler or programmer)?How many regs needed (compiler or programmer)?
– Infinite is perfect Infinite is perfect – 1024 is very good1024 is very good– 64 acceptable64 acceptable– 16 is crummy16 is crummy– 4+4 is x864+4 is x86– 1 is saa (single-accumulator architecture)1 is saa (single-accumulator architecture)
Formally on x86: 16 regs. Quick test:Formally on x86: 16 regs. Quick test:– ax, bc, cx, dx– si, di– bp, sp, ip– cs, ds, ss, es, fs, gs, flags
Of which Of which ax, bx, cx, dx are GPRs, almost are GPRs, almost Rest can be used as better tempsRest can be used as better temps ax & & dx used for * and /, used for * and /, cx for loop for loop
®®
* Other brands and names may be claimed as the property of others.
8
Register StarvationRegister Starvation Absence of regs causesAbsence of regs causes
– Spurious memory spills and loadSpurious memory spills and load
– False data dependences --not dependencies False data dependences --not dependencies
Except single-accumulator arch: No other Except single-accumulator arch: No other arch is more register starved than x86 arch is more register starved than x86
Instruction Stream
mov %eax, [mem1]use stuff, %eaxmov [mem1], %eax
Added ops
Mem latency
Instruction Stream
mov %eax, [tmp] add %ebx, %eax imul %ecx mov %eax, [prod] mov [tmp], %eax
False DD
®®
* Other brands and names may be claimed as the property of others.
9
And the Programmer?And the Programmer? No solution in ISA, x86 had 4 GPRs since 8086No solution in ISA, x86 had 4 GPRs since 8086 Improved via internal register renamingImproved via internal register renaming
– Pentium ® Pro has hundreds of internal regsPentium ® Pro has hundreds of internal regs
Added registers in mmxAdded registers in mmx– Visible to you, programmer and compilerVisible to you, programmer and compiler
– fp(0) .. fp(7), 80-bits as FP, 64 bits as mmx, but note: fp(0) .. fp(7), 80-bits as FP, 64 bits as mmx, but note: context switchcontext switch
Added registers in SSEAdded registers in SSE– xmm(0) .. xmm(7) 128 bitsxmm(0) .. xmm(7) 128 bits
®®
* Other brands and names may be claimed as the property of others.
10
Processor-Memory GapProcessor-Memory Gap
µProc60%/yr.
DRAM7%/yr.
1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law”
Source: David Patterson, UC Berkeley
2001
2002
®®
* Other brands and names may be claimed as the property of others.
Bridging the Gap: TrendBridging the Gap: Trend
DRAM
CPU
CachesMultilevel
Caches
Per
form
ance
Time Instruction Level
Thread Level
Intel® Pentium II Processor:
Out of Order Execution
~30%
Intel® Xeon™ Processor:Hyperthreading
Technology ~30%
Hyperthreading Technology:Hyperthreading Technology:Feeds two threads to exploit shared execution unitsFeeds two threads to exploit shared execution units
®®
* Other brands and names may be claimed as the property of others.
12
Impact of Memory LatencyImpact of Memory Latency
Memory speed has Memory speed has NOTNOT kept up with kept up with advance in processor speedadvance in processor speed– Avg. integer add ~ 0.16 ns (Xeon), but memory Avg. integer add ~ 0.16 ns (Xeon), but memory
accesses take ~10 ns or moreaccesses take ~10 ns or more
CPU hardware resource utilization is only CPU hardware resource utilization is only 35%35% on average on average– Limited due to memory stalls and dependenciesLimited due to memory stalls and dependencies
Possible solutions to memory speed Possible solutions to memory speed mismatch?mismatch?
Memory speed mismatch is a major source of CPU stallsMemory speed mismatch is a major source of CPU stalls
®®
* Other brands and names may be claimed as the property of others.
13
And the Programmer?And the Programmer? Cache providedCache provided Methods to manipulate cacheMethods to manipulate cache Tools provided to pre-fetch dataTools provided to pre-fetch data
– At risk of superfluous fetch, if control-flow At risk of superfluous fetch, if control-flow changechange
®®
* Other brands and names may be claimed as the property of others.
14
Processor StallsProcessor Stalls Stalled cycle is a cycle in which processor cannot Stalled cycle is a cycle in which processor cannot
receive or schedule new instructionsreceive or schedule new instructions– Total Cycles = Total Stall Cycles + Productive CyclesTotal Cycles = Total Stall Cycles + Productive Cycles
– Stalls waste processor cyclesStalls waste processor cycles
– Perfmon, Linux ps, tops, other system tools show Stalled Perfmon, Linux ps, tops, other system tools show Stalled cycles as busy CPU cyclescycles as busy CPU cycles
– Intel® VTune Analyzer used to monitor stalls (HP* PFmon)Intel® VTune Analyzer used to monitor stalls (HP* PFmon)
Unstalled
Stalled
®®
* Other brands and names may be claimed as the property of others.
15
Why Stalls Occur!Why Stalls Occur! Stalls occur, becauseStalls occur, because::
– Instruction needs resource not availableInstruction needs resource not available
– Dependences [sic] (control- or data-) between instructionsDependences [sic] (control- or data-) between instructions
– Processor / instruction waits for some signal or eventProcessor / instruction waits for some signal or event
Sample resource limitations:Sample resource limitations:– RegistersRegisters
– Execution portsExecution ports
– Execution unitsExecution units
– Load / store portsLoad / store ports
– Internal buffers (ROBs, WOBs , etc.)Internal buffers (ROBs, WOBs , etc.)
Sample eventsSample events::– Exceptions, Cache misses, TLB misses, e.t.c.Exceptions, Cache misses, TLB misses, e.t.c.
– Common thing: they hold up compute progressCommon thing: they hold up compute progress
®®
* Other brands and names may be claimed as the property of others.
16
Control Dependences (CD)Control Dependences (CD) Change in flow of control causes stallsChange in flow of control causes stalls Processors handle control dependences:Processors handle control dependences:
– Via branch prediction hardwareVia branch prediction hardware
– Conditional move to avoid branch & pipeline stallConditional move to avoid branch & pipeline stall
Instruction Stream
mov [%ebp+8], %eax cmp 1, %eax jg bigger mov 1, %eax . . .bigger:
Barrier(Predict)
Instruction Stream
dec %ecxpush %eaxcall rfactmov %ecx,[%ebp+8]mul %ecx
Barrier(Predict)
®®
* Other brands and names may be claimed as the property of others.
17
Data Dependences (DD)Data Dependences (DD) Data dependence limits performanceData dependence limits performance Programmer / Compiler cannot solveProgrammer / Compiler cannot solve
– Xeon has register renaming to avoid false data Xeon has register renaming to avoid false data dependenciesdependencies
– supports out of order execution to hide effects of supports out of order execution to hide effects of dependenciesdependencies
Instruction Stream
. . .mov eax, [ebp+8]cmp eax, 1
Mem latency
Instruction Stream
mov [temp], eax add eax, ebx mult ecx mov [prod], eax mov eax, [temp] . . .bigger:
False DD
®®
* Other brands and names may be claimed as the property of others.
18
Xeon Processor StallsXeon Processor Stalls D-sideD-side
– DTLB MissesDTLB Misses– Memory Hierarchy Memory Hierarchy L1, L2 and L3 misses L1, L2 and L3 misses
CoreCore– Store Buffer StallsStore Buffer Stalls
– Load/Store splitsLoad/Store splits– Store forwarding hazardStore forwarding hazard– Loading partial/misaligned dataLoading partial/misaligned data
– Branch MispredictsBranch Mispredicts
I-sideI-side– Streaming Buffer MissesStreaming Buffer Misses– ITLB MissesITLB Misses– TC missesTC misses– 64K Aliasing conflicts64K Aliasing conflicts
MiscMisc– Machine ClearsMachine Clears
®®
* Other brands and names may be claimed as the property of others.
19
And the Programmer?And the Programmer?
Reduce processor stall by prefetching dataReduce processor stall by prefetching data Reduces control flow change by conditional Reduces control flow change by conditional
movemove Reduce false dependences by using register Reduce false dependences by using register
temps, from mmx (fp) and xmm pooltemps, from mmx (fp) and xmm pool
®®
* Other brands and names may be claimed as the property of others.
20
Partial Writes: WC buffersPartial Writes: WC buffers
First Level Cache
Fill/WC BufferFill/WC BufferFill/WC BufferFill/WC Buffer
8B 8B 8B -
Incomplete WC buffer3 - 8B “Partial” bustransactions8B 8B 8B 8B
Complete WC buffer1 bus transaction
Second Level Cache
Memory
Detection (VTune)Event based sampling:
Ext. Bus Partial Write Trans.
Causes:
L2 Cache Request
Ext. Bus Burst Read Trans.
Ext. Bus RFO Trans.
Causes:1) Too many WC streams
2) WB loads/stores contending for fill-buffers to access L2 cache or memory
Partial writes reduce actual front-side bus Bandwidth Partial writes reduce actual front-side bus Bandwidth – ~3x lower for PIII~3x lower for PIII– ~7x lower for ~Pentium 4 processor due to longer cache line~7x lower for ~Pentium 4 processor due to longer cache line
FSBFSB
®®
* Other brands and names may be claimed as the property of others.
21
Store Forwarding GuidelinesStore Forwarding Guidelines
A
Will Forward Forwarding Penalty
Store
LoadLoad aligned with Store
Load contained in Store
128-bit forwards must be16-byte aligned
Store
Load
Store
Load
Store
Load
Store
Load
Store
Load
16-byte boundaries
Load contained in single Store
BStore
Load
Store
Load
Store Forward: Loading from an address recently stored can cause data to be fetched more quickly than via mem access.Large penalty for non-forwarding cases (1.1-1.3x)
MSVC < 7.0 and you generate these. Intel Compiler doesn’t.MSVC < 7.0 and you generate these. Intel Compiler doesn’t.
®®
* Other brands and names may be claimed as the property of others.
22
And the Programmer?And the Programmer?
Pick right compiler, for HLL programsPick right compiler, for HLL programs Use VTune to check, for asm codeUse VTune to check, for asm code In asm programs, ensure loads after stores are:In asm programs, ensure loads after stores are:
– Contained in stored data, subset or proper subsetContained in stored data, subset or proper subset– In single previous store, not in sum of multiple In single previous store, not in sum of multiple
storesstores– Thus do store-combining: assemble together, then Thus do store-combining: assemble together, then
storestore– Both data start on same addressBoth data start on same address
®®
* Other brands and names may be claimed as the property of others.
23
Misc LimitationsMisc Limitations
Spin-Lock in Multi ThreadSpin-Lock in Multi Thread– Don’t use busy wait, juts because you have (almost) a second Don’t use busy wait, juts because you have (almost) a second
processor for second threadprocessor for second thread Misaligned dataMisaligned data
– Don't align data on arbitrary boundary, just because Don't align data on arbitrary boundary, just because architecture can fetch from any addressarchitecture can fetch from any address
Dumb errors Dumb errors – Fail to use proper tool (library, compiler, performance analyzer)Fail to use proper tool (library, compiler, performance analyzer)– Failure to use tiling (aka blocking) or SW pipeliningFailure to use tiling (aka blocking) or SW pipelining
Denormalized FloatsDenormalized Floats
®®
* Other brands and names may be claimed as the property of others.
24
And the Programmer?And the Programmer?
Use pause, when applicable!Use pause, when applicable!– New NetBurst instructionNew NetBurst instruction
Use compiler switches to align data on address Use compiler switches to align data on address divisible by greatest individual data objectdivisible by greatest individual data object
– Who cares about wasting 7 bytes to force 8-byte alignment?Who cares about wasting 7 bytes to force 8-byte alignment? Be smart, pick right tools Be smart, pick right tools
– Instruct compiler to SW pipelineInstruct compiler to SW pipeline– In asm, manually SW pipeline; note easier on EPIC than In asm, manually SW pipeline; note easier on EPIC than
VLIW, lacking prologue, epilogue sometimesVLIW, lacking prologue, epilogue sometimes– Enable compiler to partition larger data structures into Enable compiler to partition larger data structures into
smaller suitable blocks, for improved localitysmaller suitable blocks, for improved locality– cache parameter dependentcache parameter dependent
®®
* Other brands and names may be claimed as the property of others.
25
And the Programmer?And the Programmer?
Executes for first of 2 labs, this one being a Executes for first of 2 labs, this one being a "two-minute" exercise:"two-minute" exercise:
Turn on your computer, verify Linux is alive Turn on your computer, verify Linux is alive Verify you have available:Verify you have available:
– Editor to modify programEditor to modify program– Intel C++ compiler, text command icc, with -gIntel C++ compiler, text command icc, with -g– Debugger ddd, with disassembly abilityDebugger ddd, with disassembly ability
Source program vscal.cppSource program vscal.cpp Linux commands: ls, vi, icc, mkdir, etc.Linux commands: ls, vi, icc, mkdir, etc.
®®
* Other brands and names may be claimed as the property of others.
26
Module SummaryModule Summary
Covered: key causes that render execution Covered: key causes that render execution slower than possible:slower than possible:
More registers at your disposal than seemsMore registers at your disposal than seems Van Neumann bottleneck can be softened via Van Neumann bottleneck can be softened via
cache use and data pre-fetchcache use and data pre-fetch Stalls can be reduced by conditional move, Stalls can be reduced by conditional move,
avoiding false dependencesavoiding false dependences Use (time limited) capabilities, such as proper Use (time limited) capabilities, such as proper
store forwardingstore forwarding Note new Pause instructionNote new Pause instruction
®®
* Other brands and names may be claimed as the property of others.
27
x86 Architecturex86 ArchitectureProgressionProgression
®®
* Other brands and names may be claimed as the property of others.
28
Agenda: x86 Arch. ProgressionAgenda: x86 Arch. Progression
Abstract & ObjectivesAbstract & Objectives x86 Nomenclature & Notationx86 Nomenclature & Notation Intel® Architecture ProgressIntel® Architecture Progress Pentium 4 AbstractPentium 4 Abstract
®®
* Other brands and names may be claimed as the property of others.
29
Abstract & Objectives:Abstract & Objectives:x86 Architecture Progressionx86 Architecture Progression Abstract: High-level introduction to history and Abstract: High-level introduction to history and
evolution of increasingly powerful 16-bit and evolution of increasingly powerful 16-bit and 32-bit x86 processors that are backwards 32-bit x86 processors that are backwards compatible.compatible.
Objectives: understand processor generations Objectives: understand processor generations and architectural features, by learningand architectural features, by learning– Progressive architectural capabilitiesProgressive architectural capabilities
– Names of corresponding Intel processorsNames of corresponding Intel processors
– Explanation, description of capabilitiesExplanation, description of capabilities
– FP incompatibility, minorFP incompatibility, minor
®®
* Other brands and names may be claimed as the property of others.
30
Non-ObjectivesNon-Objectives
Objective is Objective is notnot introduction of: introduction of:
– x86 assembly language, assumed knownx86 assembly language, assumed known
– Itanium ® processor family now in 3Itanium ® processor family now in 3rdrd generation generation
– Intel tools (C++, VTune)Intel tools (C++, VTune)
– Performance tools: MS Perfmon, Linux ps, Performance tools: MS Perfmon, Linux ps, emon, HP PFMon, etc.emon, HP PFMon, etc.
– Performance benchmarks, performance countersPerformance benchmarks, performance counters
– Differentiation Intel vs. competitor productsDifferentiation Intel vs. competitor products
– CISC vs. RISCCISC vs. RISC
®®
* Other brands and names may be claimed as the property of others.
31
x86 Nomenclature & Notationx86 Nomenclature & Notation
Pentium ® II, 2H98, 450 MHzPentium ® II, 2H98, 450 MHzMMX, BX chipsetMMX, BX chipsetDynamic branch prediction enhancedDynamic branch prediction enhanced
Processor name, initial launch date, final clock speedProcessor name, initial launch date, final clock speed
Architecturally visible enhancement list, can be emptyArchitecturally visible enhancement list, can be empty
Architectural speedup technique, invisible exc. higher speedArchitectural speedup technique, invisible exc. higher speed
®®
* Other brands and names may be claimed as the property of others.
32
Intel® Architecture ProgressIntel® Architecture Progress
Pentium ® Pro, 2H95, 100 MHz
,
Dynamic branch prediction
8086, 2H80, 4 MHz
,
8087
80485, 2H2h85, 10 MHz
,
FP integrated
Pentium ®, 1988, 40 MHz
,
D+I caches, static branch prediction
Pentium ® 4, 2H00, 3.06 GHz
SSE2, 144 WNI, NetBurst ®
L3 on chip cache
Pentium ® II, 2H98, 450 MHz
MMX, BX chipset
Dynamic branch prediction enhanced
Pent
ium
® II
I, 2H
99, 7
33 M
Hz
SSE,
XM
M re
gs
Larg
e ca
che,
l2 o
nchi
p
®®
* Other brands and names may be claimed as the property of others.
33
Intel ® Pentium ® 4 ProcessorsIntel ® Pentium ® 4 ProcessorsProcessorProcessor FamilyFamily DescriptionDescription
NorthwoodNorthwood Pentium ®Pentium ® Willamette shrink. Consumer and business desktop processor. HT not enabled, though capable.
NW E-StepNW E-Step PentiumPentium HT errata corrected. Desktop processor
PrescottPrescott PentiumPentium Consumer and business desktop processor. Replaces NW. Offers 6 PNI: Prescott New Instructions. First processor with Lagrande technology (trusted computing)
Prestonia DPPrestonia DP Xeon Xeon TMTM DP slated for workstations and entry-level servers. Based on NW core. HT enabled. 512 kB L2 cache. No L3. 3 GHz processor.
Nocona DPNocona DP XeonXeon DP based on Prescott core. Targeted for 3.06 GHz. 533 MHz (quad-pumped) bus, I.e. bus speed is 133 MHz. 1 MB L2 cache. HT enabled. About to be launched.
Foster MPFoster MP XeonXeon MP based on Willamette core. 1 MB L3 cache, 256 kB L2, HT enabled. For higher-end servers.
Gallatin MPGallatin MP XeonXeon MP based on NW core. 1 or 2 MB L3 cache, 512 kB L2 cache. For high-end servers. See 8-way HP DL 760, and IBM x440. HT enabled.
Potomac MPPotomac MP XeonXeon MP based on Prescott core. 533 MHz (quad-pumped) bus. 1 MB L2 cache, 8 MB L3 cache. HT enabled, yet to be launched.
Note: lower clock rates for MP versions.Due to higher circuit complexity,
bus load.
®®
* Other brands and names may be claimed as the property of others.
34
Processor Generation ComparisonProcessor Generation Comparison
FeatureFeature
MHzMHz
Execution TypeExecution Type
MMX™ TechnologyMMX™ Technology
Streaming SIMDStreaming SIMDExtensionsExtensions
Yes
Pentium® IIIPentium® IIIProcessorProcessor
Yes
Dynamic
600 MHz – 1.13GHz
System BusSystem Bus
1.5 GHz
Intel® NetBurst™Arch
Yes Yes
400MHz(4x100 MHz)
133MHz
Streaming SIMD Streaming SIMD Extensions 2Extensions 2 No Yes
Pentium® 4Pentium® 4ProcessorProcessor
Yes
Pentium® IIIPentium® IIIProcessorProcessor
Yes
Dynamic
450-600 MHz
100MHz
No
L2 Cache L2 Cache 512k off-die 256k on-die 256k on-die 512k on-die
2+ GHz
NorthwoodNorthwood
400/533MHz(4x100/133 MHz)
Yes Yes
Yes
Manufacturing Manufacturing ProcessProcess
ChipsetChipset ICH-1ICH-1 ICH-2ICH-2 ICH-2ICH-2 ICH-2ICH-2
.25 micron.25 micron .18 micron.18 micron .18 micron.18 micron .13 micron.13 micron
Intel® NetBurst™Arch
®®
* Other brands and names may be claimed as the property of others.
35
Intel® Architecture ProgressIntel® Architecture Progress
8087 co-processor of 80868087 co-processor of 8086: off-chip FP : off-chip FP computation, extended 80-bit FP format for DPcomputation, extended 80-bit FP format for DP
MMXMMX: multi-media extensions: multi-media extensions– Mmx regs aliased w. FP register stackMmx regs aliased w. FP register stack– needs context switchneeds context switch– FP regs also called ST(I) regsFP regs also called ST(I) regs
SSESSE: Streaming SIMD extension already since : Streaming SIMD extension already since Pentium IIIPentium III
WNIWNI: 144 new instructions, using additional data : 144 new instructions, using additional data types for existing opcodes, using previously types for existing opcodes, using previously reserved opcodesreserved opcodes
®®
* Other brands and names may be claimed as the property of others.
36
Intel® Architecture ProgressIntel® Architecture Progress
XMMXMM: 8 new 128-bit registers, in addition to : 8 new 128-bit registers, in addition to MMXMMX
SSE2SSE2: multiple integer ops and multiple DP FP : multiple integer ops and multiple DP FP ops: part of 144 WNIops: part of 144 WNI– Regs unchanged in Pentium ® 4 from P IIIRegs unchanged in Pentium ® 4 from P III
– Ops addedOps added
NetBurstNetBurst: generic term for: HyperThreading & : generic term for: HyperThreading & quad-pumped bus & new Trace Cache & etc.quad-pumped bus & new Trace Cache & etc.
Note: architectural feature ageswith next generation, but survives, dueto compatibility requirement. Hence is
interesting not only for historical reasons:You need to know it!
®®
* Other brands and names may be claimed as the property of others.
37
XeonXeonTMTM MP Abstract MP Abstract
2020HyperthreadingHyperthreading
TechnologyTechnology
Xeon™ MP Processor“Gallatin”
64 GB64 GB(PAE-36)(PAE-36)
8 Integer, 8 Integer,
1 Multimedia1 Multimedia
2 2 FloatingFloating
PointPoint
2.0+ GHz2.0+ GHz
1 2 3 424 Registers (126)24 Registers (126)
HyperthreadingHyperthreadingTechnologyTechnology
3 Instructions / Cycle3 Instructions / Cycle
L3 – 1or 2 MB L3 – 1or 2 MB L2 - 512KB L2 - 512KB L1 - 12K TC, 8K D L1 - 12K TC, 8K D
652xALU
3.2 GB/s3.2 GB/s(400)(400)
Physical Addressing (36-bit P Pro)Physical Addressing (36-bit P Pro)
On-die CacheOn-die Cache
Pipeline StagesPipeline Stages
RegistersRegisters
Execution UnitsExecution Units
Core FrequencyCore Frequency
Issue PortsIssue Ports
Logical CPU 2 XLogical CPU 2 X
System Bus BandwidthSystem Bus Bandwidth
Instructions/clock-cycleInstructions/clock-cycle
External CacheExternal Cache
®®
* Other brands and names may be claimed as the property of others.
38
XeonXeonTMTM Memory Hierarchy Memory Hierarchy
Xeon™ Processor MP
12.8 GB/s
L2 (unif'd) 512KB8-way128B lines7+ CLKS
L32MB8-way128B lines21+ CLKS
ExternalMemory
64GB 3.2 GB/sL1(DL0)8KB64B lines2 CLKS
TC12KB64B lines2 CLKS
Note: Physical Address Extension,36-bit PAE addresses,since Pentium ® Pro
®®
* Other brands and names may be claimed as the property of others.
39
ArchitectureArchitectureEnhancementsEnhancements
®®
* Other brands and names may be claimed as the property of others.
40
Agenda: Architecture EnhancementsAgenda: Architecture Enhancements
Abstract & ObjectivesAbstract & Objectives Faster ClockFaster Clock Caches: Advantage, Cost, LimitationCaches: Advantage, Cost, Limitation Multi-Level Cache-Coherence in MPMulti-Level Cache-Coherence in MP Register RenamingRegister Renaming Speculative, Out of Order ExecutionSpeculative, Out of Order Execution Branch Prediction, Code StraighteningBranch Prediction, Code Straightening
®®
* Other brands and names may be claimed as the property of others.
41
Abstract & Objectives:Abstract & Objectives:Architecture EnhancementsArchitecture Enhancements Abstract: Outline generic techniques that Abstract: Outline generic techniques that
overcome performance limitationsovercome performance limitations Objectives: under stand cost of architectural Objectives: under stand cost of architectural
techniques (tricks) in terms of resources (mil techniques (tricks) in terms of resources (mil space) and of lost performance if incorrectly space) and of lost performance if incorrectly guessedguessed– Caches: cost silicon, can slow downCaches: cost silicon, can slow down
– Branch prediction: costs silicon, can be wrongBranch prediction: costs silicon, can be wrong
– Prefetch: costs instruction, may be superfluousPrefetch: costs instruction, may be superfluous
– Superscalar: may not find a second opSuperscalar: may not find a second op
®®
* Other brands and names may be claimed as the property of others.
42
Non-ObjectivesNon-Objectives
Objective is not to explain detail of Intel Objective is not to explain detail of Intel processor architectureprocessor architecture
Not to claim Intel invented techniques; Not to claim Intel invented techniques; academia invented manyacademia invented many
Not to show all techniques; some apply Not to show all techniques; some apply mainly to EPIC or VLIW architecturesmainly to EPIC or VLIW architectures
No hype, no judgment, just the facts please! No hype, no judgment, just the facts please!
®®
* Other brands and names may be claimed as the property of others.
43
Faster ClockFaster Clock CISC:CISC:
– Decompose circuitry into multiple simple, sequential modulesDecompose circuitry into multiple simple, sequential modules Resulting modules are smaller and thus can be fast:Resulting modules are smaller and thus can be fast:
– high clock ratehigh clock rate– Shorter speed-pathsShorter speed-paths
That's what we call: pipelined architectureThat's what we call: pipelined architecture More modules -> simpler modules -> faster clock -> More modules -> simpler modules -> faster clock ->
super-pipelinedsuper-pipelined Super-pipelining NOT goodness per-se:Super-pipelining NOT goodness per-se:
– Saves no siliconSaves no silicon– Execution time per instruction does not improveExecution time per instruction does not improve– May get worse, due to delay cyclesMay get worse, due to delay cycles
But:But:– Instructions retired per unit time improvesInstructions retired per unit time improves– Especially in absence of (large number of) control-flow stallsEspecially in absence of (large number of) control-flow stalls
®®
* Other brands and names may be claimed as the property of others.
44
Faster ClockFaster Clock Xeon Xeon TMTM processor pipeline has 20 stages processor pipeline has 20 stages
Beautiful model breaks upon control transferBeautiful model breaks upon control transfer
IntelIntel®® NetBurst NetBurstTMTM µarchitecture: 20 stage pipeline µarchitecture: 20 stage pipeline
11 22 33 44 55 66 77 88 99 1010 1111 1212
TC Nxt IPTC Nxt IP TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF
ALU opALU op
I-FetchI-Fetch
R StoreR Store
Decode Decode
O1-FetchO1-Fetch
O2-FetchO2-Fetch
..
I-FetchI-Fetch
Decode Decode
O1-FetchO1-Fetch
O2-FetchO2-Fetch
ALU opALU op
R StoreR Store
®®
* Other brands and names may be claimed as the property of others.
45
IntelIntel®® x86 x86 ArchitecturesArchitectures
®®
* Other brands and names may be claimed as the property of others.
46
Agenda: Intel x86 ArchitecturesAgenda: Intel x86 Architectures Abstract & ObjectivesAbstract & Objectives High Speed, Long PipeHigh Speed, Long Pipe MultiprocessingMultiprocessing MMX OperationsMMX Operations SSE OperationsSSE Operations SSE2 OperationsSSE2 Operations Willamette New Instructions WNIWillamette New Instructions WNI Cacheability InstructionsCacheability Instructions Pause InstructionPause Instruction NetBurst, HyperthreadingNetBurst, Hyperthreading SW ToolsSW Tools
®®
* Other brands and names may be claimed as the property of others.
47
Abstract & Objectives:Abstract & Objectives:IntelIntel®® x86 Architectures x86 Architectures Abstract: Emphasizing Pentium ® 4 processors, show Abstract: Emphasizing Pentium ® 4 processors, show
progressively more powerful architectural features progressively more powerful architectural features introduced in Intel processors. Refer to speed introduced in Intel processors. Refer to speed problems solved from module 2 and general solutions problems solved from module 2 and general solutions explained in module 3.explained in module 3.
Objective: you not only understand the various Objective: you not only understand the various processor product names and supported features (Intel processor product names and supported features (Intel marketing names), but understand how they work, and marketing names), but understand how they work, and what their limitations and costs are.what their limitations and costs are.
®®
* Other brands and names may be claimed as the property of others.
48
Non-ObjectivesNon-Objectives
Objective is not to show Intel's techniques Objective is not to show Intel's techniques are the only ones, or best possible. They are are the only ones, or best possible. They are just good trade-off in light of conflicting just good trade-off in light of conflicting constraints:constraints:– Clock speed vs. small # of pipesClock speed vs. small # of pipes
– Small transistor count vs. high performanceSmall transistor count vs. high performance
– Large caches vs. small mil. SpaceLarge caches vs. small mil. Space
– Grandiose architecture vs. backward compatibilityGrandiose architecture vs. backward compatibility
– Need for large register file vs. register-starved x86Need for large register file vs. register-starved x86
– Wish to have two full on-die processors vs. Wish to have two full on-die processors vs. preserving silicon spacepreserving silicon space
®®
* Other brands and names may be claimed as the property of others.
High Speed, Long NetBurst High Speed, Long NetBurst TMTM Pipe Pipe
11 22 33 44 55 66 77 88 99 1010
FetchFetch FetchFetch DecodeDecode DecodeDecode DecodeDecode RenameRename ROB RdROB Rd Rdy/SchRdy/Sch DispatchDispatch ExecExec
Basic Pentium ® Pro PipelineBasic Pentium ® Pro Pipeline
Hyper pipelined Technology enables industry Hyper pipelined Technology enables industry leading performance and clock rateleading performance and clock rate
Hyper pipelined Technology enables industry Hyper pipelined Technology enables industry leading performance and clock rateleading performance and clock rate
Basic Basic NetBurst™ Micro-architecture PipelinePipeline
11 22 33 44 55 66 77 88 99 1010 1111 1212
TC Nxt IPTC Nxt IP TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF
Intro at Intro at 733MHz733MHz
.18µ.18µ
1.4 1.4
GHz .18 µGHz .18 µ2.2GHz .132.2GHz .13
µµ
®®
* Other brands and names may be claimed as the property of others.
50
Check Your ProgressCheck Your Progress
33 44 55 66 77 88 99 10102211 1111 1212 1313 1414 1515 1616 1717 1818 20201919
Execute: Execute the Execute: Execute the ops on the correct ops on the correct port; 1 clkport; 1 clk
Flags: Compute Flags: Compute flags (0, negative, flags (0, negative, etc.); etc.); 1 clk1 clk
Trace Cache Fetch:Trace Cache Fetch:Read decoded Read decoded op op from TC; 2 clksfrom TC; 2 clks
Register File: Read Register File: Read the register file; the register file; 2 clks2 clks
Drive: Drive Drive: Drive ops to ops to the Allocator; 1 clkthe Allocator; 1 clk
Trace Cache/Next Trace Cache/Next IP: Read from IP: Read from Branch Target Branch Target Buffer; 2 clksBuffer; 2 clks
Dispatch: Send Dispatch: Send ops to appropriate ops to appropriate execution unit; 2 execution unit; 2 clksclks
Rename: Rename Rename: Rename logical regs to logical regs to physical regs; 2 physical regs; 2 clksclks
Drive: Drive the Drive: Drive the branch result to BTB branch result to BTB at front; 1 clkat front; 1 clk
Allocate: Allocate Allocate: Allocate resources for resources for execution; 1 clkexecution; 1 clk
Branch Check: Branch Check: Compare act. Compare act. branch to predicted; branch to predicted; 1 clk1 clk
Queue: Write Queue: Write op op into into op queue to op queue to wait for wait for scheduling; 1 clkscheduling; 1 clk
Schedule: Write to Schedule: Write to schedulers; compute schedulers; compute dependencies; 3 clksdependencies; 3 clks
Match pipe functions to clocks/stages
®®
* Other brands and names may be claimed as the property of others.
51
Multiprocessing, SMPMultiprocessing, SMP
Def: Execution of 1 task by >= 2 processorsDef: Execution of 1 task by >= 2 processors Floyd Model (1960s):Floyd Model (1960s):
– Single-Instruction, Single-Data Stream (SISD) Single-Instruction, Single-Data Stream (SISD) Architecture (PDP-11)Architecture (PDP-11)
– Single-Instruction, Multiple-Data Stream (SIMD) Single-Instruction, Multiple-Data Stream (SIMD) Architecture (Array Processors, Solomon, Illiac IV, Architecture (Array Processors, Solomon, Illiac IV, BSP, TMC)BSP, TMC)
– Multiple-Instruction, Single-Data Stream (MISD) Multiple-Instruction, Single-Data Stream (MISD) Architecture (possibly: pipelined, VLIW, EPIC)Architecture (possibly: pipelined, VLIW, EPIC)
– Multiple-Instruction, Multiple-Data Stream Multiple-Instruction, Multiple-Data Stream Architecture (possibly: EPIC when SW-pipelined, Architecture (possibly: EPIC when SW-pipelined, true multiprocessor)true multiprocessor)
®®
* Other brands and names may be claimed as the property of others.
52
MP Scalability CaveatMP Scalability Caveat
0.900.81
0.730.59
0.430.25
0.11
2 4 8 16 32 64 128
Performance gain from doubling processors
Number of processors
Gain
Gain Follows Law of Diminishing ReturnsGain Follows Law of Diminishing Returns
®®
* Other brands and names may be claimed as the property of others.
53
Intel® Xeon™ Processor Scaling Intel® Xeon™ Processor Scaling 1.39x 1.39x FrequencyFrequency
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other source of information to evaluate the performance of systems or components they are considering purchasing.
Source: Intel CorporationBased on Intel internal projections. System configuration assumptions: 1) two Intel® Xeon™ processor 2.8GHz with 512KB L2 cache in an E7500 chipset-based server platform, 16GB memory, Hyperthreading enabled; 2) Four Intel® Xeon™ processor MP 1.6GHz with 1MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled; 3) Four Intel® Xeon™ processor MP 2.0GHz with 2MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled; 4) Four Intel® Xeon™ processor MP 2.8GHz with 2MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled
1.001.25 1.40
1.001.31
1.68
(2P) 2.2GHz, 400MHz Bus,512KB cache
(2P) 3.06GHz,533MHz Bus,512KB cache
(2P) 3.06GHz,533MHz Bus,1MB cache
(2P) 2.2GHz, 400MHz Bus,512KB cache
(2P) 3.06GHz,533MHz Bus,512KB cache
(2P) 3.06GHz,533MHz Bus,1MB cache
OLTPSPECint_rate_base2000 Frequency Scale more visible with large cache
®®
* Other brands and names may be claimed as the property of others.
54
Intel® Xeon™ MP vs. Xeon™ Relative Intel® Xeon™ MP vs. Xeon™ Relative OLTP PerformancesOLTP Performances
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other source of information to evaluate the performance of systems or components they are considering purchasing.
Source: TPC.org
1.00
2.00
(2P) Intel® Xeon™processor @ 2.8GHz,
533MHz Bus, 0 L3
(4P) Intel® Xeon™processor MP 2.0GHz,400MHz Bus 2MB L3
Which processor is better?
Xeon processor MP Targeted for OLTPXeon processor MP Targeted for OLTP
®®
* Other brands and names may be claimed as the property of others.
55
MMX Integer OperationsMMX Integer Operations
Add (saturation)Add (saturation)padduswpadduswmm0, mm3mm0, mm3ppacked acked add add with with uunsigned nsigned ssaturation aturation
on on wwordsords
mm0
b1 b03000h b2
a1 a0F000h a2
a1+b1 a0+b0FFFFh a2+b2mm0
mm3
+ + + +
Add (wrap around)Add (wrap around) paddwpaddwmm0, mm3mm0, mm3 ppacked acked add add onon w words ords
mm0
b1 b03000h b2
a1 a0F000h a2
a1+b1 a0+b02000h a2+b2mm0
mm3
+ + + +
®®
* Other brands and names may be claimed as the property of others.
56
Multiply-lowMultiply-lowpmullwpmullwmm0, mm3mm0, mm3
mulmultiply tiply llow, ow, wwordsords
Multiply-highMultiply-highpmulhwpmulhwmm1, mm4mm1, mm4
mulmultiply tiply hhigh, igh, wwordsords
MMX Arithmetic OperationsMMX Arithmetic Operations
mm1
b1 b0b3 b2
a1 a0a3 a2
c1 c0c3 c2mm1
* ** *
a3*b3 a2*b2
mm4
a1*b1 a0*b0
c1 c0c3 c2 c1 c0c3 c2
mm0
b1 b0b3 b2
a1 a0a3 a2
mm0
mm3* ** *
a3*b3 a2*b2 a1*b1 a0*b0
®®
* Other brands and names may be claimed as the property of others.
57
MMX Arithmetic OperationsMMX Arithmetic Operations
Multiply AddMultiply Addpmaddwdpmaddwd mm1, mm4mm1, mm4
ppacked acked mmultiply and ultiply and addadd 4 4 wwords to 2 ords to 2 ddoublewordsoublewords
b3 b2 b1 b0
a3 a2 a1 a0
* ** *
a3*b3+a2*b2 a1*b1+a0*b0
mm1
mm1
mm4
a1*b1 a0*b0a3*b3 a2*b2
Note: This instruction does not have a saturation option.
®®
* Other brands and names may be claimed as the property of others.
58
MMX Convert OperationsMMX Convert Operations
punpckhwd mm0, mm1unpack high words into doublewords
b0b1 a0a1
b1 b0b3 b2mm1 a1 a0a3 a2mm0
mm0
b2b3 a2a3
a1 a0a3 a2mm0
mm0
b1 b0b3 b2mm1
Unpack, interleaved merge Unpack, interleaved merge punpcklwdpunpcklwd mm0, mm1mm0, mm1unpunpaackck llowow wwords into ords into ddoublewordsoublewords
Zero extend from small data elements to bigger data elements by using the unpack instruction, with zeros in one of the operands.
®®
* Other brands and names may be claimed as the property of others.
59
MMX Convert OperationsMMX Convert Operations
PackPackpackusdwpackusdw mm0, mm1mm0, mm1
packpack with with uunsigned nsigned ssaturation (signed) aturation (signed) ddoublewords into oublewords into wwordsords
mm1 A B C D
C’ D’A’ B’mm0
mm0
®®
* Other brands and names may be claimed as the property of others.
60
8
psllw MM0, 8 packed shift left logical words
MM0
psllq MM0, 8 packed shift left logical quadword
MM0 703F 0000 FFD9 4364h
3F00 00FF D943 6400hMM0
81DBh 007Fh703Fh DF00h
DB00h 7F00h3F00h 0000hMM0
8
MMX Shift OperationsMMX Shift Operations
®®
* Other brands and names may be claimed as the property of others.
61
MMX Compare OperationsMMX Compare Operations
pcmpgtwpcmpgtw ; ; ccoompmpare are ggreareatter er wwords (generate a mask)ords (generate a mask)
73 2 5 6
51 3 5 23
000...00 111...11 000...00 111...11
> > > >
®®
* Other brands and names may be claimed as the property of others.
62
SSE RegistersSSE Registers
Eight 128 bit registersEight 128 bit registersSingle-precision / Double-precision Single-precision / Double-precision
/ 128-bit integer/ 128-bit integerDirect access to registersDirect access to registersReferred to as XMM0-XMM7Referred to as XMM0-XMM7Use simultaneously with FP /Use simultaneously with FP /
MMXMMX™ Technology TechnologyData array onlyData array only
IA-INT Registers
32
EAX
EDI
.
.
.
Fourteen 32-bit registersFourteen 32-bit registers Direct register access Direct register access Scalar Data onlyScalar Data only
Streaming SIMD Extension Registers(128-bit integer)
128
XMM0
XMM3
XMM4
XMM7
.
.
.
.
.
.
Eight 64 bit registersEight 64 bit registersXor eight 80 bit FP regsXor eight 80 bit FP regsDirect access to regsDirect access to regsFP data / data arrayFP data / data arrayx87 remains aliased with x87 remains aliased with
SIMD integer registersSIMD integer registersContext-switchContext-switch
MMX™ Technology / IA-FP Registers
6480
.
.
.
FP0 or MM0
FP7 or MM7
.
.
.
®®
* Other brands and names may be claimed as the property of others.
63
SSE Arithmetic OperationsSSE Arithmetic Operations
ADD, SUB, MUL, DIV, SQRTADD, SUB, MUL, DIV, SQRT – Floating Point Floating Point
(Packed/Scalar) (Packed/Scalar) – Full 23 bit precisionFull 23 bit precision
RCPRCP - Reciprocal - Reciprocal
RSQRTRSQRT - Reciprocal - Reciprocal Square RootSquare Root
– Perspective correction / Perspective correction / projectionprojection
– Vector normalizationVector normalization– Very fast Very fast – Return at least 11 bits of Return at least 11 bits of
precisionprecision
Full PrecisionFull Precision Approximate PrecisionApproximate Precision
®®
* Other brands and names may be claimed as the property of others.
64
SSE Arithmetic OperationsSSE Arithmetic Operations
MULPS: MULPS: MulMultiply tiply PPacked acked SSingle-FPingle-FP
mulpsmulps xmm1, xmm2 xmm1, xmm2
xmm1
xmm2
xmm1X4*Y4 X3*Y3 X2*Y2 X1*Y1
*X4 X3 X2 X1
Y4 Y3 Y2 Y1
®®
* Other brands and names may be claimed as the property of others.
SSE Compare OperationSSE Compare Operation
CMPPS: CMPPS: CCoompmpare are PPacked acked SSingle-FPingle-FP
cmpps cmpps xmm0, xmm1, 1xmm0, xmm1, 1
xmm0
xmm1
xmm0111…11 000…00 111…11 000...00
<1.1 7.3 2.3 5.6
8.6 2.3 3.5 1.2
®®
* Other brands and names may be claimed as the property of others.
66
SSE2 Registers, SSE2 Registers, look like SSE look like SSE cos they r cos they r
Eight 128 bit registersEight 128 bit registersSingle-precision array / Double-Single-precision array / Double-
precision array / 128-bit integerprecision array / 128-bit integerDirect access to registersDirect access to registersReferred to as XMM0-XMM7Referred to as XMM0-XMM7Use simultaneously with FP /Use simultaneously with FP /
MMXMMX™ Technology TechnologyData array onlyData array only
IA-INT Registers
32
EAX
EDI
.
.
.
Fourteen 32-bit registersFourteen 32-bit registers Direct register access Direct register access Scalar Data onlyScalar Data only
Streaming SIMD Extension Registers(scalar / packed SIMD-SP, SIMD-DP,
128-bit integer)
128
XMM0
XMM3
XMM4
XMM7
.
.
.
.
.
.
Eight 64 bit registersEight 64 bit registersXor eight 80 bit FP regsXor eight 80 bit FP regsDirect access to regsDirect access to regsFP data / data arrayFP data / data arrayx87 remains aliased with x87 remains aliased with
SIMD integer registersSIMD integer registersContext-switchContext-switch
MMX™ Technology / IA-FP Registers
6480
.
.
.
FP0 or MM0
FP7 or MM7
.
.
.
®®
* Other brands and names may be claimed as the property of others.
67
SSE2 Register UseSSE2 Register Use
Backward compatible with all existing MMX™ & SSE codeBackward compatible with all existing MMX™ & SSE code
Cache ManagementCache Management(Memory Streaming/Prefetch)(Memory Streaming/Prefetch)
-AND--AND-
-OR--OR-
-AND--AND-
Instruction TypeInstruction Type
64-bit SIMD int. 64-bit SIMD int. (4x16, 8x8)(4x16, 8x8)
Single-precision SIMD FPSingle-precision SIMD FP(4x32)(4x32)Double-precision SIMD FPDouble-precision SIMD FP(2x64)(2x64)
Pen
tiu
m®
III
Pen
tiu
m®
III
Pro
cess
or
Pro
cess
or
128-bit SIMD int.128-bit SIMD int.(8x16, 16x8)(8x16, 16x8)
Will
amet
teW
illam
ette
Pro
cess
or
Pro
cess
or
Standard x87 (SP, DP, EP)Standard x87 (SP, DP, EP)
New 64-bit double-New 64-bit double-precision floating point precision floating point instructionsinstructions
New / enhanced 128-bit New / enhanced 128-bit wide SIMD integerwide SIMD integer
– Superset of MMX™ Superset of MMX™ technology instruction technology instruction setset
No forced context No forced context switching on SSE switching on SSE registers (unlike registers (unlike MMX™/x87 registers)MMX™/x87 registers)
®®
* Other brands and names may be claimed as the property of others.
68
Willamette New InstructionsWillamette New Instructions
New InstructionsNew InstructionsExtended SIMD Integer InstructionsExtended SIMD Integer InstructionsNew SIMD Double-precision FP InstructionsNew SIMD Double-precision FP InstructionsNew Cacheability InstructionsNew Cacheability Instructions
Fully Integrated into Intel ArchitectureFully Integrated into Intel Architecture– Use previously Use previously reservedreserved opcodes opcodes
– Same addressing modes as MMX™ / SSE opsSame addressing modes as MMX™ / SSE ops
– Several MMX™ / SSE mnemonics are repeatedSeveral MMX™ / SSE mnemonics are repeated– New Extended SIMD functionality is obtained by New Extended SIMD functionality is obtained by
specifying 128-bit registers (xmm0-xmm7) as src/dst.specifying 128-bit registers (xmm0-xmm7) as src/dst.
®®
* Other brands and names may be claimed as the property of others.
69
SIMD Double-Precision FP OpsSIMD Double-Precision FP Ops Same instruction categories as SIMD single-Same instruction categories as SIMD single-
precision FP instructionsprecision FP instructions Operate on both elements of packed data, in Operate on both elements of packed data, in
parallel -> SIMDparallel -> SIMD Some instructions have scalar or packed versionsSome instructions have scalar or packed versions
IEEE 754 Compliant FP ArithmeticIEEE 754 Compliant FP Arithmetic– Not bit exact with x87Not bit exact with x87: 80 bit internal vs 64 bit mem: 80 bit internal vs 64 bit mem
Usable in all modes: real, virtual x86, SMM, and Usable in all modes: real, virtual x86, SMM, and protected (16-bit & 32-bit)protected (16-bit & 32-bit)
X2 X1 / Scalar
S Exponent Significand005151525262626363
®®
* Other brands and names may be claimed as the property of others.
70
FP Instruction SyntaxFP Instruction Syntax Arithmetic FP Instructions can be:Arithmetic FP Instructions can be:
– Packed or Scalar Packed or Scalar
– Single-Precision or Double-PrecisionSingle-Precision or Double-Precision
ASMASM IntrinsicsIntrinsics
addaddpsps _mm_add_ps()_mm_add_ps() Add Packed Single Add Packed Single
addaddpdpd _mm_add_pd()_mm_add_pd() Add Packed Double Add Packed Double
addaddss ss _mm_add_ss()_mm_add_ss() Add Scalar SingleAdd Scalar Single
addaddsd sd _mm_add_sd()_mm_add_sd() Add Scalar DoubleAdd Scalar Double
PPacked or acked or SScalarcalar
SSingle or ingle or DDoubleouble
®®
* Other brands and names may be claimed as the property of others.
71
New SSE2 Data TypesNew SSE2 Data Types Packed & Scalar FP Instructions operate on packed Packed & Scalar FP Instructions operate on packed
single- or double-precisionsingle- or double-precision floating point elements floating point elements– Packed instructions operate on 4 (sp) or 2 (dp) floatsPacked instructions operate on 4 (sp) or 2 (dp) floats
– Scalar instructions operate only on the right-most fieldScalar instructions operate only on the right-most field
addaddppdd
X2opY2 X1opY1
X2 X1
Y2 Y1
op
addaddppss
X4opY4 X3opY3 X2opY2 X1opY1
X4 X3 X2 X1
Y4 Y3 Y2 Y1
op
addaddssss
Y4 Y3 Y2 X1opY1
X4 X3 X2 X1
Y4 Y3 Y2 Y1
op
addaddsdsd
Y2 X1opY1
X2 X1
Y2 Y1
op
®®
* Other brands and names may be claimed as the property of others.
Extended SIMD Integer OpsExtended SIMD Integer Ops
All MMX™/SSE integer instructions operate on All MMX™/SSE integer instructions operate on 128-bit wide data in XMM registers128-bit wide data in XMM registers
Additionally, some new functionalityAdditionally, some new functionality– MOVDQA, MOVDQU: 128-bit aligned/unaligned movesMOVDQA, MOVDQU: 128-bit aligned/unaligned moves
– PADDQ, PSUBQ: 64-bit Add/Subtract for PADDQ, PSUBQ: 64-bit Add/Subtract for mmmm & & xmmxmm regs regs
– PMULUDQ: Packed 32 * 32 bit MultiplyPMULUDQ: Packed 32 * 32 bit Multiply
– PSLLDQ, PSRLDQ: 128-bit byte-wise ShiftsPSLLDQ, PSRLDQ: 128-bit byte-wise Shifts
– PSHUFD: Shuffle four double-words in PSHUFD: Shuffle four double-words in xmmxmm register register
– PSHUFL/HW: Shuffle four words in upper/lower half of PSHUFL/HW: Shuffle four words in upper/lower half of xmm xmm regreg
– PUNPCKL/HQDQ: Interleave upper/lower quadwordsPUNPCKL/HQDQ: Interleave upper/lower quadwords
– Full 128-bit Conversions: 4 Ints vs. 4 SP Floats Full 128-bit Conversions: 4 Ints vs. 4 SP Floats
®®
* Other brands and names may be claimed as the property of others.
73
New 128-bit data-types for fixed-point integer dataNew 128-bit data-types for fixed-point integer data– 16 Packed bytes16 Packed bytes
– 8 Packed words8 Packed words
– 4 Packed doublewords4 Packed doublewords
– 2 Quadwords2 Quadwords
New SIMD Integer Data FormatsNew SIMD Integer Data Formats
127127 0015156363 1616
127127 006363
127127 00776363 88
127127 006363 313132 32
®®
* Other brands and names may be claimed as the property of others.
74
New DP Instruction CategoriesNew DP Instruction Categories
ADD, SUB, MUL, DIV, SQRTADD, SUB, MUL, DIV, SQRT
MAX, MINMAX, MIN – Full 52-bit precision mantissa Full 52-bit precision mantissa
(Packed & Scalar) (Packed & Scalar)
AND, ANDN, OR, XORAND, ANDN, OR, XOR– Operate uniformly on entire Operate uniformly on entire
128-bit register 128-bit register – Must use DP instructions for Must use DP instructions for
double-precision datadouble-precision data
MOVAPD, MOVUPDMOVAPD, MOVUPD– 128-bit DP moves 128-bit DP moves
(aligned/unaligned)(aligned/unaligned)
MOVH/LPD, MOVSDMOVH/LPD, MOVSD – 64-bit DP moves64-bit DP moves
SHUFPDSHUFPD– Shuffle packed doublesShuffle packed doubles– Select data using 2-bit Select data using 2-bit
immediate operandimmediate operand
ComputationComputation Data FormattingData Formatting
LogicLogic
®®
* Other brands and names may be claimed as the property of others.
75
DP Packed & Scalar OperationsDP Packed & Scalar Operations
The new Packed & Scalar FP Instructions The new Packed & Scalar FP Instructions operate on packed operate on packed double precision double precision floating floating point elementspoint elements– Packed instructions operate on 2 numbersPacked instructions operate on 2 numbers
– Scalar instructions operate on least-significant Scalar instructions operate on least-significant numbernumber
Y2 X1opY1
opX2 X1
Y2 Y1addaddsdsd
X2opY2 X1opY1
opX2 X
1Y2 Y1addaddpdpd
®®
* Other brands and names may be claimed as the property of others.
76
y2-y1 x2-x1
SHUFPD: SHUFPD: ShufShuffle fle PPacked acked DDouble-FPouble-FP
SHUFPD InstructionSHUFPD Instruction
XMM1
XMM1
XMM2
SHUFPD XMM1, XMM2, 3SHUFPD XMM1, XMM2, 3 // binary 11// binary 11
SHUFPD XMM1, XMM2, 2SHUFPD XMM1, XMM2, 2 // binary 10// binary 10
1 0 01
XMM1
XMM1
x2 x1y2 y1
y2 x2
y2 x1
®®
* Other brands and names may be claimed as the property of others.
77
New DP instruction Categories, Cont'dNew DP instruction Categories, Cont'd
CMPPD, CMPSDCMPPD, CMPSD– Compare & mask Compare & mask
(Packed/Scalar)(Packed/Scalar)
COMISDCOMISD – Scalar compare and set Scalar compare and set
status flagsstatus flags
MOVMSKPDMOVMSKPD– Store 2-bit mask of DP sign Store 2-bit mask of DP sign
bits in a bits in a reg32reg32
CVTCVT– Convert DP to SP & 32-Convert DP to SP & 32-
bit integer w/ rounding bit integer w/ rounding (Packed/Scalar)(Packed/Scalar)
CVTTCVTT– Convert DP to 32-bit Convert DP to 32-bit
integer w/ truncation integer w/ truncation (Packed/Scalar)(Packed/Scalar)
BranchingBranching Type ConversionType Conversion
®®
* Other brands and names may be claimed as the property of others.
78
Compare & Mask OperationCompare & Mask Operation
CMPPD: CMPPD: CCoompmpare are PPacked acked DDouble-FPouble-FP
CMPPD CMPPD XMM0, XMM1, 1XMM0, XMM1, 1 // 1 = less than// 1 = less than
8.6 3.5
XMM0
XMM1
XMM0
< <1.1 12.3
1111111….111 0000000….000
®®
* Other brands and names may be claimed as the property of others.
79
Cache EnhancementsCache Enhancements
On-die trace cache for decoded uops (TC)On-die trace cache for decoded uops (TC)– Holds 12K uopsHolds 12K uops
8K on-die, 18K on-die, 1stst level data cache (L1) level data cache (L1) – 64-byte line size64-byte line size
– Pentium Pro was 32 bytesPentium Pro was 32 bytes
– Ultrafast, multiple accesses per instructionUltrafast, multiple accesses per instruction
256K on-die, 2256K on-die, 2ndnd level write-back, unified data and level write-back, unified data and instruction cache (L2)instruction cache (L2)
– 128-byte line size128-byte line size
– operates at full processor clock frequencyoperates at full processor clock frequency
PREFETCH instructions return 128 bytes to L2PREFETCH instructions return 128 bytes to L2
Fas
ter
Fas
ter
®®
* Other brands and names may be claimed as the property of others.
80
New Cacheability InstructionsNew Cacheability Instructions
MMX™/SSE cacheability instructions preservedMMX™/SSE cacheability instructions preserved New Functionality:New Functionality:
– CLFLUSH: Cache line flushCLFLUSH: Cache line flush
– LFENCE / MFENCE: Load Fence / Memory FenceLFENCE / MFENCE: Load Fence / Memory Fence
– PAUSE: Pause executionPAUSE: Pause execution
– MASKMOVDQU: Mask move 128-bit integer dataMASKMOVDQU: Mask move 128-bit integer data
– MOVNTPD: Streaming store with 2 64-bit DP FP dataMOVNTPD: Streaming store with 2 64-bit DP FP data
– MOVNTDQ: Streaming store with 128-bit integer dataMOVNTDQ: Streaming store with 128-bit integer data
– MOVNTI: Streaming store with 32-bit integer dataMOVNTI: Streaming store with 32-bit integer data
®®
* Other brands and names may be claimed as the property of others.
81
Streaming StoresStreaming Stores
Willamette implementation supports:Willamette implementation supports:– Writing to uncacheable buffer (e.g. AGP) with Writing to uncacheable buffer (e.g. AGP) with
full line-writesfull line-writes
– Re-reading same buffer with full line-readsRe-reading same buffer with full line-reads
– New in WNI, compared to Katmai/CuMineNew in WNI, compared to Katmai/CuMine
Integer streaming storeInteger streaming store– Operates on integer registers (ie, EAX, EBX)Operates on integer registers (ie, EAX, EBX)
– Useful for OS, by avoiding need to save FP Useful for OS, by avoiding need to save FP state, just move raw bitsstate, just move raw bits
®®
* Other brands and names may be claimed as the property of others.
82
Detail: Cache Line FlushDetail: Cache Line Flush
CLFLUSH: Cache line containing m8 flushed CLFLUSH: Cache line containing m8 flushed and invalidated from all caches in the and invalidated from all caches in the coherency domaincoherency domain
Linear address based; allowed by user codeLinear address based; allowed by user code Potential usage:Potential usage:
– Allows incoherent (AGP) I/O data to be mapped as Allows incoherent (AGP) I/O data to be mapped as WB for high read performance and flushed when WB for high read performance and flushed when updatedupdated
– Example: video encode streamExample: video encode stream
– Precise control of dirty data eviction may increase Precise control of dirty data eviction may increase performance by scheduling @ idle memory cyclesperformance by scheduling @ idle memory cycles
®®
* Other brands and names may be claimed as the property of others.
83
Detail: FencesDetail: Fences Capabilities introduced over time to enable software managed Capabilities introduced over time to enable software managed
coherence:coherence:– Write combining with the Pentium Pro processorWrite combining with the Pentium Pro processor
– SFence and memory streaming with Streaming SIMD Extensions SFence and memory streaming with Streaming SIMD Extensions
New Willamette Fences completes the tool set to enable full New Willamette Fences completes the tool set to enable full software coherence management software coherence management
– LFence, strong load orderLFence, strong load order– Blocks younger loads from passing a prior load instructionBlocks younger loads from passing a prior load instruction
– All loads preceding an LFence will be completed before loads coming after All loads preceding an LFence will be completed before loads coming after the LFencethe LFence
– MFenceMFence– Achieves effect of LFence and SFence instructions executed at same timeAchieves effect of LFence and SFence instructions executed at same time
– Necessary, as issuing an SFence instruction followed by an LFence Necessary, as issuing an SFence instruction followed by an LFence instruction does not prevent a load from passing a prior storeinstruction does not prevent a load from passing a prior store
®®
* Other brands and names may be claimed as the property of others.
84
Pause InstructionPause Instruction
PAUSE architecturally a NOP on IA-32 processor PAUSE architecturally a NOP on IA-32 processor generationsgenerations
Usable since Willamette!Usable since Willamette! Not necessary to check processor type.Not necessary to check processor type. PAUSE is hint to processor that code is a spin- wait or PAUSE is hint to processor that code is a spin- wait or
non- performance- critical code. A processor which non- performance- critical code. A processor which uses the hint can:uses the hint can:
– Significantly improves performance of spin-wait loops without Significantly improves performance of spin-wait loops without negative performance impact, by inserting a implementation- negative performance impact, by inserting a implementation- dependent delay that helps processors with dynamic dependent delay that helps processors with dynamic execution (a. k. a. out- of- order execution) exit from the spin- execution (a. k. a. out- of- order execution) exit from the spin- loop fasterloop faster
Significantly reduces power consumption during spin- Significantly reduces power consumption during spin- wait loopswait loops
®®
* Other brands and names may be claimed as the property of others.
85
NetBurstNetBurstTMTM µµArchitecture OverviewArchitecture OverviewSystem Bus
2nd Level Cache
8-way
1st Level Cache (Data) 4-way
Bus Unit
Fetch/ Decode
Trace Cache
Microcode ROM
Frequently used paths
Less frequently used paths
Execution
Out-of-Order Core
Retirement
BTBs/Branch Prediction
Front End
L2 Cache and ControlL2 Cache and Control
FP
RF
FP
RF
FMulFMulFAddFAddMMXMMXSSESSE
FP moveFP moveFP storeFP store
3.2
GB
/s S
yste
m In
terf
ace
3.2
GB
/s S
yste
m In
terf
ace L2 Cache and ControlL2 Cache and Control
L1
D-C
ach
e an
d D
-TL
BL
1 D
-Cac
he
and
D-T
LB
StoreStoreAGUAGULoadLoadAGUAGU
Sch
edu
lers
Sch
edu
lers
Inte
ger
RF
Inte
ger
RF
ALUALU
ALUALU
ALUALU
ALUALU
Tra
ce C
ach
eT
race
Cac
he
Ren
ame/
Allo
cR
enam
e/A
lloc
uo
p Q
ueu
esu
op
Qu
eues
BTBBTB
uCodeuCodeROMROM
33 33
Dec
od
erD
eco
der
BT
B &
I-T
LB
BT
B &
I-T
LB
NetBurstNetBurstTMTM µµArchitectureArchitecture
®®
* Other brands and names may be claimed as the property of others.
87
NetBurstNetBurstTMTM µµArchitecture SummaryArchitecture Summary
Quad Pumps bus to keep the Caches loadedQuad Pumps bus to keep the Caches loaded Stores most recent instructions as µops in TC to Stores most recent instructions as µops in TC to
enhance instruction issueenhance instruction issue Improves Program ExecutionImproves Program Execution
– Issues up to 3 µops per ClockIssues up to 3 µops per Clock
– Dispatches up to 6 µops to Execution Units per clockDispatches up to 6 µops to Execution Units per clock
– Retires up to 3 µops per clockRetires up to 3 µops per clock
Feeds back branch and data information to have Feeds back branch and data information to have required instructions and data availablerequired instructions and data available
®®
* Other brands and names may be claimed as the property of others.
88
What is Hyperthreading?What is Hyperthreading? Ability of processor to run multiple threadsAbility of processor to run multiple threads
– Duplicate architecture state creates illusion to SW Duplicate architecture state creates illusion to SW of Dual Processor (DP)of Dual Processor (DP)
– Execution unit shared between two threads, but Execution unit shared between two threads, but dedicated if one stallsdedicated if one stalls
Effect of Hyperthreading on Xeon Processor:Effect of Hyperthreading on Xeon Processor:– CPU utilization increases to 50% (from ~35%)CPU utilization increases to 50% (from ~35%)– About 30% performance gain for some applications About 30% performance gain for some applications
with the same processor frequencywith the same processor frequency
Hyperthreading Technology Results:Hyperthreading Technology Results:1. More performance with enabled applications1. More performance with enabled applications
2. Better responsiveness with existing applications2. Better responsiveness with existing applications
®®
* Other brands and names may be claimed as the property of others.
89
Hyperthreading ImplementationHyperthreading Implementation Almost two Logical ProcessorsAlmost two Logical Processors Architecture state (registers) Architecture state (registers)
and APIC duplicatedand APIC duplicated Share execution units, caches, Share execution units, caches,
branch prediction, control logic branch prediction, control logic and busesand buses
ProcessorExecutionResource
Adv. ProgrammableInterrupt Control
Architecture State
Adv. ProgrammableInterrupt Control
Architecture State
On-DieCaches
System Bus
**APIC: Advanced APIC: Advanced Programmable Interrupt Programmable Interrupt Controller. Handles Controller. Handles interrupts sent to a interrupts sent to a specified logical processorspecified logical processor
®®
* Other brands and names may be claimed as the property of others.
90
1.21 1.19
1.00
HT OFF WebBench / WebServer Performance
Trade2 / Java AppsServer Performance
Benefits to Xeon™ ProcessorBenefits to Xeon™ Processor Hyperthreading Technology Performance for Dual Processor Hyperthreading Technology Performance for Dual Processor Servers Servers
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm or call (U.S.) 1-800-628-8686 or 1-916-356-3104.
Source: Veritest (Sep, 2002). Comparisons based on Intel internal measurements w/pre-production hardware1) HTT on and off configurations with the Intel® Xeon™ processor 2.80 GHz with 512KB L2 cache, Intel® Server Board SE7501WV2 with Intel® E7501 chipset, 2GB DDR, Microsoft Windows* 2000 Server SP2, Intel® PRO/1000 Gigabit Server adapter, AMI 438 MegaRAID* controller v1.48 16MB EDO RAM- Dell PowerVault 210S disk array.2) HTT on and off configurations with the Intel® Xeon™ processor 2.80 GHz with 512KB L2 cache, Intel® Server Board SE7501WV2 with Intel® E7501 chipset, 2GB DDR, Microsoft Windows* 2000 Server SP2, Intel® PRO/1000 Gigabit Server adapter, AMI 438 MegaRAID* controller v1.48 16MB EDO RAM- Dell PowerVault 210S disk array.
Enhancements in bandwidth, Enhancements in bandwidth, throughput and thread-level throughput and thread-level parallelism with parallelism with Hyperthreading Technology Hyperthreading Technology deliver an acceleration of deliver an acceleration of performanceperformance
Hyper Threading Technology Performance Gains
Intel® Xeon™ processor 2.8GHz with 512KB cache, Microsoft Windows* 2000
Hyperthreading Technology increases performance by Hyperthreading Technology increases performance by ~20% on Some Server Applications~20% on Some Server Applications
®®
* Other brands and names may be claimed as the property of others.
91
Hyperthreading for WorkstationHyperthreading for Workstation
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm or call (U.S.) 1-800-628-8686 or 1-916-356-3104.
Source: Intel Corporation. With and without Hyperthreading Technology on the following system configuration: Intel Xeon Processor 2.80 GHz/533 MHz system bus with 512KB L2 cache, Intel® E7505 chipset-based Pre-Release platform, 1GB PC2100 DDR CL2 CAS2-2-2, (2) 18GB Seagate* Cheetah ST318452LW 15K Ultra160 SCSI hard drive using Adaptec 39160 SCSI adapter BIOS 3.10.0, nVidia* Quadro4 Pro 980XGL 128MB AGP 8x graphics card with driver version 40.52, Windows XP* Professional build 2600.
1.00
1.15 1.151.26 1.27 1.27
1.18 1.191.26 1.29
1.37
HT Off
Multi-Threaded Multi-Tasking
Intel® Xeon™ processor 2.8GHz with 512KB cacheHyperthreading Technology performance gains
• Performance gains Performance gains whether running:whether running: Multiple tasks within Multiple tasks within
one applicationone application Multiple applications Multiple applications
running at oncerunning at once
Multi-Multi-Threaded Threaded ApplicatioApplicationn
CHARMm*CHARMm* 3DSM*53DSM*5 D2cluster*D2cluster* BLAST*BLAST* LightwaveLightwave3D*753D*75
Multi-Multi-Tasking Tasking ApplicatioApplicationn
Patran* + Patran* + NastranNastran**
Multiple Multiple CompilesCompiles
3ds max* + 3ds max* + PhotoshopPhotoshop
Compile + Compile + RegressioRegressionn
Maya* Maya* multiple multiple renderings renderings + + AnimationAnimationHyperthreading Technology increases performance by Hyperthreading Technology increases performance by
15-37% on Workstation Applications15-37% on Workstation Applications
®®
* Other brands and names may be claimed as the property of others.
92
Hyperthreading ResourcesHyperthreading Resources
TypeType DescriptionDescription ExampleExample
SharedShared Each logical processor can use, evict Each logical processor can use, evict or allocate any part of resourceor allocate any part of resource
Cache, WC Buffers, Cache, WC Buffers, VTune reg. MS-ROMVTune reg. MS-ROM
DuplicatedDuplicated Each logical processor has it’s own Each logical processor has it’s own set of resourcesset of resources
APIC, registers, TSC, APIC, registers, TSC, IPIP
SplitSplit Resources are hard partitioned in halfResources are hard partitioned in half Load/Store buffers, Load/Store buffers, ITLB, ROB, IAQITLB, ROB, IAQ
TaggedTagged Resource entries are tagged with Resource entries are tagged with processor IDprocessor ID
Trace Cache, DTLBTrace Cache, DTLB
®®
* Other brands and names may be claimed as the property of others.
93
Xeon Processor PipelineXeon Processor PipelineSimplifiedSimplified Buffering Queues Buffering Queues
separate major pipeline separate major pipeline logic blocks logic blocks
Buffering queues are Buffering queues are either partitioned or either partitioned or duplicated to ensure duplicated to ensure independent forward independent forward progress through each progress through each logic block logic block
Buffering Queues duplicated
Buffering Queues partitioned
Queue Queue
Queue Queue
Queue
Queue
Queue
Fetch
Decode
TC/MSROM
Rename/Allocate
OOO Execute
Retirement
®®
* Other brands and names may be claimed as the property of others.
94
HT in NetBurstHT in NetBurst
Front EndFront End– Execution Trace CacheExecution Trace Cache– Microcode Store ROM (MSROM)Microcode Store ROM (MSROM)– ITLB and Branch PredictionITLB and Branch Prediction– IA-32 Instruction DecodeIA-32 Instruction Decode– Micro-op QueueMicro-op Queue
Bus unit
3rd level cacheOptional server product
2nd level cache1st level cache
4 way
Fetch/Decode
Trace CacheMS ROM
OOO Execution
Retirement
BTBs/ Branch Prediction
System Bus
®®
* Other brands and names may be claimed as the property of others.
95
Front EndFront End
Responsible for delivering instruction to the Responsible for delivering instruction to the later pipe stageslater pipe stages
Trace Cache HitTrace Cache Hit– When the requested instruction trace is present in When the requested instruction trace is present in
trace cachetrace cache
Trace cache missTrace cache miss– Requested instruction is brought in the trace cache Requested instruction is brought in the trace cache
from L2 cachefrom L2 cache
®®
* Other brands and names may be claimed as the property of others.
96
Trace Cache HitTrace Cache HitFront EndFront End
Two separate instruction pointersTwo separate instruction pointers Two logical processors arbitrate for access to TC each cycleTwo logical processors arbitrate for access to TC each cycle If one logical processor stalls,other uses full bandwidth of If one logical processor stalls,other uses full bandwidth of
TCTC
IPIP
Trace Cache
Micro-Op Queue
®®
* Other brands and names may be claimed as the property of others.
97
Programming ModelsProgramming Models Two major types of parallel programming Two major types of parallel programming
modelsmodels– Domain decompositionDomain decomposition– Functional decompositionFunctional decomposition
Domain DecompositionDomain Decomposition– Multiple threads working on subsets of the dataMultiple threads working on subsets of the data
Functional DecompositionFunctional Decomposition– Different computation on the same dataDifferent computation on the same data– E.g. Motion estimation vs. color conversion, e.t.c.E.g. Motion estimation vs. color conversion, e.t.c.
Both models can be implemented on HT processorsBoth models can be implemented on HT processors
®®
* Other brands and names may be claimed as the property of others.
98
Threading ImplementationThreading Implementation O/S thread implementations may differO/S thread implementations may differ Microsoft Win32Microsoft Win32
– NT threads (supports 1-1 O/S level threading)NT threads (supports 1-1 O/S level threading)– Fibers (supports M-N user level threading)Fibers (supports M-N user level threading)
LinuxLinux– Native Linux Thread (severely broken & inefficient)Native Linux Thread (severely broken & inefficient)– IBM Next Generation Posix Threads (NGPT) – IBM’s attempt to fix IBM Next Generation Posix Threads (NGPT) – IBM’s attempt to fix
Linux native threadLinux native thread– Redhat Native Posix Thread Model for Linux (NPTL) -supports 1-Redhat Native Posix Thread Model for Linux (NPTL) -supports 1-
1 O/S level threading that is to be Posix compliant1 O/S level threading that is to be Posix compliant
OthersOthers– Pthreads (generic Posix compliant thread)Pthreads (generic Posix compliant thread)– Sun Solaris Light Weight Processes (lwp), Sun Solaris user level Sun Solaris Light Weight Processes (lwp), Sun Solaris user level
threadsthreads
Thread Model Issues Somewhat Orthogonal to HTThread Model Issues Somewhat Orthogonal to HT
®®
* Other brands and names may be claimed as the property of others.
99
OS Implications of HTOS Implications of HT
ALL UP OS ALL UP OS
Legacy MP OSLegacy MP OSBackward Backward
Compatible,Compatible,
will not take the will not take the advantage ofadvantage of
Enabled MP OSEnabled MP OSOS with Basic OS with Basic
Hyperthreading Hyperthreading Technology Technology
FunctionalityFunctionality
Optimized MP Optimized MP OSOS
OS with optimized OS with optimized HyperthreadingHyperthreading
Technology supportTechnology support
Fully Compatible with ALL existing O/S… but only Fully Compatible with ALL existing O/S… but only optimized O/S enables the most benefitsoptimized O/S enables the most benefits
®®
* Other brands and names may be claimed as the property of others.
100
HT Optimized OSHT Optimized OS
Windows XPWindows XP– Windows XPWindows XP
– Windows XP ProfessionalWindows XP Professional
Windows 2003Windows 2003– EnterpriseEnterprise
– Data CenterData Center
EnabledEnabled– RedHat Enterprise Server (version 7.3, 8.0)RedHat Enterprise Server (version 7.3, 8.0)
– RedHat Advanced Server 2.1RedHat Advanced Server 2.1
– Suse (8.0, 9.0)Suse (8.0, 9.0)
®®
* Other brands and names may be claimed as the property of others.
101
OS SchedulingOS Scheduling HT enabled O/S sees two processors for each HT HT enabled O/S sees two processors for each HT
physical processorphysical processor– Enumerates first logical processor from all physical processors Enumerates first logical processor from all physical processors
firstfirst
Schedules processors almost same as regular SMPSchedules processors almost same as regular SMP– Thread priority determines schedule, Thread priority determines schedule, butbut CPU dispatch matters CPU dispatch matters
– O/S independently submits code stream for thread to logical O/S independently submits code stream for thread to logical processors and can independently interrupt or halt each logical processors and can independently interrupt or halt each logical processor (no change)processor (no change)
LogicalProcessor
1
LogicalProcessor
0
LogicalProcessor
1
LogicalProcessor
0
00000011 00000001 0000000000000010
Physical Processor 1 Physical Processor 0
CPUID CPUID CPUID CPUID
®®
* Other brands and names may be claimed as the property of others.
102
Thread ManagementThread Management Avoid coding practices that disable hyperthreaded Avoid coding practices that disable hyperthreaded
processors, e.g.processors, e.g.– Avoid Avoid 64KB Aliasing64KB Aliasing– Avoid processor serializing events (e.g. FP denormals, self Avoid processor serializing events (e.g. FP denormals, self
modifying codes, e.t.c.)modifying codes, e.t.c.)
Avoid Avoid Spin LocksSpin Locks– Minimize lock contention to less than two threads per lockMinimize lock contention to less than two threads per lock– Use “Use “PausePause” and “” and “O/S synchronizationO/S synchronization” when Spin-Wait ” when Spin-Wait
loops must be implementedloops must be implemented
In addition, follow multi-threading best practices:In addition, follow multi-threading best practices:– Use O/S services to block waiting threadsUse O/S services to block waiting threads– Spin as briefly as possible before yielding to O/SSpin as briefly as possible before yielding to O/S– Avoid Avoid false sharingfalse sharing– Avoid unintended synchronizations (C Runtime, C++ Avoid unintended synchronizations (C Runtime, C++
Template Library implementations)Template Library implementations)
®®
* Other brands and names may be claimed as the property of others.
103
Threading ToolsThreading Tools Intel ThreadChecker ToolIntel ThreadChecker Tool
– Itemization of parallelization bugs and sourceItemization of parallelization bugs and source
– ThreadChecker classThreadChecker class
OpenMPOpenMP– Thread model in which programmer introduces Thread model in which programmer introduces
parallelism or threading via directives or pragmasparallelism or threading via directives or pragmas
Intel Vtune AnalyzerIntel Vtune Analyzer– Provides analysis and drills down to source codeProvides analysis and drills down to source code
– ThreadChecker IntegrationThreadChecker Integration
GuideViewGuideView– Parallel performance tuningParallel performance tuning
®®
* Other brands and names may be claimed as the property of others.
104
Software ToolsSoftware Tools Intel C/C++ CompilerIntel C/C++ Compiler
– Support for SSE and SSE2 using C++ classes, intrinsics, and assemblySupport for SSE and SSE2 using C++ classes, intrinsics, and assembly– Improved Vectorization and prefetch insertionImproved Vectorization and prefetch insertion– Profile-guided optimizationsProfile-guided optimizations– G7 compiler switch for Pentium® 4 optimizationsG7 compiler switch for Pentium® 4 optimizations
Register Viewing Tool (RVT)Register Viewing Tool (RVT)– Shows contents of XMM registers as they are updatedShows contents of XMM registers as they are updated– Plugs into Microsoft* Visual Studio*Plugs into Microsoft* Visual Studio*
Microsoft* Visual Studio* 6.0 Processor Pack* Microsoft* Visual Studio* 6.0 Processor Pack* Support for SSE and SSE2 instructions, including intrinsicsSupport for SSE and SSE2 instructions, including intrinsics Available for free download from Microsoft*Available for free download from Microsoft*
Microsoft* Visual Studio* .NET Microsoft* Visual Studio* .NET – Provides improved support for Intel® NetBurst™ micro-architectureProvides improved support for Intel® NetBurst™ micro-architecture– Recognizes XMM registersRecognizes XMM registers
®®
* Other brands and names may be claimed as the property of others.
105
Hyperthreading is NOT:Hyperthreading is NOT: Hyperthreading is not a full, dual-core Hyperthreading is not a full, dual-core
processorprocessor Hyper-threading does not deliver multi-Hyper-threading does not deliver multi-
processor scalingprocessor scaling
Dual Processor Dual CoreHyper-Threading
Processorcore
APIC
Arch. State
APIC
Arch. State
On-DieCache
Processorcore
APIC
Arch. State
APIC
Arch. State
Cache Cache
Processorcore
APIC
Arch. State
APIC
Arch. State
On-DieCache
Processorcore
Processorcore
®®
* Other brands and names may be claimed as the property of others.
106
BackupBackup
®®
* Other brands and names may be claimed as the property of others.
107
TERMSTERMS Branch: transfer of control to address different from Branch: transfer of control to address different from
next instructionnext instruction. Unconditional or conditional.. Unconditional or conditional. Branch Prediction: Ability to guess target of Branch Prediction: Ability to guess target of
conditional branch. Can be wrong, in which case conditional branch. Can be wrong, in which case we have mis-predict.we have mis-predict.
CISC: complex instruction set computerCISC: complex instruction set computer Compiler: Tool translating high-level instructions Compiler: Tool translating high-level instructions
into low-level machine instructions. Can be asm into low-level machine instructions. Can be asm source (ASCII) or binary machine code.source (ASCII) or binary machine code.
EPIC (Explicitly Parallel Instruction Computing): EPIC (Explicitly Parallel Instruction Computing): New architecture jointly defined by IntelNew architecture jointly defined by Intel®® and HP.Is and HP.Is foundation of new 64-bit Instruction Set foundation of new 64-bit Instruction Set ArchitectureArchitecture
®®
* Other brands and names may be claimed as the property of others.
108
TERMSTERMS
Explicit parallelism: Intended ability of two tasks to Explicit parallelism: Intended ability of two tasks to be executed by design (explicitly) at the same time. be executed by design (explicitly) at the same time. Task can be as simple as an instruction, or as Task can be as simple as an instruction, or as complex as a complete program.complex as a complete program.
Implicit parallelism: Incidental ability of two or more Implicit parallelism: Incidental ability of two or more tasks to be executed at the same time. Example: tasks to be executed at the same time. Example: sequence of integer add and FP convert sequence of integer add and FP convert instructions without common registers or memory instructions without common registers or memory addresses, executed on a target machine that addresses, executed on a target machine that happens to have respective HW modules available.happens to have respective HW modules available.
®®
* Other brands and names may be claimed as the property of others.
109
TERMSTERMS Instruction Set Architecture (ISA): Architecturally Instruction Set Architecture (ISA): Architecturally
visible instructions that perform software functions visible instructions that perform software functions and direct operations within the processor. HP and and direct operations within the processor. HP and IntelIntel®® jointly developed a new 64-bit ISA.This ISA jointly developed a new 64-bit ISA.This ISA integrates technical concepts from the EPIC integrates technical concepts from the EPIC technology.technology.
Memory latency: Time to move data from memory Memory latency: Time to move data from memory to the processor, at request of processor.to the processor, at request of processor.
Mispredict: A wrong guess, where new flow of Mispredict: A wrong guess, where new flow of control will continue as a result of a branch (or control will continue as a result of a branch (or similar control flow instruction).similar control flow instruction).
top related