high-performance microarchitecture techniquesrichter/cs550/mrf... · 60mhz p5 micro-architecture 5...
TRANSCRIPT
HighHigh--PerformancePerformanceMicroarchitectureMicroarchitecture TechniquesTechniques
John Paul ShenJohn Paul ShenDirector of Director of MicroarchitectureMicroarchitecture ResearchResearch
Intel LabsIntel Labs
October 29, 2002October 29, 2002Microprocessor Research ForumMicroprocessor Research Forum
IntelIntel’’s s MicroarchitectureMicroarchitectureResearch LabsResearch Labs! USA: California, Oregon, Texas (John Shen)
– High Frequency Superscalar Processors– Helper Threads for SMT and CMP Machines– Future Enterprise Server Processors
! Israel: Haifa (Ronny Ronen)– Low Power Microarchitecture Techniques– Future Mobile High-performance Processors
! Spain: Barcelona (Antonio Gonzalez)– Speculative Multithreading for SMT and CMP– Clustered Microarchitecture Techniques
Microprocessor Performance Microprocessor Performance Growth in PerspectiveGrowth in Perspective!! Doubling every 18 months (1982Doubling every 18 months (1982--2000): 2000):
–– Total of 3,200XTotal of 3,200X–– Cars travel at 176,000 MPH; get 64,000 miles/gal.Cars travel at 176,000 MPH; get 64,000 miles/gal.–– Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200)Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200)–– Wheat yield: 320,000 bushels per acreWheat yield: 320,000 bushels per acre
!! Doubling every 24 months (1971Doubling every 24 months (1971--2001): 2001): –– Total of 36,000XTotal of 36,000X–– Cars travel at 2,400,000 MPH; get 600,000 miles/gal.Cars travel at 2,400,000 MPH; get 600,000 miles/gal.–– Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000)Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000)–– Wheat yield: 3,600,000 bushels per acreWheat yield: 3,600,000 bushels per acre
Unmatched by any other industry!!Unmatched by any other industry!!
““Iron LawIron Law”” of of Microprocessor PerformanceMicroprocessor Performance
1/Processor Performance = ---------------Time
Program
= ------------------ X ---------------- X ------------Instructions Cycles
Program Instruction
Time
Cycle(inst. count) (CPI) (cycle time)
Processor Performance = -----------------IPC x GHzinst. count
Performance Improvement Performance Improvement TechniquesTechniques!! Increase GHzIncrease GHz
–– Process TechnologyProcess Technology–– Circuit TechniquesCircuit Techniques–– Pipelining and CachesPipelining and Caches
!! Increase IPC (Reduce CPI)Increase IPC (Reduce CPI)–– Superscalar PipelinesSuperscalar Pipelines–– OutOut--ofof--order Executionorder Execution–– Cache Miss ReductionCache Miss Reduction
!! Decrease Instruction CountDecrease Instruction Count–– Compiler OptimizationCompiler Optimization–– Architecture ExtensionsArchitecture Extensions
MicroarchitectureTechniques
SPECint92 LandscapeSPECint92 Landscape
P6 vs. Pentium 4 PipelinesP6 vs. Pentium 4 Pipelines
11 22 33 44 55 66 77 88 99 1010FetchFetch FetchFetch DecodeDecode DecodeDecode DecodeDecode RenameRename ROB RdROB Rd Rdy/SchRdy/Sch DispatchDispatch ExecExec
Basic P6 PipelineBasic P6 Pipeline
Basic PentiumBasic Pentium®® 4 Processor Pipeline4 Processor Pipeline11 22 33 44 55 66 77 88 99 1010 1111 1212
TC TC NxtNxt IPIP TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch1313 1414
DispDisp DispDisp1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF
Intro at Intro at 1.51.51.51.51.51.51.51.5GHzGHz
.18.18µµ
Intro at Intro at 733MHz733MHz
.18.18µµ
Freq
uenc
yFr
eque
ncy
TimeTimeIntroductionIntroduction
233MHz233MHz
60MHz60MHz P5 MicroP5 Micro--ArchitectureArchitecture55
Hyper Pipelined TechnologyHyper Pipelined Technology
1GHz1GHz
166MHz166MHz
P6 MicroP6 Micro--ArchitectureArchitecture
1010
@ intro@ intro
1.5 GHz1.5 GHz
2020Netburst MicroNetburst Micro--ArchitectureArchitecture
Deeper and Wider PipelinesDeeper and Wider Pipelines
FetchDec.Disp.Exec.Mem.Retire
Execute
Memory
Fetch
Decode
Dispatch
Retire
BranchPenalty
LoadPenalty
LoadPenalty
BranchPenalty
ALUPenalty
ALUPenalty
Pipelining Penalty LoopsPipelining Penalty Loops!! Branch PenaltyBranch Penalty
–– Branch predictorBranch predictor–– CPI overhead:CPI overhead:
–– Branch% x Branch% x MispredictionMisprediction% x % x PipeDepthPipeDepth–– Performance lost:Performance lost:
–– CPI overhead x CPI overhead x PipeWidthPipeWidth
!! Load PenaltyLoad Penalty–– Cache hierarchyCache hierarchy–– CPI overhead:CPI overhead:
–– Load% x Load% x AvgLoadLatencyAvgLoadLatency–– Average Load Latency:Average Load Latency:
–– ΣΣ Cache(i)HitCache(i)Hit% x % x Cache(i)LatencyCache(i)Latency
!! ALU PenaltyALU Penalty–– Forwarding paths and superForwarding paths and super--pipeliningpipelining
Branch PredictionBranch Prediction
Decode Buffer
Fetch
Dispatch Buffer
Decode
Reservation
Dispatch
StationsIssue
Execute
Finish Completion Buffer
Branch
nPC to Icache
nPC(seq.) = PC+4PCBranch
Predictor
specu. target
BTBupdate
prediction
(target addr.and history)
specu. cond.
FA-mux
Branch Prediction TechnologyBranch Prediction Technology!! Basic 2Basic 2--bit Local History Predictorbit Local History Predictor
–– ~~80% prediction accuracy80% prediction accuracy–– ~25 instructions/~25 instructions/mispredictmispredict–– ~5 cycles/25 instructions (0.2 CPI)~5 cycles/25 instructions (0.2 CPI)
!! TwoTwo--Level Correlated Predictor (P6)Level Correlated Predictor (P6)–– ~90% prediction accuracy~90% prediction accuracy–– ~50 instructions/~50 instructions/mispredictmispredict–– ~10 cycles/50 instructions (0.2 PI)~10 cycles/50 instructions (0.2 PI)
!! Current State of the Art (Pentium 4)Current State of the Art (Pentium 4)–– ~95% prediction accuracy~95% prediction accuracy–– ~100 instructions/~100 instructions/mispredictmispredict–– ~20 cycles/100 instructions (0.2 CPI)~20 cycles/100 instructions (0.2 CPI)
!! Current Research ChallengeCurrent Research Challenge (2008)(2008)–– ~98% prediction accuracy~98% prediction accuracy–– ~250 instructions/~250 instructions/mispredictmispredict–– ~25 cycles/250 instructions (0.1 CPI)~25 cycles/250 instructions (0.1 CPI)
Data Cache and Data Cache and PrefetchingPrefetching
Completion Buffer
Decode Buffer
Dispatch BufferDecode
ReservationDispatch
Complete
Stations
Data Cache
Main Memory
I-cacheBranch Predictor
Memory
branch integer integer floating store loadpoint
Prefetch
ReferencePrediction
Queue
Stor
e Bu
ffer
Cache Hierarchy TechnologyCache Hierarchy Technology!! Current Commercial Workload Current Commercial Workload (6 cycles/load)(6 cycles/load)
–– L1 Hits: 80% x 2 cycles = 1.6L1 Hits: 80% x 2 cycles = 1.6–– L2 Hits: 15% x 10 cycles = 1.5L2 Hits: 15% x 10 cycles = 1.5–– L3 Hits: 4% x 30 cycles = 1.2L3 Hits: 4% x 30 cycles = 1.2–– Memory: 1% x 150 cycles = 1.5Memory: 1% x 150 cycles = 1.5
!! Future Commercial Workload Future Commercial Workload (17 cycles/load)(17 cycles/load)–– L1 Hits: 80% x 4 cycles = 3.2L1 Hits: 80% x 4 cycles = 3.2–– L2 Hits: 15% x 20 cycles = 3.0L2 Hits: 15% x 20 cycles = 3.0–– L3 Hits: 4% x 60 cycles = 2.4L3 Hits: 4% x 60 cycles = 2.4–– Memory: 1% x 800 cycles = 8.0Memory: 1% x 800 cycles = 8.0
!! Current Research ChallengeCurrent Research Challenge (5 cycles/load)(5 cycles/load)–– Efficient and judicious cachesEfficient and judicious caches–– Load partitioning and specialized cachingLoad partitioning and specialized caching–– Aggressive memory Aggressive memory prefetchingprefetching
Memory Latency BottleneckMemory Latency Bottleneck
CacheCacheLatencyLatency(Clocks)(Clocks)
ExternalExternalMemoryMemory
11
1010
100100
10001000
L1L1 L2L2 L3L3
InstructionInstructionCostCost
External Memory LatencyExternal Memory Latency
00
400400
800800
PentiumPentium®®
procprocPentium Pentium Pro ProcPro Proc
Pentium Pentium III procIII proc
FutureFutureProcessorsProcessors
Cache Cache PrefetchingPrefetching::• Hardware:Hardware: Limited by predictable patterns• Software:Software: Limited by single control flow• Research Challenge:Research Challenge: Pointer-intensive code
Frequency vs. ParallelismFrequency vs. Parallelism
!! Increase Frequency (GHz)Increase Frequency (GHz)–– Deeper PipelinesDeeper Pipelines–– Increases Branch/Load penaltiesIncreases Branch/Load penalties–– Lowers IPC Lowers IPC
!! Increase Instruction Parallelism (IPC)Increase Instruction Parallelism (IPC)–– Wider PipelinesWider Pipelines–– Increases ComplexityIncreases Complexity–– Lowers GHzLowers GHz
FrontFront--End PipeEnd Pipe--Depth Penalty Depth Penalty
Execute
Memory
Fetch
Decode
Dispatch
Retire
Execute
Memory
Fetch
Decode
Dispatch
Retire
Optimize
Front-EndContraction
Back-EndOptimization
Alleviate PipeAlleviate Pipe--Depth Penalty Depth Penalty !! FrontFront--End ContractionEnd Contraction
–– Code ReCode Re--mapping and Cachingmapping and Caching–– Trace Construction, Caching, OptimizationTrace Construction, Caching, Optimization–– Leverage BackLeverage Back--End OptimizationsEnd Optimizations
!! BackBack--End OptimizationEnd Optimization–– MultipleMultiple--Branch, Trace, Stream, PredictionBranch, Trace, Stream, Prediction–– Code Reordering, Alignment, OptimizationCode Reordering, Alignment, Optimization–– PrePre--decode, Predecode, Pre--rename, Prerename, Pre--schedulingscheduling–– Memory PreMemory Pre--fetch Prediction and Controlfetch Prediction and Control
Execution Core ImprovementExecution Core Improvement
Execute
Memory
Fetch
Decode
Dispatch
Retire
Optimize
• Super-pipelinedALU design
• Very high-speed arithmetic units
• Speculative OoO execution
• Criticality-baseddata caching
• Aggressive datapre-fetching
How Deep Can You Go?How Deep Can You Go?
0
5
10
15
20
251 8 15 22 29 36 43 50 57 64 71 78 85 92 99
P ip e lin e D e p th
F re q ue nc yC P IP e rfo rm anc eP o we r
[Ed Grochowski, 7/6/01]
57?
Source: Intel CorporationSource: Intel Corporation
How Much ILP Is There?How Much ILP Is There?Weiss and Smith [1984]Weiss and Smith [1984] 1.581.58Sohi and Vajapeyam [1987]Sohi and Vajapeyam [1987] 1.811.81Tjaden and Flynn [1970]Tjaden and Flynn [1970] 1.861.86Tjaden and Flynn [1973]Tjaden and Flynn [1973] 1.961.96Uht [1986]Uht [1986] 2.002.00Smith et al. [1989]Smith et al. [1989] 2.002.00Jouppi and Wall [1988]Jouppi and Wall [1988] 2.402.40Johnson [1991]Johnson [1991] 2.502.50Acosta et al. [1986]Acosta et al. [1986] 2.792.79Wedig [1982]Wedig [1982] 3.003.00Butler et al. [1991]Butler et al. [1991] 5.85.8Melvin and Patt [1991]Melvin and Patt [1991] 66Wall [1991]Wall [1991] 77Kuck et al. [1972]Kuck et al. [1972] 88Riseman and Foster [1972]Riseman and Foster [1972] 5151Nicolau and Fisher [1984] Nicolau and Fisher [1984] 9090
SPECint95 LandscapeSPECint95 LandscapeLandscape of Microprocessor Families
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
80 180 280 380 480 580 680 780 880 980
Frequency (MHz)
SPEC
int9
5/M
Hz
Alpha AMD-x86
Intel-x86
** Data source www.spec.org
5
10
15
20
2530 35 40 45 50 55 60 SPECint 95
064
164
264
AthlonAthlon
PPro
P
PIIPIII
Bryan Black
SPECint2000 LandscapeSPECint2000 LandscapeLandscape of Microprocessor Families
0
0.5
1
0 500 1000 1500 2000 2500
Frequency (MHz)
SPEC
int2
000/
MHz
Intel-x86
AMD-x86
Alpha
PowerPC
Sparc
IPF
800 SPECint 2000700600500400300200
10050
PIII-Xeon
P4
Athlon
264C
Sparc-III
264A
604e Itanium
** Data source www.spec.org
264B25
Bryan Black
Parallelism in TransitionParallelism in Transition
1
10
100
1000
10000
100000
1000000
1980 1985 1990 1995 2000 2005 2010
MIP
S Pentium® Pro ArchitectureSpeculative Out of Order
Pentium® 4 ArchitectureTrace Cache
Future Xeon™ ArchitectureMulti-Threaded
Multi-Threaded, Multi-Core
Pentium® ArchitectureSuper Scalar
Era of Era of Instruction Instruction ParallelismParallelism
Era of Era of Thread Thread
ParallelismParallelism
SummarySummaryPerformance Demand ContinuesPerformance Demand Continues!! 55--10 billion transistors by 201010 billion transistors by 2010!! 1010--20 GHz by 201020 GHz by 2010
Challenge Is Power and EfficiencyChallenge Is Power and Efficiency!! Power dissipation, delivery, densityPower dissipation, delivery, density!! New clever/efficient implementationsNew clever/efficient implementations
New Frontiers to ExploreNew Frontiers to Explore!! Synergism of ILP, TLP, and MLPSynergism of ILP, TLP, and MLP!! ““SemiSemi--CustomCustom”” MicroarchitecturesMicroarchitectures