why intel is designing multi-core processors...• few available, tied to old models, languages,...
TRANSCRIPT
Why Intel is designing multi-core processors
Geoff Lowney
Intel Fellow, Microprocessor Architecture and Planning
July, 2006
July, 20062
Transistor Count
1.E+03
1.E+05
1.E+07
1.E+09
1970 1980 1990 2000 2010 2020
2.0x
Frequency (MHz)
1
10
100
1000
10000
100000
1970 1980 1990 2000 2010 2020
1.5x
2.0x
CPU Power (W)
10
100
1000
1990 1995 2000 2005 2010 2015
1.4x
Feature Size (um)
0.01
0.1
1
10
1970 1980 1990 2000 2010 2020
0.70x
0.79x
Moore’s Law at Intel 1970-2005
Power trend not Power trend not sustainablesustainable
July, 20063
Pentium® 4 Processor
386 Processor
May 1986@16 MHz core
275,000 1.5µ transistors~1.2 SPECint2000
17 Years200x
200x/11x1000x
August 27, [email protected] GHz core55 Million 0.13µ transistors1249 SPECint2000
July, 20064
1
10
100
1000
10000
Jan-85 Jan-87 Jan-89 Jan-91 Jan-93 Jan-95 Jan-97 Jan-99 Jan-01 Jan-03 Jan-05
Introduction Date
SPEC
int2
000
1
10
100
1000
10000
Jan-85 Jan-87 Jan-89 Jan-91 Jan-93 Jan-95 Jan-97 Jan-99 Jan-01 Jan-03 Jan-05
Introduction Date
Clo
ck F
requ
ency
Performance: 1000x
Frequency: 200x
July, 20065
SPECint2000/MHz (normalized)
0
1
2
3
4
5
6
7
Jan-85 Jan-87 Jan-89 Jan-91 Jan-93 Jan-95 Jan-97 Jan-99 Jan-01 Jan-03 Jan-05
Introduction Date
Nor
mal
ized
SPE
Cin
t200
0/M
Hz
386486 486DX2
P5 P55-MMX
PPro/II
PIII
P4
July, 20066
0
1
2
3
4
5
Pipelined S-Scalar OOO-Spec
Deep Pipe
Incr
ease
(X)
Area XPerf X
-1
0
1
2
3
Pipelined S-Scalar OOO-Spec
DeepPipe
Incr
ease
(X)
Power XMips/W (%)
Performance scales with area**.5
Power efficiency has dropped
July, 20067
Pushing Frequency
Pipeline & Performance
02468
10
1 2 3 4 5 6 7 8 9 10Relative Frequency (Pipelining)
Performance
DiminishingReturn
Maximized frequency byMaximized frequency by•• Deeper pipelinesDeeper pipelines•• Pushed process to the limitPushed process to the limit
Process Technology
02468
10
1 2 3 4 5 6 7 8 9 10Relative Frequency
Sub-threshold Leakage increases
exponentially••Dynamic power increases Dynamic power increases with frequency with frequency ••Leakage power increases Leakage power increases with reducing with reducing VtVt
Diminishing return on performance. Increase in power
July, 20068
Transistor Count
1.E+03
1.E+05
1.E+07
1.E+09
1970 1980 1990 2000 2010 2020
2.0x
Frequency (MHz)
1
10
100
1000
10000
100000
1970 1980 1990 2000 2010 2020
1.5x
2.0x
CPU Power (W)
10
100
1000
1990 1995 2000 2005 2010 2015
1.4x
Feature Size (um)
0.01
0.1
1
10
1970 1980 1990 2000 2010 2020
0.70x
0.79x
Moore’s Law at Intel 1970-2005
Power trend not Power trend not sustainablesustainable
July, 20069
Reducing power with voltage scaling
Power = Capacitance * Voltage**2 * Frequency
Frequency ~ Voltage in region of interest
Power ~ Voltage ** 3
10% reduction of voltage yields
• 10% reduction in frequency
• 30% reduction in power
• Less than 10% reduction in performance
Rule of ThumbRule of Thumb
Voltage Frequency Power Performance
1% 1% 3% 0.66%
July, 200610
Dual core with voltage scaling
Area = 1Area = 1Voltage = 1Voltage = 1Freq = 1Freq = 1Power = 1Power = 1PerfPerf = 1= 1
Area = 2Area = 2Voltage = 0.85Voltage = 0.85Freq = 0.85Freq = 0.85Power = 1Power = 1PerfPerf = ~1.8= ~1.8
Frequency
Reduction
Power
Reduction
Performance
Reduction
15% 45% 10%
A 15% A 15%
ReductionReduction
In VoltageIn Voltage
YieldsYields
SINGLE CORESINGLE CORE DUAL COREDUAL CORE
RULE OF THUMBRULE OF THUMB
July, 200611
Multiple cores deliver more performance per watt
Big coreBig core
CacheCache PowerPowerPower = Power = ¼¼
Performance = 1/2
C1C1
C4C4
C2C2
C3C3
SmallSmallcorecore
CacheCache
11
22
33
44 Performance = 1/2
PerformancePerformance
11
22
11 11
Many core is more Many core is more power efficientpower efficient
Power ~ areaPower ~ area
Single thread Single thread performance ~ area**.5
11
22
33
44
11
22
33
44
performance ~ area**.5
July, 200612
Memory Gap
Growing Performance GapGrowing Performance Gap
0
100
200
300
400
500
600
700
Pentium66MHz
Pentium-Pro200MHz
PentiumIII1100MHz
Pentium4 2GHz
19921992 19941994 19961996 19981998 20002000 20022002
LOGICLOGIC
MEMORYMEMORY
GA
PG
AP
Peak InstructionsPeak InstructionsPer DRAM AccessPer DRAM Access
Reduce DRAM access with large cachesExtra benefit: power savings. Cache is lower power than logic
Tolerate memory latency with multiple threadsMultiple coresHyper-threading
July, 200613
Multi-threading tolerates memory latency
Serial Execution
Ai Idle Bi IdleAi+1 Bi+1
Multi-threaded Execution
Ai Ai+1Idle
Bi Bi+1
Execute thread B while thread A waits for memoryExecute thread B while thread A waits for memory
Multi-core has a similar effect
July, 200614
Multi-core tolerates memory latency
Serial Execution
Ai Ai+1 Bi IdleIdle Bi+1
Multi-core Execution
Ai Idle Ai+1
Bi Bi+1Idle
Execute thread A and B simultaneouslyExecute thread A and B simultaneously
July, 200615
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
1970 1975 1980 1985 1990 1995 2000 2005
Introduction Date
Cor
e A
rea
- mm
^2
PPro
Pentium Processor
Banias
Pentium 4 processor
Merom4004
386
Pentium II processor
486
McKinley
Prescott
Yonah
Core Area (with L1 caches) TrendDie area/core shrinking after peaking with PPro
Why shrinking? Diminishing returns on performance.
Interconnect. Caches. Complexity.
July, 200616
Reliability in the long term
Soft Error FIT/Chip (Logic & Mem)
0
50
100
150
180130 90 65 45 32 22 16
Rel
ativ
e
~8% degradation/bit/generation
Extreme device variations
0
50
100
100 120 140 160 180 200Vt(mV)
Rel
ativ
e
Wider
Time dependent device degradation
0
1
1 2 3 4 5 6 7 8 9 10Time
Ion
Rel
ativ
e
In future process generations, soft and hard errors will be more common.
July, 200617
Redundant multi-threading: an architecture for fault detection and recovery
Two copies of each architecturally visible thread
Compare results: signal fault if different
Memory System
Sphere of Replication
OutputComparison
InputReplication
LeadingThread
TrailingThread
Multi-core enables many possible designs for redundant threading.
July, 200618
Transistor Count
1.E+03
1.E+05
1.E+07
1.E+09
1970 1980 1990 2000 2010 2020
2.0x
Frequency (MHz)
1
10
100
1000
10000
100000
1970 1980 1990 2000 2010 2020
1.5x
2.0x
CPU Power (W)
10
100
1000
1990 1995 2000 2005 2010 2015
1.4x
Feature Size (um)
0.01
0.1
1
10
1970 1980 1990 2000 2010 2020
0.70x
0.79x
Moore’s Law at Intel 1970-2005
Multi-core addresses power, performance, memory, complexity, reliability
July, 200619
Moore’s Law will provide transistors
Intel process technology capabilitiesIntel process technology capabilities
High Volume Manufacturing 2004 2006 2008 2010 2012 2014 2016 2018
Feature Size 90nm 65nm 45nm 32nm 22nm 16nm 11nm
128
8nm
Integration Capacity(Billions of Transistors) 2 4 8 16 32 64 256
50nm
Transistor for Transistor for 90nm Process90nm Process
Source: IntelSource: Intel
Influenza VirusInfluenza VirusSource: CDCSource: CDC
Use transistors for multiple cores, caches, and new features.
July, 200620
Multi-Core Processors
CLOVERTOWNCLOVERTOWN
WOODCRESTWOODCRESTQuad CoreQuad Core
PENTIUMPENTIUM®® MMDual CoreDual Core
Single CoreSingle Core
July, 200621
Future Architecture: More Cores
Open issuesCores• How many?• What size?• Homogeneous?• Heterogeneous?
On-die Interconnect• Topology• Bandwidth
Cache hierarchy• Number of levels• Sharing• Inclusion
Scalability
Power delivery and managementPower delivery and management
High bandwidth memoryHigh bandwidth memory
Reconfigurable cacheReconfigurable cache
Scalable fabricScalable fabric
FixedFixed--function unitsfunction units
CoreCore CoreCore
CoreCore CoreCore
CoreCore CoreCore
CoreCore
CoreCore
CoreCore
CoreCore CoreCore
CoreCore CoreCore
CoreCore CoreCore
July, 200622
The Importance of Threading
Do Nothing: Benefits Still Visible• Operating systems ready for multi-processing• Background tasks benefit from more compute resources• Virtual machines
Parallelize: Unlock the Potential• Native threads• Threaded libraries• Compiler generated threads
July, 200623
Performance Scaling
0123456789
10
0 10 20 30Number of Cores
Perf
orm
ance
Amdahl’s Law: Parallel Speedup = 1/(Serial% + (1-Serial%)/N)
Serial% = 6.7%N = 16, N1/2 = 8
16 Cores, Perf = 8
Serial% = 20%N = 6, N1/2 = 3
6 Cores, Perf = 3
Parallel software key to Multi-core successParallel software key to MultiParallel software key to Multi--core successcore success
July, 200624
How does Multicore Change Parallel Programming?
No change in fundamental programming model
Synchronization and communication costs greatly reduced
• Makes it practical to parallelize more programs
Resources now shared
• Caches
• Memory interface
• Optimization choices may be different
P1
cache
P2 P3 P4
cache cache cache
Memory
SMP
C1
cache
Memory
C2 C3 C4
cache cache cache
CMP
July, 200625
Threading for Multi-Core
Introducing Introducing ThreadsThreads
Architectural Architectural AnalysisAnalysis IntelIntel®® VTuneVTune™™ AnalyzersAnalyzers
IntelIntel®® C++ CompilerC++ Compiler
DebuggingDebugging IntelIntel®® Thread CheckerThread Checker
PerformancePerformanceTuningTuning IntelIntel®® Thread ProfilerThread Profiler
Intel has a full set of tools for parallel programming
July, 200626
The Era Of TeraTerabytes of data. Teraflops of performance.
ImprovedImprovedMedical careMedical care Machine Machine
visionvision
TeleTele--present meetingspresent meetings
Interactive learning Interactive learning
Source: Steven K. Feiner, Columbia University.
Immersive 3DImmersive 3Dentertainmententertainment
Virtual Virtual realitiesrealities
Courtesy of the Electronic Visualization Laboratory, Univ. of Illinois at Chicago.
When personalcomputing finally
becomespersonal
Scientific SimulationScientific SimulationCourtesy TsinghuaUniversity HPC Center
Text Text MiningMining
Computationally intensive applications of the future will be highly parallel
TeleTele--present present meetingsmeetings
X
Y
July, 200627
Slide fromSlide from
Patterson 2006Patterson 200621st Century Computer Architecture
Old CW: Since cannot know future programs, find set of old programs to evaluate designs of computers for the future
• E.g., SPEC2006
What about parallel codes?
• Few available, tied to old models, languages, architectures, …
New approach: Design computers of future for numerical methods important in future
Claim: key methods for next decade are 7 dwarves (+ a few), so design for them!
• Representative codes may vary over time, but these numerical methods will be important for > 10 years
Patterson, UC Berkeley, also predicts importance of parallel applications.
July, 200628
Slide fromSlide from
Patterson 2006Patterson 2006Phillip Colella’s “Seven dwarfs”
High-end simulation in the physical sciences = 7 numerical methods:
Structured Grids (including locally structured grids, e.g. Adaptive Mesh Refinement)
Unstructured Grids
Fast Fourier Transform
Dense Linear Algebra
Sparse Linear Algebra
Particles
Monte Carlo
If add 4 for embedded, covers all 41 EEMBC benchmarks
8. Search/Sort9. Filter
10. Combinational logic11. Finite State Machine
Note: Data sizes (8 bit to 32 bit) and types (integer, character) differ, but algorithms the same
Slide from “Defining Software Requirements for Scientific Computing”, Phillip Colella, 2004
Well-defined targets from algorithmic, software, and architecture standpoint
Note scientific computing, media processing, machine learning, statistical computing share many algorithms
July, 200629
Workload Acceleration
RMS Workload Speedup (Simulated)RMS Workload Speedup (Simulated)
0
8
16
24
32
40
48
56
64
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
# of Cores
Spee
dup
Ray Tracing (Avg.)
Mod. Levelset
FB_Est
Body Tracker
Gauss-Seidel
Sparse Matrix (Avg.)
Dense Matrix (Avg.)
Kmeans
SVD
SVM_classification
Cholesky (watson)
Cholesky (mod2)
Forward Solver (pds-10)
Forward Solver (world)
Backward Solver (pds-10)Backward Solver(watson)
48x - 57x
19x - 33x
Recognition
Mining
Synthesis
Group I Group I –– Scale well with increasing core countScale well with increasing core countExamples: Ray tracing, body tracking, physical simulationExamples: Ray tracing, body tracking, physical simulation
Group II Group II –– WorstWorst--case scaling examples, yet still scalecase scaling examples, yet still scaleExamples: Forward & backward solversExamples: Forward & backward solvers
Source: Intel CorporationSource: Intel Corporation
July, 200630
Tera-Leap to Parallelism:EN
ERG
Y-E
FFIC
IEN
T P
ERFO
RM
AN
CE
TIME
Instruction level parallelism
Hyper-Threading
Dual Core
Quad-Core
Tera-ScaleComputing
SingleSingle--core chipscore chips
More performanceMore performanceUsing Using lessless energyenergy
ScienceSciencefictionfiction
becomes becomes A REALITYA REALITY
Energy Efficient PerformanceEnergy Efficient Performance
July, 200631
Summary
Technology is driving Intel to build multi-core processors.– Power– Performance– Memory latency– Complexity– Reliability
Parallel programming is a central issue.
Parallel applications will become mainstream