hc17.s8t1.intel 8xx series and paxville xeon-mp ... 8xx series and paxville xeon-mp microprocessors...
TRANSCRIPT
Intel 8xx series and PaxvilleIntel 8xx series and Paxville
Xeon-MP MicroprocessorsXeon-MP Microprocessors
Jonathan DouglasJonathan Douglas
Intel CorporationIntel Corporation
Thanks to: Justin Marquart, James Vogeltanz, Mike Grassi, DEG/BCG
package design, Donald Parker & Benson Inkley for help in putting
together this presentation.
22
OutlineOutline Why the move to multi-coreWhy the move to multi-core
Overview of 8xx series Pentium4Overview of 8xx series Pentium4
Challenges in moving CPU infrastructureChallenges in moving CPU infrastructureto multi-coreto multi-core
Learning's from the 8xx series Pentium4Learning's from the 8xx series Pentium4designdesign
Overview of Paxville-MP processorOverview of Paxville-MP processor
Going forward with multi-core designsGoing forward with multi-core designs
ConclusionConclusion
33
Why rapid move to Dual-CoreWhy rapid move to Dual-Core
Single core designs hitting powerSingle core designs hitting power
wall.wall.
Need more power efficient way toNeed more power efficient way to
manage OS loading.manage OS loading.
Natural extension of softwareNatural extension of software
migration to multi-threaded apps.migration to multi-threaded apps.
More threads in 1 core is complexMore threads in 1 core is complex
and tax core resources heavily.and tax core resources heavily.
Competitive response.Competitive response.
44
Overview of 8xx seriesOverview of 8xx series
Pentium4 processorPentium4 processor Dual-Core/Multi-Threaded Pentium®4Dual-Core/Multi-Threaded Pentium®4
Processor on 90nm processProcessor on 90nm process
2-1M caches, speeds to 3.2Ghz, support for over2-1M caches, speeds to 3.2Ghz, support for overclocking, up to 4 threads.clocking, up to 4 threads.
Shared 800Mhz quad-pumped FSB.Shared 800Mhz quad-pumped FSB.
Independent bus tuning per agentIndependent bus tuning per agent
Enhanced auto-halt and 2-state speed stepEnhanced auto-halt and 2-state speed steppower managementpower management
Independent events supported per core.Independent events supported per core.
55
High level block diagramHigh level block diagram
System BusSystem Bus
L2 C
ach
e a
nd
Co
ntro
l
FP RF
FM
ul
FA
dd
MM
X
SS
E
FP
mo
ve
FP
sto
re
L2 C
ach
e a
nd
Co
ntro
l
L1 D-Cache and D-TLB
Sto
re
AG
UL
oa
d
AG
U
Schedulers
Integer RF
AL
U
AL
U
AL
U
AL
U
Trace Cache
Rename/Alloc
uop Queues
BT
B
uC
od
e
RO
M
33
Decoder
BTB & I-TLB
L2 C
ach
e a
nd
Co
ntro
l
FP RF
FM
ul
FA
dd
MM
X
SS
E
FP
mo
ve
FP
sto
re
L2 C
ach
e a
nd
Co
ntro
l
L1 D-Cache and D-TLB
Sto
re
AG
UL
oa
d
AG
U
Schedulers
Integer RF
AL
U
AL
U
AL
U
AL
UTrace Cache
Rename/Alloc
uop Queues
BT
B
uC
od
e
RO
M
33
Decoder
BTB & I-TLB
Core-To-Core Communication
77
Why the shared bus designWhy the shared bus design
Time to market a critical factorTime to market a critical factor
Leverages existing P4 coreLeverages existing P4 core
Uses existing 775-LGA socketUses existing 775-LGA socket
P4 core already has right feature setP4 core already has right feature set
P4 FSB already 4-way compliant.P4 FSB already 4-way compliant.
Already architected with thread independentAlready architected with thread independent
power management.power management.
Already Already ‘‘HTHT’’ so 2 cores = 4 threads so 2 cores = 4 threads
Gives independent cachesGives independent caches
Plus no extra latency to external memory.Plus no extra latency to external memory.
88
Dual core performanceDual core performance
1) Intel® Pentium® 4 Processor with HT Technology Extreme Edition
3.73GHz (2 MB L2 Cache, 1066 MHz FSB) and Intel® 925XE Express Chipset
2) Intel® Pentium® Processor Extreme Edition 840 (2x1 MB L2 Cache, 3.20
GHz, 800 MHz FSB, HT Technology) and Intel® 955X Express Chipset
99
Challenges in migrating toChallenges in migrating to
multi-coremulti-core Rapid movement from single coreRapid movement from single core
design to multi-core designdesign to multi-core design
presented many complexitiespresented many complexities
Already existing platform hardwareAlready existing platform hardware
Factory already populated withFactory already populated with
manufacturing hardwaremanufacturing hardware
Test database developed for single coreTest database developed for single core
Tight package dimensionsTight package dimensions
Little power headroom leftLittle power headroom left
1010
Package issuePackage issue
Package design a huge challengePackage design a huge challenge
More layers required (Just address/data alone is >More layers required (Just address/data alone is >100 more signals)100 more signals)
Same package cavity and pinout Same package cavity and pinout –– couldn couldn’’t grow.t grow.
New IHS (Integrated Heat Sink) required for thickerNew IHS (Integrated Heat Sink) required for thickerpackagepackage
Power cap placement canPower cap placement can’’t be centered over botht be centered over bothcorescores
Existing signals on 4 sides of core causes powerExisting signals on 4 sides of core causes powerbus routing voids.bus routing voids.
No logic outside core. Any needed logic mustNo logic outside core. Any needed logic mustbe in core. Lots of be in core. Lots of ‘‘special signalspecial signal’’ headaches headacheslike thermal diode, ODT (On-Die Termination).like thermal diode, ODT (On-Die Termination).
1212
Power constraintsPower constraints
Existing platform dictated 1 power planeExisting platform dictated 1 power planefor both coresfor both cores
Penalized for 2X leakage, required architectingPenalized for 2X leakage, required architectinga speed-step protocola speed-step protocol
2 cores powering up & fully active cause2 cores powering up & fully active causelarge di/dt eventslarge di/dt events
Required Voltage Regulator mods to growRequired Voltage Regulator mods to growheadroom to 125A plus silver box restrictionsheadroom to 125A plus silver box restrictions
Required BIOS change to boot to lowRequired BIOS change to boot to lowvoltage/frequency on performance parts.voltage/frequency on performance parts.
BIOS initiates speedstep event to all threadsBIOS initiates speedstep event to all threadsafter completionafter completion
1414
Test issuesTest issues
Thousands of hours invested inThousands of hours invested insingle core coverage databasesingle core coverage database
Copied core design a plusCopied core design a plus
Needed to add Needed to add ‘‘core swap & killcore swap & kill’’hardware to reuse databasehardware to reuse database
Existing single core test canExisting single core test can’’ttexpose problems on core->coreexpose problems on core->coreinteractioninteraction
Voltage transients, thermal gradientVoltage transients, thermal gradient
Some explicit dual core contentSome explicit dual core contentrequiredrequired
1515
Test flow exampleTest flow example
Sort Die
Independent
Screen/Speed
Grade Core0
Switch
Cores
Screen/Speed
Grade Core1
Multi-Core
Content Run
Check Power
of 2 cores
Check Speed
Step voltage
Snap to
lowest Speed
1616
Thermal issueThermal issue
Platforms support only 1 ADC forPlatforms support only 1 ADC for
thermal monitoringthermal monitoring
2 cores can create many different2 cores can create many different
thermal profilesthermal profiles
Diode temp to junction hot spot deltaDiode temp to junction hot spot delta
can vary depending on workload & corecan vary depending on workload & core
utilizedutilized
Required thermal protection to beRequired thermal protection to be
independent on both coresindependent on both cores
1717
Thermal gradientsThermal gradients
1 4 7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
S1
S4
S7
S10
S13
S16
S19
S22
S25
S28
98.00-100.00
96.00-98.00
94.00-96.00
92.00-94.00
90.00-92.00
88.00-90.00
86.00-88.00
84.00-86.00
82.00-84.00
80.00-82.00
78.00-80.00
76.00-78.00
74.00-76.00
72.00-74.00
70.00-72.00
68.00-70.00
66.00-68.00
1 3 5 7 9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
S1
S3
S5
S7
S9
S11
S13
S15
S17
S19
S21
S23
S25
S27
S29 92.00-94.00
90.00-92.00
88.00-90.00
86.00-88.00
84.00-86.00
82.00-84.00
80.00-82.00
78.00-80.00
76.00-78.00
74.00-76.00
72.00-74.00
Thermal
Diode
Core0 only
Dual Core
1818
Limitations of shared busLimitations of shared bus
2 loads on bus = less bus speed.2 loads on bus = less bus speed.
Plus 1M cache = more bus traffic. DoublePlus 1M cache = more bus traffic. Double
whammy.whammy.
Difficult package designDifficult package design
~2x traces to same number pins~2x traces to same number pins
Thermal & electrical properties degrade.Thermal & electrical properties degrade.
Slow down penalizes both cores.Slow down penalizes both cores.
Segregated die.Segregated die.
Test overhead. Slowest die constrains finalTest overhead. Slowest die constrains final
product.product.
1919
Overview of Paxville-MPOverview of Paxville-MP
processorprocessor Dual-Core/Multi-Threaded Xeon ProcessorDual-Core/Multi-Threaded Xeon Processor
on 90nm processon 90nm process
2-2M caches, 667Mhz min FSB, up to 4 threads.2-2M caches, 667Mhz min FSB, up to 4 threads.
Platform still 4-P compatible for up to 16 threadsPlatform still 4-P compatible for up to 16 threads
per platformper platform
Dual bus platform Dual bus platform –– 2 CPU agents per bus 2 CPU agents per bus
Only 1 load presented to system by CPUOnly 1 load presented to system by CPU
Enhanced auto-halt and 2-state speed stepEnhanced auto-halt and 2-state speed step
power managementpower management
Independent events supported per core.Independent events supported per core.
2020
Advantages of new PaxvilleAdvantages of new Paxville
designdesign Single CPU load on bus. Allows fasterSingle CPU load on bus. Allows faster
bus, less electrical load.bus, less electrical load.
8 agents (16 threads) on top end platform8 agents (16 threads) on top end platform
Larger cache = less FSB bottlenecksLarger cache = less FSB bottlenecks
Better package designBetter package design
Fewer traces allows better power deliveryFewer traces allows better power delivery
Integrated die (monolithic)Integrated die (monolithic)
Consolidated bus logic allows testConsolidated bus logic allows test
enhancementsenhancements
2222
Challenges with PaxvilleChallenges with Paxville
designdesign Degraded I/O timing with shared busDegraded I/O timing with shared bus
Requires extra logic & routing but must beRequires extra logic & routing but must be
compatible to existing bus timing.compatible to existing bus timing.
Requires circuit tricks for quad pumped bus.Requires circuit tricks for quad pumped bus.
Enhancements to validation toolsEnhancements to validation tools
8xx series treated as 2 independent CPUs.8xx series treated as 2 independent CPUs.
Paxville is integrated Paxville is integrated –– 1 die. 1 die.
Additional complexity in testAdditional complexity in test
infrastructure.infrastructure.
New test modes & consolidated bus logic.New test modes & consolidated bus logic.
2323
Going forward with multi-coreGoing forward with multi-core
Solving bus bottlenecks.Solving bus bottlenecks.
Integrate next level cache for less busIntegrate next level cache for less bustraffic.traffic. Downside is higher latency on cache misses.Downside is higher latency on cache misses.
Upside is lower pin count & can stay with aUpside is lower pin count & can stay with aflexible bus architectureflexible bus architecture
Cache thrashing by multiple cores an issue ifCache thrashing by multiple cores an issue ifsize isnsize isn’’t large enough t large enough –– swamps bus again. swamps bus again.
‘‘Point-to-pointPoint-to-point’’ busses & memory busses & memorycontrollerscontrollers Upside is no bus traffic collisionsUpside is no bus traffic collisions
Downsides are being locked into memoryDownsides are being locked into memoryprotocol and a huge pin count increase.protocol and a huge pin count increase.
2424
Going forward with multi-coreGoing forward with multi-core
Solving power issues..Solving power issues..
Need better power state managementNeed better power state management
Single voltage plane is an issue Single voltage plane is an issue –– can can’’t dropt drop
leakage on inactive coresleakage on inactive cores
Need more intelligence in controllerNeed more intelligence in controller
Segment products with power in mindSegment products with power in mind
Typically done more now on speed/feature set.Typically done more now on speed/feature set.
Can microprocessor be Can microprocessor be ‘‘tunedtuned’’ for a power for a power
segment.segment.
2525
SpeedStep protocolSpeedStep protocol
Core0 high activity Core0 asleep Core0 low activity
Core1 high activityCore1 asleep Core1 low activityCore1 asleep Core1 high activity
Core Activity over time
High
voltage
Low
voltage
High
voltage
Low
voltage
Limited opportunities to
reduce power, much harder
with even more cores
2626
Going forward with multi-coreGoing forward with multi-core
Core counts will continue to increase.Core counts will continue to increase.
Higher threaded applications give opportunityHigher threaded applications give opportunityto have better power / performance.to have better power / performance.
Power is wasted when a core that isnPower is wasted when a core that isn’’ttworking on a thread is alive, but performanceworking on a thread is alive, but performanceis wasted if OS has to continually swap outis wasted if OS has to continually swap outthreads.threads.
Expect that logic to Expect that logic to ‘‘glueglue’’ cores together cores togetherwill become as critical as the corewill become as critical as the core
Need lots of sophistication to take fullNeed lots of sophistication to take fulladvantage of a high core countadvantage of a high core count
Need busses capable of handling the highNeed busses capable of handling the hightraffic to memorytraffic to memory