ultrasparc iv
DESCRIPTION
UltraSparc IV. Tolga TOLGAY. OUTLINE. Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion. INTRODUCTION. Sparc = Scalable Processor Architecture Open processor architecture SUN UltraSparc v9: RISC Architecture 64 bit address and data - PowerPoint PPT PresentationTRANSCRIPT
UltraSparc IVUltraSparc IVTolga TOLGAYTolga TOLGAY
OUTLINEOUTLINE
IntroductionHistoryWhat is new?Chip MultitreadingPipelineCacheBranch PredictionConclusion
IntroductionHistoryWhat is new?Chip MultitreadingPipelineCacheBranch PredictionConclusion
INTRODUCTIONINTRODUCTION
Sparc = Scalable Processor Architecture
Open processor architectureSUN UltraSparc v9:
RISC Architecture64 bit address and dataSuperscalar
Sparc = Scalable Processor Architecture
Open processor architectureSUN UltraSparc v9:
RISC Architecture64 bit address and dataSuperscalar
HISTORYHISTORY
Begin developing Sparc – 1984 First Sparc Processor – 1986 SuperSparc – 1992 UltraSparc I – 1995 UltraSparc II – 1997 UltraSparc III – 2001 UltraSparc IV – 2004UltraSparc IV – 2004 UltraSparc IV+ – 2005 UltraSparc T1 – 2005
Begin developing Sparc – 1984 First Sparc Processor – 1986 SuperSparc – 1992 UltraSparc I – 1995 UltraSparc II – 1997 UltraSparc III – 2001 UltraSparc IV – 2004UltraSparc IV – 2004 UltraSparc IV+ – 2005 UltraSparc T1 – 2005
WHAT IS NEW?WHAT IS NEW?
What UltraSparc IV offers new : CMT (Chip Multithreading)
New registers added due to CMT enhancementMCU registers, Sun Fireplan Interconnect
registers are shared.Enhancements on Floating Point Unit16 MB L2 cache with 128 byte line-size
shared by two processors.L2 caches uses LRU replacement strategyNew write-cache indexing-hashing feature
What UltraSparc IV offers new : CMT (Chip Multithreading)
New registers added due to CMT enhancementMCU registers, Sun Fireplan Interconnect
registers are shared.Enhancements on Floating Point Unit16 MB L2 cache with 128 byte line-size
shared by two processors.L2 caches uses LRU replacement strategyNew write-cache indexing-hashing feature
Chip Multitreading (CMT)Chip Multitreading (CMT)
Two UltraSparc III cores into one die.
Two mirrored cores share :System busDRAM controllerOff-die L2 cacheFireplan registers.
Also called Chip Multiprocessing
Two UltraSparc III cores into one die.
Two mirrored cores share :System busDRAM controllerOff-die L2 cacheFireplan registers.
Also called Chip Multiprocessing
Chip MultitreadingChip Multitreading
Chip MultitreadingChip Multitreading
Aim is to increase performance without increasing clock speed.
Mirroring the cores cause a hot spot of floating point units.
How to avoid hot spot : Heat towers in copper interconnect
Aim is to increase performance without increasing clock speed.
Mirroring the cores cause a hot spot of floating point units.
How to avoid hot spot : Heat towers in copper interconnect
Chip MultitreadingChip Multitreading
CoreCore
More core improvements:Improved instruction fetch and store
bandwidth.Improved data prefetchingFPU can handle more unexpected
and underflow cases so reducing exceptions.
On-die cache enhanced with a hashed index to better handle multiple writes.
More core improvements:Improved instruction fetch and store
bandwidth.Improved data prefetchingFPU can handle more unexpected
and underflow cases so reducing exceptions.
On-die cache enhanced with a hashed index to better handle multiple writes.
PipelinePipeline
Because UltraSparc IV contains two UltraSparc III cores, it uses the same pipeline.
4-way superscalar architecture.14-stage pipeline.
Because UltraSparc IV contains two UltraSparc III cores, it uses the same pipeline.
4-way superscalar architecture.14-stage pipeline.
Pip
elin
e S
tag
es
Pip
elin
e S
tag
es
Pipeline StagesPipeline Stages
Pipeline Stage Definition
A Address Generation
P Preliminary Fetch
F Fetch Intructions from I-Cache
B Branch Target Computation
I Instruction Group Formation
J Grouping
R Register Access
E Execute
C Cache
M Miss Detect
W Write
X Extend
T Trap
D Done
Pipeline StagesPipeline Stages
Pipeline StagesPipeline Stages
Stage A : Address Generation Generates and selects the fetch address Address can be selected from several sources
Stage P : Preliminary Fetch Starts fetching from I-Cache Accesses to Branch Predictor
Stage F : Fetch Second half of I-Cache access At the end of stage 4 instructions may be
latched Stage B : Branch Target Computation
Analyzes the instructions Calculate branch target address
Stage A : Address Generation Generates and selects the fetch address Address can be selected from several sources
Stage P : Preliminary Fetch Starts fetching from I-Cache Accesses to Branch Predictor
Stage F : Fetch Second half of I-Cache access At the end of stage 4 instructions may be
latched Stage B : Branch Target Computation
Analyzes the instructions Calculate branch target address
Pipeline StagesPipeline Stages
Stage I : Instruction Group FormationInstructions are grouped into instruction
queue.Stage J : Instruction Group Staging
A group of instructions are dequeued and sent to R-Stage
Stage R : Dispatch and Register AccessDependency calculationDependency solution
Stage I : Instruction Group FormationInstructions are grouped into instruction
queue.Stage J : Instruction Group Staging
A group of instructions are dequeued and sent to R-Stage
Stage R : Dispatch and Register AccessDependency calculationDependency solution
Pipeline StagesPipeline Stages
Stage E : Integer Instruction ExecutionFirst stage of execution pipelinesInteger instructions -> A0 and A1
pipelinesBranch instructions -> Branch pipelineOther instructions -> MS pipeline
Stage C : CacheInteger pipelines write results backSIU results are producedFirst stage for Floating Point Instructions
Stage E : Integer Instruction ExecutionFirst stage of execution pipelinesInteger instructions -> A0 and A1
pipelinesBranch instructions -> Branch pipelineOther instructions -> MS pipeline
Stage C : CacheInteger pipelines write results backSIU results are producedFirst stage for Floating Point Instructions
Pipeline StagesPipeline Stages
Stage M : Miss Data cache misses are determined Second step for FP instructions
Stage W : Write MS pipeline results are written Third step for FP instructions D-cache miss requests send to L2 cache
Stage X : Extend Final step for Floating Point instructions Results from FP instructions are ready for
bypass
Stage M : Miss Data cache misses are determined Second step for FP instructions
Stage W : Write MS pipeline results are written Third step for FP instructions D-cache miss requests send to L2 cache
Stage X : Extend Final step for Floating Point instructions Results from FP instructions are ready for
bypass
Pipeline StagesPipeline Stages
Stage T : TrapTraps are signalledAfter trap, instructions invalidate results
Stage D : DoneInteger results are written into
architectural register fileFloating point results are written to
floating point register file.Results became visible to any traps
generated from younger instructions.
Stage T : TrapTraps are signalledAfter trap, instructions invalidate results
Stage D : DoneInteger results are written into
architectural register fileFloating point results are written to
floating point register file.Results became visible to any traps
generated from younger instructions.
Pipeline RulesPipeline Rules
Grouping rules :Group : collection of instructions that
does not limit eachother to be executed in parallel
Made before R-stageNeeded for :
The execution order is maintainedEach pipeline runs a subset of instructionsInstructions may require helpers
Execution order : in – order execution
Grouping rules :Group : collection of instructions that
does not limit eachother to be executed in parallel
Made before R-stageNeeded for :
The execution order is maintainedEach pipeline runs a subset of instructionsInstructions may require helpers
Execution order : in – order execution
Cache OrganizationCache Organization
Doubled cache size because of dual core.Data Cache : 64 KB x 2Instruction Cache : 32 KB x 2L2 Cache : 16 MB, off-chip, sharedNo L3 Cache
Doubled cache size because of dual core.Data Cache : 64 KB x 2Instruction Cache : 32 KB x 2L2 Cache : 16 MB, off-chip, sharedNo L3 Cache
Cache OrganizationCache Organization
Cache OrganizationCache Organization
Data Cache64 KB Level 1 cache per core
Instruction Cache32 KB Level 1 cache per core4 – way associative
Data Cache64 KB Level 1 cache per core
Instruction Cache32 KB Level 1 cache per core4 – way associative
Cache OrganizationCache Organization
Prefetch CacheOne of L1 caches2 Kbyte SRAM : 32 x 64 bytesUses LRU replacement algorithmAim is to fetch data before neededReduces main memory access latency2 ports reads 8 bytes, 1 port writes 16
bytes per cycle.Hardware prefetch
Prefetch CacheOne of L1 caches2 Kbyte SRAM : 32 x 64 bytesUses LRU replacement algorithmAim is to fetch data before neededReduces main memory access latency2 ports reads 8 bytes, 1 port writes 16
bytes per cycle.Hardware prefetch
Cache OrganizationCache Organization
Write CacheReduces the bandwidth due to store
traffic2 Kbyte cacheHandles multiprocessor and on-chip
cache consistencyImproves error recoveryOptionally uses a hashed index
Write CacheReduces the bandwidth due to store
traffic2 Kbyte cacheHandles multiprocessor and on-chip
cache consistencyImproves error recoveryOptionally uses a hashed index
Cache OrganizationCache Organization
L2 Cache16 MB SRAM shared by two processorsSeperate L2 cache tagsTwo way set associativeLRU replacement policy128 bytes of line size
UltraSparc IV+ has an on-die Level 2 cache with an off-die Level 3 cache
L2 Cache16 MB SRAM shared by two processorsSeperate L2 cache tagsTwo way set associativeLRU replacement policy128 bytes of line size
UltraSparc IV+ has an on-die Level 2 cache with an off-die Level 3 cache
Branch PredictionBranch Prediction
Branch Predictor : Small, single-cycle accessedSRAMOutput is connected to P-stage
Branch detemination is made in B-stageIf miss, return to A-Stage.
Branch Predictor : Small, single-cycle accessedSRAMOutput is connected to P-stage
Branch detemination is made in B-stageIf miss, return to A-Stage.
ConclusionConclusion
UltraSparc IV is a milestone as it is first dual core chip of UltraSparc family
Sun continues to develop UltraSparc :UltraSparc IV+UltraSparc T1
UltraSparc IV is a milestone as it is first dual core chip of UltraSparc family
Sun continues to develop UltraSparc :UltraSparc IV+UltraSparc T1
ReferencesReferences
UltraSparc IV User’s Manual, Sun Microsystems
UltraSparc IV Whitepaper, Sun Microsystems
UltraSparc IV Mirrors Predecessor, Kevin Krewell
Implementation and Productization of a 4th Generation 1.8GHz Dual-Core SPARC V9 Microprocessor, Anand Dixit, Jason Hart, ...
UltraSparc III User’s Manual, Sun Microsystems
UltraSparc IV User’s Manual, Sun Microsystems
UltraSparc IV Whitepaper, Sun Microsystems
UltraSparc IV Mirrors Predecessor, Kevin Krewell
Implementation and Productization of a 4th Generation 1.8GHz Dual-Core SPARC V9 Microprocessor, Anand Dixit, Jason Hart, ...
UltraSparc III User’s Manual, Sun Microsystems
ReferencesReferences
Web Sites :http://web.cs.unlv.edu/cs219/group3/
index.htmlhttp://bwrc.eecs.berkeley.edu/CIC/
archive/cpu_history.html#SPARChttp://www.arcade-eu.org/overview/2005/
sparcIV.htmlhttp://www.top500.org/orsc/2006/
sparcIV.htmhttp://www.sparc.org/history.html
Web Sites :http://web.cs.unlv.edu/cs219/group3/
index.htmlhttp://bwrc.eecs.berkeley.edu/CIC/
archive/cpu_history.html#SPARChttp://www.arcade-eu.org/overview/2005/
sparcIV.htmlhttp://www.top500.org/orsc/2006/
sparcIV.htmhttp://www.sparc.org/history.html
Questions...Questions...