ultrasparc iv

UltraSparc IVUltraSparc IVTolga TOLGAYTolga TOLGAY

OUTLINEOUTLINE

IntroductionHistoryWhat is new?Chip MultitreadingPipelineCacheBranch PredictionConclusion

IntroductionHistoryWhat is new?Chip MultitreadingPipelineCacheBranch PredictionConclusion

INTRODUCTIONINTRODUCTION

Sparc = Scalable Processor Architecture

Open processor architectureSUN UltraSparc v9:

RISC Architecture64 bit address and dataSuperscalar

Sparc = Scalable Processor Architecture

Open processor architectureSUN UltraSparc v9:

RISC Architecture64 bit address and dataSuperscalar

HISTORYHISTORY

Begin developing Sparc – 1984 First Sparc Processor – 1986 SuperSparc – 1992 UltraSparc I – 1995 UltraSparc II – 1997 UltraSparc III – 2001 UltraSparc IV – 2004UltraSparc IV – 2004 UltraSparc IV+ – 2005 UltraSparc T1 – 2005

Begin developing Sparc – 1984 First Sparc Processor – 1986 SuperSparc – 1992 UltraSparc I – 1995 UltraSparc II – 1997 UltraSparc III – 2001 UltraSparc IV – 2004UltraSparc IV – 2004 UltraSparc IV+ – 2005 UltraSparc T1 – 2005

WHAT IS NEW?WHAT IS NEW?

What UltraSparc IV offers new : CMT (Chip Multithreading)

New registers added due to CMT enhancementMCU registers, Sun Fireplan Interconnect

registers are shared.Enhancements on Floating Point Unit16 MB L2 cache with 128 byte line-size

shared by two processors.L2 caches uses LRU replacement strategyNew write-cache indexing-hashing feature

What UltraSparc IV offers new : CMT (Chip Multithreading)

New registers added due to CMT enhancementMCU registers, Sun Fireplan Interconnect

registers are shared.Enhancements on Floating Point Unit16 MB L2 cache with 128 byte line-size

shared by two processors.L2 caches uses LRU replacement strategyNew write-cache indexing-hashing feature

Chip Multitreading (CMT)Chip Multitreading (CMT)

Two UltraSparc III cores into one die.

Two mirrored cores share :System busDRAM controllerOff-die L2 cacheFireplan registers.

Also called Chip Multiprocessing

Two UltraSparc III cores into one die.

Two mirrored cores share :System busDRAM controllerOff-die L2 cacheFireplan registers.

Also called Chip Multiprocessing

Chip MultitreadingChip Multitreading


Aim is to increase performance without increasing clock speed.

Mirroring the cores cause a hot spot of floating point units.

How to avoid hot spot : Heat towers in copper interconnect

Aim is to increase performance without increasing clock speed.

Mirroring the cores cause a hot spot of floating point units.

How to avoid hot spot : Heat towers in copper interconnect

CoreCore

More core improvements:Improved instruction fetch and store

bandwidth.Improved data prefetchingFPU can handle more unexpected

and underflow cases so reducing exceptions.

On-die cache enhanced with a hashed index to better handle multiple writes.

More core improvements:Improved instruction fetch and store

bandwidth.Improved data prefetchingFPU can handle more unexpected

and underflow cases so reducing exceptions.

On-die cache enhanced with a hashed index to better handle multiple writes.

PipelinePipeline

Because UltraSparc IV contains two UltraSparc III cores, it uses the same pipeline.

4-way superscalar architecture.14-stage pipeline.

Because UltraSparc IV contains two UltraSparc III cores, it uses the same pipeline.

4-way superscalar architecture.14-stage pipeline.

Pip

elin

e S

tag

es

Pip

elin

e S

tag

es

Pipeline StagesPipeline Stages

Pipeline Stage Definition

A Address Generation

P Preliminary Fetch

F Fetch Intructions from I-Cache

B Branch Target Computation

I Instruction Group Formation

J Grouping

R Register Access

E Execute

C Cache

M Miss Detect

W Write

X Extend

T Trap

D Done


Stage A : Address Generation Generates and selects the fetch address Address can be selected from several sources

Stage P : Preliminary Fetch Starts fetching from I-Cache Accesses to Branch Predictor

Stage F : Fetch Second half of I-Cache access At the end of stage 4 instructions may be

latched Stage B : Branch Target Computation

Analyzes the instructions Calculate branch target address

Stage A : Address Generation Generates and selects the fetch address Address can be selected from several sources

Stage P : Preliminary Fetch Starts fetching from I-Cache Accesses to Branch Predictor

Stage F : Fetch Second half of I-Cache access At the end of stage 4 instructions may be

latched Stage B : Branch Target Computation

Analyzes the instructions Calculate branch target address


Stage I : Instruction Group FormationInstructions are grouped into instruction

queue.Stage J : Instruction Group Staging

A group of instructions are dequeued and sent to R-Stage

Stage R : Dispatch and Register AccessDependency calculationDependency solution

Stage I : Instruction Group FormationInstructions are grouped into instruction

queue.Stage J : Instruction Group Staging

A group of instructions are dequeued and sent to R-Stage

Stage R : Dispatch and Register AccessDependency calculationDependency solution


Stage E : Integer Instruction ExecutionFirst stage of execution pipelinesInteger instructions -> A0 and A1

pipelinesBranch instructions -> Branch pipelineOther instructions -> MS pipeline

Stage C : CacheInteger pipelines write results backSIU results are producedFirst stage for Floating Point Instructions

Stage E : Integer Instruction ExecutionFirst stage of execution pipelinesInteger instructions -> A0 and A1

pipelinesBranch instructions -> Branch pipelineOther instructions -> MS pipeline

Stage C : CacheInteger pipelines write results backSIU results are producedFirst stage for Floating Point Instructions


Stage M : Miss Data cache misses are determined Second step for FP instructions

Stage W : Write MS pipeline results are written Third step for FP instructions D-cache miss requests send to L2 cache

Stage X : Extend Final step for Floating Point instructions Results from FP instructions are ready for

bypass

Stage M : Miss Data cache misses are determined Second step for FP instructions

Stage W : Write MS pipeline results are written Third step for FP instructions D-cache miss requests send to L2 cache

Stage X : Extend Final step for Floating Point instructions Results from FP instructions are ready for

bypass


Stage T : TrapTraps are signalledAfter trap, instructions invalidate results

Stage D : DoneInteger results are written into

architectural register fileFloating point results are written to

floating point register file.Results became visible to any traps

generated from younger instructions.

Stage T : TrapTraps are signalledAfter trap, instructions invalidate results

Stage D : DoneInteger results are written into

architectural register fileFloating point results are written to

floating point register file.Results became visible to any traps

generated from younger instructions.

Pipeline RulesPipeline Rules

Grouping rules :Group : collection of instructions that

does not limit eachother to be executed in parallel

Made before R-stageNeeded for :

The execution order is maintainedEach pipeline runs a subset of instructionsInstructions may require helpers

Execution order : in – order execution

Grouping rules :Group : collection of instructions that

does not limit eachother to be executed in parallel

Made before R-stageNeeded for :

The execution order is maintainedEach pipeline runs a subset of instructionsInstructions may require helpers

Execution order : in – order execution

Cache OrganizationCache Organization

Doubled cache size because of dual core.Data Cache : 64 KB x 2Instruction Cache : 32 KB x 2L2 Cache : 16 MB, off-chip, sharedNo L3 Cache

Doubled cache size because of dual core.Data Cache : 64 KB x 2Instruction Cache : 32 KB x 2L2 Cache : 16 MB, off-chip, sharedNo L3 Cache


Data Cache64 KB Level 1 cache per core

Instruction Cache32 KB Level 1 cache per core4 – way associative

Data Cache64 KB Level 1 cache per core

Instruction Cache32 KB Level 1 cache per core4 – way associative


Prefetch CacheOne of L1 caches2 Kbyte SRAM : 32 x 64 bytesUses LRU replacement algorithmAim is to fetch data before neededReduces main memory access latency2 ports reads 8 bytes, 1 port writes 16

bytes per cycle.Hardware prefetch

Prefetch CacheOne of L1 caches2 Kbyte SRAM : 32 x 64 bytesUses LRU replacement algorithmAim is to fetch data before neededReduces main memory access latency2 ports reads 8 bytes, 1 port writes 16

bytes per cycle.Hardware prefetch


Write CacheReduces the bandwidth due to store

traffic2 Kbyte cacheHandles multiprocessor and on-chip

cache consistencyImproves error recoveryOptionally uses a hashed index

Write CacheReduces the bandwidth due to store

traffic2 Kbyte cacheHandles multiprocessor and on-chip

cache consistencyImproves error recoveryOptionally uses a hashed index


L2 Cache16 MB SRAM shared by two processorsSeperate L2 cache tagsTwo way set associativeLRU replacement policy128 bytes of line size

UltraSparc IV+ has an on-die Level 2 cache with an off-die Level 3 cache

L2 Cache16 MB SRAM shared by two processorsSeperate L2 cache tagsTwo way set associativeLRU replacement policy128 bytes of line size

UltraSparc IV+ has an on-die Level 2 cache with an off-die Level 3 cache

Branch PredictionBranch Prediction

Branch Predictor : Small, single-cycle accessedSRAMOutput is connected to P-stage

Branch detemination is made in B-stageIf miss, return to A-Stage.

Branch Predictor : Small, single-cycle accessedSRAMOutput is connected to P-stage

Branch detemination is made in B-stageIf miss, return to A-Stage.

ConclusionConclusion

UltraSparc IV is a milestone as it is first dual core chip of UltraSparc family

Sun continues to develop UltraSparc :UltraSparc IV+UltraSparc T1

UltraSparc IV is a milestone as it is first dual core chip of UltraSparc family

Sun continues to develop UltraSparc :UltraSparc IV+UltraSparc T1

ReferencesReferences

UltraSparc IV User’s Manual, Sun Microsystems

UltraSparc IV Whitepaper, Sun Microsystems

UltraSparc IV Mirrors Predecessor, Kevin Krewell

Implementation and Productization of a 4th Generation 1.8GHz Dual-Core SPARC V9 Microprocessor, Anand Dixit, Jason Hart, ...

UltraSparc III User’s Manual, Sun Microsystems

UltraSparc IV User’s Manual, Sun Microsystems

UltraSparc IV Whitepaper, Sun Microsystems

UltraSparc IV Mirrors Predecessor, Kevin Krewell

Implementation and Productization of a 4th Generation 1.8GHz Dual-Core SPARC V9 Microprocessor, Anand Dixit, Jason Hart, ...

UltraSparc III User’s Manual, Sun Microsystems

ReferencesReferences

Web Sites :http://web.cs.unlv.edu/cs219/group3/

index.htmlhttp://bwrc.eecs.berkeley.edu/CIC/

archive/cpu_history.html#SPARChttp://www.arcade-eu.org/overview/2005/

sparcIV.htmlhttp://www.top500.org/orsc/2006/

sparcIV.htmhttp://www.sparc.org/history.html

Web Sites :http://web.cs.unlv.edu/cs219/group3/

index.htmlhttp://bwrc.eecs.berkeley.edu/CIC/

archive/cpu_history.html#SPARChttp://www.arcade-eu.org/overview/2005/

sparcIV.htmlhttp://www.top500.org/orsc/2006/

sparcIV.htmhttp://www.sparc.org/history.html

Questions...Questions...

ultrasparc iv

Documents

ultrasparc t1

pipelinebecause ultrasparc

stage pipeline

results backsiu results

floating point register

writems pipeline results

improved instruction

doneinteger results