vlsi system design lecture 3 timing · 2013. 9. 6. · 2.6 billion moore’s law 1 million 2...

UC Regents Fall 2013 © UCBCS 250 L3: Timing

2013-9-5Professor Jonathan Bachrach

today’s lecture by John Lazzaro

CS 250 VLSI System Design

Lecture 3 – Timing

www-inst.eecs.berkeley.edu/~cs250/

TA: Ben Keller

1Thursday, September 5, 13

http://www.eecs.berkeley.edu/~johnw/

http://www.eecs.berkeley.edu/~johnw/


... everything doesn’t happen at once.

Timing, the 10,000 ft view. Locally synchronous, globally asynchronous.

On the same page. Minimal set of timing concepts you need for project.

Break

RTL Examples. Better timing through micro-architecture.

Electrical details. Just so you know ...


View from 10,000 Ft.

Google I/O, 20123Thursday, September 5, 13

Moore’s Law2.6 Billion

1 Million

2 Thousand

Synchronous logic on a single clock domain is not practical for

a 2.6 billion transistor design


GALS: Globally Asynchronous, Locally Synchronous

both clocks. The basic GALS method focuses on point-

to-point communication between blocks.

FIFO solutionsAnother approach to interfacing locally synchro-

nous blocks is using specially designed asynchronous

FIFO buffers8–10 and hiding the system synchronization

problem within the FIFO buffers. Such a system can

tolerate very large interconnect delays and is also

robust with regard to metastability. Designers can use

this method to interconnect asynchronous and

synchronous systems and also to construct synchro-

nous-synchronous and asynchronous-asynchronous

interfaces. Figure 2 diagrams a typical FIFO interface,

which achieves an acceptable data throughput.8 In

addition to the data cells, the FIFO structure includes

an empty/full detector and a special deadlock de-

tector.

The advantage of FIFO synchronizers is that they

don’t affect the locally synchronous module’s opera-

tion. However, with very wide interconnect data

buses, FIFO structures can be costly in silicon area.

Also, they require specialized complex cells to

generate the empty/full flags used for flow control.

The introduced latency might be significant and

unacceptable for high-speed applications.

As an alternative, Beigne and Vivet designed

a synchronous-asynchronous FIFO based on the

bisynchronous classical FIFO design using gray code,

for the specific case of an asynchronous network-on-

chip (NoC) interface.10 Their aim was to maintain

compatibility with existing design solutions and to use

standard CAD tools. Thus, even with some performance

degradation or suboptimal

architecture, designers can

achieve the main goal of

designing GALS systems in

the standard design envi-

ronment.

Boundary

synchronizationA third solution is to

perform data synchroni-

zation at the borders of

the locally synchronous

island, without affecting

the inner operation of lo-

cally synchronous blocks

and without relying on

FIFO buffers. For this purpose, designers can use

standard two-flop, one-flop, predictive, or adaptive

synchronizers for mesochronous systems, or locally

delayed latching.1,11 This method can achieve very

reliable data transfer between locally synchronous

blocks. On the other hand, such solutions generally

increase latency and reduce data throughput, resulting

in limited applicability for high-speed systems. Table 1

summarizes the properties of GALS systems’ synchro-

nization methods.

Advantages and limitations ofGALS solutions

The scientific community has shown great interest

in GALS solutions and architectures in the past two

decades. However, this interest hasn’t culminated in

many commercial applications, despite all reported

advantages. There are several reasons why standard

design practice has not adopted GALS techniques.

Design and system integration issuesMany proposed solutions require programmable

ring oscillators. This is an inexpensive solution that

allows full control of the local clock. However, it has

significant drawbacks. Ring oscillators are impractical

for industrial use. They need careful calibration

because they are very sensitive to process, voltage,

and temperature variations. Moreover, embedded ring

oscillators consume additional power through contin-

uous switching of the chained inverters.

On the other hand, careful design of the delay line

can reduce its power consumption to a level below

that of a corresponding clock tree. In addition,

432

Figure 2. Typical FIFO-based GALS system.

Globally Asynchronous, Locally Synchronous Design and Test

IEEE Design & Test of Computers

Synchronous modules typically 50K-1M gates, so that the synchronous logic approach works well without requiring heroics. Examples ...


The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-

rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For

43MARCH–APRIL 2004

MP ISS RF EA DC WB Xfer

MP ISS RF EX WB Xfer

MP ISS RF EX WB Xfer

MP ISS RF

XferF6

Group formation andinstruction decode

Instruction fetch

Branch redirects

Interrupts and flushes

WB

Fmt

D1 D2 D3 Xfer GD

BPICCP

D0

IF

Branchpipeline

Load/storepipeline

Fixed-pointpipeline

Floating-point pipeline

Out-of-order processing

Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).

Shared by two threads Thread 0 resources Thread 1 resources

LSU0FXU0

LSU1

FXU1

FPU0

FPU1

BXU

CRL

Dynamicinstructionselection

Threadpriority

Group formationInstruction decode

Dispatch

Shared-register

mappers

Readshared-

register files

Sharedissue

queues

Sharedexecution

units

Alternate

Branch prediction

Instructioncache

Instructiontranslation

Programcounter

Branchhistorytables

Returnstack

Targetcache

DataCache

DataTranslation

L2cache

Datacache

Datatranslation

Instructionbuffer 0

Instructionbuffer 1

Writeshared-

register files

Groupcompletion

Storequeue

Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).

IBM Power 5 CPU - Dynamically Scheduled

Stars denote FIFOs that create separate synchronous domains. An example of howarchitecture and circuits work together.


Rocket uses GALS for accelerator interface

Your project interfaces with the RISC-V

pipeline and the memory system using

FIFOs.

Your timing closure is

independent of the CPU logic domain.


Today: Timing insights for your project

What we’re not doing. If this class wasEE 241 and your project was an SRAM:

You could see through down to the layout.Timing? Use SPICE on this hand-drawn schematic.


Technology X: The CS 250 timing challenge.

What we are doing --->

© Synopsys 2012 7

1986: Logic CompilerOptimal Solutions, Inc. (aka Synopsys, Inc.)

Technology X – Provide automation and increase productivity for gate level designers

Logic Synthesis

If your accelerator is too slow ... two options:

Bottom-up: Take control away from logic synthesis. Use HDL as textual schematic. Also, use command-line tool flags.

Top-down: Rework high-level micro-architecture. Let Technology X keep its job.

Sometimes necessary. Ben is the expert, ask in discussion section.

Today.



A Logic Circuit Primer

“Models should be as simple as possible, but no simpler ...” Albert Einstein.



Inverters: A simple transistor model

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.5

Design Refinement

Informal System Requirement

Initial Specification

Intermediate Specification

Final Architectural Description

Intermediate Specification of Implementation

Final Internal Specification

Physical Implementation

refinementincreasing level of detail


Lec3.6

Logic Components


Lec3.7

° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)

° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay

° Flip-Flops: Storage Elements• After a clock edge, input copied to output

• Otherwise, the flip-flop holds its value

• Also: a “Latch” is a storage element that is level triggered

D Q D[8] Q[8]

8

Combinational

Logic

11

8

Elements of the design zoo


Lec3.8

Basic Combinational Elements+DeMorgan Equivalence

Wire Inverter

In Out

01

01

In Out

10

01

OutIn

Out = InOut = In

NAND Gate NOR GateA B Out

111

0 00 11 01 1 0

A B Out

0 0 10 1 01 0 01 1 0

OutA

BA

B

Out

DeMorgan’s

TheoremOut = A + B = A • BOut = A • B = A + B

A

B

Out

A B Out

1 1 11 0 10 1 10 0 0

0 00 11 01 1

A B

OutA

B

A B Out

1 1 11 0 00 1 00 0 0

0 00 11 01 1

A B


Lec3.29

Delay Model:

CMOS


Lec3.30

Review: General C/L Cell Delay Model

° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior

- truth-table, logic equation, VHDL

• load factor of each input

• critical propagation delay from each input to each output for each transition

- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load

° Linear model composes

Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.31

Basic Technology: CMOS

° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors

• PMOS (P-Type Metal Oxide Semiconductor) transistors

° NMOS Transistor• Apply a HIGH (Vdd) to its gate

turns the transistor into a “conductor”

• Apply a LOW (GND) to its gateshuts off the conduction path

° PMOS Transistor• Apply a HIGH (Vdd) to its gate

shuts off the conduction path

• Apply a LOW (GND) to its gateturns the transistor into a “conductor”

Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v


Lec3.32

Basic Components: CMOS Inverter

Vdd

Circuit

° Inverter Operation

OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

“1”

“0”

pFET.A switch.

“On” if gate is

grounded.

nFET.A switch.

“On” if gate is at Vdd.

“1”“0”

“1” “0”

Correctly predicts logic output for simple static CMOS circuits.

Extensions to model subtler circuit families, or to predict timing, have not worked well ...



Transistors as water valves.If electrons are water molecules,

transistor strengths (W/L) are pipe diameters, and capacitors are buckets ...

A “on” p-FET fillsup the capacitor

with charge.


Lec3.29

Delay Model:

CMOS


Lec3.30








Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.31










Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v


Lec3.32


Vdd

Circuit


OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

A “on” n-FET empties the

bucket.


Lec3.29

Delay Model:

CMOS


Lec3.30








Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.31










Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v


Lec3.32


Vdd

Circuit


OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)

!"#$%&'(#)*(+,%-$*".(/0

1 2+.$0#$03

1 4546%,"#$3

“1”

“0”Time

Water level

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)

!"#$%&'(#)*(+,%-$*".(/0

1 2+.$0#$03

1 4546%,"#$3

“0”

“1”

TimeWater level

This model is often good enough ...

(Cartoon physics)



What is the bucket? A gate’s “fan-out”.

Driving other gates slows a gate down.

Spring 2003 EECS150 – Lec10-Timing Page 10

Gate Switching Behavior

• Inverter:

• NAND gate:

Driving wires slows a gate down.

“Fan-out”: The number of gate inputs driven by a gate’s output.

Driving it’s own parasitics slows a gate down.



Fanout



A closer look at fan-out ...


Lec3.37

Series Connection

Vdd

Cout

Vout

C1

V1G2

Vdd

Voltage

Vdd

Vin

GND

V1 Vout

Vdd/2

d1 d2

G1

V1Vin Vout

VinG1 G2

Time

° Total Propagation Delay = Sum of individual delays = d1 + d2

° Capacitance C1 has two components:

• Capacitance of the wire connecting the two gates

• Input capacitance of the second inverter


Lec3.38

Calculating Aggregate Delays

Vdd

G2

Vdd

° Sum delays along serial paths

° Delay (Vin -> V2) ! = Delay (Vin -> V3)• Delay (Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2)

• Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V3)

° Critical Path = The longest among the N parallel paths

° C1 = Wire C + Cin of Gate 2 + Cin of Gate 3

V2

V1Vin V2

G1V1

C1

Vin

Vdd

G3V3

V3


Lec3.39

Characterize a Gate

° Input capacitance for each input

° For each input-to-output path:• For each output transition type (H->L, L->H, H->Z, L->Z ... etc.)

- Internal delay (ns)

- Load dependent delay (ns / fF)

° Example: 2-input NAND Gate

OutA

B

Delay A -> Out

Out: Low -> High

0.5ns

Slope =

0.0021ns / fF

For A and B: Input Load (I.L.) = 61 fF

For either A -> Out or B -> Out:

Tlh = 0.5ns Tlhf = 0.0021ns / fF

Thl = 0.1ns Thlf = 0.0020ns / fF

Cout


Lec3.40

A Specific Example: 2 to 1 MUX

Y = (A and !S)

or (B and S)

A

B

S

Gate 3

Gate 2

Gate 1Wire 1

Wire 2

Wire 0

A

B

Y

S

2 x

1M

ux

° Input Load (I.L.)• A, B: I.L. (NAND) = 61 fF

• S: I.L. (INV) + I.L. (NAND) = 50 fF + 61 fF = 111 fF

° Load Dependent Delay (L.D.D.): Same as Gate 3• TAYlhf = 0.0021 ns / fF TAYhlf = 0.0020 ns / fF

• TBYlhf = 0.0021 ns / fF TBYhlf = 0.0020 ns / fF

• TSYlhf = 0.0021 ns / fF TSYlhf = 0.0020 ns / fF

Linear model works for

reasonablefan-out


Gate Delay

• Fan-out:

• The delay of a gate is proportional to its output capacitance. Because, gates #2 and 3 turn on/off at a later time. (It takes longer for the output of gate #1 to reach the switching threshold of gates #2 and 3 as we add more output capacitance.)

1

3

2

Delay time of an inverter driving 4 inverters.

FO4: Fanout of four delay.

Driving more gates adds delay.


Gate Delay

• Fan-out:


1

3

2


Gate Delay

• Fan-out:


1

3

2



Propagation delay graphs ...


Gate Delay

• Cascaded gates:

Vout

Vin


Gate Delay

• Cascaded gates:

Vout

Vin


Gate Delay

• Cascaded gates:

Vout

Vin


Gate Delay

• Cascaded gates:

Vout

Vin

1 ->0 1 ->0

0 ->1 0 ->1

inverter transfer function



Worst-case delay through combinational logic


Gate Delay

• “Fan-in”

• What is the delay in this circuit?

• Critical Path: the path with the maximum delay, from any

input to any output.

– In general, we include register set-up and clk-to-Q times in

critical path calculation.

• Why do we care about the critical path?

x = g(a, b, c, d, e, f)

T2 might be the

worst-case delay path

(critical path)

If d going 0-to-1 switches x 0-to-1, delay is T1.If a going 0-to-1 switches x 0-to-1, delay is T2.

It would be surprising if T1 > T2.

T1

T2

0 ->1

0 ->10 ->1



Why “might”? Wires have delay too ...


Wire Delay

• Even in those cases where the

transmission line effect is

negligible:

– Wires posses distributed

resistance and capacitance

– Time constant associated with

distributed RC is proportional to

the square of the length

• For short wires on ICs,

resistance is insignificant

(relative to effective R of

transistors), but C is important.

– Typically around half of C of

gate load is in the wires.

• For long wires on ICs:

– busses, clock lines, global

control signal, etc.

– Resistance is significant,

therefore distributed RC effect

dominates.

– signals are typically “rebuffered”

to reduce delay:v1

v4v3

v2

time

v1 v2 v3 v4


Wire Delay



negligible:

















dominates.


to reduce delay:v1

v4v3

v2

time

v1 v2 v3 v4


Wire Delay



negligible:

















dominates.


to reduce delay:v1

v4v3

v2

time

v1 v2 v3 v4


Wire Delay



negligible:

















dominates.


to reduce delay:v1

v4v3

v2

time

v1 v2 v3 v4

Looksbenign,

but ...



Clocked Logic Circuits



From Delay Models to Timing Analysis1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-


Example

• Parallel to serial converter:

a

b T ! time(clk"Q) + time(mux) + time(setup)

T ! #clk"Q + #mux + #setup

clk

f T1 MHz 1 μs

10 MHz 100 ns100 MHz 10 ns

1 GHz 1 ns

Timing AnalysisWhat is the

smallest T that produces correct

operation?



1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001













III. ARCHITECTURE




































is infrequent.
















Timing Analysis and Logic Delay

If our clock period T > worst-case delay through CL, does this ensure correct operation?

1600

IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001

Fig.1.ProcessSEMcrosssection.

Theprocess

wasraisedfrom[1]tolimitstandbypower.

Circuitdesignandarchitecturalpipeliningensurelowvoltage

performanceandfunctionality.Tofurtherlimitstandbycurrent

inhandheldASSPs,alongerpolytargettakesadvantageofthe

versus

dependenceandsource-to-bodybiasisused

toelectricallylimittransistor

instandbymode.Allcore

nMOSandpMOStransistorsutilizeseparatesourceandbulk

connectionstosupportthis.Theprocessincludescobaltdisili-

cidegatesanddiffusions.Lowsourceanddraincapacitance,as

wellas3-nmgate-oxidethickness,allowhighperformanceand

low-voltageoperation. III.ARCHITECTURE

Themicroprocessorcontains32-kBinstructionanddata

cachesaswellasaneight-entrycoalescingwritebackbuffer.

Theinstructionanddatacachefillbuffershavetwoandfour

entries,respectively.Thedatacachesupportshit-under-miss

operationandlinesmaybelockedtoallowSRAM-likeoper-

ation.Thirty-two-entryfullyassociativetranslationlookaside

buffers(TLBs)thatsupportmultiplepagesizesareprovided

forbothcaches.TLBentriesmayalsobelocked.A128-entry

branchtargetbufferimprovesbranchperformanceapipeline

deeperthanearlierhigh-performanceARMdesigns[2],[3].

A.PipelineOrganization

Toobtainhighperformance,themicroprocessorcoreutilizes

asimplescalarpipelineandahigh-frequencyclock.Inaddition

toavoidingthepotentialpowerwasteofasuperscalarapproach,

functionaldesignandvalidationcomplexityisdecreasedatthe

expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,

thepipelinepartitioningbalancestheworkloadandensuresthat

noonepipelinestageistight.Themainintegerpipelineisseven

stages,memoryoperationsfollowaneight-stagepipeline,and

whenoperatinginthumbmodeanextrapipestageisinserted

afterthelastfetchstagetoconvertthumbinstructionsintoARM

instructions.Sincethumbmodeinstructions[11]are16b,two

instructionsarefetchedinparallelwhileexecutingthumbin-

structions.Asimplifieddiagramoftheprocessorpipelineis

Fig.2.Microprocessorpipelineorganization.

showninFig.2,wherethestateboundariesareindicatedby

gray.Featuresthatallowthemicroarchitecturetoachievehigh

speedareasfollows.

TheshifterandALUresideinseparatestages.TheARMin-

structionsetallowsashiftfollowedbyanALUoperationina

singleinstruction.Previousimplementationslimitedfrequency

byhavingtheshiftandALUinasinglestage.Splittingthisop-

erationreducesthecriticalALUbypasspathbyapproximately

1/3.Theextrapipelinehazardintroducedwhenaninstructionis

immediatelyfollowedbyonerequiringthattheresultbeshifted

isinfrequent.

DecoupledInstructionFetch.Atwo-instructiondeepqueueis

implementedbetweenthesecondfetchandinstructiondecode

pipestages.Thisallowsstallsgeneratedlaterinthepipetobe

deferredbyoneormorecyclesintheearlierpipestages,thereby

allowinginstructionfetchestoproceedwhenthepipeisstalled,

andalsorelievesstallspeedpathsintheinstructionfetchand

branchpredictionunits.

Deferredregisterdependency

stalls.Whileregisterdepen-

denciesarecheckedintheRFstage,stallsduetothesehazards

aredeferreduntiltheX1stage.Allthenecessaryoperandsare

thencapturedfromresult-forwardingbussesastheresultsare

returnedtotheregisterfile.

Oneofthemajorgoalsofthedesignwastominimizetheen-

ergyconsumedtocompleteagiventask.Conventionalwisdom

hasbeenthatshorterpipelinesaremoreefficientduetore-


Lec3.9

General C/L Cell Delay Model



• Input load factor of each input

• Propagation delay from each input to each output for each transition



Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.10

Storage Element’s Timing Model

Clk

D Q

° Setup Time: Input must be stable BEFORE trigger clock edge

° Hold Time: Input must REMAIN stable after trigger clock edge

° Clock-to-Q time:

• Output cannot change instantaneously at the trigger clock edge

• Similar to delay in logic gates, two components:

- Internal Clock-to-Q

- Load dependent Clock-to-Q

Don’t Care Don’t Care

HoldSetup

D

Unknown

Clock-to-Q

Q


Lec3.11

Clocking Methodology

Clk

Combination Logic

.

.

.

.

.

.

.

.

.

.

.

.

° All storage elements are clocked by the same clock edge

° The combination logic blocks:• Inputs are updated at each clock tick

• All outputs MUST be stable before the next clock tick


Lec3.12

Critical Path & Cycle Time

Clk

.

.

.

.

.

.

.

.

.

.

.

.

° Critical path: the slowest path between any two storage devices

° Cycle time is a function of the critical path

° must be greater than:

Clock-to-Q + Longest Path through Combination Logic + Setup

Register:

An Array of Flip-Flops


Lec3.9








Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.10


Clk

D Q



° Clock-to-Q time:






HoldSetup

D

Unknown

Clock-to-Q

Q


Lec3.11


Clk

Combination Logic

.

.

.

.

.

.

.

.

.

.

.

.





Lec3.12


Clk

.

.

.

.

.

.

.

.

.

.

.

.





Combinational Logic



Flip Flops have internal delays ...

D Q

CLK

Value of D is sampled on positive clock edge.Q outputs sampled value for rest of cycle.

D

Q

t_setup

t_clk-to-Q



Flip-Flop delays eat into “time budget”1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001













III. ARCHITECTURE




































is infrequent.

















Example

• Parallel to serial converter:

a

b T ! time(clk"Q) + time(mux) + time(setup)

T ! #clk"Q + #mux + #setup

clk

ALU “time budget”


General Model of Synchronous Circuit

• In general, for correct operation:

for all paths.

• How do we enumerate all paths?

– Any circuit input or register output to any register input or circuit

output.

– “setup time” for circuit outputs depends on what it connects to

– “clk-Q time” for circuit inputs depends on from where it comes.

reg regCL CL

clock input

output

option feedback

input output

T ! time(clk"Q) + time(CL) + time(setup)

T ! #clk"Q + #CL + #setup


Lec3.9








Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.10


Clk

D Q



° Clock-to-Q time:






HoldSetup

D

Unknown

Clock-to-Q

Q


Lec3.11


Clk

Combination Logic

.

.

.

.

.

.

.

.

.

.

.

.





Lec3.12


Clk

.

.

.

.

.

.

.

.

.

.

.

.





Combinational Logic



Clock skew also eats into “time budget”


Clock Skew (cont.)

• If clock period T = TCL+Tsetup+Tclk!Q, circuit will fail.

• Therefore:

1. Control clock skew

a) Careful clock distribution. Equalize path delay from clock source to all clock loads by controlling wires delay and buffer delay.

b) don’t “gate” clocks.

2. T " TCL+Tsetup+Tclk!Q + worst case skew.

• Most modern large high-performance chips (microprocessors) control end to end clock skew to a few tenths of a nanosecond.

clock skew, delay in distribution

CL

CLKCLK’

CLK

CLK’


Clock Skew (cont.)

• Note reversed buffer.

• In this case, clock skew actually provides extra time (adds

to the effective clock period).

• This effect has been used to help run circuits as higher

clock rates. Risky business!

CL

CLK

CLK’


CLK

CLK’

As T →0, which circuit

fails first?


Clock Skew (cont.)


• Therefore:







CL

CLKCLK’

CLK

CLK’

CLKd CLKd


Clock Skew (cont.)


• Therefore:







CL

CLKCLK’

CLK

CLK’CLKd



the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [16] was required to tune themore than 2000 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3D visualizations. Figure 8 visualizes four ofthe 64 sector trees containing about 125 tuned wiresdriving 1/16th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [17, 18]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 15 200 clock pins on the chip,regardless of proximity.

From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.

Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed

Figure 6

Schematic diagram of global clock generation and distribution.

PLL

Bypass

Referenceclock in

Referenceclock out

Clock distributionClock out

Figure 7

3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.

Del

ayGrid

Tunedsectortrees

Sectorbuffers

Buffer level 2

Buffer level 1

y

x

Figure 8

Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.

Del

ay Multiple-fingeredtransmissionline

yx

J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002

32

Clock Tree Delays,

IBM “Power” CPU

Del

ay



the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [16] was required to tune themore than 2000 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3D visualizations. Figure 8 visualizes four ofthe 64 sector trees containing about 125 tuned wiresdriving 1/16th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [17, 18]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 15 200 clock pins on the chip,regardless of proximity.

From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.

Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed

Figure 6

Schematic diagram of global clock generation and distribution.

PLL

Bypass

Referenceclock in

Referenceclock out

Clock distributionClock out

Figure 7

3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.

Del

ay

Grid

Tunedsectortrees

Sectorbuffers

Buffer level 2

Buffer level 1

y

x

Figure 8

Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.

Del

ay Multiple-fingeredtransmissionline

yx

J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002

32

Clock Tree Delays, IBM Power

clock grid was completed with a tool run at the chip level,connecting unit-level pins to the grid. At this point, theclock tuning and the bottom-up clock routing process stillhave a great deal of flexibility to respond rapidly to evenlate changes. Repeated practice routing and tuning wereperformed by a small, focused global clock team as theclock pins and buffer placements evolved to guaranteefeasibility and speed the design process.

Measurements of jitter and skew can be carried outusing the I/Os on the chip. In addition, approximately 100top-metal probe pads were included for direct probingof the global clock grid and buffers. Results on actualPOWER4 microprocessor chips show long-distanceskews ranging from 20 ps to 40 ps (cf. Figure 9). This isimproved from early test-chip hardware, which showedas much as 70 ps skew from across-chip channel-lengthvariations [19]. Detailed waveforms at the input andoutput of each global clock buffer were also measuredand compared with simulation to verify the specializedmodeling used to design the clock grid. Good agreementwas found. Thus, we have achieved a “correct-by-design”clock-distribution methodology. It is based on our designexperience and measurements from a series of increasinglyfast, complex server microprocessors. This method resultsin a high-quality global clock without having to usefeedback or adjustment circuitry to control skews.

Circuit designThe cycle-time target for the processor was set early in theproject and played a fundamental role in defining thepipeline structure and shaping all aspects of the circuitdesign as implementation proceeded. Early on, criticaltiming paths through the processor were simulated indetail in order to verify the feasibility of the designpoint and to help structure the pipeline for maximumperformance. Based on this early work, the goal for therest of the circuit design was to match the performance setduring these early studies, with custom design techniquesfor most of the dataflow macros and logic synthesis formost of the control logic—an approach similar to thatused previously [20]. Special circuit-analysis and modelingtechniques were used throughout the design in order toallow full exploitation of all of the benefits of the IBMadvanced SOI technology.

The sheer size of the chip, its complexity, and thenumber of transistors placed some important constraintson the design which could not be ignored in the push tomeet the aggressive cycle-time target on schedule. Theseconstraints led to the adoption of a primarily static-circuitdesign strategy, with dynamic circuits used only sparinglyin SRAMs and other critical regions of the processor core.Power dissipation was a significant concern, and it was akey factor in the decision to adopt a predominantly static-circuit design approach. In addition, the SOI technology,

including uncertainties associated with the modelingof the floating-body effect [21–23] and its impact onnoise immunity [22, 24 –27] and overall chip decouplingcapacitance requirements [26], was another factor behindthe choice of a primarily static design style. Finally, thesize and logical complexity of the chip posed risks tomeeting the schedule; choosing a simple, robust circuitstyle helped to minimize overall risk to the projectschedule with most efficient use of CAD tool and designresources. The size and complexity of the chip alsorequired rigorous testability guidelines, requiring almostall cycle boundary latches to be LSSD-compatible formaximum dc and ac test coverage.

Another important circuit design constraint was thelimit placed on signal slew rates. A global slew rate limitequal to one third of the cycle time was set and enforcedfor all signals (local and global) across the whole chip.The goal was to ensure a robust design, minimizingthe effects of coupled noise on chip timing and alsominimizing the effects of wiring-process variability onoverall path delay. Nets with poor slew also were foundto be more sensitive to device process variations andmodeling uncertainties, even where long wires and RCdelays were not significant factors. The general philosophywas that chip cycle-time goals also had to include theslew-limit targets; it was understood from the beginningthat the real hardware would function at the desiredcycle time only if the slew-limit targets were also met.

The following sections describe how these designconstraints were met without sacrificing cycle time. Thelatch design is described first, including a description ofthe local clocking scheme and clock controls. Then thecircuit design styles are discussed, including a description

Figure 9

Global clock waveforms showing 20 ps of measured skew.

1.5

1.0

0.5

0.0

0 500 1000 1500 2000 2500

20 ps skew

Vol

ts (

V)

Time (ps)

IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. D. WARNOCK ET AL.

33



Some Flip Flops have “hold” time ...

D

t_setup

CLK

t_hold

D must stay

stable here

D Q

CLK

Does flip-flop hold time affect operation of this

circuit? Under what conditions?

t_inv

What is the intended function of this circuit?

t_clk-to-Q + t_inv > t_holdFor correct operation.



Searching for processor critical path1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001













III. ARCHITECTURE




































is infrequent.
















Timing AnalysisWhat is the

smallest T that produces correct

operation?Must considerall connectedregister pairs.

?

Why might I suspect this one?



Combinational paths for IBM Power 4 CPU

From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.

netlist. Of these, 121 713 were top-level chip global nets,and 21 711 were processor-core-level global nets. Againstthis model 3.5 million setup checks were performed in latemode at points where clock signals met data signals inlatches or dynamic circuits. The total number of timingchecks of all types performed in each chip run was9.8 million. Depending on the configuration of the timingrun and the mix of actual versus estimated design data,the amount of real memory required was in the rangeof 12 GB to 14 GB, with run times of about 5 to 6 hoursto the start of timing-report generation on an RS/6000*Model S80 configured with 64 GB of real memory.Approximately half of this time was taken up by readingin the netlist, timing rules, and extracted RC networks, as

well as building and initializing the internal data structuresfor the timing model. The actual static timing analysistypically took 2.5–3 hours. Generation of the entirecomplement of reports and analysis required an additional5 to 6 hours to complete. A total of 1.9 GB of timingreports and analysis were generated from each chip timingrun. This data was broken down, analyzed, and organizedby processor core and GPS, individual unit, and, in thecase of timing contracts, by unit and macro. This was onecomponent of the 24-hour-turnaround time achieved forthe chip-integration design cycle. Figure 26 shows theresults of iterating this process: A histogram of the finalnominal path delays obtained from static timing for thePOWER4 processor.

The POWER4 design includes LBIST and ABIST(Logic/Array Built-In Self-Test) capability to enable full-frequency ac testing of the logic and arrays. Such testingon pre-final POWER4 chips revealed that several circuitmacros ran slower than predicted from static timing. Thespeed of the critical paths in these macros was increasedin the final design. Typical fast ac LBIST laboratory testresults measured on POWER4 after these paths wereimproved are shown in Figure 27.

SummaryThe 174-million-transistor !1.3-GHz POWER4 chip,containing two microprocessor cores and an on-chipmemory subsystem, is a large, complex, high-frequencychip designed by a multi-site design team. Theperformance and schedule goals set at the beginning ofthe project were met successfully. This paper describesthe circuit and physical design of POWER4, emphasizingaspects that were important to the project’s success in theareas of design methodology, clock distribution, circuits,power, integration, and timing.

Figure 25

POWER4 timing flow. This process was iterated daily during the physical design phase to close timing.

VIM

Timer files ReportsAsserts

Spice

Spice

GL/1

Reports

< 12 hr

< 12 hr

< 12 hr

< 48 hr

< 24 hr

Non-uplift timing

Noiseimpacton timing

Upliftanalysis

Capacitanceadjust

Chipbench /EinsTimer

Chipbench /EinsTimer

Extraction

Core or chipwiring

Analysis/update(wires, buffers)

Notes:• Executed 2–3 months prior to tape-out• Fully extracted data from routed designs • Hierarchical extraction• Custom logic handled separately • Dracula • Harmony• Extraction done for • Early • Late

Extracted units (flat or hierarchical)Incrementally extracted RLMsCustom NDRsVIMs

Figure 26

Histogram of the POWER4 processor path delays.

!40 !20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280Timing slack (ps)

Lat

e-m

ode

timin

g ch

ecks

(th

ousa

nds)

0

50

100

150

200

IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. D. WARNOCK ET AL.

47

Most wires have hundreds of picoseconds to spare.The critical path


Post-Placement C-slow Retiming for the Xilinx VirtexFPGA

Nicholas Weaver⇤

UC BerkeleyBerkeley, CA

Yury MarkovskiyUC BerkeleyBerkeley, CA

Yatish PatelUC BerkeleyBerkeley, CA

John WawrzynekUC BerkeleyBerkeley, CA

ABSTRACT

C-slow retiming is a process of automatically increas-ing the throughput of a design by enabling fine grainedpipelining of problems with feedback loops. This transfor-mation is especially appropriate when applied to FPGAdesigns because of the large number of available registers.To demonstrate and evaluate the benefits of C-slow re-timing, we constructed an automatic tool which modifiesdesigns targeting the Xilinx Virtex family of FPGAs. Ap-plying our tool to three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1synthesized microprocessor core, we were able to substan-tially increase the total throughput. For some parameters,throughput is e↵ectively doubled.

Categories and Subject Descriptors

B.6.3 [Logic Design]: Design Aids—Automatic syn-thesys

General Terms

Performance

Keywords

FPGA CAD, FPGA Optimization, Retiming, C-slowRetiming

⇤Please address any correspondance [email protected]

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.FPGA’03, February 23–25, 2003, Monterey, California, USA.Copyright 2003 ACM 1-58113-651-X/03/0002 ...$5.00.

1. Introduction

Leiserson’s retiming algorithm[7] o↵ers a polynomialtime algorithm to optimize the clock period on arbitrarysynchronous circuits without changing circuit semantics.Although a powerful and e�cient transformation that hasbeen employed in experimental tools[10][2] and commercialsynthesis tools[13][14], it o↵ers only a minor clock periodimprovement for a well constructed design, as many de-signs have their critical path on a single cycle feedbackloop and can’t benefit from retiming.

Also proposed by Leiserson et al to meet the constraintsof systolic computation, is C-slow retiming.1 In C-slow re-timing, each design register is first replaced with C regis-ters before retiming. This transformation modifies the de-sign semantics so that C separate streams of computationare distributed through the pipeline, greatly increasing theaggregate throughput at the cost of additional latency andflip flops. This can automatically accelerate computationscontaining feedback loops by adding more flip-flops thatretiming can then move moved around the critical path.

The e↵ect of C-slow retiming is to enable pipelining ofthe critical path, even in the presence of feedback loops. Totake advantage of this increased throughput however, thereneeds to be su�cient task level parallelism. This processwill slow any single task but the aggregate throughput willbe increased by interleaving the resulting computation.

This process works very well on many FPGA archite-cures as these architectures tend to have a balanced ra-tio of logic elements to registers, while most user designscontain a considerably higher percentage of logic. Addi-tionaly, many architectures allow the registers to be usedindependently of the logic in a logic block.

We have constructed a prototype C-slow retiming toolthat modifies designs targeting the Xilinx Virtex familyof FPGAs. The tool operates after placement: convertingevery design register to C separate registers before apply-ing Leiserson’s retiming algorithm to minimize the clockperiod. New registers are allocated by scavenging unusedarray resources. The resulting design is then returned toXilinx tools for routing, timing analysis, and bitfile gener-ation.

We have selected three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1

1This was originally defined to meet systolic slowdown re-quirements.

How to retime logic


Nicholas Weaver⇤





ABSTRACT




General Terms

Performance

Keywords




1. Introduction








IN OUT

1 1

1 1 22

Figure 1: A small graph before retiming. Thenodes represent logic delays, with the inputs andoutputs passing through mandatory, fixed regis-ters. The critical path is 5.

microprocessor core, for which we can envision scenar-ios where ample task-level parallelism exists. The AESand Smith/Watherman benchmarks were also C-slowed byhand, enabling us to evaluate how well our automated tech-niques compare with careful, hand designed implementa-tions that accomplishes the same goals.

The LEON 1 processor is a significantly larger synthe-sized design. Although it seems unusual, there is su�cienttask level parallelism to C-slow a microprocessor, as eachstream of execution can be viewed as a separate task. Theresulting C-slowed design behaves like a multithreaded sys-tem, with each virtual processor running slower but o↵er-ing a higher total throughput.

This prototype demonstrates significant speedups onall 3 benchmarks, nearly doubling the throughput for theproper parameters. On the AES and Smith/Watermanbenchmarks, these automated results compare favorablywith careful hand-constructed implementations that werethe result of manual C-slowing and pipelining.

In the remainder of the paper, we first discuss the se-mantic restrictions and changes that retiming and C-slowretiming impose on a design, the details of the retimingalgorithm, and the use of the target architecture. Fol-lowing the discussion of C-slow retiming, we describe ourimplementation of an automatic retiming tool. Then wedescribe the structure of all three benchmarks and presentthe results of applying our tool.

2. Conventional Retiming

Leiserson’s retiming treats a synchronous circuit as adirected graph, with delays on the nodes representing com-bination delays and weights on the edges representing reg-isters in the design. An additional node represents theexternal world, with appropriate edges added to accountfor all the I/Os. Two matrixes are calculated, W and D,that represent the number of registers and critical pathbetween every pair of nodes in the graph. Each node alsohas a lag value r that is calculated by the algorithm andused to change the number of registers on any given edge.Conventional retiming does not change the design seman-tics: all input and output timings remain unchanged, whileimposing minor design constraints on the use of FPGA fea-tures. More details and formal proofs of correctness canbe found in Leiserson’s original paper[7].

In order to determine whether a critical path P can beachieved, the retiming algorithm creates a series of con-

IN OUT

1 1

1 1 22

Figure 2: The example in Figure 2 after retiming.The critical path is reduced from 5 to 4.

straints to calculate the lag on each node. All these con-strains are of the form x � y k that can be solved inO(n2) time by using the Bellman/Ford shortest path al-gorithm. The primary constraints insure correctness: noedge will have a negative number of registers while everycycle will always contain the original number of registers.All IO passes through an intermediate node insuring thatinput and output timings do not change. These constraintscan be modified to insure that a particular line will containno registers or a mandatory minimum number of registersto meet architectural constraints.

A second set of constraints attempt to insure that everypath longer than the critical path will contain at least oneregister, by creating an additional constraint for every pathlonger than the critical path. The actual constraints aresummarized in Table 1.

This process is iterated to find the minimum criticalpath that meets all the constraints. The lag calculated bythese constraints can then be used to change the designto meet this critical path. For each edge, a new registerweight w0 is calculated, with w0(e) = w(e)� r(u) + r(v).

An example of how retiming a↵ects a simple design canbe seen in Figures 2 and 2. The initial design has a criticalpath of 5, while after retiming the critical path is reducedto 4. During this process, the number of registers is in-creased, yet the number of registers on every cycle andthe path from input to output remain unchanged. Sincethe feedback loop has only a single register and a delay of4, it is impossible to further improve the performance byretiming.

Retiming in this form imposes only minimal design lim-itations: there can be no asynchronous resets or similarelements, as the retiming technique only applies to syn-chronous circuits. A synchronous global reset imposes toomany constraints to allow retiming unless initial conditionsare calculated and the global reset itself is now excludedfrom retiming purposes. Local synchronous resets and en-ables just produce small, self loops that have no e↵ect onthe correct operation of the algorithm.

Most other design features can be accommodated bysimply adding appropriate constraints. As an example, alltristated lines can’t have registers applied to them, whilemandatory elements such as those seen in synchronousmemories can be easily accommodated by mandating reg-isters on the appropriate nets.

Memories themselves can be retimed like any other el-ement in the design, with dual ported memories treatedas a single node for retiming purposes. Memories thatare synthesized with a negative clock edge (to create thedesign illusion of asynchronous memories) can either be

Circles are combinational logic, labelled with delays.

Critical path is 5.We want to improve it without changing circuit semantics.

IN OUT

1 1

1 1 22









IN OUT

1 1

1 1 22









Add a register, move one circle. Performance improves by 20%.

“Technology X” can often do this.



Power 4: Timing Estimation, Closure

Timing EstimationPredicting a

processor’s clock rate early in the

project




Power 4: Timing Estimation, Closure

Timing ClosureMeeting

(or exceeding!) the timing estimate

















III. ARCHITECTURE




































is infrequent.
















Floorplaning: essential to meet timing.

(Intel XScale 80200)33Thursday, September 5, 13


Break


Simple exercises for gaining intuition about timing for your process + EDA tools.

Thanks to Bhupesh Dasila, Open-Silicon Bangalore


http://www.edn.com/user/Bhupesh%20Singh%20Dasila


Bhupesh Dasila

Synthesize gate chains using hand-specified library cells

Exercisescell libraryand placeand routetools.

Lets you know how many levels of logic you can use in the best case.

Helps you “see through” ... “Technology X”.

Synthesis constrained to 2ns clock.


Gate Delay

• Cascaded gates:

Vout

Vin

Delay of a chain of 3 inverters with strongest strength. “Guaranteed not to exceed” speed.

weak NANDs

Chain lengths ...

40 nm process 29 ps/gate av.






Wire Delay



negligible:

















dominates.


to reduce delay:v1

v4v3

v2

time

v1 v2 v3 v4

Force P&L to drive a long wire with a known buffer cell.


Wire Delay



negligible:

















dominates.


to reduce delay:v1

v4v3

v2

time

v1 v2 v3 v4

Vary driver strength, wire length, metal layer.

Shows the maximum distance two gates can be placed and still meet your clock period.

Distributed RC is the square of the length is clearly seen!

Bhupesh Dasila38Thursday, September 5, 13




CS250, UC Berkeley Fall ’12Lecture 04, Timing

Turning Rise/Fall Delay into Gate Delay• Cascaded gates:

“transfer curve” for inverter.

11

1 11 10 0 0 0

CS250, UC Berkeley Fall ’12Lecture 04, Timing

Driving Large Loads‣ Large fanout nets: clocks, resets, memory bit lines, off-chip‣ Relatively small driver results in long rise time (and thus

large gate delay)

‣ Strategy:

‣ Optimal trade-off between delay per stage and total number of stages ⇒ fanout of ∼4-6 per stage

12

Staged Buffers



Register file: Synthesize, or use SRAM?

R1

R2

...

R31

Q

Q

Q

R0 - The constant 0 Q

clk

.

.

.

32MUX

32

32

sel(rs1)5

.

.

.rd1

32MUX

32

32

sel(rs2)

5...

rd2

“two read ports”

D

D

D

En

En

En

DEMUX

.

.

.

sel(ws)

5

WE

wd32

Speed will depend on how large it lays out ...


Figure 3: Using the raw area data, the physical implementation team can get a more accurate area estimation early in the RTL development stage for floorplanning purposes. This shows an example of this graph for a 1-port, 32-bit-wide SRAM.

Synthesized, custom, and SRAM-based register files, 40nm

For small register files, logic synthesis is competitive.

Not clear if the SRAM data points include area for register control, etc.

Registerfile compiler

Synthesis

SRAMS

Bhupesh Dasila41Thursday, September 5, 13




Techniques



Pipelining



Starting point: A single-cycle processor

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Dout

Data Memory

WE

Din

Addr

MemToReg

Addr Data

Instr

Mem32A

L

U

32

32

op

Ext

SecondsProgram

InstructionsProgram

= SecondsCycle Instruction

Cycles

CPI == 1This is good.

Slow.This is bad.

Challenge: Speed up clock while keeping CPI == 1



Reminder: How data flows after posedge

32rd1

RegFile

32rd2

WE

32wd

5rs15rs25ws

32ALU

32

32

opLogic

Addr Data

InstrMem

D

PC

Q+0x4



Next posedge: Update state and repeat

32rd1

RegFile

32rd2

WE32

wd

5 rs15 rs25 ws

D

PC

Q



Observation: Logic idle most of cycle

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Dout

Data Memory

WE

Din

Addr

MemToReg

Addr Data

Instr

Mem32A

L

U

32

32

op

Ext

For most of cycle, ALU is either “waiting” for its inputs, or “holding” its output

Ideal: a CPU architecture where each part is always “working”.



Inspiration: Automobile assembly lineAssembly line moves on a steady clock.

Each station does the same task on each car.Car

body shell

Car chassis

Mergestation

Boltingstation

The clock



Inspiration: Automobile assembly lineSimpler station tasks → more cars per hour.Simple tasks take less time, clock is faster.



Inspiration: Automobile assembly lineLine speed limited by slowest task.

Most efficient if all tasks take same time to do



Inspiration: Automobile assembly lineSimpler tasks, complex car → long line!

These lines go 24 x 7, and rarely shut down.



Lessons from car assembly lines

Faster line movement yields more cars per hour off the line.

Faster line movement requires more stages, each doing simpler tasks.

To maximize efficiency, all stages should take same amount of time(if not, workers in fast stages are idle)

“Filling”, “flushing”, and “stalling” assembly line are all bad news.



Key Analogy: The instruction is the car

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR IR

Instruction Fetch

IR

Pipeline Stage #1 Stage #2

Controlshardware

in stage 2

Stage #3

Controlshardware

in stage 3

Stage #4

Controlshardware

in stage 4

Stage #5

Controlshardware

in stage 5

“Data-stationary control”



Example: Decode & Register Fetch Stage

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR

Instr Fetch

Pipeline Stage #1

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR

B

A

M

Stage #2Decode & Reg Fetch

IR

Stage #3

ADD R4,R3,R2OR R7,R6,R5SUB R10,R9,R8

ADD R4,R3,R2OR R7,R6,R5

SUB R10,R9,R8

A sample program

R’s chosen so that instructions are

independent - like cars on the line.



Hazards: An instruction is not a car ...

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr FetchStage #1 Stage #2 Stage #3

Decode & Reg Fetch

ADD R4,R3,R2OR R5,R4,R2

An example of a “hazard” -- we must

(1) detect and (2) resolve all hazards

to make a CPU that matches ISA

R4 not written yet ...... wrong value of R4 fetched from RegFile, contract

with programmer broken! Oops! ADD R4,R3,R2

OR R5,R4,R2

New sample program



Decode & Reg Fetch

Performance Equation and Hazards

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch Stage #3

SecondsProgram

InstructionsProgram= Seconds

Cycle InstructionCycles

“Software slows the machine

down”Seymour Cray

Some ways to cope with hazards

makes CPI > 1“stalling pipeline”

Added logic to detect and resolve hazards increases

clock period



Superpipelining



Superpipelining: Add more stagesSeconds

Program

Instructions

Program= Seconds

Cycle Instruction

Cycles

Goal: Reduce critical path byadding more pipeline stages.

Difficulties: Added penalties for load delays and branch misses.

Ultimate Limiter: As logic delay goes to 0, FF clk-to-Q and setup.














III. ARCHITECTURE




































is infrequent.
















Example: 8-stage ARM XScale:extra IF, ID, data cache stages.

Also, power!58Thursday, September 5, 13


CS

152 L10 Pipeline Intro (9)Fall 2004 ©

UC

Regents

Graphically R

epresenting MIP

S P

ipeline

Can help w

ith answering questions like:

how m

any cycles does it take to execute this code?w

hat is the ALU

doing during cycle 4?is there a hazard, w

hy does it occur, and how can it be fixed?

ALU

IMR

egD

MR

eg

IR

ID+RF

EX

MEM

WB

IR

IR

IR

IF

5 Stage1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001













III. ARCHITECTURE




































is infrequent.
















8 Stage

IF now takes 2 stages (pipelined I-cache)ID and RF each get a stage.ALU split over 3 stagesMEM takes 2 stages (pipelined D-cache)

Note: Some stages now overlap, some instructions

take extra stages.



Superpipelining techniques ...

Split ALU and decode logic over several pipeline stages.

Pipeline memory: Use more banks of smaller arrays, add pipeline stages between decoders, muxes.

Remove “rarely-used” forwarding networks that are on critical path.

Pipeline the wires of frequently used forwarding networks.

Creates stalls, affects CPI.

Also: Clocking tricks (example: use posedge and negedge registers)60Thursday, September 5, 13


Hardware limits to superpipelining?

Francois Labonte

[email protected] 4/23/2003 Stanford University

Cycle in FO4

0

10

20

30

40

50

60

70

80

90

100

85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05

intel 386

intel 486

intel pentium

intel pentium 2

intel pentium 3

intel pentium 4

intel itanium

Alpha 21064

Alpha 21164

Alpha 21264

Sparc

SuperSparc

Sparc64

Mips

HP PA

Power PC

AMD K6

AMD K7

AMD x86-64

Thanks to Francois Labonte, Stanford

FO4Delays

Historicallimit:about

12 FO4s

CPU Clock Periods1985-2005

MIPS 20005 stages

Pentium 420 stages

Pentium Pro10 stages

*

Power wall:Intel Core Duo has 14 stages

FO4: How many fanout-of-4 inverter delays in the clock period.


PROCESSORS

1

CPU DB: Recording Microprocessor History

With this open database, you can mine microprocessor trends over the past 40 years.

Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University

In November 1971, Intel introduced the world’s first single-chip microprocessor, the Intel 4004. It had 2,300 transistors, ran at a clock speed of up to 740 KHz, and delivered 60,000 instructions per second while dissipating 0.5 watts. The following four decades witnessed exponential growth in compute power, a trend that has enabled applications as diverse as climate modeling, protein folding, and computing real-time ballistic trajectories of angry birds. Today’s microprocessor chips employ billions of transistors, include multiple processor cores on a single silicon die, run at clock speeds measured in gigahertz, and deliver more than 4 million times the performance of the original 4004.

Where did these incredible gains come from? This article sheds some light on this question by introducing CPU DB (cpudb.stanford.edu), an open and extensible database collected by Stanford’s VLSI (very large-scale integration) Research Group over several generations of processors (and students). We gathered information on commercial processors from 17 manufacturers and placed it in CPU DB, which now contains data on 790 processors spanning the past 40 years.

In addition, we provide a methodology to separate the effect of technology scaling from improvements on other frontiers (e.g., architecture and software), allowing the comparison of machines built in different technologies. To demonstrate the utility of this data and analysis, we use it to decompose processor improvements into contributions from the physical scaling of devices, and from improvements in microarchitecture, compiler, and software technologies.

AN OPEN REPOSITORY OF PROCESSOR SPECSWhile information about current processors is easy to find, it is rarely arranged in a manner that is useful to the research community. For example, the data sheet may contain the processor’s power, voltage, frequency, and cache size, but not the pipeline depth or the technology minimum feature size. Even then, these specifications often fail to tell the full story: a laptop processor operates over a range of frequencies and voltages, not just the 2 GHz shown on the box label.

Not surprisingly, specification data gets harder to find the older the processor becomes, especially for those that are no longer made, or worse, whose manufacturers no longer exist. We have been collecting this type of data for three decades and are now releasing it in the form of an open repository of processor specifications. The goal of CPU DB is to aggregate detailed processor specifications into a convenient form and to encourage community participation, both to leverage this information and to keep it accurate and current. CPU DB (cpudb. stanford.edu) is populated with desktop, laptop, and server processors, for which we use SPEC13 as our performance-measuring tool. In addition, the database contains limited data on embedded cores, for which we are using the CoreMark benchmark for performance.5 With time and help from the community, we hope to extend the coverage of embedded processors in the database.

PROCESSORS

1

CPU DB: Recording Microprocessor History

With this open database, you can mine microprocessor trends over the past 40 years.

Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University

In November 1971, Intel introduced the world’s first single-chip microprocessor, the Intel 4004. It had 2,300 transistors, ran at a clock speed of up to 740 KHz, and delivered 60,000 instructions per second while dissipating 0.5 watts. The following four decades witnessed exponential growth in compute power, a trend that has enabled applications as diverse as climate modeling, protein folding, and computing real-time ballistic trajectories of angry birds. Today’s microprocessor chips employ billions of transistors, include multiple processor cores on a single silicon die, run at clock speeds measured in gigahertz, and deliver more than 4 million times the performance of the original 4004.

Where did these incredible gains come from? This article sheds some light on this question by introducing CPU DB (cpudb.stanford.edu), an open and extensible database collected by Stanford’s VLSI (very large-scale integration) Research Group over several generations of processors (and students). We gathered information on commercial processors from 17 manufacturers and placed it in CPU DB, which now contains data on 790 processors spanning the past 40 years.

In addition, we provide a methodology to separate the effect of technology scaling from improvements on other frontiers (e.g., architecture and software), allowing the comparison of machines built in different technologies. To demonstrate the utility of this data and analysis, we use it to decompose processor improvements into contributions from the physical scaling of devices, and from improvements in microarchitecture, compiler, and software technologies.

AN OPEN REPOSITORY OF PROCESSOR SPECSWhile information about current processors is easy to find, it is rarely arranged in a manner that is useful to the research community. For example, the data sheet may contain the processor’s power, voltage, frequency, and cache size, but not the pipeline depth or the technology minimum feature size. Even then, these specifications often fail to tell the full story: a laptop processor operates over a range of frequencies and voltages, not just the 2 GHz shown on the box label.

Not surprisingly, specification data gets harder to find the older the processor becomes, especially for those that are no longer made, or worse, whose manufacturers no longer exist. We have been collecting this type of data for three decades and are now releasing it in the form of an open repository of processor specifications. The goal of CPU DB is to aggregate detailed processor specifications into a convenient form and to encourage community participation, both to leverage this information and to keep it accurate and current. CPU DB (cpudb. stanford.edu) is populated with desktop, laptop, and server processors, for which we use SPEC13 as our performance-measuring tool. In addition, the database contains limited data on embedded cores, for which we are using the CoreMark benchmark for performance.5 With time and help from the community, we hope to extend the coverage of embedded processors in the database.

1985 1990 1995 201020052000 2015

140

120

100

80

60

40

20

0

F04

/ cyc

leF04 Delays Per Cycle for Processor Designs

FO4 delay per cycle is roughly proportional to the amount of computation completed per cycle.62Thursday, September 5, 13


Multithreading



Krste

November 10, 2004

6.823, L18--3

Multithreading

How can we guarantee no dependencies between instructions in a pipeline?

-- One way is to interleave execution of instructions from different program threads on same pipeline

F D X M W

t0 t1 t2 t3 t4 t5 t6 t7 t8

T1: LW r1, 0(r2)

T2: ADD r7, r1, r4

T3: XORI r5, r4, #12

T4: SW 0(r7), r5

T1: LW r5, 12(r1)

t9

F D X M W

F D X M W

F D X M W

F D X M W

Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

Last instruction

in a thread

always completes

writeback before

next instruction

in same thread

reads regfile

KrsteNovember 10, 2004

6.823, L18--5

Simple Multithreaded Pipeline

Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage

+1

2 Thread

select

PC1

PC1

PC1

PC1

I$ IRGPR1GPR1GPR1GPR1

X

Y

2

D$

Multithreading of Static Pipelines4

CPUs,each

run at 1/4

clock

Many variants ...



Nicholas Weaver⇤





ABSTRACT




General Terms

Performance

Keywords




1. Introduction








At the logic level ...


Nicholas Weaver⇤





ABSTRACT




General Terms

Performance

Keywords




1. Introduction








IN OUT

1 1

1 1 22









IN OUT

1 1

1 1 22









Condition Constraint

normal edge from u! v r(u)� r(v) w(e)edge from u! v r(u)� r(v) w(e)� 1must be registerededge from u! v r(u)� r(v) 0 andcan never be registered r(v)� r(u) 0Critical Paths r(u)� r(v) W (u, v)� 1must be registered for all u, v such that D(u, v) > P

Table 1: The constraint system used by the retiming process.

IN OUT

1 1

1 1 22

Figure 3: The example in Figure 2 2-slowed.This design now operates on 2 independent datastreams.

unchanged or switch to operate on the positive edge withconstraints to mandate the placement of registers.2

Initial register conditions can also be calculated if de-sired, but this process is NP hard in the general case.Cong and Wu[3] have an algorithm that computes ini-tial states by restricting the design to forward retimingonly, so it propagates the information and registers for-ward throughout the computation. This is because solvinginitial states for all registers moved forward is straightfor-ward, but backwards movement is NP hard3 as it reducesto satisfiability.

An important question is how to deal with multipleclocks. If the interfaces between the clock domains are reg-istered by clocks from both domains, and with all signalsbeing unidirectional, each clock domain can be treated asan independent block with all signals crossing the domaintreated as I/O. Due to the retiming-imposed constraints onI/O, the logical view of each input will not change. How-ever, constraints may be needed to insure that the physicalregisters remain in position to prevent asynchronous con-ditions from occurring on this interface.

3. C-slow retiming

C-slowing enhances retiming by simply replacing ev-ery register with a sequence of C separate registers beforeretiming occurs. The resulting design operates on C dis-tinct execution tasks. Since all registers are duplicated,the computation proceeds in a round-robin fashion. Theeasiest way to utilize a C slowed block is to simply multi-

2For some cases, this may produce a set of unsolvable con-straints, thus requiring that the memory remain a negativeedge device.3And may not posses a valid solution for nonsensical cases.

IN OUT

1 1

1 1 22

Figure 4: The example in Figure 3 after retim-ing. The combination of C-slowing and retimingreduced the critical path from 5 to 2.

plex and demultiplex C separate data streams, but a moresophisticated interface may be desired depending on theapplication.

One possible interface is to register all inputs and out-puts of a C-slowed block. Because of the additional edgesretiming creates to track I/Os and to insure a consistentinterface, every stream of execution presents all outputs atthe same time, with all inputs being registered on the nextcycle. If part of the design is C-slowed, but all operate onthe same clock, the resulting design can be retimed as acomplete whole while preserving all other semantics. Weuse these observations later when discussing the e↵ects ofC-slowing on a microprocessor core.

However, C-slowing imposes some more significantFPGA design constraints, as summarized in Table 2. Reg-ister clock enables and resets must be expressed as logicfeatures, since each independent thread must see a dif-ferent view of the reset or enable. Thus, they can remainfeatures in the design but can’t be implemented by currentFPGAs using the native enables and resets. Other special-ized features, such as Xilinx SRL16s,4 can’t be utilized ina C-slow design for the same reasons.

One important issue is how to properly C-slow memoryblocks. In most cases, one desires the complete illusionthat each stream of execution is completely independentand unchanged. To create this illusion, the memory mustbe increased by a factor of C, with additional address linesdriven by a counter. This insures that each stream of ex-ecution enjoys a completely separate memory space.

For dual ported memories, this potentially enables agreater freedom in retiming: the two ports can have dif-ferent lags, as long as the di↵erence in lag is C � 1 orless. After retiming, the di↵erence in lag is added to theappropriate port’s thread counter. This insures that each

4A mode where the LUT can act as a 16 bit shift register

Condition Constraint

normal edge from u! v r(u)� r(v) w(e)edge from u! v r(u)� r(v) w(e)� 1must be registerededge from u! v r(u)� r(v) 0 andcan never be registered r(v)� r(u) 0Critical Paths r(u)� r(v) W (u, v)� 1must be registered for all u, v such that D(u, v) > P

Table 1: The constraint system used by the retiming process.

IN OUT

1 1

1 1 22

Figure 3: The example in Figure 2 2-slowed.This design now operates on 2 independent datastreams.

unchanged or switch to operate on the positive edge withconstraints to mandate the placement of registers.2

Initial register conditions can also be calculated if de-sired, but this process is NP hard in the general case.Cong and Wu[3] have an algorithm that computes ini-tial states by restricting the design to forward retimingonly, so it propagates the information and registers for-ward throughout the computation. This is because solvinginitial states for all registers moved forward is straightfor-ward, but backwards movement is NP hard3 as it reducesto satisfiability.

An important question is how to deal with multipleclocks. If the interfaces between the clock domains are reg-istered by clocks from both domains, and with all signalsbeing unidirectional, each clock domain can be treated asan independent block with all signals crossing the domaintreated as I/O. Due to the retiming-imposed constraints onI/O, the logical view of each input will not change. How-ever, constraints may be needed to insure that the physicalregisters remain in position to prevent asynchronous con-ditions from occurring on this interface.

3. C-slow retiming

C-slowing enhances retiming by simply replacing ev-ery register with a sequence of C separate registers beforeretiming occurs. The resulting design operates on C dis-tinct execution tasks. Since all registers are duplicated,the computation proceeds in a round-robin fashion. Theeasiest way to utilize a C slowed block is to simply multi-

2For some cases, this may produce a set of unsolvable con-straints, thus requiring that the memory remain a negativeedge device.3And may not posses a valid solution for nonsensical cases.

IN OUT

1 1

1 1 22

Figure 4: The example in Figure 3 after retim-ing. The combination of C-slowing and retimingreduced the critical path from 5 to 2.

plex and demultiplex C separate data streams, but a moresophisticated interface may be desired depending on theapplication.

One possible interface is to register all inputs and out-puts of a C-slowed block. Because of the additional edgesretiming creates to track I/Os and to insure a consistentinterface, every stream of execution presents all outputs atthe same time, with all inputs being registered on the nextcycle. If part of the design is C-slowed, but all operate onthe same clock, the resulting design can be retimed as acomplete whole while preserving all other semantics. Weuse these observations later when discussing the e↵ects ofC-slowing on a microprocessor core.

However, C-slowing imposes some more significantFPGA design constraints, as summarized in Table 2. Reg-ister clock enables and resets must be expressed as logicfeatures, since each independent thread must see a dif-ferent view of the reset or enable. Thus, they can remainfeatures in the design but can’t be implemented by currentFPGAs using the native enables and resets. Other special-ized features, such as Xilinx SRL16s,4 can’t be utilized ina C-slow design for the same reasons.

One important issue is how to properly C-slow memoryblocks. In most cases, one desires the complete illusionthat each stream of execution is completely independentand unchanged. To create this illusion, the memory mustbe increased by a factor of C, with additional address linesdriven by a counter. This insures that each stream of ex-ecution enjoys a completely separate memory space.

For dual ported memories, this potentially enables agreater freedom in retiming: the two ports can have dif-ferent lags, as long as the di↵erence in lag is C � 1 orless. After retiming, the di↵erence in lag is added to theappropriate port’s thread counter. This insures that each

4A mode where the LUT can act as a 16 bit shift register

Synchronous logic we want to “multithread”. Critical path is 5.

2X multi-threading: double each register.

Modern synthesis will retime this as shown: critical path is now 2.


Good fit for GALS

Two input queues(red and green). The mux control logic implements

turn-taking.

Outputs placed into two output

queues.



Crossbar Networks



When register files get big, they get slow.

R1

R2

...

R31

Q

Q

Q

R0 - The constant 0 Q

clk

.

.

.

32MUX

32

32

sel(rs1)

5...

rd1

32MUX

32

32

sel(rs2)

5...

rd2

D

D

D

En

En

En

DEMUX

.

.

.

sel(ws)

5

WE

wd32

Even worse: adding ports slows down as O(N2) ...

Why? Number of loads on each Q goes as O(N), and the wire length to port mux goes as O(N).



NAWATHE et al.: IMPLEMENTATION OF AN 8-CORE, 64-THREAD, POWER-EFFICIENT SPARC SERVER ON A CHIP 7

Fig. 2. Niagara2 block diagram.

Fig. 3. Niagara2 die micrograph.

two FBDIMM channels. These three major I/O interfaces areserializer/deserializer (SerDes) based and provide a total pinbandwidth in excess of 1 Tb/s. All the SerDes are on chip.The high levels of system integration truly makes Niagara2 a“server-on-a-chip”, thus reducing system component count,complexity and power, and hence improving system reliability.

B. SPARC Core Architecture

Fig. 4 shows the block diagram of the SPARC Core. EachSPARC core (SPC) implements the 64-bit SPARC V9 instruc-tion set while supporting concurrent execution of eight threads.Each SPC has one load/store unit (LSU), two Execution units(EXU0 and EXU1), and one Floating Point and Graphics Unit(FGU). The Instruction Fetch unit (IFU) and the LSU contain an8-way 16 kB Instruction cache and a 4-way 8 kB Data cache re-spectively. Each SPC also contains a 64-entry Instruction-TLB(ITLB), and a 128-entry Data-TLB (DTLB). Both the TLBs arefully associative. The memory Management Unit (MMU) sup-ports 8 K, 64 K, 4 M, and 256 M page sizes and has Hardware

Fig. 4. SPC block diagram.

Fig. 5. Integer pipeline: eight stages.

Fig. 6. Floating point pipeline: 12 stages.

TableWalk to reduce TLB miss penalty. “TLU” in the block dia-gram is the Trap Logic Unit. The “Gasket” performs arbitrationfor access to the Crossbar. Each SPC also has an advanced Cryp-tographic/Stream Processing Unit (SPU). The combined band-width of the eight Cryptographic units from the eight SPCs issufficient for running the two 10 Gb Ethernet ports encrypted.This enables Niagara2 to run secure applications at wire speed.

Fig. 5 and Fig. 6 illustrate the Niagara2 integer and floatingpoint pipelines, respectively. The integer pipeline is eight stageslong. The floating point pipeline has 12 stages for most opera-tions. Divide and Square-root operations have a longer pipeline.

Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.

Crossbar networks: general case of this problem

Each DRAM channel: 50 GB/s Read, 25 GB/s Write BW.Crossbar BW: 270 GB/s total (Read + Write).

(Also shared by an I/O port, not shown)

Sun Niagara II: 8 cores, 4MB L2, 4 DRAM channels




Fig. 9. L2 cache row redundancy scheme.

2-cycle latency. Addresses can be hashed to distribute accessesacross different sets in case of hot cache sets caused by refer-ence conflicts. All arrays are protected by single error correc-tion, double error detection ECC, and parity. Data from differentways and different words is interleaved to improve soft errorrates.

The L2 cache used a unique row-redundancy scheme. It is im-plemented at the 32 kB level and is illustrated in Fig. 9. Sparerows for one array are located in the adjacent array as opposed tothe same array. In other words, spare rows for the top array arelocated in the bottom array and vice versa. When redundancy isenabled, the incoming address is compared with the address ofthe defective row and if it matches, the adjacent array (which isnormally not enabled) is enabled to read from or write into thespare row. Using this kind of scheme enables a large ( 30%)reduction in X-decoder area. The area reduction is achieved be-cause the multiplexing required in the X-decoder to bypass thedefective row/rows in the traditional row redundancy scheme isno longer needed in this scheme.

N-well power for the Primary and L2 cache memory cellsis separated out as a test hook. This allows weakening of thepMOS loads of the SRAM bit cells by raising their thresholdvoltage, thus enabling screening cells with marginal static noisemargin. This significantly reduces defective parts per million(DPPM) and improves reliability.

Fig. 10 shows the Niagara2 Crossbar (CCX). CCX serves asa high bandwidth interface between the eight SPARC Cores,shown on top, and the eight L2 cache banks, and the non-cacheable unit (NCU) shown at the bottom. CCX consists of twoblocks: PCX and CPX. PCX (“Processor-to-Cache-Transfer”)is a 8-input 9-output multiplexer (mux). It transfers data fromthe eight SPARC cores to the eight L2 cache banks and theNCU. Likewise, CPX (“Cache-to-Processor Transfer”) is a

Fig. 10. Crossbar.

9-input 8-output mux, and it transfers data in the reverse di-rection. The PCX and CPX combined provide a Read/Writebandwidth of 270 GB/s. All crossbar data transfer requestsare processed using a four-stage pipeline. The pipeline stagesare: Request, Arbitration, Selection, and Transmission. As canbe seen from the figure, there are possiblesource destination pairs for each data transfer request. There isa two-deep queue for each source–destination pair to hold datatransfer requests for that pair.

IV. CLOCKING

Niagara2 contains a mix of many clocking styles—syn-chronous, mesochronous and asynchronous—and hence a largenumber of clock domains. Managing all these clock domainsand domain crossings between them was one of the biggestchallenges the design team faced. A subset of synchronousmethodology, ratioed synchronous clocking (RSC) is usedextensively. The concept works well for functional mode whilebeing equally applicable to at-speed test of the core using theSerDes interfaces.

A. Clock Sources and Distribution

An on-chip phase-locked loop (PLL) uses a fractional divider[8], [9] to generate Ratioed Synchronous Clocks with supportfor a wide range of integer and fractional divide ratios. Thedistribution of these clocks uses a combination of H-treesand grids. This ensures they meet tight clock skew budgetswhile keeping power consumption under control. Clock TreeSynthesis is used for routing the asynchronous clocks. Asyn-chronous clock domain crossings are handled using FIFOsand meta-stability hardened flip-flops. All clock headers aredesigned to support clock gating to save clock power.

Fig. 11 shows the block diagram of the PLL. Its architectureis similar to the one described in [8]. It uses a loop filter capac-itor referenced to a regulated 1.1 V supply (VREG). VREG isgenerated by a voltage regulator from the 1.5 V supply coming


Sun Niagara II 8 x 9 Crossbar

8 ports on CPU side (one per core)

8 ports for L2 banks, plus one for I/0

4 cycle latency (715ps/cycle).

Cycles 1-3 are for arbitration.

Transmit data on cycle 4.

100-200 wires/ port (each way).

Pipelined.70Thursday, September 5, 13


A complete switch transfer (4 epochs)

Epoch 1: All input ports (that are ready to send data) request an output port.Epoch 2: Allocation algorithm decides which inputs get to write.Epoch 3: Allocation system informs the winning inputs and outputs.Epoch 4: Actual data transfer takes place.

Allocation is pipelined: a data transfer happens on every cycle, as does the three allocation stages, for different sets of requests.



NA

WA

TH

Eetal.:IM

PLE

ME

NTA

TIO

NO

FA

N8-C

OR

E,64-T

HR

EA

D,PO

WE

R-E

FFICIE

NT

SPAR

CSE

RV

ER

ON

AC

HIP

9

Fig.9.L

2cache

rowredundancy

scheme.

2-cyclelatency.A

ddressescan

behashed

todistribute

accessesacross

differentsets

incase

ofhot

cachesets

causedby

refer-ence

conflicts.All

arraysare

protectedby

singleerror

correc-tion,double

errordetectionE

CC

,andparity.D

atafrom

differentw

aysand

differentw

ordsis

interleavedto

improve

softerror

rates.T

heL

2cache

useda

uniquerow

-redundancyschem

e.Itisim-

plemented

atthe32

kBleveland

isillustrated

inFig.9.Spare

rowsforone

arrayare

locatedin

theadjacentarray

asopposedto

thesam

earray.In

otherwords,spare

rows

forthetop

arrayare

locatedin

thebottom

arrayand

viceversa.W

henredundancy

isenabled,the

incoming

addressis

compared

with

theaddress

ofthe

defectiverow

andifitm

atches,theadjacentarray

(which

isnorm

allynotenabled)

isenabled

toread

fromor

write

intothe

sparerow

.Using

thiskind

ofschem

eenables

alarge

(30%

)reduction

inX

-decoderarea.The

areareduction

isachieved

be-cause

them

ultiplexingrequired

inthe

X-decoder

tobypass

thedefective

row/row

sin

thetraditionalrow

redundancyschem

eis

nolonger

neededin

thisschem

e.N

-well

power

forthe

Primary

andL

2cache

mem

orycells

isseparated

outas

atest

hook.T

hisallow

sw

eakeningof

thepM

OS

loadsof

theSR

AM

bitcells

byraising

theirthreshold

voltage,thusenabling

screeningcells

with

marginalstatic

noisem

argin.T

hissignificantly

reducesdefective

partsper

million

(DPPM

)and

improves

reliability.Fig.10

shows

theN

iagara2C

rossbar(C

CX

).CC

Xserves

asa

highbandw

idthinterface

between

theeight

SPAR

CC

ores,show

non

top,and

theeight

L2

cachebanks,

andthe

non-cacheable

unit(NC

U)show

natthe

bottom.C

CX

consistsoftwo

blocks:PC

Xand

CPX

.PC

X(“Processor-to-C

ache-Transfer”)is

a8-input

9-outputm

ultiplexer(m

ux).Ittransfers

datafrom

theeight

SPAR

Ccores

tothe

eightL

2cache

banksand

theN

CU

.L

ikewise,

CPX

(“Cache-to-Processor

Transfer”)is

a

Fig.10.C

rossbar.

9-input8-output

mux,

andit

transfersdata

inthe

reversedi-

rection.T

hePC

Xand

CPX

combined

providea

Read/W

ritebandw

idthof

270G

B/s.

All

crossbardata

transferrequests

areprocessed

usinga

four-stagepipeline.T

hepipeline

stagesare:R

equest,Arbitration,Selection,and

Transmission.A

scan

beseen

fromthe

figure,there

arepossible

sourcedestination

pairsfor

eachdata

transferrequest.T

hereis

atw

o-deepqueue

foreach

source–destinationpair

tohold

datatransfer

requestsfor

thatpair.

IV.

CL

OC

KIN

G

Niagara2

containsa

mix

ofm

anyclocking

styles—syn-

chronous,mesochronous

andasynchronous—

andhence

alarge

number

ofclock

domains.

Managing

allthese

clockdom

ainsand

domain

crossingsbetw

eenthem

was

oneof

thebiggest

challengesthe

designteam

faced.A

subsetof

synchronousm

ethodology,ratioed

synchronousclocking

(RSC

)is

usedextensively.T

heconceptw

orksw

ellforfunctionalmode

while

beingequally

applicableto

at-speedtest

ofthe

coreusing

theSerD

esinterfaces.

A.

Clock

Sourcesand

Distribution

An

on-chipphase-locked

loop(PL

L)usesa

fractionaldivider[8],[9]

togenerate

Ratioed

SynchronousC

locksw

ithsupport

fora

wide

rangeof

integerand

fractionaldivide

ratios.T

hedistribution

ofthese

clocksuses

acom

binationof

H-trees

andgrids.

This

ensuresthey

meet

tightclock

skewbudgets

while

keepingpow

erconsum

ptionunder

control.C

lockTree

Synthesisis

usedfor

routingthe

asynchronousclocks.

Asyn-

chronousclock

domain

crossingsare

handledusing

FIFOs

andm

eta-stabilityhardened

flip-flops.A

llclock

headersare

designedto

supportclockgating

tosave

clockpow

er.Fig.11

shows

theblock

diagramof

thePL

L.Its

architectureis

similarto

theone

describedin

[8].Itusesa

loopfiltercapac-

itorreferenced

toa

regulated1.1

Vsupply

(VR

EG

).VR

EG

isgenerated

bya

voltageregulatorfrom

the1.5

Vsupply

coming

Auth

oriz

ed lic

ensed u

se lim

ited to

: Univ

of C

alif B

erk

ele

y. D

ow

nlo

aded o

n S

epte

mber 1

8, 2

009 a

t 17:0

3 fro

m IE

EE

Xplo

re. R

estric

tions a

pply

.

Sun Niagara II 8 x 9 Crossbar







Fig. 10. Crossbar.


IV. CLOCKING












Fig. 10. Crossbar.


IV. CLOCKING






Every cross of blue and purple is a pass gate with a unique control signal.

72 control signals (if distributed unencoded).

NA

WA

TH

Eetal.:IM

PLE

ME

NTA

TIO

NO

FA

N8-C

OR

E,64-T

HR

EA

D,PO

WE

R-E

FFICIE

NT

SPAR

CSE

RV

ER

ON

AC

HIP

9

Fig.9.L

2cache

rowredundancy

scheme.

2-cyclelatency.A

ddressescan

behashed

todistribute

accessesacross

differentsets

incase

ofhot

cachesets

causedby

refer-ence

conflicts.All

arraysare

protectedby

singleerror

correc-tion,double

errordetectionE

CC

,andparity.D

atafrom

differentw

aysand

differentw

ordsis

interleavedto

improve

softerror

rates.T

heL

2cache

useda

uniquerow

-redundancyschem

e.Itisim-

plemented

atthe32

kBleveland

isillustrated

inFig.9.Spare

rowsforone

arrayare

locatedin

theadjacentarray

asopposedto

thesam

earray.In

otherwords,spare

rows

forthetop

arrayare

locatedin

thebottom

arrayand

viceversa.W

henredundancy

isenabled,the

incoming

addressis

compared

with

theaddress

ofthe

defectiverow

andifitm

atches,theadjacentarray

(which

isnorm

allynotenabled)

isenabled

toread

fromor

write

intothe

sparerow

.Using

thiskind

ofschem

eenables

alarge

(30%

)reduction

inX

-decoderarea.The

areareduction

isachieved

be-cause

them

ultiplexingrequired

inthe

X-decoder

tobypass

thedefective

row/row

sin

thetraditionalrow

redundancyschem

eis

nolonger

neededin

thisschem

e.N

-well

power

forthe

Primary

andL

2cache

mem

orycells

isseparated

outas

atest

hook.T

hisallow

sw

eakeningof

thepM

OS

loadsof

theSR

AM

bitcells

byraising

theirthreshold

voltage,thusenabling

screeningcells

with

marginalstatic

noisem

argin.T

hissignificantly

reducesdefective

partsper

million

(DPPM

)and

improves

reliability.Fig.10

shows

theN

iagara2C

rossbar(C

CX

).CC

Xserves

asa

highbandw

idthinterface

between

theeight

SPAR

CC

ores,show

non

top,and

theeight

L2

cachebanks,

andthe

non-cacheable

unit(NC

U)show

natthe

bottom.C

CX

consistsoftwo

blocks:PC

Xand

CPX

.PC

X(“Processor-to-C

ache-Transfer”)is

a8-input

9-outputm

ultiplexer(m

ux).Ittransfers

datafrom

theeight

SPAR

Ccores

tothe

eightL

2cache

banksand

theN

CU

.L

ikewise,

CPX

(“Cache-to-Processor

Transfer”)is

a

Fig.10.C

rossbar.

9-input8-output

mux,

andit

transfersdata

inthe

reversedi-

rection.T

hePC

Xand

CPX

combined

providea

Read/W

ritebandw

idthof

270G

B/s.

All

crossbardata

transferrequests

areprocessed

usinga

four-stagepipeline.T

hepipeline

stagesare:R

equest,Arbitration,Selection,and

Transmission.A

scan

beseen

fromthe

figure,there

arepossible

sourcedestination

pairsfor

eachdata

transferrequest.T

hereis

atw

o-deepqueue

foreach

source–destinationpair

tohold

datatransfer

requestsfor

thatpair.

IV.

CL

OC

KIN

G

Niagara2

containsa

mix

ofm

anyclocking

styles—syn-

chronous,mesochronous

andasynchronous—

andhence

alarge

number

ofclock

domains.

Managing

allthese

clockdom

ainsand

domain

crossingsbetw

eenthem

was

oneof

thebiggest

challengesthe

designteam

faced.A

subsetof

synchronousm

ethodology,ratioed

synchronousclocking

(RSC

)is

usedextensively.T

heconceptw

orksw

ellforfunctionalmode

while

beingequally

applicableto

at-speedtest

ofthe

coreusing

theSerD

esinterfaces.

A.

Clock

Sourcesand

Distribution

An

on-chipphase-locked

loop(PL

L)usesa

fractionaldivider[8],[9]

togenerate

Ratioed

SynchronousC

locksw

ithsupport

fora

wide

rangeof

integerand

fractionaldivide

ratios.T

hedistribution

ofthese

clocksuses

acom

binationof

H-trees

andgrids.

This

ensuresthey

meet

tightclock

skewbudgets

while

keepingpow

erconsum

ptionunder

control.C

lockTree

Synthesisis

usedfor

routingthe

asynchronousclocks.

Asyn-

chronousclock

domain

crossingsare

handledusing

FIFOs

andm

eta-stabilityhardened

flip-flops.A

llclock

headersare

designedto

supportclockgating

tosave

clockpow

er.Fig.11

shows

theblock

diagramof

thePL

L.Its

architectureis

similarto

theone

describedin

[8].Itusesa

loopfiltercapac-

itorreferenced

toa

regulated1.1

Vsupply

(VR

EG

).VR

EG

isgenerated

bya

voltageregulatorfrom

the1.5

Vsupply

coming

Auth

oriz

ed lic

ensed u

se lim

ited to

: Univ

of C

alif B

erk

ele

y. D

ow

nlo

aded o

n S

epte

mber 1

8, 2

009 a

t 17:0

3 fro

m IE

EE

Xplo

re. R

estric

tions a

pply

.



Crossbar defines floorplan: all port devices should be equidistant to the crossbar.

Uniform latency between all port pairs.

Did not scale up for 16-core Rainbow Falls. Rainbow Falls keeps the 8 x 9 crossbar, and shares each CPU-side port with two cores.

Sun Niagara II Crossbar Notes

Low latency: 4 cycles (less than 3 ns).

Design alternatives to crossbar?74Thursday, September 5, 13


CLOS networks: from telecom world ...

Build a high-port switch by tiling fixed-sized shuffle units. Pipeline registers naturally fit between tiles. Trades scalability for latency.



CLOS networks: an example route

Numbers on left and right are port numbers. Colors show routing paths for an exchange. Arbitration still needed to prevent blocking.



Electrical Details



Flip Flops Revisited



Recall: Static RAM cell (6 Transistors)

x! x

Gnd Vdd Vdd Gnd Vth Vth

noise noise

“Cross- coupled

inverters”



Recall: Positive edge-triggered flip-flop

D Q A flip-flop “samples” right before the edge, and then “holds” value.


Delay in Flip-flops

• Setup time results from delay

through first latch.

• Clock to Q delay results from

delay through second latch.

D

clk

Q

setup time clock to Q delay

clk

clk’

clk

clk

clk’

clk’

clk

clk’

Sampling circuit


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Holds value

16 Transistors: Makes an SRAM look compact!What do we get for the 10 extra transistors?

Clocked logic semantics.80Thursday, September 5, 13


Sensing: When clock is low

D QA flip-flop “samples” right before the edge, and then “holds” value.


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Sampling circuit


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Holds value


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

clk = 0clk’ = 1

Will capture new value on posedge.

Outputs last value captured.



Capture: When clock goes high

D QA flip-flop “samples” right before the edge, and then “holds” value.


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Sampling circuit


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Holds value


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

clk = 1clk’ = 0

Remembers value just captured.

Outputs value just captured.



Flip Flop delays:

D Q


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

clk-to-Q ?

CLK == 0Sense D, but Q

outputs old value.

CLK 0->1Capture D, pass

value to Q

CLK

setup ? hold ?

clk-to-Q

setup

hold



More Detailed Gate Models



Inverters: Circuits and Layout

VoutVin

Vdd symbol

Vin Vout



p-

oxiden+ n+

n-well

oxidep+ p+ n+

Vout

Vin Vin

Inverter: Die Cross Section

VoutVin



Inverters with Vin = Gnd, Vout = Vdd

I ds

VoutVin

Vs

Vd

Vs

Vd

I sd

Is Vsg > Vt ?

Isd = k (W/L) [Vsg -Vt] [Vsd]

Ids ≈0, but really a small leakage current

Is Vsd > Vsg - Vt once Vout is Vdd?

This goes as close to 0 as it can while still supplying the leakage current.



Inverters with Vin = Vdd, Vout = Gnd

I ds

VoutVin

Vs

Vd

Vs

Vd

I sd

Is Vgs > Vt ?Ids = k (W/L) [Vgs -Vt] [Vds]

Is Vds > Vgs - Vt once Vout is Gnd?

Isd ≈0, but really a small leakage current

This goes as close to 0 as it can while still supplying the leakage current.


On Tuesday ... Power and Energy

Heat Sink

Heat Source


vlsi system design lecture 3 timing · 2013. 9. 6. · 2.6 billion moore’s law 1 million 2...

Documents