vlsi system design lecture 3 timing · 2013. 9. 6. · 2.6 billion moore’s law 1 million 2...
TRANSCRIPT
UC Regents Fall 2013 © UCBCS 250 L3: Timing
2013-9-5Professor Jonathan Bachrach
today’s lecture by John Lazzaro
CS 250 VLSI System Design
Lecture 3 – Timing
www-inst.eecs.berkeley.edu/~cs250/
TA: Ben Keller
1Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
... everything doesn’t happen at once.
Timing, the 10,000 ft view. Locally synchronous, globally asynchronous.
On the same page. Minimal set of timing concepts you need for project.
Break
RTL Examples. Better timing through micro-architecture.
Electrical details. Just so you know ...
2Thursday, September 5, 13
View from 10,000 Ft.
Google I/O, 20123Thursday, September 5, 13
Moore’s Law2.6 Billion
1 Million
2 Thousand
Synchronous logic on a single clock domain is not practical for
a 2.6 billion transistor design
4Thursday, September 5, 13
GALS: Globally Asynchronous, Locally Synchronous
both clocks. The basic GALS method focuses on point-
to-point communication between blocks.
FIFO solutionsAnother approach to interfacing locally synchro-
nous blocks is using specially designed asynchronous
FIFO buffers8–10 and hiding the system synchronization
problem within the FIFO buffers. Such a system can
tolerate very large interconnect delays and is also
robust with regard to metastability. Designers can use
this method to interconnect asynchronous and
synchronous systems and also to construct synchro-
nous-synchronous and asynchronous-asynchronous
interfaces. Figure 2 diagrams a typical FIFO interface,
which achieves an acceptable data throughput.8 In
addition to the data cells, the FIFO structure includes
an empty/full detector and a special deadlock de-
tector.
The advantage of FIFO synchronizers is that they
don’t affect the locally synchronous module’s opera-
tion. However, with very wide interconnect data
buses, FIFO structures can be costly in silicon area.
Also, they require specialized complex cells to
generate the empty/full flags used for flow control.
The introduced latency might be significant and
unacceptable for high-speed applications.
As an alternative, Beigne and Vivet designed
a synchronous-asynchronous FIFO based on the
bisynchronous classical FIFO design using gray code,
for the specific case of an asynchronous network-on-
chip (NoC) interface.10 Their aim was to maintain
compatibility with existing design solutions and to use
standard CAD tools. Thus, even with some performance
degradation or suboptimal
architecture, designers can
achieve the main goal of
designing GALS systems in
the standard design envi-
ronment.
Boundary
synchronizationA third solution is to
perform data synchroni-
zation at the borders of
the locally synchronous
island, without affecting
the inner operation of lo-
cally synchronous blocks
and without relying on
FIFO buffers. For this purpose, designers can use
standard two-flop, one-flop, predictive, or adaptive
synchronizers for mesochronous systems, or locally
delayed latching.1,11 This method can achieve very
reliable data transfer between locally synchronous
blocks. On the other hand, such solutions generally
increase latency and reduce data throughput, resulting
in limited applicability for high-speed systems. Table 1
summarizes the properties of GALS systems’ synchro-
nization methods.
Advantages and limitations ofGALS solutions
The scientific community has shown great interest
in GALS solutions and architectures in the past two
decades. However, this interest hasn’t culminated in
many commercial applications, despite all reported
advantages. There are several reasons why standard
design practice has not adopted GALS techniques.
Design and system integration issuesMany proposed solutions require programmable
ring oscillators. This is an inexpensive solution that
allows full control of the local clock. However, it has
significant drawbacks. Ring oscillators are impractical
for industrial use. They need careful calibration
because they are very sensitive to process, voltage,
and temperature variations. Moreover, embedded ring
oscillators consume additional power through contin-
uous switching of the chained inverters.
On the other hand, careful design of the delay line
can reduce its power consumption to a level below
that of a corresponding clock tree. In addition,
432
Figure 2. Typical FIFO-based GALS system.
Globally Asynchronous, Locally Synchronous Design and Test
IEEE Design & Test of Computers
Synchronous modules typically 50K-1M gates, so that the synchronous logic approach works well without requiring heroics. Examples ...
5Thursday, September 5, 13
The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-
rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For
43MARCH–APRIL 2004
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF
XferF6
Group formation andinstruction decode
Instruction fetch
Branch redirects
Interrupts and flushes
WB
Fmt
D1 D2 D3 Xfer GD
BPICCP
D0
IF
Branchpipeline
Load/storepipeline
Fixed-pointpipeline
Floating-point pipeline
Out-of-order processing
Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).
Shared by two threads Thread 0 resources Thread 1 resources
LSU0FXU0
LSU1
FXU1
FPU0
FPU1
BXU
CRL
Dynamicinstructionselection
Threadpriority
Group formationInstruction decode
Dispatch
Shared-register
mappers
Readshared-
register files
Sharedissue
queues
Sharedexecution
units
Alternate
Branch prediction
Instructioncache
Instructiontranslation
Programcounter
Branchhistorytables
Returnstack
Targetcache
DataCache
DataTranslation
L2cache
Datacache
Datatranslation
Instructionbuffer 0
Instructionbuffer 1
Writeshared-
register files
Groupcompletion
Storequeue
Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).
IBM Power 5 CPU - Dynamically Scheduled
Stars denote FIFOs that create separate synchronous domains. An example of howarchitecture and circuits work together.
6Thursday, September 5, 13
Rocket uses GALS for accelerator interface
Your project interfaces with the RISC-V
pipeline and the memory system using
FIFOs.
Your timing closure is
independent of the CPU logic domain.
7Thursday, September 5, 13
Today: Timing insights for your project
What we’re not doing. If this class wasEE 241 and your project was an SRAM:
You could see through down to the layout.Timing? Use SPICE on this hand-drawn schematic.
8Thursday, September 5, 13
Technology X: The CS 250 timing challenge.
What we are doing --->
© Synopsys 2012 7
1986: Logic CompilerOptimal Solutions, Inc. (aka Synopsys, Inc.)
Technology X – Provide automation and increase productivity for gate level designers
Logic Synthesis
If your accelerator is too slow ... two options:
Bottom-up: Take control away from logic synthesis. Use HDL as textual schematic. Also, use command-line tool flags.
Top-down: Rework high-level micro-architecture. Let Technology X keep its job.
Sometimes necessary. Ben is the expert, ask in discussion section.
Today.
9Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
A Logic Circuit Primer
“Models should be as simple as possible, but no simpler ...” Albert Einstein.
10Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Inverters: A simple transistor model
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.5
Design Refinement
Informal System Requirement
Initial Specification
Intermediate Specification
Final Architectural Description
Intermediate Specification of Implementation
Final Internal Specification
Physical Implementation
refinementincreasing level of detail
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.6
Logic Components
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.7
° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)
° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay
° Flip-Flops: Storage Elements• After a clock edge, input copied to output
• Otherwise, the flip-flop holds its value
• Also: a “Latch” is a storage element that is level triggered
D Q D[8] Q[8]
8
Combinational
Logic
11
8
Elements of the design zoo
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.8
Basic Combinational Elements+DeMorgan Equivalence
Wire Inverter
In Out
01
01
In Out
10
01
OutIn
Out = InOut = In
NAND Gate NOR GateA B Out
111
0 00 11 01 1 0
A B Out
0 0 10 1 01 0 01 1 0
OutA
BA
B
Out
DeMorgan’s
TheoremOut = A + B = A • BOut = A • B = A + B
A
B
Out
A B Out
1 1 11 0 10 1 10 0 0
0 00 11 01 1
A B
OutA
B
A B Out
1 1 11 0 00 1 00 0 0
0 00 11 01 1
A B
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
“1”
“0”
pFET.A switch.
“On” if gate is
grounded.
nFET.A switch.
“On” if gate is at Vdd.
“1”“0”
“1” “0”
Correctly predicts logic output for simple static CMOS circuits.
Extensions to model subtler circuit families, or to predict timing, have not worked well ...
11Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Transistors as water valves.If electrons are water molecules,
transistor strengths (W/L) are pipe diameters, and capacitors are buckets ...
A “on” p-FET fillsup the capacitor
with charge.
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
A “on” n-FET empties the
bucket.
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
“1”
“0”Time
Water level
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
“0”
“1”
TimeWater level
This model is often good enough ...
(Cartoon physics)
12Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
What is the bucket? A gate’s “fan-out”.
Driving other gates slows a gate down.
Spring 2003 EECS150 – Lec10-Timing Page 10
Gate Switching Behavior
• Inverter:
• NAND gate:
Driving wires slows a gate down.
“Fan-out”: The number of gate inputs driven by a gate’s output.
Driving it’s own parasitics slows a gate down.
13Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Fanout
14Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
A closer look at fan-out ...
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.37
Series Connection
Vdd
Cout
Vout
C1
V1G2
Vdd
Voltage
Vdd
Vin
GND
V1 Vout
Vdd/2
d1 d2
G1
V1Vin Vout
VinG1 G2
Time
° Total Propagation Delay = Sum of individual delays = d1 + d2
° Capacitance C1 has two components:
• Capacitance of the wire connecting the two gates
• Input capacitance of the second inverter
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.38
Calculating Aggregate Delays
Vdd
G2
Vdd
° Sum delays along serial paths
° Delay (Vin -> V2) ! = Delay (Vin -> V3)• Delay (Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2)
• Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V3)
° Critical Path = The longest among the N parallel paths
° C1 = Wire C + Cin of Gate 2 + Cin of Gate 3
V2
V1Vin V2
G1V1
C1
Vin
Vdd
G3V3
V3
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.39
Characterize a Gate
° Input capacitance for each input
° For each input-to-output path:• For each output transition type (H->L, L->H, H->Z, L->Z ... etc.)
- Internal delay (ns)
- Load dependent delay (ns / fF)
° Example: 2-input NAND Gate
OutA
B
Delay A -> Out
Out: Low -> High
0.5ns
Slope =
0.0021ns / fF
For A and B: Input Load (I.L.) = 61 fF
For either A -> Out or B -> Out:
Tlh = 0.5ns Tlhf = 0.0021ns / fF
Thl = 0.1ns Thlf = 0.0020ns / fF
Cout
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.40
A Specific Example: 2 to 1 MUX
Y = (A and !S)
or (B and S)
A
B
S
Gate 3
Gate 2
Gate 1Wire 1
Wire 2
Wire 0
A
B
Y
S
2 x
1M
ux
° Input Load (I.L.)• A, B: I.L. (NAND) = 61 fF
• S: I.L. (INV) + I.L. (NAND) = 50 fF + 61 fF = 111 fF
° Load Dependent Delay (L.D.D.): Same as Gate 3• TAYlhf = 0.0021 ns / fF TAYhlf = 0.0020 ns / fF
• TBYlhf = 0.0021 ns / fF TBYhlf = 0.0020 ns / fF
• TSYlhf = 0.0021 ns / fF TSYlhf = 0.0020 ns / fF
Linear model works for
reasonablefan-out
Spring 2003 EECS150 – Lec10-Timing Page 12
Gate Delay
• Fan-out:
• The delay of a gate is proportional to its output capacitance. Because, gates #2 and 3 turn on/off at a later time. (It takes longer for the output of gate #1 to reach the switching threshold of gates #2 and 3 as we add more output capacitance.)
1
3
2
Delay time of an inverter driving 4 inverters.
FO4: Fanout of four delay.
Driving more gates adds delay.
Spring 2003 EECS150 – Lec10-Timing Page 12
Gate Delay
• Fan-out:
• The delay of a gate is proportional to its output capacitance. Because, gates #2 and 3 turn on/off at a later time. (It takes longer for the output of gate #1 to reach the switching threshold of gates #2 and 3 as we add more output capacitance.)
1
3
2
Spring 2003 EECS150 – Lec10-Timing Page 12
Gate Delay
• Fan-out:
• The delay of a gate is proportional to its output capacitance. Because, gates #2 and 3 turn on/off at a later time. (It takes longer for the output of gate #1 to reach the switching threshold of gates #2 and 3 as we add more output capacitance.)
1
3
2
15Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Propagation delay graphs ...
Spring 2003 EECS150 – Lec10-Timing Page 11
Gate Delay
• Cascaded gates:
Vout
Vin
Spring 2003 EECS150 – Lec10-Timing Page 11
Gate Delay
• Cascaded gates:
Vout
Vin
Spring 2003 EECS150 – Lec10-Timing Page 11
Gate Delay
• Cascaded gates:
Vout
Vin
Spring 2003 EECS150 – Lec10-Timing Page 11
Gate Delay
• Cascaded gates:
Vout
Vin
1 ->0 1 ->0
0 ->1 0 ->1
inverter transfer function
16Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Worst-case delay through combinational logic
Spring 2003 EECS150 – Lec10-Timing Page 13
Gate Delay
• “Fan-in”
• What is the delay in this circuit?
• Critical Path: the path with the maximum delay, from any
input to any output.
– In general, we include register set-up and clk-to-Q times in
critical path calculation.
• Why do we care about the critical path?
x = g(a, b, c, d, e, f)
T2 might be the
worst-case delay path
(critical path)
If d going 0-to-1 switches x 0-to-1, delay is T1.If a going 0-to-1 switches x 0-to-1, delay is T2.
It would be surprising if T1 > T2.
T1
T2
0 ->1
0 ->10 ->1
17Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Why “might”? Wires have delay too ...
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Looksbenign,
but ...
18Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Clocked Logic Circuits
19Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
From Delay Models to Timing Analysis1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Spring 2003 EECS150 – Lec10-Timing Page 7
Example
• Parallel to serial converter:
a
b T ! time(clk"Q) + time(mux) + time(setup)
T ! #clk"Q + #mux + #setup
clk
f T1 MHz 1 μs
10 MHz 100 ns100 MHz 10 ns
1 GHz 1 ns
Timing AnalysisWhat is the
smallest T that produces correct
operation?
20Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Timing Analysis and Logic Delay
If our clock period T > worst-case delay through CL, does this ensure correct operation?
1600
IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001
Fig.1.ProcessSEMcrosssection.
Theprocess
wasraisedfrom[1]tolimitstandbypower.
Circuitdesignandarchitecturalpipeliningensurelowvoltage
performanceandfunctionality.Tofurtherlimitstandbycurrent
inhandheldASSPs,alongerpolytargettakesadvantageofthe
versus
dependenceandsource-to-bodybiasisused
toelectricallylimittransistor
instandbymode.Allcore
nMOSandpMOStransistorsutilizeseparatesourceandbulk
connectionstosupportthis.Theprocessincludescobaltdisili-
cidegatesanddiffusions.Lowsourceanddraincapacitance,as
wellas3-nmgate-oxidethickness,allowhighperformanceand
low-voltageoperation. III.ARCHITECTURE
Themicroprocessorcontains32-kBinstructionanddata
cachesaswellasaneight-entrycoalescingwritebackbuffer.
Theinstructionanddatacachefillbuffershavetwoandfour
entries,respectively.Thedatacachesupportshit-under-miss
operationandlinesmaybelockedtoallowSRAM-likeoper-
ation.Thirty-two-entryfullyassociativetranslationlookaside
buffers(TLBs)thatsupportmultiplepagesizesareprovided
forbothcaches.TLBentriesmayalsobelocked.A128-entry
branchtargetbufferimprovesbranchperformanceapipeline
deeperthanearlierhigh-performanceARMdesigns[2],[3].
A.PipelineOrganization
Toobtainhighperformance,themicroprocessorcoreutilizes
asimplescalarpipelineandahigh-frequencyclock.Inaddition
toavoidingthepotentialpowerwasteofasuperscalarapproach,
functionaldesignandvalidationcomplexityisdecreasedatthe
expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,
thepipelinepartitioningbalancestheworkloadandensuresthat
noonepipelinestageistight.Themainintegerpipelineisseven
stages,memoryoperationsfollowaneight-stagepipeline,and
whenoperatinginthumbmodeanextrapipestageisinserted
afterthelastfetchstagetoconvertthumbinstructionsintoARM
instructions.Sincethumbmodeinstructions[11]are16b,two
instructionsarefetchedinparallelwhileexecutingthumbin-
structions.Asimplifieddiagramoftheprocessorpipelineis
Fig.2.Microprocessorpipelineorganization.
showninFig.2,wherethestateboundariesareindicatedby
gray.Featuresthatallowthemicroarchitecturetoachievehigh
speedareasfollows.
TheshifterandALUresideinseparatestages.TheARMin-
structionsetallowsashiftfollowedbyanALUoperationina
singleinstruction.Previousimplementationslimitedfrequency
byhavingtheshiftandALUinasinglestage.Splittingthisop-
erationreducesthecriticalALUbypasspathbyapproximately
1/3.Theextrapipelinehazardintroducedwhenaninstructionis
immediatelyfollowedbyonerequiringthattheresultbeshifted
isinfrequent.
DecoupledInstructionFetch.Atwo-instructiondeepqueueis
implementedbetweenthesecondfetchandinstructiondecode
pipestages.Thisallowsstallsgeneratedlaterinthepipetobe
deferredbyoneormorecyclesintheearlierpipestages,thereby
allowinginstructionfetchestoproceedwhenthepipeisstalled,
andalsorelievesstallspeedpathsintheinstructionfetchand
branchpredictionunits.
Deferredregisterdependency
stalls.Whileregisterdepen-
denciesarecheckedintheRFstage,stallsduetothesehazards
aredeferreduntiltheX1stage.Allthenecessaryoperandsare
thencapturedfromresult-forwardingbussesastheresultsare
returnedtotheregisterfile.
Oneofthemajorgoalsofthedesignwastominimizetheen-
ergyconsumedtocompleteagiventask.Conventionalwisdom
hasbeenthatshorterpipelinesaremoreefficientduetore-
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic
.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Register:
An Array of Flip-Flops
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic
.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Combinational Logic
21Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Flip Flops have internal delays ...
D Q
CLK
Value of D is sampled on positive clock edge.Q outputs sampled value for rest of cycle.
D
Q
t_setup
t_clk-to-Q
22Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Flip-Flop delays eat into “time budget”1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Spring 2003 EECS150 – Lec10-Timing Page 7
Example
• Parallel to serial converter:
a
b T ! time(clk"Q) + time(mux) + time(setup)
T ! #clk"Q + #mux + #setup
clk
ALU “time budget”
Spring 2003 EECS150 – Lec10-Timing Page 8
General Model of Synchronous Circuit
• In general, for correct operation:
for all paths.
• How do we enumerate all paths?
– Any circuit input or register output to any register input or circuit
output.
– “setup time” for circuit outputs depends on what it connects to
– “clk-Q time” for circuit inputs depends on from where it comes.
reg regCL CL
clock input
output
option feedback
input output
T ! time(clk"Q) + time(CL) + time(setup)
T ! #clk"Q + #CL + #setup
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic
.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Combinational Logic
23Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Clock skew also eats into “time budget”
Spring 2003 EECS150 – Lec10-Timing Page 18
Clock Skew (cont.)
• If clock period T = TCL+Tsetup+Tclk!Q, circuit will fail.
• Therefore:
1. Control clock skew
a) Careful clock distribution. Equalize path delay from clock source to all clock loads by controlling wires delay and buffer delay.
b) don’t “gate” clocks.
2. T " TCL+Tsetup+Tclk!Q + worst case skew.
• Most modern large high-performance chips (microprocessors) control end to end clock skew to a few tenths of a nanosecond.
clock skew, delay in distribution
CL
CLKCLK’
CLK
CLK’
Spring 2003 EECS150 – Lec10-Timing Page 19
Clock Skew (cont.)
• Note reversed buffer.
• In this case, clock skew actually provides extra time (adds
to the effective clock period).
• This effect has been used to help run circuits as higher
clock rates. Risky business!
CL
CLK
CLK’
clock skew, delay in distribution
CLK
CLK’
As T →0, which circuit
fails first?
Spring 2003 EECS150 – Lec10-Timing Page 18
Clock Skew (cont.)
• If clock period T = TCL+Tsetup+Tclk!Q, circuit will fail.
• Therefore:
1. Control clock skew
a) Careful clock distribution. Equalize path delay from clock source to all clock loads by controlling wires delay and buffer delay.
b) don’t “gate” clocks.
2. T " TCL+Tsetup+Tclk!Q + worst case skew.
• Most modern large high-performance chips (microprocessors) control end to end clock skew to a few tenths of a nanosecond.
clock skew, delay in distribution
CL
CLKCLK’
CLK
CLK’
CLKd CLKd
Spring 2003 EECS150 – Lec10-Timing Page 18
Clock Skew (cont.)
• If clock period T = TCL+Tsetup+Tclk!Q, circuit will fail.
• Therefore:
1. Control clock skew
a) Careful clock distribution. Equalize path delay from clock source to all clock loads by controlling wires delay and buffer delay.
b) don’t “gate” clocks.
2. T " TCL+Tsetup+Tclk!Q + worst case skew.
• Most modern large high-performance chips (microprocessors) control end to end clock skew to a few tenths of a nanosecond.
clock skew, delay in distribution
CL
CLKCLK’
CLK
CLK’CLKd
24Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [16] was required to tune themore than 2000 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3D visualizations. Figure 8 visualizes four ofthe 64 sector trees containing about 125 tuned wiresdriving 1/16th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [17, 18]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 15 200 clock pins on the chip,regardless of proximity.
From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.
Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed
Figure 6
Schematic diagram of global clock generation and distribution.
PLL
Bypass
Referenceclock in
Referenceclock out
Clock distributionClock out
Figure 7
3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.
Del
ayGrid
Tunedsectortrees
Sectorbuffers
Buffer level 2
Buffer level 1
y
x
Figure 8
Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.
Del
ay Multiple-fingeredtransmissionline
yx
J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002
32
Clock Tree Delays,
IBM “Power” CPU
Del
ay
25Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [16] was required to tune themore than 2000 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3D visualizations. Figure 8 visualizes four ofthe 64 sector trees containing about 125 tuned wiresdriving 1/16th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [17, 18]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 15 200 clock pins on the chip,regardless of proximity.
From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.
Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed
Figure 6
Schematic diagram of global clock generation and distribution.
PLL
Bypass
Referenceclock in
Referenceclock out
Clock distributionClock out
Figure 7
3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.
Del
ay
Grid
Tunedsectortrees
Sectorbuffers
Buffer level 2
Buffer level 1
y
x
Figure 8
Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.
Del
ay Multiple-fingeredtransmissionline
yx
J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002
32
Clock Tree Delays, IBM Power
clock grid was completed with a tool run at the chip level,connecting unit-level pins to the grid. At this point, theclock tuning and the bottom-up clock routing process stillhave a great deal of flexibility to respond rapidly to evenlate changes. Repeated practice routing and tuning wereperformed by a small, focused global clock team as theclock pins and buffer placements evolved to guaranteefeasibility and speed the design process.
Measurements of jitter and skew can be carried outusing the I/Os on the chip. In addition, approximately 100top-metal probe pads were included for direct probingof the global clock grid and buffers. Results on actualPOWER4 microprocessor chips show long-distanceskews ranging from 20 ps to 40 ps (cf. Figure 9). This isimproved from early test-chip hardware, which showedas much as 70 ps skew from across-chip channel-lengthvariations [19]. Detailed waveforms at the input andoutput of each global clock buffer were also measuredand compared with simulation to verify the specializedmodeling used to design the clock grid. Good agreementwas found. Thus, we have achieved a “correct-by-design”clock-distribution methodology. It is based on our designexperience and measurements from a series of increasinglyfast, complex server microprocessors. This method resultsin a high-quality global clock without having to usefeedback or adjustment circuitry to control skews.
Circuit designThe cycle-time target for the processor was set early in theproject and played a fundamental role in defining thepipeline structure and shaping all aspects of the circuitdesign as implementation proceeded. Early on, criticaltiming paths through the processor were simulated indetail in order to verify the feasibility of the designpoint and to help structure the pipeline for maximumperformance. Based on this early work, the goal for therest of the circuit design was to match the performance setduring these early studies, with custom design techniquesfor most of the dataflow macros and logic synthesis formost of the control logic—an approach similar to thatused previously [20]. Special circuit-analysis and modelingtechniques were used throughout the design in order toallow full exploitation of all of the benefits of the IBMadvanced SOI technology.
The sheer size of the chip, its complexity, and thenumber of transistors placed some important constraintson the design which could not be ignored in the push tomeet the aggressive cycle-time target on schedule. Theseconstraints led to the adoption of a primarily static-circuitdesign strategy, with dynamic circuits used only sparinglyin SRAMs and other critical regions of the processor core.Power dissipation was a significant concern, and it was akey factor in the decision to adopt a predominantly static-circuit design approach. In addition, the SOI technology,
including uncertainties associated with the modelingof the floating-body effect [21–23] and its impact onnoise immunity [22, 24 –27] and overall chip decouplingcapacitance requirements [26], was another factor behindthe choice of a primarily static design style. Finally, thesize and logical complexity of the chip posed risks tomeeting the schedule; choosing a simple, robust circuitstyle helped to minimize overall risk to the projectschedule with most efficient use of CAD tool and designresources. The size and complexity of the chip alsorequired rigorous testability guidelines, requiring almostall cycle boundary latches to be LSSD-compatible formaximum dc and ac test coverage.
Another important circuit design constraint was thelimit placed on signal slew rates. A global slew rate limitequal to one third of the cycle time was set and enforcedfor all signals (local and global) across the whole chip.The goal was to ensure a robust design, minimizingthe effects of coupled noise on chip timing and alsominimizing the effects of wiring-process variability onoverall path delay. Nets with poor slew also were foundto be more sensitive to device process variations andmodeling uncertainties, even where long wires and RCdelays were not significant factors. The general philosophywas that chip cycle-time goals also had to include theslew-limit targets; it was understood from the beginningthat the real hardware would function at the desiredcycle time only if the slew-limit targets were also met.
The following sections describe how these designconstraints were met without sacrificing cycle time. Thelatch design is described first, including a description ofthe local clocking scheme and clock controls. Then thecircuit design styles are discussed, including a description
Figure 9
Global clock waveforms showing 20 ps of measured skew.
1.5
1.0
0.5
0.0
0 500 1000 1500 2000 2500
20 ps skew
Vol
ts (
V)
Time (ps)
IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. D. WARNOCK ET AL.
33
26Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Some Flip Flops have “hold” time ...
D
t_setup
CLK
t_hold
D must stay
stable here
D Q
CLK
Does flip-flop hold time affect operation of this
circuit? Under what conditions?
t_inv
What is the intended function of this circuit?
t_clk-to-Q + t_inv > t_holdFor correct operation.
27Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Searching for processor critical path1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Timing AnalysisWhat is the
smallest T that produces correct
operation?Must considerall connectedregister pairs.
?
Why might I suspect this one?
28Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Combinational paths for IBM Power 4 CPU
From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
netlist. Of these, 121 713 were top-level chip global nets,and 21 711 were processor-core-level global nets. Againstthis model 3.5 million setup checks were performed in latemode at points where clock signals met data signals inlatches or dynamic circuits. The total number of timingchecks of all types performed in each chip run was9.8 million. Depending on the configuration of the timingrun and the mix of actual versus estimated design data,the amount of real memory required was in the rangeof 12 GB to 14 GB, with run times of about 5 to 6 hoursto the start of timing-report generation on an RS/6000*Model S80 configured with 64 GB of real memory.Approximately half of this time was taken up by readingin the netlist, timing rules, and extracted RC networks, as
well as building and initializing the internal data structuresfor the timing model. The actual static timing analysistypically took 2.5–3 hours. Generation of the entirecomplement of reports and analysis required an additional5 to 6 hours to complete. A total of 1.9 GB of timingreports and analysis were generated from each chip timingrun. This data was broken down, analyzed, and organizedby processor core and GPS, individual unit, and, in thecase of timing contracts, by unit and macro. This was onecomponent of the 24-hour-turnaround time achieved forthe chip-integration design cycle. Figure 26 shows theresults of iterating this process: A histogram of the finalnominal path delays obtained from static timing for thePOWER4 processor.
The POWER4 design includes LBIST and ABIST(Logic/Array Built-In Self-Test) capability to enable full-frequency ac testing of the logic and arrays. Such testingon pre-final POWER4 chips revealed that several circuitmacros ran slower than predicted from static timing. Thespeed of the critical paths in these macros was increasedin the final design. Typical fast ac LBIST laboratory testresults measured on POWER4 after these paths wereimproved are shown in Figure 27.
SummaryThe 174-million-transistor !1.3-GHz POWER4 chip,containing two microprocessor cores and an on-chipmemory subsystem, is a large, complex, high-frequencychip designed by a multi-site design team. Theperformance and schedule goals set at the beginning ofthe project were met successfully. This paper describesthe circuit and physical design of POWER4, emphasizingaspects that were important to the project’s success in theareas of design methodology, clock distribution, circuits,power, integration, and timing.
Figure 25
POWER4 timing flow. This process was iterated daily during the physical design phase to close timing.
VIM
Timer files ReportsAsserts
Spice
Spice
GL/1
Reports
< 12 hr
< 12 hr
< 12 hr
< 48 hr
< 24 hr
Non-uplift timing
Noiseimpacton timing
Upliftanalysis
Capacitanceadjust
Chipbench /EinsTimer
Chipbench /EinsTimer
Extraction
Core or chipwiring
Analysis/update(wires, buffers)
Notes:• Executed 2–3 months prior to tape-out• Fully extracted data from routed designs • Hierarchical extraction• Custom logic handled separately • Dracula • Harmony• Extraction done for • Early • Late
Extracted units (flat or hierarchical)Incrementally extracted RLMsCustom NDRsVIMs
Figure 26
Histogram of the POWER4 processor path delays.
!40 !20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280Timing slack (ps)
Lat
e-m
ode
timin
g ch
ecks
(th
ousa
nds)
0
50
100
150
200
IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. D. WARNOCK ET AL.
47
Most wires have hundreds of picoseconds to spare.The critical path
29Thursday, September 5, 13
Post-Placement C-slow Retiming for the Xilinx VirtexFPGA
Nicholas Weaver⇤
UC BerkeleyBerkeley, CA
Yury MarkovskiyUC BerkeleyBerkeley, CA
Yatish PatelUC BerkeleyBerkeley, CA
John WawrzynekUC BerkeleyBerkeley, CA
ABSTRACT
C-slow retiming is a process of automatically increas-ing the throughput of a design by enabling fine grainedpipelining of problems with feedback loops. This transfor-mation is especially appropriate when applied to FPGAdesigns because of the large number of available registers.To demonstrate and evaluate the benefits of C-slow re-timing, we constructed an automatic tool which modifiesdesigns targeting the Xilinx Virtex family of FPGAs. Ap-plying our tool to three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1synthesized microprocessor core, we were able to substan-tially increase the total throughput. For some parameters,throughput is e↵ectively doubled.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids—Automatic syn-thesys
General Terms
Performance
Keywords
FPGA CAD, FPGA Optimization, Retiming, C-slowRetiming
⇤Please address any correspondance [email protected]
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.FPGA’03, February 23–25, 2003, Monterey, California, USA.Copyright 2003 ACM 1-58113-651-X/03/0002 ...$5.00.
1. Introduction
Leiserson’s retiming algorithm[7] o↵ers a polynomialtime algorithm to optimize the clock period on arbitrarysynchronous circuits without changing circuit semantics.Although a powerful and e�cient transformation that hasbeen employed in experimental tools[10][2] and commercialsynthesis tools[13][14], it o↵ers only a minor clock periodimprovement for a well constructed design, as many de-signs have their critical path on a single cycle feedbackloop and can’t benefit from retiming.
Also proposed by Leiserson et al to meet the constraintsof systolic computation, is C-slow retiming.1 In C-slow re-timing, each design register is first replaced with C regis-ters before retiming. This transformation modifies the de-sign semantics so that C separate streams of computationare distributed through the pipeline, greatly increasing theaggregate throughput at the cost of additional latency andflip flops. This can automatically accelerate computationscontaining feedback loops by adding more flip-flops thatretiming can then move moved around the critical path.
The e↵ect of C-slow retiming is to enable pipelining ofthe critical path, even in the presence of feedback loops. Totake advantage of this increased throughput however, thereneeds to be su�cient task level parallelism. This processwill slow any single task but the aggregate throughput willbe increased by interleaving the resulting computation.
This process works very well on many FPGA archite-cures as these architectures tend to have a balanced ra-tio of logic elements to registers, while most user designscontain a considerably higher percentage of logic. Addi-tionaly, many architectures allow the registers to be usedindependently of the logic in a logic block.
We have constructed a prototype C-slow retiming toolthat modifies designs targeting the Xilinx Virtex familyof FPGAs. The tool operates after placement: convertingevery design register to C separate registers before apply-ing Leiserson’s retiming algorithm to minimize the clockperiod. New registers are allocated by scavenging unusedarray resources. The resulting design is then returned toXilinx tools for routing, timing analysis, and bitfile gener-ation.
We have selected three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1
1This was originally defined to meet systolic slowdown re-quirements.
How to retime logic
Post-Placement C-slow Retiming for the Xilinx VirtexFPGA
Nicholas Weaver⇤
UC BerkeleyBerkeley, CA
Yury MarkovskiyUC BerkeleyBerkeley, CA
Yatish PatelUC BerkeleyBerkeley, CA
John WawrzynekUC BerkeleyBerkeley, CA
ABSTRACT
C-slow retiming is a process of automatically increas-ing the throughput of a design by enabling fine grainedpipelining of problems with feedback loops. This transfor-mation is especially appropriate when applied to FPGAdesigns because of the large number of available registers.To demonstrate and evaluate the benefits of C-slow re-timing, we constructed an automatic tool which modifiesdesigns targeting the Xilinx Virtex family of FPGAs. Ap-plying our tool to three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1synthesized microprocessor core, we were able to substan-tially increase the total throughput. For some parameters,throughput is e↵ectively doubled.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids—Automatic syn-thesys
General Terms
Performance
Keywords
FPGA CAD, FPGA Optimization, Retiming, C-slowRetiming
⇤Please address any correspondance [email protected]
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.FPGA’03, February 23–25, 2003, Monterey, California, USA.Copyright 2003 ACM 1-58113-651-X/03/0002 ...$5.00.
1. Introduction
Leiserson’s retiming algorithm[7] o↵ers a polynomialtime algorithm to optimize the clock period on arbitrarysynchronous circuits without changing circuit semantics.Although a powerful and e�cient transformation that hasbeen employed in experimental tools[10][2] and commercialsynthesis tools[13][14], it o↵ers only a minor clock periodimprovement for a well constructed design, as many de-signs have their critical path on a single cycle feedbackloop and can’t benefit from retiming.
Also proposed by Leiserson et al to meet the constraintsof systolic computation, is C-slow retiming.1 In C-slow re-timing, each design register is first replaced with C regis-ters before retiming. This transformation modifies the de-sign semantics so that C separate streams of computationare distributed through the pipeline, greatly increasing theaggregate throughput at the cost of additional latency andflip flops. This can automatically accelerate computationscontaining feedback loops by adding more flip-flops thatretiming can then move moved around the critical path.
The e↵ect of C-slow retiming is to enable pipelining ofthe critical path, even in the presence of feedback loops. Totake advantage of this increased throughput however, thereneeds to be su�cient task level parallelism. This processwill slow any single task but the aggregate throughput willbe increased by interleaving the resulting computation.
This process works very well on many FPGA archite-cures as these architectures tend to have a balanced ra-tio of logic elements to registers, while most user designscontain a considerably higher percentage of logic. Addi-tionaly, many architectures allow the registers to be usedindependently of the logic in a logic block.
We have constructed a prototype C-slow retiming toolthat modifies designs targeting the Xilinx Virtex familyof FPGAs. The tool operates after placement: convertingevery design register to C separate registers before apply-ing Leiserson’s retiming algorithm to minimize the clockperiod. New registers are allocated by scavenging unusedarray resources. The resulting design is then returned toXilinx tools for routing, timing analysis, and bitfile gener-ation.
We have selected three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1
1This was originally defined to meet systolic slowdown re-quirements.
IN OUT
1 1
1 1 22
Figure 1: A small graph before retiming. Thenodes represent logic delays, with the inputs andoutputs passing through mandatory, fixed regis-ters. The critical path is 5.
microprocessor core, for which we can envision scenar-ios where ample task-level parallelism exists. The AESand Smith/Watherman benchmarks were also C-slowed byhand, enabling us to evaluate how well our automated tech-niques compare with careful, hand designed implementa-tions that accomplishes the same goals.
The LEON 1 processor is a significantly larger synthe-sized design. Although it seems unusual, there is su�cienttask level parallelism to C-slow a microprocessor, as eachstream of execution can be viewed as a separate task. Theresulting C-slowed design behaves like a multithreaded sys-tem, with each virtual processor running slower but o↵er-ing a higher total throughput.
This prototype demonstrates significant speedups onall 3 benchmarks, nearly doubling the throughput for theproper parameters. On the AES and Smith/Watermanbenchmarks, these automated results compare favorablywith careful hand-constructed implementations that werethe result of manual C-slowing and pipelining.
In the remainder of the paper, we first discuss the se-mantic restrictions and changes that retiming and C-slowretiming impose on a design, the details of the retimingalgorithm, and the use of the target architecture. Fol-lowing the discussion of C-slow retiming, we describe ourimplementation of an automatic retiming tool. Then wedescribe the structure of all three benchmarks and presentthe results of applying our tool.
2. Conventional Retiming
Leiserson’s retiming treats a synchronous circuit as adirected graph, with delays on the nodes representing com-bination delays and weights on the edges representing reg-isters in the design. An additional node represents theexternal world, with appropriate edges added to accountfor all the I/Os. Two matrixes are calculated, W and D,that represent the number of registers and critical pathbetween every pair of nodes in the graph. Each node alsohas a lag value r that is calculated by the algorithm andused to change the number of registers on any given edge.Conventional retiming does not change the design seman-tics: all input and output timings remain unchanged, whileimposing minor design constraints on the use of FPGA fea-tures. More details and formal proofs of correctness canbe found in Leiserson’s original paper[7].
In order to determine whether a critical path P can beachieved, the retiming algorithm creates a series of con-
IN OUT
1 1
1 1 22
Figure 2: The example in Figure 2 after retiming.The critical path is reduced from 5 to 4.
straints to calculate the lag on each node. All these con-strains are of the form x � y k that can be solved inO(n2) time by using the Bellman/Ford shortest path al-gorithm. The primary constraints insure correctness: noedge will have a negative number of registers while everycycle will always contain the original number of registers.All IO passes through an intermediate node insuring thatinput and output timings do not change. These constraintscan be modified to insure that a particular line will containno registers or a mandatory minimum number of registersto meet architectural constraints.
A second set of constraints attempt to insure that everypath longer than the critical path will contain at least oneregister, by creating an additional constraint for every pathlonger than the critical path. The actual constraints aresummarized in Table 1.
This process is iterated to find the minimum criticalpath that meets all the constraints. The lag calculated bythese constraints can then be used to change the designto meet this critical path. For each edge, a new registerweight w0 is calculated, with w0(e) = w(e)� r(u) + r(v).
An example of how retiming a↵ects a simple design canbe seen in Figures 2 and 2. The initial design has a criticalpath of 5, while after retiming the critical path is reducedto 4. During this process, the number of registers is in-creased, yet the number of registers on every cycle andthe path from input to output remain unchanged. Sincethe feedback loop has only a single register and a delay of4, it is impossible to further improve the performance byretiming.
Retiming in this form imposes only minimal design lim-itations: there can be no asynchronous resets or similarelements, as the retiming technique only applies to syn-chronous circuits. A synchronous global reset imposes toomany constraints to allow retiming unless initial conditionsare calculated and the global reset itself is now excludedfrom retiming purposes. Local synchronous resets and en-ables just produce small, self loops that have no e↵ect onthe correct operation of the algorithm.
Most other design features can be accommodated bysimply adding appropriate constraints. As an example, alltristated lines can’t have registers applied to them, whilemandatory elements such as those seen in synchronousmemories can be easily accommodated by mandating reg-isters on the appropriate nets.
Memories themselves can be retimed like any other el-ement in the design, with dual ported memories treatedas a single node for retiming purposes. Memories thatare synthesized with a negative clock edge (to create thedesign illusion of asynchronous memories) can either be
Circles are combinational logic, labelled with delays.
Critical path is 5.We want to improve it without changing circuit semantics.
IN OUT
1 1
1 1 22
Figure 1: A small graph before retiming. Thenodes represent logic delays, with the inputs andoutputs passing through mandatory, fixed regis-ters. The critical path is 5.
microprocessor core, for which we can envision scenar-ios where ample task-level parallelism exists. The AESand Smith/Watherman benchmarks were also C-slowed byhand, enabling us to evaluate how well our automated tech-niques compare with careful, hand designed implementa-tions that accomplishes the same goals.
The LEON 1 processor is a significantly larger synthe-sized design. Although it seems unusual, there is su�cienttask level parallelism to C-slow a microprocessor, as eachstream of execution can be viewed as a separate task. Theresulting C-slowed design behaves like a multithreaded sys-tem, with each virtual processor running slower but o↵er-ing a higher total throughput.
This prototype demonstrates significant speedups onall 3 benchmarks, nearly doubling the throughput for theproper parameters. On the AES and Smith/Watermanbenchmarks, these automated results compare favorablywith careful hand-constructed implementations that werethe result of manual C-slowing and pipelining.
In the remainder of the paper, we first discuss the se-mantic restrictions and changes that retiming and C-slowretiming impose on a design, the details of the retimingalgorithm, and the use of the target architecture. Fol-lowing the discussion of C-slow retiming, we describe ourimplementation of an automatic retiming tool. Then wedescribe the structure of all three benchmarks and presentthe results of applying our tool.
2. Conventional Retiming
Leiserson’s retiming treats a synchronous circuit as adirected graph, with delays on the nodes representing com-bination delays and weights on the edges representing reg-isters in the design. An additional node represents theexternal world, with appropriate edges added to accountfor all the I/Os. Two matrixes are calculated, W and D,that represent the number of registers and critical pathbetween every pair of nodes in the graph. Each node alsohas a lag value r that is calculated by the algorithm andused to change the number of registers on any given edge.Conventional retiming does not change the design seman-tics: all input and output timings remain unchanged, whileimposing minor design constraints on the use of FPGA fea-tures. More details and formal proofs of correctness canbe found in Leiserson’s original paper[7].
In order to determine whether a critical path P can beachieved, the retiming algorithm creates a series of con-
IN OUT
1 1
1 1 22
Figure 2: The example in Figure 2 after retiming.The critical path is reduced from 5 to 4.
straints to calculate the lag on each node. All these con-strains are of the form x � y k that can be solved inO(n2) time by using the Bellman/Ford shortest path al-gorithm. The primary constraints insure correctness: noedge will have a negative number of registers while everycycle will always contain the original number of registers.All IO passes through an intermediate node insuring thatinput and output timings do not change. These constraintscan be modified to insure that a particular line will containno registers or a mandatory minimum number of registersto meet architectural constraints.
A second set of constraints attempt to insure that everypath longer than the critical path will contain at least oneregister, by creating an additional constraint for every pathlonger than the critical path. The actual constraints aresummarized in Table 1.
This process is iterated to find the minimum criticalpath that meets all the constraints. The lag calculated bythese constraints can then be used to change the designto meet this critical path. For each edge, a new registerweight w0 is calculated, with w0(e) = w(e)� r(u) + r(v).
An example of how retiming a↵ects a simple design canbe seen in Figures 2 and 2. The initial design has a criticalpath of 5, while after retiming the critical path is reducedto 4. During this process, the number of registers is in-creased, yet the number of registers on every cycle andthe path from input to output remain unchanged. Sincethe feedback loop has only a single register and a delay of4, it is impossible to further improve the performance byretiming.
Retiming in this form imposes only minimal design lim-itations: there can be no asynchronous resets or similarelements, as the retiming technique only applies to syn-chronous circuits. A synchronous global reset imposes toomany constraints to allow retiming unless initial conditionsare calculated and the global reset itself is now excludedfrom retiming purposes. Local synchronous resets and en-ables just produce small, self loops that have no e↵ect onthe correct operation of the algorithm.
Most other design features can be accommodated bysimply adding appropriate constraints. As an example, alltristated lines can’t have registers applied to them, whilemandatory elements such as those seen in synchronousmemories can be easily accommodated by mandating reg-isters on the appropriate nets.
Memories themselves can be retimed like any other el-ement in the design, with dual ported memories treatedas a single node for retiming purposes. Memories thatare synthesized with a negative clock edge (to create thedesign illusion of asynchronous memories) can either be
Add a register, move one circle. Performance improves by 20%.
“Technology X” can often do this.
30Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Power 4: Timing Estimation, Closure
Timing EstimationPredicting a
processor’s clock rate early in the
project
From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
31Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Power 4: Timing Estimation, Closure
Timing ClosureMeeting
(or exceeding!) the timing estimate
From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
32Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Floorplaning: essential to meet timing.
(Intel XScale 80200)33Thursday, September 5, 13
34Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Break
35Thursday, September 5, 13
Simple exercises for gaining intuition about timing for your process + EDA tools.
Thanks to Bhupesh Dasila, Open-Silicon Bangalore
36Thursday, September 5, 13
Bhupesh Dasila
Synthesize gate chains using hand-specified library cells
Exercisescell libraryand placeand routetools.
Lets you know how many levels of logic you can use in the best case.
Helps you “see through” ... “Technology X”.
Synthesis constrained to 2ns clock.
Spring 2003 EECS150 – Lec10-Timing Page 11
Gate Delay
• Cascaded gates:
Vout
Vin
Delay of a chain of 3 inverters with strongest strength. “Guaranteed not to exceed” speed.
weak NANDs
Chain lengths ...
40 nm process 29 ps/gate av.
37Thursday, September 5, 13
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Force P&L to drive a long wire with a known buffer cell.
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Vary driver strength, wire length, metal layer.
Shows the maximum distance two gates can be placed and still meet your clock period.
Distributed RC is the square of the length is clearly seen!
Bhupesh Dasila38Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
CS250, UC Berkeley Fall ’12Lecture 04, Timing
Turning Rise/Fall Delay into Gate Delay• Cascaded gates:
“transfer curve” for inverter.
11
1 11 10 0 0 0
CS250, UC Berkeley Fall ’12Lecture 04, Timing
Driving Large Loads‣ Large fanout nets: clocks, resets, memory bit lines, off-chip‣ Relatively small driver results in long rise time (and thus
large gate delay)
‣ Strategy:
‣ Optimal trade-off between delay per stage and total number of stages ⇒ fanout of ∼4-6 per stage
12
Staged Buffers
39Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Register file: Synthesize, or use SRAM?
R1
R2
...
R31
Q
Q
Q
R0 - The constant 0 Q
clk
.
.
.
32MUX
32
32
sel(rs1)5
.
.
.rd1
32MUX
32
32
sel(rs2)
5...
rd2
“two read ports”
D
D
D
En
En
En
DEMUX
.
.
.
sel(ws)
5
WE
wd32
Speed will depend on how large it lays out ...
40Thursday, September 5, 13
Figure 3: Using the raw area data, the physical implementation team can get a more accurate area estimation early in the RTL development stage for floorplanning purposes. This shows an example of this graph for a 1-port, 32-bit-wide SRAM.
Synthesized, custom, and SRAM-based register files, 40nm
For small register files, logic synthesis is competitive.
Not clear if the SRAM data points include area for register control, etc.
Registerfile compiler
Synthesis
SRAMS
Bhupesh Dasila41Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Techniques
42Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Pipelining
43Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Starting point: A single-cycle processor
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Dout
Data Memory
WE
Din
Addr
MemToReg
Addr Data
Instr
Mem32A
L
U
32
32
op
Ext
SecondsProgram
InstructionsProgram
= SecondsCycle Instruction
Cycles
CPI == 1This is good.
Slow.This is bad.
Challenge: Speed up clock while keeping CPI == 1
44Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Reminder: How data flows after posedge
32rd1
RegFile
32rd2
WE
32wd
5rs15rs25ws
32ALU
32
32
opLogic
Addr Data
InstrMem
D
PC
Q+0x4
45Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Next posedge: Update state and repeat
32rd1
RegFile
32rd2
WE32
wd
5 rs15 rs25 ws
D
PC
Q
46Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Observation: Logic idle most of cycle
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Dout
Data Memory
WE
Din
Addr
MemToReg
Addr Data
Instr
Mem32A
L
U
32
32
op
Ext
For most of cycle, ALU is either “waiting” for its inputs, or “holding” its output
Ideal: a CPU architecture where each part is always “working”.
47Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Inspiration: Automobile assembly lineAssembly line moves on a steady clock.
Each station does the same task on each car.Car
body shell
Car chassis
Mergestation
Boltingstation
The clock
48Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Inspiration: Automobile assembly lineSimpler station tasks → more cars per hour.Simple tasks take less time, clock is faster.
49Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Inspiration: Automobile assembly lineLine speed limited by slowest task.
Most efficient if all tasks take same time to do
50Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Inspiration: Automobile assembly lineSimpler tasks, complex car → long line!
These lines go 24 x 7, and rarely shut down.
51Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Lessons from car assembly lines
Faster line movement yields more cars per hour off the line.
Faster line movement requires more stages, each doing simpler tasks.
To maximize efficiency, all stages should take same amount of time(if not, workers in fast stages are idle)
“Filling”, “flushing”, and “stalling” assembly line are all bad news.
52Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Key Analogy: The instruction is the car
D
PC
Q
+
0x4
Addr Data
Instr
Mem
IR IR IR
Instruction Fetch
IR
Pipeline Stage #1 Stage #2
Controlshardware
in stage 2
Stage #3
Controlshardware
in stage 3
Stage #4
Controlshardware
in stage 4
Stage #5
Controlshardware
in stage 5
“Data-stationary control”
53Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Example: Decode & Register Fetch Stage
D
PC
Q
+
0x4
Addr Data
Instr
Mem
IR
Instr Fetch
Pipeline Stage #1
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
Ext
IR
B
A
M
Stage #2Decode & Reg Fetch
IR
Stage #3
ADD R4,R3,R2OR R7,R6,R5SUB R10,R9,R8
ADD R4,R3,R2OR R7,R6,R5
SUB R10,R9,R8
A sample program
R’s chosen so that instructions are
independent - like cars on the line.
54Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Hazards: An instruction is not a car ...
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Addr Data
Instr
Mem
Ext
IR IR IR
B
A
M
Instr FetchStage #1 Stage #2 Stage #3
Decode & Reg Fetch
ADD R4,R3,R2OR R5,R4,R2
An example of a “hazard” -- we must
(1) detect and (2) resolve all hazards
to make a CPU that matches ISA
R4 not written yet ...... wrong value of R4 fetched from RegFile, contract
with programmer broken! Oops! ADD R4,R3,R2
OR R5,R4,R2
New sample program
55Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Decode & Reg Fetch
Performance Equation and Hazards
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Addr Data
Instr
Mem
Ext
IR IR IR
B
A
M
Instr Fetch Stage #3
SecondsProgram
InstructionsProgram= Seconds
Cycle InstructionCycles
“Software slows the machine
down”Seymour Cray
Some ways to cope with hazards
makes CPI > 1“stalling pipeline”
Added logic to detect and resolve hazards increases
clock period
56Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Superpipelining
57Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Superpipelining: Add more stagesSeconds
Program
Instructions
Program= Seconds
Cycle Instruction
Cycles
Goal: Reduce critical path byadding more pipeline stages.
Difficulties: Added penalties for load delays and branch misses.
Ultimate Limiter: As logic delay goes to 0, FF clk-to-Q and setup.
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Example: 8-stage ARM XScale:extra IF, ID, data cache stages.
Also, power!58Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
CS
152 L10 Pipeline Intro (9)Fall 2004 ©
UC
Regents
Graphically R
epresenting MIP
S P
ipeline
Can help w
ith answering questions like:
how m
any cycles does it take to execute this code?w
hat is the ALU
doing during cycle 4?is there a hazard, w
hy does it occur, and how can it be fixed?
ALU
IMR
egD
MR
eg
IR
ID+RF
EX
MEM
WB
IR
IR
IR
IF
5 Stage1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
8 Stage
IF now takes 2 stages (pipelined I-cache)ID and RF each get a stage.ALU split over 3 stagesMEM takes 2 stages (pipelined D-cache)
Note: Some stages now overlap, some instructions
take extra stages.
59Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Superpipelining techniques ...
Split ALU and decode logic over several pipeline stages.
Pipeline memory: Use more banks of smaller arrays, add pipeline stages between decoders, muxes.
Remove “rarely-used” forwarding networks that are on critical path.
Pipeline the wires of frequently used forwarding networks.
Creates stalls, affects CPI.
Also: Clocking tricks (example: use posedge and negedge registers)60Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Hardware limits to superpipelining?
Francois Labonte
[email protected] 4/23/2003 Stanford University
Cycle in FO4
0
10
20
30
40
50
60
70
80
90
100
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05
intel 386
intel 486
intel pentium
intel pentium 2
intel pentium 3
intel pentium 4
intel itanium
Alpha 21064
Alpha 21164
Alpha 21264
Sparc
SuperSparc
Sparc64
Mips
HP PA
Power PC
AMD K6
AMD K7
AMD x86-64
Thanks to Francois Labonte, Stanford
FO4Delays
Historicallimit:about
12 FO4s
CPU Clock Periods1985-2005
MIPS 20005 stages
Pentium 420 stages
Pentium Pro10 stages
*
Power wall:Intel Core Duo has 14 stages
FO4: How many fanout-of-4 inverter delays in the clock period.
61Thursday, September 5, 13
PROCESSORS
1
CPU DB: Recording Microprocessor History
With this open database, you can mine microprocessor trends over the past 40 years.
Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University
In November 1971, Intel introduced the world’s first single-chip microprocessor, the Intel 4004. It had 2,300 transistors, ran at a clock speed of up to 740 KHz, and delivered 60,000 instructions per second while dissipating 0.5 watts. The following four decades witnessed exponential growth in compute power, a trend that has enabled applications as diverse as climate modeling, protein folding, and computing real-time ballistic trajectories of angry birds. Today’s microprocessor chips employ billions of transistors, include multiple processor cores on a single silicon die, run at clock speeds measured in gigahertz, and deliver more than 4 million times the performance of the original 4004.
Where did these incredible gains come from? This article sheds some light on this question by introducing CPU DB (cpudb.stanford.edu), an open and extensible database collected by Stanford’s VLSI (very large-scale integration) Research Group over several generations of processors (and students). We gathered information on commercial processors from 17 manufacturers and placed it in CPU DB, which now contains data on 790 processors spanning the past 40 years.
In addition, we provide a methodology to separate the effect of technology scaling from improvements on other frontiers (e.g., architecture and software), allowing the comparison of machines built in different technologies. To demonstrate the utility of this data and analysis, we use it to decompose processor improvements into contributions from the physical scaling of devices, and from improvements in microarchitecture, compiler, and software technologies.
AN OPEN REPOSITORY OF PROCESSOR SPECSWhile information about current processors is easy to find, it is rarely arranged in a manner that is useful to the research community. For example, the data sheet may contain the processor’s power, voltage, frequency, and cache size, but not the pipeline depth or the technology minimum feature size. Even then, these specifications often fail to tell the full story: a laptop processor operates over a range of frequencies and voltages, not just the 2 GHz shown on the box label.
Not surprisingly, specification data gets harder to find the older the processor becomes, especially for those that are no longer made, or worse, whose manufacturers no longer exist. We have been collecting this type of data for three decades and are now releasing it in the form of an open repository of processor specifications. The goal of CPU DB is to aggregate detailed processor specifications into a convenient form and to encourage community participation, both to leverage this information and to keep it accurate and current. CPU DB (cpudb. stanford.edu) is populated with desktop, laptop, and server processors, for which we use SPEC13 as our performance-measuring tool. In addition, the database contains limited data on embedded cores, for which we are using the CoreMark benchmark for performance.5 With time and help from the community, we hope to extend the coverage of embedded processors in the database.
PROCESSORS
1
CPU DB: Recording Microprocessor History
With this open database, you can mine microprocessor trends over the past 40 years.
Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University
In November 1971, Intel introduced the world’s first single-chip microprocessor, the Intel 4004. It had 2,300 transistors, ran at a clock speed of up to 740 KHz, and delivered 60,000 instructions per second while dissipating 0.5 watts. The following four decades witnessed exponential growth in compute power, a trend that has enabled applications as diverse as climate modeling, protein folding, and computing real-time ballistic trajectories of angry birds. Today’s microprocessor chips employ billions of transistors, include multiple processor cores on a single silicon die, run at clock speeds measured in gigahertz, and deliver more than 4 million times the performance of the original 4004.
Where did these incredible gains come from? This article sheds some light on this question by introducing CPU DB (cpudb.stanford.edu), an open and extensible database collected by Stanford’s VLSI (very large-scale integration) Research Group over several generations of processors (and students). We gathered information on commercial processors from 17 manufacturers and placed it in CPU DB, which now contains data on 790 processors spanning the past 40 years.
In addition, we provide a methodology to separate the effect of technology scaling from improvements on other frontiers (e.g., architecture and software), allowing the comparison of machines built in different technologies. To demonstrate the utility of this data and analysis, we use it to decompose processor improvements into contributions from the physical scaling of devices, and from improvements in microarchitecture, compiler, and software technologies.
AN OPEN REPOSITORY OF PROCESSOR SPECSWhile information about current processors is easy to find, it is rarely arranged in a manner that is useful to the research community. For example, the data sheet may contain the processor’s power, voltage, frequency, and cache size, but not the pipeline depth or the technology minimum feature size. Even then, these specifications often fail to tell the full story: a laptop processor operates over a range of frequencies and voltages, not just the 2 GHz shown on the box label.
Not surprisingly, specification data gets harder to find the older the processor becomes, especially for those that are no longer made, or worse, whose manufacturers no longer exist. We have been collecting this type of data for three decades and are now releasing it in the form of an open repository of processor specifications. The goal of CPU DB is to aggregate detailed processor specifications into a convenient form and to encourage community participation, both to leverage this information and to keep it accurate and current. CPU DB (cpudb. stanford.edu) is populated with desktop, laptop, and server processors, for which we use SPEC13 as our performance-measuring tool. In addition, the database contains limited data on embedded cores, for which we are using the CoreMark benchmark for performance.5 With time and help from the community, we hope to extend the coverage of embedded processors in the database.
1985 1990 1995 201020052000 2015
140
120
100
80
60
40
20
0
F04
/ cyc
leF04 Delays Per Cycle for Processor Designs
FO4 delay per cycle is roughly proportional to the amount of computation completed per cycle.62Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Multithreading
63Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Krste
November 10, 2004
6.823, L18--3
Multithreading
How can we guarantee no dependencies between instructions in a pipeline?
-- One way is to interleave execution of instructions from different program threads on same pipeline
F D X M W
t0 t1 t2 t3 t4 t5 t6 t7 t8
T1: LW r1, 0(r2)
T2: ADD r7, r1, r4
T3: XORI r5, r4, #12
T4: SW 0(r7), r5
T1: LW r5, 12(r1)
t9
F D X M W
F D X M W
F D X M W
F D X M W
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
Last instruction
in a thread
always completes
writeback before
next instruction
in same thread
reads regfile
KrsteNovember 10, 2004
6.823, L18--5
Simple Multithreaded Pipeline
Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage
+1
2 Thread
select
PC1
PC1
PC1
PC1
I$ IRGPR1GPR1GPR1GPR1
X
Y
2
D$
Multithreading of Static Pipelines4
CPUs,each
run at 1/4
clock
Many variants ...
64Thursday, September 5, 13
Post-Placement C-slow Retiming for the Xilinx VirtexFPGA
Nicholas Weaver⇤
UC BerkeleyBerkeley, CA
Yury MarkovskiyUC BerkeleyBerkeley, CA
Yatish PatelUC BerkeleyBerkeley, CA
John WawrzynekUC BerkeleyBerkeley, CA
ABSTRACT
C-slow retiming is a process of automatically increas-ing the throughput of a design by enabling fine grainedpipelining of problems with feedback loops. This transfor-mation is especially appropriate when applied to FPGAdesigns because of the large number of available registers.To demonstrate and evaluate the benefits of C-slow re-timing, we constructed an automatic tool which modifiesdesigns targeting the Xilinx Virtex family of FPGAs. Ap-plying our tool to three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1synthesized microprocessor core, we were able to substan-tially increase the total throughput. For some parameters,throughput is e↵ectively doubled.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids—Automatic syn-thesys
General Terms
Performance
Keywords
FPGA CAD, FPGA Optimization, Retiming, C-slowRetiming
⇤Please address any correspondance [email protected]
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.FPGA’03, February 23–25, 2003, Monterey, California, USA.Copyright 2003 ACM 1-58113-651-X/03/0002 ...$5.00.
1. Introduction
Leiserson’s retiming algorithm[7] o↵ers a polynomialtime algorithm to optimize the clock period on arbitrarysynchronous circuits without changing circuit semantics.Although a powerful and e�cient transformation that hasbeen employed in experimental tools[10][2] and commercialsynthesis tools[13][14], it o↵ers only a minor clock periodimprovement for a well constructed design, as many de-signs have their critical path on a single cycle feedbackloop and can’t benefit from retiming.
Also proposed by Leiserson et al to meet the constraintsof systolic computation, is C-slow retiming.1 In C-slow re-timing, each design register is first replaced with C regis-ters before retiming. This transformation modifies the de-sign semantics so that C separate streams of computationare distributed through the pipeline, greatly increasing theaggregate throughput at the cost of additional latency andflip flops. This can automatically accelerate computationscontaining feedback loops by adding more flip-flops thatretiming can then move moved around the critical path.
The e↵ect of C-slow retiming is to enable pipelining ofthe critical path, even in the presence of feedback loops. Totake advantage of this increased throughput however, thereneeds to be su�cient task level parallelism. This processwill slow any single task but the aggregate throughput willbe increased by interleaving the resulting computation.
This process works very well on many FPGA archite-cures as these architectures tend to have a balanced ra-tio of logic elements to registers, while most user designscontain a considerably higher percentage of logic. Addi-tionaly, many architectures allow the registers to be usedindependently of the logic in a logic block.
We have constructed a prototype C-slow retiming toolthat modifies designs targeting the Xilinx Virtex familyof FPGAs. The tool operates after placement: convertingevery design register to C separate registers before apply-ing Leiserson’s retiming algorithm to minimize the clockperiod. New registers are allocated by scavenging unusedarray resources. The resulting design is then returned toXilinx tools for routing, timing analysis, and bitfile gener-ation.
We have selected three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1
1This was originally defined to meet systolic slowdown re-quirements.
At the logic level ...
Post-Placement C-slow Retiming for the Xilinx VirtexFPGA
Nicholas Weaver⇤
UC BerkeleyBerkeley, CA
Yury MarkovskiyUC BerkeleyBerkeley, CA
Yatish PatelUC BerkeleyBerkeley, CA
John WawrzynekUC BerkeleyBerkeley, CA
ABSTRACT
C-slow retiming is a process of automatically increas-ing the throughput of a design by enabling fine grainedpipelining of problems with feedback loops. This transfor-mation is especially appropriate when applied to FPGAdesigns because of the large number of available registers.To demonstrate and evaluate the benefits of C-slow re-timing, we constructed an automatic tool which modifiesdesigns targeting the Xilinx Virtex family of FPGAs. Ap-plying our tool to three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1synthesized microprocessor core, we were able to substan-tially increase the total throughput. For some parameters,throughput is e↵ectively doubled.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids—Automatic syn-thesys
General Terms
Performance
Keywords
FPGA CAD, FPGA Optimization, Retiming, C-slowRetiming
⇤Please address any correspondance [email protected]
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.FPGA’03, February 23–25, 2003, Monterey, California, USA.Copyright 2003 ACM 1-58113-651-X/03/0002 ...$5.00.
1. Introduction
Leiserson’s retiming algorithm[7] o↵ers a polynomialtime algorithm to optimize the clock period on arbitrarysynchronous circuits without changing circuit semantics.Although a powerful and e�cient transformation that hasbeen employed in experimental tools[10][2] and commercialsynthesis tools[13][14], it o↵ers only a minor clock periodimprovement for a well constructed design, as many de-signs have their critical path on a single cycle feedbackloop and can’t benefit from retiming.
Also proposed by Leiserson et al to meet the constraintsof systolic computation, is C-slow retiming.1 In C-slow re-timing, each design register is first replaced with C regis-ters before retiming. This transformation modifies the de-sign semantics so that C separate streams of computationare distributed through the pipeline, greatly increasing theaggregate throughput at the cost of additional latency andflip flops. This can automatically accelerate computationscontaining feedback loops by adding more flip-flops thatretiming can then move moved around the critical path.
The e↵ect of C-slow retiming is to enable pipelining ofthe critical path, even in the presence of feedback loops. Totake advantage of this increased throughput however, thereneeds to be su�cient task level parallelism. This processwill slow any single task but the aggregate throughput willbe increased by interleaving the resulting computation.
This process works very well on many FPGA archite-cures as these architectures tend to have a balanced ra-tio of logic elements to registers, while most user designscontain a considerably higher percentage of logic. Addi-tionaly, many architectures allow the registers to be usedindependently of the logic in a logic block.
We have constructed a prototype C-slow retiming toolthat modifies designs targeting the Xilinx Virtex familyof FPGAs. The tool operates after placement: convertingevery design register to C separate registers before apply-ing Leiserson’s retiming algorithm to minimize the clockperiod. New registers are allocated by scavenging unusedarray resources. The resulting design is then returned toXilinx tools for routing, timing analysis, and bitfile gener-ation.
We have selected three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1
1This was originally defined to meet systolic slowdown re-quirements.
IN OUT
1 1
1 1 22
Figure 1: A small graph before retiming. Thenodes represent logic delays, with the inputs andoutputs passing through mandatory, fixed regis-ters. The critical path is 5.
microprocessor core, for which we can envision scenar-ios where ample task-level parallelism exists. The AESand Smith/Watherman benchmarks were also C-slowed byhand, enabling us to evaluate how well our automated tech-niques compare with careful, hand designed implementa-tions that accomplishes the same goals.
The LEON 1 processor is a significantly larger synthe-sized design. Although it seems unusual, there is su�cienttask level parallelism to C-slow a microprocessor, as eachstream of execution can be viewed as a separate task. Theresulting C-slowed design behaves like a multithreaded sys-tem, with each virtual processor running slower but o↵er-ing a higher total throughput.
This prototype demonstrates significant speedups onall 3 benchmarks, nearly doubling the throughput for theproper parameters. On the AES and Smith/Watermanbenchmarks, these automated results compare favorablywith careful hand-constructed implementations that werethe result of manual C-slowing and pipelining.
In the remainder of the paper, we first discuss the se-mantic restrictions and changes that retiming and C-slowretiming impose on a design, the details of the retimingalgorithm, and the use of the target architecture. Fol-lowing the discussion of C-slow retiming, we describe ourimplementation of an automatic retiming tool. Then wedescribe the structure of all three benchmarks and presentthe results of applying our tool.
2. Conventional Retiming
Leiserson’s retiming treats a synchronous circuit as adirected graph, with delays on the nodes representing com-bination delays and weights on the edges representing reg-isters in the design. An additional node represents theexternal world, with appropriate edges added to accountfor all the I/Os. Two matrixes are calculated, W and D,that represent the number of registers and critical pathbetween every pair of nodes in the graph. Each node alsohas a lag value r that is calculated by the algorithm andused to change the number of registers on any given edge.Conventional retiming does not change the design seman-tics: all input and output timings remain unchanged, whileimposing minor design constraints on the use of FPGA fea-tures. More details and formal proofs of correctness canbe found in Leiserson’s original paper[7].
In order to determine whether a critical path P can beachieved, the retiming algorithm creates a series of con-
IN OUT
1 1
1 1 22
Figure 2: The example in Figure 2 after retiming.The critical path is reduced from 5 to 4.
straints to calculate the lag on each node. All these con-strains are of the form x � y k that can be solved inO(n2) time by using the Bellman/Ford shortest path al-gorithm. The primary constraints insure correctness: noedge will have a negative number of registers while everycycle will always contain the original number of registers.All IO passes through an intermediate node insuring thatinput and output timings do not change. These constraintscan be modified to insure that a particular line will containno registers or a mandatory minimum number of registersto meet architectural constraints.
A second set of constraints attempt to insure that everypath longer than the critical path will contain at least oneregister, by creating an additional constraint for every pathlonger than the critical path. The actual constraints aresummarized in Table 1.
This process is iterated to find the minimum criticalpath that meets all the constraints. The lag calculated bythese constraints can then be used to change the designto meet this critical path. For each edge, a new registerweight w0 is calculated, with w0(e) = w(e)� r(u) + r(v).
An example of how retiming a↵ects a simple design canbe seen in Figures 2 and 2. The initial design has a criticalpath of 5, while after retiming the critical path is reducedto 4. During this process, the number of registers is in-creased, yet the number of registers on every cycle andthe path from input to output remain unchanged. Sincethe feedback loop has only a single register and a delay of4, it is impossible to further improve the performance byretiming.
Retiming in this form imposes only minimal design lim-itations: there can be no asynchronous resets or similarelements, as the retiming technique only applies to syn-chronous circuits. A synchronous global reset imposes toomany constraints to allow retiming unless initial conditionsare calculated and the global reset itself is now excludedfrom retiming purposes. Local synchronous resets and en-ables just produce small, self loops that have no e↵ect onthe correct operation of the algorithm.
Most other design features can be accommodated bysimply adding appropriate constraints. As an example, alltristated lines can’t have registers applied to them, whilemandatory elements such as those seen in synchronousmemories can be easily accommodated by mandating reg-isters on the appropriate nets.
Memories themselves can be retimed like any other el-ement in the design, with dual ported memories treatedas a single node for retiming purposes. Memories thatare synthesized with a negative clock edge (to create thedesign illusion of asynchronous memories) can either be
Condition Constraint
normal edge from u! v r(u)� r(v) w(e)edge from u! v r(u)� r(v) w(e)� 1must be registerededge from u! v r(u)� r(v) 0 andcan never be registered r(v)� r(u) 0Critical Paths r(u)� r(v) W (u, v)� 1must be registered for all u, v such that D(u, v) > P
Table 1: The constraint system used by the retiming process.
IN OUT
1 1
1 1 22
Figure 3: The example in Figure 2 2-slowed.This design now operates on 2 independent datastreams.
unchanged or switch to operate on the positive edge withconstraints to mandate the placement of registers.2
Initial register conditions can also be calculated if de-sired, but this process is NP hard in the general case.Cong and Wu[3] have an algorithm that computes ini-tial states by restricting the design to forward retimingonly, so it propagates the information and registers for-ward throughout the computation. This is because solvinginitial states for all registers moved forward is straightfor-ward, but backwards movement is NP hard3 as it reducesto satisfiability.
An important question is how to deal with multipleclocks. If the interfaces between the clock domains are reg-istered by clocks from both domains, and with all signalsbeing unidirectional, each clock domain can be treated asan independent block with all signals crossing the domaintreated as I/O. Due to the retiming-imposed constraints onI/O, the logical view of each input will not change. How-ever, constraints may be needed to insure that the physicalregisters remain in position to prevent asynchronous con-ditions from occurring on this interface.
3. C-slow retiming
C-slowing enhances retiming by simply replacing ev-ery register with a sequence of C separate registers beforeretiming occurs. The resulting design operates on C dis-tinct execution tasks. Since all registers are duplicated,the computation proceeds in a round-robin fashion. Theeasiest way to utilize a C slowed block is to simply multi-
2For some cases, this may produce a set of unsolvable con-straints, thus requiring that the memory remain a negativeedge device.3And may not posses a valid solution for nonsensical cases.
IN OUT
1 1
1 1 22
Figure 4: The example in Figure 3 after retim-ing. The combination of C-slowing and retimingreduced the critical path from 5 to 2.
plex and demultiplex C separate data streams, but a moresophisticated interface may be desired depending on theapplication.
One possible interface is to register all inputs and out-puts of a C-slowed block. Because of the additional edgesretiming creates to track I/Os and to insure a consistentinterface, every stream of execution presents all outputs atthe same time, with all inputs being registered on the nextcycle. If part of the design is C-slowed, but all operate onthe same clock, the resulting design can be retimed as acomplete whole while preserving all other semantics. Weuse these observations later when discussing the e↵ects ofC-slowing on a microprocessor core.
However, C-slowing imposes some more significantFPGA design constraints, as summarized in Table 2. Reg-ister clock enables and resets must be expressed as logicfeatures, since each independent thread must see a dif-ferent view of the reset or enable. Thus, they can remainfeatures in the design but can’t be implemented by currentFPGAs using the native enables and resets. Other special-ized features, such as Xilinx SRL16s,4 can’t be utilized ina C-slow design for the same reasons.
One important issue is how to properly C-slow memoryblocks. In most cases, one desires the complete illusionthat each stream of execution is completely independentand unchanged. To create this illusion, the memory mustbe increased by a factor of C, with additional address linesdriven by a counter. This insures that each stream of ex-ecution enjoys a completely separate memory space.
For dual ported memories, this potentially enables agreater freedom in retiming: the two ports can have dif-ferent lags, as long as the di↵erence in lag is C � 1 orless. After retiming, the di↵erence in lag is added to theappropriate port’s thread counter. This insures that each
4A mode where the LUT can act as a 16 bit shift register
Condition Constraint
normal edge from u! v r(u)� r(v) w(e)edge from u! v r(u)� r(v) w(e)� 1must be registerededge from u! v r(u)� r(v) 0 andcan never be registered r(v)� r(u) 0Critical Paths r(u)� r(v) W (u, v)� 1must be registered for all u, v such that D(u, v) > P
Table 1: The constraint system used by the retiming process.
IN OUT
1 1
1 1 22
Figure 3: The example in Figure 2 2-slowed.This design now operates on 2 independent datastreams.
unchanged or switch to operate on the positive edge withconstraints to mandate the placement of registers.2
Initial register conditions can also be calculated if de-sired, but this process is NP hard in the general case.Cong and Wu[3] have an algorithm that computes ini-tial states by restricting the design to forward retimingonly, so it propagates the information and registers for-ward throughout the computation. This is because solvinginitial states for all registers moved forward is straightfor-ward, but backwards movement is NP hard3 as it reducesto satisfiability.
An important question is how to deal with multipleclocks. If the interfaces between the clock domains are reg-istered by clocks from both domains, and with all signalsbeing unidirectional, each clock domain can be treated asan independent block with all signals crossing the domaintreated as I/O. Due to the retiming-imposed constraints onI/O, the logical view of each input will not change. How-ever, constraints may be needed to insure that the physicalregisters remain in position to prevent asynchronous con-ditions from occurring on this interface.
3. C-slow retiming
C-slowing enhances retiming by simply replacing ev-ery register with a sequence of C separate registers beforeretiming occurs. The resulting design operates on C dis-tinct execution tasks. Since all registers are duplicated,the computation proceeds in a round-robin fashion. Theeasiest way to utilize a C slowed block is to simply multi-
2For some cases, this may produce a set of unsolvable con-straints, thus requiring that the memory remain a negativeedge device.3And may not posses a valid solution for nonsensical cases.
IN OUT
1 1
1 1 22
Figure 4: The example in Figure 3 after retim-ing. The combination of C-slowing and retimingreduced the critical path from 5 to 2.
plex and demultiplex C separate data streams, but a moresophisticated interface may be desired depending on theapplication.
One possible interface is to register all inputs and out-puts of a C-slowed block. Because of the additional edgesretiming creates to track I/Os and to insure a consistentinterface, every stream of execution presents all outputs atthe same time, with all inputs being registered on the nextcycle. If part of the design is C-slowed, but all operate onthe same clock, the resulting design can be retimed as acomplete whole while preserving all other semantics. Weuse these observations later when discussing the e↵ects ofC-slowing on a microprocessor core.
However, C-slowing imposes some more significantFPGA design constraints, as summarized in Table 2. Reg-ister clock enables and resets must be expressed as logicfeatures, since each independent thread must see a dif-ferent view of the reset or enable. Thus, they can remainfeatures in the design but can’t be implemented by currentFPGAs using the native enables and resets. Other special-ized features, such as Xilinx SRL16s,4 can’t be utilized ina C-slow design for the same reasons.
One important issue is how to properly C-slow memoryblocks. In most cases, one desires the complete illusionthat each stream of execution is completely independentand unchanged. To create this illusion, the memory mustbe increased by a factor of C, with additional address linesdriven by a counter. This insures that each stream of ex-ecution enjoys a completely separate memory space.
For dual ported memories, this potentially enables agreater freedom in retiming: the two ports can have dif-ferent lags, as long as the di↵erence in lag is C � 1 orless. After retiming, the di↵erence in lag is added to theappropriate port’s thread counter. This insures that each
4A mode where the LUT can act as a 16 bit shift register
Synchronous logic we want to “multithread”. Critical path is 5.
2X multi-threading: double each register.
Modern synthesis will retime this as shown: critical path is now 2.
65Thursday, September 5, 13
Good fit for GALS
Two input queues(red and green). The mux control logic implements
turn-taking.
Outputs placed into two output
queues.
66Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Crossbar Networks
67Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
When register files get big, they get slow.
R1
R2
...
R31
Q
Q
Q
R0 - The constant 0 Q
clk
.
.
.
32MUX
32
32
sel(rs1)
5...
rd1
32MUX
32
32
sel(rs2)
5...
rd2
D
D
D
En
En
En
DEMUX
.
.
.
sel(ws)
5
WE
wd32
Even worse: adding ports slows down as O(N2) ...
Why? Number of loads on each Q goes as O(N), and the wire length to port mux goes as O(N).
68Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
NAWATHE et al.: IMPLEMENTATION OF AN 8-CORE, 64-THREAD, POWER-EFFICIENT SPARC SERVER ON A CHIP 7
Fig. 2. Niagara2 block diagram.
Fig. 3. Niagara2 die micrograph.
two FBDIMM channels. These three major I/O interfaces areserializer/deserializer (SerDes) based and provide a total pinbandwidth in excess of 1 Tb/s. All the SerDes are on chip.The high levels of system integration truly makes Niagara2 a“server-on-a-chip”, thus reducing system component count,complexity and power, and hence improving system reliability.
B. SPARC Core Architecture
Fig. 4 shows the block diagram of the SPARC Core. EachSPARC core (SPC) implements the 64-bit SPARC V9 instruc-tion set while supporting concurrent execution of eight threads.Each SPC has one load/store unit (LSU), two Execution units(EXU0 and EXU1), and one Floating Point and Graphics Unit(FGU). The Instruction Fetch unit (IFU) and the LSU contain an8-way 16 kB Instruction cache and a 4-way 8 kB Data cache re-spectively. Each SPC also contains a 64-entry Instruction-TLB(ITLB), and a 128-entry Data-TLB (DTLB). Both the TLBs arefully associative. The memory Management Unit (MMU) sup-ports 8 K, 64 K, 4 M, and 256 M page sizes and has Hardware
Fig. 4. SPC block diagram.
Fig. 5. Integer pipeline: eight stages.
Fig. 6. Floating point pipeline: 12 stages.
TableWalk to reduce TLB miss penalty. “TLU” in the block dia-gram is the Trap Logic Unit. The “Gasket” performs arbitrationfor access to the Crossbar. Each SPC also has an advanced Cryp-tographic/Stream Processing Unit (SPU). The combined band-width of the eight Cryptographic units from the eight SPCs issufficient for running the two 10 Gb Ethernet ports encrypted.This enables Niagara2 to run secure applications at wire speed.
Fig. 5 and Fig. 6 illustrate the Niagara2 integer and floatingpoint pipelines, respectively. The integer pipeline is eight stageslong. The floating point pipeline has 12 stages for most opera-tions. Divide and Square-root operations have a longer pipeline.
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.
Crossbar networks: general case of this problem
Each DRAM channel: 50 GB/s Read, 25 GB/s Write BW.Crossbar BW: 270 GB/s total (Read + Write).
(Also shared by an I/O port, not shown)
Sun Niagara II: 8 cores, 4MB L2, 4 DRAM channels
69Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
NAWATHE et al.: IMPLEMENTATION OF AN 8-CORE, 64-THREAD, POWER-EFFICIENT SPARC SERVER ON A CHIP 9
Fig. 9. L2 cache row redundancy scheme.
2-cycle latency. Addresses can be hashed to distribute accessesacross different sets in case of hot cache sets caused by refer-ence conflicts. All arrays are protected by single error correc-tion, double error detection ECC, and parity. Data from differentways and different words is interleaved to improve soft errorrates.
The L2 cache used a unique row-redundancy scheme. It is im-plemented at the 32 kB level and is illustrated in Fig. 9. Sparerows for one array are located in the adjacent array as opposed tothe same array. In other words, spare rows for the top array arelocated in the bottom array and vice versa. When redundancy isenabled, the incoming address is compared with the address ofthe defective row and if it matches, the adjacent array (which isnormally not enabled) is enabled to read from or write into thespare row. Using this kind of scheme enables a large ( 30%)reduction in X-decoder area. The area reduction is achieved be-cause the multiplexing required in the X-decoder to bypass thedefective row/rows in the traditional row redundancy scheme isno longer needed in this scheme.
N-well power for the Primary and L2 cache memory cellsis separated out as a test hook. This allows weakening of thepMOS loads of the SRAM bit cells by raising their thresholdvoltage, thus enabling screening cells with marginal static noisemargin. This significantly reduces defective parts per million(DPPM) and improves reliability.
Fig. 10 shows the Niagara2 Crossbar (CCX). CCX serves asa high bandwidth interface between the eight SPARC Cores,shown on top, and the eight L2 cache banks, and the non-cacheable unit (NCU) shown at the bottom. CCX consists of twoblocks: PCX and CPX. PCX (“Processor-to-Cache-Transfer”)is a 8-input 9-output multiplexer (mux). It transfers data fromthe eight SPARC cores to the eight L2 cache banks and theNCU. Likewise, CPX (“Cache-to-Processor Transfer”) is a
Fig. 10. Crossbar.
9-input 8-output mux, and it transfers data in the reverse di-rection. The PCX and CPX combined provide a Read/Writebandwidth of 270 GB/s. All crossbar data transfer requestsare processed using a four-stage pipeline. The pipeline stagesare: Request, Arbitration, Selection, and Transmission. As canbe seen from the figure, there are possiblesource destination pairs for each data transfer request. There isa two-deep queue for each source–destination pair to hold datatransfer requests for that pair.
IV. CLOCKING
Niagara2 contains a mix of many clocking styles—syn-chronous, mesochronous and asynchronous—and hence a largenumber of clock domains. Managing all these clock domainsand domain crossings between them was one of the biggestchallenges the design team faced. A subset of synchronousmethodology, ratioed synchronous clocking (RSC) is usedextensively. The concept works well for functional mode whilebeing equally applicable to at-speed test of the core using theSerDes interfaces.
A. Clock Sources and Distribution
An on-chip phase-locked loop (PLL) uses a fractional divider[8], [9] to generate Ratioed Synchronous Clocks with supportfor a wide range of integer and fractional divide ratios. Thedistribution of these clocks uses a combination of H-treesand grids. This ensures they meet tight clock skew budgetswhile keeping power consumption under control. Clock TreeSynthesis is used for routing the asynchronous clocks. Asyn-chronous clock domain crossings are handled using FIFOsand meta-stability hardened flip-flops. All clock headers aredesigned to support clock gating to save clock power.
Fig. 11 shows the block diagram of the PLL. Its architectureis similar to the one described in [8]. It uses a loop filter capac-itor referenced to a regulated 1.1 V supply (VREG). VREG isgenerated by a voltage regulator from the 1.5 V supply coming
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.
Sun Niagara II 8 x 9 Crossbar
8 ports on CPU side (one per core)
8 ports for L2 banks, plus one for I/0
4 cycle latency (715ps/cycle).
Cycles 1-3 are for arbitration.
Transmit data on cycle 4.
100-200 wires/ port (each way).
Pipelined.70Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
A complete switch transfer (4 epochs)
Epoch 1: All input ports (that are ready to send data) request an output port.Epoch 2: Allocation algorithm decides which inputs get to write.Epoch 3: Allocation system informs the winning inputs and outputs.Epoch 4: Actual data transfer takes place.
Allocation is pipelined: a data transfer happens on every cycle, as does the three allocation stages, for different sets of requests.
71Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
NA
WA
TH
Eetal.:IM
PLE
ME
NTA
TIO
NO
FA
N8-C
OR
E,64-T
HR
EA
D,PO
WE
R-E
FFICIE
NT
SPAR
CSE
RV
ER
ON
AC
HIP
9
Fig.9.L
2cache
rowredundancy
scheme.
2-cyclelatency.A
ddressescan
behashed
todistribute
accessesacross
differentsets
incase
ofhot
cachesets
causedby
refer-ence
conflicts.All
arraysare
protectedby
singleerror
correc-tion,double
errordetectionE
CC
,andparity.D
atafrom
differentw
aysand
differentw
ordsis
interleavedto
improve
softerror
rates.T
heL
2cache
useda
uniquerow
-redundancyschem
e.Itisim-
plemented
atthe32
kBleveland
isillustrated
inFig.9.Spare
rowsforone
arrayare
locatedin
theadjacentarray
asopposedto
thesam
earray.In
otherwords,spare
rows
forthetop
arrayare
locatedin
thebottom
arrayand
viceversa.W
henredundancy
isenabled,the
incoming
addressis
compared
with
theaddress
ofthe
defectiverow
andifitm
atches,theadjacentarray
(which
isnorm
allynotenabled)
isenabled
toread
fromor
write
intothe
sparerow
.Using
thiskind
ofschem
eenables
alarge
(30%
)reduction
inX
-decoderarea.The
areareduction
isachieved
be-cause
them
ultiplexingrequired
inthe
X-decoder
tobypass
thedefective
row/row
sin
thetraditionalrow
redundancyschem
eis
nolonger
neededin
thisschem
e.N
-well
power
forthe
Primary
andL
2cache
mem
orycells
isseparated
outas
atest
hook.T
hisallow
sw
eakeningof
thepM
OS
loadsof
theSR
AM
bitcells
byraising
theirthreshold
voltage,thusenabling
screeningcells
with
marginalstatic
noisem
argin.T
hissignificantly
reducesdefective
partsper
million
(DPPM
)and
improves
reliability.Fig.10
shows
theN
iagara2C
rossbar(C
CX
).CC
Xserves
asa
highbandw
idthinterface
between
theeight
SPAR
CC
ores,show
non
top,and
theeight
L2
cachebanks,
andthe
non-cacheable
unit(NC
U)show
natthe
bottom.C
CX
consistsoftwo
blocks:PC
Xand
CPX
.PC
X(“Processor-to-C
ache-Transfer”)is
a8-input
9-outputm
ultiplexer(m
ux).Ittransfers
datafrom
theeight
SPAR
Ccores
tothe
eightL
2cache
banksand
theN
CU
.L
ikewise,
CPX
(“Cache-to-Processor
Transfer”)is
a
Fig.10.C
rossbar.
9-input8-output
mux,
andit
transfersdata
inthe
reversedi-
rection.T
hePC
Xand
CPX
combined
providea
Read/W
ritebandw
idthof
270G
B/s.
All
crossbardata
transferrequests
areprocessed
usinga
four-stagepipeline.T
hepipeline
stagesare:R
equest,Arbitration,Selection,and
Transmission.A
scan
beseen
fromthe
figure,there
arepossible
sourcedestination
pairsfor
eachdata
transferrequest.T
hereis
atw
o-deepqueue
foreach
source–destinationpair
tohold
datatransfer
requestsfor
thatpair.
IV.
CL
OC
KIN
G
Niagara2
containsa
mix
ofm
anyclocking
styles—syn-
chronous,mesochronous
andasynchronous—
andhence
alarge
number
ofclock
domains.
Managing
allthese
clockdom
ainsand
domain
crossingsbetw
eenthem
was
oneof
thebiggest
challengesthe
designteam
faced.A
subsetof
synchronousm
ethodology,ratioed
synchronousclocking
(RSC
)is
usedextensively.T
heconceptw
orksw
ellforfunctionalmode
while
beingequally
applicableto
at-speedtest
ofthe
coreusing
theSerD
esinterfaces.
A.
Clock
Sourcesand
Distribution
An
on-chipphase-locked
loop(PL
L)usesa
fractionaldivider[8],[9]
togenerate
Ratioed
SynchronousC
locksw
ithsupport
fora
wide
rangeof
integerand
fractionaldivide
ratios.T
hedistribution
ofthese
clocksuses
acom
binationof
H-trees
andgrids.
This
ensuresthey
meet
tightclock
skewbudgets
while
keepingpow
erconsum
ptionunder
control.C
lockTree
Synthesisis
usedfor
routingthe
asynchronousclocks.
Asyn-
chronousclock
domain
crossingsare
handledusing
FIFOs
andm
eta-stabilityhardened
flip-flops.A
llclock
headersare
designedto
supportclockgating
tosave
clockpow
er.Fig.11
shows
theblock
diagramof
thePL
L.Its
architectureis
similarto
theone
describedin
[8].Itusesa
loopfiltercapac-
itorreferenced
toa
regulated1.1
Vsupply
(VR
EG
).VR
EG
isgenerated
bya
voltageregulatorfrom
the1.5
Vsupply
coming
Auth
oriz
ed lic
ensed u
se lim
ited to
: Univ
of C
alif B
erk
ele
y. D
ow
nlo
aded o
n S
epte
mber 1
8, 2
009 a
t 17:0
3 fro
m IE
EE
Xplo
re. R
estric
tions a
pply
.
Sun Niagara II 8 x 9 Crossbar
NAWATHE et al.: IMPLEMENTATION OF AN 8-CORE, 64-THREAD, POWER-EFFICIENT SPARC SERVER ON A CHIP 9
Fig. 9. L2 cache row redundancy scheme.
2-cycle latency. Addresses can be hashed to distribute accessesacross different sets in case of hot cache sets caused by refer-ence conflicts. All arrays are protected by single error correc-tion, double error detection ECC, and parity. Data from differentways and different words is interleaved to improve soft errorrates.
The L2 cache used a unique row-redundancy scheme. It is im-plemented at the 32 kB level and is illustrated in Fig. 9. Sparerows for one array are located in the adjacent array as opposed tothe same array. In other words, spare rows for the top array arelocated in the bottom array and vice versa. When redundancy isenabled, the incoming address is compared with the address ofthe defective row and if it matches, the adjacent array (which isnormally not enabled) is enabled to read from or write into thespare row. Using this kind of scheme enables a large ( 30%)reduction in X-decoder area. The area reduction is achieved be-cause the multiplexing required in the X-decoder to bypass thedefective row/rows in the traditional row redundancy scheme isno longer needed in this scheme.
N-well power for the Primary and L2 cache memory cellsis separated out as a test hook. This allows weakening of thepMOS loads of the SRAM bit cells by raising their thresholdvoltage, thus enabling screening cells with marginal static noisemargin. This significantly reduces defective parts per million(DPPM) and improves reliability.
Fig. 10 shows the Niagara2 Crossbar (CCX). CCX serves asa high bandwidth interface between the eight SPARC Cores,shown on top, and the eight L2 cache banks, and the non-cacheable unit (NCU) shown at the bottom. CCX consists of twoblocks: PCX and CPX. PCX (“Processor-to-Cache-Transfer”)is a 8-input 9-output multiplexer (mux). It transfers data fromthe eight SPARC cores to the eight L2 cache banks and theNCU. Likewise, CPX (“Cache-to-Processor Transfer”) is a
Fig. 10. Crossbar.
9-input 8-output mux, and it transfers data in the reverse di-rection. The PCX and CPX combined provide a Read/Writebandwidth of 270 GB/s. All crossbar data transfer requestsare processed using a four-stage pipeline. The pipeline stagesare: Request, Arbitration, Selection, and Transmission. As canbe seen from the figure, there are possiblesource destination pairs for each data transfer request. There isa two-deep queue for each source–destination pair to hold datatransfer requests for that pair.
IV. CLOCKING
Niagara2 contains a mix of many clocking styles—syn-chronous, mesochronous and asynchronous—and hence a largenumber of clock domains. Managing all these clock domainsand domain crossings between them was one of the biggestchallenges the design team faced. A subset of synchronousmethodology, ratioed synchronous clocking (RSC) is usedextensively. The concept works well for functional mode whilebeing equally applicable to at-speed test of the core using theSerDes interfaces.
A. Clock Sources and Distribution
An on-chip phase-locked loop (PLL) uses a fractional divider[8], [9] to generate Ratioed Synchronous Clocks with supportfor a wide range of integer and fractional divide ratios. Thedistribution of these clocks uses a combination of H-treesand grids. This ensures they meet tight clock skew budgetswhile keeping power consumption under control. Clock TreeSynthesis is used for routing the asynchronous clocks. Asyn-chronous clock domain crossings are handled using FIFOsand meta-stability hardened flip-flops. All clock headers aredesigned to support clock gating to save clock power.
Fig. 11 shows the block diagram of the PLL. Its architectureis similar to the one described in [8]. It uses a loop filter capac-itor referenced to a regulated 1.1 V supply (VREG). VREG isgenerated by a voltage regulator from the 1.5 V supply coming
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.
NAWATHE et al.: IMPLEMENTATION OF AN 8-CORE, 64-THREAD, POWER-EFFICIENT SPARC SERVER ON A CHIP 9
Fig. 9. L2 cache row redundancy scheme.
2-cycle latency. Addresses can be hashed to distribute accessesacross different sets in case of hot cache sets caused by refer-ence conflicts. All arrays are protected by single error correc-tion, double error detection ECC, and parity. Data from differentways and different words is interleaved to improve soft errorrates.
The L2 cache used a unique row-redundancy scheme. It is im-plemented at the 32 kB level and is illustrated in Fig. 9. Sparerows for one array are located in the adjacent array as opposed tothe same array. In other words, spare rows for the top array arelocated in the bottom array and vice versa. When redundancy isenabled, the incoming address is compared with the address ofthe defective row and if it matches, the adjacent array (which isnormally not enabled) is enabled to read from or write into thespare row. Using this kind of scheme enables a large ( 30%)reduction in X-decoder area. The area reduction is achieved be-cause the multiplexing required in the X-decoder to bypass thedefective row/rows in the traditional row redundancy scheme isno longer needed in this scheme.
N-well power for the Primary and L2 cache memory cellsis separated out as a test hook. This allows weakening of thepMOS loads of the SRAM bit cells by raising their thresholdvoltage, thus enabling screening cells with marginal static noisemargin. This significantly reduces defective parts per million(DPPM) and improves reliability.
Fig. 10 shows the Niagara2 Crossbar (CCX). CCX serves asa high bandwidth interface between the eight SPARC Cores,shown on top, and the eight L2 cache banks, and the non-cacheable unit (NCU) shown at the bottom. CCX consists of twoblocks: PCX and CPX. PCX (“Processor-to-Cache-Transfer”)is a 8-input 9-output multiplexer (mux). It transfers data fromthe eight SPARC cores to the eight L2 cache banks and theNCU. Likewise, CPX (“Cache-to-Processor Transfer”) is a
Fig. 10. Crossbar.
9-input 8-output mux, and it transfers data in the reverse di-rection. The PCX and CPX combined provide a Read/Writebandwidth of 270 GB/s. All crossbar data transfer requestsare processed using a four-stage pipeline. The pipeline stagesare: Request, Arbitration, Selection, and Transmission. As canbe seen from the figure, there are possiblesource destination pairs for each data transfer request. There isa two-deep queue for each source–destination pair to hold datatransfer requests for that pair.
IV. CLOCKING
Niagara2 contains a mix of many clocking styles—syn-chronous, mesochronous and asynchronous—and hence a largenumber of clock domains. Managing all these clock domainsand domain crossings between them was one of the biggestchallenges the design team faced. A subset of synchronousmethodology, ratioed synchronous clocking (RSC) is usedextensively. The concept works well for functional mode whilebeing equally applicable to at-speed test of the core using theSerDes interfaces.
A. Clock Sources and Distribution
An on-chip phase-locked loop (PLL) uses a fractional divider[8], [9] to generate Ratioed Synchronous Clocks with supportfor a wide range of integer and fractional divide ratios. Thedistribution of these clocks uses a combination of H-treesand grids. This ensures they meet tight clock skew budgetswhile keeping power consumption under control. Clock TreeSynthesis is used for routing the asynchronous clocks. Asyn-chronous clock domain crossings are handled using FIFOsand meta-stability hardened flip-flops. All clock headers aredesigned to support clock gating to save clock power.
Fig. 11 shows the block diagram of the PLL. Its architectureis similar to the one described in [8]. It uses a loop filter capac-itor referenced to a regulated 1.1 V supply (VREG). VREG isgenerated by a voltage regulator from the 1.5 V supply coming
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.
Every cross of blue and purple is a pass gate with a unique control signal.
72 control signals (if distributed unencoded).
NA
WA
TH
Eetal.:IM
PLE
ME
NTA
TIO
NO
FA
N8-C
OR
E,64-T
HR
EA
D,PO
WE
R-E
FFICIE
NT
SPAR
CSE
RV
ER
ON
AC
HIP
9
Fig.9.L
2cache
rowredundancy
scheme.
2-cyclelatency.A
ddressescan
behashed
todistribute
accessesacross
differentsets
incase
ofhot
cachesets
causedby
refer-ence
conflicts.All
arraysare
protectedby
singleerror
correc-tion,double
errordetectionE
CC
,andparity.D
atafrom
differentw
aysand
differentw
ordsis
interleavedto
improve
softerror
rates.T
heL
2cache
useda
uniquerow
-redundancyschem
e.Itisim-
plemented
atthe32
kBleveland
isillustrated
inFig.9.Spare
rowsforone
arrayare
locatedin
theadjacentarray
asopposedto
thesam
earray.In
otherwords,spare
rows
forthetop
arrayare
locatedin
thebottom
arrayand
viceversa.W
henredundancy
isenabled,the
incoming
addressis
compared
with
theaddress
ofthe
defectiverow
andifitm
atches,theadjacentarray
(which
isnorm
allynotenabled)
isenabled
toread
fromor
write
intothe
sparerow
.Using
thiskind
ofschem
eenables
alarge
(30%
)reduction
inX
-decoderarea.The
areareduction
isachieved
be-cause
them
ultiplexingrequired
inthe
X-decoder
tobypass
thedefective
row/row
sin
thetraditionalrow
redundancyschem
eis
nolonger
neededin
thisschem
e.N
-well
power
forthe
Primary
andL
2cache
mem
orycells
isseparated
outas
atest
hook.T
hisallow
sw
eakeningof
thepM
OS
loadsof
theSR
AM
bitcells
byraising
theirthreshold
voltage,thusenabling
screeningcells
with
marginalstatic
noisem
argin.T
hissignificantly
reducesdefective
partsper
million
(DPPM
)and
improves
reliability.Fig.10
shows
theN
iagara2C
rossbar(C
CX
).CC
Xserves
asa
highbandw
idthinterface
between
theeight
SPAR
CC
ores,show
non
top,and
theeight
L2
cachebanks,
andthe
non-cacheable
unit(NC
U)show
natthe
bottom.C
CX
consistsoftwo
blocks:PC
Xand
CPX
.PC
X(“Processor-to-C
ache-Transfer”)is
a8-input
9-outputm
ultiplexer(m
ux).Ittransfers
datafrom
theeight
SPAR
Ccores
tothe
eightL
2cache
banksand
theN
CU
.L
ikewise,
CPX
(“Cache-to-Processor
Transfer”)is
a
Fig.10.C
rossbar.
9-input8-output
mux,
andit
transfersdata
inthe
reversedi-
rection.T
hePC
Xand
CPX
combined
providea
Read/W
ritebandw
idthof
270G
B/s.
All
crossbardata
transferrequests
areprocessed
usinga
four-stagepipeline.T
hepipeline
stagesare:R
equest,Arbitration,Selection,and
Transmission.A
scan
beseen
fromthe
figure,there
arepossible
sourcedestination
pairsfor
eachdata
transferrequest.T
hereis
atw
o-deepqueue
foreach
source–destinationpair
tohold
datatransfer
requestsfor
thatpair.
IV.
CL
OC
KIN
G
Niagara2
containsa
mix
ofm
anyclocking
styles—syn-
chronous,mesochronous
andasynchronous—
andhence
alarge
number
ofclock
domains.
Managing
allthese
clockdom
ainsand
domain
crossingsbetw
eenthem
was
oneof
thebiggest
challengesthe
designteam
faced.A
subsetof
synchronousm
ethodology,ratioed
synchronousclocking
(RSC
)is
usedextensively.T
heconceptw
orksw
ellforfunctionalmode
while
beingequally
applicableto
at-speedtest
ofthe
coreusing
theSerD
esinterfaces.
A.
Clock
Sourcesand
Distribution
An
on-chipphase-locked
loop(PL
L)usesa
fractionaldivider[8],[9]
togenerate
Ratioed
SynchronousC
locksw
ithsupport
fora
wide
rangeof
integerand
fractionaldivide
ratios.T
hedistribution
ofthese
clocksuses
acom
binationof
H-trees
andgrids.
This
ensuresthey
meet
tightclock
skewbudgets
while
keepingpow
erconsum
ptionunder
control.C
lockTree
Synthesisis
usedfor
routingthe
asynchronousclocks.
Asyn-
chronousclock
domain
crossingsare
handledusing
FIFOs
andm
eta-stabilityhardened
flip-flops.A
llclock
headersare
designedto
supportclockgating
tosave
clockpow
er.Fig.11
shows
theblock
diagramof
thePL
L.Its
architectureis
similarto
theone
describedin
[8].Itusesa
loopfiltercapac-
itorreferenced
toa
regulated1.1
Vsupply
(VR
EG
).VR
EG
isgenerated
bya
voltageregulatorfrom
the1.5
Vsupply
coming
Auth
oriz
ed lic
ensed u
se lim
ited to
: Univ
of C
alif B
erk
ele
y. D
ow
nlo
aded o
n S
epte
mber 1
8, 2
009 a
t 17:0
3 fro
m IE
EE
Xplo
re. R
estric
tions a
pply
.
72Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
73Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Crossbar defines floorplan: all port devices should be equidistant to the crossbar.
Uniform latency between all port pairs.
Did not scale up for 16-core Rainbow Falls. Rainbow Falls keeps the 8 x 9 crossbar, and shares each CPU-side port with two cores.
Sun Niagara II Crossbar Notes
Low latency: 4 cycles (less than 3 ns).
Design alternatives to crossbar?74Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
CLOS networks: from telecom world ...
Build a high-port switch by tiling fixed-sized shuffle units. Pipeline registers naturally fit between tiles. Trades scalability for latency.
75Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
CLOS networks: an example route
Numbers on left and right are port numbers. Colors show routing paths for an exchange. Arbitration still needed to prevent blocking.
76Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Electrical Details
77Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Flip Flops Revisited
78Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Recall: Static RAM cell (6 Transistors)
x! x
Gnd Vdd Vdd Gnd Vth Vth
noise noise
“Cross- coupled
inverters”
79Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Recall: Positive edge-triggered flip-flop
D Q A flip-flop “samples” right before the edge, and then “holds” value.
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Sampling circuit
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Holds value
16 Transistors: Makes an SRAM look compact!What do we get for the 10 extra transistors?
Clocked logic semantics.80Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Sensing: When clock is low
D QA flip-flop “samples” right before the edge, and then “holds” value.
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Sampling circuit
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Holds value
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
clk = 0clk’ = 1
Will capture new value on posedge.
Outputs last value captured.
81Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Capture: When clock goes high
D QA flip-flop “samples” right before the edge, and then “holds” value.
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Sampling circuit
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Holds value
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
clk = 1clk’ = 0
Remembers value just captured.
Outputs value just captured.
82Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Flip Flop delays:
D Q
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
clk-to-Q ?
CLK == 0Sense D, but Q
outputs old value.
CLK 0->1Capture D, pass
value to Q
CLK
setup ? hold ?
clk-to-Q
setup
hold
83Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
More Detailed Gate Models
84Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Inverters: Circuits and Layout
VoutVin
Vdd symbol
Vin Vout
85Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
p-
oxiden+ n+
n-well
oxidep+ p+ n+
Vout
Vin Vin
Inverter: Die Cross Section
VoutVin
86Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Inverters with Vin = Gnd, Vout = Vdd
I ds
VoutVin
Vs
Vd
Vs
Vd
I sd
Is Vsg > Vt ?
Isd = k (W/L) [Vsg -Vt] [Vsd]
Ids ≈0, but really a small leakage current
Is Vsd > Vsg - Vt once Vout is Vdd?
This goes as close to 0 as it can while still supplying the leakage current.
87Thursday, September 5, 13
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Inverters with Vin = Vdd, Vout = Gnd
I ds
VoutVin
Vs
Vd
Vs
Vd
I sd
Is Vgs > Vt ?Ids = k (W/L) [Vgs -Vt] [Vds]
Is Vds > Vgs - Vt once Vout is Gnd?
Isd ≈0, but really a small leakage current
This goes as close to 0 as it can while still supplying the leakage current.
88Thursday, September 5, 13
On Tuesday ... Power and Energy
Heat Sink
Heat Source
89Thursday, September 5, 13