eecs 151/251a spring 2018 digital design and integrated ...eecs151/sp18/files/lecture10.pdf · in...
TRANSCRIPT
EE141
EECS 151/251ASpring2018DigitalDesignandIntegratedCircuitsInstructors:NickWeaver&JohnWawrzynek
Lecture 10
1
What do ASIC/FPGA Designers need to know about physics?
‣ Physics effect: Area ⇒ cost Delay ⇒ performance Energy ⇒ performance & cost
• Ideally, zero delay, area, and energy. However, the physical devices occupy area, take time, and consume energy.
• CMOS process lets us build transistors, wires, connections, and we get capacitors, inductors, and resistors whether or not we want them.
2
Spring 2018 EECS151 Page
Performance, Cost, Power
• How do we measure performance? operations/sec? cycles/sec?
• Performance is directly proportional to clock frequency. Although it may not be the entire story:
Ex: CPU performance = # instructions X CPI X clock period
3
Spring 2018 EECS151 Page
Limitations on Clock Rate
1 Logic Gate Delay 2 Delays in flip-flops
• What must happen in one clock cycle for correct operation? – All signals connected to FF (or memory) inputs must be
ready and “setup” before rising edge of clock. – For now we assume perfect clock distribution (all flip-flops
see the clock at the same time).
What are typical delay values?Both times contribute to limiting the clock period.
4
Spring 2018 EECS151 Page
Example
Parallel to serial converter circuit
T ≥ time(clk→Q) + time(mux) + time(setup) T ≥ τclk→Q + τmux + τsetupa
b
clk
5
Spring 2018 EECS151 Page
In General ...
T ≥ τclk→Q + τCL + τsetup
For correct operation:
for all paths.
• How do we enumerate all paths? – Any circuit input or register output to any register input or
circuit output? • Note:
– “setup time” for outputs is a function of what it connects to. – “clk-to-q” for circuit inputs depends on where it comes from.
6
“Gate Delay”
‣ Modern CMOS gate delays on the order of a few picoseconds. (However, highly dependent on gate context.)
‣ Often expressed as FO4 delays (fan-out of 4) - as a process independent delay metric: ‣ the delay of an inverter, driven by an
inverter 4x smaller than itself, and driving an inverter 4x larger than itself.
‣ For a 90nm process FO4 is around 20ps. Less than 10ps for a 32nm process.
7
“Path Delay”
‣ For correct operation: Total Delay ≤ clock_period - FFsetup_time - FFclk_to_q on all paths.
‣ High-speed processors critical paths have around 20 FO4 delays.
8
FO4 Delays per clock period
Francois Labonte
[email protected] 4/23/2003 Stanford University
Cycle in FO4
0
10
20
30
40
50
60
70
80
90
100
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05
intel 386
intel 486
intel pentium
intel pentium 2
intel pentium 3
intel pentium 4
intel itanium
Alpha 21064
Alpha 21164
Alpha 21264
Sparc
SuperSparc
Sparc64
Mips
HP PA
Power PC
AMD K6
AMD K7
AMD x86-64
Thanks to Francois Labonte, Stanford
FO4Delays
Historical limit: about
12
CPU Clock Periods1985-2005
MIPS 2000 5 pipeline stages
Pentium 4 20 pipeline stages
Pentium Pro 10 pipeline stages
9
PROCESSORS
1
CPU DB: Recording Microprocessor History
With this open database, you can mine microprocessor trends over the past 40 years.
Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University
In November 1971, Intel introduced the world’s first single-chip microprocessor, the Intel 4004. It had 2,300 transistors, ran at a clock speed of up to 740 KHz, and delivered 60,000 instructions per second while dissipating 0.5 watts. The following four decades witnessed exponential growth in compute power, a trend that has enabled applications as diverse as climate modeling, protein folding, and computing real-time ballistic trajectories of angry birds. Today’s microprocessor chips employ billions of transistors, include multiple processor cores on a single silicon die, run at clock speeds measured in gigahertz, and deliver more than 4 million times the performance of the original 4004.
Where did these incredible gains come from? This article sheds some light on this question by introducing CPU DB (cpudb.stanford.edu), an open and extensible database collected by Stanford’s VLSI (very large-scale integration) Research Group over several generations of processors (and students). We gathered information on commercial processors from 17 manufacturers and placed it in CPU DB, which now contains data on 790 processors spanning the past 40 years.
In addition, we provide a methodology to separate the effect of technology scaling from improvements on other frontiers (e.g., architecture and software), allowing the comparison of machines built in different technologies. To demonstrate the utility of this data and analysis, we use it to decompose processor improvements into contributions from the physical scaling of devices, and from improvements in microarchitecture, compiler, and software technologies.
AN OPEN REPOSITORY OF PROCESSOR SPECSWhile information about current processors is easy to find, it is rarely arranged in a manner that is useful to the research community. For example, the data sheet may contain the processor’s power, voltage, frequency, and cache size, but not the pipeline depth or the technology minimum feature size. Even then, these specifications often fail to tell the full story: a laptop processor operates over a range of frequencies and voltages, not just the 2 GHz shown on the box label.
Not surprisingly, specification data gets harder to find the older the processor becomes, especially for those that are no longer made, or worse, whose manufacturers no longer exist. We have been collecting this type of data for three decades and are now releasing it in the form of an open repository of processor specifications. The goal of CPU DB is to aggregate detailed processor specifications into a convenient form and to encourage community participation, both to leverage this information and to keep it accurate and current. CPU DB (cpudb. stanford.edu) is populated with desktop, laptop, and server processors, for which we use SPEC13 as our performance-measuring tool. In addition, the database contains limited data on embedded cores, for which we are using the CoreMark benchmark for performance.5 With time and help from the community, we hope to extend the coverage of embedded processors in the database.
PROCESSORS
1
CPU DB: Recording Microprocessor History
With this open database, you can mine microprocessor trends over the past 40 years.
Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University
In November 1971, Intel introduced the world’s first single-chip microprocessor, the Intel 4004. It had 2,300 transistors, ran at a clock speed of up to 740 KHz, and delivered 60,000 instructions per second while dissipating 0.5 watts. The following four decades witnessed exponential growth in compute power, a trend that has enabled applications as diverse as climate modeling, protein folding, and computing real-time ballistic trajectories of angry birds. Today’s microprocessor chips employ billions of transistors, include multiple processor cores on a single silicon die, run at clock speeds measured in gigahertz, and deliver more than 4 million times the performance of the original 4004.
Where did these incredible gains come from? This article sheds some light on this question by introducing CPU DB (cpudb.stanford.edu), an open and extensible database collected by Stanford’s VLSI (very large-scale integration) Research Group over several generations of processors (and students). We gathered information on commercial processors from 17 manufacturers and placed it in CPU DB, which now contains data on 790 processors spanning the past 40 years.
In addition, we provide a methodology to separate the effect of technology scaling from improvements on other frontiers (e.g., architecture and software), allowing the comparison of machines built in different technologies. To demonstrate the utility of this data and analysis, we use it to decompose processor improvements into contributions from the physical scaling of devices, and from improvements in microarchitecture, compiler, and software technologies.
AN OPEN REPOSITORY OF PROCESSOR SPECSWhile information about current processors is easy to find, it is rarely arranged in a manner that is useful to the research community. For example, the data sheet may contain the processor’s power, voltage, frequency, and cache size, but not the pipeline depth or the technology minimum feature size. Even then, these specifications often fail to tell the full story: a laptop processor operates over a range of frequencies and voltages, not just the 2 GHz shown on the box label.
Not surprisingly, specification data gets harder to find the older the processor becomes, especially for those that are no longer made, or worse, whose manufacturers no longer exist. We have been collecting this type of data for three decades and are now releasing it in the form of an open repository of processor specifications. The goal of CPU DB is to aggregate detailed processor specifications into a convenient form and to encourage community participation, both to leverage this information and to keep it accurate and current. CPU DB (cpudb. stanford.edu) is populated with desktop, laptop, and server processors, for which we use SPEC13 as our performance-measuring tool. In addition, the database contains limited data on embedded cores, for which we are using the CoreMark benchmark for performance.5 With time and help from the community, we hope to extend the coverage of embedded processors in the database.
PROCESSORS
15
1985 1990 1995 201020052000 2015
140
120
100
80
60
40
20
0
F04
/ cyc
leF04 Delays Per Cycle for Processor Designs
FO4 delay per cycle is roughly proportional to the amount of computation completed per cycle.
2008 2009 2010 2011 2012
1024libquantumoverall SPEC
512
256
32
64
128
16
Libquantum Score Versus SPEC Score
This figure shows how compiler optimizations have led to performance boosts in Libquantum.
SPEC
200
6
10
“Gate Delay”‣ What determines the actual delay of a logic gate? ‣ Transistors are not perfect switches - cannot change terminal
voltages instantaneously. ‣ Consider the NAND gate:
‣ Current (I) value depends on: process parameters, transistor size
‣ CL models gate output, wire, inputs to next stage (Cap. of Load) ‣ C “integrates” I creating a voltage change at output
∆ ∝ CL / I
11
More on transistor Current‣ Transistors act like a cross between a resistor and “current
source”
‣ ISAT depends on process parameters (higher for nFETs than for pFETs) and transistor size (layout):
ISAT ∝ W/L
12
Physical Layout determines FET strength
‣ “Switch-level” abstraction gives a good way to understand the function of a circuit. ‣ nFET (g=1 ? short circuit : open) ‣ pFET (g=0 ? short circuit : open)
‣ Understanding delay means going below the switch-level abstraction to transistor physics and layout details.
13
UC Regents Spring 2016 © UCBCS 250 L4: Timing
Transistors as water valves.If electrons are water molecules,
transistor strengths (W/L) are pipe diameters, and capacitors are buckets ...
A “on” p-FET fillsup the capacitor
with charge.
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
A “on” n-FET empties the bucket.
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
“1”
“0”Time
Water level
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
“0”
“1”
TimeWater level
(Cartoon physics)
14
EE141
The Switch – Dynamic Model (Simplified)
|VGS|
S D
G
|VGS| ≥ |VT|
S DRon
G
CG
CDCS
15
EE141
Switch SizingWhat happens if we make a switch W times larger (wider)
|VGS|
S D
G
|VGS| ≥ |VT|
S DRon/W
G
CGW
CDWCSW
W
16
EE141
Switch Parasitic Model
The pull-down switch (NMOS)
Vin
CG
Vout
CD
Minimum-size switch
RN
Sizing the transistor (factor W)
Vin
WCG
Vout
WCDRN
W
We assume transistors of minimal length (or at least constant length). R’s and C’s in units of per unit width.
17
EE141
The pull-up switch (PMOS)
Minimum-size switch
Vin
CG Vout
CD
RP = 2RNVin
2CG Vout
2CD
RN
Sized for symmetry
Vin
2WCG Vout
2WCD
RN
General sizing
W
Switch Parasitic Model
18
EE141
Inverter Parasitic Model
Vin
Cin = 3WCG
Vout
Cint = 3WCD
RNW
RNW
Drain and gate capacitance of transistor are directly related by process (γ≈1)
CD= γCG
= 3WγCG
tp = 0.69RNW
⎛
⎝⎜
⎞
⎠⎟(3WγCG ) = 0.69(3γ )RNCG
Intrinsic delay of inverter independent of size
19
Turning Rise/Fall Delay into Gate Delay• Cascaded gates:
“transfer curve” for inverter.
1 11 10 0 0 0
20
EE141
The Switch Inverter: Transient Response
tpHL = f(RonCL)= 0.69 Rn CL
(a) Low-to-high (b) High-to-low
Vin Vin
Cin Cin
V(t) = V0 e –t/RC t1/2 = ln(2) × RC
21
More on CL
‣ Everything that connects to the output of a logic gate (or transistor) contributes capacitance:
‣ Transistor drains ‣ Interconnection
(wires/contacts/vias)
‣ Transistor Gates
I
22
EE141
Inverter with Load Capacitance
Vin
Cin = 3WCG
Vout
Cint = 3WγCG
RNW
RNW
CL
)()(
))(3(69.0
)3(69.0
)(69.0
0
int
ftCCt
CCRC
CCWWR
CCWRt
in
Linv
in
LNG
LGN
LN
p
+=+=
+=
+⎟⎠
⎞⎜⎝
⎛=
+⎟⎠
⎞⎜⎝
⎛=
γγ
γ
γ
f = fanout = ratio between load and input capacitance of gate
23
EE141
Inverter Delay Model
tp=tinv(γ+f)
f
Delay
γ
tinv technology constant ▪ Can be dropped from
expression ▪ Delay unit-less variable
(expressed in unit delays)t’p=γ+f
Question: how does transistor sizing (W) impact delay?
24
Spring 2018 EECS151 Page
Wire Delay• Ideally, wires behave as
“transmission lines”: – signal wave-front moves close to
the speed of light • ~1ft/ns
– Time from source to destination is called the “transit time”.
– In ICs most wires are short, and the transit times are relatively short compared to the clock period and can be ignored.
– Not so on PC boards.
25
Wires‣ As parallel plate capacitors:
C ∝ Area = width ∗ length
‣ Wires have finite resistance, so have distributed R and C:
with r = res/length, c = cap/length, ∆ ∝ rcL2 ≅ rc + 2rc +3rc + ...
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
26
Spring 2018 EECS151 Page
Wire Delay• Even in those cases where the
transmission line effect is negligible:
– Wires posses distributed resistance and capacitance
– Time constant associated with distributed RC is proportional to the square of the length
• For short wires on ICs, resistance is insignificant (relative to effective R of transistors), but C is important. – Typically around half of C of
gate load is in the wires. • For long wires on ICs:
– busses, clock lines, global control signal, etc.
– Resistance is significant, therefore distributed RC effect dominates.
– signals are typically “rebuffered” to reduce delay:
v1 v2 v3 v4
27
v1
v4v3
v2
time
UC Regents Spring 2018 © UCBEECS 151
Recall: Positive edge-triggered flip-flop
D Q A flip-flop “samples” right before the edge, and then “holds” value.
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Sampling circuit
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Holds value
28
UC Regents Spring 2018 © UCBEECS 151
Sensing: When clock is low
D QA flip-flop “samples” right before the
edge, and then “holds” value.
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Sampling circuit
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Holds value
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
clk = 0 clk’ = 1
Will capture new value on posedge.
Outputs last value captured.
29
UC Regents Spring 2018 © UCBEECS 151
Capture: When clock goes high
D QA flip-flop “samples” right before the
edge, and then “holds” value.
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Sampling circuit
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Holds value
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
clk = 1 clk’ = 0
Remembers value just captured.
Outputs value just captured.
30
UC Regents Spring 2018 © UCBEECS 151
Flip Flop delays:
D Q
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
clk-to-Q ?
CLK == 0Sense D, but Q
outputs old value.
CLK 0->1Capture D, pass
value to Q
CLK
setup ? hold ?
clk-to-Q
setup
hold?31
UC Regents Spring 2018 © UCBEECS 151
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Timing Analysis and Logic Delay
If our clock period T > worst-case delay through CL, does this ensure correct operation?
1600
IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001
Fig.1.ProcessSEMcrosssection.
Theprocess
wasraisedfrom[1]tolimitstandbypower.
Circuitdesignandarchitecturalpipeliningensurelowvoltage
performanceandfunctionality.Tofurtherlimitstandbycurrent
inhandheldASSPs,alongerpolytargettakesadvantageofthe
versus
dependenceandsource-to-bodybiasisused
toelectricallylimittransistor
instandbymode.Allcore
nMOSandpMOStransistorsutilizeseparatesourceandbulk
connectionstosupportthis.Theprocessincludescobaltdisili-
cidegatesanddiffusions.Lowsourceanddraincapacitance,as
wellas3-nmgate-oxidethickness,allowhighperformanceand
low-voltageoperation. III.ARCHITECTURE
Themicroprocessorcontains32-kBinstructionanddata
cachesaswellasaneight-entrycoalescingwritebackbuffer.
Theinstructionanddatacachefillbuffershavetwoandfour
entries,respectively.Thedatacachesupportshit-under-miss
operationandlinesmaybelockedtoallowSRAM-likeoper-
ation.Thirty-two-entryfullyassociativetranslationlookaside
buffers(TLBs)thatsupportmultiplepagesizesareprovided
forbothcaches.TLBentriesmayalsobelocked.A128-entry
branchtargetbufferimprovesbranchperformanceapipeline
deeperthanearlierhigh-performanceARMdesigns[2],[3].
A.PipelineOrganization
Toobtainhighperformance,themicroprocessorcoreutilizes
asimplescalarpipelineandahigh-frequencyclock.Inaddition
toavoidingthepotentialpowerwasteofasuperscalarapproach,
functionaldesignandvalidationcomplexityisdecreasedatthe
expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,
thepipelinepartitioningbalancestheworkloadandensuresthat
noonepipelinestageistight.Themainintegerpipelineisseven
stages,memoryoperationsfollowaneight-stagepipeline,and
whenoperatinginthumbmodeanextrapipestageisinserted
afterthelastfetchstagetoconvertthumbinstructionsintoARM
instructions.Sincethumbmodeinstructions[11]are16b,two
instructionsarefetchedinparallelwhileexecutingthumbin-
structions.Asimplifieddiagramoftheprocessorpipelineis
Fig.2.Microprocessorpipelineorganization.
showninFig.2,wherethestateboundariesareindicatedby
gray.Featuresthatallowthemicroarchitecturetoachievehigh
speedareasfollows.
TheshifterandALUresideinseparatestages.TheARMin-
structionsetallowsashiftfollowedbyanALUoperationina
singleinstruction.Previousimplementationslimitedfrequency
byhavingtheshiftandALUinasinglestage.Splittingthisop-
erationreducesthecriticalALUbypasspathbyapproximately
1/3.Theextrapipelinehazardintroducedwhenaninstructionis
immediatelyfollowedbyonerequiringthattheresultbeshifted
isinfrequent.
DecoupledInstructionFetch.Atwo-instructiondeepqueueis
implementedbetweenthesecondfetchandinstructiondecode
pipestages.Thisallowsstallsgeneratedlaterinthepipetobe
deferredbyoneormorecyclesintheearlierpipestages,thereby
allowinginstructionfetchestoproceedwhenthepipeisstalled,
andalsorelievesstallspeedpathsintheinstructionfetchand
branchpredictionunits.
Deferredregisterdependency
stalls.Whileregisterdepen-
denciesarecheckedintheRFstage,stallsduetothesehazards
aredeferreduntiltheX1stage.Allthenecessaryoperandsare
thencapturedfromresult-forwardingbussesastheresultsare
returnedtotheregisterfile.
Oneofthemajorgoalsofthedesignwastominimizetheen-
ergyconsumedtocompleteagiventask.Conventionalwisdom
hasbeenthatshorterpipelinesaremoreefficientduetore-
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic
.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Register:
An Array of Flip-Flops
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic
.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Combinational Logic
32
UC Regents Spring 2018 © UCBEECS 151
Flip-Flop delays eat into “time budget”1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Spring 2003 EECS150 – Lec10-Timing Page 7
Example
• Parallel to serial converter:
a
b T ! time(clk"Q) + time(mux) + time(setup)
T ! #clk"Q + #mux + #setup
clk
ALU “time budget”
Spring 2003 EECS150 – Lec10-Timing Page 8
General Model of Synchronous Circuit
• In general, for correct operation:
for all paths.
• How do we enumerate all paths?
– Any circuit input or register output to any register input or circuit
output.
– “setup time” for circuit outputs depends on what it connects to
– “clk-Q time” for circuit inputs depends on from where it comes.
reg regCL CL
clock input
output
option feedback
input output
T ! time(clk"Q) + time(CL) + time(setup)
T ! #clk"Q + #CL + #setup
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic
.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Combinational Logic
33
UC Regents Spring 2018 © UCBEECS 151
Clock skew also eats into “time budget”
Spring 2003 EECS150 – Lec10-Timing Page 18
Clock Skew (cont.)
• If clock period T = TCL+Tsetup+Tclk!Q, circuit will fail.
• Therefore:
1. Control clock skew
a) Careful clock distribution. Equalize path delay from clock source to all clock loads by controlling wires delay and buffer delay.
b) don’t “gate” clocks.
2. T " TCL+Tsetup+Tclk!Q + worst case skew.
• Most modern large high-performance chips (microprocessors) control end to end clock skew to a few tenths of a nanosecond.
clock skew, delay in distribution
CL
CLKCLK’
CLK
CLK’
Spring 2003 EECS150 – Lec10-Timing Page 19
Clock Skew (cont.)
• Note reversed buffer.
• In this case, clock skew actually provides extra time (adds
to the effective clock period).
• This effect has been used to help run circuits as higher
clock rates. Risky business!
CL
CLK
CLK’
clock skew, delay in distribution
CLK
CLK’
As T →0, which circuit
fails first?
Spring 2003 EECS150 – Lec10-Timing Page 18
Clock Skew (cont.)
• If clock period T = TCL+Tsetup+Tclk!Q, circuit will fail.
• Therefore:
1. Control clock skew
a) Careful clock distribution. Equalize path delay from clock source to all clock loads by controlling wires delay and buffer delay.
b) don’t “gate” clocks.
2. T " TCL+Tsetup+Tclk!Q + worst case skew.
• Most modern large high-performance chips (microprocessors) control end to end clock skew to a few tenths of a nanosecond.
clock skew, delay in distribution
CL
CLKCLK’
CLK
CLK’
CLKd CLKd
Spring 2003 EECS150 – Lec10-Timing Page 18
Clock Skew (cont.)
• If clock period T = TCL+Tsetup+Tclk!Q, circuit will fail.
• Therefore:
1. Control clock skew
a) Careful clock distribution. Equalize path delay from clock source to all clock loads by controlling wires delay and buffer delay.
b) don’t “gate” clocks.
2. T " TCL+Tsetup+Tclk!Q + worst case skew.
• Most modern large high-performance chips (microprocessors) control end to end clock skew to a few tenths of a nanosecond.
clock skew, delay in distribution
CL
CLKCLK’
CLK
CLK’CLKd
34
UC Regents Fall 2013 © UCBCS 250 L3: Timing
the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [16] was required to tune themore than 2000 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3D visualizations. Figure 8 visualizes four ofthe 64 sector trees containing about 125 tuned wiresdriving 1/16th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [17, 18]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 15 200 clock pins on the chip,regardless of proximity.
From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.
Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed
Figure 6
Schematic diagram of global clock generation and distribution.
PLL
Bypass
Referenceclock in
Referenceclock out
Clock distributionClock out
Figure 7
3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.
Del
ayGrid
Tunedsectortrees
Sectorbuffers
Buffer level 2
Buffer level 1
y
x
Figure 8
Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.
Del
ay Multiple-fingeredtransmissionline
yx
J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002
32
Clock Tree Delays,
IBM “Power” CPU
Del
ay
35
UC Regents Fall 2013 © UCBCS 250 L3: Timing
the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [16] was required to tune themore than 2000 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3D visualizations. Figure 8 visualizes four ofthe 64 sector trees containing about 125 tuned wiresdriving 1/16th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [17, 18]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 15 200 clock pins on the chip,regardless of proximity.
From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.
Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed
Figure 6
Schematic diagram of global clock generation and distribution.
PLL
Bypass
Referenceclock in
Referenceclock out
Clock distributionClock out
Figure 7
3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.
Del
ay
Grid
Tunedsectortrees
Sectorbuffers
Buffer level 2
Buffer level 1
y
x
Figure 8
Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.
Del
ay Multiple-fingeredtransmissionline
yx
J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002
32
Clock Tree Delays, IBM Power
clock grid was completed with a tool run at the chip level,connecting unit-level pins to the grid. At this point, theclock tuning and the bottom-up clock routing process stillhave a great deal of flexibility to respond rapidly to evenlate changes. Repeated practice routing and tuning wereperformed by a small, focused global clock team as theclock pins and buffer placements evolved to guaranteefeasibility and speed the design process.
Measurements of jitter and skew can be carried outusing the I/Os on the chip. In addition, approximately 100top-metal probe pads were included for direct probingof the global clock grid and buffers. Results on actualPOWER4 microprocessor chips show long-distanceskews ranging from 20 ps to 40 ps (cf. Figure 9). This isimproved from early test-chip hardware, which showedas much as 70 ps skew from across-chip channel-lengthvariations [19]. Detailed waveforms at the input andoutput of each global clock buffer were also measuredand compared with simulation to verify the specializedmodeling used to design the clock grid. Good agreementwas found. Thus, we have achieved a “correct-by-design”clock-distribution methodology. It is based on our designexperience and measurements from a series of increasinglyfast, complex server microprocessors. This method resultsin a high-quality global clock without having to usefeedback or adjustment circuitry to control skews.
Circuit designThe cycle-time target for the processor was set early in theproject and played a fundamental role in defining thepipeline structure and shaping all aspects of the circuitdesign as implementation proceeded. Early on, criticaltiming paths through the processor were simulated indetail in order to verify the feasibility of the designpoint and to help structure the pipeline for maximumperformance. Based on this early work, the goal for therest of the circuit design was to match the performance setduring these early studies, with custom design techniquesfor most of the dataflow macros and logic synthesis formost of the control logic—an approach similar to thatused previously [20]. Special circuit-analysis and modelingtechniques were used throughout the design in order toallow full exploitation of all of the benefits of the IBMadvanced SOI technology.
The sheer size of the chip, its complexity, and thenumber of transistors placed some important constraintson the design which could not be ignored in the push tomeet the aggressive cycle-time target on schedule. Theseconstraints led to the adoption of a primarily static-circuitdesign strategy, with dynamic circuits used only sparinglyin SRAMs and other critical regions of the processor core.Power dissipation was a significant concern, and it was akey factor in the decision to adopt a predominantly static-circuit design approach. In addition, the SOI technology,
including uncertainties associated with the modelingof the floating-body effect [21–23] and its impact onnoise immunity [22, 24 –27] and overall chip decouplingcapacitance requirements [26], was another factor behindthe choice of a primarily static design style. Finally, thesize and logical complexity of the chip posed risks tomeeting the schedule; choosing a simple, robust circuitstyle helped to minimize overall risk to the projectschedule with most efficient use of CAD tool and designresources. The size and complexity of the chip alsorequired rigorous testability guidelines, requiring almostall cycle boundary latches to be LSSD-compatible formaximum dc and ac test coverage.
Another important circuit design constraint was thelimit placed on signal slew rates. A global slew rate limitequal to one third of the cycle time was set and enforcedfor all signals (local and global) across the whole chip.The goal was to ensure a robust design, minimizingthe effects of coupled noise on chip timing and alsominimizing the effects of wiring-process variability onoverall path delay. Nets with poor slew also were foundto be more sensitive to device process variations andmodeling uncertainties, even where long wires and RCdelays were not significant factors. The general philosophywas that chip cycle-time goals also had to include theslew-limit targets; it was understood from the beginningthat the real hardware would function at the desiredcycle time only if the slew-limit targets were also met.
The following sections describe how these designconstraints were met without sacrificing cycle time. Thelatch design is described first, including a description ofthe local clocking scheme and clock controls. Then thecircuit design styles are discussed, including a description
Figure 9
Global clock waveforms showing 20 ps of measured skew.
1.5
1.0
0.5
0.0
0 500 1000 1500 2000 2500
20 ps skew
Vol
ts (
V)
Time (ps)
IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. D. WARNOCK ET AL.
33
36
Spring 2018 EECS151 Page
Clock Skew (cont.)
• Note reversed buffer. • In this case, clock skew actually provides extra time (adds to
the effective clock period). • This effect has been used to help run circuits as higher clock
rates. Risky business!
CL
CLKCLK’
clock skew, delay in distribution
CLK
CLK’
37
Components of Path Delay
1. # of levels of logic 2. Internal cell delay 3. wire delay 4. cell input capacitance 5. cell fanout 6. cell output drive strength
38
Who controls the delay?foundary engineer (TSMC)
Library Developer (Aritsan)
CAD Tools (DC, IC Compiler) Designer (you!)
1. # of levels synthesis RTL
2. Internal cell delay
physical parameters
cell topology, trans sizing cell selection
3. Wire delay physical parameters place & route layout
generator4. Cell input capacitance
physical parameters
cell topology, trans sizing cell selection instantiation
5. Cell fanout synthesis RTL
6. Cell drive strength
physical parameters
transistor sizing cell selection instantiation
39
UC Regents Spring 2018 © UCBEECS 151
From Delay Models to Timing Analysis1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Spring 2003 EECS150 – Lec10-Timing Page 7
Example
• Parallel to serial converter:
a
b T ! time(clk"Q) + time(mux) + time(setup)
T ! #clk"Q + #mux + #setup
clk
f T
1 MHz 1 μs
10 MHz 100 ns
100 MHz 10 ns
1 GHz 1 ns
Timing AnalysisWhat is the smallest
T that produces correct operation? Or, can we meet a
target T?
40
Timing Closure: Searching for and beating down the critical path1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Must consider all connected register pairs, paths, plus from input to register, plus register
to output.
?
• Design tools help in the search. • Synthesis tools work to meet clock
constraint, report delays on paths, – Special static timing analyzers accept a
design netlist and report path delays, – and, of course, simulators can be used to
determine timing performance.
Tools that are expected to do something about the timing behavior (such as synthesizers), also include
provisions for specifying input arrival times (relative to the clock), and output requirements (set-up times
of next stage).41
Timing Analysis, real example
From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
netlist. Of these, 121 713 were top-level chip global nets,and 21 711 were processor-core-level global nets. Againstthis model 3.5 million setup checks were performed in latemode at points where clock signals met data signals inlatches or dynamic circuits. The total number of timingchecks of all types performed in each chip run was9.8 million. Depending on the configuration of the timingrun and the mix of actual versus estimated design data,the amount of real memory required was in the rangeof 12 GB to 14 GB, with run times of about 5 to 6 hoursto the start of timing-report generation on an RS/6000*Model S80 configured with 64 GB of real memory.Approximately half of this time was taken up by readingin the netlist, timing rules, and extracted RC networks, as
well as building and initializing the internal data structuresfor the timing model. The actual static timing analysistypically took 2.5–3 hours. Generation of the entirecomplement of reports and analysis required an additional5 to 6 hours to complete. A total of 1.9 GB of timingreports and analysis were generated from each chip timingrun. This data was broken down, analyzed, and organizedby processor core and GPS, individual unit, and, in thecase of timing contracts, by unit and macro. This was onecomponent of the 24-hour-turnaround time achieved forthe chip-integration design cycle. Figure 26 shows theresults of iterating this process: A histogram of the finalnominal path delays obtained from static timing for thePOWER4 processor.
The POWER4 design includes LBIST and ABIST(Logic/Array Built-In Self-Test) capability to enable full-frequency ac testing of the logic and arrays. Such testingon pre-final POWER4 chips revealed that several circuitmacros ran slower than predicted from static timing. Thespeed of the critical paths in these macros was increasedin the final design. Typical fast ac LBIST laboratory testresults measured on POWER4 after these paths wereimproved are shown in Figure 27.
SummaryThe 174-million-transistor !1.3-GHz POWER4 chip,containing two microprocessor cores and an on-chipmemory subsystem, is a large, complex, high-frequencychip designed by a multi-site design team. Theperformance and schedule goals set at the beginning ofthe project were met successfully. This paper describesthe circuit and physical design of POWER4, emphasizingaspects that were important to the project’s success in theareas of design methodology, clock distribution, circuits,power, integration, and timing.
Figure 25
POWER4 timing flow. This process was iterated daily during the physical design phase to close timing.
VIM
Timer files ReportsAsserts
Spice
Spice
GL/1
Reports
< 12 hr
< 12 hr
< 12 hr
< 48 hr
< 24 hr
Non-uplift timing
Noiseimpacton timing
Upliftanalysis
Capacitanceadjust
Chipbench /EinsTimer
Chipbench /EinsTimer
Extraction
Core or chipwiring
Analysis/update(wires, buffers)
Notes:• Executed 2–3 months prior to tape-out• Fully extracted data from routed designs • Hierarchical extraction• Custom logic handled separately • Dracula • Harmony• Extraction done for • Early • Late
Extracted units (flat or hierarchical)Incrementally extracted RLMsCustom NDRsVIMs
Figure 26
Histogram of the POWER4 processor path delays.
!40 !20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280Timing slack (ps)
Lat
e-m
ode
timin
g ch
ecks
(th
ousa
nds)
0
50
100
150
200
IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. D. WARNOCK ET AL.
47
Most paths have hundreds of picoseconds to spare.
The critical path
42
Timing OptimizationAs an ASIC/FPGA designer you get to choose: ‣ The algorithm ‣ The Microarchitecture (block diagram) ‣ The RTL description of the CL blocks
(number of levels of logic) ‣ Where to place registers and memory (the
pipelining) ‣ Overall floorplan and relative placement
of blocks43
Post-Placement C-slow Retiming for the Xilinx VirtexFPGA
Nicholas Weaver⇤
UC BerkeleyBerkeley, CA
Yury MarkovskiyUC BerkeleyBerkeley, CA
Yatish PatelUC BerkeleyBerkeley, CA
John WawrzynekUC BerkeleyBerkeley, CA
ABSTRACT
C-slow retiming is a process of automatically increas-ing the throughput of a design by enabling fine grainedpipelining of problems with feedback loops. This transfor-mation is especially appropriate when applied to FPGAdesigns because of the large number of available registers.To demonstrate and evaluate the benefits of C-slow re-timing, we constructed an automatic tool which modifiesdesigns targeting the Xilinx Virtex family of FPGAs. Ap-plying our tool to three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1synthesized microprocessor core, we were able to substan-tially increase the total throughput. For some parameters,throughput is e↵ectively doubled.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids—Automatic syn-thesys
General Terms
Performance
Keywords
FPGA CAD, FPGA Optimization, Retiming, C-slowRetiming
⇤Please address any correspondance [email protected]
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.FPGA’03, February 23–25, 2003, Monterey, California, USA.Copyright 2003 ACM 1-58113-651-X/03/0002 ...$5.00.
1. Introduction
Leiserson’s retiming algorithm[7] o↵ers a polynomialtime algorithm to optimize the clock period on arbitrarysynchronous circuits without changing circuit semantics.Although a powerful and e�cient transformation that hasbeen employed in experimental tools[10][2] and commercialsynthesis tools[13][14], it o↵ers only a minor clock periodimprovement for a well constructed design, as many de-signs have their critical path on a single cycle feedbackloop and can’t benefit from retiming.
Also proposed by Leiserson et al to meet the constraintsof systolic computation, is C-slow retiming.1 In C-slow re-timing, each design register is first replaced with C regis-ters before retiming. This transformation modifies the de-sign semantics so that C separate streams of computationare distributed through the pipeline, greatly increasing theaggregate throughput at the cost of additional latency andflip flops. This can automatically accelerate computationscontaining feedback loops by adding more flip-flops thatretiming can then move moved around the critical path.
The e↵ect of C-slow retiming is to enable pipelining ofthe critical path, even in the presence of feedback loops. Totake advantage of this increased throughput however, thereneeds to be su�cient task level parallelism. This processwill slow any single task but the aggregate throughput willbe increased by interleaving the resulting computation.
This process works very well on many FPGA archite-cures as these architectures tend to have a balanced ra-tio of logic elements to registers, while most user designscontain a considerably higher percentage of logic. Addi-tionaly, many architectures allow the registers to be usedindependently of the logic in a logic block.
We have constructed a prototype C-slow retiming toolthat modifies designs targeting the Xilinx Virtex familyof FPGAs. The tool operates after placement: convertingevery design register to C separate registers before apply-ing Leiserson’s retiming algorithm to minimize the clockperiod. New registers are allocated by scavenging unusedarray resources. The resulting design is then returned toXilinx tools for routing, timing analysis, and bitfile gener-ation.
We have selected three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1
1This was originally defined to meet systolic slowdown re-quirements.
How to retime logic
Post-Placement C-slow Retiming for the Xilinx VirtexFPGA
Nicholas Weaver⇤
UC BerkeleyBerkeley, CA
Yury MarkovskiyUC BerkeleyBerkeley, CA
Yatish PatelUC BerkeleyBerkeley, CA
John WawrzynekUC BerkeleyBerkeley, CA
ABSTRACT
C-slow retiming is a process of automatically increas-ing the throughput of a design by enabling fine grainedpipelining of problems with feedback loops. This transfor-mation is especially appropriate when applied to FPGAdesigns because of the large number of available registers.To demonstrate and evaluate the benefits of C-slow re-timing, we constructed an automatic tool which modifiesdesigns targeting the Xilinx Virtex family of FPGAs. Ap-plying our tool to three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1synthesized microprocessor core, we were able to substan-tially increase the total throughput. For some parameters,throughput is e↵ectively doubled.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids—Automatic syn-thesys
General Terms
Performance
Keywords
FPGA CAD, FPGA Optimization, Retiming, C-slowRetiming
⇤Please address any correspondance [email protected]
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.FPGA’03, February 23–25, 2003, Monterey, California, USA.Copyright 2003 ACM 1-58113-651-X/03/0002 ...$5.00.
1. Introduction
Leiserson’s retiming algorithm[7] o↵ers a polynomialtime algorithm to optimize the clock period on arbitrarysynchronous circuits without changing circuit semantics.Although a powerful and e�cient transformation that hasbeen employed in experimental tools[10][2] and commercialsynthesis tools[13][14], it o↵ers only a minor clock periodimprovement for a well constructed design, as many de-signs have their critical path on a single cycle feedbackloop and can’t benefit from retiming.
Also proposed by Leiserson et al to meet the constraintsof systolic computation, is C-slow retiming.1 In C-slow re-timing, each design register is first replaced with C regis-ters before retiming. This transformation modifies the de-sign semantics so that C separate streams of computationare distributed through the pipeline, greatly increasing theaggregate throughput at the cost of additional latency andflip flops. This can automatically accelerate computationscontaining feedback loops by adding more flip-flops thatretiming can then move moved around the critical path.
The e↵ect of C-slow retiming is to enable pipelining ofthe critical path, even in the presence of feedback loops. Totake advantage of this increased throughput however, thereneeds to be su�cient task level parallelism. This processwill slow any single task but the aggregate throughput willbe increased by interleaving the resulting computation.
This process works very well on many FPGA archite-cures as these architectures tend to have a balanced ra-tio of logic elements to registers, while most user designscontain a considerably higher percentage of logic. Addi-tionaly, many architectures allow the registers to be usedindependently of the logic in a logic block.
We have constructed a prototype C-slow retiming toolthat modifies designs targeting the Xilinx Virtex familyof FPGAs. The tool operates after placement: convertingevery design register to C separate registers before apply-ing Leiserson’s retiming algorithm to minimize the clockperiod. New registers are allocated by scavenging unusedarray resources. The resulting design is then returned toXilinx tools for routing, timing analysis, and bitfile gener-ation.
We have selected three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1
1This was originally defined to meet systolic slowdown re-quirements.
IN OUT
1 1
1 1 22
Figure 1: A small graph before retiming. Thenodes represent logic delays, with the inputs andoutputs passing through mandatory, fixed regis-ters. The critical path is 5.
microprocessor core, for which we can envision scenar-ios where ample task-level parallelism exists. The AESand Smith/Watherman benchmarks were also C-slowed byhand, enabling us to evaluate how well our automated tech-niques compare with careful, hand designed implementa-tions that accomplishes the same goals.
The LEON 1 processor is a significantly larger synthe-sized design. Although it seems unusual, there is su�cienttask level parallelism to C-slow a microprocessor, as eachstream of execution can be viewed as a separate task. Theresulting C-slowed design behaves like a multithreaded sys-tem, with each virtual processor running slower but o↵er-ing a higher total throughput.
This prototype demonstrates significant speedups onall 3 benchmarks, nearly doubling the throughput for theproper parameters. On the AES and Smith/Watermanbenchmarks, these automated results compare favorablywith careful hand-constructed implementations that werethe result of manual C-slowing and pipelining.
In the remainder of the paper, we first discuss the se-mantic restrictions and changes that retiming and C-slowretiming impose on a design, the details of the retimingalgorithm, and the use of the target architecture. Fol-lowing the discussion of C-slow retiming, we describe ourimplementation of an automatic retiming tool. Then wedescribe the structure of all three benchmarks and presentthe results of applying our tool.
2. Conventional Retiming
Leiserson’s retiming treats a synchronous circuit as adirected graph, with delays on the nodes representing com-bination delays and weights on the edges representing reg-isters in the design. An additional node represents theexternal world, with appropriate edges added to accountfor all the I/Os. Two matrixes are calculated, W and D,that represent the number of registers and critical pathbetween every pair of nodes in the graph. Each node alsohas a lag value r that is calculated by the algorithm andused to change the number of registers on any given edge.Conventional retiming does not change the design seman-tics: all input and output timings remain unchanged, whileimposing minor design constraints on the use of FPGA fea-tures. More details and formal proofs of correctness canbe found in Leiserson’s original paper[7].
In order to determine whether a critical path P can beachieved, the retiming algorithm creates a series of con-
IN OUT
1 1
1 1 22
Figure 2: The example in Figure 2 after retiming.The critical path is reduced from 5 to 4.
straints to calculate the lag on each node. All these con-strains are of the form x � y k that can be solved inO(n2) time by using the Bellman/Ford shortest path al-gorithm. The primary constraints insure correctness: noedge will have a negative number of registers while everycycle will always contain the original number of registers.All IO passes through an intermediate node insuring thatinput and output timings do not change. These constraintscan be modified to insure that a particular line will containno registers or a mandatory minimum number of registersto meet architectural constraints.
A second set of constraints attempt to insure that everypath longer than the critical path will contain at least oneregister, by creating an additional constraint for every pathlonger than the critical path. The actual constraints aresummarized in Table 1.
This process is iterated to find the minimum criticalpath that meets all the constraints. The lag calculated bythese constraints can then be used to change the designto meet this critical path. For each edge, a new registerweight w0 is calculated, with w0(e) = w(e)� r(u) + r(v).
An example of how retiming a↵ects a simple design canbe seen in Figures 2 and 2. The initial design has a criticalpath of 5, while after retiming the critical path is reducedto 4. During this process, the number of registers is in-creased, yet the number of registers on every cycle andthe path from input to output remain unchanged. Sincethe feedback loop has only a single register and a delay of4, it is impossible to further improve the performance byretiming.
Retiming in this form imposes only minimal design lim-itations: there can be no asynchronous resets or similarelements, as the retiming technique only applies to syn-chronous circuits. A synchronous global reset imposes toomany constraints to allow retiming unless initial conditionsare calculated and the global reset itself is now excludedfrom retiming purposes. Local synchronous resets and en-ables just produce small, self loops that have no e↵ect onthe correct operation of the algorithm.
Most other design features can be accommodated bysimply adding appropriate constraints. As an example, alltristated lines can’t have registers applied to them, whilemandatory elements such as those seen in synchronousmemories can be easily accommodated by mandating reg-isters on the appropriate nets.
Memories themselves can be retimed like any other el-ement in the design, with dual ported memories treatedas a single node for retiming purposes. Memories thatare synthesized with a negative clock edge (to create thedesign illusion of asynchronous memories) can either be
Circles are combinational logic, labelled with delays.
Critical path is 5.We want to improve it without changing circuit semantics.
IN OUT
1 1
1 1 22
Figure 1: A small graph before retiming. Thenodes represent logic delays, with the inputs andoutputs passing through mandatory, fixed regis-ters. The critical path is 5.
microprocessor core, for which we can envision scenar-ios where ample task-level parallelism exists. The AESand Smith/Watherman benchmarks were also C-slowed byhand, enabling us to evaluate how well our automated tech-niques compare with careful, hand designed implementa-tions that accomplishes the same goals.
The LEON 1 processor is a significantly larger synthe-sized design. Although it seems unusual, there is su�cienttask level parallelism to C-slow a microprocessor, as eachstream of execution can be viewed as a separate task. Theresulting C-slowed design behaves like a multithreaded sys-tem, with each virtual processor running slower but o↵er-ing a higher total throughput.
This prototype demonstrates significant speedups onall 3 benchmarks, nearly doubling the throughput for theproper parameters. On the AES and Smith/Watermanbenchmarks, these automated results compare favorablywith careful hand-constructed implementations that werethe result of manual C-slowing and pipelining.
In the remainder of the paper, we first discuss the se-mantic restrictions and changes that retiming and C-slowretiming impose on a design, the details of the retimingalgorithm, and the use of the target architecture. Fol-lowing the discussion of C-slow retiming, we describe ourimplementation of an automatic retiming tool. Then wedescribe the structure of all three benchmarks and presentthe results of applying our tool.
2. Conventional Retiming
Leiserson’s retiming treats a synchronous circuit as adirected graph, with delays on the nodes representing com-bination delays and weights on the edges representing reg-isters in the design. An additional node represents theexternal world, with appropriate edges added to accountfor all the I/Os. Two matrixes are calculated, W and D,that represent the number of registers and critical pathbetween every pair of nodes in the graph. Each node alsohas a lag value r that is calculated by the algorithm andused to change the number of registers on any given edge.Conventional retiming does not change the design seman-tics: all input and output timings remain unchanged, whileimposing minor design constraints on the use of FPGA fea-tures. More details and formal proofs of correctness canbe found in Leiserson’s original paper[7].
In order to determine whether a critical path P can beachieved, the retiming algorithm creates a series of con-
IN OUT
1 1
1 1 22
Figure 2: The example in Figure 2 after retiming.The critical path is reduced from 5 to 4.
straints to calculate the lag on each node. All these con-strains are of the form x � y k that can be solved inO(n2) time by using the Bellman/Ford shortest path al-gorithm. The primary constraints insure correctness: noedge will have a negative number of registers while everycycle will always contain the original number of registers.All IO passes through an intermediate node insuring thatinput and output timings do not change. These constraintscan be modified to insure that a particular line will containno registers or a mandatory minimum number of registersto meet architectural constraints.
A second set of constraints attempt to insure that everypath longer than the critical path will contain at least oneregister, by creating an additional constraint for every pathlonger than the critical path. The actual constraints aresummarized in Table 1.
This process is iterated to find the minimum criticalpath that meets all the constraints. The lag calculated bythese constraints can then be used to change the designto meet this critical path. For each edge, a new registerweight w0 is calculated, with w0(e) = w(e)� r(u) + r(v).
An example of how retiming a↵ects a simple design canbe seen in Figures 2 and 2. The initial design has a criticalpath of 5, while after retiming the critical path is reducedto 4. During this process, the number of registers is in-creased, yet the number of registers on every cycle andthe path from input to output remain unchanged. Sincethe feedback loop has only a single register and a delay of4, it is impossible to further improve the performance byretiming.
Retiming in this form imposes only minimal design lim-itations: there can be no asynchronous resets or similarelements, as the retiming technique only applies to syn-chronous circuits. A synchronous global reset imposes toomany constraints to allow retiming unless initial conditionsare calculated and the global reset itself is now excludedfrom retiming purposes. Local synchronous resets and en-ables just produce small, self loops that have no e↵ect onthe correct operation of the algorithm.
Most other design features can be accommodated bysimply adding appropriate constraints. As an example, alltristated lines can’t have registers applied to them, whilemandatory elements such as those seen in synchronousmemories can be easily accommodated by mandating reg-isters on the appropriate nets.
Memories themselves can be retimed like any other el-ement in the design, with dual ported memories treatedas a single node for retiming purposes. Memories thatare synthesized with a negative clock edge (to create thedesign illusion of asynchronous memories) can either be
Add a register, move one circle. Performance improves by 20%.
Logic Synthesis tools can do this in simple cases.
44
UC Regents Spring 2018 © UCBEECS 151
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Floorplaning: essential to meet timing.
(Intel XScale 80200)
45
EECS151, UC Berkeley Sp18
Timing Analysis Tools‣ Static Timing Analysis: Tools use delay models for
gates and interconnect. Traces through circuit paths. ‣ Cell delay model capture ‣ For each input/output pair, internal delay (output
load independent) ‣ output dependent delay
‣ Standalone tools (PrimeTime) and part of logic synthesis.
‣ Back-annotation takes information from results of place and route to improve accuracy of timing analysis.
‣ DC in “topographical mode” uses preliminary layout information to model interconnect parasitics. ‣ Prior versions used a simple fan-out model of gate
loading.
delay
output load
46
EECS151, UC Berkeley Sp18
Hold-time Violations
‣ Some state elements have positive hold time requirements. ‣ How can this be?
‣ Fast paths from one state element to the next can create a violation. (Think about shift registers!)
‣ CAD tools do their best to fix violations by inserting delay (buffers). ‣ Of course, if the path is delayed too much, then cycle time suffers. ‣ Difficult because buffer insertion changes layout, which changes path
delay.
FF
clk
d q
47
Driving Large Loads‣ Large fanout nets: clocks, resets, memory bit lines, off-chip ‣ Relatively small driver results in long rise time (and thus
large gate delay)
‣ Strategy:
‣ How to optimally scale drivers? ‣ Optimal trade-off between delay per stage and total number of stages?
Staged Buffers
48
EE141
Inverter Chain
CL
In Out
❑ For some given CL: ▪ How many stages are needed to minimize delay? ▪ How to size the inverters?
❑ Anyone want to guess the solution?
49
EE141
Careful about Optimization Problems❑ Get fastest delay if build one very big inverter ▪ So big that delay is set only by self-loading
❑ Likely not the problem you’re interested in ▪ Someone has to drive this inverter…
Cload
50
EE141
Delay Optimization Problem #1
❑ You are given: ▪ A fixed number of inverters ▪ The size of the first inverter ▪ The size of the load that needs to be driven
❑ Your goal: ▪ Minimize the delay of the inverter chain
❑ Need model for inverter delay vs. size
51
EE141
CL
In Out
1 2 N
tp = tp1 + tp2 + …+ tpN
, 1
,
in jpj inv
in j
Ct t
Cγ +⎛ ⎞
= +⎜ ⎟⎜ ⎟⎝ ⎠
, 1, , 1
1 1 ,
, N N
in jp p j inv in N L
j i in j
Ct t t C C
Cγ +
+= =
⎛ ⎞= = + =⎜ ⎟⎜ ⎟
⎝ ⎠∑ ∑
Apply to Inverter Chain
52
EE141
Optimal Sizing for Given N
, 12
, , 1 ,
1 0p in jinv inv
in j in j in j
dt Ct t
dC C C+
−
= − =
❑ Delay equation has N-1 unknowns, Cin,2 … Cin,N
❑ To minimize the delay, find N-1 partial derivatives:
, , 1
, 1 ,
... ...in j in jp inv inv
in j in j
C Ct t t
C C+
−
= + + +
53
EE141
❑ Result: every stage has equal fanout (f):
❑ Size of each stage is geometric mean of two neighbors:
❑ Equal fanout à every stage will have same delay
, , 1 , 1in j in j in jC C C− +=
Optimal Sizing for Given N (cont’d)
, , 1
, 1 ,
in j in j
in j in j
C CC C
+
−
=
54
EE141
❑ When each stage has same fanout f :
❑ Fanout of each stage:
❑ Minimum path delay:
,1/NL inf F C C= =
N Ff =
( )Np invt Nt Fγ= +
Optimum Delay and Number of Stages
55
EE141
Example
CL= 8 C1
In Out
C11 f f 2
CL/C1 has to be evenly distributed across N = 3 stages:
56
EE141
Delay Optimization Problem #2
❑ You are given: ▪ The size of the first inverter ▪ The size of the load that needs to be driven
❑ Your goal: ▪ Minimize delay by finding optimal number and
sizes of gates ❑ So, need to find N that minimizes:
( )Np inv L int Nt C Cγ= +
57
EE141
( )( ) ( )1/ lnln
Np inv L in inv L in
ft Nt C C t C Cfγ
γ⎛ ⎞+
= + = ⎜ ⎟⎝ ⎠
( ) 2
ln 1ln 0ln
pinv L in
t f ft C Cf f
γ∂ − −= ⋅ =
∂
For γ = 0, f = e, N = ln (CL/Cin)
( )ln
lnL inN
L in
C Cf C C N
f= → =
( )ff γ+= 1exp
Untangling the Optimization Problem
❑ Rewrite N in terms of fanout/stage f:
(no explicit solution)
58
EE141
Optimum Effective Fanout f
( )ff γ+= 1exp
❑ Optimum f for given process defined by γ
0 0.5 1 1.5 2 2.5 32.5
3
3.5
4
4.5
5
γ
f opt
fopt = 3.6 for γ = 1
e
59
EE141
In Practice: Plot of Total Delay
❑ Why the shape? ❑ Curves very flat for f > 2
▪ Simplest/most common choice: f = 4
[Hodges, p.281]
60
EE141
( ),Np inv L int Nt F F C Cγ= + =
Normalized Delay As a Function of F
[Rabaey: page 210]
(γ = 1)
61
EECS151, UC Berkeley Sp18
Conclusion‣ Timing Optimization: You start with a target on clock
period. What control do you have? ‣ Biggest effect is RTL manipulation. ‣ i.e., how much logic to put in each pipeline stage. ‣ We will be talking later about how to manipulate RTL
for better timing results. ‣ In most cases, the tools will do a good job at logic/circuit
level: ‣ Logic level manipulation ‣ Transistor sizing ‣ Buffer insertion ‣ But some cases may be difficult and you may need to
help ‣ The tools will need some help at the floorpan and layout
62