a survey of clock distribution techniques including optical and...
TRANSCRIPT
A Survey of Clock Distribution Techniques Including Optical and RF Networks
by
Sachin Chandran
A report submitted to the Graduate Faculty ofAuburn University
in partial fulfillment of therequirements for the Degree of
Master of Electrical Engineering
Auburn, AlabamaDecember 14, 2013
Keywords: Clock distribution networks, clock trees, clock skew, clock jitter, H-trees,optical clock, RF clock, wireless clock
Copyright 2013 by Sachin Chandran
Approved by
Vishwani Agrawal, Chair, James J. Danaher Professor of Electrical and ComputerEngineering
Victor Nelson, Professor of Electrical and Computer EngineeringAdit Singh, James B. Davis Professor of Electrical and Computer Engineering
Abstract
Clock distribution has become an increasingly challenging problem for VLSI designs,
consuming an increasing fraction of resources such as wiring, power, and design time. Power
consumption of up to 40% - 50% has been reported on clock distribution. Unwanted dif-
ferences or uncertainties in clock network delays degrade performance or cause functional
errors. Although there are lots of other techniques and research in clock distribution net-
works, in this report a detailed study of wireless and optical clock distributions is performed
and compared with the conventional clock distribution technique. These techniques employ
either transmitter/receiver or optical fiber to distribute clock signal to entire chip, so that
there is a reduction in interconnecting wires which reduces the total load capacitance of the
clock distribution network. Finally some future work is discussed on these techniques.
ii
Acknowledgments
I would like to express my gratitude to people who helped me complete this project.
First, I must thank my advisor Dr. Vishwani Agrawal who lent me so much help during
the course of the project. When I met with problems, he would guide me to solve those
problems, either through emails or face to face meetings.
I am also deeply grateful to my committee members, Dr. Adit Singh and Dr. Victor
Nelson for their valuable courses and project which were extremely useful in understanding
the core concepts in VLSI.
I must thank my parents Anusha and Chandran for their love, prayers, and support and
giving me everything more than what they had. I am grately indebted to my brother Sajan
Chandran, without whom I would not have pursued my graduate studies. I would like to
thank my manager Xin Ma, Intel, Hudson, MA for giving me an opportunity to work on
the Haswell Server microprocessor clock distribution team, where I worked on building clock
spines and also calculate the load on the clock network. I also want to thank my manager
Bill Maghielse, ARM Inc, Austin, TX for giving me an opportunity to work on the gate
level simulation for ATLAS 64-bit microprocessor and minimize power in decode stage of
the pipeline using various techniques. Finally and most importantly, I must thank God for
sustaining me each and every day and for being my ultimate source of strength, renewal,
and hope.
iii
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Clock Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.3 Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 The Clock Distribution Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Clock Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Global Clock Distribution . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Clock Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 H-Tree Clock Distribution . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Wireless Clock Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Clock Transmitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Clock Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Comparison of Power Consumption with Conventional Methods . . . . . . . 23
3.3.1 Grid Based System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 H-Tree Based Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.3 Wireless System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Potential Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Areas of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
iv
4 Optical Clock Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1 Intra-Chip Clock Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 Index-Guided Optical Interconnects . . . . . . . . . . . . . . . . . . . 32
4.1.2 Free-Space Optical Interconnects . . . . . . . . . . . . . . . . . . . . 34
4.2 Essential Components of an On-Chip Optical Signaling System . . . . . . . . 37
4.2.1 Optical Clock Source . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2 Optical Clock Distribution Tree . . . . . . . . . . . . . . . . . . . . . 39
4.2.3 Optical Receiver: Photo Detector . . . . . . . . . . . . . . . . . . . . 40
4.3 Comparison of Power Consumption with Electrical Distribution System . . . 41
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 Wireless Clock Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Optical Clock Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
v
List of Figures
1.1 Delay components of a local data path. . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Illustrations of positive and negative clock skews. . . . . . . . . . . . . . . . . . 7
2.1 A generic phase locked loop (PLL) block diagram. . . . . . . . . . . . . . . . . . 12
2.2 Buffer-tree clock distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 DEC Alpha 21064 clock buffer tree. . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Symmetric H-tree clock distribution network. . . . . . . . . . . . . . . . . . . . 16
2.5 Symmetric X-tree clock distribution network. . . . . . . . . . . . . . . . . . . . 16
2.6 Half-swing clock signaling methodology. . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Intra-chip wireless interconnect system. . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Inter-chip wireless interconnect system. . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Block diagram of integrated clock transmitter. . . . . . . . . . . . . . . . . . . . 22
3.4 Block diagram of integrated clock receiver. . . . . . . . . . . . . . . . . . . . . . 23
4.1 Distribution of the clock by means of fiber. . . . . . . . . . . . . . . . . . . . . . 33
4.2 Distribution of the clock by means of integrated optical waveguides. . . . . . . . 34
4.3 Unfocused broadcast of the clock to the chip. . . . . . . . . . . . . . . . . . . . 35
vi
4.4 Focused optical distribution of the clock using a holographic optical element. . . 36
4.5 Configuration for focused clock distribution that minimizes alignment problems. 37
4.6 An optical clock distribution system. . . . . . . . . . . . . . . . . . . . . . . . . 38
4.7 Line diagram of an optical H-tree. . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.8 Traveling wave photo detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.9 General approach to optical global clock distribution network. . . . . . . . . . . 41
4.10 Power consumption of optical and electrical clock distribution networks versus
the number of H-tree nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.11 Chip structure with temperature gradient. . . . . . . . . . . . . . . . . . . . . . 43
4.12 Clock skew of a 64-output-node optical H-tree compared to the clock period as
a function of technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
vii
List of Tables
3.1 Global capacitive loading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
viii
Chapter 1
Introduction
For a sequential circuit to operate correctly, the processing of data must occur in an
orderly fashion. In a synchronous system, the order in which data are processed is coordi-
nated by a clock signal. (To be more general, the synchronization could be performed by
a collection of globally distributed clock signals). The clock signal, in the form of a pe-
riodic square wave that is globally distributed to control all sequential elements (flip-flops
or latches), achieves synchronization of the circuit operation when all data are allowed to
pass through the sequential elements simultaneously. A clock network is required to deliver
the clock signal to all sequential elements. As the distributive nature of long interconnects
becomes more pronounced because of technology scaling, the control of arrival times of the
same clock edge at different sequential elements, which are scattered over the entire chip,
becomes more difficult. If not properly controlled, the clock skew, defined as the difference
in the clock signal delays to sequential elements, can adversely affect the performance of the
systems and even cause erratic operations of the systems (e.g., latching of an incorrect signal
within a sequential element).
The design of a clock poses a formidable challenge because of stringent requirements.
A well designed clock must also account for variations in device and interconnect param-
eters. What complicates matters even further is the interplay between the clock network
and the power/ground network. For example, in a zero-skew clock network, all sequential
elements are triggered almost simultaneously. Because these elements draw current from
the power network or sink current to the ground network almost simultaneously when the
clock switches, the zero-skew design often leads to severe power supply noise, resulting in
unacceptable degradation in performance and reliability. On the other hand, power supply
1
noise affects the clock jitter (this refers to the time shift in the clock pulse, as well as the
variations in the pulse width that arise from the time-varying operating conditions such as
supply voltage), which, in turn, affects the arrival times of clock signal at different sequential
elements.
Currently, a digital (and electrical) clock signal is distributed using metallic intercon-
nects (e.g. Cu) throughout the entire die. There are many approaches to distribute digital
clock signals, such as H-trees, grids and some combinations of them [42, 46]. In this project,
we will consider buffered H-trees driving local grids as representative of present clock dis-
tribution, to which we will refer as the conventional approach. Advanced active de-skewing
techniques, which further improve the quality of standard clock distribution, are not consid-
ered in this simple analysis [46].
Several interconnect solutions have been proposed to mitigate the increasingly difficult
clock distribution, 3-D, optical, and RF being the most important ones. 3-D interconnects,
which refers to two or more active Si strata that are integrated together (for example by
bonding), take advantage of the vertical dimension to decrease the die size, therefore allevi-
ating clock distribution (see, for example, [5, 12, 41, 28]). Optical interconnects have also
been proposed for clock distribution since they are immune to crosstalk noise from adjacent
electrical interconnects, and because of their speed-of-light propagation. Optical intercon-
nects also have the potential for large bandwidths, which are mostly relevant for signalling
[32, 44]. RF approaches using Cu interconnects have been proposed as a low-power clock dis-
tribution alternative at the package level [43]. The main difference between the RF approach
and conventional approach is that RF uses a narrow-band sinusoidal wave for transmitting
the clock signal as opposed to the digital square signal that is employed in conventional
clock distribution. An interesting wireless RF clock distribution, in which the clock signal is
broadcast by a source and received by on-die antennae, has also been proposed and investi-
gated in [16]. This technique is attractive since it does not require interconnects. However,
noise considerations as well as the size and position of the antennae are technical and cost
2
issues that have to be addressed. Most of the benchmarking research has been confined to
global clock distribution. However, in order to truly assess the overall benefits of alternative
clock distribution approaches, it is also important to include the local clock distribution in
the analysis.
1.1 Definitions
1.1.1 Clock Skew
A synchronous digital system typically consists of a series of registers, between which
are blocks of combinational logic. The functional requirements of the system are met by
these blocks of combinational logic, while the registers provide the global performance and
satisfy the local timing requirements. The registers are typically inserted in such a way that
the combinational logic is partitioned into several time windows of ideally equal duration.
This partitioning, known as pipelining, enables the system to run at higher frequencies than
if the entire logic path was implemented uninterruptedly.
The operation of a single stage in a pipelined synchronous system can be broken down
as follows. Each data signal is stored in a register. The arrival of the clock signal latches a
new value of the data into the register. At this time the data propagates through the register
and into the combinational block of logic. The signal passes through this network of logic,
arrives at the input of the next register, and, for a properly functioning system, is stored on
the next latching clock edge. Thus, the components of delay through a local data path are
the delays associated with the register, the delay through the logic and interconnect between
registers, and the delays involved with the clock signal distribution.
One of the primary concerns in the design of a clock distribution network is the differ-
ence in the arrival times of the clock edge between two registers. This difference is called
clock skew, and the goal is to minimize, manage, or eliminate this term altogether. Clock
skew is most important between sequentially adjacent registers. A sequentially adjacent
3
pair of registers is defined as two registers between which lie only combinational logic and
interconnect. A formal definition for clock skew given by Friedman in [18] is repeated below.
Given two sequentially-adjacent registers, RI and RJ, and an equipotential clock distri-
bution network, the clock skew between these two registers is defined as TSKEW(I,J) = TCI -
TCJ, where TCI and TCJ are the clock delays from the clock source to the registers RI and
RJ, respectively.
Though the definition specifically states that clock skew is a characteristic of sequentially
adjacent registers, it exists, in fact, between any two registers, but is only truly significant
between registers that share direct communication. The presence of clock skew can severely
limit the performance of a synchronous system, and even create a race condition that causes
incorrect data to be latched in a register.
In order to minimize clock skew, it is first necessary to understand the sources of clock
skew. Wann and Franklin reported the following sources of clock skew [51]:
• differences in line lengths from the clock source to the clocked register;
• differences in delays of any active buffers within the clock distribution network;
• differences in passive interconnect parameters, such as line resistivity, dielectric con-
stant and thickness, via/contact resistance, line and fringing capacitance, and line
dimensions; and
• differences in active device parameters, such as MOS threshold voltages and channel
mobilities, which affect the delay of the active buffers.
Layout techniques and careful clock distribution design can minimize the effects of items
one and three above. In well designed systems, the distributed active clock buffers contribute
most significantly to the overall clock skew.
The delay components of a local data path, briefly mentioned earlier, can be formalized
with the following equation. This relation details the components of the path delay, from an
initial register RI to a final register RF and is illustrated in Figure 1.1.
4
Figure 1.1: Delay components of a local data path.
TPD = TC-Q + T LOGIC + T INT + T SETUP (1)
This relation states that the total delay of a path is equal to the sum of the clock to
output delay of the register, TC-Q, the total delay of the logic path between registers, TLOGIC,
the total delay of the interconnect, TINT, and the necessary setup time that must be met
at the final register for correct operation, TSETUP. This total path delay is related to the
maximum clock frequency possible between two registers by the relation:
1/fMAX = TCP(MIN) ≥ TPD + T SKEW (2)
where fMAX is the maximum clock frequency, TCP(MIN) is the minimum cycle time, and TSKEW
is the clock skew between the two registers. Though it has not yet been mentioned, it should
be noted that clock skew can be a negative or positive value depending on whether the clock
signal at the final register lags or leads that at the initial register. This discussion leads
naturally into the effects of clock skew on system performance.
Depending on the magnitude and polarity of the clock skew, system performance can
be either enhanced or degraded. The presence of skew defines a set of equations describing
the constraints that clock skew imposes on the system timing.
Ignoring the clock skew term in equation (2), it is apparent that the maximum attainable
frequency in a system is that given by the inverse of the longest path delay between any
two sequentially adjacent registers RF and RI [50]. If the arrival of the clock signal at RF
5
leads that of RI then positive clock skew exists. Positive clock skew serves to increase the
path delay as shown in equation (2). This increase in path delay decreases the maximum
clock frequency, decreasing performance. Thus, positive clock skew is the amount of time
that must be added to the minimum clock period to reliably transfer data through a system.
The tolerance of a system to positive clock skew can be described by combining equations
(1) and (2).
T SKEW ≤ TCP − TPD(MAX) = TCP − (TC-Q + T LOGIC(MAX) + T INT + T SETUP) (3)
At any particular clock frequency, this equation must be satisfied for the system to
function correctly. For paths that have low tolerance to skew (TCP - TPD is small), the
clock signal should be routed such that the initial register receives the signal before the final
register, imposing negative clock skew on the particular path.
As opposed to positive clock skew, negative clock skew can improve performance of
a synchronous system. Between two sequentially adjacent registers, RI and RF , negative
clock skew implies that the clock signal at RF lags that of RI. Considering equation (2) it is
apparent that negative clock skew decreases the amount of time needed between two registers
given a particular path delay, TPD. However, the existence of negative clock skew in a path
imposes a constraint on the minimum path delay. If the amount of skew is greater than the
time required for a signal to propagate from RI through the interconnect and logic to RF,
then the data from the current clock period has overwritten that from the previous period
before it was latched into RF. This creates a race condition which is a catastrophic error in
a synchronous system. What is even worse is that this is a delay-dependent condition that
may only surface under certain operating conditions (ambient temperature, system activity,
etc.). For operation to continue correctly, the final register RF must latch the previous data
from RI before the new data is allowed to propagate through to RF. This constraint is given
by the following equation:
|T SKEW| ≤ TPD(MIN) = TC-Q + T LOGIC(MIN) + T INT − THOLD (4)
6
Figure 1.2: Illustrations of positive and negative clock skews.
In equation (4), TPD(MIN) is the minimum path delay between sequentially adjacent registers,
and THOLD is the time that the input signal must be stable at the final register after the
clock signal has arrived.
An obvious use for negative clock skew is the reduction of critical path delays, thus
increasing system performance. If a designer forces the arrival of the clock at RI to lead
that of RF, the data from RI has a longer time to propagate to RF. This shifts time from
the previous local path (the path where RI is the final register), to the presumably more
critical path. This technique allows the designer to optimize the clock distribution to the
path delays, allowing a means to compensate for uneven partitioning of the local data paths
within a global data path.
1.1.2 Power Dissipation
The focus of most contemporary clock distribution network designs is the power dissi-
pation associated with the clock signal. Increasing die sizes and operating frequencies have
resulted in a significant, and sometimes inordinate, percentage of a chip power dedicated
7
to the clock distribution. This cause and effect can be simply represented by the dynamic
power equation:
P = CV2f
where C is the capacitive load, f is the switching frequency and V is the voltage swing of the
signal.
The DEC Alpha 21164 microprocessor illustrates the extent to which the clock distribu-
tion network power dissipation can become a problem. The 21164 is a 16.5 mm × 18.1 mm
chip with 3.75 nF of capacitance on the clock net. With an operating frequency of 300 MHz
and a power supply of 3.3 V, the clock distribution dissipates 65% of the 50 W consumed
by the chip [8, 33].
In microprocessor designs of today the total power dissipation is a major concern. The
present trends quickly result in power dissipations that complicate system-level design. For
this reason a lot of attention is being focused on reducing the processor’s total power dissi-
pation, including that of the clock distribution.
Examination of the above equation should reveal one of the best ways for reducing a
chip’s total power dissipation. Decreasing the power supply voltage reduces the squared
term in the power equation. Since this affects all logic gates, including those associated with
the clock distribution, it has a significant effect on the power dissipation. Recent trends
have seen the power supply voltage decrease from 5.0 V, to 3.3 V, to 2.8 V, to 2.0 V for
the core logic of DEC Alpha 21264. Obviously, this trend represents a significant decrease
in the dynamic power dissipation, but there are limits on how low power supply voltages
can go. Noise tolerance issues will keep the power supply voltage from falling much lower,
while increasing die area and operating frequency will continue to contribute to the chip’s
dynamic power dissipation.
The sharp transitions required on the clock net are another cause for increased power
dissipation. As the RC product of the clock network increases, the rise and fall times of
the clock signal will increase unless there exists a corresponding increase in power. Typical
8
efforts to reduce the RC product include routing the clock net in low-resistivity metal layers,
minimizing the capacitive load presented by the register elements, and detailed design of the
clock network to trim off excess length and width (thus minimizing capacitance). Another
common method is to use multiple levels of clock buffering. This has the effect of separating
the high resistive load of the long clock nets from the high capacitive load of the register
elements. More detail on clock distribution methods will be covered in a following chapter.
A third method of reducing the power dissipation in the clock network is to shut off
the clock signal to unused portions of the chip. This method is used to significant gain on
the DEC StrongARM microprocessor. It is estimated that conditional clocking reduced the
power dissipation of the StrongARM’s clock network almost four-fold [33].
1.1.3 Jitter
Increasing clock frequencies reveal a problem that has been previously considered neg-
ligible. A general design rule of thumb is that 10% of the clock period is allocated to clock
skew. With designs such as the DEC Alpha 21164 reporting a peak skew of 80 ps [8] at a
clock period of 3.33 ns this rule seems easy to meet. However, future generation processors
will see clock periods on the order of 1 ns. While clock skew itself may still be manageable
at these frequencies (assuming die sizes remain the same), the quantity called jitter has yet
to be considered.
Jitter represents the time varying behavior of the clock signal. Noise from various
sources cause perturbations on the clock network that can cause any receiver of the clock
signal to perceive a transition at a different time. Since the noise events are typically random
in nature, their effect on system timing is also random.
From the clock generator itself, down to the register elements, jitter affects system timing
in an unpredictable fashion. The jitter performance of a typical phase-locked loop clock
generator is 150 ps. This implies nothing about the jitter introduced along the network in
9
the buffers or registers. Obviously, this quantity will become significant for future generation
systems. The noise sources that contribute to jitter most significantly are the following:
• Noise coupled through the circuit’s power and ground connections;
• Noise coupled through adjacent or intersecting traces;
• Noise inherent to the circuit’s transistors themselves.
10
Chapter 2
The Clock Distribution Network
This section will discuss some design issues and methodologies for each of three general
stages in a clock distribution network:
• the clock generator;
• the global clock distribution;
• the local clock distribution.
Where appropriate, the sections will cover design issues relating to the topics of clock
skew, power dissipation, and jitter.
2.1 Clock Generation
There are many issues to consider in the design of a clock generator. Both power
dissipation and jitter play a major role in determining the type of generation system to
utilize. Obviously, operating frequency also plays a major role, especially in the present
day systems. The large discrepancy between system clock frequency and processor clock
frequency inevitably results in the need for a phase locked loop (PLL) clock generator.
Figure 2.1 illustrates the basic block diagram of a phase locked loop (PLL)-based clock
generator. The benefits of such a clock generator are two-fold. First, the frequency of the
clock generator output can be a multiple of the main system clock frequency. This allows
the integrated circuit to run at a higher frequency than the rest of the system, while main-
taining synchronization. This is important for microprocessors, because the advancement of
the system bus frequency has traditionally lagged the advancement of the microprocessor
11
Figure 2.1: A generic phase locked loop (PLL) block diagram.
frequency. Use of an on-chip PLL has allowed microprocessors to continually increase in
performance while only marginal gains in system bus performance were made. Second, in
a synchronous system, use of a PLL clock generator can eliminate clock skew from active
device variation between chips or even on the same chip. Since the output of the PLL is
in-phase with the input, if the clock distribution buffers are included within the loop (as
illustrated in Figure 2.1), their delay is essentially removed from the system. This does
not eliminate the chip-to-chip or on-chip clock skew due to passive device variations and
interconnect delay [4].
These important advantages to PLL clock generation do not come without a price. A
PLL, unless designed completely from digital circuitry, is essentially an analog circuit, and
requires very careful design. The noise inherent in a digital system, especially a high-speed
microprocessor, can cause many problems for a PLL clock generator. Furthermore, PLLs
typically require relatively large passive elements for stable operation. Some integrated
circuit technologies do not lend themselves well to the creation of such structures. Last,
but certainly not least, even a well-designed PLL exhibits variation in its output phase and
frequency over time. This quantity is the jitter term introduced in Section 1.1.3.
The components of a PLL that are most susceptible to noise are the phase detector,
charge pump, and voltage-controlled oscillator (VCO). With careful layout and circuit design
the jitter due to noise in the phase detector and charge pump can be largely eliminated, but
the VCO is a very sensitive circuit that bears closer inspection.
12
2.1.1 Global Clock Distribution
The global clock distribution concerns the means by which the clock signal is relayed
from the clock generator to the register loads. The routing, buffering, and signaling method-
ologies are major design decisions and have significant impact on overall system perfor-
mance. The timing requirements involved in the clock distribution network make its design
a very difficult one. In fact, system level trade-offs between system speed, physical die
area, and power dissipation are significantly affected by the design of the clock distribution
network [18]. This section discusses some traditional methods for global clock distribution
design and their characteristics in terms of clock skew, power and jitter.
2.1.2 Clock Tree
By far the most common method for distributing clock signals in VLSI applications is
the clock tree method. This tree structure is so called because buffers are placed between
the clock source and along the clock paths as they branch out towards the clock loads.
Figure 2.2 illustrates the organization of a clock tree distribution network. The distributed
buffers amplify the clock signal to remove any degradation of the signal due to interconnect
resistance between the clock source and the registers. In this scheme, it should be noted,
the distributed buffers are the primary source of the total clock skew because active device
characteristics typically vary more than passive device characteristics.
An extension of this scheme is demonstrated by the DEC Alpha 21064 microprocessor
(Figure 2.3). In this five stage buffer tree design, one of the intermediate clock tree stages
was made into a mesh by strapping metal lines across each of the branches. The mesh
structure places the interconnect resistance in parallel, reducing the effective resistance seen
by the buffers. This minimizes both the delay through the clock distribution and the total
skew within it [15]. Furthermore, the inputs to previous buffer stages were strapped together
to smooth any asymmetric arrival times of the clock signal.
13
Figure 2.2: Buffer-tree clock distribution.
If the interconnect resistance from the clock source to the registers is negligible, then
the clock tree scheme can be reduced to a single buffer driving the entire network. The
advantages of this strategy are the removal of the skew introduced by the distributed buffers,
and the reduced area obtained by eliminating the distributed buffers. However, the single
buffer must be capable of providing enough current to drive the network capacitance of the
interconnect and register loads, while maintaining sharp, consistent transitions. A recent,
and somewhat notorious, example of this clock distribution strategy is the DEC Alpha 21164
microprocessor. In this clock distribution network, a tree of buffers converges on a single
buffer which drives the entire clock load of 3.75 nF. This final inverter consists of 58 cm of
transistor and is required to drive the clock signal at 300 MHz [8]. This design resulted in an
incredibly low maximum clock skew of 80 ps across the 16.5 mm x 18.1 mm die. However,
the price for this performance was the dissipation of nearly 65% of the total processor power
of 50W in the clock distribution network [8, 33].
14
Figure 2.3: DEC Alpha 21064 clock buffer tree.
2.1.3 H-Tree Clock Distribution
The primary goal in clock distribution design has traditionally been to transmit the clock
signal to every register in the system at precisely the same time. Many routing algorithms
exist for attaining zero clock skew, some more effective than others. In all cases, routing
for zero clock skew results in a larger clock distribution network. The necessity to match
delays forces increased route lengths, and often increases complexity. A common zero skew
routing strategy is the symmetric H-tree clock distribution network. This method aims to
produce zero skew clock routing by matching the length of every path from clock source to
register load. This is accomplished by creating a series of H-shaped routes from the center
of the chip as illustrated by Figure 2.4. At the corners of each “H” the nearly identical
clock signals provide the inputs to the next level of smaller “H” routes. At each junction the
impedance of the interconnect is scaled to minimize reflections. For an H-tree network, each
conductor leaving a junction must have twice the impedance of the source conductor. This
is accomplished by decreasing the interconnect width of each successive level. This continues
until the final points of the H-tree structure are used to drive either the register loads, or
local buffers which drive the register loads. Thus, the path length from the clock source to
each register is practically identical in an H-tree distribution. A variant of H-tree is X-tree
clock distribution network illustrated in Figure 2.5
15
Figure 2.4: Symmetric H-tree clock distribution network.
Figure 2.5: Symmetric X-tree clock distribution network.
16
Obviously, this method does not truly produce zero clock skew. At some point the clock
signal must be routed to the registers. Within this local block, clock skew will exist, but the
block size is chosen such that the skew is insignificant. Other sources of skew include the
variation in interconnect parameters, and the variation in active parameters for any buffers
distributed through the H-tree network.
Since the magnitude of the clock skew is only significant for sequentially adjacent reg-
isters, the symmetry provided by an H-tree distribution is largely unnecessary. Considering
the additional interconnect capacitance (which leads to increased power dissipation), the
benefits of zero clock skew may not be worthy of the price. In fact, the use of an H-tree
structure is often overkill for even current generation microprocessors [23].
Even if sequentially adjacent registers are located on opposite sides of a chip, the concept
of tolerable skew routing seems more effective. Allowing a measure of clock skew between
registers significantly relaxes wiring constraints. This relaxation reduces overall interconnect
capacitance and power dissipation. A method for routing a two level clock network with
tolerable clock skew is reported in [54]. Though this project details only a planar, equal
path length routing algorithm for the first level of the clock network, it demonstrates that a
tolerable skew methodology can yield a lower total interconnect length. A method that not
only uses tolerable clock skew, but takes advantage of optimal skew scheduling is reported
in [36]. Part of a four step design process, this method inserts buffers with tuned delays to
achieve an optimal clock skew schedule. This not only reduces the overall clock interconnect
length, but increases system performance by allowing a higher frequency.
Routing and distributing the clock signal are only part of the design process. The
method of clock signaling must also be determined. Typically, the clock signal is not treated
any differently than any other signal in this respect. However, the use of alternative clock
signaling methods achieve significant power savings, often at minimal cost to performance
and area.
17
Figure 2.6: Half-swing clock signaling methodology.
Figure 2.6 illustrates the difference between a conventional, full-swing clock signaling
methodology, and a half-swing methodology proposed by Kojima et al. [26]. Recalling that
dynamic power dissipation is proportional to V2, it is readily apparent that routing two
half-swing pairs of clock signals results in approximately 50% the power dissipation of one
pair of full-swing clock signals.
Very often additional clock signals are required by reduced-swing methodologies because
the differential signaling is necessary to offset the reduced noise tolerances. This obviously
has the effect of complicating the clock routing since additional signals must be routed.
Unfortunately, this has the secondary effect of increasing clock load, since the differential
clock nets must be routed close to one another. This prevents the power savings from
achieving the theoretical 50% reduction.
If the clock distribution were designed around noise-tolerant circuits, it might be possible
to utilize a reduced-swing methodology without the additional clock signals. This would
harness all the benefits of voltage swing reduction.
Power reduction is also possible through conditional clocking. By using a conditional
buffer to drive the register elements (which comprise the majority of capacitive clock load)
the system is able to disable the clock signal in idle sections. This effectively reduces the
capacitive loading seen by the clock distribution and directly reduces the power dissipation
18
of the network. The StrongARM RISC microprocessor from DEC used this technique to
help achieve a very impressive power-performance point. The 2.5 million transistor device
operates at a frequency of 160 MHz, while only dissipating 450 mW of power. Of the five
clocking regimes on the chip, four are sourced by conditional clock buffers. The use of gated
clocks does not come without a price. The logic paths associated with enabling/disabling
the particular clock buffers were some of the most difficult to design. However, the DEC
design team made the decision that the power savings justified the increase in complexity
and potential degradation of performance. In fact, it is estimated that the 15% of the chip
power that is dissipated in the clock network could have quadrupled without the use of gated
clocks [33].
Recent research has reported other applications for gated clocks. In 1995, Benini and
De Mecheli [6] reported a method of designing finite state machines that takes advantage of
self-loops within the FSM to gate the clock. For a set of benchmark circuits they achieved
an average power savings of 25% with an area cost of only 5%. Digital signal processors
also benefit from the use of conditional clocking. The multi-hundred MHz multipliers used
in FIR filters dissipate a large fraction of their power in the clock network. Nagendra and
Irwin [35] propose a method for reducing the activity factor of a multiplier, thus decreasing
the power dissipation.
19
Chapter 3
Wireless Clock Distribution
Clock distribution is a crucial aspect of modern multi-GHz microprocessor design. Con-
ventional distribution schemes are more or less monolithic in that a single clock source is fed
through hierarchies of clock buffers to eventually drive almost the entire chip. This raises
a number of challenges. First, due to irregular logic, the load of the clock network is non-
uniform, and the increasing process and device variations in deep sub-micron semiconductor
technologies further adds to the spatial timing uncertainties known as clock skews. Second,
the load of the entire chip is substantial, and sending a high quality clock signal to every
corner of the chip necessarily requires driving the clock distribution network “hard”, usually
in full swing of the power supply voltage. Not only does this mean high power expenditure,
but it also requires a chain of clock buffers to deliver the ultimate driving capability. These
active element are subject to power supply noise, and add delay uncertainty jitter which
also eats into usable clock cycle. Jitter and skew combined represent about 18% of cycle
time currently [21], and that results in indirect energy waste as well. For a fixed cycle time
budget, any increase in jitter and skew reduces the time left for the logic. To compensate
and make the circuitry faster, the supply voltage is raised, therefore increasing energy con-
sumption. Conversely, any improvement in jitter and skew generates timing slack that can
be used to allow the logic circuit to operate more energy efficiently.
As commercial microprocessors are rapidly becoming multi-core systems, monolithic
clock distribution will be even less applicable. In the era of billion-transistor microprocessors,
a single chip is really a complex system with communicating components and should be
treated as such. In communication systems, synchronizing clocks is also a rudimentary and
20
crucial task. In this section, we will discuss the concept of wireless clock distribution network
in a microprocessor.
Wireless interconnect should provide an additional means for global communications,
freeing up conventional wires for other uses. This new interconnect technology is being
developed utilizing a clock distribution system as the driver. Using wireless interconnects
in a clock distribution system should reduce the latency in the clock tree which should
help reduce clock skew [1] and should eliminate the frequency dispersion problem that may
ultimately limit the maximum clock frequency [14].
A conceptual illustration of a single-chip or intra-chip wireless interconnect system for
clock distribution is shown in Figure 3.1. An approximately 20-GHz signal is generated
on-chip and applied to an integrated transmitting antenna which is located at one part of
the IC. Clock receivers distributed throughout the IC detect the transmitted 20-GHz signal
using integrated antennas, and then amplify and synchronously divide it down to a ∼2.5-
GHz local clock frequency. These local clock signals are then buffered and distributed to
adjacent circuitry. Figure 3.2 shows an illustration of a multi-chip or inter-chip wireless clock
distribution system. Here the transmitter is located off-chip, utilizing an external antenna,
potentially with a reflector. Integrated circuits located on either a board or a multi-chip
module each have integrated receivers which detect the transmitted global clock signal and
generate synchronized local clock signals.
3.1 Clock Transmitter
Figure 3.3 shows a simplified block diagram for an integrated clock transmitter. The
20-GHz signal is generated using a voltage-controlled oscillator (VCO). The signal from the
VCO is then amplified using a power amplifier (PA), and fed to the transmitting antenna.
The VCO is phase-locked to an external reference using a phase-locked loop (PLL), providing
frequency stability. The PLL consists of a phase-frequency detector (PFD), a loop filter, the
VCO, and a frequency divider.
21
Figure 3.1: Intra-chip wireless interconnect system.
Figure 3.2: Inter-chip wireless interconnect system.
Figure 3.3: Block diagram of integrated clock transmitter.
22
Figure 3.4: Block diagram of integrated clock receiver.
3.2 Clock Receiver
Figure 3.4 shows a block diagram for an integrated clock receiver. The global clock
signal is detected with a receiving antenna, amplified using a low noise amplifier (LNA), and
divided down to the local clock frequency. These local clock signals are then buffered and
distributed to adjacent circuits. The amplifier is tuned to the clock transmission frequency to
reduce interference and noise. Since the microprocessor is extremely noisy at the local clock
frequency and its harmonics, transmitting the global clock at a frequency higher than the
local clock frequency provides a level of noise immunity for the system [30]. Also, operating
at a higher frequency decreases the required antenna size. The receiver is implemented in a
fully differential architecture, which rejects common-mode noise (e.g., substrate noise) [9, 30],
obviates the need for a balanced-to-unbalanced conversion between the antenna and the LNA,
and provides dual-phase clock signals to the frequency divider.
3.3 Comparison of Power Consumption with Conventional Methods
Kenneth et al. [17] have discussed the power consumption in a wireless clock distribution
network and given a comparison with conventional clock distribution network’s. To compare
the power requirements between different types of global clock distribution systems, the
system voltage and frequencies are assumed to be equal. Also, an equal capacitive load
representing the local clock generators or distribution system is assumed for each type of
global distribution system. Under these assumptions, the power dissipation can be converted
23
to capacitances and these can be used to compare the power dissipation of different global
distribution systems, similar to an approach taken in [42].
The total global capacitance can be allocated among three components: CG, CW, and
CL. CG is the equivalent capacitance of the highest level network which delivers the clock
from its source to spine locations distributed throughout the chip. CG includes the total
capacitance of the final driver stage, herein termed the sector buffer, plus any buffers leading
up to the sector buffer. The sector buffers are assumed to be exponentially tapered for
minimum delay [29]. CW is the capacitance of the interconnecting wires for delivering the
clock from the spine locations to the local distribution system. CL is the load capacitance
representing the input capacitance of the local clock generators.
Two cases are used in comparing these clock distribution systems for 0.1-um generation
microprocessors. Case 1 is for aluminum interconnects and conventional dielectrics; case 2
is for copper interconnects and low-K dielectrics.
3.3.1 Grid Based System
The grid-based system, based on DEC 21264 [19], consists of a global tree supplying the
clock to different spine/buffer locations, and the capacitance of this network is CG. These
buffers drive the clock grid, which has capacitance CW. The local clock generators tap off
of the grid. Due to the amount of wiring used to form the grid, CW is large. Consequently,
large sector buffers are needed, increasing CG.
3.3.2 H-Tree Based Systems
The H-tree system, based on IBM S/390 [52], consists of a global tree supplying the
clock to different spine/buffer locations. Each buffer drives a balanced H-tree, which drives
the local clock generators. CW is smaller for the H-tree, therefore a smaller sector buffer is
required.
24
Table 3.1: Global capacitive loading.
3.3.3 Wireless System
The wireless system consists of a clock transmitter broadcasting a microwave signal to
a grid of distributed receivers. A receiver corresponds to a spine location. Due to their low
capacitance, balanced H-trees are used to distribute the signal from the receivers to the local
clock generators. Thus, CW for both the H-tree and wireless schemes are equal. In wireless
clock distribution systems, long interconnects for delivering the clock from its source to
spine locations are not present and the associated component of CG is zero. However, since
the wireless system contains components with static power dissipation, an equivalent global
capacitance, representing this power dissipation, is needed. This capacitance is then used to
make a comparison to the grid and H-tree systems. To obtain this equivalent capacitance, the
total power dissipation of the clock transmitter and receivers is divided by the factor (V2f).
Table 3.1 shows a breakdown of the global capacitive loading for the three distribution
systems for cases 1 and 2, and includes the initial 0.25-um data. All capacitance units are
in pF. The final row gives the power consumed by the global clock distribution systems as a
percentage of total microprocessor power. This percentage represents the relative amount of
power dissipated in the global system to drive a load of CL. The results show that for both
cases, the wireless system is comparable in performance to the H-tree system and better in
performance than the grid-based system in terms of power dissipation. The results also show
25
that technology developments such as Cu or low-k will have the greatest positive impact on
systems whose total equivalent capacitance is dominated by CW, such as the grid. Finally,
the results show that the power dissipated in the clock receivers, given by CG, should be a
small fraction (2.3%) of the total power dissipated in the microprocessor. These results show
that power dissipation does not impose limitations for wireless clock distribution systems.
However, additional work focused on the overall feasibility and implementation of this system
is ongoing.
3.4 Potential Benefits
The wireless clock distribution system would address the interconnect needs of the
semiconductor industry in providing high-frequency clock signals with short propagation
delays. These needs would be met while providing multiple benefits. First, signal propagation
occurs at the speed of light, shortening the global interconnect delay without requiring
integrated optical components. Second, the global interconnect wires used in conventional
clock distribution systems are eliminated, freeing up these metal layers for other applications.
Third, referring to Figure 3.2, the inter-chip clock distribution system can provide global
clock signals with a small skew to an area much greater than the projected IC size. This is
an additional benefit, possibly allowing synchronization of an entire PC board or a multi-
chip module (MCM). Fourth, in the wireless system, dispersive effects are minimized since
a monotone global clock signal is transmitted. Fifth, another benefit is a more uniformly
distributed power load equalizing temperature gradients across the chip. Sixth, by adjusting
the division ratio in the receiver, higher frequency local clock signals [1] can be obtained, while
maintaining synchronization with a lower frequency system clock. Seventh, an intangible
benefit of wireless interconnect systems is the effect they could have on microprocessor or
system implementations, potentially allowing paradigm shifts such as drastically increased
chip size. Finally, compared to other potential breakthrough interconnect techniques, such
as optical, superconducting, or organic, a wireless approach based on silicon seems to be
26
a potential solution which is compatible with the technology trends of the semiconductor
industry.
3.5 Areas of Research
The main areas of research for wireless clock distribution are as follows: integrating com-
pact power-efficient antenna structures, identifying noise-coupling mechanisms for the wire-
less clock distribution system and estimating the signal-to-noise ratio that can be achieved
on a working microprocessor, implementing the required 20-GHz circuits in a CMOS process
consistent with the ITRS [1], and characterizing a wireless clock distribution system in terms
of skew and power consumption and estimating the overall feasibility of the system.
27
Chapter 4
Optical Clock Distribution
Designing clock distribution networks is a big challenge for future microprocessors due
to increasing frequency, power, transistor counts and process variations. As technology
scales, implementing conventional clock distribution networks that meet low power and skew
requirements is becoming more difficult. On the other hand, optical interconnects are being
proposed as an alternative to electrical interconnects due to their speed-of-light transmission,
high bandwidth and low power dissipation. In the future, interconnects between chips,
between cores on a chip and within components on a processor core could be made optical
to achieve lower power and higher performance.
Optical interconnects for clock distribution were first studied by Goodman et al. [20].
They postulate that interconnect delays will be the limiting factor for performance in future
MOS circuits and suggest moving to optical and electro-optic technologies. With near-
infrared optical sources, modulators and detectors with media such as free space, optical
fibers and integrated optical waveguides, they state the interconnect scaling problem could
be solved.
They list five advantages in moving to photonics: freedom from capacitive loading
effects which allows greater fan-in and fan-out, immunity to mutual interference effects,
lack of planar constraints resulting in reduced cross-coupling for criss-crossing waveguides,
reconfigurability of free space focused interconnects and possibility of direct injection of
optical signals into electronic devices without the need for optical to electrical conversion.
Four ways of optical clocking are described next:
28
• Index-based with waveguides - light is carried from a single source generating the optical
clock signal to the other parts of the chip using waveguides which are integrated on a
suitable substrate.
• Index-based with fiber optics - light is transmitted similar to the above case except
that fibers are used instead of waveguides in a separate core.
• Unfocused free space interconnect - optical clock signal broadcast to the entire chip by
focusing light (through a lens or diffusers) perpendicular to the chip from above.
• Focused free space interconnect - an optical element like a hologram (which acts as a
grating) sends the optical signals onto a multitude of detection sites simultaneously.
Miller et al [31, 32] discuss various opportunities for optics in on-chip interconnects
and notes that one of the main reasons to move away from long electrical on-chip wires
is to avoid growing transmission line and inductance effects. Power requirements of long
point-to-point optical interconnects are much lower (with no need for repeaters for optical
clock transmission), and optical signals, whether long or short, perform equally well. Other
benefits include the ability to send signals across in the 3rd dimension and the possibility
of integration with electronic devices. Debaes et al. [13] and Bhatnagar [7] discuss the
advantages of optical clocking and propose a receiver-less optical clocking scheme which
reduces the latency of transmitting the clock signal to the local network. They conclude
that such a latency-reducing scheme reduces skew and jitter but does not reduce the power
consumed compared to an electrical clocking scheme.
Recent studies comparing electrical and optical clocking schemes have analysed the
power, skew and area usage for different technologies and estimated the potential of photonics
in clock distribution [11, 25]. Based on the 2001 ITRS roadmap [2], using analytical models,
they estimate that most of the power dissipation is associated with local clock distribution.
Since alternatives to electrical clock distribution have been proposed for replacing the global
clock distribution only, they conclude that there are no significant power benefits to replacing
29
electrical distribution with optical distribution. The paper also shows that low skew can be
obtained with optical as well as non-scaled electrical interconnects (130nm) and concludes
that skew benefits of optical transmission are also not significant if non-scaled electrical
interconnects are used. They consider a global balanced and buffered H-Tree driving local
grids as representative of current clock distribution techniques. For optical distribution, they
replace the electrical H-tree by an H-tree structure built using waveguides with detectors at
the end points to convert the light pulses to a clock signal. The local grids and buffers
driving the local grids remain the same in both implementations.
Mule et al. [34] discuss the pros and cons of electrical and optical clock distribution
systems. Among the different schemes used to transmit the optical clock onto different
parts of the chip, they find the waveguide based approach most feasible since free-space
approaches work in 3 dimensions (and hence are not compact) and this would complicate
power distribution and cooling. To accomplish local clock routing with optical technology,
the fanout should correspond to the total number of latches on the chip. Since optical
signals do not use repeaters and rely on the original source strength to drive all the loads,
it is extremely difficult to make the fanout more than a few tens or hundreds of loads at
the most. Hence for the regions which require short wires like the local distribution regions,
current systems can only use electrical routing.
A possible implementation of optical clocking is to replace the global transmission net-
work with optical waveguides and use electrical local distribution. At the end of the global
transmission network, when the global clock signal is converted to electrical signals using a
detector, it has to be buffered and amplified and sent to the local network. However, given
expected optical technologies, the number of fanouts on the global optical transmission must
be relatively small (less than 100). This means that the electrical fanout driven by each op-
tical receiver will be quite large (thousands of latches). In order to make the optical receiver
fast and to support large optical fanouts, each optical receiver must be physically small and
have a very small capacitance. However, taking a signal on a very small capacitance and
30
buffering it to drive a very large capacitance is a classical logical effort problem [45]. This
will require many stages of buffering, which introduces additional skew due to different pro-
cess, thermal, and voltage conditions between different local network buffers. In contrast,
a similar sized local distribution network driven from an electrical input will be connected
to a global electrical transmission spine with very large capacitance due to wire parasitics.
Thus the additional delay caused by connecting large electrical buffers to the global spine
is modest. Hence a smaller number of buffer stages are needed to drive a local clock dis-
tribution network given an electrical input in comparison to an optical input. This could
give local distribution buffers with electrical inputs lower skew than local distribution buffers
with optical inputs.
Because of the extremely large fanout of the clock, high capacitive loads, and extremely
low effective load impedance, the clock distribution problem is primarily a problem in power
amplification. Electrical power amplification technologies are more power and skew efficient
than expected optical technologies. For example, since the CMOS buffers act as non-linear
amplifiers, their power added efficiency can easily be over 60%.
In this section we consider several possible approaches to using optics for distribution
of the clock, with the aim of minimizing or eliminating clock skew. Attention is focused on
the problem of distributing the clock within a single chip.
4.1 Intra-Chip Clock Distribution
The interconnections responsible for clock distribution are characterized by the facts
that they must convey signals to all parts of the chip and to many different devices. These
requirements imply long interconnect paths and high capacitive loading. Hence the propa-
gation delays are large and depend on the particular configuration of devices on the chip.
Here we consider methods for using optics to send the clock to various parts of the chip.
It is assumed that optics is used in conjunction with electronic interconnects, in the sense
that optical signals might be used to carry the clock to various major sites on the chip, from
31
which the signals would be further distributed, on a local basis, by a conventional electronic
interconnection system.
The clocks used in MOS technology are generally two phase [29]. Presumably only one
of these phases will be distributed optically, the other being generated on the chip after the
detection of the optical timing signal.
A variety of optical techniques can be envisioned for accomplishing the task at hand.
The main distinction between these approaches occurs in the method used to convey light
to the desired locations on the chip.
4.1.1 Index-Guided Optical Interconnects
The first major category of optical interconnect techniques is refer to as “index guided”.
Light is assumed to be carried from some single source generating an optical signal modulated
by the clock to many other sites by means of waveguides. The waveguides could be of either
of two types. One type could use optical fibers for carrying the optical signals. The second
type could use optical waveguides integrated on a suitable substrate.
If fibers are chosen as the interconnect technology, then the following approach, illus-
trated in Figure 4.1, might be used. A bundle of fibers is fused together at one end, yielding
a core into which light from the modulated optical source (probably a lasing diode) must be
coupled. Light coupled in at the fused end is split as the cores separate, and transmitted to
the ends of each of the fibers in the bundle. Each fiber end must now be carefully located
over an optical detector that will convert the optical clock to an electrical one. Alignment
of the fiber and the detector might be accomplished with the help of micropositioners (anal-
ogous to a wire bonding machine), and UV-hardened epoxy could be used to hold the fiber
in its proper place permanently. The difficulties associated with the fiber-optic approach
stem from the alignment requirements for the fibers and detectors, and from the uniformity
requirements for the fused-fiber splitter. It should also be noted that the fibers cannot be
allowed to bend too much, for bends will cause radiation losses that may become severe.
32
Figure 4.1: Distribution of the clock by means of fiber.
Lastly, we should mention that the use of fibers, and the requirements regarding allowable
degrees of bending, imply that this interconnect technology will occupy a three-dimensional
volume, rather than being purely planar, and this property could be a disadvantage in some
applications.
If integrated optical waveguides are chosen as the interconnect technology, then the
geometry might be that shown in Figure 4.2. The waveguides might be formed by sputtering
of glass onto a silicon dioxide film on the Si substrate. These guides are shown as straight
in the figure, a configuration chosen again because of the large losses anticipated if this
type of light guide is bent at a large angle. Optical signals must be coupled into each of
the separate guides. Such signals might be generated by a single laser diode and carried
to the waveguides by fibers, or separate sources might drive each of the guides, with the
clock distributed to the different sources electrically. Presumably light must be coupled
out of each of the straight waveguides at several sites along its length, with a detector
converting the optical signal to electronic form at each such site. The difficulties associated
with the waveguide approach to the problem, neglecting the bending problem which has been
intentionally avoided, stem primarily from the requirement to efficiently couple into and out
of the guides. Careful alignment of the sources or fibers with the integrated waveguides is
required, and couplers with short lengths are desired to remove the light from the guides and
place it onto the appropriate detectors. Present waveguide technology requires distributed
33
Figure 4.2: Distribution of the clock by means of integrated optical waveguides.
couplers with rather large dimensions (5 um x 1 mm) compared with the feature sizes
normally thought of in electronic IC technology. A major advantage of the integrated optics
distribution system lies in its planar character and the small excess volume it requires. A
disadvantage is the comparatively inflexible geometry dictated by the necessity to avoid large
bends of the waveguides.
4.1.2 Free-Space Optical Interconnects
A second major category of optical interconnects can be referred to as “free-space”
techniques. For such interconnects, the light is not guided to its destination by refractive
index discontinuities, but rather by the laws that govern the propagation of light in free
space. It is helpful to distinguish between two types of free-space interconnect techniques,
“unfocused” and “focused”.
Unfocused interconnections are established simply by broadcasting the optical signals
carrying the clock to the entire electronic chip. One such approach is shown in Figure 4.3. A
modulated optical source is situated at a focal point of a lens that resides above the chip. The
signal transmitted by that source is collimated by the lens, and illuminates the entire chip at
normal incidence. Detectors integrated in the chip receive the optical signals with identical
delays, due to the particular location of the source at the focal point of the lens. Hence in
principle there is no clock skew whatever associated with such a broadcast system. However,
the system is very inefficient, for only a small fraction of the optical energy falls on the
34
Figure 4.3: Unfocused broadcast of the clock to the chip.
photosensitive areas of the detectors, and the rest is wasted. Inefficient use of optical energy
may result in requirements for the provision of extra amplification of the detected clock
signals on the chip, and a concomitant loss of area for realizing the other electronic circuitry
required for the functioning of the chip. Moreover, the optical energy falling on areas of the
chip where it is not wanted may induce stray electronic signals that interfere with the proper
operation of the chip. Therefore, it is likely that an opaque dielectric blocking layer would
be needed on the chip to prevent coupling of optical signals at places where they are not
wanted. Openings in this blocking layer would be provided to allow the optical signals to
reach the detectors. Alternate unfocused interconnection techniques could be imagined that
use diffusers rather than a lens. Note that all such techniques require a three-dimensional
volume in order to transport the signals to the desired locations.
The last category of optical interconnections is free-space “focused” interconnections,
which can also be called “imaging” interconnections. For such interconnections, the optical
source is actually imaged by an optical element onto a multitude of detection sites simulta-
neously. As indicated in Figure 4.4, the required optical element can be realized by means
of a hologram, which acts as a complex grating and lens to generate focused grating com-
ponents at the desired locations. The efficiency of such a scheme can obviously exceed that
35
Figure 4.4: Focused optical distribution of the clock using a holographic optical element.
of the unfocused case, provided the holographic optical elements have suitable efficiency.
Using dichromatic gelatin as a recording material, efficiencies in excess of 99 percent can
be achieved for a simple sine wave grating. When a multitude of focused spots are to be
produced, the efficiency will presumably be lower, but should be well in excess of 50 percent.
The flexibility of the method is great, for nearly any desired configuration of connections
can be realized.
The chief disadvantage of the focused interconnect technique is the very high degree of
alignment precision that must be established and maintained to assure that the focused spots
are striking the appropriate places on the chip. Of course, the spots might be intentionally
defocused, decreasing the efficiency of the system, but easing the alignment requirements.
Thus there exists a continuum of compromises between efficiency and alignment difficulty.
Figure 4.5 illustrates a possible configuration that retains high efficiency but minimizes
alignment problems. The imaging operation is provided by two two-element lenses, in the
form of a block with a gap between the elements. A Fourier hologram can be inserted
between the lenses, and it establishes the desired set of focused spots. The hologram itself
consists of a series of simple sinusoidal gratings, and as such the position of the diffracted
spots is invariant under simple translations of the hologram. The source is permanently fixed
36
Figure 4.5: Configuration for focused clock distribution that minimizes alignment problems.
on the top of the upper lens block after it has been aligned with a detector at the edge of the
chip, thereby establishing a fixed optical axis. The only alignment required for the hologram
is rotation. The position of the image spots is determined by the spatial frequencies of the
gratings in the hologram, which could be established very precisely if the hologram were
written, for example, by electron-beam lithography.
Focused interconnect systems, like the unfocused ones, require a three-dimensional vol-
ume above the chip. If holographic elements are used, thought must be given to the effects of
using a comparatively non-monochromatic source such as an LED. A spread of the spectrum
of the source results in a spread of the focused energy, so the primary effect is to reduce the
efficiency with which light can be delivered to the desired detector locations.
4.2 Essential Components of an On-Chip Optical Signaling System
Electrical clock distribution networks utilize geometrical or electrical matching or grid
routing to minimize clock skew [18, 39]. To minimize clock delay, large, and power hungry
clock buffers are inserted in the clock distribution network. These buffers can consume up
to 30% - 40% of the total chip power [27]. Most of the existing optical clocking solutions
involve using non-CMOS compatible exotic materials or processing steps which are expensive
to integrate into existing manufacturing flows, preventing their wide spread adoption [34].
37
Figure 4.6: An optical clock distribution system.
A truly CMOS compatible solution will reduce manufacturing cost and facilitate easier inte-
gration into main stream manufacturing. Such an approach was introduced in [38, 40, 53].
Figure 4.6 shows the various stages in an optical clocking system which is compatible with
the proposed approach in [40] and [38]. The optical clock source is optically coupled to the
distribution network, which is optoelectronically coupled to an optical detector that converts
incident optical energy into current pulses. The recovery and signal condition stage then am-
plifies the current pulses to generate a corresponding rail-to-rail electrical clock signal for
local distribution. The clock signal is distributed to the entire chip by dividing the chip
into clock domains and placing a clock recovery resource or transimpedance amplifier (TIA)
station in each domain.
4.2.1 Optical Clock Source
The optical source providing the optical clock is a laser diode that is external to the
chip. The laser diode is attached to a single mode optical fiber. The optical energy at the
end of this fiber is the optical input to the clock distribution network. The optical coupling
into the clock distribution network is achieved by positioning and gluing the fiber to the
polished edge of the die.
38
Figure 4.7: Line diagram of an optical H-tree.
4.2.2 Optical Clock Distribution Tree
On-chip waveguides can be used to construct a planar optical H-tree. A 16 leaf-node,
balanced optical H-tree (as shown in Figure 4.7) is used as the optical clock distribution
network in the on-chip optical clock distribution and recovery system.
Seamless integration of the optical distribution network into standard CMOS processing
is of paramount importance, especially since the distribution network is spatially the largest
component of the system and where the advantages of optical signaling is most pronounced.
The planar waveguide core is constructed from silicon nitride (SiN), which is normally used
for copper encapsulation in a standard CMOS process. The cladding layers are made of
silicon dioxide (SiO2) and low-k oxides such as phospho-silicate glass (PSG) and tetra ethyl
ortho silane (TEOS), normally used as inter-metal dielectric material in standard CMOS
processes. The silicon nitride is deposited using a plasma-enhanced chemical vapor deposi-
tion (PECVD) technique on to a PSG/TEOS/SiO2 sandwich, which serves as the bottom
cladding. The top cladding is a TEOS/SiO2 layer. The high index difference between the
waveguide core and cladding provides excellent optical confinement in the waveguide. To
improve optical coupling between the H-tree and the external optical source, i.e., the optic
fiber, the waveguide core is wider at the die edge and slowly tapered down to the detectors.
39
Figure 4.8: Traveling wave photo detector.
4.2.3 Optical Receiver: Photo Detector
The electrical clock is recovered from the optical clock by using a photo detector which
is placed at the end of each H-tree leaf node. In addition to being truly CMOS compatible,
the photo detector must have high bandwidth and photo sensitivity. Conventional photo
detectors such as P-I-N photodiodes have current “tails” (i.e., long temporal response) in
their time domain impulse response, effectively limiting incident optical energy switching
speed [22]. This behavior is attributed to the thickness of the photo detector and the greater
distance between the electrodes and where the carriers are generated within the detector.
Conventional photo detector designs therefore are insufficient for optical clock distribution
purposes where clock frequencies are high and constantly increasing. A more suitable photo
detector should have high speed and fast impulse response with minimal photo current tail.
A thin polysilicon layer, when used as the detector material, generates carriers in the
high field region with minimal photo current tail. However, a thin polysilicon layer results
in low detector responsivity. Thus, a traveling wave or lateral incidence detector (as shown
in Figure 4.8 [53]), which greatly enhances photo detector responsivity without the need
for thick polysilicon layer, is used. The thin polysilicon layer ensures that the carriers are
generated close to the contacts (electrodes), greatly improving the response time of the photo
detector. Furthermore, for a lateral incidence photo detector with a thin polysilicon layer,
40
Figure 4.9: General approach to optical global clock distribution network.
the photocurrent has lower dependence on the bias voltage as the carriers are generated close
to the contacts and are less likely to undergo recombination.
The lateral incident photo detector, which is placed at the leaf-node of the optical H-
tree, when properly biased generates a current pulse for every incident laser pulse. The
photo detector initiates the clock recovery process by converting the incident optical clock
to current pulses (photo current) which are then processed to generate electrical clock signals
for local distribution. Due to the low magnitude of the photo current, it cannot be used
directly to drive clock sink nodes. A TIA is used to convert, amplify and condition the photo
current to generate rail-to-rail electrical clock signal.
4.3 Comparison of Power Consumption with Electrical Distribution System
Grzegorz et al [49] compare the power consumption of optical and conventional electrical
clock distribution systems. Figure 4.9 shows a low-power vertical cavity surface emitting
laser (VCSEL) used as an off-chip photonic source. The VCSEL is coupled to the H-tree
symmetrical passive waveguide structure and provides the clock signal to N optical receivers.
The number and placement of the receivers in an optical clock system is equivalent to the
41
Figure 4.10: Power consumption of optical and electrical clock distribution networks versusthe number of H-tree nodes.
number and placement of the output nodes in the electrical H-tree. At the receivers, the
high speed optical signal is converted to an electrical signal and subsequently distributed by
the local electrical networks. The number of optical to electrical converters is a particularly
crucial parameter in the overall system since optoelectronic interface circuits at these points
are, ofcourse, necessary and consume power. The methodology and the assumptions used to
properly design the optical H-tree are presented in detail in [47, 48].
First, the power consumption of both systems is compared. The comparison is based
on the ITRS technology roadmap [3] in the case of electrical clock systems and on the state-
of-the-art device parameters in the case of optical clocks. This assumption may result in a
pessimistic estimation of the performance of the optical clock distribution network, for future
technology nodes. The results presented in Figure 4.10 show the comparison between power
consumption of electrical and optical clock systems, both designed for the 70 nm technology
node. For a small number of H-tree nodes the power consumed by the optical H-tree is
more than one order of magnitude lower than in the electrical one. However, along with the
growth of circuit complexity, the advantage of the optical system tends to decrease. Finally,
with 8172 output nodes in the considered case, the power consumed by the optical system
becomes higher than that consumed by the electrical one. This fact can be easily explained
taking into consideration the optical power budget [47, 48]. Along with doubling the number
42
Figure 4.11: Chip structure with temperature gradient.
of H-tree output nodes, the optical power, which needs to be emitted by the VCSEL to meet
the overall system quality increases at least by 3.2 dB, which in turn increases the electrical
power consumed by VCSEL by more than 100%. Additionally, since the number of receivers
is equal to the H-tree output nodes, the power consumed by receivers also doubles. In the
case of an electrical system the power consumption increases much less rapidly. These results
clearly show that the advantages of optics drastically decrease with the number of output
nodes.
In the next step the clock skew of the optical system is calculated. There are several
sources of clock skew in optical system. Apart from process parameter variations, which are
mainly the tolerance of device and waveguide physical parameters, system level fluctuations
like temperature variations have to be considered. In their analysis only the impact of
temperature variation on optical signal speed have been taken into account. Along with the
growth of chip temperature, the refractive index of waveguide core increases, thus reducing
the speed of clock signal. Typical temperature gradients over the entire chip presented in the
literature are less than 50 K [37]. The calculation has been performed for the chip structure
where the temperature of one part is lower (350 K), while that of the other part is higher
(400 K) as presented in Figure 4.11. This represents the worst-case scenario.
43
Figure 4.12: Clock skew of a 64-output-node optical H-tree compared to the clock period asa function of technology.
Figure 4.12 shows the clock skew of a 64-output-node optical H-tree compared to the
clock period as a function of technology. It is clear from this figure that just for the 32 nm
technology node (33 GHz) the clock skew is higher than 10% of the clock period. This will
result in a serious system failure.
4.4 Conclusion
The use of optics to make connections within and between chips could solve many of
the problems experienced in current electrical systems. Many of the physical reasons for the
use of optics are well understood and indicate many potential quantitative and qualitative
benefits. Though there are, and will continue to be, electrical solutions that stretch the
capabilities of electrical interconnects, optics is arguably the only physical solution available
to solve the underlying problems of interconnects, and has the potential to continue to scale
with future generations of silicon integrated circuits.
44
Chapter 5
Future Work
5.1 Wireless Clock Distribution
• External power amplifiers (PAs) are used to increase the power level of the transmitted
clock signal because the on-chip PA is mis-tuned. External PAs increase the system
complexity and increase the cost. They should be replaced by an on-chip PA. In this
system, the transmitted global clock signal is an amplitude-modulated sine wave with
periodic no-signal-transmission, which is generated by switches at output nodes. A PA
with high output power and power efficiency should be designed and fabricated.
• The local clock signal frequency is limited by the operating frequency of clock trans-
mitter and receiver. There is much room for increasing the clock frequency using this
technology. In an inter-chip wireless clock distribution system, microwaves propagate
at the speed of light to distribute the clock signals across a chip. The RC-delay and
dispersion associated with conventional metal-line interconnections are almost elimi-
nated. Using this technology, the maximum local clock signal frequency is set by the
maximum operating frequency of clock transmitter and clock receiver, and the clock
skew and jitter at really high frequencies. According to recent publications, CMOS ICs
using a UMC 0.13um process can reach an operating frequency of ∼100GHz [10, 24].
Therefore, the maximum clock frequency will be limited only by the clock skew and
jitter. As discussed earlier, the skew and jitter performances have the potential of
being excellent, which could give more margin for the clock frequency increase. To
further reduce the clock skew and jitter, the free-running VCO in the clock trans-
mitter should be replaced by a PLL, and a clock receiver with lower noise should be
45
developed. With the increase of clock frequency, the on-chip receiving antenna size
can be reduced, which will reduce the on-chip receiving antenna area, a major problem
in the system. As a result of the smaller receiving antenna size, the width and height
of rectangular apertures in heat sinks can be reduced, which should improve the heat
sink performance.
5.2 Optical Clock Distribution
• Optoelectronic devices require continued development to meet the yield, tolerance,
and drive voltage requirements for practical systems with future generations of silicon
CMOS.
• Work will be required in the interface circuits between optics and electronics. Though
there appears to be no current fundamental difficulty in making such circuits in CMOS,
research is needed in circuits that, i) avoid issues such as crosstalk and susceptibility
to digital noise, ii) have appropriately low power dissipation and latency, and iii) are
tolerant to process variations.
• The technology for integrating optoelectronics with silicon integrated circuits is still
at an early stage, though there have been key demonstrations of substantial working
integrations. Likely first introductions of optical interconnect to chips will use hybrid
approaches, such as solder bonding; such hybrid approaches require no modifications
to the current process for fabricating silicon integrated circuits except to add processes
to fabricated silicon integrated circuit wafers.
• It will be important to research the systems and architectural benefits of optics for
interconnects. Optics can likely enable kinds of architectures that are not well suited
to electrical interconnect systems (e.g., architectures with many long connections, ar-
chitectures with large “aspect ratios,” architectures requiring synchronous operation
over large domains), and can likely also allow continued use of current architectures
46
that otherwise would have to be abandoned in the future because of the limitations of
wired interconnects.
47
Bibliography
[1] “The International Technology Roadmap for Semiconductors,” 1999. Semiconductor IndustriesAssociation, San Jose, California.
[2] “The International Technology Roadmap for Semiconductors,” 2001.
[3] “The International Technology Roadmap for Semiconductors,” 2005.
[4] H. B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI. New York: AddisonWesley, 1990.
[5] K. Banerjee, S. J. Souri, P. Kapur, and K. C. Saraswat, “3-D ICs: A Novel Chip Design forImproving Deep-Submicrometer Interconnect Performance and Systems-on-Chip Integration,”Proc. IEEE, vol. 89, pp. 602–633, 2001.
[6] L. Benini and G. De Micheli, “Transformation and Synthesis of FSMs for Low-Power GatedImplementation,” in Proceedings International Symposium on Low Power Design, Apr. 1995,pp. 21–26.
[7] A. Bhatnagar, Low Jitter Clocking of CMOS Electronics Using Mode-Locked Lasers. PhDthesis, Stanford University, Palo Alto, California, 2005.
[8] W. J. Bowhill, R. L. Allmon, S. L. Bell, E. M. Cooper, D. R. Donchin, J. H. Edmondson,T. C. Fischer, P. E. Gronowski, A. K. Jain, P. L. Kroesen, B. J. Loughlin, R. P. Preston,P. I. Rubinfeld, M. J. Smith, S. C. Thierauf, and G. M. Wolrich, “A 300-MHz 64-b Quad-Issue CMOS RISC Microprocessor,” IEEE Jour. Solid-State Circuits, vol. 30, no. 11, pp.1203–1211, Nov. 1995.
[9] D. Bravo, Investigation of Background Noise in Integrated Circuits Relating to the Design ofan On-Chip Wireless Interconnect System Master’s thesis, University of Florida, Gainesville,Florida, 2000.
[10] C. Cao and K. K. O, “A 90-GHz Voltage-Controlled Oscillator with a 2.2-GHz Tuning Rangein a 130-nm CMOS Technology,” in Symp. VLSI Circuits Dig. Tech. Papers, (Kyoto, Japan),2005.
[11] K.-N. Chen, M. J. Kobrinsky, B. C. Barnett, and R. Reif, “Comparisons of Conventional,3-D, Optical, and RF Interconnects for On-Chip Clock Distribution,” IEEE Transactions onElectron Devices, vol. 51, no. 2, Feb. 2004.
[12] T.-Y. Chiang, K. Banerjee, and K. C. Saraswat, “Effect of Via Separation and Low-k DielectricMaterials on the Thermal Characteristics of Cu Interconnects,” in Proc. IEEE Electron DeviceMeeting, 2000, pp. 261–264.
[13] C. Debaes, A. Bhatnagar, D. Agarwal, R. Chen, G. A. Keeler, N. C. Helman, H. Thienpont,and D. A. B. Miller, “Receiver-Less Optical Clock Injection for Clock Distribution Networks,”IEEE Journal of Selected Topics in Quantum Electronics, vol. 9, no. 2, March/April 2003.
48
[14] A. Deutsch, H. Harrer, C. W. Surovic, G. Hellner, D. C. Edelstein, R. D. Goldblatt, G. A.Biery, N. A. Greco, D. M. Foster, E. Crabbe, L. T. Su, and P. W. Coteus, “FunctionalHigh-Speed Characterization and Modeling of a Six-Layer Copper Wiring Structure and Per-formance Comparison With Aluminum On-Chip Interconnections,” in IEEE Electron DeviceMeeting Tech. Digest, Dec. 1998, pp. 295–298.
[15] D. W. Dobberpuhl, R. T. Witek, R. Allmon, R. Anglin, D. Bertucci, S. Britton, L. Chao, R. A.Conrad, D. E. Dever, B. Gieseke, S. M. N. Hassoun, W. Hoeppner, K. Kuchler, M. Ladd, B. M.Leary, L. Madden, E. J. McLellan, D. R. Meyer, J. Montanaro, D. A. Priore, V. Rajagopalan,S. Samudrala, and S. Santhanam, “A 200-MHz 64-b Dual Issue CMOS Microprocessor,” IEEEJour. Solid-State Circuits, vol. SC-27, no. 11, pp. 1555–1565, Nov. 1992.
[16] B. Floyd, C.-M. Hung, and K. O. Kenneth, “Intra-Chip Wireless Interconnect for Clock Dis-tribution Implemented With Integrated Antennas, Receivers, and Transmitters,” IEEE Jour.Solid-State Circuits, vol. 37, pp. 543–552, 2002.
[17] B. A. Floyd and K. O. Kenneth, “The Projected Power Consumption of a Wireless ClockDistribution System and Comparison to Conventional Distribution Systems,” in Proc. IITC,1999.
[18] E. Friedman, Clock Distribution Networks in VLSI Circuits and Systems. New York: IEEEPress, 1995.
[19] B. A. Gieseke, R. L. Allmon, D. W. Bailey, B. J. Benschneider, S. M. Britton, J. D. Clouser,H. R. F. III, J. A. Farrell, M. K. Gowan, C. L. Houghton, J. B. Keller, T. H. Lee, D. L.Leibholz, S. C. Lowell, M. D. Matson, R. J. Matthew, V. Peng, M. D. Quinn, D. A. Priore,M. J. Smith, and K. E. Wilcox, “A 600-MHz Superscalar RISC Microprocessor with Out-Of-Order Execution,” in ISSCC Dig. Tech. Papers, Feb. 1997, pp. 176–177.
[20] J. W. Goodman, F. Leonberger, S.-Y. Kung, and R. A. Athale, “Optical Interconnections forVLSI Systems,” Proceedings of the IEEE, vol. 72, no. 7, pp. 850–866, July 1984.
[21] X. Guo, D. Yang, R. Li, and K. K. O, “A Receiver with Start-up Initialization and Pro-grammable Delays for Wireless Clock Distribution,” in IEEE Int. Solid-State Circuits Conf.Digest of Tech. Papers, 2006, pp. 386–387.
[22] U. Hilleringman and K. Goser, “Optoelectronic System Integration on Silicon: Waveguides,Photodetectors, and VLSI CMOS Circuits on One Chip,” IEEE Trans. Electron Devices,vol. 42, no. 5, pp. 841–846, May 1995.
[23] M. Horowitz, “Clocking Strategies in High Performance Processors,” in Proceedings of theIEEE Symposium on VLSI Circuits, June 1992, pp. 50–53.
[24] P. Huang, M. Tsai, H. Wang, C. Chen, and C. Chang, “A 114GHz VCO in 0.13um CMOSTechnology,” in Proc. ISSCC, 2005. Paper 21.8.
[25] M. J. Kobrinsky, B. A. Block, J. F. Zheng, B. C. Barnett, E. Mohammed, M. Reshotko,F. Robertson, S. List, I. Young, and K. Cadien, “On-Chip Optical Interconnects,” Intel Tech-nology Journal, vol. 8, no. 2, May 2004.
[26] H. Kojima, S. Tanaka, and K. Sasaki, “Half-Swing Clocking Scheme for 75% Power Saving inClocking Circuitry,” in Proceedings of the IEEE Symposium on VLSI Circuits, June 1994, pp.23–24.
[27] J. Lillis, C.-K. Cheng, and T. Lin, “Optimal and Efficient Buffer Insertion and Wire Sizing,”in Proc. IEEE Custom Integr. Circuits Conf., May 1995, pp. 259–262.
49
[28] S. List, C. Webb, and S. Kim, “3D Wafer Stacking Technology,” in Proc. Advanced Metalliza-tion Conf., Oct. 2002, pp. 29–36.
[29] C. Mead and L. Conway, Introduction to VLSI Systems. Reading, Massachusetts: Addison-Wesley, 1980.
[30] J. Mehta, An Investigation of Background Noise in ICs and Its Impact on Wireless ClockDistribution Master’s thesis, University of Florida, Gainesville, Florida, 1998.
[31] D. Miller, A. Bhatnagar, S. Palermo, A. Emami-Neyestanak, and M. Horowitz, “Opportunitiesfor Optics in Integrated Circuits Applications,” in Proceedings of the International Solid StateCircuits Conference, 2005.
[32] D. A. B. Miller, “Rationale and Challenges for Optical Interconnects to Electronic Chip,”Proc. IEEE, vol. 88, pp. 728–749, 2000.
[33] J. Montanaro, R. T. Witek, K. Anne, A. J. Black, E. M. Cooper, D. W. Dobberpuhl, P. M.Donahue, J. Eno, W. Hoeppner, D. Kruckemyer, T. H. Lee, P. C. M. Lin, L. Madden, D. Mur-ray, M. H. Pearce, S. Santhanam, K. J. Snyder, R. Stehpany, and S. C. Thierauf, “A 160-MHz,32-b, 0.5-W CMOS RISC Microprocessor,” IEEE Jour. Solid-State Circuits, vol. 31, no. 11,pp. 1703–1714, Nov. 1996.
[34] A. Mule, E. Glytsis, T. Gaylord, and J. Meindl, “Electrical and Optical Clock DistributionNetworks for Gigascale Microprocessors,” IEEE Trans. Very Large Scale Integration (VLSI)System, vol. 10, no. 5, pp. 582–594, Oct. 2002.
[35] C. Nagendra and M. J. Irwin, “Design Trade-Offs in CMOS FIR Filters,” in Proc. IEEEInternational Conference on Acoustics, Speech, and Signal Processing Conference, volume 6,May 1996.
[36] J. Neves and E. Friedman, “Circuit Synthesis of Clock Distribution Networks Based on Non-Zero Clock Skew,” in Proceedings of IEEE International Symposium on Circuits and Systems,May/June 1994, pp. 4.175–4.178.
[37] F. J. Pollack, “New microarchitecture challenges in the coming generations of CMOS processtechnologies,” in Proc. 32nd Ann. ACM/IEEE Int. Symp. Microarchitecture, (Haifa, Israel),1999, pp. 2–4.
[38] R. Pownall, G. Yuan, T. Chen, P. Nikkel, and K. Lear, “Geometry dependence of cmos-compatible, polysilicon, leaky-mode photodetectors,” IEEE Photon. Technol. Lett., vol. 19,no. 7, pp. 513–515, Apr. 2007.
[39] P. Ramanathan, A. Dupont, and K. Shin, “Clock Distribution in General VLSI Circuits,”IEEE Trans. Circuits Syst. I: Fundam. Theory Appl., vol. 41, no. 5, pp. 395–404.
[40] A. Raza, G. W. Yuan, C. Thangaraj, T. Chen, and K. Lear, “Waveguide Coupled CMOSPhotodetector for On-Chip Optical Interconnects,” Lasers Electro-Optics Soc., vol. 1, pp.152–153, Nov. 2004.
[41] R. Reif, A. Fan, K.-N. Chen, and S. Das, “Fabrication Technologies for Three-DimensionalIntegrated Circuits,” in Proc. IEEE International Symp. Quality Electronic Design, 2002, pp.33–37.
[42] P. J. Restle and A. Deutsch, “Designing the Best Clock Distribution Network,” in Proc. Symp.VLSI Circuits, 1998, pp. 2–5.
50
[43] W. Ryu, J. Lee, H. Kim, S. Ahn, N. Kim, B. Choi, D. Kam, and J. Kim, “RF Interconnectfor Multi-Gbit/s Board-Level Clock Distribution,” IEEE Trans. on Adv. Packaging, vol. 23,pp. 398–407, 2000.
[44] A. Z. Shang and F. A. P. Tooley, “Digital Optical Interconnects for Networks and ComputingSystems,” Lightwave Technology, vol. 18, pp. 2086–2094, 2000.
[45] I. Sutherland, B. Sproul, and D. Harris, Logical Effort: Designing Fast CMOS Circuits.Morgan-Kaufmann, 1999.
[46] S. Tam, S. Rusu, U. N. Desai, R. Kim, J. Zhang, and I. Young, “Clock Generatition andDistribution for the First IA-64 Microprocessor,” IEEE Jour. Solid-State Circuits, vol. 35, pp.1545–1552, 2000.
[47] G. Tosik, F. Gaffiot, Z. Lisik, I. O’Connor, and F. Tissafi-Drissi, “Optical Versus ElectricalInterconnections for Clock Distribution Networks in New VLSI Technologies,” in Proc. Inte-grated Circuit and System Design, Power and Timing Modeling, Optimization and Simulation,volume 2799, Sept. 2003, pp. 461–470.
[48] G. Tosik, F. Gaffiot, Z. Lisik, I. O’Connor, and F. Tissafi-Drissi, “Metallic Clock DistributionNetworks in New VLSI Technologies,” IEE Electron. Lett., vol. 40, no. 3, Feb. 2004.
[49] G. Tosik, Z. Lisik, and F. Gaffiot, “Optical Interconnections in future VLSI systems,” Journalof Telecommunications and Information Technology, pp. 105–108, 2007.
[50] S. H. Unger and C.-J. Tan, “Clocking Schemes for High-Speed Digital Systems,” IEEE Trans.Computers, vol. C-35, no. 10, pp. 880–895, Oct. 1986.
[51] D. Wann and M. Franklin, “Asynchronous and Clocked Control Structures for VLSI BasedInterconnection Networks,” IEEE Trans. Computers, vol. C-32, no. 3, pp. 284–293, Mar. 1983.
[52] C. Webb, C. J. Anderson, L. Sigal, K. L. Shepard, J. S. Liptay, J. D. Warnock, B. Curran,B. W. Krumm, M. D. Mayo, P. J. Camporese, E. M. Schwarz, M. S. Farrell, P. J. Restle,R. M. A. III, T. J. Slegel, W. V. Huott, Y. H. Chan, B. Wile, T. N. Nguyen, P. G. Emma,D. K. Beece, C.-T. Chuang, and C. Price, “A 400-MHz S/390 Microprocessor,” IEEE J. SolidState Circuits, vol. 32, no. 11, pp. 1665–1675, Nov. 1997.
[53] G. Yuan, R. Pownall, P. Nikkel, C. Thangaraj, T. Chen, , and K. Lear, “Characterization ofCMOS Compatible Waveguide-Coupled Leaky-Mode Photodetectors,” IEEE Photon. Technol.Lett., vol. 18, no. 15, pp. 1657–1659, Aug. 2006.
[54] Q. Zhu and W. Dai, “Planar Clock Routing for High Performance Chip and Package Co-Design,” IEEE Transactions on VLSI Systems, vol. 4, no. 2, pp. 210–226, June 1996.
51