institutionen för systemteknik - diva...

Institutionen för systemteknikDepartment of Electrical Engineering

Examensarbete

Adaptive TDCImplementation and Evaluation of an FPGA

Examensarbete utfört i Elektronikvid Tekniska högskolan vid Linköpings universitet

av

Simon Andersson Holmström

LiTH-ISY-EX-ET-15/0428–SE

Linköping 2015

Department of Electrical Engineering Linköpings tekniska högskolaLinköpings universitet Linköpings universitetSE-581 83 Linköping, Sweden 581 83 Linköping

Adaptive TDCImplementation and Evaluation of an FPGA

Examensarbete utfört i Elektronikvid Tekniska högskolan vid Linköpings universitet

av


LiTH-ISY-EX-ET-15/0428–SE

Handledare: Andreas EhliarISY, Linköpings universitet

Examinator: Jan-Åke LarssonISY, Linköpings universitet

Linköping, 29 april 2015

Avdelning, InstitutionDivision, Department

Avdelningen för DatorteknikDepartment of Electrical EngineeringSE-581 83 Linköping

DatumDate

2015-04-29

SpråkLanguage

� Svenska/Swedish

� Engelska/English

�

�

RapporttypReport category

� Licentiatavhandling

� Examensarbete

� C-uppsats

� D-uppsats

� Övrig rapport

�

�

URL för elektronisk version

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-134424

ISBN—

ISRNLiTH-ISY-EX-ET-15/0428–SE

Serietitel och serienummerTitle of series, numbering

ISSN—

TitelTitle Adaptive TDC

FörfattareAuthor


SammanfattningAbstract

Time to digital converter (TDC) is a digital unit that measures the time interval between two events.This is useful to determine the characteristics and patterns of a signal or an event. In this thesis ahybrid TDC is presented consisting of a tapped delay line and a clock counter principle.

The TDC is used to measure the time between received data in a QKD application. If the mea-sured time does not exceed a certain value then data had been sent without any interception. It is alsopossible to use TDCs in other fields such as laser-ranging and time-of-flight applications.

The TDC consists of two carry chains, an encoder, a FIFO and a counter for each channel, anAXI-module and a control unit to generate command signals to all channels that are implemented.The time is measured by sampling the signal that has propagated through the carry chain and from thissample encode the propagation length.

In this thesis a TDC is implemented that has a 10 ns dead time and a resolution below 28 psin a four channel mode. The propagation variation is approximately two percent of the total valueduring testing. For the implementation an FPGA-board with a Zynq XC7Z020 SoC is used withSystemVerilog that is a hardware describing language (HDL).

NyckelordKeywords TDC, Carry-chain, FPGA, Zynq, Delay

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-134424

Abstract

Time to digital converter (TDC) is a digital unit that measures the time interval betweentwo events. This is useful to determine the characteristics and patterns of a signal or anevent. In this thesis a hybrid TDC is presented consisting of a tapped delay line and aclock counter principle.

The TDC is used to measure the time between received data in a QKD application.If the measured time does not exceed a certain value then data had been sent withoutany interception. It is also possible to use TDCs in other fields such as laser-ranging andtime-of-flight applications.

The TDC consists of two carry chains, an encoder, a FIFO and a counter for eachchannel, an AXI-module and a control unit to generate command signals to all channelsthat are implemented. The time is measured by sampling the signal that has propagatedthrough the carry chain and from this sample encode the propagation length.

In this thesis a TDC is implemented that has a 10 ns dead time and a resolution below28 ps in a four channel mode. The propagation variation is approximately two percentof the total value during testing. For the implementation an FPGA-board with a ZynqXC7Z020 SoC is used with SystemVerilog that is a hardware describing language (HDL).

i

Abbreviations

ALU . . . . . . . . . . . . . . Arithmetic Logic Unit

AXI . . . . . . . . . . . . . . Advanced eXtensible Interface

CLB . . . . . . . . . . . . . . Configurable Logic Block

CSR . . . . . . . . . . . . . . Control Status Register

DFF . . . . . . . . . . . . . . Data Flip-Flop

DSP . . . . . . . . . . . . . . Digital Signal Processor

DUT . . . . . . . . . . . . . . Device Under Test

FIFO . . . . . . . . . . . . . First In, First Out

FPGA . . . . . . . . . . . . Field Programmable Gate Array

FSM . . . . . . . . . . . . . . Finite State Machine

HDL . . . . . . . . . . . . . Hardware Description Language

FSM . . . . . . . . . . . . . . Finite State Machine

LSB . . . . . . . . . . . . . . Lowest Significant Bit

LUT . . . . . . . . . . . . . . Look-Up Table

PG . . . . . . . . . . . . . . . Pattern Generator

PL . . . . . . . . . . . . . . . Programmable Logic

PLL . . . . . . . . . . . . . . Phase Locked Loop

PS . . . . . . . . . . . . . . . . Processing System

RAM . . . . . . . . . . . . . Random Access Memory Data

SIMD . . . . . . . . . . . . Single Instruction Multiple Data

iii

iv Abbreviations

SoC . . . . . . . . . . . . . . System on a Chip

TLSB . . . . . . . . . . . . . Lowest measurable time interval for the TDC

TDC . . . . . . . . . . . . . . Time to Digital Converter

QKD . . . . . . . . . . . . . Quantum Key Distribution

UART . . . . . . . . . . . . Universal Asynchronous Receiver/Transmitter

List of Figures

2.1 Overview of a CLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 CARRY4 element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Basic delay chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Principle for a synchronous TDC . . . . . . . . . . . . . . . . . . . . . 92.5 Operating principle of a delay chain in the TDC . . . . . . . . . . . . . 102.6 A delay chain with a clock skew . . . . . . . . . . . . . . . . . . . . . 11

3.1 The top level overview . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Overview of one TDC module . . . . . . . . . . . . . . . . . . . . . . 153.3 Screenshots from PlanAhead . . . . . . . . . . . . . . . . . . . . . . . 163.4 The FSM in the control module . . . . . . . . . . . . . . . . . . . . . . 18

4.1 An overview of the soft test bench setup . . . . . . . . . . . . . . . . . 224.2 A comparison between maximum and minimum propagation values . . 244.3 A comparison between maximum and minimum propagation variance . 26

v

List of Tables

4.1 Differences in maximum value between carry chains . . . . . . . . . . 244.2 Differences between some start signals . . . . . . . . . . . . . . . . . . 254.3 Differences in propagation between some runs with some added logic . 254.4 A collection of maximum and minimum values for the trace encoder . . 264.5 Resource utilization on the FPGA for four channels TDC . . . . . . . . 274.6 Distribution of the used resources . . . . . . . . . . . . . . . . . . . . 274.7 Resource utilization on the FPGA for four channels TDC with calibration 284.8 Resource utilization on the FPGA for the four channel TDC with trace

encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.9 A table over the clock path skew . . . . . . . . . . . . . . . . . . . . . 29

vi

Contents

Abbreviations iii

List of Figures v

List of Tables vi

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem-formulation and limitations . . . . . . . . . . . . . . . . . . . 21.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 52.1 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Configurable Logic Block . . . . . . . . . . . . . . . . . . . . 62.1.2 Slice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.3 DSP48E1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.4 Block RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Time to Digital Converter . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.1 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 Characteristic sources . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Design and implementation 133.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 TDC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.2 Control module . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.3 AXI-int module . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Results 214.1 Software Test set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Hardware Test set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

vii

viii Contents

4.4 Resource usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.5 Error sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Discussion 31

6 Conclusions 33

7 Future work 35

Bibliography 37

A Plots 41

1Introduction

This section is an introduction to the thesis and gives a short brief of the background, theproblem-formulation, the limitations and the method used.

1.1 Background

This thesis investigates the possibilities and challenges of designing a time to digital con-verter (TDC). The TDC measures the time interval between a start and a stop signal andconverts the measured time into a digital value. This is useful for miscellaneous measur-ing tools in various fields.

In condition that the precision is sufficiently high the TDC could be used for measur-ing disturbances on a data channel. An example of an application that takes advantagesof this is Quantum Key Distribution (QKD). This application is provable secure for shar-ing information through a public channel without any third party obtaining insight of theinformation. One popular protocol for this application is the BB84 (Bennett and Bas-sard,1984) , see for example Michael and Isaac [2000] [Michael and Isaac, 2000, Qi andWeiyue, 2013].

The provable security of QKD is based on the no-cloning theorem and the only re-quirement is that qubits could be communicated through the public channel with an errorrate lower than a certain threshold. The basic principle with the QKD is that Eve can notintercept Alice and Bobs transmission without affecting the signal [Qi and Weiyue, 2013,Michael and Isaac, 2000].

1

2 1 Introduction

Qubits are the information carriers in quantum systems. A difference between qubitsand ordinary databits are that qubits also have a possibility to represent a one or a zero ina superposition, which also allows the machine to represent both zero and one in a linearcombination [Michael and Isaac, 2000].

TDCs are also used in other fields such as time-of-flight and laser ranging applications.

It is possible to implement TDCs with different techniques such as oscillators, coun-ters and delay lines. However, in this thesis it is chosen to implement a TDC on an FPGA,and therefore use a combination of delay line and counter, to make it flexible and costefficient to implement.

1.2 Problem-formulation and limitations

The goal of this thesis is to design and evaluate a TDC based on an FPGA. The mainpurpose is to use this design on a QKD-system. In order for this system to work correctlyin this application, it must have a resolution lower than 100 ps.

The system is designed for a Xilinx Zynq, and therefore some changes might need tobe done in case the system would be implemented on another platform.

Due to the limited amount of time, certain limitations had to be done. In relation tothis there will for instance not be a study of how comprehensive the temperature impactis on the given design. Instead, this will be left for future work.

1.3 Method 3

1.3 Method

In order to obtain the necessary knowledge in the give area, the thesis started with aliterature study.

After obtaining the necessary knowledge, the thesis continues by formulating a modelof the system and dividing it into smaller modules that have been implemented separately.Each module is then tested individually in the software test bench and corrected fromdefects. When no more defects have been found, these modules are assembled into largermodules and retested in the test bench for a error that could occur due to the merging ofmodules. This proceeds until the entire system has been assembled and verified.

After the design have been verified it is implemented on the development board,named Zedboard. This board is also used together with a pattern generator (PG) to verifythe design. During the verification phase, some minor modifications have been done inorder for the system to be more stable and predictable. Additionally, some function is alsoadded to ease the testing and measuring.

When the functionality of the design is confirmed with the board it is used to evaluatedifferent configurations of the system.

2Theory

This chapter will introduce some of the theory behind the TDC. Starting with basic build-ing blocks of an FPGA and then continuing with a short description on the theory of theTDC and how it could be implemented. This is followed by a review of some of the char-acteristics of the TDC which could affect the delay time. The chapter ends with the dataencoding of the TDC.

2.1 FPGA

FPGAs are the most common type of re-configurable logic. Re-configurable means thatthe hardware is constructed in such way it is possible to program the logic after fabrication[Marwedel, 2011, Wolf, 2004].

Subsection 2.1.1 below explains some of the building blocks for FPGAs. The detailsare based on the Xilinx 7-series FPGA and therefore there may be some differences fromthose of other suppliers or FPGA models.

5

6 2 Theory

2.1.1 Configurable Logic Block

Configurable Logic Block (CLB) is a building block that is the main logic resource andconsists of two slices in each CLB. CLBs are used for implementing sequential and/orcombinatorial circuits. This CLBs are connected through a switch matrix and an illustra-tion over the CLB with its connetion is shown in fig. 2.1 [Xil, 2014a].

Switch MatrixC

arry

logi

c

Car

rylo

gic

SLICE(0)

SLICE(1)

COUT

CIN

COUT

CIN

CLB

Figure 2.1: Overview of a CLB, based on figure from [Xil, 2014a].

These two slices are not connected to each other, but with the slice that has sameorientation in the CLB below and above [Xil, 2014a].

2.1.2 Slice

In Xilinx 7-series FPGAs there are some different slices that are specialized for a typicalapplication purpose. Namely, there are two types SLICEL and SLICEM. The differencebetween SLICEL and SLICEM is that SLICEM have added functions for storing data anddata shifting. There are between 2.6K to 305.4K slices depending on model [Xil, 2014a].

2.1 FPGA 7

In each slice there is\ are

• Four logic-function generators or look-up tables

• Eight storage elements

• Wide-function multiplexers

• Carry logic

The Wide-function multiplexers could be used to form a 27 input combinational func-tion or a 16:1 multiplexer in one slice and it is also possible to create even wider multi-plexer over multiple slices [Xil, 2014a].

The carry logic in each slice consists of four carry logic elements that are connectedin a chain. Each element consists of one MUX and one OR-gate. Figure 2.2 illustratesthis carry chain block (CARRY4) [Xil, 2014a].

xor

0 1

OD

S

CO

xor

0 1

OD

S

CO

xor

0 1

OD

S

CO

xor

0 1

OD

S

CO

0 1

CO

CICINIT

Figure 2.2: The CARRY4 element, based from figure from [Xil, 2014a].

The carry logic is usually used for arithmetic functions. For each carry logic that iscascaded the propagation delay increases linearly with the number of bits for the operand.The number of carry logics that could be cascaded is limited to the column height of sliceson the FPGA [Xil, 2014a].

8 2 Theory

2.1.3 DSP48E1

The DSP48E1 is a slice that is suitable for DSP applications. And therefore have addedfunctions for multiplier, accumulating pattern detection and SIMD ALU [Xil, 2014c].

2.1.4 Block RAM

Regularly there are block RAMs integrated in FPGAs. Block RAMs are RAMs that aredistributed through the chip. This building block could be used to construct First-In, First-Out(FIFO) registers, large shift registers and ROMs. In Xilinx 7-series FPGAs there arebetween 25 to 1880 block RAMs with a 36 Kb size each [Xil, 2014b].

2.2 Time to Digital Converter

Time to Digital Converter (TDC) is an electronic component used for converting a timeinterval to a digital code. This component could be used to measure disturbance in asignal flow, for instance in the QKD mentioned in section 1.1.

2.2.1 Principle

There are different ways to implement a TDC. One way is to use a counter that counts thenumber of clock cycles that elapse between a start and a stop signal. The main drawbackwith this design is that the resolution is dependent of the highest achievable clock speedthat the system is limited to.

It is possible to obtain a higher resolution by dividing each clock interval into smallertime intervals. This is usually called a tapped delay chain. An illustration how the tech-nique is implemented is in fig. 2.3 [Henzler, 2010].

DFF DFF DFF DFF DFF DFF

Q[1] Q[2] Q[...] Q[n-2] Q[n-1] Q[n]

Clock

Start

Figure 2.3: A basic delay chain, based on figure from [Henzler, 2010].

Here every clock interval is divided into smaller intervals by using a chain with digitalelements which the signal propagates through. By analyzing fig. 2.3 we can describe themodel in following equations [Henzler, 2010].

∆T = NTclk − (Tclk − ∆TStart) + (Tclk − ∆TStop) (2.1)

2.2 Time to Digital Converter 9

∆Tstart = N1Tclkk− ε1 (2.2)

∆Tstop = N2Tclkk− ε2 ε1, ε2 ∈ [0; TLSB =

Tclkk

] (2.3)

If eq. (2.2) and eq. (2.3) is inserted into eq. (2.1) we get following equations fordescribe the properties for a time interval [Henzler, 2010].

∆T = NTclk + N1Tclkk− ε1 − N2

Tclkk− ε2 (2.4)

εT = ε2 − ε1 ∈ [−Tclkk

;Tclkk

] (2.5)

Where ε is the resulting quantization error. An illustration of which events theseequations describe are in fig. 2.4. The key values NTclk , ∆Tstart and ∆Tstop is plotted inthe figure [Henzler, 2010].

Clk

Start

Stop

Count 0 1 2 3

Delay line count 0 1 2 3 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

NTclk

∆Tstart ∆Tstop

∆T

Figure 2.4: Principle for a synchronous TDC, based on figure from [Henzler, 2010].

An illustrates how the start signal propagates through a group of delay elements is infig. 2.5. From this figure we can see that the number of delay elements needed dependson the time range that is desired to measure.

If we assume that delay elements are symmetric, then the number of delay elementscould be determined by following eq. (2.6) [Henzler, 2010].

NDelaysteps = b∆TTLSB

c (2.6)

There NDelaysteps is the number of delay elements, ∆T is the time period and TLSB isthe time for each element.

10 2 Theory

Start

Delay element 1

.

.

.

.

.

Delay element n

Stop

Figure 2.5: Operating principle of a delay chain in the TDC, the interval betweenthe vertical lines is ∆T .

2.2.2 Characteristic sources

There are multiple sources that could affect the precision and characteristics for the TDC.Some of these sources and their possible impact are described below.

Process Variations

The TDC behavior may differ due to variations in the process. Based on small structuringvariations in production each logical gate can have some difference in size, which impactsthe characteristics of the component [Henzler, 2010].

Furthermore, the temperature has an impact on how the TDC will perform; For in-stance, the resistance is temperature dependent. Xi and Qi have done some studies fo-cusing on how the resolution changes in different temperature [Xi and Qi, 2013]. In thisthesis that has been left due to the time constraints.

Logic structure

The logical structure may differ depending on the FPGA architecture, creating some dif-ferences in path length that could cause some differences in the delay time between eachlogical element. For the Zynq device there are four carry logic elements in each SLICEand the interconnection length may differ from inside and between each SLICE [Harmenand Edoardo, 2011].

Clock distribution and interconnect

In an ideal delay chain the sampling would be done simultaneously for all elements, butunfortunately that is difficult to achieve. FPGAs are usually divided into a number ofdifferent clock regions to compensate for the delay time that could occur to the clocksignal in the circuit and by this minimize the clock skew [Henzler, 2010, Harmen andEdoardo, 2011, Xil, 2013].

2.3 Encoding 11

Even if there is a perfect balance in these regions it could still be unbalanced due tolocal process variations. This clock skew is illustrated in fig. 2.6 below with a shimmingdelay representing the clock skew [Henzler, 2010, Harmen and Edoardo, 2011].

DFF DFF DFF DFF DFF DFF

TClock

Start

Figure 2.6: A delay chain with a clock skew, based on figure from [Henzler, 2010].

2.3 Encoding

There are different techniques for encoding the delay line. One technique is to countthe number of ones that have been sampled by DFFs. The Technique is implemented bysumming ones from Q in fig. 2.3 / fig. 2.6 as eq. (2.7) [Claudio and Edoardo, 2009].

Delay =n∑i=0

Q[i] (2.7)

This technique is relatively simple to implement and fast to run because it only sum-marizes the ones. One drawback with this technique is that it could appear areas insidethe chain where there are zeros instead of ones due to clock skews or other defects. Thisdrawback could cause the encoder to have a small error rate in the encoded value [Claudioand Edoardo, 2009].

A solution to this drawback is to use another encoder that only tracks the one thathave propagated farthest. This technique would not be affected by bubbles of zeros insidethe sampled value from the delay chain but creates a more computational intense encoder[Claudio and Edoardo, 2009].

3Design and implementation

This chapter reviews the design and its implemention. Starting with an overview of thehardware that the system is implemented on. After the overview the chapter proceeds togo through the different parts in the design. This chapter ends with an overview of thesoftware that have been added to run the design.

3.1 Hardware

During the implementation of the system, an FPGA development board named Zedboardwas used. This board is based on Xilinx Zynq™-7000 All programmable SoC. TheZynq SoC is a combined dual-core ARM®Cortex™A9 processing system (PS) and pro-grammable logic (PL) on the same chip made with 28nm technology [Xil, 2013].

FPGAs are ideal for experimental and prototype development due to the low costfor small series and simultaneously flexibility in implementation and the ability to makeupgrades and corrections further on [Xil, 2013].

Some features on the development board

• Xilinx XC7Z020-1CLG484C

• Memory

– 512 MB DDR3

– 256 MB Quad-SPI Flash

• Interfaces

– USB-JTAG

13

14 3 Design and implementation

– USB 2.0 FS USB-UART Bridge

– Five Digilent Pmod™compatible headers

The system is implemented in the PL part of the Zynq and transmits the data to PSthrough the AXI-bus protocol. There the software is executed, which handles some simpledata operations and transmissions through the UART-bridge.

The clock frequency for this system is set to 100 MHz,but it is possible to use higherclock speeds if necessary, see Xil [2013].

3.2 System

The top design consists of a communication module, a control module and a TDC modulefor each channel to be implemented. The TDC system is of a hybrid type which is acombined tapped delay chain and a clock counter. An illustration of the top level designwith four receiver channels is shown in fig. 3.1.

AXI

Control-unit

ARMTDC1

TDC2

TDC3

TDC4

Start1

Start2

Start3

Start4

Figure 3.1: The top level overview.

For implementation a hardware description language called SystemVerilog have beenused. This language is an extension of Verilog-2005 that has a C like syntax. The mainadvantage with SystemVerilog versus other hardware description languages like VHDLand classical Verilog, is added support of object-oriented programming techniques thatmakes it easier to develop test-benches.

The system is designed to be adaptable in such a way that it is possible to adjust thesystem to desirable channels by adding the desired number of TDC modules and do minorchanges to the control-unit.

3.2 System 15

Each TDC module work separately and the signal need to be one clock cycle to besure that the system will detect it. Also the time between each signal need to be at leastone clock cycle. If the system is running at 100 MHz provides that these time periodsneed to be at least 10 ns.

However, the total speed of the system is limited by the data transfer rate through theUART-bridge. In addition, the address width sets a limitation on the number of modulesthat is possible to communicate with in the implementation. The available logic does alsoset limits on the number of modules that is possible to implement.

3.2.1 TDC

In each TDC module there is an encoder, a counter, a FIFO-register and two carry-chains.An illustration on how these parts are connected and how data is flowing through the TDCmodule is shown in fig. 3.2.

Delay line

Encoder

Counter

FIFO

Start

Start StartData (from AXI)Address (from AXI)Control signals(from control unit)

Read/Write

Figure 3.2: Overview of one TDC module, in addition to the signals that are in-cluded in the figure are reset and clk going to all modules.

In the delay chain, there are two fast forward carry chains implemented with CARRY4primitive, illustrated in fig. 2.2. By using this primitive, the synthesis tool is forced toimplement this chain in a specific pattern; This results in a chain with a minimized andmore predictable path. A demonstration of not using the primitive is in fig. 3.3.a and onethat uses the primitive in fig. 3.3.b. In the first figure we can see that the delay elementsare spread out on the circuit with no direct pattern. While in the second figure they arelined up with a pattern that is desired to this application.

The two carry chains in each TDC module enables the possibility to receive signalsat the same speed as the hardware’s clock speed. Each of this carry chains has 600 carryelements, which is the maximum number of elements that fits in a column. The top ofthe column are connected to the bottom of the next column which results in a larger delaywhen the signal propagates to the next column.

(a) A carry-chain that do not have the correct route placement. This is aprint-screen from PlanAhead

(b) A carry-chain that havethe correct routeplacement. Thisis a print-screenfrom PlanAhead

Figure 3.3: Screenshots from PlanAhead with different routings of the carry chain.

3.2 System 17

During run mode a one is propagated through the delay chain if there is no start signalon the channel, and if there is a start signal on the channel a zero will be propagatedthrough the delay chain. While the system is running, one chain is active for measure andsimultaneously the other one is sampled and restored. In the next clock cycle the chainsshift between these two states.

During each detected start signal the encoder takes two sampled values from the delaychain, ∆Tstart and Clk − ∆Tstop from fig. 2.4. First sampled value, Clk − ∆Tstop that isthe time interval between the latest positive clock edge and the positive edge of the startsignal, which is a indication of the end of transfer. Second sampled value, ∆Tstart thatis the time interval from the negative edge of the start and the latest positive clock edge,which is a indication of start of transfer.

∆Tstart and Clk−∆Tstop is decoded by counting ones in the sampled value and usingthis for address in a storage register. These two values are then added together, for eachtransfer, with the value from the counter which counts the number of clock cycles betweeneach start signal, like the equation eq. (2.4).

After summation the value is sent to the FIFO that is implemented with the FIFO18E1primitive. This primitive uses the built in block RAM to make a FIFO and could beconfigured to have a data width on 4, 9, 18 or 36-bits with a total size of 18 Kb-bits. Forthis system a data width of 36-bits was used, that gives a data depth on 512-bits.

The value is stored in the FIFO until the value is requested by PS through the AXI-bus.If the FIFO becomes full will the encoder overwrite the last value, that will result in dataloss.

The FIFO and carry chain are not generated by Xilinx tools and should be replacedwith a similar primitive for other vendors or another primitive that is more suitable for thedesired application.

The decoding is during synthesis possible to change to a trace technique, that insteadof summation trace the first one and the last one to estimate the propagation value.

3.2.2 Control module

This module is implemented with one finite state machine (FSM) that is keeping track ofwhich state the system have, i.e. idle, initiate, calibration or run time. The FSM is a Mooremachine i.e. its output is only dependent on the current state. In fig. 3.4 is an overviewover the FSM in the control module with its states, jumps and condition for jumps. Thismodule also houses the control status register, which is a register that has fundamentalinformation about which conditions the system have.

18 3 Design and implementation

Run

Idle

Calibrate Initiate

CSR[4]=0 CSR[4]=1

CSR[0]=1

CSR[1]=1

CSR[1]=0

CSR[0]=0

Start

Figure 3.4: The FSM in the control module. The jumps depend on the ControlStatus Register (CSR).

Initiate

This state is the initial state which is running during start-up and sets the start-up prefer-ence for the system. It is also possible to force system to this state by setting the controlstatus register.

Calibration

The state for calibrating the TDC module if the hardware calibration is implemented dur-ing synthesis.

The calibration is based on that ∆T is known and N is measurable for the system, anduse these two values in eq. (3.1) that is a rewritten equation of eq. (2.6).

TLSB =∆TN

+ ε (3.1)

This equation gives the average delay value for a delay element by using the knownvalue of ∆T , the clock cycle time, divided with the number of elements that the signalhave propagated through.

The average value is used to calculate the product of each possible encoder case andwritten to the storage register in the encoder. When calibration is done for each channelthe calibration flag in the status register is unset and the FSM jump to idle state.

Runtime

In this state the system is activated and is measuring the time interval between each oneof the receiver channels. The measured values is stored in the FIFO and will be sent tothe ARM-core at request.

3.3 Software 19

3.2.3 AXI-int module

This module is handling the read and writes to the AXI-bus which is a multiple master andmultiple slave bus. The AXI-interface consists of five channels [ARM, 2011, Xil, 2012]

• Read Address Channel

• Write Address Channel

• Read Data Channel

• Write Data Channel

• Write Response Channel

The Zynq processor uses the second version of AXI, AXI4. The AXI4 have three typesof interfaces: AXI4, AXI4-Stream and AXI4-Lite.

The properties of each interface is that:

• The AXI4 has data burst support and traditional memory mapped address and datainterface.

• The AXI4-Stream have data-only burst.

• The AXI4-Lite have single data cycle only and traditional memory mapped addressand data interface.

The AXI-int module is in the PL part of the Zynq and communicates with the AXIinterconnections. For this application the AXI4-Lite is used due to that it is simple toimplement and use and the ability to transmit data at a higher rate, compare to the UART-bus, to the ARM-core. In this design the PS act as a master and actively request data fromthe TDC that is connected as a slave. The module is based on a design from ISY.

3.3 Software

Some simple and practical functions are written in C to ease the communication betweenthe system and the computer connected through UART. Some functions are for initiatingthe system or setting a specific state for the system. There are also some functions forreading values from the system besides the measured value, and it is also possible to writeto the state register and the registers in the encoder.

4Results

This chapter starts with presenting the setup for testing and results from these tests duringthis thesis. Starting with an overview of the software test bench followed with a descrip-tion on the hardware test bench. The chapter ends with presenting the result from thesetest benches.

4.1 Software Test set-up

During the implementation phase, a software test bench is used to verify the functionalityof the system. A basic sketch of the outline is shown in fig. 4.1.

This test bench is written with SystemVerilog and simulated in Modelsim that is asimulation environment for multiple HDL languages. This simulation environment andtest bench is also practical during hardware evaluation during search for error sources.

Other software that is used during evaluation is Xilinx PlanAhead for verifying struc-ture, placement and that correct logic elements are used after synthesis. As an example ofhow it is useful we could study fig. 3.3.a and fig. 3.3.b where the placement is essentialfor the TDC function.

4.2 Hardware Test set-up

To simplify the difficulty in making some estimates and measurements, there are someregisters added to the design. These registers check for the maximum reach of the startsignal during one clock cycle for each of the eight carry chains. To be able to do afunctional test and verify control functions in the system a pattern generator (TektronixTLA7PG2) is used for generating the start signal for each channel that is implemented.

21

22 4 Results

DUT PG

Reader

T estbench T estmanager

Figure 4.1: An overview of the soft test bench setup.

4.3 Measurements 23

4.3 Measurements

To estimate the precision of the system it is possible to trace the propagation through thecarry elements for a given time. During testing the system clock, 100 MHz, is used asa reference time, and during one clock cycle did the signal propagate through more than500 carry steps. This is based on a single channel design, no internal calibration andsurrounding logic at a minimum.

By using eq. (2.6), the resolution is estimated to less than 20 ps, that complies withthe need of 100 ps for the QKD to work. This is a simplification due to limitation in time.The Zynq SoC has carry accelerators that speed-up the propagation rate over time. Thisacceleration has to be measured to get a estimation of the impact of the speed-up.

To achieve this resolution the carry chain must be aligned in a correct way with a shortpath between each element, if not, the resolution will reduce drastically. For instance thecarry chain in fig. 3.3.a only has a resolution at 0.25 ns while a chain with line up likefig. 3.3.b could have a resolution less than 20 ps for a single channel.

The placement of the carry chains is not locked due to time limits. The number ofcarry steps the signal propagates through is dependent on the placement of the carry chainand possibly change between each synthesis, due to the possibility to small shifts in sur-rounding logic.

The resolution decreased when more channels were added, which is possible to see infor instance in fig. 4.2.

During measurements it is possible to see the effects of the differences, mention in sub-section 2.2.2, in the chip. These effects could be observed by comparing the differenceof maximum propagated carry elements between the various channels on the same imple-mentation. An example of the difference between maximum and minimum propagationfor number of channels and design is shown in fig. 4.2.

The extra calibration hardware in the plot is logic that has been added and used duringcalibration instead of functions running in PS.

The plot illustrates the maximum and minimum value of propagation through thedelay line during one clock cycle in a system running at 100 MHz. These values give theexpected range of resolution that is possible to obtain from each configuration.

24 4 Results

Max without extra calibration hardwareMax with extra calibration hardware

Min without extra calibration hardwareMin with extra calibration hardware

1 channel 2 channels 4channels

400

450

500

Channels

Prop

agat

ion

[Num

bere

lem

ents

]

Figure 4.2: A comparison between maximum and minimum propagation valuesdepending of number of channels and design.

An estimation of the resolution that is possible to obtain from a four channels imple-mentation with eight carry chains are shown in table 4.1.

Channel 1 Channel 2 Channel 3 Channel 4Line 1 Line 2 Line 1 Line 2 Line 1 Line 2 Line 1 Line 2

Steps 439 394 420 376 421 380 437 393≈ Delay(ps)/step 22.8 25.4 23.8 26.6 23.8 26.3 22.9 25.4

Table 4.1: Differences in maximum propagation value between carry chains and theapproximate value on delay time between each step.

To examine the fluctuation on the propagation, a test with multiple readings is done.This test is done in similarly as previously tests, by measuring the propagation during oneclock cycle through the delay line. A sample from one of this test is shown in table 4.2.This also appears in a plot in fig. A.1.

4.3 Measurements 25

Read order Channel 1 Channel 2 Channel 3 Channel 4Line 1 Line 2 Line 1 Line 2 Line 1 Line 2 Line 1 Line 2

1 read 442 394 420 375 424 380 434 3902 read 439 390 420 376 424 380 434 3943 read 439 394 418 373 424 380 434 3924 read 441 391 418 374 422 380 434 3945 read 442 397 420 376 424 382 430 3946 read 437 394 420 373 421 380 434 3947 read 442 394 420 376 420 380 433 3908 read 442 390 418 376 422 380 432 3949 read 438 397 420 376 421 380 435 39810 read 442 397 420 376 422 380 433 394≈ Delay(ps)/step 22.7 25.4 23.8 26.7 23.7 26.3 23.1 25.4

Table 4.2: Differences between some start signals.

This test does show a fluctuation of the signal propagation through the delay chain dur-ing run mode. However it does not exceed the requirements, as shown in Delay(ps)/stepthat is calculated as at eq. (4.1) in table 4.2.

Delay(ps)/step =∆Tclk

(∑Nreadi=1 read[i]Nread

)(4.1)

There the clock cycle time is ∆Tclk , number of readings is Nread and the read-outvalue is read[i].

Tests with a configuration with a simple internal calibration resulted in a difference insize of precision and fluctuation. Table 4.3 shows the propagation in the design with theadded logic. This also appears in a plot in fig. A.2.

Read order Channel 1 Channel 2 Channel 3 Channel 4Line 1 Line 2 Line 1 Line 2 Line 1 Line 2 Line 1 Line 2

1 read 406 474 410 468 394 468 408 4742 read 410 474 410 468 394 470 408 4743 read 408 475 408 468 394 474 410 4744 read 410 476 410 468 394 470 409 4745 read 406 476 412 468 394 468 408 4766 read 408 476 411 466 393 470 407 4747 read 406 476 410 467 390 470 408 4748 read 409 476 410 466 398 470 408 4749 read 406 476 412 468 398 472 407 47410 read 406 476 410 466 394 473 406 474

≈ Delay(ps)/step 24.5 21.0 24.4 21.4 25.4 21.3 24.5 21.1

Table 4.3: Differences in propagation between some runs with some added logic fora hardware calibration.

26 4 Results

A collection of plots for the calculated maximum- and minimum values for varianceis shown in fig. 4.3. This is a collection of 10 value readings, during 10 restarts of thesystem, that give a total number of 100 measured values.

Max without extra calibration hardwareMax with extra calibration hardware

Min without extra calibration hardwareMin with extra calibration hardware

1 channel 2 channels 4channels

0

5

10

Channels

Prop

agat

ion

vari

atio

n[N

umbe

rele

men

ts]

Figure 4.3: A comparison between maximum and minimum propagation variancedepending on number of channels.

The maximum value on the variance is about 10 steps for the four channels configura-tion which is approximately two to three percent of the total propagation value.

There has also been some testing with an alternative encoder that instead of countingthe ones trace the first one. Some values gain from this measurement is in table 4.4 withvalues from the summation encoder as a comparison.

Summation encoder Trace encoderMin Max Min Max

Variance 0 7.73 0 23.51Median 374 440 368 495

Table 4.4: A collection of maximum and minimum values for the trace encoder

As previous test this is a result from 10 value readings, during 10 restarts of the system,that give a total number of 100 measured values.

4.4 Resource usage 27

4.4 Resource usage

An overview of how much logic that been used by the design table 4.5.

Resource Used Available UtilizationRegisters 5702 106400 5%LUTs 14576 53200 27%Slices 4384 13300 32%FIFO18E1 4 140 1%IOs 15 200 7%BUFGs 1 32 3%

Table 4.5: Resource utilization on the FPGA for four channels TDC (partxc7z020clg484-1).

Some of this logic is combined e.g. there are LUTs in Slices and so on, but it give anoverview on which resources that are used for this application. BUFGs are global buffersthat usually are used to suppress the clock skew between logical domains or quick accessfor control signals.

Unfortunately it is a bit problematic to obtain an estimation on the amount of theseresources that have been used for each part of the design. This is because the tool do notalways keep the hierarchy intact due to optimization. However, some parts were possibleto estimate the size from by reading the MRP-file, a report file from PlanAhead, and someothers could be estimated by hand. A collection of these values is in table 4.6.

Resource Encoder[0] Encoder[1] Encoder[2] Encoder[3]Registers 181 148 148 148LUTs 2490 2353 2351 2340Slices 862 735 786 795Resource Delay line FIFORegisters 306 0LUTs 1206 1Slices 1202 1FIFO18E1 0 1

Table 4.6: Distribution of the used resources

28 4 Results

Table 4.7 demonstrate the size with added calibration logic and the difference in per-cent against the system without calibration logic.

Resource Used Available Utilization DifferenceRegisters 5931 106400 5% 3.86%LUTs 15418 53200 28% 5.46%Slices 4492 13300 33% 2.4%FIFO18E1 4 140 1% 0%DSP48E1 8 220 3% 100%IOs 15 200 7% 0%BUFGs 1 32 3% 0%

Table 4.7: Resource utilization on the FPGA for four channels TDC with calibrationand the size difference.

From table 4.7 it is possible to see that despite the added logic to the system, still doesnot utilizes more than a third of the available logic.

However, with a trace encoder does the usage of LUTs/Slices doubles compared toprevious design, which is possible to see in table 4.8.

Resource Used Available UtilizationRegisters 6408 106400 6%LUTs 29878 53200 56%Slices 8810 13300 66%FIFO18E1 4 140 1%IOs 15 200 7%BUFGs 1 32 3%

Table 4.8: Resource utilization on the FPGA for the four channel TDC with traceencoder

4.5 Error sources 29

4.5 Error sources

In this thesis there are some simplifications. One of these simplifications is that no analy-sis of the carry accelerator in the carry logic could affect the propagation time and give ita nonlinear character.

There are also skew in the clock signal that has not been considered in this thesis whichcould have an impact on the actual results. Through the timing report in the synthesis toolit is possible to obtain that the clock uncertainty is approximately 0.035 ns and the clockpath skew is summarized in table 4.9.

Number of channels 1 2 4Without calibration 0.019 0.002 0.067With calibration 0.060 0.030 0.007

Table 4.9: A table over the clock path skew in ns, that was retrieved trough thetiming report in synthesis

In the Timing report the maximum data path time for CIN to COUT in the CARRY4element, TBY P , is estimated to 0.114-0.117 ns.

During testings there is no control or compensation of the temperature on the chip,which could effect the delay time. There where a simple monitoring by using Chipscope.During testing did the temperature fluctuate in a span less than two Celsius for all imple-mentations, however did the chip got hotter for each increase of channel and size.

5Discussion

Through testing and evaluating it is found that the placement of carry chain cells is crucial,where differences in precision is dependent of where the carry chain was placed.

A finding is that the TDC achieve higher resolution if less channels are implementedon the chip. This could be because there is less logic around that interferes with thesignals.

The number of unknown parameters and their affect on the delay time, makes it dif-ficult to estimate the correct value by calculations in an FPGA. It is however possible tomeasure the delay time and parameters during testing.

Although there are a number of parameters that effect the delay time through delayelements it will not appear to cause an excessive fluctuation in propagation time. In thissystem it is not compensated for temperature changes which could have an impact onfluctuations.

In fig. A.1 and fig. A.2, in appendix A there are possible to see that the signal prop-agation does change between readings.This could be cause by a combination of clockuncertainty inside the FPGAs logic and other affecting sources such as interference of thesurrounding logic.

The size of clock uncertainty is between 1 and 2 delay step, which is the most commonfluctuation size as it is possible to see in table 4.2 and table 4.3. It is also possible to see atendency of increase propagation variance when more logic that are utilized, see fig. 4.3and table 4.4.

The system achieve the resolution on 20 - 30 ps and although the system does nothave the same resolution at 10 ps as [Harmen and Edoardo, 2011], it is within the re-

31

32 5 Discussion

quired resolution for the given application, and despite the system is implemented on alower performer hardware than Virtex-series FPGA. However there is no compensationfor errors depending on clock skews and temperature variations.

[Harmen and Edoardo, 2011] have conclusions that the surrounding logic influencesthe TDC and demonstrations on benefits of having some guarding slices around the carrychain. A similar procedure could also be tested on this system and probably improve thesystem. Based on that surrounding logic affect performance.

The hardware based calibration is not recommended to be synthesized in the currentstate due to the decrease in resolution. Instead it is recommended to use calibrationthrough a software algorithm. This due to that the hardware version tends to increasethe fluctuation of propagation time in a many channel implementations.

6Conclusions

The placement and implementation of the carry chain have a high impact on the resolutionthat is possible to obtain through the chip. Also the implemented logic around the carrychain has an impact on its activity even if they do not have a direct link to the carry chainduring the operation.

It does not require a large amount of logic to implement a TDC. This system couldalso be implemented on a smaller Zynq than the on been used in this thesis like the Z7010that is used in the smaller development board like MicroZed or PicoZed.

Even if it is possible to add many modules and functionality it should be used spar-ingly. Given to the decrease in accuracy it gives in the TDC for each added module.

33

7Future work

• One issue that need to be investigated is to determine how the design and technologyare dependent of temperature.

• The delay line could be locked to a specified place in every synthesis. It is alsopossible to lock the routing so no undesired connection to the carry chain arrivesduring future change. An estimation of the carry accelerations should be made forestimating a more accurate decoding of the delay line. This could give a deviationfrom the actual propagation time due to the difference from the linear propagationwhich is used in this work.

• Create a calibration function that could compensate for skew between delay ele-ments. Possible by using the internal PLL and feed the delay line with a 2 GHzsignal during 20-100 ns and with this determinate where change between one andzero are. By using this we could see if there is sections inside the delay line thathave a different propagation speed.

• For the system to co-work in a QKD it is needed to add a module for making timestamps on the measured value.

35

Bibliography

AMBA AXI and ACE Protocol Specification. ARM, www.AMBA.com, d edition, Octo-ber 2011. ID102711. Cited on page 19.

Favi. Claudio and Charbon. Edoardo. A 17ps time-to-digital converter implemented in65nm fpga technology. FPGA’09, Febrary 22-24, Monterey, California, USA, pages113–120, 2009. Cited on page 11.

Favi. Matthew W Fishburn. Student Member IEEE Harmen, Menninga. Claudio and Char-bon. Sr Member IEEE Edoardo. A multi-channel, 10ps resolution, fpga-based tdc with300ms/s throughput for open-source pet applications. IEEE Nuclear Science Sympo-sium Conference Record, (31-2):1515–1522, 2011. Cited on pages 10, 11, 31, and 32.

Stephan Henzler. Time-to-digital converters. Springer, 2010. Cited on pages 8, 9, 10,and 11.

Peter Marwedel. Embedded System Design - Embedded Systems Foundations of Cyber-Physical Systems. Springer, second edition, 2011. Cited on page 5.

Nielsen. Michael, A and Chuang. Isaac, L. Quantum Computation and Quantum Infor-mation. Cambridge, 2000. Cited on pages 1 and 2.

Liao. Shubin Liu. Jinhong Wang. Qi, Shen. Shengkai and Liu. Weiyue. An fpga-basedtdc for free space quantum key distribution. IEEE Transations on Nuclear Science, 60(5):3570–3577, 2013. Cited on page 1.

Wayne Wolf. FPGA-Based System Design. Prentice Hall Modern Semiconductor DesignSeries. Prentice Hall, 2004. Cited on page 5.

Feng. Deliang Zhang. Bin Miao. Lei Zhao. Xinjun Hao. Shubin Liu. Xi, Qin. Changqingand An. Qi. Development of a high resolution tdc for implemention in flash-based andanti-fuse fpgas for aerospace application. IEEE Transaction on Nuclear Science, 60(5):3550–3556, 2013. Cited on page 10.

AXI Reference Guide. Xilinx, www.xilinx.com, v14.3 edition, November 2012. UG761.Cited on page 19.

Zynq-7000 All Programmable SoC Overview. Xilinx, www.xilinx.com, v1.6 edition, De-cember 2013. DS190. Cited on pages 10, 13, and 14.

37

38 Bibliography

7 Series FPGAs Configurable Logic -User Guide. Xilinx, www.xilinx.com, v1.6 edition,August 2014a. UG474. Cited on pages 6 and 7.

7 Series FPGAs Memory Resources - User Guide. Xilinx, www.xilinx.com, v1.11 edition,November 2014b. UG473. Cited on page 8.

7 Series DSP48E1 Slice - User Guide. Xilinx, www.xilinx.com, v1.8 edition, November2014c. UG479. Cited on page 8.

Appendix

APlots

This section present two plots that are results of a test with ten read-outs for each channel,from two different designs. Both with four input channels and a summation encoder,but fig. A.2 have more logic for a hardware calibration. The time period between eachread-out is one clock cycle at 100 MHz.

2 4 6 8 10

380

400

420

440

Read order

Prop

agat

ion

[Num

bere

lem

ents

] Channel 1.1Channel 1.2Channel 2.1Channel 2.2Channel 3.1Channel 3.2Channel 4.1Channel 4.2

Figure A.1: Propagation without extra calibration hardware.

41

42 A Plots

2 4 6 8 10

400

420

440

460

480

Read order

Prop

agat

ion

[Num

bere

lem

ents

] Channel 1.1Channel 1.2Channel 2.1Channel 2.2Channel 3.1Channel 3.2Channel 4.1Channel 4.2

Figure A.2: Propagation with extra calibration hardware.

UpphovsrättDetta dokument hålls tillgängligt på Internet — eller dess framtida ersättare — under 25 årfrån publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva utenstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forsk-ning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inteupphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannensmedgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösning-ar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den om-fattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samtskydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhangsom är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagetshemsida http://www.ep.liu.se/

CopyrightThe publishers will keep this document online on the Internet — or its possible replace-ment — for a period of 25 years from the date of publication barring exceptional circum-stances.

The online availability of the document implies a permanent permission for anyone toread, to download, to print out single copies for his/her own use and to use it unchangedfor any non-commercial research and educational purpose. Subsequent transfers of copy-right cannot revoke this permission. All other uses of the document are conditional onthe consent of the copyright owner. The publisher has taken technical and administrativemeasures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned whenhis/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and itsprocedures for publication and for assurance of document integrity, please refer to itswww home page: http://www.ep.liu.se/

© Simon Andersson Holmström

http://www.ep.liu.se/

http://www.ep.liu.se/

institutionen för systemteknik - diva...

Documents