lecture notes in computer science 6448 - cas · madhu sudan microsoft research, cambridge, ma, usa...

Lecture Notes in Computer Science 6448Commenced Publication in 1973Founding and Former Series Editors:Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board

David HutchisonLancaster University, UK

Takeo KanadeCarnegie Mellon University, Pittsburgh, PA, USA

Josef KittlerUniversity of Surrey, Guildford, UK

Jon M. KleinbergCornell University, Ithaca, NY, USA

Alfred KobsaUniversity of California, Irvine, CA, USA

Friedemann MatternETH Zurich, Switzerland

John C. MitchellStanford University, CA, USA

Moni NaorWeizmann Institute of Science, Rehovot, Israel

Oscar NierstraszUniversity of Bern, Switzerland

C. Pandu RanganIndian Institute of Technology, Madras, India

Bernhard SteffenTU Dortmund University, Germany

Madhu SudanMicrosoft Research, Cambridge, MA, USA

Demetri TerzopoulosUniversity of California, Los Angeles, CA, USA

Doug TygarUniversity of California, Berkeley, CA, USA

Gerhard WeikumMax Planck Institute for Informatics, Saarbruecken, Germany

René van LeukenGilles Sicard (Eds.)

Integrated Circuitand System Design

Power and Timing Modeling,Optimization and Simulation

20th International Workshop, PATMOS 2010Grenoble, France, September 7-10, 2010Revised Selected Papers

13

Volume Editors

René van LeukenDelft University of Technology2628 CD Delft, The NetherlandsE-mail: [email protected]

Gilles SicardTIMA Laboratory38031 Grenoble, FranceE-mail: [email protected]

Library of Congress Control Number: 2010940964

CR Subject Classification (1998): C.4, I.6, D.2, C.2, F.3, D.3

LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues

ISSN 0302-9743ISBN-10 3-642-17751-4 Springer Berlin Heidelberg New YorkISBN-13 978-3-642-17751-4 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material isconcerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publicationor parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,in its current version, and permission for use must always be obtained from Springer. Violations are liableto prosecution under the German Copyright Law.

springer.com

© Springer-Verlag Berlin Heidelberg 2011Printed in Germany

Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, IndiaPrinted on acid-free paper 06/3180

Preface

Welcome to the proceedings of the 20th International Workshop on Power andTiming Modeling, Optimization and Simulations, PATMOS 2010. Over the years,PATMOS has evolved into an important European event, where researchersfrom both industry and academia discuss and investigate the emerging chal-lenges in future and contemporary applications, design methodologies, and toolsrequired for the development of the upcoming generations of integrated cir-cuits and systems. PATMOS 2010 was organized by the TIMA Laboratory,France, with the sponsorship of Joseph Fourier University, CEA LETI, Mina-logic, CNRS, Grenoble Institute of Technology and the technical co-sponsorshipof the IEEE France Section. Further information about the workshop is availableat: http://patmos2010.imag.fr.

The technical program of PATMOS 2010 contained state-of-the-art technicalcontributions, three invited keynotes, a special session organized by the “BeyondDREAMS (Catrene 2A717)” project on “High-Level Modeling of Power-AwareHeterogeneous Designs in SystemC-AMS” and a special session organized byMinalogic presenting the results of four projects.

The technical program focused on timing, performance, and power consump-tion, as well as architectural aspects with particular emphasis on modeling, de-sign, characterization, analysis, and optimization in the nanometer era.

The Technical Program Committee, with the assistance of additional expertreviewers, selected the 24 papers presented at PATMOS. The papers were or-ganized into six oral sessions. As is customary for the PATMOS workshops, fullpapers were required for review, and a minimum of three reviewers were receivedper manuscript.

Beyond the presentations of the papers, the PATMOS technical program wasenriched by a series of talks offered by world-class experts, on important emergingresearch issues of industrial relevance. Kiyoo Itoh Fellow of Central ResearchLaboratory, Hitachi, Ltd., spoke about “Variability-Conscious Circuit Designsfor Low-Voltage Memory-Rich Nano-Scale CMOS LSIs,” Marc Belleville of CEA,LETI, MINATEC, spoke about “3D Integration for Digital and Imagers Circuits:Opportunities and Challenges,” and Sebastien Marchal of STMicroelectonicsspoke about “Signing off Industrial Designs on Evolving Technologies.”

We would like to thank our colleagues who voluntarily worked to make thisedition of PATMOS possible: the expert reviewers; the members of the TechnicalProgram and Steering Committees; the invited speakers; and last but not least,the local personnel who offered their skill, time, and extensive knowledge tomake PATMOS 2010 a memorable event.

September 2010 Rene van LeukenGilles Sicard

Organization

Organizing Committee

Rene van Leuken TU Delft, The Netherlands (Program Chair)Gilles Sicard TIMA Laboratory, France (General Chair)Anne-Laure Fourneret-Itie TIMA Laboratory, FranceLaurent Fesquet TIMA Laboratory, FranceKatell Morin–Allory TIMA Laboratory, FranceFlorent Ouchet TIMA Laboratory, FranceJulie Correard TIMA Laboratory, France

Technical Program Committee

Atila Alvandpour Linkoping University, SwedenDavid Atienza EPFL, SwitzerlandNadine Azemard University of Montpellier, FrancePeter Beerel USC, USADavide Bertozzi University of Ferrara, ItalyNaehyuck Chang Seoul University, KoreaJorge Juan Chico University of Seville, SpainJoan Figueras University of Catalonia, SpainEby Friedman University of Rochester, USACostas Goutis University of Patras, GreeceEckhard Grass IHP, GermanyJoses Luıs Guntzel University of Santa Catarina, BrazilOscar Gustafsson Linkoping University, SwedenShiyan Hu Michigan Technical University, USANathalie Julien University of Bretagne-Sud, FranceDomenik Helms OFFIS Research Institute, GermanyRene van Leuken TU Delft, The NetherlandsPhilippe Maurine University of Montpellier, FranceJose Monteiro INESC-ID / IST, PortugalVasily Moshnyaga University of Fukuoka, JapanTudor Murgan Infineon, GermanyWolfgang Nebel University of Oldenburg, GermanyDimitris Nikolos University of Patras, GreeceAntonio Nunez University of Las Palmas, SpainVojin Oklobdzija University of Texas at Dallas, USAVassilis Paliouras University of Patras, GreeceDavide Pandini ST Microelectronics, ItalyAntonis Papanikolaou NTUA, Greece

VIII Organization

Christian Piguet CSEM, SwitzerlandMassimo Poncino Politecnico di Torino, ItalyRicardo Reis University of Porto Alegre, BrazilDonatella Sciuto Politecnico di Milano, ItalyGilles Sicard TIMA Laboratory, FranceDimitrios Soudris NTUA, Athens, GreeceZuochang Ye Tsinghua University, Beijing, ChinaRobin Wilson ST Microelectronics, France

Steering Committee

Antonio J. Acosta University of Seville, SpainNadine Azemard University of Montpellier, FranceJoan Figueras University of Catalonia, SpainReiner Hartenstein TU Kaiserslautern, GermanyJorge Juan-Chico University of Seville, SpainEnrico Macii Politecnico di Torino, ItalyPhilippe Maurine University of Montpellier, FranceJose Monteiro INESC-ID / IST, PortugalWolfgang Nebel OFFIS, GermanyVassilis Paliouras University of Patras, GreeceChristian Piguet CSEM, SwitzerlandDimitrios Soudris NTUA, Athens, GreeceRene Van Leuken TU Delft, The NetherlandsDiederik Verkest IMEC, BelgiumRoberto Zafalon ST Microelectronics, Italy

Executive Steering Committee

Vassilis Paliouras University of Patras, GreeceNadine Azemard University of Montpellier, FranceJose Monteiro INESC-ID / IST, Portugal

Table of Contents

Session 1: Design Flows

A Power-Aware Online Scheduling Algorithm for StreamingApplications in Embedded MPSoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Tanguy Sassolas, Nicolas Ventroux, Nassima Boudouani, andGuillaume Blanc

An Automated Framework for Power-Critical Code Region Detectionand Power Peak Optimization of Embedded Software . . . . . . . . . . . . . . . . . 11

Christian Bachmann, Andreas Genser, Christian Steger,Reinhold Weiß, and Josef Haid

System Level Power Estimation of System-on-Chip Interconnects inConsideration of Transition Activity and Crosstalk . . . . . . . . . . . . . . . . . . . 21

Martin Gag, Tim Wegner, and Dirk Timmermann

Residue Arithmetic for Designing Low-Power Multiply-Add Units . . . . . . 31Ioannis Kouretas and Vassilis Paliouras

Session 2: Circuit Techniques 1

An On-chip Flip-Flop Characterization Circuit . . . . . . . . . . . . . . . . . . . . . . 41Abhishek Jain, Andrea Veggetti, Dennis Crippa, andPierluigi Rolandi

A Low-Voltage Log-Domain Integrator Using MOSFET in WeakInversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Lida Ramezani

Physical Design Aware Comparison of Flip-Flops for High-SpeedEnergy-Efficient VLSI Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Massimo Alioto, Elio Consoli, and Gaetano Palumbo

A Temperature-Aware Time-Dependent Dielectric Breakdown AnalysisFramework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Dimitris Bekiaris, Antonis Papanikolaou, Christos Papameletis,Dimitrios Soudris, George Economakos, and Kiamal Pekmestzi

X Table of Contents

Session 3: Low Power Circuits

An Efficient Low Power Multiple-Value Look-Up Table TargetingQuaternary FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Cristiano Lazzari, Jorge Fernandes, Paulo Flores, and Jose Monteiro

On Line Power Optimization of Data Flow Multi-core ArchitectureBased on Vdd-Hopping for Local DVFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Pascal Vivet, Edith Beigne, Hugo Lebreton, andNacer-Eddine Zergainoh

Self-Timed SRAM for Energy Harvesting Systems . . . . . . . . . . . . . . . . . . . . 105Abdullah Baz, Delong Shang, Fei Xia, and Alex Yakovlev

L1 Data Cache Power Reduction Using a Forwarding Predictor . . . . . . . . 116P. Carazo, R. Apolloni, F. Castro, D. Chaver, L. Pinuel, andF. Tirado

Session 4: Self-Timed Circuits

Statistical Leakage Power Optimization of Asynchronous CircuitsConsidering Process Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Mohsen Raji, Alireza Tajary, Behnam Ghavami,Hossein Pedram, and Hamid R. Zarandi

Optimizing and Comparing CMOS Implementations of the C-Elementin 65nm Technology: Self-Timed Ring Case . . . . . . . . . . . . . . . . . . . . . . . . . . 137

Oussama Elissati, Eslam Yahya, Sebastien Rieubon, andLaurent Fesquet

Hermes-A – An Asynchronous NoC Router with Distributed Routing . . . 150Julian Pontes, Matheus Moreira, Fernando Moraes, andNey Calazans

Practical and Theoretical Considerations on Low-Power Probability-Codes for Networks-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

Alberto Garcia-Ortiz and Leandro S. Indrusiak

Session 5: Process Variation

Logic Architecture and VDD Selection for Reducing the Impact ofIntra-die Random VT Variations on Timing . . . . . . . . . . . . . . . . . . . . . . . . . 170

Bahman Kheradmand-Boroujeni, Christian Piguet, andYusuf Leblebici

Impact of Process Variations on Pulsed Flip-Flops: Yield ImprovingCircuit-Level Techniques and Comparative Analysis . . . . . . . . . . . . . . . . . . 180

Marco Lanuzza, Raffaele De Rose, Fabio Frustaci,Stefania Perri, and Pasquale Corsonello

Table of Contents XI

Transistor-Level Gate Modeling for Nano CMOS Circuit VerificationConsidering Statistical Process Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

Qin Tang, Amir Zjajo, Michel Berkelaar, and Nick van der Meijs

White-Box Current Source Modeling Including Parameter Variationand Its Application in Timing Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

Christoph Knoth, Irina Eichwald, Petra Nordholz, andUlf Schlichtmann

Session 6: Circuit Techniques 2

Controlled-Precision Pure-Digital Square-Wave FrequencySynthesizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

Abdelkrim Kamel Oudjida, Ahmed Liacha,Mohamed Lamine Berrandjia, and Rachid Tiar

An All-Digital Phase-Locked Loop with High Resolution for LocalOn-Chip Clock Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

Oliver Schrape, Frank Winkler, Steffen Zeidler, Markus Petri,Eckhard Grass, and Ulrich Jagdhold

Clock Network Synthesis with Concurrent Gate Insertion . . . . . . . . . . . . . 228Jingwei Lu, Wing-Kai Chow, and Chiu-Wing Sham

Modeling Time Domain Magnetic Emissions of ICs . . . . . . . . . . . . . . . . . . . 238Victor Lomne, Philippe Maurine, Lionel Torres, Thomas Ordas,Mathieu Lisart, and Jerome Toublanc

Special Session 1: High-Level Modeling ofPower-Aware Heterogeneous Designs inSystemC-AMS (Abstracts)

Power Profiling of Embedded Analog/Mixed-Signal Systems . . . . . . . . . . . 250Jan Haase and Christoph Grimm

Open-People: Open Power and Energy Optimization PLatform andEstimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

Daniel Chillet

Early Power Estimation in Heterogeneous Designs Using SoCLib andSystemC-AMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

Francois Pecheux, Khouloud Zine El Abidine, and Alain Greiner

Special Session 2: Minalogic (Abstracts)

ASTEC: Asynchronous Technology for Low Power and SecuredEmbedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

Pr. Marc Renaudin

XII Table of Contents

OPENTLM and SOCKET: Creating an Open EcoSystem for VirtualPrototyping of Complex SOCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

Laurent Maillet-Contoz

Keynotes (Abstracts)

Variability-Conscious Circuit Designs for Low-Voltage Memory-RichNano-Scale CMOS LSIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

Kiyoo Itoh

3D Integration for Digital and Imagers Circuits: Opportunities andChallenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

Marc Belleville

Signing off Industrial Designs on Evolving Technologies . . . . . . . . . . . . . . . 257Sebastien Marchal

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

A Power-Aware Online Scheduling Algorithm for

Streaming Applications in Embedded MPSoC

Tanguy Sassolas, Nicolas Ventroux, Nassima Boudouani, and Guillaume Blanc

CEA, LIST,Embedded Computing Laboratory,

91191 Gif-sur-Yvette CEDEX, [email protected]

Abstract. As application complexity grows, embedded systems move tomultiprocessor architectures to cope with the computation needs. The is-sue for multiprocessor architectures is to optimize the processing resourcesusage and power consumption to reach a higher energy efficiency. Theseoptimizations are handled by scheduling techniques. To tackle this issuewe propose a global online scheduling algorithm for streaming applica-tions. It takes into account data dependencies between pipeline tasks tooptimize processor usage and reduce power consumption through the useof DPM and DVFS modes. An implementation of the algorithm on a vir-tual platform, executing a WCDMA application, demonstrates up to 45%power consumption gain while guaranteeing regular data throughput.

Index Terms: scheduling, low-power, multiprocessor, streamingapplications.

1 Introduction

As embedded applications become more complex, future embedded architectureswill have to provide higher computing performances, while respecting strong sur-face and consumption constraints. Embedded devices will not only execute morecomputing intensive applications but also cross-domain ones, including telecomand video processing application . To cope with these demands an emerging trendin embedded system design lies in the conception of MultiProcessor Systems-on-Chips (MPSoC).

These new architectures with a high density of processing elements have astrong energy dissipation. This dissipation must be taken into account to matchan embedded-compliant power budget and to limit ageing phenomenon. To han-dle these thermal and energy issues, MPSoC designer integrate DVFS and DPMcapabilities in their platform.

To leverage MPSoCs processing capabilities, applications need to be highlyparallelized. A simple way to increase application parallelism and data through-put is to pipeline sequential applications into streaming ones. This applies tothe WCDMA application whose parallelism can be drastically increased. Thenpipeline stages must be efficiently allocated to the processing resources while

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 1–10, 2011.c© Springer-Verlag Berlin Heidelberg 2011

2 T. Sassolas et al.

taking into account data dependencies between them. As applications becomemore prone to execution time variation, online control solution are needed to dy-namically schedule tasks and increase processor load. This variations can stemfrom the differences in input data for data processing application; or from theapplication structure itself. For instance the WCDMA application differentlyprocesses a pilot frame from a user frame.

Only a global scheduler with a complete view of the computation resource andtask states can perform an optimal scheduling. The choice of global schedulingpushes forward the use of a central control solution. In addition, an online cen-tral control solution must react quickly to platform events. Therefore, onlinescheduling must remain simple and must find a balance between accuracy andexecution speed. In this article, we propose an online power-aware scheduling al-gorithm that matches these conditions. This algorithm focuses on the schedulingof streaming applications. Our scheduling algorithm also tackles power consump-tion issues through an efficient use of Dynamic Voltage and Frequency Scaling(DVFS) and Dynamic Power Management (DPM) modes of the processing re-sources.

This paper is organized as follows: section 2 will study existing solutions inthe field of power-aware streaming application scheduling. Then, section 3 willdescribe the proposed power-aware scheduling algorithm. Section 4 will detailimplementation issues focusing on the simulation framework and the targetedMPSoC platform. Results will be presented in section 5 where the impact of ourscheduling algorithm in terms of Quality of Service (QoS) and power consump-tion gain will be evaluated. Finally section 6 will discuss this new streamingapplication scheduling algorithm capabilities and its future improvements.

2 Related Work

We focus our study on power-aware sheduling algorithms that rely on DVFSand DPM techniques [1]. First of all, we will briefly present the DPM and DVFStechniques and their impact on energy consumption. Then we will present asurvey of previous works in the field of offline power-aware scheduling techniquesfor streaming processing. Finally we will expose online low-power schedulingtechniques for dependant tasks.

The dissipated power in a CMOS design can be divided into two major sources:the dynamic power consumption and the static one. The dynamic consumptionpart is mainly due to transistor state switching and it can be drastically reducedby lowering the supply voltage. As the transistor delay is a function of the supplyvoltage, lowering the supply voltage imposes an adapted frequency reduction.This technique is called DVFS.

The static consumption is due to various current leakages in the transistor.The DVFS technique has some impact on the static power consumption thanksto the supply voltage reduction. Nonetheless this is not sufficient to drasticallyreduce static power consumption. To cut down static power consumption theonly viable solution consists in switching off unused parts of a circuit. This

A Power-Aware Online Scheduling Algorithm for Streaming Applications 3

technique is called DPM. Contrary to the DVFS technique the resource is madeunavailable.

The main drawback of these two techniques lies in the timing and consump-tion mode switching penalties. If the timing penalties for the DVFS are ratherconstrained, it is not the same for the DPM where wake-up time can reach ahundred milliseconds (136ms for the PXA270 [2]). Therefore, for a processor im-plementing both techniques, the issue is to find when reducing the voltage andfrequency couple is more energy efficient than running at full speed then switchof the processor. This matter is summarized in Fig 1. For a given technologicalprocess, the issue is thus to evaluate the duration of future inactivity periods ofthe resource. Having introduced the DVFS and DPM technique and the opti-mization problem they imply, we will now present offline low-power schedulingtechnique for streaming applications.

Fig. 1. DPM (left)and DVFS (right) technique timing issues

Given the fact that scheduling on a multiprocessor environment is an NP com-plete problem [3], adding power consumption optimization to the problem makesthe issue of power-aware scheduling for multiprocessor harder to solve. Streamingapplication can be seen as a set of tasks linked by their data dependencies. Thus,scheduling dependent tasks allows to schedule streaming applications. Many of-fline solutions have been proposed to solve this optimality issue assuming taskdependencies and their execution lengths were available. They mainly vary inthe way they describe the problem, changing which parameters have to be takeninto account, and the computing optimization method used to solve the problemlike in [4].

To the authors’ knowledge no previous work has been done to find an of-fline low-power multiprocessor scheduling dedicated to streaming application.Nonetheless an interesting line of work has been developed with the same scopebut for monoprocessor environment. In [5] the authors study the power optimiza-tion by using DVFS technique on a streaming application described as a directedacyclic graph with a constant output rate. Their solution allows to find the lowerconsumption scheduling given buffer size or finding the buffer size given a powerbudget. A similar approach is taken in [6] with DPM utilization. To meet morerealistic application they describe the production rate as a random variable fol-lowing a given probability rule. Nonetheless, variations in the effective executiontime limit the performance of offline solutions. To handle this dynamism, onlinelow-power solution have been proposed for streaming applications.


Many online solutions have been designed for the case of independent tasks [7,8]but they cannot apply for streaming applications. Online scheduling that han-dle task dependency issues are uncommon. Interesting solutions for dependanttask scheduling have been proposed by [9,10]. Nonetheless, these solutions rely ona partitionning of resources. Partionning solution are necessarily sub-optimal asthey only handle resources separately. A global scheduling can potentially reacha better resource usage.

We remind for the reader’s knowledge a few online power management tech-niques used for mono processor architecture in the case of streaming applica-tion described with a Directed Acyclic Graph (DAG). In [11] the author takeinto account potential blocking communication between tasks to always run thedata producer at full speed in that case but lower the energy consumption oth-erwise. [12] presents another example of inter task communication buffer sizeoptimization, with this time an online scheduler handling slack time accumu-lated with buffer use. None of the strategies listed above take into account theonline scheduling of streaming applications that allow a pipelined execution andpotential output rate improvements in an MPSoC environment.

3 Power-Aware Streaming Application Scheduling

We believe that a more power-efficient scheduling for dynamic streaming appli-cations can be found by the use of an online global scheduling. In this section,we will first remind the application description used by our algorithm. Then wewill explain the grounds of our algorithm, before presenting it in detail.

Our scheduling algorithm has been written to handle streaming applications de-scribed in a specific way. An application is a set of tasks with consumer/producerrelationships. Data is transferred from a producer task to a consumer task througha circular buffer. Only one task can write on a buffer while it can be read by mul-tiple consumer tasks. This creates a divergence in the data flow. A consumer taskcan also read multiple input buffers, creating a convergence in the data flow. Thisallows the description of parallelism in the processing flow of a given data.

Given the previously described application model, one can make a few obser-vations. A streaming application throughput is constrained by the duration ofits slowest stage. As a result other pipeline stages can be slowed down to meetthe same output rate as the slowest stage. This can be performed by using aslower DVFS mode for the resources with a too high output rate. Besides, tasksthat are further in the pipeline stream than the slowest task are to be blockedwaiting for data. These tasks should be preempted if other tasks can executeinstead, or the resource should be shut down if not. This implies the use of DPMfunctionalities. Given these observations, our algorithm will use DVFS to bal-ance the pipeline stage length and DPM to shut down unused resources. Ourobjective is to maintain the same data throughput as if the task were executingat full speed while making substantial energy saving.

To be able to balance an application pipeline, we need additional informationon the dynamic output rate of a task. Thus we introduce monitors on every


communication buffer. For every buffer we specify how many dataset it cancontain. We also specify two thresholds. When the higher threshold is reachedwe assume that the producer is executing to fast. When the lower threshold isreached we assume that the producer is not executing fast enough. A specificevent is sent to the scheduler when a threshold is crossed. It contains the writingtask identifier. An event is also sent when a task is blocked reading an emptybuffer, as well as when a task is blocked writing a full buffer. The buffer monitorsare summarized in Fig. 2. One objective of balancing pipeline stage length is toprevent buffers from getting full, which would block the producer. And to neverreach an empty buffer, which would block the consumer and could result in anincrease of the data processing length.

Fig. 2. Summary of buffer monitors and scheduling implications

To keep our scheduling algorithm as simple as possible the task priorities aremade of a static and a dynamic part. We will list the different priority parts bylevel of importance. First we check the blocked task status, as we do not wantto give the priority to a blocked task. Then the application priority is taken intoaccount. After that, we study pipeline position priority. Every task is given apriority depending on its position in the streaming pipeline. This allows to givethe priority to tasks handling older dataset, i.e the ones that are deeper in thepipeline. Finally for tasks that have the same pipeline position priority, we givethe priority to the task with the emptier buffer. The complete scheduling loopis described in Algorithm 1.

4 Implementation

To study and validate our algorithm we implemented it on a virtual MPSoC.In this section we will first present the SESAM simulation framework. Then, wewill describe the specificities of the simulated MPSoC. Finally we will shortlypresent the WCDMA application used for our performance analysis.

SESAM [13] is a tool that has been specifically built up to ease the designof asymmetric multiprocessor architectures. This framework is described withthe SystemC description language, and allows MPSoC exploration at the TLMlevel with fast and cycle accurate simulation. Besides, SESAM uses approximate-timed TLM with explicit time to provide a fast and accurate simulation of com-plex NoC communications [14]. It performs simulations with an accuracy of 90%


Algorithm 1. The Power-Aware Streaming Application Scheduling Loop1: procedure scheduling(task to schedule[nb tasks], status proc[nb proc])♦ First we take into account buffer events

2: for all tasks to schedule do3: if task is waiting for data then4: remove task from task to schedule5: else if task output buffer reached Higher Threshold then6: reset task’s buffer priority bit7: else if task output buffer reached Lower Threshold then8: set task’s buffer priority bit9: end if

10: end for♦ Then we order the tasks by priority

11: ordered tasks[nbproc]← sort task by priority(task to schedule)♦ We handle already in execution tasks to limit preemption/migration

12: for all task already in execution in ordered tasks do13: remove task from ordered tasks14: remove proc executing task from freeproc15: end for♦ We allocate tasks not in execution on any processor yet

16: for all task left in ordered tasks do17: execute task on freeproc18: end for♦ Finally we handle the consumption

19: for all proc do20: if proc is free then21: proc mode← idle mode22: else if Task on proc reached lower threshold then23: proc mode← turbo mode24: else if Task on proc reached higher threshold then25: proc mode← half mode26: end if27: end for28: end procedure

compared to fully cycle accurate models. In addition, the programming modelof SESAM is specifically adapted to dynamic applications and global schedulingmethods. It is based on the explicit separation of the control and the computa-tion parts.

The processing elements of the SESAM simulator are functional InstructionSet Simulators (ISS) generated by the ArchC tool. Thus, we extended the ArchCISS to integrate DVFS and DPM models to the SESAM environment. To avoidmultiple context switches and accelerate simulation, every ArchC ISS executesmultiple instructions at a time then waits for the time it should have spentexecuting them. For every DVFS mode, we calculate the smallest couple (a, b) sothat a/b equals the DVFS mode slowing factor. Then, we multiply the number ofinstructions to be executed by a and the time to wait for these instructions by b.


We also calculate the energy spent during the execution of a set of instructionand keep the total energy consumption for each ISS. A DVFS mode switch ismodelled as an interruption for the ISS. When it occurs, the ISS computes thetime and energy spent in its previous mode. Then, it waits for the adequateswitching latency, takes into account its switching energy penalty and finallyresumes its execution with the (a, b) couple of the new DVFS mode. So as tomodel realistic processors we used the PXA270 Power State Machine (PSM)values [2]. We chose to use only two DVFS modes, Turbo and Half-turbo, andone DPM mode, Deep Idle, as they have acceptable switching latencies comparedto our task execution times.

To perform a realistic analysis of our scheduling algorithm we modelled withthe SESAM simulator an asymmetric MPSoC platform. This platform is buildof a set of Processing Elements (PE) made of a processor equipped with a TLB,a 1KB instruction cache and a 1KB data one. They are connected to a set ofshared 2ns-latency L2 memory through a 2ns-latency multibus. Communicationbetween tasks are made possible thanks to HAL functions. Data coherency isguaranteed by a memory Management Unit (MMU). The buffers used for our al-gorithm are modelled using a specific HAL and the buffer thresholds are handledby the MMU. Preemption and migration of tasks are possible and their costsis reduced thanks to the shared memory and the virtualization of the memoryspace enabled by the use of TLBs [13].

The central controller is made of a processor with its own caches and memory.It is connected to the PEs and the MMU through another timed multibus. Itsspecific HAL enables to send configuration, execution, preemption or consump-tion mode switch orders. It can also be interrupted by any PE to be informed ofa task execution end. The MMU also interrupts the controller whenever a taskis blocked (or no longer blocked) waiting for input data or output space, as wellas when a buffer threshold is crossed. We did not set the number of PE so as tostudy how our scheduling algorihtm can cope with different processor loads.

To evaluate our algorithm impact on a streaming application, we used a well-known telecommunication application: a WCDMA encoder/decoder [15]. Theapplication was pipelined and implemented on the simulated target MPSoC.The WCDMA application integrates an encoder followed by a decoder and isconsequently built of 13 tasks. This allows having more tasks than resources onthe SCMP platform to stress the potential scheduling anomalies. This applicationis characterized by an unbalanced pipeline whose slowest tasks are the FIR filters.In addition dynamism, is found in the task execution length as pilot frame getprocessed instead of actual data.

5 Results

To study the impact of our scheduling algorithm we chose to compare it to twosimpler versions of the algorithm. The first version does not handle power issues.It simply schedules tasks relying on pipeline stage position and blocked states.All processor are kept in Turbo mode. It is referred as the no energy handling


(a) (b)

(c) (d)

Fig. 3. Figure (a),(b),(c) and (d) were obtained with the same WCDMA applicationsending 256 frames. The communication buffers were 8-frame long and had a higherthreshold identical to the lower one and set to 2 frames. (a) Total execution time forthe WCDMA application in function of the number of processing resources and thescheduling algorithm used; execution time overhead of our solution compared to the noenergy handling algorithm. (b) Total processor effective occupancy and energy savingin function of the number of processing resources and the scheduling algorithm used;(c) Average time spent in Deep Idle mode compared to the time spent in unused stateor waiting for data for a processor when using our proposed algorithm; (d) Comparisonof the average time a processor spends waiting for data in the case of the no powersaving algorithm and of our solution (DPM+DVFS): influence of the Half-Turbo modeusage on blocking states.

scheduling. The second version is called DPM-only scheduling. This correspondsto a naive power-aware approach. Here unused resources and resources executingblocked tasks are put to Deep Idle mode. Finally our proposed algorithm willbe referred as DPM + DVFS scheduling.

As shown in figure 3(a) the total execution time of the WCDMA applicationis not affected by our scheduling algorithm no matter how many processingresources there are. The variation in execution time is always maintained below1.2%. In addition our algorithm allowed a good acceleration of the processingfor streaming applications.

While we managed to maintain the execution time of the scheduling withoutenergy awareness, Fig. 3(b) shows that substantial energy savings were made.


As soon as processor effective occupancy drops it is directly compensated by ourpower saving method. With 13 processors we reduced the power consumptionby 45%. In addition, our method obtains better results than the DPM-onlyscheduling which only reaches 37% energy saving in that case.

Fig. 3(c) illustrates how our scheduling algorithm uses the DPM mode in areal application case. The figure shows that when processors spend little timewaiting for data or in unused state (below 17%), the Deep Idle mode is seldomused. When the wasted time increases the DPM usage curve follows the unusedor blocked processor curve as planned. In fact, when the number of process-ing elements is little, there is often another task ready to be executed immedi-ately. For low PE numbers the wasted time corresponds to the control overhead.The controller lacks reactivity to reach higher computing performance or powersaving.

Finally Fig. 3(d) studies the impact of DVFS modes usage on the applicationexecution. We compare the execution of our algorithm to the no energy handlingscheduling. The analysis shows that when DVFS mode are used they drasticallyreduce the amount of time spent in blocking states (42% reduction for 13 pro-cessors). Thus, our algorithm succeeds to balance the streaming pipeline stageexecution length efficiently when the processor usage drops. As a result the pro-cessor load is increased with our algorithm compared to the no energy handlingscheduling as shows Fig. 3(b).

6 Conclusion

In this paper we presented a new power-aware scheduling algorithm for pipelinedapplication in MPSoC environments. The algorithm was implemented on a virtualMPSoC platform simulated with the SESAM environment. Substantial en- ergyconsumption gain was made compared to a classic data dependency schedulingthat only takes into account blocking states. For a WCDMA application execut-ing on a platform with 13 PE our scheduling algorithm reduced the processingresources power consumption by 45%. In addition the use of DVFS and DPM didnot impact the application execution speed. The variation in execution speed weremaintained below 2%. Moreover, our algorithm succeeded to maintain a high pro-cessor load. As a result, our algorithm allows a good acceleration of the executionspeed of streaming applications in MPSoCs while efficiently managing power con-sumption issues through the use of DVFS and DPM capabilities. In addition, asour algorithm is fully online and can handle the scheduling of more tasks than pro-cessor, we can manually shut down some processing resources to lower the powerbudget while guaranteeing a correct execution.

Acknowledgements

Part of the research leading to these results has received funding from theARTEMIS Joint Undertaking under grant agreement no. 100029.


References

1. Venkatachalam, V., Franz, M.: Power Reduction Techniques For MicroprocessorSystems. ACM Computing Surveys (CSUR) 37(3), 195–237 (2005)

2. Intel PXA27x Processor Family, Electrical, Mechanical, and Thermal Specification(2005)

3. Dertouzos, M.L., Mok, A.K.: Multiprocessor Online Scheduling of Hard-Real-TimeTasks. IEEE Transactions on Software Engineering 15(12), 1497–1506 (1989)

4. Benini, L., Bertozzi, D., Guerri, A., Milano, M.: Allocation, Scheduling and VoltageScaling on Energy Aware MPSoCs. In: Beck, J.C., Smith, B.M. (eds.) CPAIOR2006. LNCS, vol. 3990, pp. 44–58. Springer, Heidelberg (2006)

5. Lu, Y.-H., Benini, L., De Micheli, G.: Dynamic Frequency Scaling with BufferInsertion for Mixed Workloads. IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems 21(5), 1284–1305 (2002)

6. Pettis, N., Cai, L., Lu, Y.-H.: Statistically Optimal Dynamic Power Managementfor Streaming Data. IEEE Transactions on Computers 55(7), 800–814 (2006)

7. Kim, K.H., Buyya, R., Kim, J.: Power Aware Scheduling of Bag-of-Tasks Applica-tions with Deadline Constraints on DVS-enabled Clusters. In: IEEE InternationalSymposium on Cluster Computing and the Grid (CCGRID), pp. 541–548 (2007)

8. Zhang, F., Chanson, S.T.: Power-Aware Processor Scheduling under Average De-lay Constraints. In: IEEE Real Time on Embedded Technology and ApplicationsSymposium (RTAS), pp. 202–212 (2005)

9. Choudhury, P., Chakrabarti, P.P., Kumar, R.: Online Dynamic Voltage Scalingusing Task Graph Mapping Analysis for Multiprocessors. In: International Confer-ence on VLSI Design (VLSID), pp. 89–94 (2007)

10. Hua, S., Qu, G., Bhattacharyya, S.S.: Energy-Efficient Embedded Software Imple-mentation on Multiprocessor System-on-Chip with Multiple Voltages. ACM Trans-actions on Embedded Computing Systems (TECS) 5(2), 321–341 (2006)

11. Zhang, F., Chanson, S.T.: Blocking-Aware Processor Voltage Scheduling for Real-Time Tasks. ACM TECS 3(2), 307–335 (2004)

12. Im, C., Kim, H., Ha, S.: Dynamic Voltage Scheduling Technique for Low-PowerMultimedia Applications Using Buffers. In: ACM International Symposium on LowPower Electronics and Design (ISLPED), pp. 34–39 (2001)

13. Ventroux, N., Guerre, A., Sassolas, T., Moutaoukil, L., Bechara, C., David, R.:SESAM: an MPSoC Simulation Environment for Dynamic Application Processing.In: IEEE International Conference on Embedded Software and Systems, ICESS(2010)

14. Guerre, A., Ventroux, N., David, R., Merigot, A.: Approximate-Timed Transac-tional Level Modeling for MPSoC Exploration: A Network-on-Chip Case Study.In: IEEE Euromicro Symposium on Digital Systems Design (DSD), pp. 390–397(2009)

15. Richardson, A.: WCDMA Design Handbook (2006)

An Automated Framework for Power-Critical

Code Region Detection and Power PeakOptimization of Embedded Software

Christian Bachmann1, Andreas Genser1,Christian Steger1, Reinhold Weiß1, and Josef Haid2

1 Institute for Technical Informatics, Graz University of Technology, Austria2 Infineon Technologies Austria AG, Design Center Graz, Austria

Abstract. In power-constrained mobile systems such as RF-poweredsmart-cards, power consumption peaks can lead to supply voltage dropsthreatening the reliability of these systems. In this paper we focus on theautomated detection and reduction of power consumption peaks causedby embedded software. We propose a complete framework for automat-ically profiling embedded software applications by means of the poweremulation technique and for identifying the power-critical software sourcecode regions causing power peaks. Depending on the power managementfeatures available on the given device, an optimization strategy is cho-sen and automatically applied to the source code. In comparison to themanual optimization of power peaks, the automatic approach decreasesthe execution time overhead while only slightly increasing the requiredcode size.

1 Introduction

The power consumption of embedded systems is increasingly dependent on soft-ware applications determining the utilization of system components and periph-erals. Furthermore, the embedded software actuates power management featuressuch as voltage and frequency scaling as well as dedicated sleep or hibernationstates. Hence, software applications impact the average as well as the peak powerconsumption that is in turn affecting the reliability, stability and security of em-bedded systems. Especially for RF-powered devices such as contactless smart-cards, power peaks threaten the system reliability by impacting the power supplycircuit and leading to supply voltage drops [1]. These supply voltage drops canin turn result in system resets or, even worse, in erroneous system states. There-fore, power peak reduction and elimination methods for embedded software havebeen proposed [2–4]. Furthermore, power peak reduction techniques have beenstudied for the purpose of power profile flattening in hardware implementations[5–7]. For security applications, the profile flattening resembles a countermeasureagainst power analysis attacks.

In this paper we propose an automated methodology for profiling a softwareapplication’s power consumption and deriving a power peak optimized implemen-tation. Based on an integrated supply voltage simulation, critical code regions are


12 C. Bachmann et al.

detected and optimized. While existing software optimization methods employeither instruction-level power simulators [2–4] or physical on-chip power measure-ments [5–7] to obtain power profiles, our approach utilizes a high-level power emu-lation technique previously introduced in [8]. Using this technique, cycle-accuraterun-time power estimates are derived from the system-under-test’s functional em-ulation. In comparison to measurement-based approaches, the joint functional andpower emulation offers the advantage of inherent power profile to functional exe-cution trace correspondence, i.e., a power consumption value can be determinedfor each executed instruction. Furthermore, the emulation is cycle-accurate whilestill allowing for rapid profiling of long program sequences. This constitutes an ad-vantage over simulation-based approaches that are either lacking simulation detailand hence accuracy or simulation speed.

In contrast to hardware power profile flattening approaches, no additional on-chip measurement and control hardware is required. Furthermore, opposed topower peak reduction methods modifying intermediate language representationsof the given software application [2, 3], our approach operates on and modifiesthe original C or assembler source code. The resulting power peak optimizedsource code can afterwards still be manually modified by the software engineerif required. In the context of embedded software power peak optimization, thenovel contributions of this paper are as follows:

– We present a framework for detecting source code regions causing powerpeaks by analyzing the power consumption as well as the functional debuginformation obtained during software execution.

– We derive an optimization algorithm, actuating power management featuresfor these power-critical source code regions and hence reducing the numberof power peaks.

– Finally, we illustrate the feasibility of our approach on a power-constraineddeep-submicron smart-card controller system.

This paper is structured as follows. In Section 2 we discuss related work onpower peak optimization and power profile flattening. Section 3 presents ourautomated framework for power-critical code region detection and optimization.We illustrate the effectiveness of our approach in Section 4. Finally, conclusionsdrawn from our current work are summarized in Section 5.

2 Related Work

Due to the large influence of software on both average as well as peak powerconsumption of embedded systems, numerous works have studied power- andenergy-aware software optimization methods. With regard to power-constraineddevices, the power profile flattening and the optimization of power consump-tion peaks, is of increased interest. These power peaks are often caused due tothe occurrence of power-critical events during software execution. Especially inbattery- and RF-powered devices these peaks can severely impact the powersupply circuit and can lead to supply voltage drops [1]. These supply voltage

An Automated Framework for Power-Critical Code Region Detection 13

drops seriously jeopardize the stability and hence the reliability of the given sys-tem. Power profile flattening hardware implementations have been studied in thecontext of security-related applications. In the security domain, the reductionof profile variability is of increased interest as a countermeasure against poweranalysis attacks [9].

For the purpose of reliability enhancements, the reduction of power peakshas been investigated in [3] by means of a simulation-based peak eliminationframework using iterative compilation. Other attempts on power peak reductionhave focused on instruction reordering to minimize the switching activity due tocircuit state changes [2] as well as non-functional instruction (NFI) insertion [4].

Power profile flattening in security applications, aiming at hindering poweranalysis attacks by means of NFI insertion, was studied in [5]. Both software andhardware implementations were shown. In [6] a current-injection-based real-timeflattening method has been proposed. This approach has been extended in [7]by a voltage scaling capability for improved flattening performance.

3 Automated Power-Critical Code Region Detection andPower Peak Optimization of Embedded Software

Our automated power profiling and power-critical code region detection method-ology as depicted in Figure 1 builds upon a standard software development flow(A) and our run-time power profiling approach (B). The power estimates, along-side with the functional traces are being analyzed to detect power-critical coderegions (C). After these regions have been detected, an optimization algorithm isused to reduce the power consumption and hence the power peaks during thesecritical code regions (D).

PowerEmulation

FunctionalEmulation

Trace – Source Correlation

Supply Voltage Simulation

Power-Critical Code Region

DetectionPower Model

OptimizedSource Code

Critical Code

RegionReport

Binaries

Run-Time Power Profiling Detection of Power-Critical Code Regions

Power Peak Optimization

Standard Software Development Flow Power Peak Code Optimization

Memory Map

Debug Info

A

B C

D

Source Code

SW Development Toolchain

Fig. 1. Automated flow for power profiling, power-critical code region detection andoptimization


3.1 Run-Time Power Profiling Based on Power Emulation

For the purpose of detecting power-critical code regions, power profiling of thegiven software application has to be performed in the first place. In contrastto existing software power peak optimization approaches, we employ the poweremulation technique previously introduced in [8] to obtain power profiles for thesoftware application’s execution. The principle of power emulation as depictedin Figure 2, is to augment the functionally emulated system-under-test withspecial power estimation hardware. This power estimation hardware monitorsthe state of the system and its subcomponents. Based on these state data, thepower estimator derives cycle-accurate run-time power estimates according toan integrated high-level power model.

FPGA Board

...

...

...

Power Estimator

CPU CoProc Memories

Component StatePower Model



Averaging Debug Trace Generator

Power Sensor Power Sensor Power Sensor

Pow

erE

mul

atio

nFu

nctio

nal

Em

ulat

ion

FU 1 FU nCoProc 2

CoProc 1 RAMROM

NVM

Functional Verification

Pow

erTime

MOV @R8, R12INC R8, #0x02ADD R8, R5

Host PCPower Verification

Trace of Functional Execution

Trace of Power Estimates

Fig. 2. Embedded software power profiling utilizing power emulation: Run-time powerestimation and functional execution trace generation (adapted from [8])

As compared to low-level simulation-based power profiling, the power emu-lation technique largely reduces profiling time. This allows for the profiling ofcomplex software applications and elaborate program sequences, such as thebooting process of an operating system. In contrast to high-level simulators,power emulation offers the benefit of cycle-accuracy that instruction- or system-level-simulators fail to deliver. Furthermore, power emulation offers the advan-tage of inherent power profile to functional execution trace correspondence ascompared to measurement-based approaches.

3.2 Power-Critical Code Region Detection

Our power critical code region detection approach as depicted in Figure 1 con-sists of multiple stages. First, the functional execution trace obtained in thejoint functional and power emulation step is used to establish the source codecorrelation, i.e., identifying the source code region corresponding to each exe-cution trace message. Second, using the power emulation trace as input data,a supply voltage simulation employing a numerical model of the RF-supply isperformed1. Third, the resulting supply voltage profile is utilized to identify1 Due to the limited computational complexity of the numerical RF-supply model, a

simulation-based implementation is adequate.


power peaks leading to critical voltage drops and finding the source code regionscausing these drops.

Figure 3 depicts the inductively coupled power supply of a contact-less smart-card device. The impact of power peaks on the supply voltage level, however,is dependent on the duration, power level and rate of these peaks as shown inFigure 4. We define power-critical source code regions as parts of an embeddedsoftware application resulting in power peaks that lead to supply voltage dropsbelow a critical limit. These peaks can be caused by, e.g., phases of high processoractivity, a number of consecutive memory read or write accesses and co-processoras well power-intensive peripheral activity. In order to identify power peaks thatactually lead to critical supply voltage drops on the given system, a supplyvoltage simulation based on the emulated power profile is performed.

ReaderDevice C1 C2

Embedded System

Magnetic field H

Smart Card

Fig. 3. Inductively coupled power sup-ply of RF-powered smart-card embeddedsystem (adapted from [10])

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.7

0.8

0.9

1

Time [normalized]

Pow

er [n

orm

aliz

ed]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.8

0.9

1

Time [normalized]

Supp

ly V

olta

ge [n

orm

aliz

ed]

Supply Voltage

Power

VLimit

Fig. 4. Impact of different power peakson the supply voltage (voltage drops)

3.3 Optimization of Power-Critical Source Code Regions

The subsequent power-critical code region optimization algorithm as shown inAlgorithm 1 aims at applying code modifications for power peak reduction tothe original C or assembler source code. Depending on the power managementfeatures available on the given system, the frequency scaling and the NFI inser-tion techniques are applied to these power-critical regions. Listing 1.1 illustratesthe insertion of frequency scaling control instructions around the call-site2 of afunction causing power peaks, whereas Listing 1.2 shows the use of NFI insertionwithin a loop causing short power peaks.

The algorithm operates in three major stages: (1) The power-critical code re-gions for each function are determined. If a large part of a function constitutes thepower-critical code region, the algorithm chooses to optimize the entire function.In this case the call-sites of the function are searched and marked for modification

2 The source code line calling a particular function.


start_f_scaling();

power_critical_function ();

stop_f_scaling();

Listing 1.1. f-scaling example

while(loop_condition)

{

short_loop_instruction ;

nop (); //NFI

}

Listing 1.2. NFI insertion example

instead of the function itself. (2) Consecutive source code lines marked for modi-fication are grouped into modification clusters. For each of those clusters, the al-gorithm chooses an optimization strategy based on the cluster’s number of powerpeaks and their respective duration: Short power peaks are likely to be resolvedby NFI insertion, longer power peaks or longer groups of peaks can be reducedby applying frequency scaling. (3) Each of the found source code clusters is thenmodified in the chosen way and the modified code is written back to the sourcefiles.

Algorithm 1. Power-Critical Source Code Region Optimization

Input: Set of application source code S, List of power-critical code regions L,Threshold of max. percentage of power-critical lines per function Thclpf ,Threshold of f-scaling time penalty Thf−scale

Output: Set of optimized application source code So

Step 1, group by function:List of affected source code lines Lsl := {}foreach Function f in S do

Find source code lines of f in Lif Found source code lines > 0 then

Calculate percentage of power-critical code region in functionif Percentage > Thclpf then

Find call-sites of function f , add source code lines of call-sites to Lsl

elseAdd source code lines to Lsl

Step 2, cluster lines to modify & choose optimization strategy:Lslc := Cluster consecutive source code lines in Lsl

foreach Source code cluster C in Lslc doif Duration C > Thf−scale then

Mark cluster C for f-scaling

elseMark cluster C for NFI insertion

Step 3, perform modification:So : = Sforeach Source code cluster C in Lslc do

Modify So by inserting selected optimization instructions


4 Experimental Results

For evaluating our framework, a smart-card microcontroller test-system suppliedby our industrial partner was employed. For different benchmarking applications,power profiles were recorded using the power emulation technique. Afterwards,these benchmarks were optimized both in a manual as well as in an automatedway utilizing the presented framework. This allows for evaluating the effective-ness of our method.

4.1 Test System for Power Peak Optimization

The used smart-card microcontroller test system consists of a 16-bit pipelinedcache architecture. It comprises volatile and non-volatile memories as well asa number of peripherals, e.g., cryptographic coprocessors, timers, and randomnumber generators. The system has been augmented with a power emulation unitas depicted in Figure 5 to allow for the generation of run-time power estimates.

For detecting power peaks leading to problematic supply voltage drops, wehave implemented an RF power supply equivalent circuit model as proposedin [1] and depicted in Figure 6. Based on power consumption changes in themicrocontroller test-system, the load current il(t) changes and affects the loadvoltage vl(t). In phases of high power consumption and thus high load currentswhen the required load current is higher than the supplied source current is(t),the energy storage capacitor delivers the missing fraction ic(t). However, forlonger power peaks or a longer series of short power peaks, the capacitor fails todeliver the required current resulting in a critical supply voltage drop.

Fig. 5. 16-bit smart-card microcontrollertest system augmented by power emula-tion unit (adapted from [11])

TestSystem

+-Vs Vz C

Ri

vl(t)

il(t)

ic(t)

is(t)

Fig. 6. Equivalent circuit of the RFpower supply of the test system (adaptedfrom [1])

4.2 Comparison of Original and Optimized Power Consumptionand Supply Voltage Profiles

We illustrate the optimization result by comparing the power consumption andthe respective supply voltage profiles of a given software application. Figure 7


resembles the results obtained during profiling of the original application. Afterthe power-critical code region detection and optimization, the power profilingand supply voltage simulation was repeated yielding the profiles depicted inFigure 8.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4

0.6

0.8

1

Time [normalized]

Pow

er [n

orm

aliz

ed]

Unoptimized Power

Power Peaks

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.6

0.8

1

Time [normalized]

Supp

ly V

olta

ge [n

orm

aliz

ed]

Voltage Drops

VLimit

Unoptimized Supply Voltage

Fig. 7. Unoptimized power consump-tion and resulting supply voltage pro-files of authentication benchmarkingapplication3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4

0.6

0.8

1

Time [normalized]

Pow

er [n

orm

aliz

ed]

Optimized Power

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.6

0.8

1

Time [normalized]

Supp

ly V

olta

ge [n

orm

aliz

ed]

Optimized Supply Voltage

Reduced Voltage Drops

VLimit

Fig. 8. Optimized power consumptionand resulting supply voltage pro-files of authentication benchmarkingapplication3

The results illustrate how a number of power peaks result in supply voltagedrops below the critical limit. By applying frequency scaling and NFI insertionto the code regions causing these peaks, their power consumption and hencetheir supply voltage impact can be diminished. Note that this modification,while improving system stability and reliability, comes at the cost of a slightlyincreased execution time. However, as illustrated in the subsequent section, theadditionally required execution time is smaller for the automatically than for themanually optimized version because the frequency scaling and the NFI insertionare applied more selectively.

4.3 Impact of Power Peak Optimization on Execution Time andCode Size

We have applied the power peak optimization algorithm to various benchmarkingapplications in order to evaluate its impact on the execution time and the codesize. For comparison we have also manually optimized the given benchmarkingapplications by applying frequency scaling to the entire benchmark. For both themanual and the automatic approach, all power peaks resulting in critical supplyvoltage drops have been eliminated. Figure 9 illustrates these results for twogeneral purpose microcontroller benchmarks (Coremark [12] and Dhrystone) aswell as for two domain-specific ones (Authenthication and Crypto).

3 Data normalized due to existing NDA.


70

80

90

100

110

120Execution Time per Testcase

Exe

cutio

n Ti

me

[%]

Authen

ticati

on

Coremark

Crypto

Dhrysto

ne90

95

100

105

110Code Size per Testcase

Cod

e S

ize

[%]

Authen

ticati

on

Coremark

Crypto

Dhrysto

ne

OriginalManual optimizationAutomatic optimization

Fig. 9. Execution time and code size of original, manually as well as automaticallymodified benchmarks4

The results show that in terms of execution time the automatic approachoutperforms the manual optimization due to the finer granularity of code mod-ifications. For the manual optimization approach the execution time increasesby ∼10% due to the minimally required frequency reduction of ∼10% for elim-inating all critical supply voltage drops. However, for the automatic approachthis increase is in the range of only 1.2% (Crypto) up to 6.8% (Authentication)depending on the number and duration of power peaks. Note that the increasein execution time also depends on the ratio of code regions affected by powerpeaks that need to be optimized to regions requiring no optimization.

Furthermore, we compare the increase in code size caused by the insertion offrequency scaling control instructions and NFIs. This increase is almost negli-gible for the manual approach (smaller than or ∼1% for all testcases). For theautomatic approach, the increase is slightly higher and in the range of 0.2%(Crypto) up to 3.2% (Dhrystone).

5 Conclusions

The power consumption of embedded systems is to a large extent determined bysoftware applications, actuating power management features as well as control-ling the overall system activity. Power peaks, caused by power-critical softwareevents, can seriously impact the supply voltage and lead to critical supply voltagedrops. These voltage drops pose a threat to the reliability of power-constrainedmobile devices such as RF-powered smart cards.

In this paper we have outlined an automated framework aimed at the powerpeak detection utilizing the emulation-based power profiling of given embeddedsoftware applications. By identifying the software code regions causing powerpeaks, the framework is able to selectively apply power reduction strategies, such

4 Data normalized due to existing NDA.


as frequency scaling and non-functional instruction insertion, to the affected re-gions. Furthermore, we have evaluated the effectiveness of this automated powerpeak optimization framework on a number of benchmarking applications. Forthese benchmarks the inherent execution time increase is in the range of only1.2% up to 6.8% for the automatic modifications as compared to ∼10% for themanual ones.

Acknowledgements

We would like to thank the Austrian Federal Ministry for Transport, Innovation,and Technology for providing us with funding for the POWERHOUSE projectunder FIT-IT contract FFG 815193, as well as our industrial partners InfineonTechnologies Austria AG and Austria Card GmbH for their enduring support.

References

1. Haid, J., Kargl, W., Leutgeb, T., Scheiblhofer, D.: Power management for RF-powered vs. battery-powered devices. In: TMCS (2005)

2. Grumer, M., Wendt, M., Steger, C., Weiss, R., Neffe, U., Muehlberger, A.: Au-tomated software power optimization for smart card systems with focus on peakreduction. In: AICCSA (2007)

3. Grumer, M., Wendt, M., Lickl, S., Steger, C., Weiss, R., Neffe, U., Muehlberger, A.:Software power peak reduction on smart card systems based on iterative compiling.Emerging Directions in Embedded and Ubiquitous Computing (2007)

4. Wendt, M., Grumer, M., Steger, C., Weiss, R., Neffe, U., Muehlberger, A.: Systemlevel power profile analysis and optimization for smart cards and mobile devices.In: SAC (2008)

5. Muresan, R., Gebotys, C.: Current flattening in software and hardware for securityapplications. In: CODES+ISSS (2004)

6. Li, X., Vahedi, H., Muresan, R., Gregori, S.: An integrated current flattening mod-ule for embedded cryptosystems. In: ISCAS (2005)

7. Vahedi, H., Muresan, R., Gregori, S.: On-chip current flattening circuit with dy-namic voltage scaling. In: ISCAS (2006)

8. Genser, A., Bachmann, C., Haid, J., Steger, C., Weiss, R.: An emulation-basedreal-time power profiling unit for embedded software. In: SAMOS (2009)

9. Kocher, P.C., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M. (ed.)CRYPTO 1999. LNCS, vol. 1666, p. 388. Springer, Heidelberg (1999)

10. Finkenzeller, K.: RFID Handbook. John Wiley & Sons Ltd., Chichester (2003)11. Bachmann, C., Genser, A., Steger, C., Weiss, R., Haid, J.: Automated power char-

acterization for run-time power emulation of soc designs. In: 13th Euromicro DSD(2010) (in press)

12. http://www.coremark.org/

System Level Power Estimation of

System-on-Chip Interconnects in Considerationof Transition Activity and Crosstalk

Martin Gag, Tim Wegner, and Dirk Timmermann

Institute of Applied Microelectronics and Computer Engineering,University of Rostock

[email protected]

www.networks-on-chip.com

Abstract. As technology reaches nanoscale order, interconnectionsystems account for the largest part of power consumption in Systems-on-Chip. Hence, an early and sufficiently accurate power estimation tech-nique is needed for making the right design decisions.

In this paper we present a method for system-level power estimation ofinterconnection fabrics in Systems-on-Chip. Estimations with simple av-erage assumptions regarding the data stream are compared against esti-mations considering bit level statistics in order to include low level effectslike activity factors and crosstalk capacitances. By examining differentdata patterns and traces of a video decoding system as a realistic exam-ple, we found that the data dependent effects are not negligible influenceson power consumption in the interconnection system of nanoscale chips.Due to the use of statistical data there is no degradation of simulationspeed in our approach.

1 Introduction

Lowering the power consumption of microsystems is one of the main topics inchip design and technology development. Not only due to the demand of energysaving and extended run times of mobile devices but also to avoid problemsconcerning cooling and reliability, this challenge has to be tackled.

Shrinking and further enhancements regarding technology structures are es-pecially lowering the dynamic power consumption and the size of transistors. Aslogic devices are getting less and less energy dissipative and smaller, the inte-gration density is raised. Therefore, more interconnects between these elementsare needed. The power consumption of the wires mainly remains on a certainlevel because they cannot be made smaller and need to be at a low distance toeach other raising the capacitances even under the use of ultra low-k materials.The share of energy consumed in the interconnection system increases comparedto the overall energy dissipation. In the Intel 80-core e. g. the communicationsystem is responsible for over 28% of the overall power budget [1]. Hence, theimportance of energy consumed in the interconnection system of microchips isgetting bigger.

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 21–30, 2011.� Springer-Verlag Berlin Heidelberg 2011

22 M. Gag, T. Wegner, and D. Timmermann

During the design process power consumption has to be estimated in everydesign step to be sure to meet the constraints of every part of the system aswell as the whole system. The early phases of architectural, algorithmic andsystem design are very important parts of the whole process. Precise high levelpower estimation is leading to better designs, as the high level design changesare known to have more significant effects than enhancements at lower levels.

At early design stages wire-mappings and cycle-accurate behavior mostly arenot known, making system level power estimations difficult. We tackle this prob-lem with a mixture of well accepted assumptions regarding technology param-eters and statistical information that represents the characteristics of the datatransmitted on-chip. For this matter, different data patterns are evaluated to getsignificant statistics of transition probabilities and crosstalk effects. The resultingstatistical data is provided to a power model. This mixture of high level infor-mation and low level assumptions will facilitate more accurate power estimationthan just relying on high level design information.

In the following section this paper is related to the state of the art. Then theused power model is described. Our simulations are explained and the resultsare discussed before the paper is ended by a short conclusion.

2 Related Work

System level power estimation is already recognized as an important aspect inthe field of chip design and system simulation. For design space exploration ofNetworks-on-Chip (NoCs) Kahng et al. give a high level power model of routersand links called Orion 2.0[2]. This work is based on the Predictive TechnologyModel (PTM) [3] and calculations of capacitances by Wang et al. [4].

The inclusion of low level power models in system level NoC simulation is partof the work of Xi et al. [5]. Transition activity was included in their simulationframework, which is crucial for the correct treatment when transition encoding isutilized [6–9]. Nevertheless, no crosstalk effects were included in their simulationframework. This could be fatal as influences of coupling capacitances on on-chipbuses are not negligible. Sotiriadis et al. derived a new low level bus model totake such deep submicron effects into account [10].

There is many work on so called crosstalk avoidance codes [11–14] and even thecombination of transition and crosstalk avoidance [15] that would benefit from asystem level power estimation technique respecting actual transition counts andcross coupling effects.

Using signal statistics to estimate transition activity and even crosstalk [16]is considered to claim many resources during simulation. In [17] the utilizationof word level statistics was proposed to be a solution. In this paper we willshow, that even bit level statistics are suitable to enhance the high level powerestimations of on-chip interconnects at no simulation performance costs.

System Level Power Estimation 23

3 Modeling of Dynamic Power Dissipation on Links

The power consumed by communicating links can be divided in static and dy-namic dissipation. Here we want to concentrate on the dynamic power dissipationbecause the static part is not influenced by the transmitted data. The well knownformula

Pdyn =12· a · f · V 2 · C (1)

where a is the transition probability, f the frequency, V the operating voltageand C the switched load capacitance, represents the dynamic power model ofevery logic element in CMOS systems. In the case of wires, energy consumptionoriginates from charging ground and cross coupling capacitances. In general, ca-pacitances to the ground and top plates are constant. The coupling capacitancesare created by the left and right neighbors of a wire, which are parallel wiresbuilding a bus in most cases. The signal changes on those neighboring wiresaffect the effective capacitance seen by the driver through capacitive coupling.This can be considered a special case of the Miller Effect.

The calculation of the effective capacitance is a combination of ground andcoupling capacitance:

Ceff = Cg + σ · Cc (2)

Where σ in this combination depends on switching directions of the right andleft neighbor of the wire and is called the Miller Coupling Factor (MCF). Thereare different possible combinations which can raise but also lower the value of theeffective capacitance compared to a static MCF, which is 2 on average (Tab. 1).The MCF can be calculated using the following equation, where vf

i is one whenthe final value of the voltage on the i-th line is high and zero if it is low. vi

i standsfor the initial value of that line.

σ = [−1, 2,−1] ·⎡⎣

vfi−1 − vi

i−1

vfi − vi

i

vfi+1 − vi

i+1

⎤⎦ (3)

The resulting dynamic power consumption can be calculated with the resultingEq. (4), where V is the initial or final voltage.

Pdyn = a · f · V fi · [−λ, 1 + 2λ,−λ] ·

⎡⎣

V fi−1 − V i

i−1

V fi − V i

i

V fi+1 − V i

i+1

⎤⎦ · Cg (4)

Similar to the Predictive Technology Model (PTM) [3] and Orion 2.0 [2], weare using the models of Wong et al. [4] to calculate the technology dependentvalues of ground and coupling capacitances. Together with the gathered MCFthese values are used for dynamic power calculation. In Addition, a componentof static power is added to include leakage like it is done in Orion 2.0.


Table 1. Possible Miller Coupling Factors of a wire (i) switching from 0 to 1

��i+1i-1

0 � 0 0� 1 1� 0 1 � 1

0� 0 2 1 3 20� 1 1 0 2 11� 0 3 2 4 31� 1 2 1 3 2

4 Bit Level Statistics

To get the most exact values for effective coupling capacitances and transitioncounts, it is necessary to evaluate every bit that traverses the data bus in thesystem and analyze its correlation to the previous bit of this position. This ispossible for all signals in gate level simulations, because all signals are knownand their probable mapping to wires can be estimated. Even at system levelthis is possible for links connecting main modules (e. g. a bus in SoCs or theinterconnection network in NoCs), if a few assumptions concerning bus mappingsare made.

The evaluation of every bit transmitted through the communication systemtakes time during the simulation process. This may reverse speed gain achievedthrough high level abstractions if done during system level simulations. However,we propose to use signal statistics to account for transition activity and crosstalkeffects on links. The necessary signal statistics can be obtained from a sample ofdata characterizing traffic on the actual link before the system simulation starts.

The time required to create offline statistics depends on the evaluated systemand signal parameters but usually should be much lower than the time thatis taken to process the whole real data stream. The acquisition of those signaldata can be achieved by deploying cycle accurate system models or architecturalmodels and exploiting knowledge of algorithms used in the system modules. Ithas to be known if the data is mostly random like compressed data or if thereare inter-word correlations that are often found in uncompressed data. Of coursesignal traces of lower level models could be used as well, if they are available.

In our experiments we generally used two ways to gather the bit level statisticsof the data. In the first method stream based evaluation software is used toexamine the characteristics of general data. At first, the incoming data froma file is divided into chunks corresponding to the expected word width on thelater bus structure. Then transitions between two successive words are countedand the MCF is calculated for every bit position in the data word in order toconsider crosstalk. In the middle of the bus the needed energy is affected by twoaggressors, while the victim lines at the fringes have only one aggressor (Fig. 1).If the stream comes to an end, the arithmetic average of transitions and MCFsof all bit positions are determined.

The second method is based on the interpretation of signal traces in ValueChange Dump (VCD) format. A gate level simulation of a hardware design is


used to generate the trace files. Our software extracts the interesting signalsout of the signal dump. That would be the signals that will run between mainmodules and are possible candidates for relatively long wires i. e. claiming highcapacitances in the data bus. These signals are analyzed as it is done in thestream based evaluation.

In our simulations we used the first method for general investigations of bitlevel statistics of common data. The Second approach was used to evaluate ourestimation technique for an implemented SoC.

Victim

Victim

Aggressor

Aggressor

Aggressor

Edge

Middle

T1 T2

Fig. 1. Crosstalk estimation in two successive cycles at fringes and in the middle of abus [16]

5 Simulation Results

To estimate the accuracy gain concerning power estimation with bit level statis-tics, different types of data were analyzed by our stream based program. Asrepresentatives for compressed data JPEG- and H.264 compressed image andvideo files as well as MPEG-Layer 3 encoded audio files were used. As a groupof uncompressed data decoded image, audio, video and text files were used. Amore practical data stream with a mixture of compressed and uncompresseddata is represented by a network stream while browsing a webpage. Character-istic content of such a stream dump are uncompressed packet headers and acompressed HTML-text plus a few compressed graphics files. For comparison,we included a data pattern that maximizes crosstalk and transition probabilityto 100% representing the worst case of data patterns.

To get indications for the applicability of using bit level statistics, the model ofan application was investigated. The H.264 decoder [18] was simulated at registertransfer level to extract signal dumps of the global connections of functionalblocks like memories, entropy decoder, prediction unit etc. Those trace dumpswere analyzed to extract the bit level signal statistics.


5.1 Simulation Accuracy

Traditional data independent power estimation considers a transition probabilityof 50%. In Fig. 2 the results of our system level power estimation compared to atraditional one are shown. In addition, we determined the estimated power valueswith the actual gathered transition probability without calculating crosstalkeffects to rule out the influence of the MCF.

As expected, the highly compressed data mostly consists of uncorrelated pat-terns. This corresponds to random data. The resulting power estimation withconsideration of bit level statistics differs hardly from the traditional approachof assuming 50% transition probability. This applies for random data as well ascompressed images (JPEG), videos (H.264) and audio (MP3). The estimationerror in respect to the most accurate method of using the real transition countand the crosstalk calculation shows relatively low values of up to 7.1% (Tab. 2).

The most accurate calculation with respecting the crosstalk capacitances in-cluding the MCF shows a little bit lower power values even in the case of completelyrandom data. That is because the fringe capacitances, which are considered to bevery much lower than the coupling capacitances, were included only in this esti-mation mode where the deep submicron bus model was used. The other two esti-mation modes only assume coupling capacitances on both sides of the wire even atthe fringes of the bus.

The uncompressed data shows higher autocorrelation. This results in lowerpower values due to fewer transitions on the wires in cases of uncompressedvideo as well as images (BMP), audio (WAVE) and text files. The effect is dueto the most significant bits are switched more infrequently compared to the lesssignificant ones. In these cases it is very important to choose the right wordwidth to exploit the data characteristics. This decision is mostly implied by theapplication but information about this aspect can also be provided by our dataanalysis software. As Fig. 3 shows, transition probability of uncompressed datahas a dependency on the used word width. The optimal width for uncompressedimage and video data is 24 bit because typically there is 3 byte of color informa-tion per pixel in such a data structure. Our audio example consists of a 16 bitstereo wave file and shows an optimal word width of 32 bit. The text file wouldbe optimally segmented in every multiple of 8 bit because ASCII encoding isused, which utilizes 1 byte of data per character.

The highest difference between the power estimation values was reached byuncompressed video, which consists of a scene of an animated comic in 1080pformat. The method of considering realistic transition counts and calculatingthe crosstalk activity differs about 432.5% from the estimation with a simpleassumption of 50% transition activity. Just considering transitions and ignoringthe MCFs of crosstalk shows a deviation of only 2.2%.

To get more realistic data patterns a SoC was examined. This hardware designimplements a H.264 decoder and is divided into functional blocks. The signalsconnecting those modules are considered to be intermediate wires that are longenough to produce high capacitances and make a remarkable contribution tothe overall energy consumption. The extracted signal statistics lead to power


estimations that are significantly lower (deviation of 84.6%) than assuming anaverage transition rate of 50%. Therefore, the average transition rates betweenthe main modules of the SoC are more in the regions of uncompressed data thanbeing similar to the compressed data. This leads to a better power estimationwhen using real signal statistics.

As simulation results show, the accuracy of the system level power estimationis raised by our approach of using signal statistics to predict transition prob-ability. By doing so, the error of up to 432.5% in simulations using a generalassumption of 50% transition probability is avoided. The amount of such esti-mation errors depends on the data itself and is higher the less compressed thedata is. As our worst case data sample shows, the simple estimation could betoo low by about 64.9% in cases of practical data it is consistently too high.Crosstalk effects are not that much important to the power estimation as canbe seen by the little deviations of the method using real transition counts with-out the application of crosstalk estimation. That is because the average MCF ismostly met by the data characteristics.

Table 2. Relative deviation of energy estimation techniques related to the method ofconsidering real transition rate and crosstalk

method using real tr. rate 50 % tr. rate

worst case 0,313 0,649random 0,026 0,027

JPEG 0,007 0,022H.264 0,028 0,043MP3 0,013 0,071

web surfing 0,032 0,187text (ASCII) 0,052 0,520

BMP 0,016 1,266video unenc. 0,022 4,325

WAVE 0,021 0,422H.264 decoder SoC 0,059 0,846

5.2 Simulation Performance

The method of using signal statistics reduces to calculating the power equationduring simulation. In this step the general time complexity of the simulationis not affected, so there is no speed penalty and system level power estimationfinishes in parts of a second.

The statistical data of possible signals must be gathered prior to the simula-tion. This step takes time and depends on the method of statistics acquisition.In our experiment with general data files the data stream analysis lasts up to 5seconds when processing up to 100MB on an Intel Core2Duo workstation PC.It has to be mentioned that we did not optimize for runtime, as we assume togather the statistics offline and then simulate high level models with few designpossibilities in seconds.


0

5

10

15

20

25

30

35

worst

case

random

JPEG

H.264

MP3

websurfing

WAV

E

text(ASCII)

BMPvideo

unenc.

H.264

decoderSoC

Energyin

fJper

Bit

50% Transition RateReal Transition Rate

Real Tr. Rate and Crosstalk

Fig. 2. Estimated average energy for transmitting one bit on an intermediate wire of200 �m length (single spaced) in 65 nm technology for different data files evaluated by3 different estimation techniques

0

0.1

0.2

0.3

0.4

0.5

0.6

8 16 24 32 40 48 56 64

Tra

nsition

Pro

bability

Word Width

WAVEvideoBMP

randomtext

Fig. 3. Transition probability using different word width for transmission

6 Conclusion

In this paper we showed how wrong system level power estimation could be if notaware of the data that will pass the interconnection system between the mainmodules. Our proposed technique takes bit level statistical data of a possibledata stream in the system and makes it available to commonly accepted low


level power models of interconnection links. By using this approach the actualtransition activity of the interconnections and low level phenomena like crosscoupling effects can be considered. It turns out that, if mainly uncompressed datais transmitted between the system components, the deviations between the powerestimations are not negligible. In consequence, the consideration of bit levelstatistics promises to facilitate more accurate estimations. As the investigationon a realistic system showed, our technique was by 84.6% more correct then ifa general transition activity of 50% would be assumed.

The crosstalk feature of our power estimation technique showed no mention-able effects when realistic data was used. The difference to the method consider-ing real transition activities was 6.5%. As we plan to integrate this work into abigger simulation kit with different link level encoding features to exploit tran-sition and crosstalk avoidance codes, the feature of cross coupling estimation isgoing to be essential for correct power estimations.

References

1. Vangal, S., Howard, J., Ruhl, G., Dighe, S., et al.: An 80-tile sub-100-w teraflopsprocessor in 65-nm cmos. IEEE Journal of Solid-State Circuits 43(1), 29–41 (2008)

2. Kahng, A., Li, B., Peh, L., Samadi, K.: Orion 2.0: A fast and accurate NoC powerand area model for early-stage design space exploration. In: Design, Automation,and Test in Europe, pp. 423–428 (2009)

3. Predictive Technology Model, http://ptm.asu.edu/4. Wong, S.C., Lee, G.Y., Ma, D.J.: Modeling of Interconnect Capacitance, Delay,

and Crosstalk in VLSI. IEEE Transactions on Semiconductor Manufacturing 13,108–111 (2000)

5. Xi, J., Zhong, P.: A System-level Network-on-Chip Simulation Framework Inte-grated with Low-level Analytical Models. In: 2006 International Conference onComputer Design, pp. 383–388 (Oktober 2006)

6. Kretzschmar, C., Siegmund, R., Muller, D.: Adaptive bus encoding technique forswitching activity reduced data transfer over wide system buses. In: Soudris, D.J.,Pirsch, P., Barke, E. (eds.) PATMOS 2000. LNCS, vol. 1918, pp. 66–75. Springer,Heidelberg (2000)

7. Sotiriadis, P., Chandrakasan, A.: Bus energy minimization by transition pat-tern coding (TPC) in deep sub-micron technologies. In: Proceedings of the 2000IEEE/ACM International Conference on Computer-Aided Design, pp. 322–328.IEEE Press, Los Alamitos (2000)

8. Ramprasad, S., Shanbhag, N., Hajj, I.: A coding framework for low-power addressand data busses. IEEE Transactions on Very Large Scale Integration (VLSI) Sys-tems 7(2), 212–221 (1999)

9. Benini, L., Micheli, G., Macii, E., Sciuto, D., Silvano, C.: Address bus encodingtechniques for system-level power optimization. In: Design, Automation, and Testin Europe, pp. 275–289. Springer, Heidelberg (1998)

10. Sotiriadis, P.P., Chandrakasan, A.: A Bus Energy Model For Deep Sub-MicronTechnology. IEEE Transactions on Very Large Scale Integration (VLSI) Sys-tems 10, 341–350 (2002)

11. Pande, P., Ganguly, a., Zhu, H., Grecu, C.: Energy reduction through crosstalkavoidance coding in networks on chip. Journal of Systems Architecture 54(3-4),441–451 (2008)


12. Rahaman, M., Chowdhury, M.: Crosstalk Avoidance and Error-Correction Codingfor Coupled RLC Interconnects. Crosstalk, 141–144 (2009)

13. Duan, C., Cordero Calle, V.H., Khatri, S.P.: Efficient On-Chip Crosstalk Avoid-ance CODEC Design. IEEE Transactions on Very Large Scale Integration (VLSI)Systems 17(4), 551–560 (2009)

14. Sankaran, H., Katkoori, S.: On-chip dynamic worst-case crosstalk pattern detec-tion and elimination for bus-based macro-cell designs. In: 2009 10th InternationalSymposium on Quality of Electronic Design, pp. 33–39 (Marz 2009)

15. Palesi, M., Fazzino, F., Ascia, G., Catania, V.: Data Encoding for Low-Power inWormhole-Switched Networks-on-Chip. In: 2009 12th Euromicro Conference onDigital System Design, Architectures, Methods and Tools, pp. 119–126 (2009)

16. Gupta, S., Katkoori, S.: Intra-bus crosstalk estimation using word-level statistics.In: 17th International Conference on VLSI Design, Proceedings, pp. 449–454 (2004)

17. Ramprasad, S., Shanbhag, N., Hajj, I.: Analytical estimation of transition activityfrom word-level signal statistics. In: Proceedings of the 34th, vol. 16(7), pp. 718–733(1997)

18. Fleming, K., Dave, C., Arvind, N., Raghavan, G., Jamey, M.: H. 264 Decoder: ACase Study in Multiple Design Points. In: 6th ACM/IEEE International Confer-ence on Formal Methods and Models for Co-Design, MEMOCODE, pp. 165–174(2008)

Residue Arithmetic for Designing Low-Power

Multiply-Add Units

Ioannis Kouretas and Vassilis Paliouras

Electrical and Computer Engineering Dept.,University of Patras, Greece

Abstract. In this paper an efficient way to exploit multi-Vdd standard-cell libraries is quantitatively investigated as a means to reduce powerconsumption of multiply-add units. It is shown that multi-Vdd library-based design is suitable for RNS systems due to their inherent modularorganization. In particular the paths defined by the isolated moduli chan-nels are clearly distinguished and the designer can easily and efficientlydetermine high- and low-voltage areas in the design. Three-, four- andfive-moduli RNS bases have been used for the design of the RNS multiply-add units. Comparisons to synthesized circuits that do not use multi-Vddlibraries revealed power reduction up to 38%.

1 Introduction

A main challenge for the electronics industry is to provide extremely efficientand powerful devices for communications, video and network applications thatmeet strict power constraints of portable battery-operated devices. This requireseffective design techniques to address both the power constraints and the increaseof the computational complexity.

The use of alternative number representations such as the Logarithmic Num-ber System (LNS) and the Residue Number System (RNS), is a promisingtechnique for the implementation of computationally-intensive low-power sys-tems [1, 17] using special-purpose dedicated circuits. In particular, RNS hasbeen investigated as a possible choice for number representation in DSP ap-plications [14, 15], since it offers parallel multiplication or addition and errorcorrection properties [18]. Recently RNS has been proved to provide solutionsin the field of wireless telecom applications [12]. RNS architectures for basicarithmetic circuits can be distinguished into memory table lookup-based ones,combinatorial logic-based ones, or combination of both approaches [2]. Combi-natorial RNS circuits are efficient especially for large moduli, and for moduli ofthe form 2n − 1[7], 2n, and 2n + 1 [7, 8, 19]. Moduli of the form 2n − 1 and2n + 1 offer low-complexity circuits for arithmetic operations due to the end-around carry property, while moduli of the form 2n lead to simple and regulararchitectures due to the carry-ignore property.

Recent publications have shown that RNS can offer significant power savingswhen applied to the design of VLSI FIR digital filters [3–5]. In [13] it is theo-retically shown that power minimization is possible in RNS domain, by using


32 I. Kouretas and V. Paliouras

multi-voltage supply voltages. The particular study focuses on Polynomial RNSfor the implementation of low-power convolvers. In this paper a multi-voltagelibrary is exploited to reduce power dissipation of RNS multiply-add units anda quantitatively analysis is offered. In particular low-voltage cells are employedto implement specific paths, i.e., paths that are not maximum delay critical forthe circuit.

The remainder of the paper s organized as follows. Section 2 offers RNS ba-sics, while section 3 reviews power dissipation basics. Section 4 describes theproposed multi-Vdd multiply-add units and quantitative analysis is taking placein section 5. Section 6 ends up with some conclusions.

2 Review of RNS Basics

The RNS maps an integer X to a N -tuple of residues xi, as follows

XRNS−→ {x1, x2, . . . , xN}, (1)

where xi = 〈X〉mi, 〈·〉mi denotes the mod mi operation, and mi is a member

of a set of pair-wise co-prime integers {m1, m2, . . . , mM}, called base. Co-primeintegers have the property that gcd(mi, mj) = 1, i �= j. The modulo operation〈X〉m returns the integer remainder of the integer division x div m, i.e., a numberk such that x = m · l + k, where l is an integer. Mapping (1) offers a uniquerepresentation of integer X , when 0 ≤ X <

∏Ni=1 mi.

RNS is of interest because basic arithmetic operations can be performed ina carry-free manner. In particular the operation Z = X ◦ Y , where Y

RNS−→{y1, y2, . . . , yN}, Z

RNS−→ {z1, z2, . . . , zN}, and the symbol ◦ stands for addition,subtraction, or multiplication, can be implemented in RNS as zi = 〈xi ◦ yi〉mi

,for i = 1, 2, . . . , M . According to the above, each residue result zi does notdepend on any of the xi, yi, j �= i, thus allowing fast data processing in N parallelindependent residue channels. Inverse conversion is accomplished by means ofthe Chinese Remainder Theorem (CRT) or mixed-radix conversion [16].

3 Low-Power in RNS

Dynamic power Pdyn of a circuit is given by [10]

Pdyn = CL · V 2dd · f · α, (2)

where CL is the load capacitance, Vdd is the supply voltage, f is the frequencyof transitions and α is the switching activity on each clock cycle. Eq. (2) showsthat power is quadratically related to voltage. Therefore by reducing power sup-ply (Vdd), dynamic power decreases dramatically. The penalty for the reductionof Vdd is that cells that operate at lower voltage are slower. Hence, the de-signer should identify the non-critical paths (i.e., the paths that do not definethe maximum-delay critical path) and power the respective gates with a lowervoltage.

Residue Arithmetic for Designing Low-Power Multiply-Add Units 33

For the case of a multi-Vdd system, power dissipation is given by

Pdyn =∑p

i=1 CLi · V 2dd,i · fi · αi, (3)

where p is the number of power domains employed. The proposed techniquebuilds on the modular organization of residue-based systems. In particular, it ishere proposed that each independent moduli channel of an RNS architecture ismapped to an appropriate supply voltage. According to the proposed techniquemoduli channels that contain the longest path are mapped to higher supplyvoltages. It is noted that power minimization is achieved without any impact onthe delay. Due to its modular organization, RNS is ideally suited for the simpleand efficient application of the aforementioned low-power design technique.

Assume an L-moduli RNS base {m1, m2, ..., mL} implemented by an L-channelresidue architecture, as shown in Fig. 2. Each modulo mi defines the complexityof the corresponding modulo channel the delays of which are {d1, d2, ..., dL}, re-spectively, assuming high-voltage power supply denoted as Vdd(H). Here we focuson the case of two power domains, i.e., p = 2, with two voltage values, Vdd(H) andVdd(L). The maximum delay dmax = max(d1, d2, . . . , dL) determines the criticalmaximum delay of the design. Assume that dmax = dk and for the delays dl, l �= k,without loss of generality, it holds that

dk1 < dk2 < . . . dkL−1 < dmax, (4)

where ki, i = 1, 2, . . . , L − 1 is an ordering of integers j, 1 ≤ j ≤ L, j �= k.Without violating design constraints, replacement of high-voltage gates

((Vdd(H)) that compose each one of the moduli channels mki with low-voltagegates (Vdd(L) is permissible, provided that the imposed delay penalty in non-critical circuits does not affect the overall critical delay dmax, i.e., dmax = dk ≥max{dki}.

Subsequently the proposed multiply-add units are described and quantitativepower dissipation and complexity results are derived. Comparisons are offeredto both binary structures and residue multiply-add units without multi-voltagesupply in terms of power dissipation and complexity in terms of power dissipationand complexity.

4 RNS and Binary Multiply-Add Units

This section describes the organization of RNS and binary multiply-add units. Inthe case of RNS, three- four- and five-moduli bases of the form {2n1−1, 2n2 , 2n3+1}, {2n1, 2n2 − 1, 2n3 − 1, 2n4 + 1} and {2n1 , 2n2 − 1, 2n3 − 1, 2n4 + 1, 2n5 + 1}are used, respectively. The binary multiply-add unit comprises a Wallace mul-tiplier augmented by a step for the addition of a third operand. Figs. 1 and 3depict the organization of a binary and a three-moduli RNS-based multiply-addunit respectively, while Fig. 4 shows possible 4-bit implementations for modulo-(2n − 1) MAC (Fig. 4(a)), modulo-2n (Fig. 4(c)) and binary MAC (Fig. 4(b)).Both architectures implement the multiply-add operation a ∗ b + c.


a b c

AND ARRAY

Wallace adder array

Fig. 1. Organization ofthe binary multiply-addunit

bin

to

RNS

RNS

to

bin

Vdd(H)

Vdd(H)

Vdd(H)

Vdd(L)

Vdd(L)

Vdd(L)

n1 bits

nk1 bits

nk2 bits

nkl−1 bits

nk bits

nL bits

n bitsn bits

mod m1

mod mk1

mod mk2

mod mkl−1

mod mk

mod mL

processor

processor

processor

processor

processor

processor

Fig. 2. Architecture of multi-voltage RNS system

It is noted that in the case of RNS, binary-to-RNS and RNS-to-binary con-verters are required. Forward conversion is required at the start and reverseconversion at the end of a MAC-intensive operation, such as the computation ofan N -point Fourier transform [11].

To illustrate this point, assume the FIR filter operation y(n) = b0x(n) +b1x(n − 1) + b2x(n − 2) + · · · + bMx(n − M), where x(n) is the input signal,b(n) are the coefficients and y(n) is the output signal. Let the RNS base be ofthe form {m1, m2, m3, ..., mN}. Then for kth sample y(k) of the filter output, itholds that y(k) = b0x(k)+ b1x(k−1)+ b2x(k−2)+ · · ·+ bLx(k−L). In the RNSdomain the same operation is performed in N parallel modulo-mi channels as

〈y(k)〉mi=

⟨M∑l=0

〈bl · x(k − l)〉mi

⟩

mi

, (5)

where mi denotes the ith moduli, i = 1, 2, ..., N .The procedure for the computation of y(n) is as follows. Initially the mul-

tiplication c(0) = 〈b0x(k)〉miis computed. Then the modulo-mi result c(0) is

added to the residue product 〈b1x(k − 1)〉mito derive the intermediate quantity

c(1) = 〈c(0) + b1x(k − 1)〉mi. The result 〈y(k)〉mi

is recursively derived after Ladditions and multiplications. Hence the final result y(k) is generated by theresidue-to-binary conversion of the RNS result {〈y(k)〉m1

, ..., 〈y(k)〉mN} after L

multiply-add operations. For this reason the backward residue-to-binary conver-sion is performed every L multiply-add operations. Furthermore, x and b is for-ward converted once and is recursively used for the computation of y. Thereforefor sufficiently large amount of processing, the conversion cost can be compen-sated by savings achieved due to more efficient processing. Due to the conversionoverhead, applications suitable for RNS include multiply-add-intensive kernelssuch as digital filtering or discrete transforms.


AND array

modulo 2n1 − 1

modulo 2n1 − 1 adder

〈a〉2n1−1 〈b〉2n1−1〈c〉2n1−1

modulo 2n1 − 1 channel [20].

adder array

AND array

adder array

modulo 2n2

modulo 2n2 adder

〈a〉2n2 〈b〉2n2 〈c〉2n2

modulo 2n2 channel.

AND array

adder array

modulo 2n3 + 1

modulo 2n3 + 1 adder

〈a〉2n3+1 〈b〉2n3+1〈c〉2n3+1

modulo 2n3 + 1 channel [6].

Fig. 3. Organization of RNS-based multiply-add unit

a3a2a1a0 b3b2b1b0

r3r2r1r0

c0c1c2c3

FAFAFAFA

FAFAFAFA

FAFAFAFA

and array

modulo 2n − 1 adder

(a) Modulo 2n − 1MAC.

a3a2a1a0 b3b2b1b0

r0r1r2r3r4r6 r5r7

c0c1c2c3

HA

HA

HA

HA

FA

FAFA

FAFAFA

FAFAFA

FAFAFA

and array

(b) Binary MAC.

a3a2a1a0 b3b2b1b0

r0r1r2r3

c0c1c2c3

HA

HA

HA

HA

FA

FAFA

FAFAFA

and array

(c) modulo 2n MAC.

Fig. 4. Implementations of RNS and binary MAC units

5 Results and Comparisons

In this section quantitative analysis and comparisons of residue circuits to theequivalent binary multiply-add unit is offered, in case of three-, four- and five-moduli bases.

In particular as a test case, a 50th-order FIR low-pass filter is used, witha cut-off frequency of 0.3rad/sec. A zero-mean uncorrelated gaussian randomsequence is used as stimulus. The experiment assumes 1000 input data samples.For each modulo channel of the RNS circuit the corresponding input vectors arederived by the modulo operation on the input data samples and the coefficientsof the FIR filter. Hence the inputs of the modulo circuits assume the values thata forward converter would generate.


Subsequently, the equivalent to RNS binary multiply-add unit is defined. Thesignal to noise ratio (SNR) is used as a metric to define binary structure whichis equivalent to RNS. SNR is estimated by using the filter and the input datadescribed above. It is found that the 30-bit data range RNS FIR filter exhibitsalmost the same SNR with the binary FIR filter with 20-bit wordlength operands(SNRBIN = 64.71,SNRRNS = 65.38).

In this paper a multi-Vdd 90nm TSMC library, characterized for 1.2Volts(high-voltage) and 1.0Volts (low-voltage) power supply and Prime Time of Syn-opsys [9] have been used. Power is estimated by using the stimuli derived by theFIR filter defined above with annotated switching activity, assuming a 5ns clockperiod for the simulation.

It is noted that high-voltage gated exhibit faster delay compared to the low-voltage gates. The proposed multi-Vdd based design technique distinguishesparts of the circuit that are not critical and may operate at reduced speed.Therefore low-voltage power supply can be used without affecting the criticalpath delay.

In the following the residue number system is used for multi-Vdd design.Assume an L-moduli RNS base {m1, m2, ..., mL} the delays of which are{d1, d2, ..., dL}, respectively, for high-voltage power supply. The maximum delaydmax = max(d1, d2, . . . , dn) determines the critical delay of the design. Now as-sume that dmax = dk and for the delays dp, dp−1 and dp−2 of the moduli channelsp, p − 1, and p − 2 respectively, it holds

dp < dp−1 < dp−2 < dmax. (6)

Regarding design constraints, legal replacement of high-voltage gates that com-pose each one of the moduli channels p, p − 1 and p − 2 with low-voltage gatesis achievable, provided that the derived delay penalty retains the critical delaydmax stable, i.e., dmax = dk ≥ {dp, dp−1, dp−2}.

Several RNS circuits have been synthesized using the multi-Vdd library, andthe obtained results are presented in Tables 1, 2 and 3. The moduli followed by(*) denote low-voltage(1.0volts) power-supply synthesis. Lack of (*) means thatthe particular moduli circuits have been synthesized with high-voltage(1.2volts)power supply. The column labeled “power”, contains power results for RNSsystem before the application of the multi-Vdd low-power technique and after.The power savings percentage is computed as Powerbefore−Powerafter

Powerbefore· 100%.

More specifically, Table 1 depicts results in case of three-moduli RNS basesof the form {2n1 − 1, 2n2 , 2n3 + 1}. It is shown that power savings range from8.11% to 37.96%, in case of bases {256∗, 2047, 2049} and {64, 8191∗, 1025∗},respectively. In case of the base {6256, 2047, 1025} it is shown that by low-voltage supplying modulo-1025, deriving the base {256, 2047, 1025∗}, 28.71%power saving is achieved while in case of low-Vdd application to both modulo-1025 and -256, power saving is increased to 33.35%.

Regarding four-moduli bases of the form {2n1 , 2n2−1, 2n3−1, 2n4+1}, Table 2depicts that power achieves upto 38.63% savings in case of thebase {16, 31∗, 2047∗, 1025∗}. Table 2 also demonstrates that the bases


Table 1. Power, delay and area results in case of multi-vdd application in three-moduliRNS bases

basepower(mW)

area(μm2) delay(ns) power savingsbefore after

{256, 2047, 1025∗} 3.0577 2.1797 11427.6623 2 28.71%{256, 511∗, 8193} 3.3128 2.5658 12513.1888 2 22.55%{64, 8191∗, 1025∗} 3.2168 1.9957 20874.1566 2 37.96%{256∗, 2047, 2049} 1.7488 1.607 7166.2304 2 8.11%{256∗, 2047, 1025∗} 3.0577 2.0379 11190.0319 2 33.35%{256∗, 1023∗, 4097} 3.0823 2.2485 12060.9775 2 27.05%

Table 2. Power, delay and area results in case of multi-vdd application in four-moduliRNS bases

basepower(mW)


{16, 31, 2047, 1025∗} 3.1598 2.282 12507.1519 2 27.79%{32, 15, 511∗, 4097} 3.1058 2.359 12056.0384 2 24.05%{16, 31, 2047∗, 1025∗} 3.1598 2.069 14390.0846 2 34.52%{32, 511∗, 2047, 17} 2.9866 2.240 11585.7168 2 25.01%{16, 31∗, 2047, 1025∗} 3.1598 2.152 12301.9007 2 31.90%{32, 511∗, 2047∗, 17} 2.9866 2.027 13468.6495 2 32.14%{16, 31∗, 2047∗, 1025∗} 3.1598 1.939 14184.8334 2 38.63%{256∗, 31, 4095, 17} 1.8238 1.682 7958.6976 2 7.77%{16∗, 31, 2047, 1025∗} 3.1598 2.247 12265.1311 2 28.89%{32∗, 15, 511∗, 4097} 3.1058 2.327 12082.3808 2 25.08%{16∗, 31, 2047∗, 1025∗} 3.1598 2.034 14148.0638 2 35.63%{32∗, 511∗, 2047, 17} 2.9866 2.208 11612.0592 2 26.08%{16∗, 31∗, 2047, 1025∗} 3.1598 2.117 12059.8799 2 33.00%{32∗, 511∗, 2047∗, 17} 2.9866 1.995 13494.9919 2 33.20%

{16, 31, 2047, 1025∗} and {16, 31, 2047∗, 1025∗} achieve 27.79% and 34.52% powersavings respectively.

In Table 3, similar results are revealed in the case of five-moduli RNS multi-add units. In particular the base {64, 31, 511∗, 17, 33} which demonstrates low-voltage supply to modulo-511, achieves 23.66% power reduction while the base{64, 31∗, 511∗, 17∗, 33} with three low-Vdd moduli channels, namely modulo-511, -31 and -17, exhibits 30.34% power consumption gains. Power-saving gainsrange from 9.54% up to 38.03% in the case of the bases {512∗, 15, 31, 17, 257}and {16, 31∗, 63∗, 17∗, 1025∗}, respectively.

Referring to binary FIR filter with 20-bit wordlength operands, it depictsresults in Table 4. It is shown that power consumption in the binary domain is4.432mW while the maximum power result in the RNS domain is 3.373mW and2.415mW in case of high-Vdd and low-Vdd supply voltage, respectively.

Results reveal that multi-Vdd design is highly suited for RNS design of multiply-add units and hence for the implementation of low-power FIR VLSI filters.


Table 3. Power, delay and area results in case of multi-vdd application in five-moduliRNS bases

basepower(mW)


{16, 31, 63, 17, 1025∗} 3.3735 2.4955 13912.0799 1.67 26.03%{64, 31, 127, 33∗, 65} 2.1642 2.0944 10590.7424 1.30 3.23%{16, 31, 63, 17∗, 1025∗} 3.3735 2.4145 14082.7567 1.67 28.43%{64, 31, 511∗, 17, 33} 3.1573 2.4103 12844.1152 1.73 23.66%{16, 31, 63∗, 17, 1025∗} 3.3735 2.3016 13527.3711 1.67 31.77%{64, 31, 511∗, 17∗, 33} 3.1573 2.3293 13014.792 1.73 26.22%{16, 31, 63∗, 17∗, 1025∗} 3.3735 2.2206 13698.0479 1.67 34.18%{64, 63∗, 127, 17, 65} 2.2674 2.0735 10667.5744 1.47 8.55%{16, 31∗, 63, 17, 1025∗} 3.3735 2.3656 13706.8287 1.67 29.88%{64, 63∗, 127, 17∗, 65} 2.2674 1.9925 10838.2512 1.47 12.12%{16, 31∗, 63, 17∗, 1025∗} 3.3735 2.2846 13877.5055 1.67 32.28%{64, 31∗, 511∗, 17, 33} 3.1573 2.2804 12638.864 1.73 27.77%{16, 31∗, 63∗, 17, 1025∗} 3.3735 2.1717 13322.1199 1.67 35.62%{64, 31∗, 511∗, 17∗, 33} 3.1573 2.1994 12809.5408 1.73 30.34%{16, 31∗, 63∗, 17∗, 1025∗} 3.3735 2.0907 13492.7967 1.67 38.03%{512∗, 15, 31, 17, 257} 2.498 2.2598 11028.6848 2.23 9.54%{16∗, 31, 63, 17, 1025∗} 3.3735 2.46056 13670.0591 1.67 27.06%{512∗, 15, 31, 17∗, 257} 2.498 2.1788 11199.3616 2.23 12.78%{16∗, 31, 63, 17∗, 1025∗} 3.3735 2.37956 13840.7359 1.67 29.46%{64∗, 31, 511∗, 17, 33} 3.1573 2.3588 13127.8448 1.73 25.29%{16∗, 31, 63∗, 17, 1025∗} 3.3735 2.26666 13285.3503 1.67 32.81%{32∗, 31, 511∗, 17∗, 65} 3.275 2.41517 13784.7584 1.73 26.25%{16∗, 31, 63∗, 17∗, 1025∗} 3.3735 2.18566 13456.0271 1.67 35.21%{256∗, 31∗, 127, 17, 33} 2.0767 1.805 9426.7376 1.87 13.08%{16∗, 31∗, 63, 17, 1025∗} 3.3735 2.33066 13464.8079 1.67 30.91%{256∗, 31∗, 127, 17∗, 33} 2.0767 1.724 9597.4144 1.87 16.98%{16∗, 31∗, 63, 17∗, 1025∗} 3.3735 2.24966 13635.4847 1.67 33.31%{64∗, 31∗, 511∗, 17, 33} 3.1573 2.2289 12922.5936 1.73 29.40%{16∗, 31∗, 63∗, 17, 1025∗} 3.3735 2.13676 13080.0991 1.67 36.66%{64∗, 31∗, 511∗, 17∗, 33} 3.1573 2.1479 13093.2704 1.73 31.97%

Table 4. Power, delay and area results for the binary 20-bit wordlength multiply-addunit with high-vdd supply voltage

power(mW) area(μm2) delay(ns)

4.432 19550.451 4.41

6 Conclusions

In this paper the low-power technique of multi-Vdd design has been applied forthe design of multiply-add units in residue number system. It is shown that theparticular technique can be used in RNS systems because the paths defined bythe moduli channels are clearly distinguished and the designer can easily applyhigh- and low-voltage areas in the design.


Furthermore, binary and residue multiply-add units are quantitatively com-pared. RNS is shown to demonstrate substantial power savings due to the parallelstructure of RNS and to the simple and effective application of multi-Vdd designtechnique.

References

1. Basetas, C., Kouretas, I., Paliouras, V.: Low-Power Digital Filtering Based on theLogarithmic Number System. In: Azemard, N., Svensson, L. (eds.) PATMOS 2007.LNCS, vol. 4644, pp. 546–555. Springer, Heidelberg (2007)

2. Bayoumi, M.A., Jullien, G.A., Miller, W.C.: A VLSI implementation of residueadders. IEEE Transactions on Circuits and Systems 34, 284–288 (1987)

3. Bernocchi, G.L., Cardarilli, G.C., Re, A.D., Nannarelli, A., Re, M.: Low-poweradaptive filter based on RNS components. In: ISCAS, pp. 3211–3214 (2007)

4. Cardarilli, G., Re, A.D., Nannarelli, A., Re, M.: Impact of RNS coding overhead onFIR filters performance. In: Proc. of 41st Asilomar Conference on Signals, Systems,and Computers (November 2007), http://www2.imm.dtu.dk/pubdb/p.php?5566

5. Cardarilli, G., Nannarelli, A., Re, M.: Reducing Power Dissipation in FIR Filtersusing the Residue Number System. In: Proceedings of the 43rd IEEE MidwestSymposium on Circuits and Systems, vol. 1, pp. 320–323 (August 2000)

6. Efstathiou, C., Vergos, H.T., Dimitrakopoulos, G., Nikolos, D.: Efficientdiminished-1 modulo 2n + 1 multipliers. IEEE Transactions on Computers 54(4),491–496 (2005)

7. Efstathiou, C., Vergos, H.T., Nikolos, D.: Modulo 2n±1 adder design using select-prefix blocks. IEEE Transactions on Computers 52(11) (November 2003)

8. Hiasat, A.A.: High-speed and reduced area modular adder structures for RNS.IEEE Transactions on Computers 51(1), 84–89 (2002)

9. http://www.synopsys.com

10. Keating, M., Flynn, D., Aitken, R., Gibbons, A., Shi, K.: Low Power MethodologyManual: For System-on-Chip Design. Springer Publishing Company, Incorporated,Heidelberg (2007)

11. Kouretas, I., Paliouras, V.: Mixed radix-2 and high-radix RNS bases for low-power multiplication. In: Svensson, L., Monteiro, J. (eds.) PATMOS 2008. LNCS,vol. 5349, pp. 93–102. Springer, Heidelberg (2009)

12. Madhukumar, A.S., Chin, F.: Enhanced architecture for residue number system-based CDMA for high-rate data transmission. IEEE Transactions on Wireless Com-munications 3(5), 1363–1368 (2004)

13. Paliouras, V., Skavantzos, A., Stouraitis, T.: Multi-Voltage Low Power ConvolversUsing the Polynomial Residue Number System. In: Proceedings of the 12th ACMGreat Lakes Symposium on VLSI, GLSVLSI 2002, pp. 7–11. ACM, New York(2002)

14. Ramirez, J., Fernandez, P., Meyer-Base, U., Taylor, F., Garcia, A.: Index-BasedRNS DWT architecture for custom IC designs. In: IEEE Workshop, Signal Pro-cessing Systems, pp. 70–79 (2001)

15. Ramirez, J., Garcia, A., Lopez-Buedo, S., Lloris, A.: RNS-enabled digital signalprocessor design. Electronics Letters 38, 266–268 (2002)

16. Soderstrand, M.A., Jenkins, W.K., Jullien, G.A., Taylor, F.J.: Residue NumberSystem Arithmetic: Modern Applications in Digital Signal Processing. IEEE Press,Los Alamitos (1986)


17. Stouraitis, T., Paliouras, V.: Considering the alternatives in low-power design.IEEE Circuits and Devices 17(4), 23–29 (2001)

18. Szabo, N., Tanaka, R.: Residue Arithmetic and its Applications to Computer Tech-nology. McGraw-Hill, New York (1967)

19. Wang, Z., Jullien, G.A., Miller, W.C.: An algorithm for multiplication modulo(2n + 1). In: Proceedings of 29th Asilomar Conference on Signals, Systems andComputers, Pacific Grove, CA, pp. 956–960 (1996)

20. Zimmermann, R.: Efficient VLSI implementation of modulo (2n ± 1) addition andmultiplication. In: Proceedings of the 14th IEEE Symposium on Computer Arith-metic, ARITH 1999, p. 158 (1999)

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 41–50, 2011. © Springer-Verlag Berlin Heidelberg 2011

An On-Chip Flip-Flop Characterization Circuit

Abhishek Jain1, Andrea Veggetti2, Dennis Crippa2, and Pierluigi Rolandi2

1 STMicroelectronics Noida, India 2 STMicroelectronics Agrate, Italy

[email protected], [email protected], [email protected], [email protected]

Abstract. The performance of the sequential digital circuit (Speed, Power con-sumption etc.) depends upon the performance of flip-flop used in the design. ASIC design flows use characterized data of flip-flops for final signoff. There-fore it’s critical to know precisely the accuracy of characterized data with respect to the actual behavior of flip-flops on silicon. An on-chip flip-flop char-acterization circuit (FCC) has been presented here which gives the accurate es-timation of various parameters of flip-flop such as CP-Q Delay, Setup time, Hold time and Power consumption. The system consists of a digital controller and characterization circuit which are based upon configurable oscillator which could be programmed to oscillate in different configurations or could be oper-ated in functional mode for functional verification. The delay values are calcu-lated by processing the value of time period of oscillator in different modes. The system was fabricated in 40nm CMOS technology and the flip-flop pa-rameters are extracted from it.

Keywords: Flip-flop, CMOS, delay measurement, characterization, silicon validation, on-chip, setup-hold.

1 Introduction

Flip-flops and Latches are the basic sequential logic elements used in ASIC design. These elements take significant portion of critical path timing in a high speed digital circuit and they also contribute heavily on the total system power dynamic as well as static. The performance and complexity of modern designs make these components vital part of the design. Therefore, there exists a need of studying the behavior of these components.

In general, the characteristics are measured using SPICE models and circuit simu-lators at the CAD level, and the data obtained is being put in different packaging formats. This data is used in the final SIGNOFF of the chip and thus it is required to be validated with actual measured results on silicon. A direct off-chip measurement of the delay between waveforms of flip-flop/Latch ports [1] can be used to validate the simulation models. However, an off-chip measurement approach has serious limita-tions, since the on-chip delays of flip-flops/Latches in deep-submicron technologies are typically much smaller than that of the circuitry connecting the ports to the

42 A. Jain et al.

instrumentation. The measurement errors incurred by this circuitry can be comparable to the measured quantity. Other methods for on-chip delay estimation are dummy path method and ring oscillator method [2]. The dummy path method is again limited by the accuracy since it’s based upon off-chip measurements however ring oscillator method involving measurement of square wave time period gives accurate results. The ring oscillator method is very good for delay measurement of combinational cells and latches but it is not well explored for measurement of flip-flop parameters. Some other systems have also been proposed [3], [4], [5] involving complete characteriza-tion of flip-flops/Latches but they are based on multiple circuits for characterization of different parameters.

In this paper we present a single on-chip measurement system for complete charac-terization of the sequential elements which is based upon ring oscillator configuration for estimation of data, clock – output delays, setup-hold timings, and shift register configuration for estimation of power. In section 2 flip-flop/Latch characteris-tics/parameters have been explained, followed by the description of measurement apparatus/system in section 3. In Section 4 how the parameter extraction is being done based upon the apparatus of section 3 is explained. The Section 5 contains the measurement results based upon CAD simulations and Silicon Results obtained from test circuit implemented in 40nm CMOS technology and error analysis. Section 6 concludes the paper.

2 Sequential Element Characterization Parameters

In this section we describe the key parameters of a positive edge triggered D flip-flop circuit. These parameters are also valid for other configurations of sequential elements.

2.1 Timing Parameters

The functionality of flip-flop circuit depends upon the time at which a change in the data input D of the flip-flop has occurred with respect to positive edge of the clock input CP. If the signal at the D input is stable within a window around the positive transition of the clock CP, then some time later D value will propagate to the output Q of the flip-flop. As shown in Fig. 1, the time before the clock edge that the D input has to be stable is called the setup time (ts) and the time after the clock edge that the D input has to be stable is called the hold time (th). The delay from the positive clock input to the new value of the Q output is called the clock-to-Q delay or propagation delay (tCP-Q) [6].

The timing verification tools issues a timing violation if the data input D changes inside the window of setup and hold time as described above. This is a case of failure of flip-flop since the flip-flop circuit could enter in meta-stable state. In Fig. 2, the clock-to-Q delay has been plotted with respect to time difference between data and clock inputs of flip-flop. For large values of delay between data and clock the clock-to-Q delay is constant, but as the delay approaches the setup and hold time window the clock-to-Q delay starts increasing since internally flip-flop circuit takes more time

An On-Chip Flip-Flop Characterization Circuit 43

to resolve its state. There exists a failure window wherein a change in data input does not have any effect on flip-flop output. The setup and hold time are therefore defined at the point where slope of the curve is equal to 1[3].

In the presented measurement system, we have exploited this relation of clock-to-Q delay with data and clock input delay to measure the timing parameters. The clock-to-Q delay is being measured when it is constant and setup/hold time are measured at the points defined in Fig. 2.

Fig. 1. D Flip-flop Timing Parameters

2.2 Dynamic Power

Flip-flops are used in wide variety of circuits targeting different applications where the data rates could be different. Therefore, it’s important to study the power con-sumption of the flip-flop with respect to the switching activity of data input or data rate (which also results in change in the output state). Here, dynamic Power is meas-ured with respect to different data rates and a constant clock frequency.

2.3 Static Power

As leakage power has become quite significant in submicron technologies, it is also important to know what current flip-flop is drawing in inactive state. Leakage power estimation is also useful for the case of retention flip-flops which are used in power down applications. Here the leakage power of the flip-flop could be measured under different configurations of inputs and outputs.

44 A. Jain et al.

Fig. 2. Clock-to-Q delay v/s Delay between Data and Clock Inputs of D Flip-flop [3]

3 Measurement System

The measurement system consists of two main blocks, Controller circuit and Charac-terization circuit (FCC). The controller circuit is based upon digital state machine generating the control signals for FCC to operate in different configurations. The FCC could be made to operate with or without controller circuit.

Fig. 3. FCC BASECELL Circuit Diagram


3.1 Characterization Circuit (FCC)

It is pure Digital circuit which could be implemented using basic standard cell library. It is based upon N stages of FCC BASECELL units as shown in Figure 4. The Base-cell circuit consists of MUXes, Programmable Delay cells PDD and PDC and the DUT (Device Under Test, in present case could any D flip-flop) connected as shown in Figure 3. The signal to the Clock and Data inputs of the DUT could be configured through 4X1 muxes select lines and their respective path delays could be varied through PDC and PDD cells. The output of the Basecell could also be programmed to select output of DUT or D input of DUT or CP input of DUT as output. Depending upon the mode of working, these inputs and output could be configured accordingly, either by controlling circuit or external IO.

The PDD and PDC cells used in data and clock path respectively are based upon programmable delay cell circuit as shown in figure 5. These cells are used to intro-duce delay between data and clock input of DUT for timing measurements. The PDD and PDC cells are made of different drives of BUF cell which forms a vernier delay line between clock and data path selectable through SDD and SDCP select lines. The select lines are selected in order to have minimum delay difference between the two. The delay introduced by these cells could be characterized in oscillator mode of the system which is explained later. These two blocks are implemented with full custom flow, in order to have minimum variation delays between different cells.

To minimize the variation in delay due to different rise and fall delays of cells in PDC and PDD, for every even stage Basecell the positive edge of signal is propagated and for every odd stage Basecell negative edge of signal is propagated through PDD and PDC cells.

The DUT in the circuit is connected to different power domain which is done by separating the rail connection of DUT from rest of the circuit and connecting it to different power supply. The number of stages N of the system is limited by the mini-mum current measurement value of the Tester. The N number of flip-flops should be able to produce leakage current of that order.

Fig. 4. FCC Characterization Circuit

46 A. Jain et al.

Fig. 5. Programmable Delay Cell Circuit Diagram (PDD and PDC)

3.2 Characterization Circuit Configurations

The system is based upon two different configurations. Oscillator and Shifter. Oscilla-tor configuration is used for extraction of timing parameters and Shifter configuration is used for extraction of static and dynamic power, and functional verification.

Oscillator Configuration:- In this configuration the inputs and output of the Basecell are configured to form a ring oscillator. The oscillator configuration could be config-ured in three different modes to include or exclude the delay of certain paths.

(a) The delay of clock path is characterized in this mode. The output BOUT of Basecell passes the signal at CP input of DUT to the next stage Basecell. The delay of single unit equals 1/(2*N*Frequency of Oscillation at System Output).

(b) The delay of clock path and clock-to-Q path of DUT is characterized in this mode. The select lines for MUXES are being set to send signal at Q output of DUT to CP input of next stage DUT. Here, a single edge (rise or fall) is being propagated through N stages and DUTs are reset (for Rise Delay Measurement) or set (for Fall Delay measurement). The Delay of single unit equals 1/(N*Frequency of Oscillation at System Output).

(c) The delay of data path is being characterized. The BOUT output passes sig-nal at D input of DUT to next stage. The Delay of single unit equals 1/(2*N*Frequency of Oscillation at System Output).

Shifter Configuration:- In this configuration the Clock input of DUT is controlled with external clock signal and its Q output goes to the D input of next stage cell. In this way signal available at the D input of first stage is available at Q output of Nth Stage after N clock cycles. This configuration is useful for dynamic and leakage power estimation.

4 Measurements

4.1 Clock-to-Q Delay Measurement

The circuit is operated in Oscillator configuration in mode (a) and (b) as explained above. Here, the setup and hold constraints of the DUT are respected in order to have stable value of clock-to-Q delay. The clock-to-Q delay value is given by (b)-(a).


4.2 Setup and Hold Time Measurement

The circuit is operated in oscillator configuration in mode (a), (b) and (c) as explained in Section 2.The data path selects the clock path signal to pass through instead of signal coming to 4X1 data MUX.. The three measurements are performed for all combination of polarity and delays of clock and data paths. The clock-to-Q delay value is given by (b)-(a) and delay between clock and data signal is given by (c)-(a). These values are plotted and optimized setup and hold time values are extracted from the graph as explained in Section 2.

4.3 Dynamic Power Measurement

The circuit is operated in Shifter Configuration. The data with different activity rates with respect to clock frequency is being passed through the shifter and power meas-urements are performed for two power supplies. i.e one which is supplying power to DUTs and other which is supplying power to rest of the circuit.

4.4 Leakage Power Measurement

The circuit is again operated in Shifter configuration. The DUTs are first fixed to constant state and then leakage measurements are performed on two power supplies.

4.5 Sources of Error and Improvements

The main sources of the error in the timing measurements at the circuit level come from the different path delays of MUX used in PDD and PDC cells, and difference in rise and fall delays of cells used in the circuit. These errors could be minimized using the im-plementation method suggested in section 3.1 but cannot be eliminated completely.

For the power measurements, since, the power domain of DUT is separated from rest of the interface and control circuit, the results show the actual power dissipation without any external component. However, error in this case could be introduced by measurement apparatus used for current measurement, since these apparatus have limitation on minimum measurement values. To overcome this limitation, sufficient stages of Basecell should be put in circuit, especially in the case of static currents.

Further, the present measurement system targets characterization at a particular load and slope only. In order to do characterization at different loads and clock transi-tions additional MUX stages could be added at the output of DUT and at the clock input of DUT which would give the programmability for selecting different load and clock signal slopes.

5 Measurement Results

5.1 CAD Results

The analysis of complete system is being done at CAD level using XA simulator from Synopsys and device models from 40nm CMOS technology process. The circuit is implemented based upon Tristate buffer Master –slave D flip-flop circuit [7] as DUT and 100 stages of Basecell has been put to make complete system. The simulation results shown are based on typical models. The misalignment in measured and actual

48 A. Jain et al.

values of clock-to-q delay as shown in Fig. 6 and Fig. 8 is due error introduced by different path delays of MUX lines for different values of selection inputs required for enabling oscillation in different modes and due to difference in rise and fall delays internal cells. The estimated error introduction due to MUX is approximately 8-10ps and due to difference in rise and fall delay is 5-7 ps as obtained from characterized library database. The measured Hold time at 1V and 25C is around 5ps and Setup time is around 70ps. The measured clock-to-Q delay is 132ps. More analysis across different PVT corners is required to be done for complete validation of the circuit.

The power values shown in Fig. 7 and Fig. 9 are obtained by calculating the aver-age current flowing through power supply of DUT. For dynamic power the circuit is operated in shift register mode wherein the input data rate is varied w.r.t clock fre-quency and for static measurement data corresponding to average of static current in different clock, data and output configuration has been plotted. The power values gives the actual power through the DUT excluding the power dissipation in interface circuit, therefore, are expected to be accurate.

-50

0

50

100

150

200

-80.00 -60.00 -40.00 -20.00 0.00 20.00 40.00 60.00

TCP-Q (Measured) TCP-Q (Actual)

Fig. 6. Clock-to-Delay(ps) Vs Clock-Data Path Delay(ps) for Hold Time Estimation

0.00E+00

1.00E-06

2.00E-06

3.00E-06

4.00E-06

5.00E-06

6.00E-06

7.00E-06

8.00E-06

9.00E-06

50% 33% 25% 20% 17% 14% 13% 11% 10%

DUT Power

Fig. 7. Dynamic Current in amps through hundred DUT stages Vs Data Activity Rate w.r.t. to Clock at 1V, 25C and 10MHz Clock Frequency

-20

30

80

130

180

-50 0 50 100 150 200

TCP-Q (Measured) TCP-Q (Actual)

Fig. 8. Clock-to-Delay(ps) Vs Clock-Data Path Delay(ps) for Setup Time Estimation

0.00E+00

2.00E-06

4.00E-06

6.00E-06

8.00E-06

1.00E-05

1.20E-05

1.40E-05

0.9V 1.0V 1.2V

DUT Power

Fig. 9. Leakage Current in amps through hun-dred DUT stages Vs Applied Voltage at 150C


MERCURY_C40LP

3mm X 3mm

Ultra Low Voltage IPs

Fourtune Memory Cuts

BISC for Access Time Characterization

Fourtune ALLCELL

structures

Low Power Block 1

Low Power Block 2

Ring Oscillator Structures

MERCURY_C40LP

3mm X 3mm

Ultra Low Voltage IPs

Fourtune Memory Cuts

BISC for Access Time Characterization

Fourtune ALLCELL

structures

Low Power Block 1

Low Power Block 2

Ring Oscillator Structures

Fig. 10. Mercury_C40LP

5.2 Silicon Results

A subset of the system for the measurement of clock-to-Q delay of tri-state latch based master slave D flip-flop circuit [6] is implemented on Mercury Test-chip in 40nm CMOS process from SAMSUNG. The results are extracted across different voltages and temperatures on multiple dies at package level are shown in Fig. 11. At lower voltage level there is higher misalignment between CAD and Silicon values which is due to Model misalignment. At lower voltages the average error percentage is around 12% which reduces to 2% towards higher voltage side.

0.00E+00

5.00E-11

1.00E-10

1.50E-10

2.00E-10

2.50E-10

3.00E-10

T=-40.00 V=0.90

T=-40.00 V=1.00

T=-40.00 V=1.10

T=25.00 V=0.90

T=25.00 V=1.00

T=25.00 V=1.10

T=125.00V=0.90

T=125.00V=1.00

T=125.00V=1.10

CP-Q Rise Arc Silicon CP-Q Fall Arc Silicon CP-Q Rise Arc CAD CP-Q Fall Arc CAD

Fig. 11. CAD Vs Silicon Results for Clock-to-Q Delay (sec)

50 A. Jain et al.

6 Conclusion

An accurate on-chip measurement system has been presented for characterization of flip-flop and latches which is also useful in spice model validation and comparative analysis of different structures. The silicon results obtained from 40nm CMOS proc-ess test-chip has been presented on subset of measurement apparatus which validates the principle of measurements and analysis of complete system has been shown at CAD level based on spice simulations, which is to be further validated on silicon for actual analysis. The silicon and CAD results shows that measurement apparatus gives accurate results for delay and power, and the error in measurements is under accept-able limits. The given system could be improved further for characterization at differ-ent output loads and clock transitions.

References

[1] Nikolic, B., et al.: Improved sense-amplifier-based flip-flop: Design and measurements. IEEE J. Solid-State Circuits 35, 876–884 (2000)

[2] Singh, A.P., Panwar, N.S., et al.: On Silicon Timing Validation of Digital Logic Gates - A Study of Two Generic Methods. In: 25th International Conference on Microelectronics, pp. 424–427 (2006)

[3] Nedovic, N., et al.: A Test Circuit for Measurement of Clocked Storage Element Charac-teristics. IEEE Journal of Solid State Circuits 39(8) (August 2004)

[4] Rosenberger, F., et al.: Flip-flop Resolving Time Test Circuit. IEEE Journal of Solid State Circuits SC-17 (4) (August 1982)

[5] Veggetti, A., et al.: Random sampling for on-chip characterization of standard-cell propa-gation delay. In: Fourth International Symposium on Quality Electronic Design, pp. 41–45 (2003)

[6] Weste, N., Eshragian, K.: Principles of CMOS VLSI Design, pp. 317–324. Book Pub-lished by Pearson Education Asia

[7] Yuan, J., et al.: New Single-Clock CMOS Latches and Flipflops with improved Speed and Power Savings. IEEE Journal of Solid State Circuits 32(1), 62–69 (1997)


A Low-Voltage Log-Domain Integrator Using MOSFET in Weak Inversion

Lida Ramezani

Electrical & Computer Engineering Dept., Ryerson University, George Vari Engineering and Computing Center, 245 Church St.,

Toronto, Ontario, Canada, M5B 2K3 [email protected], [email protected]

Abstract. In this paper a low-voltage integrator circuit using MOSFETs in sub-threshold region is presented. This integrator is a Current-mode log-domain cir-cuit. The EKV MOSFET model is used for sub-threshold region simulations. Model parameters of IBM CMOS 130nm technology are used. This integrator works with a 500mv single supply voltage and its input current range is as high as bias current of the input transistor. According to CADENCE simulation re-sults for 1pf integrating capacitor and bias current of 20nA, cutoff frequency is 113.4 KHz and power consumption is 45.44nW. Integrator’s Cutoff frequency is tuned from 1.083 KHz to 1.023MHz using variable integrator capacitor value in the range of 10pf-0.1pf.

Keywords: Nonlinear electronics; Sub-threshold CMOS; Log-domain Integra-tor; Companding method; low voltage; low power.

1 Introduction

Low power integrated filters are required in portable systems such as telecommunica-tion receivers and implanted biomedical integrated circuits. Transcoductor-capacitor (Gm/C) filters are a kind of current mode active filters which can be used in a wide range of frequencies from a few HZ in biomedical systems to Several MHz in base-band or IF part of telecommunication receivers.

In active Gm/C filters, passive inductors are replaced by active gyrator-C circuits. Active filters have smaller silicon area in comparison to passive filters. The pass-band gain, cutoff and centre frequency and quality factor in active filters are easily tuned and it is possible to make higher quality factors in active filters. But, active filters consume power and they have limited dynamic range. In most of applications, design of low-voltage, low-power active filters with sufficient dynamic range and bandwidth is intended.

Low-voltage and low-current techniques are used in low power circuits. Rail-to-rail designs, use of supply multipliers, multistage circuit designs and use of bulk-driven transistors are among low-voltage strategies. Adaptive biasing and sub-threshold

52 L. Ramezani

biasing are kind of low current design methods [1]. In [2] continuous time low-voltage current-mode filters are discussed. Low voltage circuits suffer from dynamic range limitations. The maximum input signal is limited to linear range of the input circuit, and the minimum range of acceptable input signal is limited to noise level. The input signal should be several times less than bias level to reduce harmonic distortion caused by nonlinearity of input circuit. At the same time, input noise level should be kept as low as possible. For higher dynamic range, we need large bias level that causes large power consumption. There are several linearization techniques such as source degen-eration, nonlinear term cancellation, adaptive biasing, and class AB implementation. In these linearization methods several transistors are added to the circuit. Each transistor adds several parasitic capacitors and causes more limited bandwidth. Also the power consumption increases with transistor counts. As we intend to design high-frequency and low-power circuit, we need simple circuits with less count of transistors.

In companding theory externally linear, internally nonlinear circuits (ELIN) are used to improve the dynamic range. Companding method is useful for improving the dynamic range with less count of transistors [1]. Companding method is used in log-domain circuits. Trans-linear devices are the key elements in log domain circuits.

In this paper a low-voltage, current-mode, log-domain integrator using MOSFETs biased in sub-threshold rejoin is presented. In part 2, CMOS transistor in sub-threshold or weak inversion mode is discussed and used as a trans-linear element. Also companding method and log-domain filters are introduced in part 2. In part 3, MOSFET realization of a first order log-companding filter or integrator is presented and CADENCE simulation results are given. Finally comparison and conclusion are given in part 4.

2 MOSFET Biased in Weak Inversion as a Trans-linear Element

In this part, behavior of MOSFET in weak inversion is reviewed. Then trans-linear element and trans-linear principle are described. A trans-linear loop using MOSFETs in sub-threshold is presented. Also companding method and log-domain filters are introduced. These concepts and definitions are used in log domain MOSFET integra-tor circuit which is described in part3.

2.1 MOSFET in Weak Inversion

When the gate source voltage of a MOS transistor is less than threshold voltage but high enough to create depletion region at the surface of silicon, the device operates in weak inversion. This is called sub-threshold region and MOS has exponential volt-age-current characteristics. The drain current in weak inversion or sub-threshold re-gion is given in (1) [3].

( )exp 1 exp .GS th DSD t

T T

V V VI I W L

nV V

⎡ ⎤⎛ ⎞ ⎛ ⎞− −= × −⎢ ⎥⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠⎣ ⎦

(1)

A Low-Voltage Log-Domain Integrator Using MOSFET in Weak Inversion 53

In (1) W and L are transistor channel width and length respectively. Ispec=It×(W/L) is called specific current and depends on physical parameters and technology. Specific current relation is given in (2). [4]

( ) ( )( ) 2 2. 2 2 .spec t ox T TI I W L n C W L V n Vμ β= × = (2)

VGS is gate to source voltage and VDS is drain to source voltage, Vth is threshold volt-age and VT is thermal voltage i.e. 25mv at room temperature. When VDS>>3VT, drain current is independent of VDS. Drain current in sub-threshold is less than It ×(W/L) [3]. In weak inversion, there is a voltage divider between the oxide capacitance (Cox) and depletion region capacitance (Cjs). In (1), n is the coefficient of voltage divider as given in (3).

1 1.5 .js

ox

Cn

C= + (3)

MOSFET trans-conductance gain in weak inversion is given in (4) and transition frequency in weak inversion is according to (5). [3]

.D Dm

GS T

I Ig

V nV

∂= =∂

(4)

1 1.

2D

TT js

If

V WLCπ=

(5)

2.2 Trans-linear Principle

A trans-linear element is a physical device whose trans-conductance gain and current through the device are linearly related. In trans-linear elements, the current is expo-nentially dependent on the controlling voltage. Considering (1) and (4), MOSFET transistor biased in sub-threshold region is a trans-linear element. A closed loop con-taining equal number of oppositely connected trans-linear elements is called a trans-linear loop. According to trans-linear principle [2], in a trans-linear loop, the product of the current densities in the elements connected in clockwise (CW) direction is equal to the corresponding product for elements connected in the counter clockwise (CCW) direction.

.n mn CW m CCW

I I∈ ∈Π = Π (6)

A CMOS trans-linear loop that is composed of MOS transistors biased in weak inver-sion is shown in Fig.1. Relation between transistor drain currents in Fig.1 is given in (7).

54 L. Ramezani

1 2 3 4.D D D Di i i i× = × (7)

2.3 Companding Method and Log-Domain Filters

In companding method, compressor and expander circuits are used. The compressor circuit compresses the dynamic range of the input; it amplifies weak signals so that they can be transmitted with noise immunity. The expander circuit expands the dy-namic range; it reduces the amplitude of the amplified signals and thus of the noise picked up during transmission [1]. Logarithm is a compressor function and exponen-tial is an expander function. Block diagram of a companding circuit is shown in Fig.2.

Fig. 1. A trans-linear loop with CMOS in sub-threshold

Fig. 2. Block diagram of a companding circuit

In 1990 Seevinck invented a circuit using bipolar junction transistors (BJT) and he called it a companding current-mode integrator. That circuit was effectively a first-order log-domain filter [5]. In a log domain integrator, the currents with an inherently


large dynamic range are compressed logarithmically when transformed into voltages (prior to the integration on a capacitor) and expanded exponentially afterwards when transformed back to current [6]. Companding can be used in filters to enable supply voltage reduction without signal to noise ratio degeneration [6].

Log domain filters are type of externally linear, internally nonlinear (ELIN) Fil-ters. Log domain and companded filter synthesis methods are discussed in [7]. Log domain filters have the advantages of reduced circuit complexity, wider bandwidth, wider dynamic range and lower power consumption [7]. Different types of log do-main filters including Class A, Class AB and syllabic companding are described in [7]. One of filter synthesis methods is cascading. In this method first order and second order building blocks are used. Integrator is a first order filter and in part3 design and simulation results of a first order log companding filter using MOSFET in weak-inversion is given which is a low voltage and low power integrator.

3 Circuit Design and Simulation Results

In this section, CMOS realization of a log domain integrator and its transfer function is presented. Then CADENCE simulation results using EKV MOSFET model in sub-threshold or weak inversion region are given.

3.1 Circuit Design

The MOSFET realization of a log-domain integrator (first order filter) that uses MOSFET transistors biased in sub-threshold region is shown in Fig. 3.

Fig. 3. MOSFET realization of CMOS log-domain integrator with ideal current sources

In Fig.3 M1 is used as log compressor that converts input current to compressed voltage VGS1, M2 is a level shifter, M3 and C are the integrator circuit core elements

56 L. Ramezani

and M4 is expander transistor. M1, M2, M3 and M4 make a trans-linear loop and according to (6) the relationship between their drain currents is given in (8).

( ) ( )1 2 3( ) ( ) ( ).in C outi t I I i t I i t+ = + (8)

Capacitor voltage is equal to VGS4 i.e. gate-source voltage in M4. M4 is biased in sub-threshold and according to (1) VGS4 is a logarithmic function of drain current in M4 as given in (9).

( )4

( )( ) ln .out

C GS T tht

i tv t V nV V

I W L

⎛ ⎞= = +⎜ ⎟⎜ ⎟

⎝ ⎠ (9)

The capacitor current is given in (10).

( ) ( )( ) .

( )C outT

Cout

dv t di tCnVi t C

dt i t dt= = (10)

From (8), (9) and (10), the first order differential equation between input current and output current is concluded as shown in (11).

( )1 2 3

( )( ) ( ).

( )outT

in outout

di tCnVi t I I I i t

i t dt

⎛ ⎞+ = +⎜ ⎟

⎝ ⎠ (11)

The first order differential equation of the circuit in Fig.1 is according to (12).

( )21

3 3

( )( ) ( ) .outT

out in

di tCnV Ii t i t I

I dt I+ = + (12)

The integrator transfer function is given in (13) and its cutoff frequency and pass-band gain (kPB) are given in (14), (15).

( )0

( )( ) .

( ) 1out PB

in

i s kH s

i s s ω= =

+ (13)

30 .

T

I

CnVω = (14)

2

3

.PB

Ik

I= (15)

For higher cutoff frequency, smaller capacitor (C) and larger bias current (I3) are needed. Cutoff frequency tuning can be done by changing bias current (I3) and capaci-tor value (C).


In Fig.4, MOSFET log-domain integrator with non-ideal current sources is shown. M1, M2, M3 and M4 are log domain integrator elements as described in Fig.3 and M5, M6 and M7 are current source transistors mirrored from the main bias current branch that is composed of M9, M10 and RBIAS. M8 is active load. All transistors except M8 are biased in sub-threshold.

3.2 CADENCE Simulation Results

In this section, CADENCE simulation results of the circuits shown in Fig.4 are given. EKV model for MOSFETs in weak-inversion which is a precise model is used by CADENCE.

Fig. 4. CMOS log-domain integrator circuit

All transistors have the minimum size with channel length of 480nm and channel width of 120nm. The bias currents of non-ideal current sources in Fig.4 are 20nA provided by M9-M10 with 50KΩ bias resistor (RBIAS). Supply voltage is 500mv and according to simulation results all transistors are biased in region-3 i.e. sub-threshold region. Input signal is a sine wave current with frequency of 1KHz and amplitude of 20nA. The maximum amplitude of the input signal is equal to bias current and it should be less than It ×(W/L) to keep input transistor in sub-threshold. For higher input range, larger bias current for input/compressor transistor is needed and power consumption increases. Also larger compressor transistor ratio (W1/L1) is needed to stay in sub-threshold and parasitic capacitors of MOSFETs increases. According to (5) the transient frequency of the transistor decreases when it has large size, therefore

58 L. Ramezani

maximum applicable cutoff frequency of integrator circuit decreases. Naturally, the trade off between power consumption and bandwidth and input range exist, but in this nonlinear integrator, maximum input range is as high as compressor transistor bias current and small signal limitations and distortion issues do not exist when input tran-sistor works in sub-threshold.

Transient and frequency response of integrator circuit of Fig.4 with 1 Pico Farad integrator capacitor are shown in Fig.5. In right side waveforms, input signal (/Iin/MINUS) which is a 1KHz sine wave with 20nA amplitude and output signal which is drain current of expander transistor (/T4/D) are shown. Also Voltage of the gate of compressor transistor M1 (/net012), and expander transistor M4 (/net032) are shown. In transient response , the gate-source voltages of M1 and M4 are logarithm function of their drain currents as given in (16).

max( ) log( sin( ) ).gs DCV t I t Iω∝ + (16)

Imax and IDC are 20nA and ω is 1KHz. In Fig.6 integrator cutoff frequency is tuned from 1.089 KHz to 1.023MHz by varying capacitor in the range of 100Pf -0.1Pf re-spectively. In Fig.7, the integrator core transistor width and its current mirror transis-tor width (W3=W7=k×480nm) are changed from 480nm to 4.8μm and the integrator pass-band gain increases from 0dB to 4.5dB , also cutoff frequency is increased from 113.4 KHz to 974.2 KHz.

Fig. 5. Frequency response and transient waveforms of circuit in Fig.4


Fig. 6. Tuning of -3dB/cutoff frequency using variable capacitor in integrator circuit of Fig.4

Fig. 7. Cutoff frequency and pass-band gain in circuit of Fig.4 with 1pf capacitor for 3 different width sizes in M3, M7,(W3=W7=K*480nm)

60 L. Ramezani

4 Discussion and Conclusions

The low-voltage current-mode filters were motivated by the need to have high fre-quency filters with low supply voltage in portable equipment applications. Low voltage designs suffer from dynamic range limitations due to nonlinear behavior of transistors. In companding method, externally linear internally nonlinear (ELIN) circuits are used. This method is useful in low power and low voltage circuits to improve the maximum input range.

In this paper a new log domain integrator circuit using MOSFET in sub-threshold region is introduced. MOSFETs in sub-threshold act as trans-linear elements and the designed circuit is a first order ELIN filter. CADENCE simulation results with IBM130nm technology parameters are given. This low voltage circuit works with 500mv supply voltage.

The cutoff frequency of integrator can be tuned in two ways. By changing capaci-tor value from 0.1Pf to 10Pf, cutoff frequency changes from 1.023MHz to 1.089 KHz. This way is suggested for frequency coarse tuning and the power consumption re-mains nearly constant between 45.4nw to 54nw. Also by changing RBIAS, the bias current in integrator core changes and the cutoff frequency can be tuned. In this way the power consumption will increase and this way is suggested for frequency fine tuning.

In this design, bias current and proper size of compressor transistor (M1) are cho-sen regarding the maximum input range. Also bias current and size of integrator core transistor (M3) are chosen regarding the desired cutoff frequency with an appropriate capacitor value. Considering the low-power constraints, the bias currents should be as low as possible. Also considering the high-frequency constraints, the parasitic capaci-tors and transistor sizes should be as small as possible. The tradeoffs between power consumption, bandwidth and input range exist. The summary of simulation results is given in table 1.

Table 1. Summary

Integrator core bias current (I3)

Integrator capacitor

Power consumption

Pass band gain Cutoff frequency

20nA 1pf 50nw 0dB 113.4KHz 20nA 0.1pf-10pf 45.44nw-54nw 0dB 1.023MHz-1.089KHz 20nA-0.20uA 1pf 50nw-133.8nw 0dB-4.5dB 113.4KHz-974.2KHz

Acknowledgment

The author wishes to thank Department of Electrical and Computer Engineering at Ryerson University for their support to use workstation at Microsystems research laboratory. Furthermore I wish to thank Professor Fei Yuan, supervisor of ICS re-search group at Ryerson University for his useful comments.


References

[1] Serra-Graells, F., Rueda, A., Huertas, J.L.: Low- Voltage CMOS Log Companding Analog Design. Kluwer Academic Publishers, Dordrecht (2003)

[2] Sanchez Sinencio, E., Andrreou, A.G.: Low Voltage/Low Power Integrated Circuits and Systems, Low Voltage Mixed Signal Circuits. IEEE press series in microelectronic sys-tems, ch.3, pp. 68–72 (1998)

[3] Gray, P.R., Meyer, R.G.: Analysis and Design of Analog Integrated Circuits, 5th edn. John Wiley & Sons Ltd., Chichester (2000)

[4] Enz, C.C., Vittoz, E.A.: Charge-based MOS Transistor Modeling, the EKV model for low- Power and RF IC design. John Wiley & Sons Ltd., Chichester (2006)

[5] Seevinck, E.: Companding Current-mode Integrator, a New Circuit Principle for Continu-ous Time Monolithic Filters. Electronics Letters 26(24), 2046–2047 (1990)

[6] Fried, R., Python, D., Enz, C.C.: Compact Log Domain Current Mode Integrator with High Transconductance-to-Bias Current Ratio. Electronics Letters 32(11), 952–953 (1996)

[7] Frey, D.: Future Implications on the Log Domain Paradigm. IEE Proc. Circuits Devices Syst., 147(1), 65–72 (2000)


Physical Design Aware Comparison of Flip-Flops for High-Speed Energy-Efficient VLSI Circuits

Massimo Alioto1,2, Elio Consoli3, and Gaetano Palumbo3

1 DIE, University of Siena, 53100 Siena, Italy 2 Currently also with BWRC, UC Berkeley, 94704-1302 Berkeley, California, USA

[email protected], [email protected] 3 DIEES, University of Catania, 95100 Catania, Italy {econsoli,gpalumbo}@diees.unict.it

Abstract. In this paper, an extensive comparison of flip-flop (FF) topologies for high-speed applications is carried out in a 65-nm CMOS technology. This work goes beyond previous analyses in that traditional rankings do not include layout parasitics, which strongly affect both speed and energy and lead to drastic changes in the optimum transistor sizing. For this reason, in this work layout parasitics are included in the circuit design loop by adopting a novel strategy. The obtained results show that the energy efficiency and the performance of FFs is mainly determined by the regularity of their topology and layout. Finally, the area-delay tradeoff is also analyzed for the first time.

Keywords: Energy Efficiency, Clocking, Flip-Flops, High Speed, Energy-Delay, Nanometer CMOS, Interconnects, Layout Impact.

1 Introduction

The selection of flip-flop (FF) topologies is essential for the design of both high-speed and energy-efficient microprocessors [1]. Indeed, in fast micro-architectures with low logic depth, FFs delay occupies a significant fraction of the clock cycle [2]. Moreo-ver, together with circuits devoted to clock generation and distribution, FFs are re-sponsible for a large fraction of the whole chip energy budget [3]-[4].

Various high-speed FFs have been proposed in the past, mainly belonging to the Pulsed and Differential classes [2]. Usually, they are featured by a transparency win-dow, leading to clock-uncertainties absorption properties but also to a reduced race immunity [2]. However, both setup and hold time values can be arranged regardless of the FF delay value, since they depend on the sizing of gates that do not belong to the FF critical path. Therefore, the real figure of merit concerning the timing of such FFs is the minimum data-to-output delay, measuring the impact of FF speed on the clock cycle [5]-[6]. Given the presence of precharged nodes and the high switching activity in the pulse generator stages, high-speed FFs are distinguished by an high dissipation (e.g., compared to low-energy FFs, such as Master-Slave ones) [5]. There-fore, given that CMOS technology has entered a power-limited regime, identifying the most energy-efficient high-speed FFs is nowadays a decisive issue.

Physical Design Aware Comparison of Flip-Flops for High-Speed Energy-Efficient 63

However, the most significant previous comparisons [5]-[10] have not considered nanometer technologies, thereby neglecting the increasing impact of layout parasitics associated with local interconnects, which severely degrade both speed and energy.

In this paper, the ranking of the most representative high-speed FFs in a 65-nm CMOS technology is reconsidered by including the above issue since the early design phases, in order to reach the very optimum FFs sizings corresponding to the energy-efficient designs in the Energy-Delay (E-D) space. The framework for FFs analysis and design and the considered topologies are briefly presented in Section 2. The rank-ing of Pulsed and Differential topologies in the E-D space is discussed in Section 3, where the main differences with respect to previous results are pointed out. Section 4 considers FFs area and its tradeoff with delay. Finally, conclusions are in Section 5.

2 Framework for FFs Comparison and Selected FF Topologies

2.1 Adopted Analysis/ Design Strategies and Inclusion of Layout Impact

As previously stated, FF delay is identified with the minimum data-to-output delay, , (measuring the impact of FF timing on the speed of pipelined systems [2],[5]-[6]). FF energy is extracted by summing transient (i.e., dynamic and short-circuit) and static (i.e., leakage) contributions, weighed according to the data input switching activity and to the clock period duration (set to 10 times the delay of the FF), respec-tively. The test bench adopted to evaluate the FFs energy is similar to that in [2],[5]-[6] and is summarized in the Appendix D of [11]. Various applicative conditions [9]-[10] are considered in terms of small, medium and large load, , equal to , 16 and 64 minimum symmetrical inverters (with 2 2 , being 120 nm the minimum channel width and 410 aF at the input) and small, medium and large data input activity, i.e. 0.10 , 0.25 , 0.50 , respectively. In the rest of the paper, we assume a load capacitance 16 , and a switching activity 0.25 as the “reference case”.

The comparison is carried out by analyzing the energy-efficient curves (EECs) of the FFs in the E-D space. Such curves are extracted by minimizing some figures of merit (FOMs) as described in [11] (due to the lack of space, please refer to that paper for procedures and examples concerning the detailed FFs design strategy).

To gain an intuitive understanding of results independently of technology, they are properly normalized to reference values typical of the considered 65-nm CMOS technology. In particular: delays are normalized to 4 18.27 ps, energies are normalized to , 0.202 fJ (it is the energy dissipated by an unloaded symme-trical minimum inverter during a complete 0 1 0 transition cycle at its output) and areas are normalized to , where 200 nm is the minimum pitch of the Met-al2 layer. For all the analyses, a 1 V supply voltage is adopted.

The sizing strategy in [11] also accounts for capacitive parasitics due to local inter-connects since the early design phases, for the first time in the literature relative to FFs analysis and design. Indeed, among previous works [2],[5]-[10], few consider layout impact simply a posteriori, while most neglect it at all. This leads to strong differences between the adopted design strategies and the actual optimum ones and to

64 M. Alioto, E. Consoli, and G. Palumbo

the unreliability of the previously reported results, given the huge influence that local wires have on both energy and delay of FFs.

The detailed methodology to extract capacitive parasitics is based on geometrical calculations performed on stick diagrams and on a realistic modeling of the per-unit-length capacitances of the various interconnecting Poly and Metal layers (thereby including the effect of capacitive coupling between adjacent and stacked wires). Such a methodology is accurately described in Appendix A in [11] and has been validated through the realization of several actual layouts of the considered FFs corresponding to the minimum -product designs in the reference case.

Local interconnects parasitic capacitances are estimated with an error equal to 10 25%, while the error in the delay-energy estimation is lower (5 10%). It is worth noting that the values of such capacitances is quite similar to those of transis-tors-related (gate and drain) capacitances in the various FFs nodes, i.e. they introduce extremely significant branching and parasitic effects. As a consequence, the optimiza-tion leads to larger transistors sizes (up to 2X) in order to compensate the resulting speed degradation and hence energy increases both for the additional interconnects capacitances themselves and for the larger transistors sizes. This confirms the huge impact that such parasitics have on both energy and delay.

2.2 High-Speed FF Topologies: Pulsed and Differential Classes

In this paper we focus on the comparison of high-speed FFs and hence we consider the Pulsed and Differential topological classes, which are featured by small delays. On the whole, 11 among the most representative and best known FFs are selected.

The analyzed Pulsed topologies are the Hybrid Latch FF [12] (HLFF), the Semi-Dynamic FF [13] (SDFF), the UltraSPARC Semi-Dynamic FF [14] (USDFF), the Implicit Push-Pull FF [6] (IPPFF), the Conditional Precharge FF [15] (CPFF), the Static Explicit Pulsed FF [16] (SEPFF) and the Transmission Gate Pulsed Latch [17] (TGPL). The latter two are Explicit Pulsed (EP) circuits, i.e. they employ a pulse generator (PG) providing an actually pulsed clock, whereas the remaining ones are Implicit Pulsed (IP), i.e. they simulate a pulsed clock through the temporary enabling of some (typically two) transistors according to the delay of an inverter chain [2].

The Differential FFs investigated are the Modified Sense-Amplifier FF [18] (MSAFF), the Skew-Tolerant FF [19] (STFF), the Conditional Capture FF [20] (CCFF) and the Variable Sampling Window FF [21] (VSWFF). The operation of the latter two resemble that of Pulsed FFs, since they employ a transparency window.

The FFs schematics are reported in Fig. 1, together with widths of transistors in the data-to-output paths and that are optimized as independent design variables [11].

3 Energy-Delay Tradeoff and Energy-Efficient Curves

3.1 Pulsed FFs

The EEC of the IP-EP FFs, derived in the reference case, is reported in Fig. 2a. From this figure, the TGPL is clearly the most energy-efficient Pulsed FF in the high-speed region and in part of the low-energy one. This is expected from the simplicity of the basic latch structure of TGPL (and hence the low impact of layout parasitics). This


Fig. 1. Schematics of the analyzed FFs: HLFF (a), SDFF (b), USDFF (c), IPPFF (d), CPFF (e), SEPFF (f), TGPL (g), MSAFF (h), STFF (i), CCFF (j), VSWFF (k)


good energy efficiency of TGPL is remarkable since here every FF is considered with its own Pulse Generator (PG), but actually energy may be further reduced by sharing PG among various FFs. From Fig. 2, in the deep low-energy region, the CPFF and IPPFF are the best Pulsed FFs. Indeed, both are Implicit Pulsed and hence do not require a PG. In addition, the CPFF employs a conditional technique to avoid unne-cessary precharge [15], while the IPPFF reduces the load on the precharged node by using a push-pull second stage.

SEPFF is fast, but dissipates more than TGPL in all conditions and hence is less energy-efficient. Its average delay is also nearly 1.2X greater than TGPL. This is somewhat different from previous works [8], which predicted the same speed for a medium load (like 16 ). Again, this is due to the heavier impact of interconnects, since SEPFF has a slightly more complex layout compared to TGPL.

Among all the Pulsed FFs, the semi-dynamic ones (SDFF and USDFF) exhibit the worst performances in the whole E-D space. The reason is again related with the layout complexity. In contrast with [5],[8],[13], where it is stated that such FFs have E-D features very similar to the HLFF, we find that the latter one is significantly more energy-efficient throughout the whole E-D space (except in the very high-speed re-gion where they are similar). Indeed, HLFF has a much simpler schematic and hence its layout has much shorter interconnects, thus reducing energy consumption.

Moreover, in contrast to previous results [6],[14], USDFF does not outperform SDFF, again because of its more complex routing. Given the mirror-like structure of the two circuits, the local wires capacitances can be compared by averaging out the results for all the different nodes and for all the different considered sizing strategies. On the average, we find that parasitics are nearly 60% larger for USDFF than SDFF.

All SET IP FFs are slower than EP FFs. In particular, by averaging out the delays correspondent to the various optimized FOMs, IP FFs delays are nearly 1.3X greater than for EP FFs. This happens mainly because IP FFs need stages with three stacked transistors in their critical path, whereas EP FFs exploit a real pulsed signal and need stages with two stacked transistors. In particular, IPPFF has the worst minimum delay among IP FFs, since it exhibits three and four stages paths for the rising and falling data transitions and this overcomes the advantages given by the push-pull stage [6].

To understand the dependence of the above results on the load value, the EECs of Pulsed FFs for 64 and 4 are reported in Fig. 2b-c (in both cases 0.25). The ranking of IP FFs does not change significantly, except for IPPFF that, having a greater number of stages in its data-to-output paths, becomes relatively faster for a large load. As concerns EP FFs, unlike [9], where the speed of a two stage FF (TGPL) is overcome by that of a three stage topology (SEPFF) when the load is large enough (64 ), the SEPFF still shows an average 1.1x (1.3x) delay increment even for 64 (4 ). When the load is small (4 ), TGPL is the most energy-efficient Pulsed FF practically in all the E-D space.

To understand the effect of switching activity, the EECs for 0.1 and a 0.5 are reported in Fig. 2d-e (in both cases 16 ). The main changes occur in the low-energy region, where the CPFF becomes more energy efficient for 0.1, since it takes advantage of the conditional precharge. Conversely, for 0.5, the IPPFF becomes the most energy-efficient Pulsed FF in the deep low-energy region, whereas CPFF and SEPFF (both exhibiting pseudo-static first stages) experience a considerable dissipation increase due to the high data activity rate.


As a final remark, the overall superiority of EP over IP FFs is explained by consi-dering that, in nanometer technologies, IP FFs suffer from a complex routing between the stages involved in the data-to-output paths, which thus need to be oversized to avoid a speed penalty. This must be emphasized since EP FFs can benefit from a further energy reduction when the PG is shared among various FFs.

Fig. 2. Implicit-Explicit Pulsed FFs: reference case (a), 64 (b), 4 (c), 0.1 (d), 0.5 (e). In (b)-(c) 0.25. In (d)-(e) 16 .

(a)

(b) (c)

(d) (e)


3.2 Differential FFs

The EECs of the SET Differential FFs in the reference case are reported in Fig. 3a. From this figure, the E-D space is split in two regions: the high-speed one, where the STFF is the most energy-efficient, and the low-energy one, where the MSAFF is the best Differential FF. In particular, STFF is the fastest among all the analyzed FFs. For instance the average delay of TGPL is 1.1X greater than the STFF, whereas those of MSAFF, CCFF and VSWFF are 1.8X, 1.3X and 1.4X greater, respectively.

These differences in the speed of such Differential FFs can be explained as fol-lows: all of them have equal second (skewed inverter) and third (push-pull) stages, which are very fast. As regards the first stage, the speed of MSAFF is affected by the load imposed by the cross-coupled inverters, whose NMOS transistors belong to the complementary critical paths (although the sense-amplifier nature is useful for level- restoring). The first stage of CCFF and VSWFF does not have this drawback and is significantly faster, but not as much as the first stage of STFF, where only two stacked NMOS are employed thanks to the use of additional driving NOR gates.

The high energy-efficiency of MSAFF in the low-energy region is due to the rela-tively simpler layout and to the lower impact of layout parasitics that allows for downsizing transistors with minor performances loss with respect to STFF, CCFF and VSWFF. For analogous reasons, CCFF and VSWFF, which have an extremely com-plex routing, are never the most energy-efficient. This is in contrast to what is claimed in many papers [2],[15],[20]-[21] where the conditional capture property is praised as a very efficient technique to reduce energy at a negligible speed penalty. This is no longer true in nanometer technologies where the impact of local wires is considerable (to maintain a good speed, such FFs need to be strongly oversized).

Given the similar topology of the considered Differential FFs, the same ranking is obtained regardless of the load . Instead, switching activity has a significant impact on the comparison, as is shown in Fig. 3b-c where the EECs derived for 0.1 and a 0.5 are plotted (in both cases 16 ). In detail, for 0.1, CCFF and VSWFF become the most energy-efficient in the region around the mini-mum point). For 0.5 their EECs move far away from the MSAFF and STFF ones, in contrast to [20], where it is stated that conditional capture FFs have a reasonable energy consumption even for such a data transition rate. Note that some of the considered Differential FFs [19]-[20] have complex IP single-ended counterparts whose energy-efficiency is always worse than the other single-ended topologies.

4 Area and Tradeoff with Delay

The silicon area occupied by FFs can be accurately estimated by using the same pro-cedure used to estimate the interconnects length (previous works did not analyze this aspect [2],[4]-[10],[12]-[21]). Table 1 reports the absolute and normalized area of the various FFs under three typical optimum sizings (minimum , and ).

Area is mostly dictated by the topological complexity and we can draw the follow-ing main conclusions, which roughly hold for all the considered sizings:

− Conditional Differential FFs (CCFF and VSWFF) have the greatest area; − HLFF and MSAFF have very small area. Indeed, MSAFF (despite its Differential

nature) takes advantage of its regularity and HLFF is the simplest considered FF.


As concerns EP FFs, the values in Table 1 are somewhat pessimistic. Indeed, when sharing the PG among an increasing FFs number, the area increase of the PG is small.

Fig. 3. Differential FFs: reference case (a), 0.1 (b), 0.5 (c) ( 16 )

Table 1. Absolute and normalized area of the considered FFs for various optimum sizings

Min : Area [ ] Min : Area [ ] Min : Area [ ]

HLFF 681.6 (1.00x) 462.4 (1.00x) 462.4 (1.00x) SDFF 869.6 (1.28x) 703.2 (1.52x) 588.0 (1.27x) USDFF 983.2 (1.44x) 816.8 (1.77x) 644.8 (1.39x) IPPFF 816.8 (1.20x) 624.0 (1.35x) 603.2 (1.30x) CPFF 912.0 (1.34x) 704.0 (1.52x) 541.6 (1.17x) SEPFF 946.4 (1.39x) 759.2 (1.64x) 644.0 (1.39x) TGPL 780.8 (1.15x) 635.2 (1.37x) 552.0 (1.19x) MSAFF 691.2 (1.01x) 504.0 (1.09x) 504.0 (1.09x) STFF 1202.4 (1.76x) 765.6 (1.57x) 724.0 (1.57x) CCFF 1397.6 (2.05x) 1106.4 (1.74x) 804.0 (1.74x) VSWFF 1397.6 (2.05x) 1106.4 (1.74x) 804.0 (1.74x)

(a)

(b) (c)


The area-delay tradeoff is illustrated for the reference case in Fig. 4. From this fig-ure, the area-delay tradeoff closely resembles the energy-delay tradeoff since the overall energy dissipation is strongly related with the area and the size of the circuits. Note the very good tradeoff offered by the HLFF in the delay range 3 6 4.

We also analyze the area degradation versus sizing (i.e., when optimizing FOMs where more emphasis is given to the speed). The results in Fig. 5 (Differential and Pulsed FFs are depicted with dotted and dashed lines, respectively) refer to the refer-ence case and are normalized with respect to the minimum area for each FF, obvious-ly achieved when simply minimizing the energy.

Differential FFs see the highest relative increase in their area (up to 1.8X) when they are progressively up-sized for smaller delays. Indeed, their complex layouts and the high branching effects due to local wires parasitics and additional gates (not lying in the data-to-output paths) require a significant transistor oversizing of their critical stages. Pulsed FFs (both IP and EP) show area increments up to 1.4 1.7X.

Fig. 4. Area-Delay tradeoff in the reference case

Fig. 5. Area degradation when considering the optimum sizings minimizing various FOMs

400

600

800

1000

1200

1400

1 2 3 4 5 6 7

HLFF SDFFUSDFF IPPFFCPFF SEPFFTGPL MSAFFSTFF CCFFVSWFF

(Area)/χ2

D/FO4

TGPL

HLFF

STFF

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8HLFFSDFFUSDFFIPPFFCPFFSEPFFTGPLMSAFFSTFFCCFFVSWFF

(Area)/(Area)Emin

ED5 ED4 ED3 ED2 ED E2D E3D Emin


5 Conclusion

In this paper, a thorough comparison in the energy-delay-area space of several high- speed FFs (Pulsed and Differential) in nanometer (65-nm) CMOS technology has been carried out. Analysis showed that, in many cases, results are different from pre-vious papers because the impact of local interconnects parasitics has been explicitly included since the early design phases. As a general remark, simpler basic structures are rewarded in nanometer technologies because of the strong impact of layout para-sitics. In particular, EP topologies, and specifically the TGPL, have been recognized as the best high-speed FF topologies in a very wide range of applications.

References

1. Kurd, N., et al.: A Family of 32nm IA Processors. In: 2010 IEEE ISSCC (2010) 2. Oklobdzija, V., et al.: Digital System Clocking: High-Performance and Low Power As-

pects. Wiley-IEEE Press (2003) 3. Alioto, M., et al.: Flip-Flop Energy/Performance versus Clock Slope and Impact on the

Clock Network Design. In: Print on IEEE TCAS-I 4. Nedovic, N., et al.: Dual-Edge Triggered Storage Elements and Clocking Strategy for

Low-Power Systems. IEEE TVLSI 13(5), 577–590 (2005) 5. Stojanovic, V., et al.: Comparative Analysis of Master-Slave Latches and Flip-Flops for

High-Performance and Low-Power Systems. IEEE JSSC 34(4), 536–548 (1999) 6. Giacomotto, C., et al.: The Effect of the System Specification on the Optimal Selection of

Clocked Storage Elements. IEEE JSSC 42(6), 1392–1404 (2007) 7. Markovic, D., et al.: Analysis and design of Low-Energy Flip-Flops. In: 2001 ISLPED,

pp. 52–55 (2001) 8. Tschanz, J., et al.: Comparative Delay and Energy of Single Edge-Triggered and Dual

Edge-Triggered Pulsed Flip-Flops for High-Performance Microprocessors. In: 2001 ISLPED, pp. 147–152 (2001)

9. Heo, S., et al.: Load-Sensitive Flip-Flop Characterization. In: 2001 IEEE CSW-VLSI, pp. 87–92 (2001)

10. Heo, S., et al.: Activity-Sensitive Flip-Flop and Latch Selection for Reduced Energy. IEEE TVLSI 15(9), 1060–1064 (2007)

11. Alioto, M., et al.: General Strategies to Design Nanometer Flip-Flops in the Energy-Delay Space. In: Print on IEEE TCAS-I

12. Partovi, H., et al.: Flow-Through Latch and Edge-Triggered Flip-Flop Hybrid Elements. In: 1996 IEEE ISSCC, pp. 138–139 (1996)

13. Klass, F., et al.: A New Family of Semidynamic and Dynamic Flip-Flops with Embedded Logic for High-Performance Processors. IEEE JSSC 34(5), 712–716 (1999)

14. Heald, R., et al.: A Third Generation SPARC V9 64-b Microprocessor. IEEE JSSC 35(11), 1526–1538 (2000)

15. Nedovic, N., et al.: Conditional Techniques for Low Power Consumption Flip-Flops. In: 2001 IEEE ICECS, vol. 2, pp. 803–806 (2001)

16. Zhao, P., et al.: Low Power and High Speed Explicit-Pulsed Flip-Flops. In: 2002 IEEE MSCS, pp. 477–480 (2002)

17. Naffziger, S., et al.: The Implementation of the Itanium 2 Microprocessor. IEEE JSSC 37(11), 1448–1460 (2002)


18. Nikolic, B., et al.: Improved Sense-Amplifier-Based Flip-Flop: Design and Measurements. IEEE JSSC 35(6), 876–884 (2000)

19. Nedovic, N., et al.: A Clock Skew Absorbing Flip-Flop. In: 2003 IEEE ISSCC, pp. 342–344 (2003)

20. Kong, B., et al.: Conditional-Capture Flip-Flop for Statistical Power Reduction. IEEE JSSC 36(8), 1263–1271 (2001)

21. Shin, S., et al.: Variable Sampling Window Flip-Flops for Low-Power High-Speed VLSI. In: 2005 IEE CDS, vol. 152(3), pp. 266–271 (2005)


A Temperature-Aware Time-Dependent Dielectric Breakdown Analysis Framework

Dimitris Bekiaris, Antonis Papanikolaou, Christos Papameletis, Dimitrios Soudris, George Economakos, and Kiamal Pekmestzi

Microprocessors and Digital Systems Lab, National Technical University of Athens 157 80, Zografou, Athens, Greece

{mpekiaris,antonis,xristos86,dsoudris, geconom,pekmes}@microlab.ntua.gr

Abstract. The shrinking of interconnect width and thickness, due to technology scaling, along with the integration of low-k dielectrics, reveal novel reliability wear-out mechanisms, progressively affecting the performance of complex sys-tems. These phenomena progressively deteriorate the electrical characteristics and therefore the delay of interconnects, leading to violations in timing-critical paths. This work estimates the timing impact of Time-Dependent Dielectric Breakdown (TDDB) between wires of the same layer, considering temperature variations. The proposed framework is evaluated on a Leon3 MP-SoC design, implemented at a 45nm CMOS technology. The results evaluate the system’s performance drift due to TDDB, considering different physical implementation scenarios.

Keywords: Reliability, Time-Dependent Dielectric Breakdown, Inter-Metal Dielectric Leakage, Timing.

1 Introduction

The current trend of CMOS technology scaling aggressively reduces the physical dimensions of devices and interconnects leading simultaneously to contiguous effects, which form novel threats regarding the reliability of modern integrated circuits. The shrinking of channel length of transistors incurs an exponential growth of sub-threshold leakage, which increases power density and creates hot spots in congested areas of the chip. The reduction of gate oxide thickness in technology nodes beyond 65nm enhances the gate tunneling current, resulting in Negative-Bias Temperature Instability (NBTI) in PMOS transistors due to the gradual rise of threshold voltage.

Similar effects of a progressive impact also appear in interconnection structures. They are caused by the shrinking of geometrical dimensions and the saturation of the operating voltage at around 1V, in sub-micron technologies [1]. The reduction of wires width and thickness increases current density, while the smaller pitch and spac-ing enhances the electrical field between interconnects of the same metal layer. Thus, Back-End-of-Line (BEOL) reliability phenomena like Electro-Migration (EM), Stress Migration (SM) and Time-Dependent Dielectric Breakdown (TDDB) start to gain in

74 D. Bekiaris et al.

significance with technology scaling and they progressively degrade the electrical characteristics and structure of affected interconnects.

The recent move from silica-based to porous, low-k dielectrics between copper lines in the interconnect stack comes along with the advent of nanoscale technologies and has further aggravated the potential TDDB problems. Copper tends to “leak” into the dielectrics and create conductive paths between wires of the same metal layer, leading to breakdowns in the dielectric and leakage current between wires. Moreover, the evolution of this leakage current is not abrupt. It seems to be a rather smooth func-tion of operating time until the magnitude of the current is large enough to create an electrical short between wires which affects the functionality of the circuit.

In this paper, we present an analysis flow that can capture the impact of Time De-pendent Dielectric Breakdown of the low-k dielectrics of the interconnect stack on the delay of the individual wires and, furthermore, propagate this impact to the timing of the entire chip. Hence, we can estimate when the chip will present timing violations due to reliability problems on the interconnects.

The rest of the paper starts by presenting the related work in the literature and con-tinues with the model used for the TDDB estimations. Section 4 presents the proposed reliability analysis framework and Section 5 demonstrates the experimental results, based on the application of this framework on layouts of an MP-SoC platform. Fi-nally, a discussion on the results and also hints for future work conclude the paper.

2 Related Work

Time-Dependent Dielectric Breakdown of the low-k dielectrics has been identified as a potential reliability threat by many independent researchers since the decision to move from aluminum to copper wires for standard CMOS processes [2][3][4]. Sig-nificant effort is being invested at the process technology development level, in order to determine the process steps and materials that can alleviate this phenomenon [5][6]. Up to now, however, no solution at the level of process technology seems to solve the problem completely. Hence, TDDB must be taken into account at the design stage as a potential threat, not only for the reliability of interconnects, but also for the circuit’s performance, as the flow of inter-metal leakage through the dielectric increases the wire delay and possibly design’s critical time delay drift over time.

This has been implicitly understood also by the process technology people, who have started working on modeling the impact of TDDB on the electrical properties of interconnects [7][8][9]. Although the design community has not yet taken up any of these models to evaluate the impact of TDDB at the level of an entire system, recent works present methodologies and tools estimating the system’s performance drift over time [10][11], based on the extrapolation of accelerated inter-metal leakage measure-ments to normal, operating conditions. In this work, we take a step further on these approaches, by exploring the impact of different place-and-route styles on system’s timing degradation due to TDDB, while considering the entire layout’s temperature profile, which is of course dependent on the application.

A Temperature-Aware Time-Dependent Dielectric Breakdown Analysis Framework 75

3 Time-Dependent Dielectric Breakdown Mechanism

Time-Dependent Dielectric Breakdown (TDDB) of inter-metal dielectrics refers to the progressive destruction of the material insulating interconnects of the same metal layer, leading to the formation of “leaky” paths and therefore increasing the time required for charging and discharging of wire capacitances. This mechanism is similar to the ones appearing in gate oxide structures and parallel plate capacitors of high-k dielectrics, also used in DRAMs.

However, TDDB becomes more significant for interconnects with the advent of low-k porous dielectric materials, mostly used in the sub-micron manufacturing proc-esses to reduce interconnect delay, while improving crosstalk and minimizing inter-connect power dissipation. These gains come hand in hand with worse reliability characteristics, due to the porous nature of the specific type of dielectrics. The gradual breakdown of low-k materials is aggravated as far as the electric field between neighboring wires rises, wire pitch scales down and the operating voltage saturates around 1V [1]. Hence, the inter-metal electric field is growing stronger with technol-ogy scaling and comprises the main reason for the formation of conductive paths through the dielectric, along with imperfections appearing in the interconnects.

These defects appear in the low-k materials used in current nanometer technology processes and their formation is mainly due to the dominating dielectric deposition methods, performed during the manufacturing process. Therefore, considering, in accelerated conditions of voltage and temperature, an electric field lower than 6 MV/cm, which is a usual stress value for low-k metal-insulator-metal structures [7], free charges (holes) are trapped into the areas of the dielectric where these defects exist. The number of trapped holes rises progressively, until a critical value is reached. Then, the flow of inter-metal leakage becomes significantly stronger, leading to the dielectric’s breakdown and finally resulting into a short-circuit.

The TDDB mechanism can be modeled either by the Schottky or by the Frenkel-Poole emission, both of which have similar mathematical expressions of inter-metal leakage current density [7] and are exponentially dependent on temperature. How-ever, mainly because of the nature of the specific wear-out and of the recent shifting of the interconnect technology on low-k dielectrics, there is little convergence on a specific model. Therefore, a common practice for the estimation of inter-metal leak-age in operating conditions deals with the extrapolation of leakage measurements from experimental data, where wires are stressed for a certain number of hours under high voltage and temperature, resulting in strong electric fields.

The extrapolation approach has been also adopted in this work, where the wires have been stressed for about one hour. The leakage in operating conditions is ex-tracted by performing linear extrapolation from the experimental measurements and the derived values formed the basis for the estimation of the delay impact of TDDB on individual interconnects. In the proposed reliability analysis framework, presented in the following section, we demonstrate how we use the information from the inter-metal leakage characterization libraries in stress conditions, in order to guide the estimation of additional delay in wires due to TDDB. This was a necessary step for the development of the proposed reliability framework, which predicts the design’s performance drift over time due to TDDB and therefore the shortening of system’s operating lifetime, under the required performance.


4 The Proposed Interconnect Reliability Framework

The proposed reliability analysis flow, which captures the impact of TDDB in inter-connects of low-k dielectrics on a design’s timing, is illustrated in Fig. 1. Even though its structure is generic enough, we have customized this instance of flow to capture the impact of TDDB on the delay of interconnects.

The flow of Fig. 1 takes four main inputs: (i) the layout of the circuit which in-cludes all the geometrical information of the interconnect stack, (ii) the timing con-straints of the design, (iii) the standard-cell technology libraries, which include the information about the timing of the cells in the design’s post-layout netlist and the dimensions of cells and interconnects, and (iv) the layout’s power profile, which is needed to extract the temperature profile and establish the actual temperature on each net.

The first steps of the flow estimate the temperature and timing profile of the layout. For the temperature estimation, we used HotSpot [12], an open-source academic tool that produces the thermal map of the chip, by taking as inputs the floorplan of the target design and the power consumption of the floorplan’s units. The power profile required for the temperature estimation is obtained via power analysis of the post-layout Verilog netlist in Synopsys PrimeTime PX, using an activity trace obtained through logic simulation, based on a testbench of a real application, in ModelSim.

Static Timing Analysis (STA) is performed on the design’s post-layout Verilog netlist, using the SoC Encounter Timing System (ETS) tool, which finds the most timing-critical paths in the design. In our framework, we extract the nets from the 50 most timing-critical paths. These nets are the “key” interconnects, as they belong to timing paths susceptible to suffer from TDDB. These paths have a minimal slack (less than 2ns) and thus, a delay overhead due to TDDB may lead to timing violations.

After these nets are identified, their geometrical properties are extracted, including the dimensions of the wires themselves and of their neighbors, as well as the spacing between them. This is performed through a Tcl script, which is executed in the SoC Encounter’s environment and reads the layout’s database based on the SoC Encoun-ter’s Database Access [13] command set. Hence, the script extracts the wires of the nets for the examined critical path, as well as their length, width and thickness, and finds the neighboring wires of the same metal layer, along with their physical dimen-sions and the distance between them and the wires of the examined net. All this in-formation, which will be used in the additional delay computation due to TDDB, is dumped to an output file, named as wire.report in our toolchain.

After extracting the physical information about the examined nets’ wires, the next step is to estimate the impact that TDDB is expected to have on the delay of these wires individually, based on the model outlined in the previous section, and to anno-tate the generated delay overhead due to TDDB on the design’s Standard Delay For-mat (SDF) file. Finally, the additional delay of each wire is taken into account in a chip-level timing analysis, in order to estimate the impact of TDDB on the timing of the entire layout, in a similar way as in the second step.


Fig. 1. The proposed temperature-aware interconnect reliability framework

4.1 Estimation of Delay Impact on Interconnects

For each of these wires identified in the 50 most timing-critical paths, our flow esti-mates the delay overhead due to TDDB, based on pre-computed inter-metal leakage (IMD) look-up table libraries, given in operating and accelerated conditions. This delay, computed for each of the nets of the examined path, is annotated to the Stan-dard Delay Format (SDF) file of the design, to update the specific net delay with the new value. The computation of the additional delay due to TDDB is performed in three steps, as it is shown in Fig. 1. It is performed through a Matlab script, based on the information of wire extraction for the nets of the examined path, while taking into account the proper temperature, depending on the units from which the specific path comes through. The final SDF file, including the new net delays, is then back-annotated, along with the post-layout netlist, to the static timing analyzer of ETS, to


evaluate the impact of TDDB on the design’s performance. The analytic description of the three steps required for the TDDB impact annotation is given below:

Step 1 – IMD Leakage Extrapolation: Based on the neighboring wire information, a Matlab script performs the additional delay overhead computation due to inter-metal leakage and annotates the shifted delay to the SDF file of the design, for the TDDB timing impact evaluation. The additional delay calculation is divided into two steps. At first, the script reads all the wires of the examined net from wire.report, as well as their neighboring wires, and obtains IMD leakage from accelerated to operating con-ditions by performing linear extrapolation, based on experimental look-up table librar-ies. These libraries contain IMD leakage information after having stressed the wires for up to one hour and in conditions of 35V, 40V and 45V of voltage, under tempera-tures of 323K, 398K and 448K respectively.

Step 2 – Delay Increment Computation: The extrapolated leakage is used to estimate the additional delay on the net due to TDDB, based on another look-up table library, which provides the delay increment ratio for charging or discharging a wire, depend-ing on the inter-metal leakage between two adjacent wires of varying length, spacing and overlap. For the construction of such a library, we simulated the behavior of two neighboring wires in Synopsys HSPICE, in order to find the ratio of delay increment of charging and discharging a wire due to IMD leakage, for various possible adjacent wire patterns. This library was created once and it is used in all the conducted ex-periments, as long as the on-the-fly extraction and simulation of adjacent wire pat-terns for all the timing paths of each layout would be time-consuming.

In the conducted experiments, the wire length ranges between 10um and 600um, in order to include wire patterns with length equal or greater than those met in the lay-outs of our case study. The spacing’s range is between 0.06um and 0.5um, covering the range defined in the design rules of the 45nm standard-cell library used for the implementation of the layouts. Moreover, in order to measure the delay of wires which are not totally overlapped, we simulated wire patterns where the starting point of the neighboring wire was not equal to the one of the wire for which the delay was measured. Hence, the neighboring wire’s starting point was ranging from zero (total overlap of wires) to 75% of the target wire’s length (smallest overlap of wires). In order to simulate the inter-metal leakage in HSPICE, we used current sources distrib-uted across the target wire at each R-C (Resistance – Capacitance) segment.

Pre-overlap region Overlap region Post-overlap region

Fig. 2. The distributed RC model, simulating inter-metal leakage in HSPICE

The total leakage current for each wire in our simulations is dependent on the wire’s length and varies between 0 and 50uA, in order to cover a wide enough range. The


value of each current source, given in uA, depends on the overlap length between the target wire and its adjacent one. In our approach, it is computed by dividing the total leakage current for each wire with the number of R-C stages corresponding to the overlap length between the target wire and the adjacent. In Fig. 2, we demonstrate an example of an equivalent distributed R-C model of a wire that has two of the four R-C stages overlapping with its neighboring one (Overlap region), at the same metal layer. Step 3 – Interconnect Delay computation: Thus, based on the delay increment ratios extracted from the simulations of wires, we constructed a look-up table library, in-cluding the wirelength of which the delay is computed, along with the neighboring wire’s length, the wires’ spacing and the starting point of the neighboring wire, all given in um. Based on this library, namely TDDB_LUT.lib in our flow, as well as on the extrapolated leakage from accelerated to operating conditions of the first step, we now performed a linear interpolation through Matlab, to compute the delay ratio for each wire of the examined net in the current timing path. It must be noted that only wires of length longer than 10um are considered in the additional delay calculation script, in correspondence with the range of wire lengths included in TDDB_LUT.lib.

In the linear extrapolation method performed in Matlab, we derive the wire of each net in the examined path of the design, by reading the wire.report, which contains the physical dimensions information about the net’s wires and their neighboring ones in the same metal stack, from the initial layout extraction step. The arguments passed for the extrapolation are the wire’s width, thickness and length, as well as the starting point of each neighboring wire, its length and the distance between them, all obtained from the layout extraction. Hence, in order to find the additional delay for the specific wire, considering the neighboring ones from the layout, we perform a linear interpola-tion between these values and those of the wire patterns simulated in HSPICE, for which we have already computed the delay increment ratios and dumped them in TDDB_LUT.lib, as it is mentioned above. The delay overhead for individual wires is shown in Fig. 3, as a function of the wires’ length and distance (right figure), as well as of temperature and operation time, given in years (left figure).

Fig. 3. Delay impact on a wire due to TDDB depending on: temperature (left) and wire length and distance (right)


The calculated delay ratio is then multiplied with the quotient of the target wire’s length and the total net length and the result is added to the initial wire’s delay due to IMD leakage, which is of course zero. Thus, the additional delay on the specific wire due to TDDB is computed. The same process is performed for all the wires of the examined net in the current path of the design. The total additional delay of the whole net due to TDDB is the weighted summation of the all net’s wire delays, where the weights are computed by dividing the wire’s length to the total net’s length. The up-dated net’s delay is then annotated into the design’s SDF file, by finding the specific net and adding the extra delay to the existing one. Thus, the produced SDF, contain-ing the delay overhead from all nets of the path is then annotated to ETS to evaluate the total impact of TDDB on the design’s performance. The aforementioned process is continuously followed for all the selected register-to-register paths in the design, while it is applicable to any other design with a reasonable amount of gates.

5 Evaluation of the TDDB Framework to a LEON3-Based MP-SoC

The presented TDDB analysis flow is applied to an MP-SoC design, based on two LEON3 SPARC processor cores, both attached on the AMBA Advanced High Per-formance bus (AHB). Each processor has seven pipeline stages, while the internal caches include 2 sets of 4K bytes.

The design’s RTL description is given in parameterized VHDL, configured via the Gaisler Research automated tools [14]. It is synthesized in Synopsys Design Compiler based on the TSMC 45nm standard-cell library (0.9V, 25 C) and at a clock period constraint of 2ns, resulting in about 30K gates. The floorplanning and the place-and-route steps are implemented in Cadence SoC Encounter, while ETS is employed for Static Timing Analysis. The post-layout Verilog netlist simulation is performed in ModelSim [15], where we obtained switching activity from a matrix multiplication application, running in both processors, as well as from an MP-SoC benchmark ini-tializing the two cores and the system’s peripherals, included in the Gaisler’s suite. The power analysis is performed in PrimeTime PX, by annotating the .vcd (Value Change Dump) file with the design’s activity, derived from ModelSim’s framework.

In the proposed case study, we explore how the impact of TDDB on the perform-ance of a LEON3-based MP-SoC design may change, by selecting different place-ment and routing scenarios, considering the gate-level netlist obtained from synthesis. The dependence of inter-metal leakage on length and distance of wires motivated us to look at different place-and-route strategies favoring either timing or congestion, to find out which scenario minimizes the timing impact of TDDB.

5.1 Experimental Results and Discussion

The main parameters that affect TDDB on the interconnect dielectrics are temperature, wire length and distance between adjacent wires. Regarding temperature, it is mostly affected by the switching activity of the designs. In our LEON3 layouts, which were implemented based on five different place-and-route scenarios, we ob-served minor temperature differences for two application benchmarks of different computational effort, mentioned above. This is due to the similar power traces


extracted from power analysis for the two application benchmarks executed, as well as to the fact that we have been based on the same floorplan, in order to implement the different placement and routing strategies.

On the other hand, interconnect stack geometrical parameters, like lengths and distances, are mainly impacted by how the circuit is placed-and-routed. In principle, a timing-optimized placement and routing approach will tend to lead to shorter wires, while a congestion-oriented physical implementation strategy will tend to result into longer wires due to coarser placement, as well as to the detouring of wires during routing, to avoid the formation of over-congested areas. Therefore, it is likely that such a strategy will incur larger distances between wires in the same metal layer.

However, the results depicted in Fig. 4 indicate that when placement is congestion-aware (CPl-NR & CPl-CR), the delay overhead due to TDDB is very high. Such a placement scenario will spread out the standard cells and inevitably lead to longer wires at the routing stage, compared to the timing-driven approach and irrespective of the routing strategy. At the other extreme, timing-aware placement and routing seems to result into the minimum delay impact, because the wire lengths are minimal.

Combining these remarks with those of Fig. 3 (delay impact of a wire due to TDDB vs length vs distance vs wire length), we can draw interesting conclusions. Even though at the individual wire level the distance between wires is the most critical parameter for the delay impact of TDDB, at the entire chip level, wire length is the only important parameter for our LEON3-based layouts. Howevesr. in the presented case-study, the different routing strategies, favoring timing or congestion, tend to leave the distances between wires almost unaffected. Hence, a timing-optimal placement and routing approach will also lead to the best layout for TDDB. Since timing is usually the major design spec, the resulting layouts will be optimal for TDDB, while selecting a totally timing-driven place-and-route approach.

Fig. 4. Chip-level timing overhead due to TDDB for different layout styles (C: congestion-aware, T: timing-aware, N: normal, Pl: placement, R: routing)

This does not imply that designers need not to worry about TDDB, however. Tim-ing-optimal layouts tend to have minimal slack between the data arrival time and the


required time, so that the designs can run at the highest possible clock frequency. Even after 3 years of operation the layout we used has incurred a critical path delay overhead of about 40ps, which might be enough to cause a timing violation. There is a trade-off between actual clock frequency and operating lifetime of the chip. If enough timing slack is left for TDDB tolerance, the expected operating lifetime will be longer, while the design’s operating frequency and consequently performance will degrade, and vice versa.

6 Conclusion and Hints for Future Work

In this work, we introduced a reliability analysis framework that estimates the impact of Time-Dependent Dielectric Breakdown on the system’s performance, considering an MP-SoC design implemented with a nanometer CMOS technology with different place-and-route strategies. The proposed flow captures the timing violations induced by the inter-metal leakage of low-k interconnects of the examined paths and predicts the gradual performance degradation for each implementation scenario, considering the layout’s temperature profile, based on a specific application. Future work may be focused on the frameworks’ automation, as well as on the selection of paths, depend-ing on the temperature of design units, the congestion and the length of wires.

References

1. ITRS 2005 public reports (2005), http://public.itrs.net 2. Chen, F., et al.: Critical low-k reliability issues for advanced CMOS technologies. In: Proc.

of the 2009 IRPS Symposium, Montreal, Canada, May 26-30, pp. 464–475 (2009) 3. Nitta, S., et al.: Copper BEOL interconnects for silicon CMOS logic technology. In: Davis,

J.A., Meindl, J.D. (eds.) Interconnect Technology and Design for Gigascale Integration. Springer, Heidelberg (2003)

4. Gonella, R.: Key reliability issues for copper integration in damascene architecture. Jour-nal of Microelectronic Engineering 55(1-4), 245–255 (2001)

5. Tan, T.L., Gan, C.L., Du, A.Y., Cheng, C.K., Gambino, J.P.: Dielectric degradation mechanism for copper interconnects capped with CoWP. Applied Physics, Letter. 92, 201916 (2008)

6. Takeda, K.-i., Ryuzaki, D., Mine, T., Hinode, K., Yoneyama, R.: Copper-induced dielec-tric breakdown in silicon oxide deposited by plasma-enhanced chemical vapor deposition using trimethoxysilane. Journal of Applied Physics 94(2572) (2003)

7. Chen, F., et al.: Line-edge roughness and spacing effect on low-k TDDB characteristics. In: Proceedings of the 2008 International Reliability Physics Symposium (IRPS), April 27-May 1, pp. 132–138 (2008)

8. Chen, F., Shinosky, M.: Addressing Cu/Low-k Dielectric TDDB Reliability Challenges for Advanced CMOS Technologies. IEEE Transactions on Electron Devices 56(1), 2–12 (2009)

9. Li, Y.: Low-k dielectric reliability in copper interconnects, PhD Dissertation, Katholieke Universiteit Leuven (2007)

10. Guo, J., et al.: A Tool Flow for Predicting System-Level Timing Failures due to Intercon-nect Reliability Degradation. In: Proc. of the 2008 GLSVLSI International Symposium, Orlando, Florida, USA, May 4-6, pp. 291–296 (2008)


11. Guo, J., et al.: The Analysis of system level timing failures due to interconnect reliability degradation. IEEE Transactions on Device and Material Reliability (2009)

12. Huang, W., Ghosh, S., Velusamy, S., Sankaranarayanan, K., Skadron, K., Stan, M.R., Brown, C.L.: HotSpot: a compact thermal modeling methodology for early-stage VLSI de-sign. IEEE Transactions on VLSI Systems 14(5) (May 2006)

13. Cadence SoC Encounter Database Access command reference, http://www.cadence.com

14. Aeroflex Gaisler Research, http://www.gaisler.com 15. Mentor Graphics ModelSim, http://www.model.com

An Efficient Low Power Multiple-Value Look-Up

Table Targeting Quaternary FPGAs

Cristiano Lazzari1, Jorge Fernandes2, Paulo Flores2, and Jose Monteiro2

1 INESC-ID, Lisbon, Portugal2 INESC-ID / IST, TU Lisbon, Lisbon, Portugal

{lazzari,jorge.fernandes,pff,jcm}@inesc-id.pt

Abstract. FPGA structures are widely used as they enable early time-to-market and reduced non-recurring engineering costs in comparison toASIC designs. Interconnections play a crucial role in modern FPGAs,because they dominate delay, power and area. Multiple-valued logic al-lows the reduction of the number of interconnections in the circuit, hencecan serve as a mean to effectively curtail the impact of interconnections.In this work we propose a new look-up table structure based on a low-power high-speed quaternary voltage-mode device. The most importantcharacteristics of the proposed architecture are that it is a voltage-modestructure, which allows reduced power consumption, and it is imple-mented with a standard CMOS technology. Our quaternary implemen-tation overcomes previous proposed techniques with simple and efficientCMOS structures. Moreover, results show significant reductions on powerconsumption and timing in comparison to binary implementations withsimilar functionality.

Keywords: Multiple-value Logic, Quaternary Logic, Look-up Tables,FPGAs, Standard CMOS Technology.

1 Introduction

Designers face new challenges in modern systems on a chip (SoCs) due to thelarge number of components. The high integration of different systems increasesthe number and length of interconnections, which are becoming the dominantaspect of the circuit delay for state-of-the-art circuits due to the advent of deepsub-micron technologies (DSM). This fact is even more significant with eachnew technology generation [1]. In DSM technologies, the gate speed, densityand power scaling follows Moore’s law. On the other hand, the interconnectionresistance-capacitance product increases with the technology node, leading toan increase of network delay. Even after modifications in interconnections, fromaluminum to copper and low-k inter metal dielectric materials, the problemremains and it is getting more significant [2].

In particular, interconnections play a crucial role in Field Programmable GateArrays (FPGA), because they not only dominate the delay, but they also presenta significant impact on power consumption [3] and occupied area [4]. Recent work


An Efficient Low Power Multiple-Value LUT Targeting Quaternary FPGAs 85

suggests that in modern million-gates FPGAs, as much as 90% of chip area isdedicated to interconnections [5].

In order to keep the wide range of applications of the FPGAs in the market,one must deal with their excessive power dissipation, and this must be reducedwithout compromising computational power. One way to deal with this problemis to reduce the area occupied by the interconnections by, not only reducing thenumber of interconnections, but also the length of these interconnections.

Multiple-valued logic (MVL) has received increased attention in the last yearsbecause of the possibility to represent the information with more than two dis-crete levels in a single wire. Hence, the number of interconnections can be signif-icantly reduced, with major impact in all design parameters: less area dedicatedto interconnections; more compact and shorter interconnections, leading to in-creased performance; lower interconnect switched capacitance, and hence lowerglobal power dissipation [6].

MVL has been successfully accomplished in several type of devices such asadders [7] and multipliers [8], as well as programmable devices [5,9] were alsoproposed. The main drawbacks of these previous MVL implementations arethat they are either based on current-mode devices or demand extra steps inthe fabrication process (for the generation of transistors with different Vths).Current-based circuits present successful improvements in reducing area, buttheir excessive power consumption and implementation complexities has pre-vented, until now, MVL systems from being a viable alternative to standardCMOS designs. On the other hand, while it is true that technologies with mul-tiple Vths deal very well with the power dissipation problem, as stated in [5,10],their additional phases on the fabrication process make their implementationmore difficult, more susceptible to variability problems and more expensive.

In this work we present a new implementation of a multiple-valued look-uptable based on the quaternary representation, taking advantage of the analognature of the multiple-valued representation. We implemented the quaternarylook up-table by using a simple and efficient analog structure able to deal withthe quaternary signals. Results show that our implementation overcomes thedrawbacks of previous implementation and are competitive when compared tobinary LUTs with the same functionality.

This paper is organized as follows. Section 2 discusses the differences betweenbinary and quaternary look-up table implementations. Section 3 presents thenew quaternary look-up table, giving details about the proposed structure. Acomparison between the binary and quaternary look-up tables is presented inSection 4. Variability and the reduced noise margin effects in quaternary circuitsare discussed in Section 5, and finally, Section 6 concludes the paper and outlinesfuture work.

2 Binary and Quaternary Look-Up Tables Overview

General Look-Up Tables (LUT) are basically memories, which implement a logicfunction according to their configuration. Configuration values C = (c0, · · · , ci,

86 C. Lazzari et al.

· · · , ck−1) are initially stored in the look-up table structure, and once inputs areapplied to it, the logic value in the addressed position is assigned to the output.The capacity of a LUT |C| is given by

|C| = n × bk (1)

where n is the number of outputs, k is the number of inputs and b is the numberof logic values. For example, a 4-input binary look-up table with one output isable to store 1 × 24 = 16 Boolean values. For the purpose of this work, only1-output LUTs (n = 1) are discussed in this paper.

A binary function implemented by a Binary Look-Up Table (BLUT) is definedas f : Bk → B, over a set of variables X = (x0, · · · , xi, · · · , xk−1), where eachvariable xi represents a Boolean value. The total number of different functions|F | that can be implemented in a BLUT with k input variables is given by

|F | = b|C| (2)

where b = |B| (b = 2 in the binary case). For example, a look-up table with 4inputs (k = 4) can implement one of |F | = 65, 536 different functions.

Quaternary functions are basically generalizations of binary functions. A qua-ternary function implemented by a quaternary look-up table (QLUT) is definedas g: Qk → Q, over a set of quaternary variables Y = (y0, · · · , yi, · · · , yk−1),where the values of a variable yi, as the values of the function g(Y ), can be inQ= {0, 1, 2, 3}. As in the binary case, the number of possible function in QLUTsis given by (2), where b = 4. In this case, the number of functions that can berepresented is around 4.3 × 109 for a QLUT with only two quaternary inputs(k = 2), which is much larger than for the BLUT.

It is important to highlight that the function g(Y ) performs exactly the samefunction as two binary BLUTs, f0(Y ) and f1(Y ), where f0 represents the leastsignificant Boolean value and f1 represents the most significant one. Followingthe same idea, the configuration values are also quaternary for the QLUT, whichrepresent the values for two binary configuration values.

Since a quaternary variable y is capable of representing twice as much infor-mation as a binary variable x, we note that the cardinality of |Q| = 2 × |B| inour experiments. In other words, two binary variables with the same inputs canbe grouped in order to represent a quaternary variable. Such procedure aims atreducing both the total number of connections and the number of gates.

3 Look-Up Tables Implementation

Binary and quaternary look-up tables were implemented with transmission gates.For the binary version, transmission gates are controlled by the BLUT inputs,while the QLUT is composed of transmission gates controlled by a new quater-nary to binary device.

Fig. 1a shows a binary 4-input BLUT implementation (b = 2, k = 4, |C| = 16)where xi ∈ X are the inputs, ci ∈ C form the look-up table configuration


B1

B1

B1

B1

B3

B3

B3

B2

B2

B2

B2

B1

B1

B1

B1

B2

B2

B2

B2 B3

B0

B0

B0

B0

B0

B0

B0

B0

B0

B0

B0

B0

c15

c14

c13

c12

c0

c1

x3

B3B2

x2

B1

x1

B0

x0

B0 B1 B2 B3

z

(a) 4-input BLUT.

y1y0

Q−decoder 0

Q00

Q01

Q10

Q11

Q−decoder 1

Q12

Q13

Q03

Q02

Q00

Q00

Q01

Q01

Q03

Q03

Q02

Q02

c15

c14

c13

c12

Q02

Q02

Q03

Q03

Q00

Q00

Q01

Q01

c0

c1

c2

c3

Q13

Q13

Q10

Q11

Q11

Q12

Q12

w

Q10

(b) 2-input QLUT.

Fig. 1. Binary and quaternary look-up table implementations

and z is the output. The BLUT is composed of four stages as a consequenceof the number of inputs. Multiplexers (implemented using transmission gates)are responsible for propagating the configuration values to the BLUT output.The transmission gates receive selection signals from the four BLUT inputs andassociated inverters.

A quaternary look-up table (QLUT) follows the same structure as the BLUTs.Fig. 1b illustrates the implementation of a 2-input QLUT (b = 4, k = 2, |C| =16). As in the binary case, ci ∈ C are the look-up table configuration values,yi ∈ Y are the inputs and w is the output. Due to the quaternary representation,only two stages of transmission gates are required.

The transmission gates are controlled by binary signals. Therefore, we need aspecial circuit to convert the quaternary inputs y0 and y1 to the correspondentcontrol signals – the quaternary-to-binary converter (Q-decoder).


Table 1. The Q-decoder behavior as a funtion of the quaternary logic value at theinput

Q Q0 Q1 Q2 Q3

04 12 0 0 0

14 0 12 0 0

24 0 0 12 0

34 0 0 0 12

3.1 Quaternary-to-Binary Converter

Table 1 shows the Q-decoder binary output logic values as function of the quater-nary input Q. Outputs Q0 to Q3 determine which transmission gates (in Fig. 1b)are propagating the configuration value ci ∈ C to the QLUT output w. Notethat values for the controlling signals Q0, Q1, Q2 and Q3 are binary values,meaning 0 (0V ) or 12 (VDD).

The Q-decoder outputs may be considered as flags that determine which qua-ternary value is applied to Q-decoder input. Once we are able to determine thequaternary value in the Q-decoder input Q, the transmission gates connectedto the Q-decoder outputs may be properly controlled. In other words, with theQ-decoder structure we are able to convert a quaternary input to a 4-bit wordin one-hot codification and its inverted value.

Q0 Q0

Q3Q2Q2

aux

Q1 Q1

Q3

Q

CP

CN

Fig. 2. The Q-decoder logic structure

The Q-decoder structure is shown in Fig. 2. The main advantage of this struc-ture over previous proposed implementations is that it is has standard CMOSstructures. The Q-decoder is composed of two comparators CP and CN, andother traditional digital circuits such as inverters, NANDs and NORs.

The CP and CN are self-reference analog comparators shown in Fig. 3. Withthese structures we are able to detect the four possible voltage levels. In a binaryimplementation, an inverter may be seen as a comparator where the voltagereference is VDD/2. For our quaternary device, we need three voltage referencesin order to determine a quaternary value, at 1/6VDD, 3/6VDD and 5/6VDD, asdepicted in Fig. 3a.


(a) Logic levels. (b) CP and CN transfer functions.

(c) CP Structure. (d) CN Structure.

Fig. 3. Quaternary logic levels and comparators details

One way to obtain this comparator behavior is by designing inverters withunbalanced PMOS and NMOS transistor widths. The main drawback of thistechnique is that it leads to large transistors widths with large gate capaci-tances, penalizing speed and power. Furthermore, in technologies with low VDD,reference voltage values are below Vth, which makes this sizing technique im-practicable.

To overcome this problem, we propose the use of the comparator circuits inFig. 3c and Fig. 3d that add an extra transistor connected as a “diode” to shiftthe supply voltage by Vth.

In a first order approach, we consider simplified transistor models, and thattransistors are equally sized (k1 = k2 = k′

2 → µn(W/L)1 = µp(W/L)2), withequal threshold voltages (Vth1 = Vth2 = V ′

th2 = Vth). This simplified analysis isconfirmed by simulations with more accurate models that will be presented inthe next sections.


Reference points are defined by calculating vx for vi = 0 (3) and the transitionspoints (4), leading to the transfer function curves represented in Fig. 3b.

vx

∣∣vi=0

⇒ iD2 = 0

⇒ k2(VDD − vx − Vth2)2 = 0⇒ vx = VDD − Vth2 (3)

iD1 = iD2 ⇒ k1(vi − Vth1)2 = k2( vx︸︷︷︸VDD−Vth

−vi − Vth2)2

⇒ vi − Vth1 = VDD − vi − 2Vth

⇒ 2vi = VDD − Vth

⇒ vi =VDD − Vth

2(4)

The Q-decoder was implemented with the UMC 130nm technology. Simulationswaveforms are shown in Fig. 4, where Q-decoder outputs are shown as expectedand described in Table 1. The largest propagation delay from the Q-decoderinput to the outputs (Q → Q2) is 196ps for this technology. This result is veryimportant, because an inverter connected to the same transmission gates (i.e.,same output load) presents a 81ps propagation delay, and the transmission gatesare the main contributors to the look-up table propagation delay. More detailsabout the comparison of binary and quaternary LUTs are given in the nextsection.

4 Binary vs Quaternary Look-Up Tables

We also implemented the complete binary and quaternary look-up tables withthe UMC 130nm technology in order to evaluate their performance and powerconsumption. The development of the binary and quaternary LUTs was per-formed according to the Fig. 1. Transistor widths were kept to the minimumvalue in order to have a fair comparison between binary and quaternary ver-sions.

We inserted buffers in the binary structure in order to reduce the impact ofthe gate capacitances. According to Fig. 1a, a cell connected to the BLUT inputx0 should drive 16 transistors. We balanced this gate capacitances by inserting 4buffers, and thus improving the propagation delay. The power consumption wasalso reduced due to the faster transitions, and as a consequence, smaller shortcircuit times.

Experimental results are shown in Table 2, where the quaternary structureproposed in this paper outperforms the binary implementation in both powerconsumption and propagation delay. These results were obtained throughCADENCE Spectre simulation [11]. The propagation delay is simply the largestdelay from an input to the output of each LUT. The average power consumptionis obtained from the simulation of 1024 random input vectors, when circuits wererunning at 100MHz.


5 10 15 20 25 30 35 40

5 10 15 20 25 30 35 40

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25 30 35 40

0

1.2

0 5 10 15 20 25 30 35 40

0

1.2

0 5 10 15 20 25 30 35 40

0 0

1.2

1.2

0

0

t (ns)

Q1

(V)

Q0

(V)

Q (

V)

Q3

(V)

Q2

(V)

Fig. 4. The Q-decoder inputs and outputs waveforms

For the quaternary circuits, we carefully took in consideration every singlevoltage source (e.g., used to drive ci values of the QLUT), so that the resultsshown in Table 2 reflect the real power consumption (i.e., currents flowing froma voltage source to another are considered).

Results highlight that the quaternary look-up table, proposed in this paper,is very promising. In terms of delay, the quaternary LUT presents a very similarbehavior, but better results are obtained when the load capacitance is 0.5pF orlarger.

The power consumption is the most important result. According to Table 2,the quaternary LUT presents gains ranging from 22% (Cl=0.2pF) to 39%(Cl=1pF) in terms of power consumption. Note that, as for the propagationdelay, gains are more important when the load capacitance increases.

It is clear that these gains related to the power consumption are obtaineddue to the reduced voltage levels. While binary transitions range from 0V to1.2V (for this technology), quaternary transitions may vary from 0V →0.44V to0V →1.2V , demanding different current flows. Considering that all the possibletransitions have the same probability, quaternary transitions have a smaller av-erage voltage transition, reducing the average current flow and consequently thepower dissipation.


Table 2. Delay and power consumption comparison of two 4-input BLUTs and one2-input QLUT, both implemented with UMC 130nm process technology

Output 2 4-input Binary LUTs 2-input Quaternary LUTLoad (Cl) Delay Power@100MHz Delay Power@100MHz

0.2pF 0.91ns 45µW 0.95ns 35µW

0.5pF 1.9ns 68µW 1.7ns 43µW

1.0pF 3.4ns 94µW 3.0ns 57µW

In a practical implementation of a FPGA, there will be a smaller number ofinterconnections due to the quaternary representation, and hence we will alsobe able to reduce the wire length, and the parasitics capacitance will be smaller,as a consequence. For this reason, we expect to have better results than theones presented in this paper, when developing a complete FPGA, based on theproposed circuits, to implement the quaternary logic.

5 Variability and Noise Margin in Quaternary Circuits

In current sub-micron and future technologies, process variability and reducednoise margin are important challenges for the development of multiple-valueddevices. Voltage-mode multiple-valued logic devices present reduced voltage lev-els to represent logic values in comparison to binary circuits, and for this reasonthey may be, in theory, more susceptible to errors.

However, we performed Monte Carlo simulation with 500 runs to show thatour quaternary LUT is robust to process variations when considering randomprocess and mismatch variations. In this simulations, voltage variations are keptbelow 90mV for all the critical transition points (Q0 and Q3). Even with thisvariation range, we still have a 100mV gap between logic level transitions forother sources of noise or perturbations.

Noise levels are indeed reduced in quaternary circuits due to the fact that wehave four voltage levels while keeping the same supply voltage. However, we mayargue from a different perspective. In the last years, supply voltages have beenreduced from 5V , to 3.3V , and recently to 1V . This is a huge reduction in thenoise margin and circuits have successfully coped with it.

It is important to highlight that the perturbations in the quaternary devicesshould be smaller than the binary ones because of the smaller average voltagetransitions. Therefore a lower noise coupling between lines.

In summary, we may see the quaternary devices as a specific type of analogdevice. The knowledge and experience acquired by analog designers applied tothe development of these devices in sub-micron technologies may be very usefulin an effort to develop new multiple-value devices.

6 Conclusions

This work presents important advances in the development of multi-valuedcircuits through the implementation of a quaternary look-up table targeting


multiple-valued FPGAs. Results show that the proposed structure is competi-tive with the binary one with significant reductions on power consumption andpropagation delay. The technique proposed in this paper is simpler to imple-ment than the previous proposed multiple-valued circuits. Furthermore, as farwe know, no other proposed work is more efficient than our technique whencomparing to binary circuits.

As future work, we are developing a complete FPGA (logic block, switchmatrix, etc). A functional quaternary FPGA will allow the study of viability andthe comparison with current binary circuits. We are also planning to implementour quaternary device in more recent technologies such as 45nm and below.

Acknowledgments. This work was supported by FCT (INESC-ID multian-nual funding) through the PIDDAC Program funds and by the FCT projectPTDC/EEA-ELC/72933/2006.

References

1. Gupta, A.K., Dally, W.J.: Topology optimization of interconnection networks.IEEE Comput. Archit. Lett. 5(1), 3 (2006)

2. Banerjee, K., Souri, S.J., Kapur, P., Saraswat, K.C.: 3-D ICs: a novel chip designfor improving deep-submicrometer interconnect performance and systems-on-chipintegration. Proceedings of the IEEE 89(5), 602–633 (2001)

3. Li, F., Lin, Y., He, L., Chen, D., Cong, J.: Power modeling and characteristics offield programmable gate arrays. IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems 24(11), 1712–1724 (2005)

4. Singh, A., Marek-Sadowska, M.: Efficient circuit clustering for area and powerreduction in FPGAs. In: Proceedings of the 2002 ACM/SIGDA Tenth InternationalSymposium on Field-Programmable Gate Arrays, FPGA 2002, pp. 59–66. ACM,New York (2002)

5. da Silva, R., Lazzari, C., Boudinov, H., Carro, L.: CMOS voltage-mode quaternarylook-up tables for multi-valued FPGAs. Microelectronics Journal 40(10), 1466–1470 (2009)

6. Dubrova, E.: Multiple-valued logic in vlsi: Challenges and opportunities. In: Pro-ceedings of NORCHIP 1999, pp. 340–350 (1999)

7. Gonzalez, A., Mazumder, P.: Multiple-valued signed digit adder using negativedifferential resistance devices. IEEE Transactions on Computers 47(9), 947–959(1998)

8. Hanyu, T., Kameyama, M.: A 200 MHz pipelined multiplier using 1.5 v-supplymultiple-valued mos current-mode circuits with dual-rail source-coupled logic.IEEE Journal of Solid-State Circuits 30(11), 1239–1245 (1995)

9. Zilic, Z., Vranesic, Z.: Multiple-valued logic in FPGAs. In: Proceedings of the 36thMidwest Symposium on Circuits and Systems, vol. 2, pp. 1553–1556 (August 1993)

10. Cunha, R., Boudinov, H., Carro, L.: Quaternary look-up tables using voltage-modeCMOS logic design. In: 37th International Symposium on Multiple-Valued Logic,ISMVL 2007, pp. 56–56 (May 2007)

11. Cadence Design Systems Inc.: Virtuoso spectre simulator user guide (2010)


On Line Power Optimization of Data Flow Multi-core Architecture Based on Vdd-Hopping for Local DVFS

Pascal Vivet1, Edith Beigne1, Hugo Lebreton1, and Nacer-Eddine Zergainoh2

1 CEA-Leti, Minatec, Grenoble, France 2 TIMA, Grenoble, France

{pascal.vivet,edith.beigne,hugo.lebreton}@cea.fr, [email protected]

Abstract. With growing integration, power consumption is becoming a major issue for multi-core chips. At system level, per-core DVFS is expected to save substantial energy provided an adapted control. In this paper we propose a local on-line optimization technique to reduce energy in data-flow architecture, thanks to a Local Power Manager (LPM) using Vdd-Hopping for efficient local DVFS. The proposed control is a hybrid global and local scheme which respects throughput and latency constraints. The approach has been fully validated on a real MIMO Telecom application using a SystemC platform instrumented with power estimates. Local DVFS brings 45% power reduction compared to idle mode. When local on-line optimization benefit from computation time varia-tions, 30% extra energy savings can be achieved.

Keywords: Low Power, DVFS, VDD-Hopping.

1 Introduction

In today’s System on Chip, power consumption is becoming a major issue. Dedicated mechanisms have been proposed in order to reduce both static and dynamic power consumption at different levels: from technology up to system level. At system level, Dynamic Power Management (DPM) techniques are classically used, such as ad-vanced standby modes or efficient Dynamic Voltage and Frequency Scaling (DVFS). The main difficulty of DPM techniques is to design efficient dedicated control up to application level. Power management is often specific to the low-power design tech-niques and must take into account architecture and application.

In future multi-cores, the Globally Asynchronous Locally Synchronous paradigm is a natural enabler to help architecture partitioning and facilitate clock and power management [1][12]. In GALS scheme, each IP unit has its own frequency, and communicate asynchronously through a global interconnect. GALS scheme enables local power management: each IP unit is an independent Voltage and Frequency Is-land (VFI). This is also commonly called “per-core DPM”: further energy savings are obtained, since the power optimum is not limited by the most constrained IP core but can be reached independently on each IP cores.

On Line Power Optimization of Data Flow Multi-core Architecture 95

Considering that the energy is square Vdd dependant, DVFS technique is the most promising in terms of overall energy reduction. Due to the usage of external DC-DC converters, today’s DVFS techniques are mostly CPU centric and not applied at IP level. Recently, a low cost and efficient DVFS technique, called Vdd-Hopping, has been proposed [2][3]. By using only two external voltages and a dynamic voltage selector switch, DVFS can be efficiently offered locally to each IP core.

In this paper, we target heterogeneous data-flow like architecture with Telecom applications as an exemple [14]. Regarding the application, execution time variations are decisive. In non real time systems, voltage and frequency selection consists usu-ally in a tradeoff between performance and energy. In case of real time system with data flow application, timing constraints must be met and are twofold: a throughput constraint for each IP and an overall latency constraint on the whole data-flow [4][5][6]. Heuristic algorithms can be used, based on the worst case application sce-narios [7][8]. In an heterogeneous architecture using dedicated IP engines, contrary to homogeneous multi-cores, task allocation is static and directly driven by the architec-ture. In that case, to reduce energy in a multi-application context and benefit from all available dynamic slack time, on-line optimization associated with a fast hardware DPM controller is required [10][11]. The VDD-Hopping technique has been intro-duced early by T. Zakurai group, which proposed some software control techniques [16] but not yet adapted to hardware heterogeneous architecture.

In this paper, we propose an on-line optimization technique, to reduce energy in data-flow heterogeneous architecture, by using a dedicated DPM controller, which uses the efficient Vdd-Hopping technique for local DVFS. This consists in a hybrid global and local technique, as in [11], which respects throughput and latency con-straints, and using only two voltage/frequency points. The proposed technique has been applied to a real GALS NoC architecture targeting MIMO telecommunication applications [14]. Energy savings have been estimated on a SystemC simulation plat-form which has been instrumented with power estimates [15]. The outline of the paper is as follows: Section 2 introduces the targeted GALS NoC low-power architec-ture, and Section 3 describes the proposed Vdd-Hopping control for local DVFS. The local on-line optimization is described Section 4. Finally, the experimental results are given in Section 5.

2 Low Power GALS NoC Architecture

The low power overall architecture is organized within a complex GALS NoC fully implemented in asynchronous logic [14]. As shown in Figure 1, each synchronous IP unit of the SoC is integrated with advanced low-power mechanisms, such as in [12]. A programmable Local Clock Generator is implemented within each unit to generate a variable frequency F in a predefined applicative range. A local Power Supply Unit (PSU) manages the local unit voltage V, sharing a power switch between a Vdd-hopping technique and a classical MTCMOS technique. The PSU uses two external voltages with two power switches: VHIGH and VLOW which are automatically switched during DVFS phases. The Network Interface (NI) is in charge of communications with respect to the NoC protocol.

96 P. Vivet et al.

Fig. 1. Low Power GALS NoC overall Architecture

The Local Power Manager (LPM) implements the proposed DPM and on-line op-timization techniques. The LPM is activated by the NI in a data-flow manner accord-ing to NoC traffic and HW tasks. The NoC architecture targets data flow applications, where task control and complex data flows are handled by the NI. For each executed task, the NI loads a configuration for the IP core and associated input/output data flows, and then computation starts.

2.1 IP Unit Integration for Power Optimization

Each synchronous IP unit is defined as an independent power domain (using its dedi-cated local voltage V) and an independent frequency domain (using its dedicated local clock frequency F). Each IP unit can be set in one of the 4 power supply modes:

• HIGH mode, local supply voltage V is VHIGH and core clock is on. This is the “nominal” high performance working mode.

• LOW mode, core clock is on, but supply is switched to VLOW. Frequency is lower than nominal, energy per cycle decreases. This is “low power” mode.

• IDLE mode, core clock is off and leakage power is further reduced thanks to VLOW supply voltage. This is the “low-power dormant” mode.

• OFF mode, the unit is switched off when not used in the application, to further reduce the leakage power.

For each unit, all power modes can be programmed through the Network Interface and the Local Power Manager, except the OFF mode which is programmed through top level signals (main CPU).

2.2 Local DVFS Using Two Voltages Set Points

In order to perform efficient local Dynamic Voltage Scaling (DVS), the main objec-tive is to avoid as much as possible low-level software control to ensure minimal latency cost. Within the Power Supply Unit, a hardware controller called Vdd-hopping automatically switches between VHIGH and VLOW (Figure 2).


PWM

Frequency

Voltage

0

Vhigh

Vlow

0

FhighFlow

0

1

0

1Clk

LPMPWM

Frequency

Voltage

0

Vhigh

Vlow

0

FhighFlow

0

1

0

1Clk

LPM

Fig. 2. LPM control, Vdd-Hopping sequence example

During smooth DVFS transitions, the synchronous IP can continue its own compu-tations or communications. To obtain an average value between VHIGH and VLOW, the LPM controls the target performance by switching between these 2 values. The power efficiency of the proposed Vdd-Hopping [2] is more than 95%. In a given VHIGH or VLOW voltage, there are no losses except those in a standard power switch; there are only energy losses during the transitions (less than 100 ns). There is no latency cost, and no need for real time software, fast and robust transitions are ensured by hard-ware. The VDD-Hopping mechanism has been implemented and validated in a test-chip in 65nm [13], which prove high reliability. In order to minimize energy per operation, the IP unit should run at maximum achievable frequencies fh and fl. The LPM objective is then to spend more time at VLOW to decrease energy, while respect-ing timing constraints. The proposed hybrid local and global DVFS principle and associated LPM schemes are introduced in next section.

3 Local DVFS Control

On data flow architectures with latency constraint on the whole chain, a global man-agement is required to ensure the deadline. In order to guarantee latency, due to dy-namic variations of the computation on each core, centralized control or software control cannot be done since it would not respond fast enough to handle all the dy-namic variations. We choose a Worst Case Execution Cycle (WCEC) based static management to select a set point for each task. A heuristic based algorithm, as in [7], can be used. To benefit from dynamic slack time induced when the actual number of cycles to complete a task is less than WCEC, a local control is implemented. Such a hybrid (local and global) approach has also been adopted in [11].

Based on worst case, a global power manager (such as the host processor) dis-patches the available latency among tasks. Hence, each core is given a timeslot to complete its task. For each IP core, its Local Power Manager (LPM) controls the Vdd-Hopping by spreading the computation over the given timeslot. The LPM is activated by the NI in a data-flow manner according to NoC traffic and HW tasks. Two control schemes are proposed and presented below, with NI task or IP Core task synchronization. One must notice that NOC bandwidth must be enough to tolerate uncorrelated IP frequencies variations, to smooth applicative traffic, hypothesis which is respected in the addressed application and corresponding NOC (see section 5).

98 P. Vivet et al.

3.1 NI Task Synchronization

The first proposed solution interacts with the NoC platform programming model to control the power modes, in a generic way. As soon as a new task is loaded in the NI, the Vdd-Hopping transitions can start. The LPM control of the IP is thus activated in a data-flow manner according to the NoC incoming traffic and task.

Given the WCEC Nwcec, the number of cycles to spend at high voltage Nh and at low voltage Nl can be derived from the given timeslot τ for the task. Let fh and fl be the maximum available frequency at respectively high set point and low set point, we have:

l

hwcec

h

h

l

l

h

h

f

NN

f

N

f

N

f

N −+=+=τ (1)

For the task computation, the number of cycles at high and low level is given by:

( )lwceclh

hh fN

ff

fN ×−

−= τ and

hwcecl NNN −= (2)

The timeslot is equivalent to a mean frequency: τwcec

ett

Nf =arg

.

t10

Task loaded

t

CoreActive t

10

Vhigh

Vlow

Fig. 3. NI task synchronization

The LPM switches periodically from high to low while the task is loaded in the NI, so that the target frequency is reached when the core is actually computing (Figure 3). If the hopping frequency is increased while keeping the Nh and Nl ratio, the mean frequency is not modified and NoC traffic is smoothened. Since extra energy is con-sumed during transitions, a tradeoff is required between transition number, NoC traf-fic regularity and energy.

Lastly, if the targeted frequency is lower than the fastest frequency at Vlow, the frequency is decreased (this is DFS at Vlow). Finally, as seen Figure 3, task loading in the NI may not match the actual computation phase, because the IP core may wait for additional data before starting. In that case, extra energy could be saved thanks to a tighter control.

3.2 Core Task Synchronization

Better control is obtained if LPM is synchronized with actual IP core computation. In this case, a dedicated signal must be generated by the IP core to indicate its own


t10

TaskLoaded

t

AtomicTask t

10

VhighVlow

Fig. 4. Core task synchronization

activity/inactivity. The number of cycles Nh and Nl are still calculated as described in section 4.1.

Instead of controlling with NI task activity, the LPM performs the Vdd-Hopping transitions with IP core task activity. An atomic task is defined when the number of cycles and the number of input/output data are known. In order to balance the fre-quency of hops, the LPM is able to perform switching over several atomic tasks or within a single task.

In case the actual number of cycles of the atomic task is less than the worst case, it is possible to start the computation at low level [17]. In Figure 4, the NI task consists of five atomic tasks, with only one transition low to high done within each task. The unit gets back to low level as soon as the task is completed; most of the computation is spent at low level.

4 Local On-Line Optimization

The Actual number of Execution Cycles (AEC) needed by a task may be less than the WCEC. The computation time may depend on data, the communication time is vari-able and the architecture can have unpredictable events such as cache defaults, leading to dynamic slack time. The LPM can exploit this dynamic slack time by reducing the speed of the unit. Even though it is possible to predict the number of cycles for next task from the execution history, this approach may not meet the timing constraints. A prediction mistake will induce a timing violation. We rather assume the current task still runs at WCEC and benefit from the dynamic slack time from the previous task. The cycle budgets at high and low levels are updated according to the remaining cycles at high and low levels.

t

tk-1 k k+1

wait compute

Nl

Vhigh

Vlow

Nh

N'l

N'h

T TT'

Vhigh

Vlow

Fig. 5. Local on-line optimization principle

100 P. Vivet et al.

Figure 5 presents the on-line optimization principle. The first chronogram shows the LPM control without online optimization. The second uses the online optimiza-tion. The first task k-1 runs at WCEC while the following tasks do not use as much cycles. In this case, the third task is slowed down while respecting the deadline.

When a task k is over and cycles are remaining, respectively nh at high level and nl at low level, the unit switches to low level and keeps on counting the number of elapsed cycles. Hence, when the next task starts, the remaining cycles nh and nl reflect the dynamic slack time. The timeslot for the following task is extended to:

l

l

h

h

f

n

f

n ++=′+ τττ (3)

The updated number of cycles Nh’ is then given by:

( )( )

hlh

ll

lh

hh

lwceclh

hh

nff

fn

ff

fN

fNff

fN

−−

−−=

×′+−−

=′ ττ

(4)

Thus, before the next task (k+1) begins, we compute its parameters with the extended time:

⎪⎪⎩

⎪⎪⎨

⎧

′−=′

⎟⎟⎠

⎞⎜⎜⎝

⎛+

−−=′

hwcecl

hh

ll

lh

hhh

NNN

nf

fn

ff

fNN

(5)

The above equations provide the main principles of the on-line optimization algo-rithm. In order to implement such control efficiently in the hardware LPM controller, some simplifications are required. The LPM requires mainly two counters to keep track of the number of elapsed cycles in high and low voltage. In order to have simple hardware, the computations of both ratios ( )lhh fff − and hl ff must be either

done in software or simplified to be done in hardware. The new budget Nh’ should not be overestimated; otherwise the deadline might be violated. If those ratios are under-estimated, then the efficiency is reduced. Assuming lh ff 2= , we obtain the following

simplified equations for the updated cycle budgets at VHIGH and VLOW with regard to dynamic slack time of previous task:

( )⎪⎩

⎪⎨⎧

′−=′+−=′

hwcecl

hlhh

NNN

nnNN 5,0*2 (6)

The LPM controller is then programmed with the two input parameters: the timeslotτ , and the target NWCEC. It implements two counters, and it can be implemented as a sim-ple state machine to control any of the AEC mode, the NI mode or the CORE mode.

The LPM controller has been fully modeled in SystemC. From the algorithmic complexity and the number of registers, the LPM is estimated to be less than 2Kgates. The area cost of the PSU including the Vdd-Hopping is 3% of the core area for a 200Kgates IP core.


5 Case Study on a 3GPP LTE Telecom Application

The targeted application and circuit [14] is based on the 3GPP LTE telecommunica-tion protocol; we focus on the baseband demodulation of the downstream. Once the application is mapped onto the NoC architecture, the application is divided into sev-eral sequential phases, a whole frame is constituted of 14 OFDM symbols. There are three main phases and each phase is separated by memory buffering. The IP core tasks are periodic and sequenced in a data flow manner (Figure 6).

0403020100

1413

MC8051mc8051_12

1211

CFOChan. estim.

mep_10

10

Turbo-decodingasip_24

24

MIMOdecodingmep_23

23

MIMOdecodingmep_22

22

SMEsme_21

21

OFDMdemod.

trx_ofdm_20

20

OFDMdemod.

trx_ofdm_20s

CFOChan. estim.

mep_21s

SMEsme_22s

Deinterleav.Demod.rx_bit_23s

SMEsme_10w

NoC Interface

NoC Interface

ARM

Fig. 6. Task mapping on the Low Power GALS NoC architecture

The GALS NoC architecture is build with dedicated hardware engines, such as TurboCode, RX/TX bit engines, OFDM modulation/demodulation, MEP engines (advanced configurable VLIW-like core) and finally some SME (Smart Memory Engine) used to handle memory buffers. Each IP core is encapsulated with a PSU, a LPM and a LCG providing 16 frequencies in the [400MHz-1GHz] range, with addi-tional scaling factors.

5.1 Simulation Platform and Applicative Scenarios

The simulation platform used to qualify the energy savings is based on an existing timed SystemC/TLM platform. The power consumption has been included in the simulation platform, along with DVFS modeling [15]. The SystemC model takes into account leakage current, dynamic power, the inactivity phase’s consumption, and the variation of energy per operation due to Vdd-Hopping. For each IP blocks, power consumption values have been extracted from post Place&Route gate-level simulation thanks to PrimePower® tool. As a conclusion, fast power estimation and exploration at high level can be performed on a real application. The tool provides power profile traces (in vcd) and power statistics (per core, per mode, …).

For the targeted 3GPP-LTE application, the global constraints (timeslot and NWCEC) have been derived manually for each IP, to enforce throughput and latency constraints. For all proposed LPM scenarios (Table 1), except the first two ones, IDLE mode is used as soon as end of task is reached. All scenarios respect the application timing constraints, except the first one at Low level, which is given as a reference.

102 P. Vivet et al.

Table 1. Power Mode Scenarios

Low LOW mode at maximal achievable flow

High HIGH mode at maximal achievable fhigh

On/Off HIGH mode at fh max, and IDLE when tasks complete

DFS HIGH mode using only Dynamic Frequency Scaling

DVFS NI DVFS synchronized with NI

DVFS Core DVFS synchronized with CORE

DVFS AEC DVFS synchronized with CORE, plus on-line optimization using Actual Execution Cycle

5.2 Obtained Energy Savings

For each LPM scenario, power profiling has been done, the achieved energy savings are presented per IP (Figure 7).

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Low High On/Off DFS DVFS NI

DVFSCore

DVFSAEC

En

erg

y (m

J)

asip_24

rx_bit_23s

mep_23

mep_22

mep_21s

mep_10

trx_ofdm_20s

trx_ofdm_20

Fig. 7. Energy consumption per IP Core

The On/Off scenario exhibits substantial energy savings, thanks to the efficiency of IDLE mode (we recall that IDLE is done with IP clock gating at Vlow). When only using DFS, there is almost no gain since the computation is only spread over time, reducing peak power but not energy. When using DVFS, we observe that energy savings clearly depend on core profile. For under-constrained cores (trx-ofdm cores) with low target frequency, DVFS enables high energy savings compared to On/Off scenario. Synchronization with task loading is relevant as these units does not spend time waiting for data. These units have a steady number of computation cycles, and online optimization is useless. Synchronization of DVFS with core computation will bring benefits when the IP cores wait for a long time incoming data (mep_10, mep_21s). For more constrained cores (mep_22, mep_23) with high target frequency, when they require less cycles than the predicted WCEC to complete their task, local optimization is relevant. For tasks with target frequency close to fh, up to 30% energy savings has been achieved compared to simple core synchronization. We exhibit 45% extra energy savings with DVFS AEC compared to On/Off scenario.


0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Low High On/Off DFS DVFS NI

DVFSCore

DVFSAEC

En

erg

y (m

J)

NoC

total IP

total SME

Fig. 8. Energy savings for NoC, SME and IP Cores

In Figure 8 is given power consumption for the whole SoC, considering HW IPs, SME IPs and the NoC. The NoC power consumption represents only 5% of the total power consumption, and is slightly equivalent for each scenario. The advanced IDLE mode from On/Off scenario brings 35% power reduction on the whole chip. As a global result, the power reductions obtained on the IP Cores (Figure 7) are mitigated due to inefficient power reduction on Smart Memory Engines. Because SMEs do not actually perform computation but must run fast enough to handle data traffic, a power control based on traffic arrival like in [10] should be efficient. Finally, the total chip budget is reduced from 340 mW at full speed (High mode) to 160 mW using the DVFS scheme with on-line optimization.

6 Conclusions

In this paper, we presented a new Local Power Manager unit to reduce energy in a data-flow heterogeneous architecture by using the Vdd-Hopping technique. The Vdd-Hopping is an efficient DVFS technique with only two set points and zero overhead, which can be easily integrated for per-core DVFS. In the proposed LPM, we use a hybrid local and global scheme to enforce timing constraints, a LPM synchronization scheme with core computation to benefits from all inactivity phases, and an on-line optimization technique to distribute dynamic slack time. Energy savings have been qualified on a real application, using a SystemC platform instrumented with power. Results show that advanced idle mode achieves significant energy savings (35%). As expected, DFS achieves few energy savings. DVFS enables to reduce energy by 45% compared to IDLE mode. Finally, when number of cycle per-task varies, 30% addi-tional energy savings are achieved by local on-line optimization. Future work will address the design of an efficient DVFS control for SMEs, the RTL design of the LPM, as well as HW task automatic profiling.

References

1. Bhunia, S., Datta, A., Banerjee, N., Roy, K.: GAARP: A Power-Aware GALS Architec-ture for Real-Time Algorithm-Specific Tasks. IEEE Transactions on Computer, Special Is-sue on low-Power Design (99), 752–766 (June 2005)

104 P. Vivet et al.

2. Sylvain, M., Vivet, P., Renaudin, M.: A Power Supply Selector for Energy- and Area-Efficient Local Dynamic Voltage Scaling. In: Azémard, N., Svensson, L. (eds.) PATMOS 2007. LNCS, vol. 4644, pp. 556–565. Springer, Heidelberg (2007)

3. Truonga, D., et al.: A 167-processor 65 nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling. In: Proc. Symposium on VLSI Circuits (June 2008)

4. Mishra, R., Rastogi, N., Zhu, D., Mosse, D., Melhem, R.: Energy aware scheduling for dis-tributed real-time systems. In: Proc. of Parallel and Distributed Processing Symposium (April 2003)

5. Watanabe, R., Kondo, M., Imai, M., Nakamura, H., Nanya, T.: Task Scheduling under Per-formance Constraints for Reducing the Energy Consumption of the GALS Multi-Processor SoC Design. In: DATE 2007 (2007)

6. Xian, C., Lu, Y., Li, Z.: Energy-Aware Scheduling for Real-Time Multiprocessor Systems with Uncertain Task Execution Time. In: DAC 2007, pp. 664–669 (2007)

7. Grosse, P., Durand, Y., Feautrier, P.: Methods for Power Optimization in SoC-based Data Flow Systems. ACM Transactions On Design Automation of Electronic Systems (TODAES 2009) 14(3), Article No. 38 (2009)

8. Niyogi, K., Marculescu, D.: Speed and voltage selection for GALS systems based on volt-age/frequency islands. In: Proceedings of, ASP-DAC 2005 (2005)

9. Puschini, D., Clermidy, F., Benoit, P., Sassatelli, G., Torres, L.: Temperature-Aware Dis-tributed Run-Time Optimization on MP-SoC Using Game Theory. In: Proceedings of IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2008, pp. 375–380 (2008)

10. Alimonda, A., Acquaviva, A., Carta, S., Pisano, A.: A Control Theoretic Approach to Run-Time Energy Optimization of Pipelined Processing in MPSoCs Design. In: Proceedings of Design Automation and Test in Europe, DATE 2006 (2006)

11. Maxiaguine, A., Chakraborty, S., Thiele, L.: DVS for buffer-constrained architectures with predictable QoS-energy tradeoffs. In: 3rd International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS 2005, pp. 111–116 (2005)

12. Beigné, E., Clermidy, F., Miermont, S., Vivet, P.: Dynamic Voltage and Frequency Scal-ing Architecture for Units Integration within a GALS NoC. In: Proceedings of NOCS 2008 (2008)

13. Beigné, E., et al.: An Asynchronous Power Aware and Adaptive NoC based Circuit. IEEE Journal Of Solid State Circuits 44, 1167–1177 (2009)

14. Clermidy, F., et al.: A 477mW NoC-Based Digital Baseband for MIMO 4G SDR. In: Pro-ceedings of IEEE International Solid-State Circuits Conference, ISSCC 2010 (2010)

15. Lebreton, H., Vivet, P.: Power Modeling in SystemC at Transaction Level, Application to a DVFS Architecture. In: Proc. of Int. Symposium on VLSI, ISVLSI 2008, pp. 463–466 (2008)

16. Soongsoo, L., Sakurai, T.: Run-time Voltage Hopping for Low-Power Real-time Systems. In: Proc. of 37th Design Automation Conference, DAC 2000, pp. 806–809 (June 2000)

17. Yan, Z., Zhijian, L., Lach, J., Skadron, K., Stan, M.R.: Optimal procrastinating voltage scheduling for hard real-time systems. In: DAC 2005, pp. 905–909 (June 2005)


Self-Timed SRAM for Energy Harvesting Systems

Abdullah Baz, Delong Shang, Fei Xia, and Alex Yakovlev

Microelectronic System Design Group, School of EECE, Newcastle University Newcastle upon Tyne, NE1 7RU, England, United Kingdom

{Abdullah.baz,delong.shang,fei.xia,alex.yakovlev}@ncl.ac.uk

Abstract. Portable digital systems tend to be not just low power but power effi-cient as they are powered by low batteries or energy harvesters. Energy harvest-ing systems tend to provide nondeterministic, rather than stable, power over time. Existing memory systems use delay elements to cope with the problems under different Vdds. However, this introduces huge penalties on performance, as the delay elements need to follow the worst case timing assumption under the worst environment. In this paper, the latency mismatch between memory cells and the corresponding controller using typical delay elements is investigated and found to be highly variable for different Vdd values. A Speed Independent (SI) SRAM memory is then developed which can help avoid such mismatch problems. It can also be used to replace typical delay lines for use in bundled-data memory banks. A 1Kb SI memory bank is implemented based on this method and analysed in terms of the latency and power consumption.

1 Introduction

With the wide advancement in such remote and mobile fields as wireless sensor based applications, microelectronic system design is becoming more energy conscious. This is mainly because of limited energy supply (scavenged energy or low battery) and excessive heat with associated thermal stress and device wear-out. At the same time, the high density of devices per die and the ability to operate with a high degree of parallelism, coupled with environmental variations, create almost permanent instabil-ity in voltage supply (cf. Vdd droop), making systems highly power variant. In the not so long past low power design was targeted merely at the reduction of capacitance, Vdd and switching activity, whilst maintaining the required system performance. In many current applications, the design objectives are changing to maximizing the per-formance within the dynamic power constrains from energy supply and consumption regimes. Such systems can no longer be simply regarded as low power systems, but rather as power adaptive or power resilient systems.

Normally, this kind of system has the following properties: 1) power efficient not just low power; 2) non-deterministic supply voltage (probably with known range, which tends to be low) variable over time. Recently a possible solution is proposed for this kind of system. It is a power elastic system which takes power and energy as dynamic resources [13]. For example, when power is not enough, some of the sub-systems could either be powered off or be executed under lower power supplies (Vdds). When power is enough, systems can provide high performance. This means

106 A. Baz et al.

that all tasks in a system are managed based on the power resources, performance requirements, and thermal constraints.

When systems are subjected to varying environmental conditions, with voltage and thermal fluctuations, timing tends to be the first issue affected. Most systems are still designed with global clocking and the design is often made overly pessimistic to avoid failures due to Vdd (timing) variations.

Along with the advent of the nanometre CMOS technology, the continuation of the scaling process is vital to the future development of the digital industries. The Interna-tional Technology Roadmap for Semiconductors (ITRS) [1] predicts poorer scaling for wires than transistors in future technology nodes. This makes the above worst timing assumption even worse along with power supply voltage drooping [17].

Asynchronous techniques may provide solutions to all these problems. Unlike syn-chronous systems, asynchronous designs can completely remove global clocking. As a result, asynchronous designs may be more tolerant to timing variations.

The ITRS also predicts that asynchrony will increase with the complexity of on-chip systems. The power, design effort, and reliability cost of global clocks will also make increased asynchrony more attractive. Increasingly complex asynchronous sys-tems or subsystems will thus become more prevalent in future VLSI systems.

In order to fully realize the potential of asynchrony in an environment of variable supply voltage and latencies, system memories may need to be asynchronous together with the computation parts. In this paper, we concentrate on asynchronous SRAM. Our main contributions include: analysing the behaviour of latency in SRAM memory systems under different Vdds, developing asynchronous SRAM memory, and propos-ing a new method to build delay elements for bundled SRAM memory. We develop a fully Speed Independent (SI) [16] SRAM cell and a bundled SRAM bank technology by using such SI SRAM cells as delay elements.

The remainder of the paper is organized as follows. Section 2 introduces existing asynchronous SRAM memory structures. Section 3 analyses the effects on the latency of the SRAM memory and its controller of different Vdds. Section 4 gives our asyn-chronous SRAM solutions and implementations, and proposes a new method to build SI delay elements for SRAM memory. Section 5 demonstrates a memory bank and the measurements in terms of latency, power consumption. Section 6 gives the con-clusions and the future work.

2 Existing Asynchronous SRAM Memory

Several asynchronous SRAM methods have been reported [5,6,7,8,9]. In [5] a methodology was mostly developed for designing and verifying low power

asynchronous SRAM. An SI SRAM cell was alluded to in [5]. This memory cell is different from the conventional six transistor cell [15] and provides the possibility of checking that the data has been stored in memory. The paper however does not ex-plain how the cell needs to be controlled nor does it include a controller design.

[6,7,8,9] focus on asynchronous SRAM memory designs. [6] presents a four-phase handshake asynchronous SRAM design for self-timed systems. It proposes an SI circuit to realize completion detection of reading operations. However, the paper claims that completion detection is not suitable for writing operations. Because the

Self-Timed SRAM for Energy Harvesting Systems 107

critical circuit is the memory cell, it is said to be impractical to add a monitoring sen-sor to each memory cell to generate completion detection signals. Instead the paper proposes a delay based solution, which uses several delay lines for different delay regions as variation is considered. [8] presents an asynchronous SRAM with SI implementation in the reading. The writing works under such relative timing assump-tions that the control path takes more transitions than the data path. This is imple-mented with circuits which behave similarly to classical delay elements such as chains of inverters. The other works [7,9] abandon SI altogether and adopt bundled data methods based on delays. Noting that the delay of inverter chains commonly used in conventional SRAM to generate required timings for precharge and data ac-cess phase hardly match all the timing variations of the bit line activities across a wide range of supply voltages [11,12], the authors of [9] used a duplicated column of memory cells to replace inverter chains to serve as delay elements. Although in theory this offers potentially correct delay matching for memory under variable Vdd, so long as process variation [3] is kept under control, the method requires voltage references for precharge and sensing data. The voltage reference is assumed to be adjustable to accommodate the process, voltage, and temperature conditions.

In summary, most of existing solutions work under worst case timing assumptions, and some of them also require adjustable and known reference voltages. However, in the energy harvesting environment, there may not be stable reference voltages in a system at all, so anything based on comparators will not work. All voltages in the system may be non-deterministic. All delays may therefore be non-deterministic.

3 Latency Investigation on SRAM Cells under Different Vdds

SRAM memory is constructed from SRAM sells, address decoders, precharge driver, write driver, read driver, and controller. Although there exist different structures of SRAM cells, here we only focus on the simplest 6T [15] cell which offers the best prospect for use in energy harvesting systems.

Normally memory works based on timing assumptions. However, energy harvest-ing systems work under a wide range of non-deterministic power. It is necessary to know how timing assumptions are affected under different Vdds.

Here we investigate the difference between the latency on SRAM including bit line driver and its corresponding controllers typically implemented in inverter-chain kind of delay elements under different Vdds. This potential mismatch has already been pointed out in papers [11,12]. [11] concludes that the latency on inverter chains are getting worse and worse with reducing Vdd. [12] concludes that the percentage of the bit line drive time of the total access time under reducing Vdds is getting greater sig-nificantly. But do both types of delays increase at the same rate under the same Vdd reduction rate?

To emphasize the mismatch, we directly show the difference between the read-ing/writing times of memory and the latency of delay elements under various Vdds in the right hand side of Figure 1.

The experiment bundles an SRAM with one cell and an inverter chain, with both operating under the same variable Vdd as shown in the left hand side of Figure 1. A start signal triggers reading/writing operation of the cell. This start signal is also con-nected to the inverter chain as its input signal. We measure the number of inverters

108 A. Baz et al.

SRAMstart finish

Fig. 1. Investigation on delay elements in various Vdd: Block diagram (left) and Results (right)

the start signal has passed through when the reading/writing operation finishes. In reading, under lowest Vdd the memory is about 3 times slower than under the normal Vdd in terms of the number of inverters. In writing, under lowest Vdd the memory is about 2 times slower than under the normal Vdd in terms of the number of inverters. Interestingly, this mismatch is quite small when Vdd is above 700mV, which coinci-dentally was the lowest voltage investigated in some of the previous work (e.g. [8]). In other words, both reading from and writing to memory become slower at a much higher rate than inverter chains when Vdd is reduced below 700MV, and inverter chain type delays do not track memory operation delays when both are under the same variable Vdd. This demonstrates that using standard inverter chains for memory delay bundling would require precise design-time delay characterization and conser-vative worst-case provisions which could be 2-3 times more wasteful for some cases. Other conventional methods such as schedulable or programmable delay chains will not be useful without knowledge of the Vdd in real time, which we do not assume.

4 Asynchronous SRAM Solutions

The characteristics of the energy harvesting systems lead to non-deterministic Vdd and delays across the entire system. To deal with this it is possible to employ asyn-chrony in the form of memory bundling or completion detection.

For bundling, the above discussion has established that normal delay elements built using inverter chains are unsuitable for memory. A natural extension of using dummy SRAM cells as delay elements exists [9], but the method has too many assumptions and requirements such as known and variable reference voltages which may not be possible for energy harvesting systems.

WE

BL BLb

Db D

WL

Q Qb

BL BLbCDb CD

(a) (b)

WL

BL BLb

Q Qb

(c)

Fig. 2. Intuitive SI SRAM cell (a), write driver (b), and standard 6T cell (c)


In this section, two fully Speed Independent (SI) SRAM solutions are proposed. The SI circuits are not affected by delays on gates but delays on wires are assumed as zero or very little. This is generally not a problem for circuits of small size such as an individual 6T SRAM cell. However, fully SI solutions for memory banks can be ex-pensive in terms of power and size of circuits and also reduce performance [16]. A new method in which an asynchronous SRAM memory is bundled with SI SRAM serving as delay elements is proposed as an alternative.

4.1 Intuitive Speed Independent SRAM

As discussed in [6], reading completion detection can be built by monitoring the bit lines. For a 6T cell (Figure 2 (c)), in reading, the precharge pulls up the two bit lines to high. Then the reading sets the WL high to open the two pass transistors. After that, one bit line will be discharge to low. This means that the data is ready for reading.

However, the writing operation is to write each bit of data to its corresponding cell. It is impractical to monitor all cells. Instead, we still monitor the bit lines. Figure 2 (a) shows a straight forward SI SRAM cell which is based on the normal 6T cell. This duplicates the bit lines and uses the six extra transistors to control the two discharge channels. Reading completions can be checked in the same way as for the normal 6T cell. To check writing completions, the writing operation is arranged as: 1) precharging the four bit lines to high; 2) enabling the writing data on BL and BLb; 3) setting the WL high to write the data into cell; 4) monitoring the CD and CDb; 5) when one of them changes to low, writing done. The writing driver used is shown in Figure 2 (b).

After the four bit lines are precharged to high, the writing driver is enabled. One of BL and BLb is low and the other is floating. If the new data is the same as the data stored in the cell, for example D=1, CD will be discharged (Qb goes to CD). If the new data and the data stored inside cells are not the same, for example, Q=1 and D=0, BL is low and then waiting for Qb high to discharge CDb. In this situation, BL is low and written to Q. But only after Q is propagated to Qb, the discharging path is opened. CD or CDb being discharged means that the writing is finished. However, this SI SRAM is impractically large and power hungry. It may also cause complicated writ-ing fight.

4.2 More Practical Speed Independent SRAM

In fact, the above proposed SI SRAM introduces a reading at the writing operation with the execution order “precharging, writing, reading”. However, unlike the normal reading operation, it uses the duplicated bit lines as a reading port and to guarantee the writing data being stored into the cell. Especially the solutions have problems as discussed the above.

We optimize this completion detection method based on ideas borrowed from [14]. By changing the execution order of the writing operation to “precharging, reading, writing”, the duplicated bit lines in Figure 2 (a) can be removed. The normal 6T SRAM cell in Figure 2 (c) can be used instead with considerable savings, resulting in a new SI SRAM based on the standard 6T SRAM cell and an intelligent controller.

SRAM cells depend on control signals. The control signals PreCharge, WL, and WE, are issued based on timing assumptions in existing asynchronous SRAMs.

110 A. Baz et al.

Memory

Con

trol

ler

Wr

Wa

Rr

Ra

DnWLDnWEDn

Pre Data

Fig. 3. Block diagram of the proposed SI RAM

An intelligent controller is designed to manage these control signals based on the new execution order. To completely remove timing assumption, Delay Insensitive (DI) circuits are the best choice. However, DI circuits are limited in practice [2]. Instead, SI circuits suffice here. The block diagram of the controller is shown in Figure 3.

Two handshake protocols ((Wr,Wa) and (Rr,Ra)) connect with the processing unit and three protocols ((Pre,Dn), (WL,Dn), and (WE,Dn)) connect with the memory system. The signals (Wr,Wa) are the writing request and acknowledgement. The (Rr,Ra) pair is the reading request and acknowledgement. The (Pre,Dn) handshake is the precharge request and done. “WL” and “WE” are defined in Figure 2. All “Dn” signals are hidden inside the SI controllers.

Rr+ Pre− (BL,BLb)(1,1)

Pre+ WL+

(BL,BLb)(1,0) or (0,1)Ra+Rr−WL−Ra−

Reading:

Wr+ Pre− (BL,BLb)(1,1)

Pre+ WL+ (BL,BLb)(1,0) or (0,1)

WE+

(Q,Qb)=(BL,BLb)Wa+Wr−WL−WE−Wa−

Writing:

Fig. 4. STG specifications

The STG specifications of the reading and writing operation are shown in Figure 4. The bit lines are monitored to form a “Dn” signal. For example, after the precharging is triggered, when (BL,BLb) equals to (1,1), the “Dn” signal is generated.

We combine the two STG specifications. The controller shown in Figure 5 is ob-tained from optimizing the Petrify solution of the combine specification.

Initially, Wr, Rr, x2, and x3 are 0, 0, 1, 0. Consequently Wa, Ra, PreCharge, WL, WE, x1, x5, and x6 are 0, 0, 1, 0, 0, 0, 1, 0. The x4 is in a “don’t care” value initially.

We use the writing operation as an example to show how the controller works. Af-ter the address and data are ready, the Wr signal is issued. Wr goes through gate 7 and then through to gate 10. As x2 is 1, so x1 is 1 and then it makes PreCharge 0. The low PreCharge signal opens the P-type transistors in precharge drivers. The PreCharge also goes to the SR latch formed by gates 6 and 8 to reset the latch when PreCharge is low. After the bit lines are 1 and the SR latch is reset, x1 is changed to 0. And then PreCharge is removed. After PreCharge is removed, WL is generated, which opens the pass transistors in the 6T cell. And then the data stored in the cell is read to the bit lines. This makes x4 equal to 1. As the SR latch has been reset, x6 will be 1. And then WE is 1, which opens the write driver. If the new data is the same as the data stored in the cell, either (D,BL)=(1,1) or (Db,BLb)=(1,1), Wa is generated to notify the data processing unit that the data has been written into the cell. If, for example, new data is 1 and the stored data is 0, after the write driver is opened, BLb is low and then Qb is


discharged to 0, Q is charged to 1. That 1 will transfer to BL. after that writing is finished. After Wa is generated, Wr is removed and then only after the controller is returned to the initial states, Wa is withdrawn to wait for new Reading/Writing opera-tions. Here data is assumed to be withdrawn only after Wa is removed. Clearly there is no need for duplicated bit lines in the memory cell in this method.

1

0

BBL

BL

Wr

Rr

Rr

Wr

PreCharge

WL

Wa

BBLDB

WE

BLD

Ra

x1

x2

x3

x4

x5 x61 23

4 5

6

7 8

9

10

11 12

13

Fig. 5. Possible implementation of the controller

Fig. 6. Waveforms under variable Vdd

As for memory banks, gate 1 is duplicated. The number of the duplicated gates equals to the bits of the memory word. The inputs of each gate are a pair of bit lines corresponding to each bit of the memory word. All outputs of the duplicated gates are

112 A. Baz et al.

collected in a C element. The output of the C element is used to replace x4. Gate 5 is also duplicated. All outputs of the duplicated gates are collected in a C element and the output of the C element is the new Wa signal.

Here an SI SRAM cell is investigated under variable Vdd. In this experiment, we use a sinusoidal Vdd starting at a low level as an example. The lowest Vdd level is 300mV and the highest is 1V and the sinusoid’s frequency is 700KHz. Figure 6 shows the obtained waveforms.

This experiment consists of a writing 0 operation followed by a reading operation and then a writing 1 operation followed by a reading operation. As Vdd is variable, each operation takes a different amount of time. For example, the first writing works under lower Vdd. Precharging, writing data and then generating the Wa (WAck) signal took a long time. The second writing works under the highest Vdd, it goes very fast and generates the WAck signal very fast as well. This experiment also demonstrates that the SI SRAM structure works under continuously variable Vdd as expected.

4.3 A Possible Bundled SRAM Based on SI Delay Elements

However, a fully SI solution for large memory banks has penalties on performance, areas and power because this requires a large completion detection overhead. Here a new bundled method is proposed to overcome the problems.

We can choose a worst column in a memory bank, usually the far end column [18], and fill it with SI SRAM cells for completion monitoring. This means that gate 1 and gate 5 are connected with the bit lines of this column in the SI controller. The memory cells of the other columns use the same control signals generated from the controller but do not provide feedback information. This means that the far end column is used as a delay element and the other columns are bundled with it.

Compared to the existing method which duplicates a column SRAM cell, the new reference method does not employ duplicated cells and referent voltages. And the delay elements, being SI SRAM cells based on the same kind of cells used elsewhere in the bank, should provide correct delay tracking over a wide Vdd range.

However, to actually employ such a bundling method, such issues as the depend-ency of delay on data values stored and written need to be investigated in the future.

5 1Kb Memory Bank Design and Measurements

Using the proposed circuit, 1k-bit (64x16) SI SRAM is implemented using the Ca-dence toolkit with the UMC 90nm CMOS technology. The design is verified with analogue simulations with SPECTRE provided in the toolkit. The chip is fully func-tional from as low as 190mV up to 1V. The SRAM chip was simulated by writing 16-bits to the chip, then reading them and latching the data into SI latches.

Meanwhile the energy consumption and the worst case latency under different Vdds from 190mV to 1V are measured.

Figure 7 shows the energy consumption of the chip during reading and writing when the data is 1 and 0. The four curves show that the minimum energy point of the chip is at 400mV-500mV. The SRAM consumes 5.8pJ in 1V when writing a 16-bit word to the SRAM memory and 1.9pJ in 400mV.


Fig. 7. Energy consumption of SRAM

Figure 8 shows the access time of the SRAM. The access time is the latency from the reading/writing request to the done signal. For example, under 1V, the worst ac-cess time for writing and reading are 5.4ns and 3.0ns. And under 190mV, they are 1.6μs and 4.0μs respectively.

Fig. 8. Access time of SRAM

6 Conclusions and Future Work

In this paper, we focus on SRAM memory design for energy harvesting systems. Normally, this kind of system works under a variable power supply with high power efficiency and not just low power. Under such a non-deterministic power supply as-sumption, existing asynchronous SRAMs based on bundled delays have huge penal-ties or are impractical because of a need for voltage references.

The latency mismatch between SRAM memory and its controller under different Vdds is investigated. As Vdd goes down, mismatch grows if traditional delays are used. Under 190mV, the mismatch is more than twice greater than under the normal 1V Vdd in the UMC90nm technology.

An SI SRAM is proposed and designed. The SRAM has a simple interface, which is similar to the normal SRAM including data, address, reading request, reading ac-knowledgement, writing request, and writing acknowledgement. The internal signals for memory control are fully triggered by the corresponding events of the memory systems. This works by monitoring the bit lines of memory.

114 A. Baz et al.

A new method is proposed to implement SI writing based on the ideas from [14]. This solves the problem of completion detection for writing operations, previously considered impractical or impossible.

A 1Kb (64X16) SI SRAM is implemented using Cadence toolkits. The simulation results show the SRAM working as expected from 190mV to 1V. Meanwhile, the energy consumption and the worst case performance are measured. The measure-ments show the SRAM cell has acceptable characteristics.

However, the completion detection logic in SI SRAM is expensive in terms of area, performance, and power. A simplified SRAM is therefore possible based on the bundled delay principle. Unlike the existing asynchronous SRAM solutions, a column (the worst column, if it can be identified, or a dedicated column) of SI SRAM cells acts as a delay element. This column should be slower anyway than the other columns because of its completion detection overhead. The other columns of the memory cells are bundled with this column.

This bundled SI SRAM method requires more investigations, e.g. the effect of data values. In addition, we have only investigated basic asynchronous SRAM design. Other issues, such as static noise margin, readability, stability, failure rates, etc. need further study. These are the targets of our future research. We will also investigate multi-port asynchronous SRAM in the context of variable and nondeterministic Vdd.

Acknowledgement

This work is supported by the EPSRC project Holistic (EP/G066728/1) at Newcastle University. During the work, we get very helpful discussions from our colleagues, Dr Alex Bystrov and other members of the MSD research group. The authors would like to express our thanks to them.

References

[1] International Technology Roadmap for Semiconductors, http://public.itrs.net/

[2] Martin, A.J.: The limitations to delay-insensitivity in asynchronous circuits. In: Dally, W.J. (ed.) Advanced Research in VLSI, pp. 263–278. MIT press, Cambridge (1990)

[3] Sylvester, D., Agarwal, K., Shah, S.: Variability in nanometer CMOS: Impact, analysis, and minimization. Integration the VLSI journal (41), 319–339 (2008)

[4] Saito, H., Kondratyev, A., Cortadella, J., Lavagno, L., Yakovlev, A.: What is the cost of delay insensitivity? In: Proc. ICCAD 1999, San Jose, CA, pp. 316–323 (November 1999)

[5] Nielsen, L.S., Staunstrup, J.: Design and verification of a self-timed RAM. In: Proc. of the IFIP International Conference on VLSI 1995 (1995)

[6] Sit, V.W.-Y., et al.: A four phase handshaking asynchronous static RAM design for self-timed systems. IEEE Journal of solid-state circuits 34(1), 90–96 (1999)

[7] Soon-Hwei, T., et al.: A 160Mhz 45mw asynchronous dual-port 1Mb CMOS SRAM. In: Proc. of IEEE Conference on Electron Devices and Solid-State Circuits (2005)

[8] Dama, J., Lines, A.: GHz asynchronous SRAM in 65nm. In: Proc. of 15th IEEE Sympo-sium on Asynchronous Circuits and Systems (2009)


[9] Chang, M.F., Yang, S.M., Chen, K.T.: Wide Vdd embedded asynchronous SRAM with dual-mode self-timed technique for dynamic voltage systems. IEEE Trans. on Circuits and Systems I 56(8), 1657–1667 (2009)

[10] Wang, A., Chandrakasan, A.: A 180mv subthreshold FFT processor using a minimum energy design methodology. IEEE Journal of Solid-State Circuits 40(1), 310–319 (2005)

[11] Sekiyama, A., et al.: A 1-V operating 256 Kb full CMOS SRAM. IEEE Journal of Solid-State Circuits 27(5), 776–782 (1992)

[12] Amrutur, B.S., Horowitz, A.: A Replica technique for wordline and sense control in low power SRAM’s. IEEE Journal of Solid-State Circuits 33(8), 1208–1219 (1998)

[13] Mokhov, A., et al.: Power elastic systems: Discrete event control, concurrency reduction and hardware implementation, Tech. Report NCL-EECE-MSD-TR-2009-151, School of EECE, New-castle University

[14] Varshavsky, V., et al.: CMOS-based SRAM Cell”, USSR Patent Application 4049181/24/52011 (favourable decision made 10.10.86)

[15] Zhai, B., et al.: A Sub-200mV 6T SRAM in 0.13um CMOS. In: Proc. of ISSCC (2007) [16] Sparsø, J., Furber, S.: Principles of asynchronous circuit design: a system perspective.

Kluwer Academic Publishers, Boston (2001) [17] Reddi, V., Gupta, M., Holloway, G., et al.: Voltage emergency prediction: a signature-

based approach to reducing voltage emergencies. In: Proc. of International Symposium on High-Performance Computer Architecture, HPCA-15 (2009)

[18] Amelifard, B., Fallah, F.D., Pedram, M.: Leakage minimization of SRAM cells in a dual-Vt and dual Tox technology. IEEE Trans. on VLSI 16(7), 851–860 (2008)

L1 Data Cache Power Reduction Using a

Forwarding Predictor

P. Carazo1, R. Apolloni2, F. Castro3, D. Chaver3, L. Pinuel3, and F. Tirado3

1 Universidad Politecnica de Madrid, Spain2 Universidad Nacional de San Luis, Argentina3 Universidad Complutense de Madrid, Spain

Abstract. In most modern processor designs the L1 data cache hasbecome a major consumer of power due to its increasing size and highfrequency access rate. In order to reduce this power consumption, we pro-pose in this paper a straightforward filtering technique. The mechanismis based on a highly accurate forwarding predictor that determines if aload instruction will take its corresponding data via forwarding from theload-store structure –thus avoiding the data cache access– or it shouldcatch it from the data cache. Our simulation results show that 36% datacache power savings can be achieved on average, with a negligible per-formance penalty of 0.1%.

1 Introduction

Power dissipation in an out of order microprocessor is spread across differentstructures including caches, register files, the branch predictor, etc. Specifically,on-chip caches consume a significant part of the overall power by themselves.In this paper we intend to reduce the L1 data cache (DL1) power consump-tion in an out of order processor. It can be argued that this research problemis not a major concern now due to the trend towards multi-core architecturesmade by the industry, in which in some cases the pipelines employed are simpler.However homogeneous multi-manycore architectures with in-order pipelines willonly provide substantial benefits for scalable applications/workloads, and someresearchers have recently highlighted that future designs will benefit from asym-metric architectures that combine simple and power-efficient cores with a fewcomplex and power-hungry cores [1]. The local inefficiencies of a complex corecan translate into global performance/per-watt improvements since a complexcore could accelerate the serial phases of applications when the power-efficientcores are idle. This way, a single chip will be able to provide good scalabilityfor parallel applications as well as ensure high serial performance. In summary,as promoted in [2], researchers should still investigate methods of improvingsequential performance despite we have entered into the multicore era. Further-more if several out-of-order cores are employed –either in an asymetric or anhomogeneous multi-core design– our technique can be applied to each privateDL1 cache, leading to a higher benefit.


L1 Data Cache Power Reduction Using a Forwarding Predictor 117

The mechanism that we propose in this paper for reducing the DL1 powerconsumption is based on an effcient usage of the LSQ (load-store queue), a struc-ture responsible of keeping all in flight memory instructions and detecting andenforcing memory dependences in an out of order processor. One of the mainLSQ tasks is to supply the correct data to load instructions via a forwardingprocess –store to load forwarding– ruling out the cache data and therefore turn-ing the cache access unnecessary. Taking advantage of Nicolaescu’s CLSQ [3],in which the number of loads that receive their data from a previous store in-creases a lot, and using an accurate forwarding predictor, that suggests if a loadinstruction is likely to receive its data through forwarding, we manage to reducesignificantly the amount of accesses to data cache in an x86 architecture. Thesmall misprediction rate obtained translates into an IPC that remains largelyunchanged.

The rest of the paper is organized as follows. Section 2 recaps related work.Section 3 reviews the conventional implementation and brings in our new mech-anism. Section 4 details our experimental environment, while Section 5 outlinesexperimental results and analyses. Finally, Section 6 concludes.

2 Background

Many techniques for reducing the cache energy consumption have been exploredrecently. Next, we recap some of the more outstanding ones.

One alternative is to partition caches into several smaller caches [4] with thecorresponding reduction in both access time and power cost per access. Anotherdesign, known as filter cache [5], trades performance for power consumptionby filtering cache references through an unusually small L1 cache. An L2 cache,which is similar in size and structure to a typical L1 cache, is placed after the filtercache to minimize the performance loss. A different alternative, named selectivecache ways [6], provides the ability to disable a subset of the ways in a setassociative cache during periods of modest cache activity, whereas the full cachewill be operational for more cache-intensive periods. Another different approachtakes advantage of the special behavior in memory references: we can replacethe conventional unified data cache with multiple specialized caches. Each onehandles different kinds of memory references according to their particular localitycharacteristics [7]. These alternatives make it possible to improve in terms ofperformance or power efficiency. Finally, Jin et.al [8] obtain power savings in L1cache by exploiting loads spatial locality. In their technique, loads always bring amacro data from the processor cache, allowing additional opportunities for loadto load forwarding.

Nicolaescu et.al [3] propose to avoid the data cache access for those loadsthat receive their data through forwarding. To increase them, they modify theLSQ design to retain load and store instructions after their commit. Thereby, alater load increases its chances of receiving its data from a previous instruction,either an in-flight store, a commited store, or a commited load. The mechanism–named cached load store queue, CLSQ– is based on the low observed rates

118 P. Carazo et al.

of LSQ occupancy for some program phases, that make it possible to earmarkunoccupied entries to already commited load or store instructions. Our workextends and improves this job.

Finally, as we are using a forwarding predictor in our design, we should men-tion that there are many proposals relying on memory dependence prediction,that propose techniques to know in advance which pairs of store-load instruc-tions will depend and take appropriate actions [9] [10]. However, they all areoverprovisioned for the goal of our job.

3 Filtering DL1 Accesses Using a Forwarding Predictor

3.1 Rationale

In most conventional microprocessors each load instruction consults the first leveldata cache (DL1) in order to move the required data into an available register.In parallel, the Store-Queue (SQ) is searched looking for a previous matchingin-flight store. If it is found, the store forwards the corresponding data. Other-wise, the data is provided by the cache (Figure 1, Original Architecture). Thetechnique that we propose in this paper is based on the observation that if aload gets its data directly from an earlier store, the data cache access becomescompletely unnecessary, and hence we could avoid it saving some power. Obvi-ously, this is only useful if the percentage of loads that get the data from the SQis high enough.

In a RISC processor, the amount of architectural registers is commonly setto 32 and a register-register architecture is generally implemented. With suchconfiguration, the number of store to load forwardings is relatively small (forexample, in [11], less than 15% on average), and maybe the benefits of trying toavoid the DL1 access in such reduced occasions could turn meaningless. However,in a register-memory architecture with only 16 architectural registers –as in thecase of x86-64, the architecture employed in this job– the number of store to loadforwardings is higher as a result of the extra operations due to register spilling.

In a complementary way, we can use Nicolaescu’s CLSQ from [3], which signif-icantly increases the number of loads that receive their data via forwarding, bothdue to store-load forwarding from the Cached-SQ and to load-load forwardingfrom the Cached-LQ.

In summary, on a x86-64 architecture using Nicolaescu’s Cached-LSQ, thenumber of forwardings can be relatively high – up to 40% of the loads –, whichmakes our initial intuition appealling. However, in order to be able to filter outthese accesses, we need to either serialize the LSQ and DL1 cache searches, orknow in advance –i.e. make a prediction– whether the load will receive the datavia forwarding or not. This is a key issue that has to be addressed.

3.2 Overall Structure

As we have just mentioned, an obvious implementation would be to serialize theaccesses (as Nicolaescu in [3]): the load first scans the SQ, and then –only when


neccessary– the cache is accessed (Figure 1, Nicolaescu’s Proposal). However,this design is not effcient: when a previous matching store is not found the delayincurred in accessing to the data cache will result in a significant slowdown. Inthis paper we will turn up with a much more convenient approach.

The design that we propose (Figure 1, Proposed Architecture) is based on aforwarding predictor: for each load, we predict whether it will receive its datathrough forwarding. For convenience of discussion, we loosely refer to these loadsas predicted-dependent loads and the remainder predicted-independent loads. Forpredicted-dependent loads, only the SQ and the cached-LQ are accessed, omit-ting the DL1 access (of course, at the risk of being wrong, in which case the cacheaccess is launched with a delay of 1 cycle). For the remaining, both the SQ, thecached-LQ and the DL1 are accessed in parallel (note that in this case, if thepredictor is wrong, the data cache access is unnecessary). A predictor with highaccuracy provides significant power savings at the cost of a tiny performancedegradation. This idea has been explored in similar, yet different contexts [12].

There is a whole lot of research in the field of memory dependence prediction(Section 2). However, they all employ sophisticated predictor structures, whichare excessive for our goal of predicting in advance if a load will receive its datathrough forwarding. For this reason, we have not considered them in our job.Instead, we have evaluated two kinds of simple predictors: Bloom Filter based[13] and Branch Predictor based [14].

Bloom Filter based predictor. In this first kind of predictors, we imple-ment a low-overhead hash table of counters: At issue time, every load and storehash their memory addresses to a single entry and increment the correspond-ing counter. Then, at commit, the entry is decremented. Besides, at issue time,loads read the counter before it was incremented to perform the prediction. If itis bigger than zero, there is a likely (but not certain) address match with anothermemory instruction, and the load is predicted to receive its data via a forwarding.On the other hand, if the counter is zero, the load is predicted-independent1.

Branch Predictor based. The second kind of predictors is based on the well-known bimodal branch predictor. Similarly to branch instructions, the majorityof loads are usually strongly biased, so such a predictor works well. An advantageof this Bimodal Predictor versus the Bloom Filter based is that the predictioncan be performed as soon as the load instruction is decoded, based on its PC.On the contrary, a Bloom Filter is consulted with the memory address of theload, that needs to be calculated first, so the prediction is delayed to issue phasein this case.

Combined Predictor. Finally, we should mention that we have also consideredin our evaluation a combined predictor, merging a Bloom Filter with a Bimodal1 As explained in [15], the SQ and LQ accesses could be avoided in this case. However,

since a DL1 cache access is much more power consuming than an LQ-SQ access, inthis paper we do not consider such LQ or SQ filtering capability, that would requirea deeper study.


Original Architecture (with DL1g (Nicolaescu´s Cached LSQ)

load instruction

DL1

Previous matching store FWDOtherwise DATA from DL1

Cached ST

AssociativeSearch

InFlight STQueue

Cached STQueue

Cached LDQueue

Nicolaescu´s proposal forsaving DL1 energy No previous

matching ld/st

DATA

I Fli h ST

DL1Cached STQueue

AssociativeSearch

(a) 1 cycle delayload instruction

Previousmatching ld/st

Cached LDQueue

InFlight STQueue

(a) 1 cycle delay

FWD (DL1 filtered)

(b) 1 DL1 access saved

AssociativeS h

Proposed Architecture No previousmatching ld/st

DATA

InFlight STQueue

DL1Cached STQueue

Search

(a) 1 cycle delay

Forwarding

PredictedDependent FWD (DL1 filtered)

Previousmatching ld/st

Cached LDQueue

Queue

PredictedIndependent

Forwardingpredictorload instruction

I Fli h ST

Cached STQueue

AssociativeSearch

(b) 1 DL1 access saved

Independent

Previous matching load or store FWDOtherwise DATA from DL1

InFlight STQueue

Cached LDQueue

(c) Same energy and delay

DL1

(c) Same energy and delay

Fig. 1. Original Architecture (with the Cached-LSQ), Nicolaescu’s Architecture, andour Proposed Architecture

predictor. For extracting the final decision, we predict that a load will receiveits data through forwarding only when both structures predict the load to bedependent. Such a structure benefits from both the past forwarding informationof loads and memory address information, giving the best results as we will showin the Evaluation Section.

3.3 Supporting Coherence and Consistency

The LSQ from the baseline architecture receives the invalidation requests fromremote processors, so coherence and consistency functionalities can easily besupported in our technnique. However, we should highlight a conflict situation


that turns up in our design when implemented in a system with a MESI coherenceprotocol: If a data is replaced from the DL1 but remains in the Cached-LSQ,the Shared Line will not be activated due to a remote read request, potentiallyputting the remote data in an erroneous Exclusive State (instead of a SharedState). A possible solution is to force the LSQ to activate the Shared line forevery remote read to a load whose data was received via forwarding. As a futurework we intend to improve this management since –although straightforward–it is relatively inefficient.

4 Experimental Framework

We have evaluated our proposed design using the PTLsim [16], a performance-oriented simulation tool. The microarchitecture models the default PTLsim con-figuration that results from the merging of different features of an Intel Pentium4 [17], an AMD K8 and an Intel Core 2 [18]. Some of the main simulation pa-rameters are listed in Table 1.

Table 1. Simulation parameters for default PTLSim configuration

Branch predictor Combined (Bim-2bits + Gshare), 2K BTACInstruction Fetch queue size 32ROB size 128LSQ size 80 (LQ: 48, SQ: 32)LSAP size 16Physical Registers 256Fuctional Units (INT) 8: 4 ALU (2 INT, 2 FP), 2 Load, 2 StoreFetch/Decode/Issue/Commit width 4/4/4L1 Instruction Cache 32KB (4 way, 64B line)L1 Data Cache 16KB (4 way, 64B line, 2 cycles latency)L2 Data Cache 256KB (16 way, 64B line, 6 cycles latency)L3 Data Cache 4MB (32 way, 64B line, 16 cycles latency)Main memory latency 140 cycles

The evaluation of our proposal has been performed using 24 benchmarks fromthe SPEC CPU2006 suite, compiled for the x86 instruction set. The technologyparameters correspond to 45 nm, with a 1.0V Vdd. We simulate regions of 100Minstructions after reaching a triggering point [19], that marks the beginningof code area in which the application behavior is representative of the overallexecution.

To evaluate the impact of our data cache filtering over the power consumptionof the DL1, we use CACTI 5.3 [20] to model the cache of Table 1. Specifically, inorder to estimate the cache power consumption, we have multiplied the numberof reads and writes to DL1 by the power consumption of each kind of access tothis cache. Furthermore, the simulator has been modified to incorporate our pre-dictors in the microarchitectural simulation, although their power consumptionis considered negligible compared with the power savings obtained in the datacache.

In the following, we perform some quantitative analysis to further understandthe effectiveness of the proposed design.


5 Evaluation

5.1 Main Results

In this section we compare the data cache power and whole system performanceusing either the baseline or our alternative. Figure 2 shows the power savingsachieved in the data cache in our technique with respect to the Original Archi-tecture. Figure 3 illustrates the performance impact of our proposal with respectto the Original Architecture. In these experiments we always employ the com-bined predictor, since it reports the highest accuracy values as we will report innext subsection. We can extract the following conclusions.

First, by including our proposed scheme, a significant fraction of loads are cor-rectly predicted-dependent, and therefore the corresponding data cache accessesavoided. This leads to a significant fraction of the DL1 dynamic power con-sumption eliminated, as Figure 2 shows. On average, for a Bloom Filter with 64entries and a Bimodal Predictor of 256, the DL1 power savings of our approachare around 36%.

Second, and more important, in our architecture average performance remainsalmost untouched (around 0.1% of slowdown), something that would not happen

0102030405060708090100 BF=64 + Bimodal=256

BF=64 + Bimodal=512BF=64 + Bimodal=1024BF=64 + Bimodal=2048

DL1

Power

Savings(%

)

0102030405060708090100 BF=64 + Bimodal=256

BF=64 + Bimodal=512BF=64 + Bimodal=1024BF=64 + Bimodal=2048

DL1

Power

Savings(%

)

Fig. 2. DL1 Power Savings

0,0

0,1

0,2

0,3

0,4

0,5

0,6 Our Proposal (BF=64 + Bimodal=256)

Our Proposal (BF=64 + Bimodal=2048)

Slow

down(%

)

0,0

0,1

0,2

0,3

0,4

0,5

0,6 Our Proposal (BF=64 + Bimodal=256)

Our Proposal (BF=64 + Bimodal=2048)

Slow

down(%

)

Fig. 3. Performance Impact


with Nicolaescu’s Proposal. The reason is that in his case, when a load finds noprevious dependent stores in the LSQ (i.e. has no forwarding) incurs a delay of1 cycle accessing the DL1, while in our case the forwarding predictor avoids thisto happen by predicting most of these loads as independent.

5.2 Forwarding Predictors

In order to compare the accuracy of the forwarding predictors evaluated –BloomFilter, Bimodal (with 1 and 2 bits per entry), and Bimodal (2 bits) plus BloomFilter– we follow Grunwald et.al and employ the following metrics used in con-fidence estimation for speculation control [21]:

– Predictive Value of a Positive test (PVP). It identifies the probability thatthe prediction of a load as dependent is correct. It is computed as the ratiobetween the number of correctly dependent-predicted loads and the totalnumber of loads predicted as dependent.

– Predictive Value of a Negative test (PVN ). It identifies the probability thatthe prediction of a load as independent is incorrect. It is computed as theratio between the number of mispredicted independent loads and the totalnumber of loads predicted as independent.

In our case, using predictors with a high PVP avoids degrading performance.On the other hand, if many loads are incorrectly independent-predicted (highPVN), many cache accesses are carried out unnecessarily, resulting in missedopportunities to reduce the DL1 power consumption. Therefore, in our design,only very high PVP values and very low PVN values are acceptable.

In Figure 4, we visually present the measurements of PVP and PVN fordifferent sizes of all studied predictors. Intuitively, as we increase the size of

0,00

5,00

10,00

15,00

50 60 70 80 90 100

Bimodal_2 bits + Bloom Filter

Bimodal_2 bit

Bloom Filter

Bimodal_1 bit

PVN(%

)

PVP (%)

256+64

512+64

1024+64

2048+64

256

256512

5121024

1024

2048

2048

64 256128BEST

Fig. 4. PVP and PVP values for studied predictors. The results shown are the averagevalues for all applications. For Bimodal Predictors (1 and 2 bits) the data points reflectssizes of 256, 512, 1K and 2K. For Bloom Filter we show results for 64, 128 and 256entries. Finally, the combined predictor uses a 64-entry Bloom Filter and a BimodalPredictor (2 bits) with 256, 512, 1K and 2K entries.


any predictor, PVP value augments and PVN decreases, leading to a betterpredictor behavior. Note that PVN for Bloom Filter is always zero, since nofalse negatives exist –when a load is independent-predicted, the predictor is nevermistaken–. From this figure we can conclude –according to the intuition– thatcombining the past forwarding information (Bimodal predictor) and memoryaddresses (Bloom Filter) results in the most accurate predictor (around up to95% of hits for predicted-dependent loads and only around 6% of misses forpredicted-independent loads).

6 Conclusions

The main contributions of this paper are:

– We implement and evaluate Nicolaescu’s CLSQ [3] in a different and morecommon microarchitectural model -the widespread x86-64-.

– We propose to include a forwarding predictor to know in advance whether aload will receive its data through forwarding, in which case the DL1 accesscan be avoided.

– We study the effectiveness of different predictors, choosing the optimal onebased on a tradeoff between accuracy and HW needs.

Overall, the proposed filtering mechanism translates into DL1 power savingsup to 36% on average for the studied predictor configuration (BF of 64 entriesand Bimodal of 256 entries). Including this scheme leaves performance almostunvaried –less than 0.1% slowdown on average– with a minimal hardware costof less than 100B.

References

1. Bower, F., Sorin, D., Cox, L.: The impact of dynamically heterogeneous multicoreprocessors on thread scheduling. IEEE Micro 28(3), 17–25 (2008)

2. Hill, M.D., Marty, M.R.: Amdahl’s law in the multicore era. IEEE Computer 41(7),33–38 (2008)

3. Nicolaescu, D., Veidenbaum, A., Nicolau, A.: Reducing Data Cache Energy Con-sumption via Cached Load/Store Queue. In: ISLPED 2003, pp. 252–257 (2003)

4. Racunas, P., Patt, Y.N.: Partitioned First-Level Cache Design for Clustered Mi-croarchitectures. In: ICS 2003, pp. 22–31 (2003)

5. Kin, J., Gupta, M., Mangione-Smith, W.: The Filter Cache: An Energy EfficientMemory Structure. In: MICRO 1997, pp. 184–193 (1997)

6. Albonesi, D.: Selective Cache Ways: On-Demand Cache Resource Allocation. Jour-nal of Instruction-Level Parallelism 2 (2000)

7. Lee, H., Smelyanskiy, M., Newburn, C., Tyson, G.: Stack Value File: Custom Mi-croarchitecture for the Stack. In: HPCA 2001, pp. 5–14 (2001)

8. Jin, L., Cho, S.: Reducing Cache Traffic and Energy with Macro Data Load. In:ISLPED 2006, pp. 147–150 (2006)

9. Subramaniam, S., Loh, G.: Store Vectors for Scalable Memory Dependence Pre-diction and Scheduling. In: HPCA 2006, pp. 65–76 (2006)


10. Park, I., Ooi, C., Vijaykumar, T.: Reducing Design Complexity of the Load/StoreQueue. In: MICRO 2003, pp. 411–422 (2003)

11. Castro, F., Chaver, D., Pinuel, L., Prieto, M., Huang, M., Tirado, F.: A Load-Store Queue Design based on Predictive State Filtering. Journal of Low PowerElectronics 2(1), 27–36 (2006)

12. Sha, T., Martin, M., Roth, A.: Scalable Store-Load Forwarding via Store QueueIndex Prediction. In: MICRO 2005, pp. 159–170 (2005)

13. Bloom, B.: Space/Time Trade-offs in Hash Coding with Allowable Errors. Com-munic. of the ACM 13(7), 422–426 (1970)

14. McFarling, S.: Combining Branch Predictors. Technical report tn-36, Western Re-search Laboratory, Digital Equipment Corporation (June 1993)

15. Sethumadhavan, S., Desikan, R., Burger, D., Moore, C., Keckler, S.: Scalable Hard-ware Memory Disambiguation for High ILP Procs. In: MICRO 2003, pp. 399–410(2003)

16. Yourst, M.T.: PTLsim: A Cycle Accurate Full System x86-64 MicroarchitecturalSimulator. In: ISPASS 2007, pp. 23–34 (2007)

17. Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., Kyker, A., Roussel, P.:The Microarchitecture of the Pentium 4 Proc. Intel Technology Journal (Q1 2001)

18. Copenhagen Univ. College of Eng.: The Microarch. of Intel and AMD CPU’s: anOptimization Guide for Assembly Programmers and Compiler Makers (2009)

19. A hybrid timing-address oriented LSQ filtering for an x86 arch. Technical report20. http://www.hpl.hp.com/research/cacti/

21. Grunwald, D., Klauser, A., Manne, S., Pleszkun, A.: Confidence Estimation forSpeculation Control. In: ISCA 1998, pp. 122–131 (1998)


Statistical Leakage Power Optimization of Asynchronous Circuits Considering Process Variations

Mohsen Raji, Alireza Tajary, Behnam Ghavami, Hossein Pedram, and Hamid R. Zarandi

Department of Computer Engineering and Information Technology, Amirkabir University of Technology (Tehran Polytechnic), Tehran, I. R. Iran {raji,tajary,ghavamib,pedram,h_zarandi}@aut.ac.ir

Abstract. Increasing levels of process variability in deep sub micron era has become a critical concern for performance and power constraint designs. This paper introduces a framework for the statistical leakage power minimization of template-based asynchronous circuits considering process variation. We pro-pose a statistical Dual-Vt assignment of asynchronous circuits that considers both the variability in performance and leakage power consumption of a circuit. The utilized circuit model is an extended Timed Petri-Net named Variant-Timed Petri-Net which captures the dynamic behavior of the circuit with statis-tical delay and leakage power values. We applied a genetic algorithm that uses a 2-dimensional graph to calculate the fitness to each threshold voltage assign-ment. Experimental results show that using this statistically aware optimization, leakage power can be reduced by 40.5% and 54.4% for the mean and the va-riance values.

1 Introduction

In asynchronous circuits, local signalling eliminates the need for global synchroniza-tion which exploits some potential advantages in comparison with synchronous ones [1] [2] [3] [4] [5]. Asynchronous design allows reducing dynamic power consump-tion because activity is controlled by request, not upon clock edge. On the other hand, the request receiver and acknowledgment emission capacities have a cost in the num-ber of transistors. However, in deep sub-micron technologies the leakage current is becoming more and more significant [6].

There are many techniques to design of dual threshold voltage (dual-Vth in se-quence) synchronous circuits. However, dual-Vth cannot be applied directly to asyn-chronous circuits in the same way that it can be done for synchronous circuits. It is due to the fact that it is difficult to define or to identify a critical path in asynchronous circuits, where it starts, where it stops, at least with CAD tool that have been designed for synchronous circuits. In [7], a method to synthesize a dual-Vth asynchronous de-sign has been proposed.

As process geometries continue to shrink, the ability to control critical device para-meters is becoming increasingly difficult and significant variations in device length, doping concentrations, and oxide thicknesses have resulted. This issue is called process

Statistical Leakage Power Optimization 127

variation. In deep submicron technologies, the variability of circuit features, such as delay or leakage power, due to process variations has become a significant concern. The tremendous impact of variability was demonstrated recently in [11], showing 20X variation in leakage power for a 1.3X variation in delay between fast and slow dies. Wide spread in the leakage power distribution has emerged as an important cause of yield loss due to bound on static power dissipation [12]. Statistical analysis is a practi-cal approach in circuit design to tolerate process variation [21] [27] [28] [27] .

There is a lot of works which applied statistical analysis in synchronous circuits to mitigate the impact of variation [27] [28] . However, a statistical performance analysis of asynchronous circuits has been proposed in [23]. To the best of our knowledge, there is not any proposed method that considers the process variation in power con-sumption analysis of asynchronous circuits. In this paper, we present a process varia-tion-aware leakage power optimization framework for asynchronous circuits.

The remainder of the paper is organized as followed: section 2 provides a back-ground of necessary information for reading the paper. Section 3 introduces the statis-tical threshold voltage assignment framework. Vth assignment algorithm is described in detail in section 4 while in section 5 we give our experimental results by the use of some related benchmarks. Finally, some conclusions are inferred in section 6.

2 Background

2.1 Dual-Vth Circuit Design

The dual-Vth design technique uses two kinds of transistors in the same circuit. Some transistors have a high threshold voltage, while other transistors have a low threshold voltage. The high threshold-voltage transistors have less sub-threshold leakage power dissipation but also have a larger delay as compared to the low-threshold-voltage transistors.

In dual threshold voltage implementation of custom VLSI designs, the gates on noncritical paths are assigned as high-Vth, and the gates on the critical path are as-signed as low-Vth. The objective is to maximize the number of transistors having high threshold voltage without sacrificing the performance of the circuit. The impact of this approach heavily relies on the efficiency of the threshold voltage assignment al-gorithm. Recently, researchers have proposed many design techniques, for selecting and assigning threshold voltage to gates of circuits which reduce leakage power under performance constraints [14].

However, the dual-threshold-voltage-design technique proposed in the literature for custom VLSI designs cannot be used for asynchronous ones. This is because the performance analysis of asynchronous circuit is completely different from synchron-ous one, because of the dependencies between highly concurrent events. While syn-chronous performance estimation is based on a static critical path analysis affected only by the delay of components and interconnecting wires, it has been shown that the performance of an asynchronous circuit depends on dynamic factors like the number of tokens in the circuit. In the clocked case, the critical path has a clear beginning and a clear end because all paths are broken by latches. But importantly, no clear separa-tion is available in asynchronous circuits. Therefore, it is necessary to have special approach to analyze the performance of the asynchronous circuits.

128 M. Raji et al.

2.2 Timed Petri-NET

Petri-Nets are used as an elegant modelling formalism to model concurrency and syn-chronization in many applications including asynchronous circuit modelling [20]. A Petri Net is a four-tuple , , , where P is a finite set of places, T is a finite set of transitions and ( ) ( )F P T T P⊆ × ∪ × is a flow relation, and is the initial marking. A marking is a token assignment for the place and it shows the state of the system. Timed Petri-Net (TPN in sequence) is a Petri-Net in which some transitions or places can be annotated with delays. Variant-Timed Petri-Net (VTPN) is a TPN which the delays on the transitions or places are modelled statistically using probability density functions. In order to analyse the asynchronous circuits statistical-ly, we use VTPNT to model the circuit.

3 Statistical Dual Vth Asynchronous Circuits Design Framework

Fig. 1 shows the general structure of the proposed statistical leakage power optimiza-tion scheme and its interface with the asynchronous synthesis flow. To model the dual-threshold design of asynchronous circuits as an optimization problem, a suitable circuit and performance model of asynchronous circuit is required. In this work, the output of Decomposition is translating to Variant-Timed Petri-Nets model for perfor-mance analysis and assigns low or high Vth to each template. Then, a VTPN simulator runs the circuit model and provides the dynamic information of the original circuit such as token assignment.

The proposed optimizer includes a statistical static performance analyser in order to provide performance information and a Vth-assignment engine which assign high/low Vth to the templates of the circuit. Assignment of Vth is done using a heuris-tic method. Then the optimized circuit is given as input to Template Synthesizer to generate a netlist of standard-cell elements.

4 A Vth-Assignment Algorithm

The power optimization flow uses a genetic algorithm and is shown in Fig. 2. It shows the basic configuration of the GAs. The genetic algorithm maintains a population of m individuals at each generation g. Each individual is a candidate of a solution for the dual-Vth assignment algorithm and has n chromosomes, i.e. the number of VTPN nodes. Each chromosome can have two conditions; ‘0’ shows that low Vth has been assigned to the node and ‘1’ shows that high Vth has been assigned to it. As there is a tradeoff between the performance and the power consumption of the circuit in dual-Vth technique, the proposed algorithm in the Vth assignment process has two optimiza-tion objectives. When the performance and the leakage power analyzed, the fitness of the individuals should be evaluated. We applied a 2-dimensional fitness graph to as-sign a total fitness value to each individual. Genetic operations are then applied to reproduce the population for the new generation. This process will continue until a termination criterion is met.


Fig. 1. Statistical Dual-Vth Asynchronous Circuit Design Framework

Fig. 2. The Vth assignment flow

130 M. Raji et al.

4.1 Statistical Mathematical Operations

The delay and leakage power of each node in VTPN is modelled as random variables with normal distribution. So the delay and power of the nodes in VTPN have a mean value, , and a set of parameter variation. The linear model used to approximate delay in the analysis is as follows: ∑ (1)

Where d is the delay of a gate, is the mean value for the delay; si is the delay sensitiv-ity of process parameter pi, pi is the parameter variation in pi for this gate, and m is the number of process parameters.

As the computation will be done statistically, it is noteworthy to explain about the statistical operations first. The three operations used in our method are SUM, DIV and MAX. First of all, suppose there are tow random variable modeled as below: ∑ , , (2) ∑ , , (3)

In order to make the problem simpler, it is assumed that the parameters are uncorre-lated. So the standard deviation of the random variable is calculated like this: ∑ (4)

It is interesting to notice that the covariance between paths (here between path 1 and 2) can be calculated easily through the equation below: , ∑ , , (5)

4.1.1 SUM Operation The sum of two random variables with normal distribution results in a random variable with normal distribution. The SUM operation along each cycle is computed as follows: μ ∑ , μ μ μ (6)

, , , (7)

4.1.2 DIV Operation In calculating the SCM of a cycle, the sum of delay values of the cycle will be divided by the number of the tokens in the cycle. As the sum of the delays modeled by normal random variable is still a normal random variable, the parameters of the division are calculated as follows: / (8)


, , / (9)

4.1.3 MAX Operation The maximum of two normal random variables does not necessarily results in a normal random variable. The MAX of two random variables with normal distribution N1 and N2 can be approximated to another normal random variable Nmax using the relationship proposed in [21], that is as follows: , μ ∑ , 2 (10)

(11)

μ μ μ (12)

, , , (13)

Here, ρ represents the correlation coefficient between A and B , and φ and are the cumulative density function, CDF, and the probability density function, PDF, of a standard normal (i.e., mean 0, STD 1) distribution, respectively.

4.2 Performance and Leakage Power Analysis

Performance of any computation modeled with a VTPN is dictated by the cycle time of the VTPN and thus the largest cycle metric. A cycle c in a VTPN is a sequence of places p1,p2,p3,…,p1 connected by arcs and transitions whose the first and the last place are the same. The statistical cycle metric, (SCM(c)), is the statistical sum of the delays of all associated places along the cycle c, d(c), divided by the number of tokens that reside in the cycle, m0(c), defined as: c / (14)

The cycle time of a VTPN is defined as the largest cycle metric among all cycles in the VTPN which must be computed statistically, i.e. , where C is the set of all cycles in the Variant-TPN.

As mentioned before, the delays and the power consumptions of the nodes in VTPN are modeled statistically. So the algorithm has to use the statistical mathemati-cal operations.

Performance analysis of asynchronous circuits which are modeled by VTPN is comprehensively discussed in [8] [9] [23]. On the other hand, power analysis needs a main calculation: finding the sum of the power consumptions of the nodes of the VTPN.

4.3 Fitness Function

The fitness of a chromosome should be related to both the leakage power consump-tion and performance metric of that particular configuration since improvement of

132 M. Raji et al.

each cause the other to degrade. So we applied a 2-dimentional fitness evaluation to the individuals. In each step, the fitness weight of each configuration is calculated so that it shows the number of the configuration that both of their parameters are better than the current configuration. Fig. 3 shows an example for a step of fitness evalua-tion. In this figure, for example, individual with weight 4 means that there is four in-dividuals with both better leakage power and delay metric than that individual. As the power and performance analysis is performed statistically, we have to consider a de-terministic measurement to find a position in a 2-dimensional graph. So we use bel-low formula to find a deterministic value for each of the parameters: (15) (16)

where is the mean value of each statistical cycle metric for each configuration, is the standard deviation of each configuration and and are mean value and

standard deviation value of that configuration respectively. In the last step, we have to choose a configuration as the result of the optimization.

Based on the application for which the optimization is done, the power and the per-formance of the desired configuration can have specific weights in the last optimiza-tion step.

Fig. 3. An Example for Fitness Evaluations Method


To test our method, we construct a multiple-Vth standard cell library using 90 nm process. For NMOS (PMOS) transistors, the high threshold voltage and the low thre-shold voltage are 0.22V (-0.22V) and 0.12V (-0.12V) respectively. The library was characterized using Berkeley 90 nm BSIM predictive model [26]. An asynchronous synthesis toolset (for the sake of blind review, we don’t cite its name here) employed to synthesis benchmarks. The circuits were optimized for maximum speed and lowest leakage power consumption simultaneously using the 2-dimensional fitness graph. It is observed that, on the average, in dual-Vth asynchronous circuits 86% leakage power can be reduced in standby mode.


To verify the results of our statistical dual-Vth assignment method, we used Monte Carlo (MC) simulation for comparison. To balance the accuracy, we chose to run 10,000 iterations for the MC simulation. The runtime for the MC simulation ranges from 30 minutes to 48 hours, depending on circuit sizes and its complexity. A com-parison of these results with those from statistical approach is shown in Table 1 and 2. For each test case, the mean and standard deviation (SD) values for the leakage power consumption and the performance metric of both methods are listed. The results of the proposed method can be seen to be close to the MC results: the average error is %3.56 and 52.08% for the mean and the variance value of the delays, respectively; and the average error for the mean and the variance values of the power is 5.23% and 48.39% respectively. Although there is some error between the implemented proposed method and MC simulation, but there is a considerable difference in the runtime of the me-thods as shown in Table 3.

Table 1. Result Comparison of the Statistical Dual-Vth Assignment and MC-based Dual and Single Vth Assignment Simulation (Delay Values)

Benchmarks # of the

Nodes

# of the

Cycles

Proposed Flow Monte-Carlo Dual-Vth

Monte-Carlo Single-Vth

Delay (ns) Delay (ns) Delay (ns)

Mu ( ) Sigma

( ) Mu ( ) Sigma ( ) Mu ( ) Sigma ( )

A 6 17 8.540 0.243 8.091 1.607 8.102 1.986

B 10 51 7.533 0.235 8.54 1.033 8.601 1.589

C 16 1389 14.711 0.251 14.54 1.105 14.729 2.307

D 26 1864 17.207 0.407 16.984 1.554 17.108 8.032

E 35 7369 15.909 0.198 15.317 0.998 15.399 3.671

F 20 276 13.724 0.247 14.79 2.193 14.84 1.903

G 56 812 16.932 0.341 16.428 1.817 16.609 2.108

Table 2. Result Comparison of the Statistical Dual-Vth Assignment and MC-based Dual and Single Vth Assignment Simulation (Leakage Power Values)

Benchmarks # of the

Nodes

# of the Cycles

Proposed Flow Monte-Carlo Dual-Vth

Monte-Carlo Single-Vth

Power (mW) Power (mW) Power (mW)

Mu ( ) Sigma

( ) Mu ( ) Sigma ( ) Mu ( ) Sigma ( )

A 6 17 32.00 1.400 34.27 2.716 54.56 3.021

B 10 51 81.00 2.5865 75.87 5.020 137.36 4.907

C 16 1389 108.00 2.7893 103.30 6.049 183.85 8.145

D 26 1864 186.00 3.6633 175.90 7.9210 318.28 6.0843

E 35 7369 159.11 2.8184 152.51 5.0319 276.05 8.7823

F 20 276 169.00 3.8458 157.85 10.675 263.91 8.134

G 56 812 339.36 4.6304 344.91 6.2742 609.35 8.6292

134 M. Raji et al.

The results of dual-Vth are compared with the delay and power values of single-Vth technique in Table 1 and 2. As reported, the proposed method optimizes the lea-kage power consumption of the benchmarks at expense of some performance over-head. The average value of optimization is 40.5% and 54.4% for the mean and the variance value of power, respectively.

Table 3 shows the runtime for our benchmark for each method. It varies for the benchmarks depending on circuit sizes and timing constraints.

Table 3. The Runtime for the Statistical Dual-Vth Assignment in Comparison with MC-based Simulation

Benchmarks # of the Nodes

# of the Cycles

Runtime

SDV (Minute)

MC (Hour)

A 6 17 2 0.5

B 10 51 4 3.2

C 16 1389 6 11.7

D 26 1864 7 17.3

E 35 7369 9 37.3

F 20 276 7 16.4

G 56 812 11 47.8

6 Conclusions

In this paper, an efficient method for exploiting statistically dual-threshold voltage assignment technique for reducing leakage power of asynchronous circuits while maintaining the high performance of theses circuits is presented. The issue of process variation is considered through exploiting the statistical approach to timing and power analysis of asynchronous circuits. The decomposed circuit is used to generate a Va-riant-Timed Petri Net model. The proposed assigning high and low threshold voltage method is based on a genetic algorithm. The experimental results show that the effi-ciency of the proposed method.

We see many avenues for further investigation. In order to propose a more accurate framework and reduce the error of the method, we will consider correlation of delay and leakage power values in our future work. In addition, the application of our me-thod to a broader class of concurrent systems, such as GALS and embedded systems is a good topic for the researchers in the asynchronous circuit designs similarly to the synchronous ones.

References

[1] Tang, C.K., Lin, C.Y., Lu, Y.C.: An Asynchronous Circuit Design with Fast Forwarding Technique at Advanced Technology Node. In: Proceedings of ISQED 2008. IEEE Com-puter Society, Los Alamitos (2008)


[2] Beerel, P.A.: Asynchronous Circuits: An Increasingly Practical Design Solution. In: Pro-ceedings of ISQED 2002. IEEE Computer Society, Los Alamitos (2002)

[3] Martin, A.J., et al.: The Lutonium: A Sub-Nanojoule Asynchronous 8051 Microcontrol-ler. In: Proceedings of ASYNC 2003 (2003)

[4] Yun, K.Y., Beerel, P.A., Vakilotojar, V., Dooply, A.E., Arceo, J.: A low-control-overhead asynchronous differential equation solver. In: Proceedings of ASYNC 1997 (1997)

[5] Garnica, O., Lanchares, J., Hermida, R.: Fine-grain asynchronous circuits for low-power high performance DSP implementations. In: Proceedings of SiPS (2000)

[6] Narendra, S.G., Chandrakasan, A. (eds.): Leakage in Nanometer CMOS Technologies. Springer, Heidelberg (2005)

[7] Ghavami, B., Pedram, H.: Design of dual threshold voltages asynchronous circuits. In: Proceedings of ISLPED 2008 (2008)

[8] Raji, M., Ghavami, B., Pedram, H.: Statistical Static Performance Analysis of Asyn-chronous Circuits Considering Process Variation. In: Proceedings ISQED 2009, pp. 291–296 (2009)

[9] Raji, M., Ghavami, B., Pedram, H., Zarandi, H.R.: Process Variation Aware Performance Analysis of Asynchronous Circuits Considering Spatial Correlation. In: Monteiro, J., van Leuken, R. (eds.) PATMOS 2009. LNCS, vol. 5953, pp. 5–15. Springer, Heidelberg (2010)

[10] Orshansky, M., Nassif, S.R., Boning, D.: Design for Manufacturability and Statistical Design, A Constructive Approach, pp. 11–15. Springer, Heidelberg

[11] Borkar, S., et al.: Parameter variation and Impact on Circuits and Microarchitecture. In: Proceedings of DAC 2003, pp. 338–342 (2003)

[12] Rao, R., et al.: Parametric yield estimation considering leakage variability. In: Proceed-ings of DAC 2004, pp. 442–447 (June 2004)

[13] Orshansky, M., Nassif, S.R., Boning, D.: Design for Manufacturability and Statistical Design, A Constructive Approach, pp. 11–15. Springer, Heidelberg (2008)

[14] Wei, L., Chen, Z., Roy, K., Johnson, M.C., Ye, Y., De, V.K.: Design optimization of dual-threshold circuits for lowvoltage low-power applications. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 7(1), 16–24 (1999)

[15] Wong, C.G., Martin, A.J.: High-Level Synthesis of Asynchronous Systems by Data Dri-ven Decomposition. In: Proceedings of DAC (2003)

[16] Dinh Duc, A.V., Rigaud, J.B., Rezzag, A., Sirianni, A., Fragoso, J., Fesquet, L., Renau-din, M.: TASTCAD Tools: Tutorial. In: Proceedings of ASYNC (2002)

[17] Prakash, P., Martin, A.J.: Slack Matching Quasi Delay-Insensitive Circuits. In: Proceed-ings of ASYNC, pp. 195–204 (2006)

[18] Wong, C.G., Martin, A.J.: High-Level Synthesis of Asynchronous Systems by Data Dri-ven Decomposition. In: Proceedings of 40th DAC, Anneheim, CA, USA (2003)

[19] Beerel, P.A., Kim, N.-H., Lines, A., Davies, M.: Slack Matching Asynchronous Designs. In: Proceedings of ASYNC, Washington, DC, USA (2006)

[20] Peterson, J.L.: Petrinet Theory and the Modeling of Systems. Prentice-Hall, Englewood Cliffs (1981)

[21] Li, X., Le, J., Pileggi, L.T.: Statistical Performance Modeling and Optimization. In: Foundation and Trends in Electronic Design Automation, vol. 1(4), pp. 331–480 (2003)

[22] Kuo, J.T., Cheng, W.C., Chen, L.: Multiobjective water resources systems analysis using genetic algorithms - application to Chou-Shui River Basin, Taiwan. Water Science and Technology 48(10), 71–77 (2003)

136 M. Raji et al.

[23] Raji, M., et al.: Process variation-aware performance analysis of asynchronous circuits. Microelectron. J. (2010) doi:10.1016/j.mejo.2009.12.013

[24] Lane, B.: SystemC Language Reference Manual, Copyright © Open SystemC Initiative, San Jose, CA (2003)

[25] Karp, R.M.: A characterization of the minimum cycle mean in a diagraph. Discrete Ma-thematics Journal 23, 309–311 (1978)

[26] Sheu, B.J., Scharfetter, D.L., Ko, P.K., Teng, M.C.: BSIM: Berkeley Short-Channel IGFET Model for MOS Transistors. IEEE Journal of Solid-State Circuits SC-22(4), 558–566 (1987)

[27] Chang, H., Sapatnekar, S.: Statistical timing analysis under spatial correlations. IEEE Transaction on Computer-Aided Design of Integrated Circuits and Systems 24(9), 1467–1482 (2005)

[28] Agarwal, A., Blaauw, D., Zolotov, V.: Statistical timing analysis forintra - die process variations with spatial correlations. In: Proceedings of ICCAD, pp. 900–907 (2003)


Optimizing and Comparing CMOS Implementations of the C-Element in 65nm

Technology: Self-Timed Ring Case

Oussama Elissati1,2, Eslam Yahya1,3, Sébastien Rieubon2, and Laurent Fesquet1

1 TIMA Laboratory, Grenoble, France {Oussama.Elissati,Eslam.Yahya,Laurent.Fesquet}@imag.fr

2 ST-Ericsson, Grenoble, France [email protected]

3 Banha High Institute of Technology, Banha, Egypt

Abstract. Self-timed rings are a promising approach for designing high-speed serial links or clock generators. This study focuses on the ring stage compo-nents – a C-element and an inverter - and compares the performances of differ-ent implementations of this component in terms of speed, power consumption and phase noise. We also proposed a new self-timed ring stage - only composed by a C-element with complementary outputs - which allows us to increase the maximum speed of 25% and reduce the power consumption of 60% at the maximum frequency. All the electrical simulations and results have been per-formed using a CMOS 65nm technology from STMicroelectronics.

1 Introduction

Oscillators and especially voltage controlled oscillators are basic blocks in almost all designs. Indeed, they are employed for generating the clock synchronization signal, for modulating and demodulating signals or retrieving signals in noise. The oscillator features depends on the application, however communication applications often em-bed their oscillators in Phase-Locked Loops (PLL) with strong requirements on stabil-ity, phase noise and power consumption. Moreover, with the advanced nanometric technologies, it is also required to deal with the process variability of the technology. Today many studies are oriented to asynchronous ring oscillators which present well-suited characteristics for managing process variability and offering an appropriate structure to limit the phase noise. Therefore self-timed rings are considered as promis-ing solution for generating clocks.

In [1], Self-timed rings are efficiently used to generate high-resolution timing sig-nals. Their robustness against process variability in comparison to inverter rings is proven in [2]. They can be implemented in data driven clocks in [3]. Moreover self-timed rings can easily be configured to change its frequency by controlling its initiali-zation at reset time, while at the opposite inverter rings are not programmable [4]. Fully programmable stoppable oscillator based on self-timed rings is presented in [5].

The goal of this paper is to give to the designer some guidelines for using self-timed rings as oscillators depending on design requirements. The paper is mainly

138 O. Elissati et al.

oriented on phase noise reduction, speed and power consumption. The paper is organ-ized as follows. Section 2 provides the paper background and gives some definitions. In section 3, we present the C-element implementations which are the main compo-nent of the ring. In order to target an optimal design of the stage, we used the logical effort method introduced by I. Sutherland et al. [10] and electrical simulations. We also proposed a new self-timed ring stage only composed by a C-element with com-plementary outputs and we compare the performances of the different implementa-tions of the C-element in terms of speed power and consumption.

2 Self-Timed Rings

The C-element is the basic element in asynchronous circuit design, introduced by D. E. Muller. C-elements set their output to the input values if their inputs are equal and hold their output otherwise. Fig. 1 shows a possible CMOS implementation where the initialization circuit is omitted.

Fig. 1. Muller C-element

Each stage of STR is composed of a C-element and an inverter connected to the input B. The input which is connected to the previous stage is marked F (Forward) and the input which is connected to the following stage is marked R (Reverse), C denotes the output of the stage, as shown in Fig. 2.

Fig. 2. Self-Timed Ring

Tokens and bubbles: This subsection introduces the notions of Tokens “T” and Bubbles “B” which are very important to understand the behavior of the STR. Stagei

Optimizing and Comparing CMOS Implementations 139

contains a token if its output Ci is not equal to the output Ci+1 of stagei+1. On the other hand, Stagei contains a bubble if its output Ci is equal to the output Ci+1 of stagei+1.

{ }BubbleStageCC i1ii =⇒= + and { }TokenStageCC i1ii =⇒≠ +

The number of tokens and bubbles will be respectively denoted NT and NB. For keep-ing the ring oscillating, NT must be an even number; the reader can think about this as the duality of designing the inverter ring by odd number of stages. Each stage of the STR contains either a token or a bubble. NT + NB = N, where N is the number of the ring stages.

2.1 Propagation Rules

If a token is present in a stagei, it will propagate to stagei+1, if and only if stagei+1 contain a bubble. The Bubble of stagei+1 will move backward to stagei. This implies a transition on stagei+1 output. For example, hereafter the token/bubble movements in a five stage STR which contains 4 tokens and one bubble.

TTBTT (01001) TBTTT (01101) BTTTT (00101) TTTTB (10101) TTTBT (10100) TTBTT (01001)

2.2 Configurability

The oscillation frequency in STR depends on the initialization (number of tokens and bubbles). The oscillation frequency in a self-timed ring can be approximated accord-ing to the number of token and bubbles by the formula [5]:

1).(RD2.

1FOSC +

= ( ) ( )( ) BTrrff

BTrrff

TBff

BTrr

NNDDif

NNDDif

NN,D

NN,DRD,

≤≥

⎩⎨⎧

= (1)

where Dff the static forward propagation delay from input F to the output C and Drr the static reverse propagation delay from input R to the output C.

The maximum frequency is achieved when BTrrff NNDD = . This equality ensures

the evenly spaced propagation mode.

2.3 Phase Noise

Noise in the MOS is divided into two main contributors: thermal noise and flicker noise. The thermal noise is responsible for the noise floor at high frequencies while the flicker noise is reflected by a rise in noise at low frequencies. The phenomenon of up-conversion of the amplitude noise in phase noise is complex and has different origins. However, beyond the offset frequency f0/2Qch, HF thermal noise imposes a noise floor.

The phase noise is given by the semi-empirical Leeson formula [13]

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛

⎥⎥

⎦

⎤

⎢⎢

⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎠

⎞⎜⎜⎝

⎛+×=

sm

c

mchm P

FkT

f

f

fQ

ffL 0

2

0 12

12

1log10)( (2)


Where:

Qch : Loaded Q-factor. F : Noise factor. f0 : carrier frequency. k : Boltzmann’s constant,. fm : Frequency offset. T0 : Temperature (290K).

fc : Corner frequency. Ps : Signal power.

Figure Of Merit (FOM) is a parameter that allows comparison of oscillators by standardizing the phase noise compared to the oscillation frequency and power con-sumption. It is calculated using the equation [14]:

⎟⎠

⎞⎜⎝

⎛+⎟⎟⎠

⎞⎜⎜⎝

⎛−=

mW

P

f

ffLFOM s

mm 1

log10log20)( 0 (3)

Fig. 3. Up-conversion of noise in oscillators

3 C-Element Implementations

As the C-element is the main component of the self-timed ring, it seems essential to study it to find the most interesting implementation depending on the application and specifications. This section presents different implementations of the C-element, a comparison in terms of consumption, frequency and phase noise is made. C-element are also studied in order to find design rules to optimize these cells in terms of speed and phase noise by applying the "logical effort" model introduced by I. Sutherland et al. [10] and by simulations using CMOS 65 nm technology from STMicroelectronics.

In addition to the dynamic implementation [11], there are three different static im-plementations of the C-element in the literature: the Weak-feedback by Martin [7], the Conventional by Sutherland [8] and Symmetric by Van Berkel [9].

The dynamic implementation (Fig. 3.a) is composed by the main tree of transistors of the C-element and an output inverter. These transistors called “switchers” contrib-ute to the switching of the output.

For the static implementations, in addition to the “switchers” we have a mechanism for memorizing the output value; these transistors are called "keepers". The "keepers" are not active during the switching, they provide "feedback" to keep the output state when the input values are different, so they are as small as possible to reduce their load and limit the race problem [11].


(a) (b)

(c) (d)

Fig. 4. C-element implementations: Dynamic (a), weak feedback (b), conventional(c) and Symmetric (d)

The weak feedback implementation of the C-element is shown in Fig. 3.b; this im-plementation is composed by the same “switchers” of the dynamic one, in addition to a weak-reaction inverter (N4 and P4) to maintain the state of the output. This circuit suffers from a race problem at node C’.

In the conventional implementation (see Fig. 3.c), in addition to the weak-feedback inverter, we have four additional transistors (N5, N6, P5 and P6) to disconnect this weak-feedback inverter when the inputs are equal. N4, N5, N6, P4, P5 and P6 are sized at the minimal width allowed by the technology.

The C-element introduced by Van Berkel is illustrated in Fig. 3.d. This implemen-tation is slightly different from the previous ones. The transistors are split in two parts. The "keepers" are N6 and P6 and the splited transistors are also involved in the state holding.

4 Design of the Ring Stages

4.1 Designing with the Logical Effort Method

The first step is to find the most optimized way to design the stage of the self-timed ring composed by the C-element and an inverter. To do this we applied the "logical


effort" method [10] introduced by I. Sutherland et al. This method allows us to opti-mize the stage speed. We expect that this optimization of speed will involve the opti-mization of the phase noise.

Table 1. Key definitions of logical effort

Term Stage expression Path expression

Logical effort g ∏= igG

Electrical Effort inout CCh = pathinpathout CCH −−=

Branching effort Totalusedi CCb = ∏= ibB

Effort f=gh F=GBH

Stage effort N1

Ff =

The logical effort g captures the effect of the logic gate’s topology on its ability to

produce output current. The electrical effort h describes how the electrical environ-ment of the logic gate affects the performance and how the size of the transistors in the gate determines its load driving capability. The branching effort b describes the fan-out of the gate.

The output of a self-timed ring is connected to F input of the following stage, and to R input of the previous stage. Therefore the output capacitance of the stage is:

FRout CCC += (1) where ( ) nF W1C ⋅γ+= (2) and ( ) n2R W1UC ⋅γ+×= (3)

CF and CR and Cout are respectively the F input, R input and output capacitances of the stage. wn is the NMOS transistor width, γ represent PMOS/NMOS width ratio, U1 and U2 the contribution of wn in the input and output inverter capacitances of the stage.

( ) ( ) nout WUC ⋅+×+= γ112 (4)

We start by the path CR → . This path is composed of three sub-stages, the input inverter, the main tree of the C-element and the output inverter.

The electrical effort of the path is: 2R

F

R

FR

in

outCR U

11

C

C1

C

CC

C

CH +=+=+==→ (5)

The branching effort is: 1111bB i =××== ∏

The logical effort is: 2121bG i =××== ∏

The effort of the path CR → (Drr) is:

⎟⎠⎞

⎜⎝⎛ +×=××=

2U112HBGF (6)

Fig. 5. Self-Timed Ring Stage


The stage effort to have the minimum delay is: 3

1

2

1 22 ⎟⎠⎞

⎜⎝⎛ +== UFf N (7)

To have the minimum Delay we must respect the following relation fCgC outiin ×=

where fgCC ioutin = we apply this rule in our circuit, we find:

( )( ) ( ) 2

1

n2

n1

31

2

out

1in

U1

U

W11U

WU1

U22

1

C

C

+=

⋅γ+×+⋅⋅γ+=

⎟⎠⎞

⎜⎝⎛ +

= (8)

( )( ) 1n1

n

31

2

1in

2in

U

1

WU1

W1

U22

2

C

C=

⋅⋅γ+⋅γ+

=

⎟⎠⎞⎜

⎝⎛ +

= (9) ( )( ) 2

n

n2

31

2

2in

in UW1

WU1

U22

1

C

C=

⋅γ+⋅⋅γ+

=

⎟⎠⎞

⎜⎝⎛ +

= (10)

inC , 1inC and

2inC and are respectively the input capacitance of the input inverter,

the main tree of C-elements and the output inverter. From equation (10), we find that U2 = 0.56 and from equations (8) and (9) we find

that U1 = 0.89. The path CF → (Dff) is composed of two sub-stages, the main tree of the C-

element and the output inverter.

( ) ( ) n2out W11UC ⋅γ+×+= (11)

Electrical effort of the path is 2

F

R

F

FR

in

outCF U1

C

C1

C

CC

C

CH +=+=

+==→ (12)

Branching effort 111bB i =×== ∏

Logical effort 221bG i =×== ∏

The effort of the path CF → ( )212 UHBGF +×=××=

The stage effort to have the minimum delay is: ( ) 21

2N

1U22Ff ⋅+==

( )( )

( ) ( ) 2

1

2

1

21

2out

1in

U1

U

11U

U1

U22

1

C

C

+=

γ+×+⋅γ+

=⋅+

= (13)

( )( )

( ) 1121

21in

2in

U

1

U1

1

U22

2

C

C=

⋅γ+γ+=

⋅+= (14)

We found that U1 = 0.89 and U2 = 0.56 are solutions of these two equations. So we have the same constraints on the two paths.

4.2 Designing with Electrical Simulations

To check the efficiency of the logical effort technique, we carried out simulations based on the Eldo RF simulator in CMOS 65 nm technology from STMicroelectron-ics. The goal is to find the design rules for sizing the ring stage in order to optimize its


speed. We simulate a few examples of rings with different implementations of the C-element and we compared the performance of the four implementations presented in section 3.

For a given current consumption and for each value of the pair (U1, U2), we extract the frequency, phase noise, the FOM and the area. Then we performed simulations for various Token/Bubbles configurations.

Fig. 6. The frequency (U1, U2)

Fig. 6 show frequency simulation results as a function of U1 and U2 parameters. We note that there is an optimal point for speed. The following table presents the optimal point for different combinations Token / Bubbles.

Table 2. Optimal frequency

Optimum frequency Ring

U1 U2 3 stages 1B/2T 1 0.9 4 stages 2B/2T 1 0.9 5 stages 1B/4T 1 0.9 5 stages 3B/2T 0.9 0.5

Note that the optimal point for the first three cases is 11 =U and 9.02 =U . For the

case 3B/2T, this optimal point is located in 9.01 =U and 5.02 =U . It is the only ring

which corresponds to the results obtained by the logical effort method. In the cases 1B/2T, 2B/2T and 1B/4T, the optimization is done on a single path Drr.

The ratio rrffBT DDNN = (which corresponds to the highest frequency) cannot be

achieved because it requires having a greater or equal Dff than Drr, which is impossible with the proposed structure. In these cases, the algorithm seeks to optimize the path

CR → taking into account the input F as capacitance as Dff does not act on the oscilla-tion frequency. This explains the different values of U1 and U2 compared to that ob-tained with the “logical effort” method. In the case 3B/2T, rrffBT DDNN = (maximal

frequency) is easily reached; the optimization is done on both path CR → and CF → .


Fig. 7 represents the Frequency vs. Power consumption diagram in the optimal case for the four C-element implementations for a five-stage ring with two Tokens and three bubbles. The power has been computed with values of wn between 0.12 μm and 3μm. We performed this simulation with other rings and for different values of np ww=γ . The conclusions were the same. The symmetric implementation is a

good compromise between low-power consumption and a robust circuit behavior. For the high speed applications, the dynamic implementation is a good choice while the conventional and weak feedback implementation allows us to have lower frequencies.

Fig. 7. Power Vs. Freq. in STR (3B/2T)

We also extract the phase noise and FOM as a function of the U1 and U2 parame-ters, the optimal frequency corresponds to the optimal FOM for four different rings. This confirms our initial hypothesis. Moreover, this optimal point always involves a very small area. The highest frequency that we can achieve with this structure of Self-Timed Rings is around 6.6 GHz with the dynamic implementation in the CMOS 65 nm technology from STMicroelectronics.

In order to improve the performance of the self-timed ring, we propose a modified ring stage. The modified stage is simply a C-element, without the R input inverter. We just interconnect the ring structure with the complementary outputs C and C’.

4.3 Modified Self-Timed Ring Stage

Fig. 8 represents our modified Self-Time Ring. For each stage the output C is con-nected to the following stage input F and the complementary output C’ is connected to the previous stage output R.

This Modified Self-Timed Ring Stage allows us to improve the maximal speed by 25% and to reduce the power consumption by 55% at the maximum frequency and by 30% the power consumption by on bubble or token. With such a modified structure we can achieve a maximal frequency of 8.3 GHz with the symmetric implementation in CMOS 65 nm (See Table 3).


Fig. 8. Optimized self-timed ring stage

Table 3. Frequency and Power with various T/B configurations

Config. Modified

2T/1B Classical 2T/3B

Modified 2T/3B

Freq.(GHz) 7.9 6.4 6.1 Power (μW) 398 892 698

Fig. 9. Power vs. Frequency (Ghz) in modified STR

Fig. 9 represents the Frequency vs. Consumption diagram with the modified STR stage. The behavior is the same compare to the classical STR, with one main differ-ence: the performances of the symmetric implementation are very close or even better than the dynamic one when the wn is enough large. This improvement is due to the symmetric implementation which is divided in two sub-trees. Indeed, with a dynamic implementation, the PMOS and NMOS transistors achieve their saturation delay ear-lier than the symmetric implementation transistors and for large wn, the "keepers" effect on this delay becomes negligible. In addition, the symmetric implementation ensures at lower speed better operating conditions of the C-element.

4.4 Performances Comparison

Fig. 10 shows the Fig. Of Merit (FOM) according to the wn value, this Fig. shows that the noise performance of the weak feedback implementation is less efficient compared


to the other implementations. Notice that the conventional implementation is slightly better in most cases. We can also see that in Fig. 11 that for a given frequency the phase noise is better in the conventional implementation than in the weak feedback implementation despite that it consumes more power.

Fig. 10. FOM Vs. wn in Power in STR

Fig. 11. PN vs. Frequency in Power in STR

Table 4. Comparison between the four implementations

Speed Power Consu.

Phase noise FOM Frequency range

Dynamic High Low High Low Short Symmetric High Low High Low Short

Conventional Medium Medium Low Low Medium

Weak feed-back Low High Medium High Large


As we can see from Fig. 7 and Fig. 9, the weak feedback implementation has a large frequency range. At the opposite, the symmetric and dynamic implementations have a short one. Moreover the weak feedback implementation is able to reach low frequencies at low area cost. Table 4 presents a summary comparison between the implementations. We note that this comparison is true for both classical and modified stages.

5 Conclusions

This paper addresses the difficult problem of designing Self-Timed Ring Oscillator targeting low-phase noise applications. Self-Timed Ring is chosen as the oscillator core because of its known advantages with respect to many points of view: configura-bility, accuracy, robustness against process variability, etc. A comparison of the C-element implementations in terms of speed, power consumption and phase noise has been done. We conclude that the symmetric implementation is a good trade-off be-tween low-power and a robust behavior of the C-element. For high speed and low-power applications, conventional and weak feedback implementation allows us to access lower frequencies with a low area cost. For low-phase noise applications, we strongly recommend avoiding the usage of weak feedback implementations. In this goal, the conventional implementation seems to be the best choice. We also proposed a new self-timed ring stage - only composed by a simple C-element with its comple-mentary output - which allows us to increase the maximum speed of 30% and reduce power consumption of 60% at the maximal frequency. Moreover these implementa-tions (classical ones and modified) take advantage of the STR programmability, which gives more flexibility to the designer. We also suggested design rules to reduce the phase noise in STR. This work will be completed by a circuit fabrication and test chip measurements.

References

[1] Ebergen, J.C., Fairbanks, S., Sutherland, I.E.: Predicting performance of micro-pipelines using Charlie diagrams. In: ASYNC 1998, San Diego, CA, USA, pp. 238–246. IEEE, Los Alamitos (April 1998)

[2] Fairbanks, S., Moore, S.: Analog micropipeline rings for high precision timing. In: ASYNC 2004, CRETE, Greece, pp. 41–50. IEEE, Los Alamitos (April 2004)

[3] Mullins, R., Moore, S.: Demystifying Data-Driven and Pausible Clocking Schemes. In: ASYNC 2007, Berkeley, California, USA, pp. 175–185. IEEE, Los Alamitos (March 2007)

[4] Hamon, J., Fesquet, L., Miscopein, B., Renaudin, M.: High-Level Time-Accurate Model for the Design of Self-Timed Ring Oscillators. In: ASYNC 2008, Newcastle, UK, pp. 29–38. IEEE, Los Alamitos (April 2008)

[5] Yahya, E., Elissati, O., Zakaria, H., Fesquet, L., Renaudin, M.: Programma-ble/Stoppable Oscillator Based on Self-Timed Rings. In: 15th IEEE Symposium on ASYNC 2009, Chapel Hill, USA, May 17-20, pp. 3–12 (2009)

[6] Winstanley, A., Greenstreet, M.R.: Temporal properties of self timed rings. In: CHARM 2001, London, UK, pp. 140–154. Springer, Heidelberg (April 2001)


[7] Martin, A.J.: Formal progress transformations for VLSI circuit synthesis. In: Dijkstra, E.W. (ed.) Formal Development of Programs and Proofs, pp. 59–80. Addison-Wesley, Reading (1989)

[8] Sutherland, I.E.: Micropipelines. ACM Commun. 32, 720–738 (1989) [9] Berkel, K.v., Burgess, R., Kessels, J., Peeters, A., Roncken, M., Schalij, F.: A fully-

asynchronous low-power error corrector for the DCC player. IEEE J. Solid-State Cir-cuits 29, 1429–1439 (1994)

[10] Sutherland, I., Sproull, B., Harris, D.: Logical Effort: Designing Fast CMOS Circuits. Morgan Kaufmann, San Fransisco (1999)

[11] Shams, M., Ebergen, J.C., Elmasry, M.I.: Optimizing CMOS implementations of C-element. In: Proc. Int. Conf. Comput. Design (ICCD), pp. 700–705 (October 1997)

[12] Razavi, B.: A Study of Phase Noise in CMOS Oscillators. IEE Journal of Solid-State Cir-cuits 31(3) (March 1996)

[13] Leeson, D.B.: A simple model of feedback oscillator noise spectrum. Proc. IEEE 54, 329–330 (1966)

[14] Bunch, R.L.: A Fully Monolithic 2.5GHz LC Voltage Controlled Oscillator in 0.35mm CMOS Technology. Master of Science in Electrical Engineering, Virginia Polytechnic Institute and State University, pp. 1–7 & 53–72 (April 2001)

[15] Hajimiri, A., Limotyrakis, S., Lee, T.H.: Jitter and phase noise in ring oscillators. IEEE Journal of Solid-State Circuits 34(6), 790–804 (1999)


Hermes-A – An Asynchronous NoC Router with Distributed Routing

Julian Pontes, Matheus Moreira, Fernando Moraes, and Ney Calazans

Faculty of Informatics, PUCRS, Porto Alegre, Brazil {julian.pontes,matheus.moreira,fernando.moraes,

ney.calazans}@pucrs.br

Abstract. This work presents the architecture and ASIC implementation of Hermes-A, an asynchronous network on chip router. Hermes-A is coupled to a network interface that enables communication between router and synchronous processing elements. The ASIC implementation of the router employed stan-dard CAD tools and a specific library of components. Area and timing charac-teristics for 180nm technology attest the quality of the design, which displays a maximum throughput of 3.6 Gbits/s.

Keywords: asynchronous circuits, network on chip.

1 Introduction

Interest in asynchronous circuits has increased due the growing limitations faced dur-ing the design of synchronous System on a Chip (SoC) circuits, which often result in over constrained design and operation [1]. However, asynchronous computer aided design (CAD) tools still have to undergo a long evolutionary path before being ac-cepted by most designers. The lack of such tools renders difficult the access of tradi-tional circuit designers to the full capabilities of asynchronous circuits.

Globally Asynchronous Locally Synchronous (GALS) design techniques may help overcoming limitations of synchronous design while maintaining a mostly synchro-nous design flow [2]. GALS techniques simplify the task of reaching the overall tim-ing closure for SoCs, but typically require the addition of synchronization interfaces between each pair of communicating modules.

Synchronization interfaces bring a new set of design concerns, including metasta-bility-free operation and keeping latency and throughput figures at acceptable levels when traversing several synchronization points. A good approach is to reduce as much as possible the number of synchronization points, to achieve better data transfer rates and improve overall robustness. One way to reduce this number in a complex GALS SoC is to employ fully asynchronous communication mechanisms.

Communication in current and future SoCs relies on the use of Networks on Chip (NoCs) [3]. Using a fully asynchronous NoC as communication architecture for a SoC composed by synchronous processing elements (PEs), the number of synchronizations involved in a single point to point data transfer is reduced to two: one at the sender-NoC interface and another at the NoC-receiver interface. This paper describes the

Hermes-A – An Asynchronous NoC Router with Distributed Routing 151

design and implementation of an asynchronous NoC router that can give support to implement fully asynchronous NoCs.

The rest of this paper is divided into five Sections. Section 2 describes related work and positions the new proposition with regard to it. Section 3 describes the architec-ture of the Hermes-A router, while Section 4 explores the characteristics of the router to PE interface. Section 5 discusses the ASIC implementation of Hermes-A and Section 6 presents conclusions and directions for further work.

2 Related Work

During this decade there has been a small, yet steady movement towards research and implementation of fully asynchronous routers and corresponding NoCs. An encom-passing review of the state of the art revealed ten relevant propositions of fully asyn-chronous interconnect architectures. Table 1 summarizes the main features for each of these, with the last row of the Table presenting the features for the proposed Hermes-A router and NoC.

Table 1 is organized by the date of the first proposition for each interconnect archi-tecture, in a temporal line, although in some cases it cites later papers, where updated data about the NoC is present.

Chain and RasP belong to a first generation of asynchronous interconnect frame-works, based on the careful design of point-to-point links using repeaters, pipelining and wire length control. To support implementation, both offer a set of asynchronous components (the so-called routers, arbiters and multiplexers) that permit sharing the point-to-point links from multiple sources to one destination. Nexus is a very efficient industrial implementation of an asynchronous (16x16) crossbar. Strictly speaking, none of these three architectures really agree with the most accepted definition of NoCs as a network of multi-port routers and wires organized in a topology that for-wards packets of information among processing elements. Accordingly, all three should display scalability problems as the number of PEs grow without bounds, what is expected for future technologies.

Another group of works include the propositions of Quartana et al. and the asyn-chronous version of the Proteo NoC. These are experiments in prototyping asynchro-nous NoCs in FPGAs, with the corresponding lack of performance and prohibitive cost in area. Implementations of asynchronous devices in FPGAs more efficient than those cited in these works exist, as described in [14]. These rely on use of FPGA lay-out and timing control tools to create asynchronous devices as FPGA hard macros that are compact and respect tight timing constraints. However, so far these have not been used for NoCs.

The remaining five NoCs/routers in Table 1 (QoS, MANGO, asynchronous QNoC, ANoC, ASPIN) and Hermes-A propose ASIC implementations of routers and links for 2D mesh topologies, although in some cases there is mention to adequacy to sup-port other topologies as well. This is not the case for ASPIN, because of the chosen router organization. In this NoC, the router ports are distributed around the periphery of the PE, making inter router links small compared to intra router links. This facili-tates connection of PEs by abutment, but prevents easy use of topologies other than 2D mesh. Even a similar 2D torus would be problematic to build in this case.

152 J. Pontes et al.

Table 1. A comparison of fully asynchronous interconnection networks and/or routers for GALS SoCs. Legend: A2S, S2A – Async. to Sync./Sync. to Async., As. -Asynchronous, BE – Best Effort service, DI - delay insensitive, GS – guaranteed service, Irreg/Reg- Irregu-lar/Regular, N.A. - Information Not Available, OCP – Open Core Protocol, VC – virtual channel.

Characteristics Topology Routing / Flow

Control Network Interface

Asynchronous Style

Links and encoding

Implementation NoC

Chain [4] Framework / point-to-point (Irreg/Reg)

Source / EOP Ad hoc QDI / pipelined

Point-to-point 1-of-4 DI / 8-bit

flits

180nm, 1Gbits/s per link, ASIC

QoS [5] 2D Mesh 4 3GS/1BE VCs

XY / wormhole / credit-based N.A. QDI 1-of-4 DI / 8-bit

flits Simulation

only

Nexus [6] Single 16x16 Crossbar

Source / BOP-EOP A2S, S2A

QDI / 1-clock

converters

1-of-4 DI / 36-bit phits

130nm, 780Gbits/s,

ASIC

MANGO [7] 2D Mesh

(Irreg/Reg) 4GS/1BE VCs

Source A2S, S2A, OCP

4-phase bundled-

data

Dual-rail, 2-ph. DI / 33-bit flits

130nm, 650Mflits/s,

ASIC

As. QNoC [8] 2D Mesh

(Irreg/Reg) 8VCs

Source / wormhole /

credit-based with preemption

N.A. 4-phase bundled-

data 10-bit flits

180 nm, 200Mflits/s,

ASIC

Quartana et al. [9]

Crossbar or Octagon N.A.

Self-timed FIFOs

QDI N.A. FPGA, 56 Mflits/s

ANoC [10] 2D Mesh

(Irreg/Reg) / 2 VCs

Source / odd-even / wormhole

A2S, S2A FIFOs QDI 34-bit flits

65nm, 550Mflits/s,

ASIC

As. Proteo [11] Bidirectional Ring Oblivious OCP

QDI / 4-phase

dual-rail 32-bit flits FPGA, 202

Kbits/s

RasP [12] Framework / point-to-point (Irreg/Reg)

Source / bit serial Ad hoc Dual-rail

Point-to-point pipelined serial

links

180nm, 700Mbits/s Simulation

ASPIN [13] 2D Mesh (Reg)

Distributed XY / wormhole / EOP

A2S, S2A FIFOs

Bundled-data

Dual-rail, 4-ph., 34-bit flits

90nm, 714Mflits/s

Hermes-A 2D Mesh (Reg)

Distributed XY / wormhole / BOP-EOP

Dual-Rail SCAFFI

[14]

Dual-rail / bundled

data Dual-Rail

180nm, 727Mbits/s,

ASIC

Four of the NoCs (QoS, MANGO, asynchronous QNoC, ANoC) claim support to quality of service through the use of virtual channels and/or special circuits (GS routers). ANoC is the most developed of the proposals and presents the best overall performance. It has been successfully used to build at least two complete integrated circuits [15]. However, most of the characterization for ANoC (and for other asyn-chronous NoCs) derives from a detailed knowledge of the application in sight. If the application has unpredictable dynamic behavior, it is fundamental to employ a more flexible approach to topology choice, routing and incorporating the capacity to take decisions based on dynamic information of the network. These are some reasons be-hind the proposal of Hermes-A, described in the next Sections.


3 The Hermes-A Router Architecture

Unlike most other asynchronous routers, Hermes-A employs a distributed routing scheme, where the router itself decides which path incoming packets will follow. This enables the use of adaptive routing algorithms and, more importantly, the router may employ these algorithms to solve network congestion problems in real time. Another characteristic of Hermes-A is that it uses an independent arbitration at each router port. The reason for this design choice is to allow that dynamic voltage level schemes be used to assign distinct voltage levels to distinct paths along a NoC. Such a fine grained voltage level resolution can be quite useful to fulfill important power-performance constraints so frequent in SoCs. Distributed routing and scheduling are characteristics shared by Hermes-A and ASPIN. Differences between these NoCs are on the lumped router design for Hermes-A, which facilitates the use of the router in topologies other than 2D meshes and the concerns for designing the router to support multiple voltage levels and adaptive routing algorithms.

A traditional 2D mesh topology NoC with wormhole packet switching is the test environment used to validate the Hermes-A router. Each router in the experimented setup comprises up to five ports: East, West, North, South and Local. As usual in di-rect NoCs, the Local port is responsible for the communication between the NoC and its local PE. All experiments described herein assume the use of 8-bit flits. The packet format is extremely simple: the first flit contains the XY address of the destination router and the subsequent flits contain the packet payload. Two sideband signals con-trol the transfer of packets and support arbitrary-size packets: begin of packet (BOP), activated with the first flit of a packet, and end of packet (EOP), activated with the last flit. All intermediate flits display BOP=EOP=0.

Most of the router architecture employs a delay insensitive, 4-phase, dual-rail encoding. Note that each input port interface consists of 21 wires: 16 wires carry the 8-bit dual-rail flit value (DR-Data), four wires contain the dual-rail BOP and EOP information and the last is the single rail acknowledge signal. The router detects data availability when every pair of wires that define each bit value in the DR-Data signal is distinct from “00”. Thus, the all 0’s value in DR-Data is the spacer for the DI code.

A. Input Port Figure 1 depicts the Hermes-A input port structure as a simplified asynchronous data-flow diagram [16]. There are three alternative paths in this module, one used for the first flit (1), one for intermediate flits (2) and one for the last flit (3). In Figure 1 two wires represent each bit. Thus, a 10-bit path is in fact a 20-wire bus.

When BOP is signaled at the input port, the first demux selects the path that feeds the module responsible for computing the path to use. This module receives ten information bits that are forwarded (8 data bits plus EOP and BOP), plus four destina-tion bits using dual-rail one-hot encoding. Note that just the bit associated to the se-lected path is enabled in this 4-bit code. Since the routing decision must be kept for all flits in a packet, a loop was added to register the decision. The loop appears in Figure 1 as a chain of three asynchronous registers (4) in order to enable the data flow inside the 4-phase dual-rail loop. Each two successive asynchronous stages communicate using an individual handshake operation [16]. Thus, in this kind of circuit it is not possible that three successive stages exchange two data simultaneously. Exactly three stages are the minimum necessary to propagate information circularly. Less than three


stages incur in deadlock situation. This can be better understood remembering that between every two valid data there is always a spacer, and that before propagating a spacer the first data must be copied to the next stage.

Fig. 1. Hermes-A router input port architecture. All paths employ dual-rail encoding.

After computing the output port where to send the incoming flit, the rightmost mod-ule in Figure 1 (Output demux) sends the flit, based on the 4-bit routing information.

Subsequent flits in a packet go through the lower output of the leftmost demux and are input to a second demux after the fork element. This demux looks for the EOP bit before choosing the right direction for each flit. If there is no EOP indication the flit follows path (2) to the first merge component. Otherwise, the S-Control module is used. The next Sections cover the behavior of the Path Calculation and S-Control modules.

a) Path Calculation The basic route computation architecture is depicted in Figure 2. In direct 2D topolo-gies like 2D mesh or 2D torus, each router is defined with two values, its X and Y coordinates. The first flit of a packet carries the destination X address in the four less significant bits and the destination Y address in the four most significant bits. When a flit is accompanied by an active BOP signal it feeds the Path Calculation module. This

Fig. 2. Hermes-A Path Calculation circuit


flit arrives at the input of a completion detector (CD). Detection of a valid dual rail data token causes the propagation of the destination X and Y coordinates to two sub-traction circuits. The outputs of these circuits will determine the path the packet must follow.

If both subtractions result in 0, then the packet reached the target router and it pro-ceeds to the Local port. For the XY routing algorithm, if the X axis subtraction is dif-ferent from zero, the packet will follow either to East or West, depending only on the sign of the result (positive and negative, respectively). If the X subtraction result is 0 but the Y subtraction is not, the packet may follow to North or South, depending again only on the signal of the result (positive and negative, respectively). The Rout-ing Logic module is just a purely combinational logic that produces the resulting one-hot dual-rail 4-bit packet destination code. It points the output port to use.

b) S-Control When the last flit of a packet is received (EOP=1), it is directed to the S-Control module (see Figure 1). The S-Control protocol description appears in Figure 3.

Fig. 3. State machine for the S-control module

The function of this module is to send the last flit through the output marked A in Figure 1, and then send a kill token in the output marked B to indicate the end of a packet transmission. This has as effect to de-allocate the output currently reserved for this packet. To avoid defining a new dual-rail signal, the unused code BOP=EOP=1 is employed internally to the router to signal this situation. The circuits that interpret this code are two: the allocated output port and the one that controls the chain (4) of asyn-chronous registers (not explicit in Figure 1). The later, upon receiving the code, empty the chain using spacers. Remembering that asynchronous circuits rely on explicit lo-cal handshake between every pair of communicating modules, the S-Control only generates an acknowledge signal to the previous demux after receiving the acknowl-edge signals for both, A and B outputs. Completeness detectors produce all request signals. The Petrify tool was used to synthesize the equations that implement a speed-independent controller operating as the state machine in Figure 3.


B. Output Port In the Hermes-A router each output port receives four data flows. For instance, Figure 4 shows the Local output port structure that receives data from input ports North, South, East and West.

Fig. 4. Local output port structure. Dashed lines represent actual wires. Solid lines represent dual-rail encoded lines.

Fig. 5. Output control structure. All paths employ dual-rail encoding.

An arbiter circuit controls the behavior of each output port. This arbiter achieves fairness with a structure of six 2-input, 2-output arbiters connected in a shuffle-exchange topology. Each atomic arbiter decides which request to serve from between two input requests, using a first-come-first-served strategy. This allows the processing of up to four simultaneous input port requests. The bit used to produce the request to


the output port is produced by the logic that computes routing on the input port. Since this bit is a dual-rail representation, conversion to single-rail is necessary, since arbi-ters are the only single-rail module in the output port. A 2-input C-element with one negated input executes the conversion. Figure 5 details the structure of each output control circuit of an output port. This module receives data directly from some input port. Its role is to generate requests for the output port arbiter or to undo the internal connection between input and output ports after transmitting the last packet flit and receiving the kill token.

4 Network Interface

The synchronization mechanism is one of the crucial components of a GALS system. Traditional synchronizers like the series-connected flip-flops do not guarantee elimi-nation of metastability, and since synchronization latency is usually large in such syn-chronizers, these components often impose low throughput to the communication architecture. To overcome these limitations, this work chooses to employ clock stretching techniques, which do eliminate the risk of metastability. Also, this kind of synchronization can support higher throughput than traditional synchronizers.

The synchronization mechanism adopted here is based on the SCAFFI [14] asyn-chronous interface. SCAFFI is an asynchronous interface based on clock stretching that supports dual rail communication schemes. The network interface between Local Ports and PEs appears in Figure 6. More details on this interface are available in reference [14].

Fig. 6. SCAFFI network interface between a Hermes-A router and a synchronous PE. The in-terface employs clock stretching techniques to avoid metastability. The stretcher circuits are not represented in the picture.

5 ASIC Implementation

Since traditional design kits do not usually contain asynchronous components, the Hermes-A ASIC implementation started with the implementation of an asynchronous


digital cell library. The library includes several versions of C-elements, metastability filters and control circuits, like sequencers. The first version of the asynchronous library uses the XFab 180nm design rules and includes liberty timing files (.lib), abstract views (.lef) and Verilog models using UDP primitives to enable timing anno-tated simulations.

The asynchronous library is the base to develop a set of data flow elements (fork, join, merge, mux, demux, half-buffer registers, validity detectors, etc.).

During the asynchronous router synthesis it is important to guarantee that the (syn-chronous) synthesis tool do not change asynchronous components. For instance, in the Cadence RTL Compiler synthesis tool it is possible to ensure that this will not happen by using the PRESERVE property, which can be assigned to each module instance. This property instructs the tool not to touch the cell instance characteristics.

The results presented in the Table 2 refer to the XFab 180nm ASIC implementa-tion of the Hermes-A router. The operating conditions are 25°C, 1.8 Volts. Also, the library build employs typical transistor models. Power results were obtained when all router input and output ports are operating at their highest rate of 727 Mbits/s on each router link. The throughput presented in Table 2 is for a single link operation. The router can sustain, in the best possible case, operation at this performance level on all of its five ports, totalizing approximately 3.6 Gbits/s of maximum throughput for the whole router.

Table 2. ASIC Implementation results for a 180nm XFab technology

Throughput (Mbits/s)

Area (mm2 ) Cell – Total Area

Total Power (mW)

727 0.21 – 0.33 11.14

6 Conclusions and Future Work

The Hermes-A router demonstrates that asynchronous circuits are useful as a commu-nication architecture for a high performance complex GALS SoCs. Ongoing work proceeds in several directions, including: (1) providing support to adaptive routing algorithms in Hermes-A; (2) enabling Hermes-A to work with multiple supply volt-ages and power shutoff features, in order to reduce the power utilization mainly in idle ports; (3) implementing complete NoC topologies and applications for testing router operation, such as 2D meshes and 2D tori. It is important to note that in the case of a 2D torus, the routing module has to be modified, since a pure XY routing algorithm is not deadlock-free for this network topology.

Acknowledgements

The Authors would like to acknowledge the support of the CNPq through research grants 551473/2010-0, 309255/2008-2, and 301599/2009-2. Also, they would like to acknowledge the National Science and Technology Institute on Embedded Critical Systems (INCT-SEC) for the support to this reseach.


References

[1] Ho, R., Mai, K., Horowitz, M.: The future of wires. Proceedings of the IEEE 89(4), 490–504 (2001)

[2] Chapiro, D.: Globally-Asynchronous Locally Synchronous Systems. PhD th., Stanford University, 134 p. (October 1984)

[3] Marculescu, R., Ogras, U., Peh, L.-S., Jerger, N., Hoskote, Y.: Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspectives. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 28(1), 3–21 (2009)

[4] Bainbridge, J., Furber, S.: Chain: A Delay-Insensitive Chip Area Interconnect. IEEE Mi-cro 22(5), 16–23 (2002)

[5] Felicijan, T., Furber, S.: An Asynchronous On-Chip Router with Quality-of-Service (QoS) Support. In: 17th IEEE Int. SoC Conf. (SOCC 2004), pp. 274–277 (2004)

[6] Lines, A.: Asynchronous Interconnect for Synchronous SoC Design. IEEE Micro 24(1), 32–41 (2004)

[7] Bjerregaard, T., Stensgaard, M., Sparsø, J.: A Scalable, Timing-Safe, Network-on-Chip Architecture with an Integrated Clock Distribution Method. In: Design, Automation, and Test Europe (DATE 2007), pp. 1–6 (April 2007)

[8] Dobkin, R., Ginosar, R., Kolodny, A.: QNoC asynchronous router. Integration the VLSI Journal 42(2), 103–115 (2009)

[9] Quartana, J., Renane, S., Baixas, A., Fesquet, L., Renaudin, M.: GALS systems prototyp-ing using multiclock FPGAs and asynchronous network-on-chips. In: Int. Conf. on Field Programmable Logic and Applications (FPL 2005), pp. 299–304 (2005)

[10] Beigné, E., Clermidy, F., Vivet, P., Clouard, A., Renaudin, M.: An Asynchronous NoC Architecture Providing Low Latency Service and its Multi-level Design Framework. In: IEEE Int. Symp. on Asynchronous Circuits and Systems (ASYNC 2005), pp. 54–63 (2005)

[11] Wang, X., Ahonen, T., Nurmi, J.: Prototyping a Globally Asynchronous Locally Syn-chronous Network-On-Chip on a Conventional FPGA Device Using Synchronous Design Tools. In: Int. Conf. on Field Programmable Logic and Applications (FPL 2006), pp. 657–662 (2006)

[12] Hollis, S., Moore, S.: RasP: An Area-efficient, On-chip Network. In: Int. Conf. on Com-puter Design (ICCD 2006), pp. 63–69 (2006)

[13] Sheibanyrad, A., Greiner, A., Miro-Panades, I.: Multisynchronous and Fully Asynchro-nous NoCs for GALS Architectures. IEEE Design and Test of Computers 25(6), 572–580 (2008)

[14] Pontes, J., Soares, R., Carvalho, E., Moraes, F., Calazans, N.: SCAFFI: An intrachip FPGA asynchronous interface based on hard macros. In: Int. Conf. on Computer Design (ICCD 2007), pp. 541–546 (2007)

[15] Thonnart, Y., Vivet, P., Clermidy, F.: A Fully Asynchronous Low-Power Framework for GALS NoC Integration. In: Design, Automation, and Test Europe (DATE 2010), pp. 33–38 (2010)

[16] Sparsø, J., Furber, S.: Principles of Asynchronous Circuit Design – A Systems Perspec-tive. 354 p. Kluwer Academic Publishers, Boston (2001)

Practical and Theoretical Considerations on Low-PowerProbability-Codes for Networks-on-Chip

Alberto Garcia-Ortiz1 and Leandro S. Indrusiak2

1 Institute for Theoretical Electrical Eng. and Microelectronics (ITEM),University of Bremen, Otto-Hahn-Allee 1, NW1, 28359 Bremen, Germany

[email protected] Dept. of Computer Science - Real-Time Systems Group (RTS),

University of York, YO10 5DD York, [email protected]

Abstract. Low-power coding represents an important technique to reduce con-sumption in modern interconnect architectures. In the case of Network-on-Chip,and specially if they include virtual channels, the coding techniques require tobe effective (large reduction of transition activity) and extremely efficient (re-duced hardware resources). This work proposes a coding template called PMwith those characteristics. Moreover, it shows with a detailed theoretical analy-sis and a number of experiments the good characteristics of the approach. Somerelevant theoretical results on Exact Probability Coding are also developed in thepaper.

1 Introduction

The increasing miniaturisation capabilities of nanometric technologies allows the in-tegration of hundreds of processing units in a single chip. However, such systems de-mand for an optimised communication architecture. Networks-on-Chip are emerging asa promising approach to address that problem [2]. Stringent constraints such as power,performance and latency must be observed and requirements such as reliability, faulttolerance, correctness (data ordering) and completion (no data loss) must be complied.

The power consumption of NoC interconnects is not negligible. The internal struc-ture of a NoC router can be quite complex, with arbitration, routing and switching logic,as well as temporary storage. The wires between routers also contribute significantlyto the dynamic power consumption [6]. One alternative to reduce the dynamic powerconsumption on Networks-on-Chips is the application of coding techniques that eitherminimise the signal transition activity [7,4,5]. Crosstalk Avoidance Codes and ErrorCorrection Codes are also proposed [3] to allow a reduction in the transmitted voltageswings (and thus, the power) without sacrificing the reliability.

For the relevant case of NoCs with virtual channels, standard low-power coding ap-proaches [1,8,7] are not applicable. The packet multiplexing which occurs on the virtualchannels destroys the low transition characteristics introduced by the encoding. Novelapproaches as PMD [4] are required in this case.

A major challenge is to find coding architectures where the overhead of the coder/decoder does not eliminates the power savings in the interconnects achieved by the


Practical and Theoretical Considerations on Low-Power Probability-Codes 161

coding procedure. This works aims at analysing the suitability of a simpler codingstrategies than PMD for NoC networks with Virtual Channels.

First, Sec.2 investigates the possibility of removing the Correlator and Decorrelatorfrom the switch. The resulting template (called PM Code) provides an interesting trade-off between coding-complexity and activity reduction. Since the low-power coding ef-ficiency is slightly smaller than that of PMD, we investigate in Sec.3 the theoreticallimits of Probability-Coding. The main focus is to understand from a solid foundationwhich are the implications of the probabilistic characteristics of the signal to be codedinto the efficiency (activity reduction ratio) which can be obtained. Finally, we validateexperimentally the results of the work. The data are reported in Section 4.

2 Probability-Multiplex Coding

Since the Probability-Multiplex (PM) coding template is based on the Probability-Multiplex-Decorrelator (PMD) strategy, let us describe first PMD. The interested readeris referred to [4] for a complete description.

The goal of a standard low-power Transition-Code is to minimise the number of tran-sitions in the wires (or the number of transitions in opposite directions for neighbourwires if couping is considered.) The goal of Probability-Coding is to minimise the num-ber of ones at the output of the coder. A Transition-Code can be created by adding aXOR-Decorrelator to a Probability-Code [8]. As shown in [4] low-power transmissionin NoCs with virtual channels cannot be achieved using a Transition-Code, but it canbe obtained by using a Probability-Code and a distribution of XOR-Decorrelators andXOR-Correlators over the NoC links.

PMD is composed by three consecutive steps: first, a Probability Coder which min-imises the number of ones; second, the time multiplexing intrinsic to the virtual chan-nels of the NoC; an third, the XOR-Decorrelator which maps ones to transitions. Thedecoding applies a XOR-Correlator, a demultiplexing (intrinsic to the virtual channels)and a Probability Decoder. The Probability Coder and Decoder are located in the Net-work Interface of the NoC fabric, while the XOR-Correlator and XOR-Decorrelator aredistributed over the Links.

Although different architectures can be used for the Probability Coder, the code“Corr-K0” consisting of a XOR-Correlator followed by XORing the bus with the MSBhas been shown to provide a good compromise between hardware complexity andpower reduction. The box P.Coder of Fig. 1 illustrates the circuit.

In this work we analyse the possibility of reducing even further the hardware com-plexity of PMD by removing the XOR-Correlators and XOR-Decorrelators locatedon the Links. Fig. 1 shows the proposed coding template called Probability-Multiplex(PM).

The main advantage of PM respect to PMD is that it does not require any modifica-tions on NoC Switch itself but only the Network Interface with the Processing Element.Thus, the critical timing path between the NoC Switches is not modified. The powerand overhead of the procedure is also reduced.

162 A. Garcia-Ortiz and L.S. Indrusiak

FF

FF

FF FF

FF

FF

P.DecoderP.Coder

P. Decoder

Link

Low−power switch

Switch

Link

Processing Element

P. Coder

Processing Element

Fig. 1. PM coding template with an example of a P.Coder (Corr-K0)

2.1 Dynamic Power Considerations

In order to analyse exactly the switching activity in the Links, we can consider two(temporally) uncorrelated signals Xa and Xb, which are time-multiplexed to generate aresulting Xm. This model describes the transmission of data over virtual channels in aNoC. Let us denote by pai the probability of being 1 for the i-th bit of signal Xa and pbi

for signal Xb. The probability of having a bit transition in the form Sai = 0 → Sbi = 1is Prob[Sai = 0,Sbi = 1] = (1− pai)pbi where we have used the fact that the signals Xa

and Xb are statistically independent. Adding the opposite transition:

tmi = (1− pai)pbi + pai(1− pbi) (1)

which is independent of transition activity of Xa and Xb, and it depends only on the bitprobabilities. If we assume that the probability of Xa and Xb are equal to pi, we obtainthat

tmi = 2pi(1− pi) (2)

Let us note that the activity for a PMD code is simply tmi = pi, while a “classical” low-power code in the context of virtual channels has tmi = 1/2. Thus, PM is less efficientthan PMD by a factor 2(1− pi), but achieves a switching reduction of pi(1− pi). Fig. 2shows the activity reduction factor of PM and PMD as a function of the entropy of eachsingle wire. We observe that the reduction in coding efficiency of PM respect to PMDcan range from 0 to approximately 23%. Experimental results (see Section 4) confirmthat a typical value of 10%-15% should be expected.

2.2 Static Power Considerations

Leakage is a major concern for current technologies. In this subsection we analyse theimplications in terms of leakage of using PM instead of PMD.


0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1

Act

ivity

Red

uctio

n [%

]

Entropy [bit]

uncodedPMD code

PM codePenalty of PM

Fig. 2. Coding activity reduction for PMD and PM as a function of the signal entropy

Since the absence of the XOR-Correlator and XOR-Decorrelators does not changethe signal characteristics inside the buffers of the NoC Switch, PM maintains the samesavings in terms of static power inside the switch reported for PMD. The reductions are21% in the average case and 32% for multimedia signals.

3 Analysis of Probability-Coding in NoCs with VC

Since PMD and PM use a Probability Coder, it is useful to analyse the exact (i.e, opti-mal) Probability-Coding.

In the context of Transition-Coding, Exact Transition-Coding has been proposed in[1]. The core of the techniques is actually an Exact Probability-Coder, referred as E . Itprovides the best possible Probability-Code for a coding scheme which employs onlythe current and previous vale of the signal during the codification process. AlthoughExact Transition-Coding (and Exact Probability-Coding) are completely impractical fora real implementation (see [1]) they establish a theoretical limit on what it is achievableby low-power coding.

Let us consider a B-bit Boolean stationary random variable X with a known JointProbability Distribution PXY (x,y) = P(X [n] = x,X [n− 1] = y). The Exact Probability-Code can be viewed as a Boolean function E(x,y) : BB×BB → BB which is decodableand minimises the expected number of ones at the output for decoder (for the givenJPD). The authors of [1] provide an algorithm to obtain the specifications of such coderE . The algorithm requires to sort 4B probability values, and then to visit that list whilekeeping a table with some “forbidden” values.


Another point of view for the problem is to consider that the coding procedure iscomposed by two consecutive steps. The first is a coding function Ep which minimisethe digital numeric value of the output rather than the number of ones. The secondstep is the value-based-mapping (vbm) described in [8]. The vbm is a Boolean functionvbm(x) : BB → BB which maps the inputs with smaller digital value with those outputscodes with the lesser number of ones. The structure is shown in Fig. 4. It is straightforward to see that both approaches are equivalent.

0

B

2 −10 B

vbmCost

Prob

x[n−1]

Prob

sort

x[n]

x[n]

0

2 −1

2 −1

2 −1

x[n−1]

E p

vdm

x[n−1]

FF

x[n] z[n]

0

1

2 −10 B

Probability

w[n]

B

B

B

Fig. 3. Exact-Coding from a probabilistic point of view

We can observe that the optimal coding values corresponding to the k-th row of Ep

are the indexes used for sorting in decreasing order Ep(x,k). Thus, Ep can be found justby 2B sorts of a set of 2B values (which is much simpler than the approach presented in[1]). The sort has to be done in a row-by-row basis to guaranty the decodability of theresulting low-power code. Fig. 4 shows graphically how Ep sorts the PDF of X .

Once Ep is performed, we can apply the vdm coder. After calculating the one dimen-sional probability of each value, we can obtain the expected (average) number of ones,and thus the activity in the Links of the NoC after the XOR-Decorrelator. Using thehamming weight function (number of ones), we can write: ELink = Pones = ∑i Prob[Z =i] HammingWeigth(i). However, it turns to be easier to define an equivalent cost func-tion before the application of the vdm coder:

vdmCost(k) = HammingWeigth[vdm(k)] (3)


The vdmCost(k) is a monotonous function composed by B + 1 steps or cost-regionswith values from 0 to B. The width of the k-th cost-region is

(Bk

)corresponding to the(B

k

)words of B-bits with exactly k ones. Using vdmCost(k) we can write:

ELink(X) = Pones = ∑k

Prob[W = k] vdmCost(k) (4)

where the key point is that now the random variable W is used instead of Z. The advan-tage is that the distribution of W is easier to obtain than that of Z

Because of the relevance for DSP and multimedia applications we focus on Normallydistributed signals. Let us consider a B-bit Gaussian stationary signal with standard de-viation σ and temporal correlation ρ. For the sake of simplicity, we assume a continuousinstead of a 2B discrete signal. The PDF of the signal is:

fXY (x,y) =1

2πσ2√

1−ρ2exp

[−x2 + y2 − 2ρxy

2σ2(1−ρ2)

]

The expression for one “slice” of the PDF at value Y = k is:

fXY (x,k) =1

2πσ2√

1−ρ2exp

[− (x−ρk)2

2σ2(1−ρ2)

]exp

[− k2

2σ2

]

We observe that with respect to X, the shape of fXY (x,k) is similar to a Gaussian bellwith centre in µ = ρk and standard deviation σk = σ

√1−ρ2.

The next step is to sort the “slice”. We can think of this step as first moving the shapeof the slide to zero (which removes the mean), and then to mirror the negative-side intothe positive-side. Then, for x ≥ 0

sort( fXY (x,k)) =2

2πσ2√

1−ρ2exp

[− x2

2σ2(1−ρ2)

]exp

[− k2

2σ2

]

Finally, we have to add all the “slices”. Since 1σ√

2π

∫exp

[−x2

2σ2

]= 1 we conclude that:

pW (w) =

⎧⎨⎩

2√2π σ

√1−ρ2

exp[

−w2

2σ2(1−ρ2)

]if w ≥ 0

0 if w < 0(5)

Thus, p(w) is twice the positive side of a Gaussian PDF with zero mean and standarddeviation σp = σ

√1−ρ2. In summary, using Eq. (4):

ELink(X) = ∑w

2√2π σ

√1−ρ2

exp

[ −w2

2σ2(1−ρ2)

]vdmCostB(w) (6)

A key point is that Gaussian random signals with different standard deviation and corre-lation, but equal σp will have the same power cost after Exact-Coding. The remarkablefact is that the entropy of a Markov Gaussian random variable is given by:

HG(σ,ρ) =12

log2 (2πe)+ log2

(σ√

1−ρ2

)(7)


which is a function of the same parameter σp = σ√

1−ρ2. Thus, σ√

1−ρ2 can beobtained as a function of HG(σ,ρ). Moreover, since vdmCostB(w) is a function B, it isstraight forward to define a function φG such that:

ELink(X) = φG(HG,B) (8)

We have proved the following theorem, which characterise the maximum dynamicpower reduction that can be obtained in the presence of temporal correlation.

Theorem 1. The efficiency of the low-power Exact-Code for temporally correlatedMarkov Gaussian Signals depends only on the bit-width and the entropy of thatsignal.

It is worth to note that the exact dependency on the entropy does not hold for othercoding strategies as Gray-Code, Bus-Invert, etc. However, it does hold for an idealinfinite code, as shown in [9].


In order to compare experimentally PM with respect to PMD, we have used the samesimulation environment than [4]. It employs a simplified behavioural model of the NoCand emulates a 4 by 4 mesh topology. However, only three Processing Elements areactually active during the simulation. The flit bit-width is 8b. Four flits are used for theheader, and 128 for the payload. The main focus of this paper is the coding strategyfor NoCs. To isolate the issues related with NoC traffic and congestion from the codingitself, we employ a quite idealised network. It uses two Processing Elements which areworking as idealised data producers, and an idealised data receiver. The internal buffersof the switch are assumed to be unlimited. Moreover, all Processing Elements are ableto produce/consume a flit per clock cycle. For the analysis of the coding, we trace thesignals in the switch connected to the receiver. The data from the two producers arrivesto the switch from the same port, but through different virtual channels. The transmitteddata correspond to the following signals:

Raw image: The red component of a 800x130x8b image. It corresponds to the wel-come image from the PATMOS’08 web page.

Male voice: A male voice signal. It consists of 5000 samples with σ = 50 and ρ = 0.88.Music: A short piece of classical music (Bach).OFDM: FFT input in a HiperLan/2 OFDM receiver, using 64QAM modulation and a

type C channel. It consist of 50000 samples. It has σ = 42 and ρ = 0.22.gzip: The gzip executable in ELF 32-bit.

The experiment has been performed with and without the XOR-Correlator and Decorre-lator in the Links to simulate PMD and PM respectively. Tab. 1 summarises the results.Since PMD and PM are templates which can be used with different codes, we haveanalysed different alternatives as shown in Tab. 1. Following the framework of [8], thedifference based coding (dbm), value based coding (vbm), XOR-Correlation (corr), andXOR-Decorrelator (decor), are combined to produce different coding strategies. The K1and K0 memoryless coders [4] are also used.


Table 1. Comparison of mean transition activity resulting from using PM and PMD coding tem-plates with real signals in a virtual channel based 4x4 NoC

Raw Image Male voice Music GZIP exe OFDM data MeanCode PMD PM PMD PM PMD PM PMD PM PMD PM PMD PM

K1 3.84 3.10 2.48 3.20 2.90 3.23 2.92 3.68 2.73 3.05 2.97 3.25K0 4.03 3.05 2.65 3.33 3.16 3.29 2.81 3.60 2.95 3.11 3.12 3.27vbm 5.69 3.15 3.60 3.96 4.06 4.00 3.02 3.74 4.01 4.00 4.08 3.77

corr+ none 1.21 1.93 2.41 3.10 2.13 2.49 2.98 3.70 3.79 3.97 2.50 3.04corr+ K1 1.20 1.92 1.94 2.60 1.93 2.24 3.14 3.80 2.93 3.20 2.23 2.75corr+ K0 1.10 1.78 1.78 2.42 1.63 1.98 3.08 3.75 2.98 3.22 2.11 2.63corr+ vbm 0.87 1.53 1.92 2.89 1.45 2.32 2.93 3.71 3.67 3.97 2.17 2.88

dbm+ none 1.02 1.67 2.17 2.90 1.66 2.13 3.24 3.83 3.86 3.98 2.39 2.90dbm+ K1 1.03 1.70 1.63 2.28 1.43 1.81 3.15 3.79 2.73 3.15 2.00 2.55dbm+ K0 1.15 1.84 1.81 2.41 1.72 1.87 2.96 3.68 2.96 3.22 2.12 2.60dbm+ vbm 0.80 1.42 1.82 2.78 1.27 2.09 3.13 3.81 3.75 3.98 2.15 2.82

It can be observed that the code corr+K0 is the most practical one not only for PMD,but also for PM. The only code alternative which improves the quality of corr + K0 isdbm+ K1. Since the dbm requires three adders to be implemented, and K1 has a worsttiming path proportional to the bit-width of the signal, the dbm+K1 code is much moreexpensive than corr +K0. As shown in Fig. 1, corr + K0 has a complexity of 2B Flip-Flops and 2B− 1 XOR gates, while the worst timing path is only 2 XOR gates (around210ps in a 180nm technology).

We have compared PMD and PM with the Exact-Code in terms of activity reduc-tion. The values have been obtained using a MATLAB script. The computation of theEp requires an estimation of the JPD, which has been calculated using a two dimen-sional histogram. Once the JPD is known, the matrix Ep and the average cost are easilycalculated employing the approach described in Section 3. The results are reported inFig. 4.

For the real signals used in the experiment, the maximum reduction in the activitythat could be obtained is 70%. PMD provides a reduction of 47%, and thus it is close tothe theoretical maximum. The PM code achieves about one half of the maximum pos-sible reduction (34%). We observe that PM behaves quite well for multimedia signals.However, for random data, as in the case of the GZIP executable, the degradation is no-table. As it is depicted in Fig. 2, when the entropy of the signal increases the degradationof PM respect to PMD is more relevant.

Finally, Tab. 2 compares the complexity of PMD and PM for the NoC Switch usedin current experimental setup (i.e., NoC Switch with bit-width of 8 bits, and 4 Links forconstructing a Mesh). The results refer to a 180nm technology. To have a better insightin the characteristics of PM and PMD, Tab. 2 reports the results corresponding to theNetwork Interface, the Links, and the overal NoC Swithc.

The overhead of the encoder and decoder in the Network Interface is equal for PMand PMD, since both techniques use the same Probability Coder and Decoder. How-ever, the 4 XOR-Correlators and 4 XOR-Decorrelators used in the Links by the PMD


0

10

20

30

40

50

60

70

80

90

Raw Male Music GZIP OFDM Mean

Act

ivity

Red

uctio

n [%

]

Signal

ExactPMD

PM

Fig. 4. Comparison of activity reduction for Exact-Code, PMD, and PM code for real signals

Table 2. Comparison of the complexity of PMD and PM in terms of area and delay

Network Interface Data Link OverallArea[eq.gates] Delay [ps] Area [eq.gates] Delay [ps] Area [eq.gates] Delay [ps]

PMD 141 210 576 210 717 210PM 141 210 0 0 141 0

technique are not required for PM. Thus, the area is reduced approximately by a factorof 5 (from 717 to 141 equivalent gates). Moreover, PM does not incur in the 210ps tim-ing degradation in the Link. Finally, it should be notived, that the complexity and delayof Exact Coding is orders of magnitude larger than PM or PMD.

5 Conclusions

This work has presented a thorough study of some practical and theoretical aspectsrelated with the incorporation of low-power coding techniques to NoCs systems withvirtual channels.

From a practical point of view, a major result of this work is a novel alternativefor low-power coding called PM. The architecture is based on a Probability-Coder inthe Network Interface. Although it can be customised for different coders, the workhas focused in a “corr+K0” code, which requires a minimum number of gates while


providing a good switching reduction. The approach provides an average reduction intransitions at the data links of 34%, and 45% for multimedia signals. The techniquemaintains the same savings as PMD in terms of static power in the switch buffers.It achieves reductions of 22% in the average case and 32% for multimedia signals.Although PM is less effective than PMD (around 13%), the hardware complexity isreduced approximately by a factor of five.

From the theoretical point of view, this paper provides the analysis in probabilisticterms of Exact-Coding. It proves that for Markov Gaussian random variables entropyis the key parameter to determine the achievable reductions on switching activity. Theresults establishes a link with the ideal case of entropic coding.

References

1. Benini, L., Macii, A., Macii, E., Poncino, M., Scarsi, R.: Architectures and synthesis algo-rithms for power-efficient bus interfaces. IEEE Trans. on CAD 19, 969–980 (2000)

2. de Micheli, G., Benini, L.: Networks on chip: A new paradigm for systems on chip design. In:DATE 2002, Washington, DC, USA, p. 418. IEEE Computer Society, Los Alamitos (2002)

3. Ganguly, A., Pande, P., Belzer, B.: Crosstalk-Aware Channel Coding Schemes for EnergyEfficient and Reliable NOC Interconnects. IEEE Trans. on VLSI 17(11), 1626–1639 (2009)

4. Garcıa Ortiz, A., Indrusiak, L.S., Murgan, T., Glesner, M.: Low-Power Coding for Networks-on-Chip with Virtual Channels. Journal of Low Power Electronics (JOLPE) 1(4), 77–84(2009)

5. Lee, K., Lee, S., Yoo, H.: Low-Power Network-on-Chip for High-Performance SoC Design.IEEE Trans. on VLSI 14(02), 148–160 (2006)

6. Mullins, R.: Minimising dynamic power consumption in on-chip networks. In: Procs of theIntl. Symp. on System-on-Chip, Tampere, Finland (2006)

7. Palma, J.-C., Indrusiak, L., Moraes, F., Garcıa Ortiz, A., Glesner, M., Reis, R.: Adaptive Cod-ing in Networks-on-Chip: Transition Activity Reduction Versus Power Overhead of the CodecCircuitry. In: Vounckx, J., Azemard, N., Maurine, P. (eds.) PATMOS 2006. LNCS, vol. 4148,pp. 603–613. Springer, Heidelberg (2006)

8. Ramprasad, S., Shanbhag, N., Hajj, I.: A coding framework for low-power address and databuses. IEEE Trans. on VLSI Systems 7, 212–221 (1999)

9. Sotiriadis, P.P., Tarokh, V., Chandrakasan, A.P.: Energy reduction in VLSI computation mod-ules: an information-theoretic approach. IEEE Transactions on Information Theory 49(4),790–808 (2003)


Logic Architecture and VDD Selection for Reducing the Impact of Intra-die Random VT Variations on Timing

Bahman Kheradmand-Boroujeni1,2, Christian Piguet1, and Yusuf Leblebici2

1 Integrated and Wireless Systems, Centre Suisse d’Electronique et de Microtechnique (CSEM), Neuchâtel, Switzerland

2 Microelectronic Systems Laboratory, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland

Abstract. We show that in logic circuits working at supply voltage (VDD) be-low nominal value, proper selection of logic architecture and VDD together can reduce the impact of device-to-device random process variations (PV) on tim-ing. First we show that σ/μ of transistor current and delay strongly depend on VDD. Then we compare the PV sensitivity of Low-Power Slow (LP-S) and High-Power Fast (HP-F) architectures. The results propose the idea that for a given technology, equal power budget and delay, LP-S circuits working at higher VDD are about 1.8X less PV sensitive compare to HP-F circuits working at lower VDD.

Keywords: Low-Voltage, Low-Power, Process Variation, Random Variations, Statistical Variability, Flip-Flop, Digital VLSI.

1 Introduction

The primary motivation for low-voltage operation is to reduce energy per operation [1]. Nominal VDD is around 3×VT where VT is the threshold voltage. In this work we are talking about designing low-power logic systems which have VDD below nominal value. This includes subthreshold and moderate inversion regimes.

PV could be categorized into inter-die and intra-die variations. Inter-die variations are modeled by slow and fast process corners (SS, FF…). Intra-die variations could be systematic (correlated) or random (uncorrelated). For short-channel narrow-width transistors, which are used in logic gates, intra-die random variations accounts for more than 50% of the total variability for sub-90nm nodes [2 and 3] and are expected to have a significantly greater influence at future technology generations [3].

1.1 Intra-die Random Variability

Intra-die device-to-device random variations could be due to Random Dopant Fluctu-ation (RDF) in the channel and source/drain regions near the channel edge, channel length variations (line edge roughness), oxide thickness variations, poly gate granular-ity [4], Boron clustering, and stress variations. These result to device VT, COX, W, L, and mobility variations. For low-voltage operation VT variation is more pronounced since the drain “on” current depends on (VDD-VT) more strongly. In subthreshold

Logic Architecture and VDD Selection 171

region this dependency is exponential while in strong inversion this goes down to α power law.

Table 1 shows the measured random variability in several technology nodes. By scaling VDD is decreasing while σVT is almost increasing which results to higher performance variation. Here all of the transistors have polysilicon gate and doped channel except the ultra-thin body FD-SOI (L=25 nm) device which is using a new high-k metal-gate technology and has undoped channel. While RDF in the channel is known to be the major contributor to device mismatch [5 and 6], σVT=25mV of this undoped device clearly shows the importance of the other variability sources as well.

Table 1. Intra-die random variability in small bulk NMOS transistors

Technology (L Drawn)

Data VDD W TOXE Mean VT

Sigma VT

340 nm Foundry 3.3 V 360 nm 7.2 nm 439 mV 18 mV





60 nm Measurement [5] 1.2 V 140 nm 2 nm -- 29 mV

45 nm Measurement [6] 1.1 V -- -- -- 45 mV

25 nm (UTB-SOI)

Measurement [2] 1.0 V 60 nm 1.65 nm 480 mV 25 mV

35 nm Simulation [3] 0.85 V -- 0.88 nm 226 mV 30 mV

13 nm Simulation [3] 0.85 V -- 0.44 nm 226 mV 82 mV

1.2 Conventional PV Compensation Techniques

Chip-to-chip variations, to some extent, can be compensated by using circuit tech-niques like Adaptive Body Biasing (ABB) and Adaptive Supply Voltage (ASV). In [7] we have proposed a novel technique for compensating inter-die and regional var-iations in FPGA fabrics which does not use body effect, is scalable, controls subthre-shold and gate leakage together, and can be applied to all kind of planar and emerging multi-gate devices. Unfortunately none of these techniques can be used for compen-sating Intra-die random variations. This is simply because it is not possible to measure the variations for each single transistor on the chip and generate and apply the appro-priate body, VDD, or source voltage to it. Increasing size of the transistors is the most well-known technique for reducing device-to-device random variations. However, in digital gates this results to power and area overheads. In this paper we are solely talk-ing about intra-die device-to-device random variations.

172 B. Kheradmand-Boroujeni, C. Piguet, and Y. Leblebici

2 Performance Degradation Due to Random PV versus VDD

It has been known that by decreasing VDD PV is more pronounced [8]. Fig. 1(a) shows the ratio of standard deviation (σ) over average value (μ) of the on current (i.e. Ids @ Vgs=Vds=VDD) versus VDD in 80nm CMOS technology node. Here we have done Monte Carlo simulations using device matching models provided by a well-known foundry. Model version is BSIM4.3. These simulations include all components of intra-die random variations. In new technologies PV in NMOS is more than PMOS. As we see in this figure by increasing VDD sensitivity to PV goes down. Fig. 1(a) agrees with equation presented in [8] for calculating sensitivity to VT variations:

TVTI

I

VVDDσα

μσ ×

−=

)(

Here they have assumed that σVT is quite small and α does not depend on (VDD-VT). Both assumptions are inaccurate.

(a) transistor on current (b) minimum size inverter delay

(c) 19 Nand2 ring oscillator (d) Inverter leakage at Tj=65°C

Fig. 1. Intra-die random process variation effects in 80nm node versus VDD

Fig. 1(b) shows σ/μ ratio of inverter delay. These curves are quite similar to Fig. 1(a). Since PV in NMOS is more than PMOS, variations in output fall time is higher than variations in rise time. However both decrease by increasing VDD. In Fig. 1(c) we can see similar trend for the period of a ring oscillator consisting 19 Nand2 gates. Dynamic power is always proportional to square VDD. Leakage current increases by VDD due to DIBL. This is shown in Fig. 1(d).

To the best of our knowledge, nobody has studied how this fact can be used for se-lecting optimum VDD and logic architecture to minimize PV effects.

(1)


3 Proposed Idea

In VLSI design, engineers usually design chips using an available design-kit (tech-nology). Maximum power consumption and required performance (clock frequency) are given by the spec. So in most of the cases, logic architecture and supply voltage (VDD) are the only degrees of freedom for the designers.

Several architectures are available for each logic function. For example, Ripple Carry Adder (RCA) and Carry Select adder (CSL) do the same job while CSL is much faster but RCA consume less power. Usually there is a tradeoff between power and delay. We may think that usually at design level we can select between Low-Power Slow (LP-S) architectures or High-Power Fast (HP-F) architectures.

On the other hand, in low-voltage domain, speed can be significantly improved by increasing VDD. It means that RCA working at higher VDD value can work as fast as CSL adder working at lower VDD. Clearly increasing VDD for RCA increases its leakage and dynamic power as well.

In summary, we may think that using LP-S architectures at higher VDD can result to almost equal speed and power comparing to using HP-F architectures at lower VDD. Now the question is: which one will be less sensitive to intra-die random PV? Fig. 1 suggests that the answer is LP-S architectures at higher VDD.

Clearly, if for a particular function, structure A results to lower power and higher speed compare to structure B, always the choice will be A. On the other hand, for some simple gates like Inverter or Nand gate, different architectures do not exist. Fortunately, when we go from gate-level to top-level, e.g. Nand flip-flop State Machine CPU design, number of design choices and options increase rapidly.

4 Simulation Results

To verify this idea we selected three different logic gates, 16-bit equality comparator, flip-flop, and 16-bit adder; and one synthesis level example, Finite State Machine (FSM) encoding. HP-F architectures are parallel comparator, Sense Amplifier flip-flop (SA), CSL, and one-hot encoding. LP-S circuits are Pre-Evaluation comparator, Conditional Charge flip-flop (CC), RCA, and binary encoding. For detail information about this circuits see Section 6.

Monte-Carlo simulation results for the gates are shown in Fig. 2. Since gate delay σ/μ decreases by increasing critical path length, adder delay σ/μ is smaller than com-parator delay σ/μ and that of comparator is smaller than inverter delay σ/μ in Fig. 1(b). Similarly, LP-S gates have smaller σ/μ than HP-F gates at equal VDD because LP-S gates have longer critical path length. In Fig. 1 and 2, y-axis is logarithmic.

Table 2 compares dynamic energy per operation (Dynamic Eng.), leakage power, maximum delay, and random PV sensitivity (σ/μ of delay) of HP-F and LP-S archi-tectures at 500mV and 600mV. Values are normalized to HP-F at 500mV. Dynamic power shown here for the flip-flop is the power of the flip-flop itself plus that of the clock tree. We have assumed that 10% of the logic area is occupied by the flip-flops and switching activity of Din input is 10%. For 16-bit comparator and adder we have applied random input pattern.


(a) 16-bit Comparator (b) Flip-Flop

(c) 16-bit Adder

Fig. 2. Impact of Intra-die random process variations on various logic block delays in 80nm

As we see in Table 2 at 500mV LP-S architectures are about 2X slower than HP-F but σ/μ of delay is about 25% smaller due to longer critical path length. When we compare HP-F at 500mV and LP-S at 600mV we see that LP-S architectures are about 10% faster and less power hungry but 1.8X less sensitive to intra-die random process variations. By comparing LP-S at 500mV and LP-S at 600mV it is clear that 28% of this improvement is due to higher VDD.

It is not clear which design results to smaller area occupation. While CC and pre-evaluation comparator are slightly bigger than SA and parallel comparator, respec-tively, RCA is much smaller than CSL. Static leakage current of LP-S circuits RCA and pre-evaluation comparator are less than HP-F circuits CSL and parallel compara-tor, respectively. But static leakage current of CC is higher than SA. Roughly we can say that using LP-S at 600mV result to equal power and area compare to using HP-F at 500mV but PV sensitivity reduces 1.8X.

On the other hand, for comparing transistor-sizing and proposed method, we can say that if we wanted to apply transistor-sizing to reduce PV sensitivity of HP-F ar-chitectures by 1.8X, we had to increase transistor width×Length (W×L) by 3.24X because σVT=Avt/(W×L)0.5. Transistor sizing and gate sizing improve the performance and PV sensitivity of all logic circuits but increase the area, leakage, and dynamic power as well. This is uncorrelated to our technique. Sizing can be applied to both LP-S and HP-F architectures to reduce the σVT independent of the VDD.


Table 2. Comparing PV Sensitivity of HP-F @500mV and LP-S @600mV

Equality Comparator

Parallel @ 500mV HP-F

Pre-Eval. @ 500mV LP-S

Pre-Eval. @ 600mV LP-S

Dynamic Eng. 1 0.60 0.87

Leakage Pow. 1 0.40 0.54

Area 1 1.06 1.06

Delay 1 1.97 0.90

σ/μ (Delay) 9.1% 7.5% 5.1%

Flip-Flop

SA@500mV: HP-F CC@500mV: LP-S CC@600mV: LP-S



Area 1 1.25 1.25

Delay 1 1.89 0.82

σ/μ (Delay) 15.0% 10.8% 7.4%

Adder

CSL@500mV:HP-F RCA@500mV: LP-S RCA@600mV: LP-S



Area 1 0.68 0.68

Delay 1 2.09 0.96

σ/μ (Delay) 8.1% 5.8% 4.6%

Finite State Machine Encoding

One-hot & SA

@500mV: HP-F Binary & CC

@500mV: LP-S Binary & CC

@600mV: LP-S



Area 1 ~1 ~1

Delay 1 1.78 0.84

σ/μ (Delay) 7.2% 5.4% 3.94%

5 Discussions

It is not possible to prove or guarantee this idea for all the possible logic circuits be-cause there is no general algorithm which can generate LP-S and HP-F architectures for all logic functions and predict the power and delay of each one. However, since this idea is based on the intrinsic characteristic of the transistor as shown in Fig. 1(a), and transistors are the building block of all logic gates and blocks, this idea seems to be correct when the choice between LP-S and HP-F exist.


The idea that we proposed here is based on Fig. 1 in which we have Vgs=VDD in on condition and Vgs=0 in off condition. This is true for all logic styles except single transistor switch Pass Transistor Logic (PTL) in which NMOS charges the internal nodes up to (VDD-VT) in source follower configuration. This kind of PTL has never been used because it does not provide full-swing, needs level restoration, cannot be modeled in HDL easily, and is very PV sensitive. But Transmission-Gate (TG) PTL which has one NMOS and one PMOS in each switch is compatible to Fig. 1 because NMOS discharge the internal nodes and PMOS charge them both with Vgs=VDD. TG-PTL is especially interesting for multiplexer design. Today all of the available standard cell libraries are complementary logic which has separate pull-down and pull-up networks (PDN and PUN) and both have Vgs=VDD in on and Vgs=0 in off conditions, so they are compatible with the proposed idea.

Proposed method reduces sensitivity to inter-die variations as well, but less than that of random variability. The longer critical path of LP-S architecture does not result to less sensitivity to inter-die variations because in this case all of the transistors change in the same way and variations do not cancel each other. However, the im-provement due to higher VDD works in this case as well. We have to note that inter-die variations can be reduced by better control on the fabrication process in future. But there is no theoretical solution for the random variability as long as we are doping channel and S/D junctions and we use sub-wavelength lithography.

6 Details of the LP-S and HP-F Circuits

For RCA and CSL adders please see [9]. In digital circuits with flip-flop based regis-ters, the minimum clock period is:

suiccqclk ττττ ++= logmin

Where τcq is flip-flop clock to output delay and τsu is flip-flop setup-time. Since τcq

and τsu contribute to the total delay in the same way, the delay reported in Fig. 2(b)

and Table 2 for the flip-flops is τcq+τsu of two successive flip-flops.

SA flip-flop is shown in Fig. 3. Setup time of this flip-flop is very small (one inver-ter delay), but in every clock cycle XL and XR are charged and one on them will be discharged. So the power consumption is quite high. N0 turns off M0 at the start of evaluation phase to stop the race current between left and right branches. In some old publication N0 does not exist and M0 has a long channel length and its gate terminal is tied to VDD. This certainly reduces the power consumption but the flip-flop func-tionality will depend on sizing and transistor’s on-resistance. This means that (without N0) SA fails to work at low-voltage in the presence of intra-die PV. So we used the circuit shown in Fig. 3.

Conditional Charge Flip-Flop (CC) is also shown in Fig. 3. We have designed this flip-flop by adding the conditional charge transistors (M3,4L/R) to the Race-Free NAND-based DFF which we had proposed in [10]. During the pre-charge phase (clk=0), internal nodes will be charged only if input data has changed. Since in a typical digital system the switching activity of internal signals is much less than

(2)


clock, this simple idea can save a lot of power. However, setup-time of this flip-flop if quite long (inverter delay + charge time of XL/R nodes + NAND delay).

One may think that we could apply these conditional charge transistors to SA flip-flop. If we do this and a short glitch pulse happens on the Din input during pre-charge phase, then the XL/R node can be charged to an intermediate voltage level (e.g. VDD/2) and static short circuit current flows through N0 for up to half clock cycle. But no short circuit current can happen during pre-charge cycle in N1L/R in the pro-posed CC flip-flop due to Q/QB feedback. For example, during pre-charge cycle, if XL=0 and XR=VDD, it means that in the previous cycle Din had been zero. So Q=0, QB=VDD, and M4R keeps XR high. Since one zero is enough to turn off NMOS stack in the NAND gate, even if a glitch charges XL to an intermediate voltage level, no short circuit current happens in N1L. Since Q=clk=0, XR=QB=VDD and no short circuit current can happen in N1R as well.

Fig. 3. Sense Amplifier flip-flop (SA) (HP-F) on the left and proposed Conditional Charge flip-flop (CC) (LP-S) on the right side

When clock goes high, depending on the Din value, XL or XR goes low and the positive feedback loop (through N2L/R, M5L/R, and M6L/R) stores the Din value and stops XL/R nodes from further change if Din changed again during clk=VDD period.

The last example shown in Table 2 is about Finite State Machine (FSM) synthesis. Various FSM encoding techniques (e.g. one-hot, Gray, Johnson, Binary…) result to different performance and power consumptions. One-hot is very power consuming because each state is represented by one flip-flop. Since there is no extra hardware for decoding present state or encoding next state signals in one-hot combinational logic


part, one-hot seems to be the fastest FSM style. On the other hand, highly-encoded techniques, e.g. binary (sequential), have minimum number of flip-flops, so they are low-power. But they need wide functions in the combinational part, so they are slow. Here we assumed that flip-flops are the dominant source of power in FSM.

In Table 2 we compared two generic FSMs with 14 states. First one which is HP-F has one-hot encoding, use SA flip-flops, and we assumed a series of 20 Nand2 for the longest signal path in next state logic. Since FSM power strongly depends on the application, we assumed that 50% of the total leakage and dynamic power is related to flip-flops and 50% is in next state and outputs combinational logic. For LP-S we used binary encoding, CC flip-flops, and series of 20 Nand2 for longest signal path in next state logic the same as HP-F. Since here 14 states are coded into 4 flip-flops, we add-ed 4:16 decoder at the output of state flip-flops to model extra hardware required for identifying the present state and we added 16:4 encoder at the flip-flop inputs to mod-el the extra hardware required for encoding next state. Since we simply added decod-er/encoder to the output/input of the state flip-flops, the combinational part of both FSMs will be the same. In reality, this decoder and encoder could be merged into the combinational logic to optimize LP-S FSM. Therefore, Table 2 is showing the worth case for LP-S.

Since in one-hot FSM each flip-flop represents a single state, that flip-flop could be placed near the combinational logic related to that state. This result to short intercon-nects. In binary FSM each flip-flop is linked to many states and logics. So we have longer interconnects and more capacitive loads which need buffer gates to driver them. We put one buffer for each present and next state signals in binary FSM. Each buffer has a delay equal to two Nand2 delay.

Pre-Evaluation equality comparator is shown in Fig. 4. It saves power based on the simple idea that when we are comparing two 16-bit numbers, if A15:12 and B15:12 are not equal we do not need to compare A11:0 and B11:0. In this situation M0 turns off X11:0 and N2:0 but N4 works properly because Eq3=0. If A15:12 and B15:12 are equal then all sixteen bits will be compared. Parallel comparator (HP-F) has exactly

Fig. 4. Pre-evaluation comparator on the left. Submitted tape-out on the right side.


the same architecture; just there is no M0 and VSS terminal of X11:0 and N2:0 is directly connected to ground. So, all XOR gates work concurrently. Inputs A/B15:0 interconnect parasitic capacitance is an important contributor to the total dynamic power which is not under M0 control. This has been included in Table 2 values.

Conclusion. Random variations are increasing by scaling. Clever selection of VDD and logic architecture together could reduce intra-die PV sensitivity about 1.8X. Our results recommend designers that for reducing intra-die statistical VT variation effect on timing, first they should look for very low-power architectures and then raise VDD to get desired performance.

Acknowledgement. This research has been supported in part by the CCMX program of the Swiss Confederation; under the project title “MMNS: Materials, devices, and design technologies for nanoelectronic systems beyond ultimately scaled CMOS”.

References

1. Vittoz, E.: Weak Inversion for Ultimate Low-Power Logic. In: Piguet, C. (ed.) Low-Power Electronics Design, ch. 16. CRC Press, Boca Raton (2004)

2. Weber, O., Faynot, O., Andrieu, F., Buj-Dufournet, C., Allain, F., Scheiblin, P., Foucher, J., Daval, N., Lafond, D., Tosti, L., Brevard, L., Rozeau, O., Fenouillet-Beranger, C., Ma-rin, M., Boeuf, F., Delprat, D., Bourdelle, K., Nguyen, B.-Y., Deleonibus, S.: High im-munity to threshold voltage variability in undoped ultra-thin FDSOI MOSFETs and its physical understanding. In: IEEE International Electron Devices Meeting (IEDM), pp. 1–4 (2008)

3. Reid, D., Millar, C., Roy, G., Roy, S., Asenov, A.: Analysis of Threshold Voltage Distri-bution Due to Random Dopants: A 100 000-Sample 3-D Simulation Study. IEEE Transac-tions on Electron Devices 56(10), 2255–2263 (2009)

4. Cathignol, A., Cheng, B., Chanemougame, D., Brown, A.R., Rochereau, K., Asenov, A.: Quantitative Evaluation of Statistical Variability Sources in a 45-nm Technological Node LP N-MOSFET. IEEE Electron Device Letters 29(6), 609–611 (2008)

5. Tsunomura, T., Nishida, A., Yano, F., Putra, A.T., Takeuchi, K., Inaba, S., Kamohara, S., Terada, K., Hiramoto, T., Mogami, T.: Analyses of 5σ Vth fluctuation in 65nm-MOSFETs using takeuchi plot. In: Symposium on VLSI Technology, pp. 156–157. IEEE Press, Los Alamitos (2008)

6. Kuhn, K.J.: Reducing Variation in Advanced Logic Technologies: Approaches to Process and Design for Manufacturability of Nanoscale CMOS. In: IEEE International Electron Devices Meeting (IEDM), pp. 471–474 (2007)

7. Kheradmand-Boroujeni, B., Piguet, C., Leblebici, Y.: AVGS-Mux style: A novel technol-ogy and device independent technique for reducing power and compensating process vari-ations in FPGA fabrics. In: Design, Automation & Test in Europe Conference & Exhibi-tion (DATE), pp. 339–344 (2010)

8. Abu-rahma, M.H., Anis, M.: Variability in VLSI Circuits: Sources and Design Considera-tions. In: Proc. of IEEE International Symposium on Circuits and Systems (ISCAS), pp. 3215–3218 (2007)

9. Yeo, K.S., Roy, K.: Low-Voltage Low-Power Adders. In: Low-Voltage, Low-Power VLSI Subsystems, ch. 3, pp. 72–83. McGraw-Hill, New York (2005)

10. Piguet, C., Masgonty, J.M., Arm, C.: D-Type Master-Slave Flip-Flop. In: US Patent No. 6323710 B1, filed (November 1999)


Impact of Process Variations on Pulsed Flip-Flops: Yield Improving Circuit-Level Techniques and Comparative

Analysis

Marco Lanuzza, Raffaele De Rose, Fabio Frustaci, Stefania Perri, and Pasquale Corsonello

Departement of Electronics, Computer Science and Systems University of Calabria- Arcavacata Di Rende-87036- Rende (CS)

{lanuzza,derose,ffrustaci,perri}@deis.unical.it, [email protected]

Abstract. Process variations cause unpredictability in speed and power charac-teristics of nanometer CMOS circuits impacting the timing and energy yields. In this paper, transistor reordering and dual-Vth techniques are evaluated re-garding their efficiency in mitigating the impact of process variations on a set of pulsed flip-flops. It is shown that the conjunct use of the above mentioned tech-niques can improve delay, energy and EDP yields more than 1.98X, 1.62X and 1.99X times, respectively. The yield optimized flip-flop circuits are also com-paratively analyzed to identify the best topologies.

1 Introduction

The rapid scaling of silicon technology has enabled designers to integrate millions and even billions of transistors into a single chip. This ability to achieve very high integration density has contributed to the success of integrated circuit (IC) design during the past few decades. Unfortunately, technology scaling has led to a significant increase in process variability due to random doping effects, imperfections in litho-graphic patterning of small devices, and related effects [1]. Process variations can cause significant uncertainty in speed and power characteristics of ICs. Due to the inverse relationship between power and delay, fastest chips in a lot may present unac-ceptable power dissipation whereas low-power chips can be too slow [2]. This signifi-cantly reduces the parametric yield in advanced process technologies (like the 65-nm and the 45-nm technological nodes) [3]. Moreover, the yield loss will become very critical in future technologies where physical device parameters will approach the atomic scale and will be hence subject to atomic uncertainties [1].

In this paper, we cope with the influence of random process variations on timing and energy yield of pulsed flip-flops (FFs). These were chosen as a case study since they are very critical elements in the design of high-speed microprocessors, due to their high impact on the delay and energy characteristics of the whole system [4], [5].

FFs targeted for high speed applications in energy-constrained environments are conventionally sized to optimize the energy-delay-product (EDP) [6]. However, due to random process variations, a large number of circuits might not meet the targeted

Impact of Process Variations on Pulsed Flip-Flops 181

EDP constraint. This is intuitively shown in Fig.1. Under process variations, the EDP distribution of a given circuit can be modeled by a normal distribution with the mean value (μ) and the standard deviation (σ) [1]. Considering FFs conventionally opti-mized for minimum EDP, only the 50% of the total number of circuits would meet the target constraint. In order to achieve a higher yield, statistical sizing approaches, based on the use of statistical information to estimate the sensitivity to process varia-tions, can be used. In [7] a gate sizing algorithm is proposed to improve timing yield of clocked storage elements. The desired timing yield is achieved by iteratively in-creasing transistor sizing on the basis of statistical simulation results. As it is shown in [7], this approach can lead to not negligible power and area overheads.

Fig. 1. The EDP probability density function (pdf) due to process variations

In this work, simple circuit-level techniques to mitigate the impact of process variations on pulsed FFs are evaluated, namely the transistor reordering [8] and the usage of dual threshold voltage transistors (dual-Vth) [9]. Both these approaches can be applied at design-time of the circuits without requiring any extra device and archi-tectural modifications, thus they can easily be used also in conjunction with other techniques (such as that proposed in [7]). As it will be demonstrated in the following, timing and energy yield of FFs can be concurrently improved by the conjunct exploi-tation of the transistor reordering and the dual-Vth techniques, without any extra area requirement. Experiments have been performed on four state-of-the-art pulsed FF topologies, designed using STMicroelectronics 45-nm 1V CMOS technology.

Furthermore, comparative analysis of the FF structures has been done. Differently from the study presented in [6], where the process variability impact was analyzed considering FF circuits conventionally optimized for minimum EDP, we performed a comparative analysis considering yield improved circuit structures.

This paper is organized as follows. In Section 2, the analyzed Pulsed FF topologies are briefly reviewed and the adopted simulation setup is discussed. Section 3 deals with implemented circuital techniques to improve robustness against process variability.

182 M. Lanuzza et al.

A comparative analysis of the obtained results is provided in Section 4. Finally, the conclusions are drawn in Section 5.

2 Pulsed Flip-Flop Topologies and Simulation Methodology

In this work, four representative pulsed FF topologies widely used in high-performance processors were selected as case study. Fig.2.a shows the Hybrid-Latch Flip-Flop (HLFF), used in AMD K6 and K7 processors. This hybrid circuit is particu-larly fast. However, due to its pre-charged structure, this design is associated with considerable power consumption [4]. An improved design is the Conditional Pre-charge Flip-Flop (CPFF), depicted in Fig.2.b. This circuit overcomes the problem of the glitches at the output, thus reducing dynamic power consumption. This is accom-plished by appropriate insertion of keeper elements and introducing a conditional precharge technique to prevent unnecessary transitions [10]. Another interesting hy-brid design is the Semi-Dynamic Flip-Flop (SDFF) which is shown in Fig.2.c. This circuit achieves very high speed at the expense of considerable energy consumption, mainly due to the switching activity of the clock pulse generator and to the highly loaded dynamic internal node. A more advanced semi-dynamic Flip-Flop implemen-tation is represented by the UltraSPARC Semi-Dynamic Flip-Flop (USDFF), shown in Fig.2.d. The improvement with respect to the SDFF topology mainly consists in using a conditional keeper on the dynamic internal node. It was demonstrated that this allows to significantly reduce the energy consumption [11].

D

Qb

*

D

CLK 0.12

0.24

0.12 0.12

0.24 0.24

0.24

0.9

0.9

0.9 0.9

0.9

0.9

0.6 Q X

Q

Q

CLK

0.24

0.24 0.24

0.24

0.6

0.3

* *

0.12

0.24

ICLK

CLK

Qb

CLK

0.12

0.24

0.12 0.12

0.24 0.24

0.24

0.9

0.9

0.9

0.24

0.24 0.9

0.9

0.9

0.6 * *

*

0.6

0.3

ICLK

X

D

Qb

CLK

CLK

*

*

*

*

*

*

*

*

1.62

1.62

1.62

0.24

0.24 0.24

0.12 0.12

0.24

0.24

1.08

1.08

1.08

ICLK

0.6

0.3

X

D

*

*

1.62

1.62 0.24

0.12 0.12

0.24

1.08

1.08

Qb

0.24 ICLK

* *

1.62

0.24 0.24

D 0.24

0.24

*

0.24

0.24

0.12

0.6

0.3

1.08

X

CLK

(a) (b)

(c) (d)

CLK

Fig. 2. Analyzed Flip-Flops: (a) HLFF [4] (b) CPFF [10] (c) SDFF [11] (d) USDFF[11]

In a first phase, all the FF circuits were deterministically sized for optimal EDP. Since the number of transistors of a single topology varies from 22 to 26 transistors, proper circuit simplifications were introduced to manage the transistor sizing optimization.


Transistors that do not affect the FF performance (shown as * in Fig.2) were minimum sized to limit the energy consumption. The remaining devices were iteratively sized imposing equal width for series-connected transistors [12]. The iterations were performed until the optimum EDP was obtained.

Fig.3 shows the simulation setup used in this work. Input buffers are placed be-tween ideal voltage sources and data and clock inputs to provide realistic input sig-nals. The data input buffer is minimum sized, whereas the clock input buffer is sym-metrically sized to keep a constant clock slope equal to FO2 [13], as it is adopted in real designs.

The output of a given FF is loaded with a 12 fF capacitance. This value has been chosen analyzing the capacitive loads optimally driven by FFs with different strengths, belonging to the commercial STM 45-nm standard cells library. We sup-posed that the generic FF circuit should act as a standard cell with X9 drive strength. For this reason we analyzed the behavior of the FFs with adjacent strengths X4, and X18. Then, we found the load capacitive range for which the X9 flip-flop is opti-mized. From Fig.4, it can be seen that 12fF represents the middle range capacitive load for which the X9 strength is preferable to the adjacent ones. This choice allows realistic running conditions to be examined.

Data

VDD

0.24

0.12

VDD* VDD

*

0.24

0.12 FF under test

D

2Wclk Wclk

VDD*

2Wclk

Wclk

VDD*

Clock

Qb

constant slope (FO2)

12 fF

CLK

No process variations

Fig. 3. The simulation setup Fig. 4. Load capacitance analysis (100°C ,1V)

The impact of process variations (including the mismatch between transistors) was evaluated through Monte Carlo (MC) simulations performed on 1000 samples. In MC simulations, the nominal 1V power supply voltage, a temperature of 100°C, a clock frequency of 1 GHz and pseudorandom input data with a 25% activity rate [4] were considered. In our tests, both data and clock buffers are not influenced by random process variations.

The flip-flop delay considered in this study is the data-to-output delay (TDQb) [14] which includes both the worst clock-to-output delay (TCQb) and the setup time (Tsetup). The latter is usually defined as the data-to-clock offset that corresponds to a 10% increase in the clock-to-output delay [14]. Since the setup time can be deeply influenced by process variations, particular attention was given to the determination of the data-to-clock offset to be used in MC analysis. To this purpose, the mean value and the standard deviation of the setup time have been evaluated through appropriate


parametric MC simulations. Therefore the data to clock offset has been set to the 3-sigma setup value (i.e. (μ+3σ)s) in the subsequent MC simulations used for evaluating the FF delay. In this way a setup-time margin is introduced, which assures that more than 99.7% of the performed MC runs satisfy the constraint of having less than 10% increasing in the clock-to-output delay.

3 Circuit-Level Techniques to Improve Yield

In this section, two different circuit-level techniques that can be useful to target the desired yield in terms of delay and/or energy without sacrificing area are evaluated.

Transistor reordering is a well known technique that can be used to optimize circuit delay and power dissipation. Appropriate transistor ordering can minimize the switch-ing activity at internal nodes, thus reducing the dynamic power consumption [1]. Moreover, transistor reordering can reduce critical path delay. Placing the critical-path transistor (i.e. the transistor which is driven by last signal of all inputs which assumes a stable value) closer to the output of the gate can result in reduced gate delay. It was demonstrated that this approach improves also the delay yield of basic logic gates [8].

As part of this work, transistor reordering has been applied on pull-down network (PDN) of both the stages of the analyzed FF structures. Table 1 shows the six possible PDN transistor configurations. Each transistor combination is organized in ascending order from the ground to the output node.

Table 1. PDN transistor ordering (in brackets the transistors belonging to the PDN of the sec-ond stage)

PDN transistor ordering SDFF-USDFF HLFF-CPFF

Configuration1(C1) MCLK-MD-MICLK (MCLK-MX) MCLK-MD-MICLK (MCLK-MX- MICLK)

Configuration2(C2) MCLK-MICLK-MD (MCLK-MX) MCLK-MICLK-MD (MCLK-MICLK-Mx)

Configuration3(C3) MD-MCLK-MICLK (MX-MCLK) MD-MCLK-MICLK (Mx-MCLK-MICLK)

Configuration4(C4) MD-MICLK-MCLK (MX-MCLK) MD-MICLK-MCLK (Mx-MICLK-MCLK)

Configuration5(C5) MICLK-MCLK-MD (MCLK-MX) MICLK-MCLK-MD (MICLK-MCLK-Mx)

Configuration6(C6) MICLK-MD-MCLK (MX-MCLK) MICLK-MD-MCLK (MICLK-Mx-MCLK)

Table 2 presents obtained results in terms of setup-time margin, delay, energy and

EDP mean and standard deviation. As expected, the transistor reordering significantly influences the 3-sigma setup value. Comparing the worst and the best delay values of the analyzed configurations, it can be seen that the mean delay of the USDFF im-proves up to 20%. Whereas, that of the CPFF improves up to 28%. At the same time, an average variation of about 30% in terms of mean energy can be observed, except for the SDFF which shows a mean energy variation of about 18%.


Considering the mean TDQb values, it can be observed that most favorable con-figurations appear to be those in which the data related signals (i.e. D for the first stage and X for the second stage) drive transistors closer to the output node. Those configurations also allow the minimum standard deviation of the delay to be achieved.

From results in Tab.2, it can also be concluded that the best input vector in terms of energy mean and standard deviation appears to be that in which the input signals with the highest probability of being at the logic state of one (in this case CLK and ICLK) are positioned far from the output node. This is due to the minimization of the switch-ing activity of internal nodes [15]. On the contrary, for the SDFF and the USDFF circuits that result more susceptible to leakage power (due to the reduced stack effect in the PDN of the second stage), the design rule given in [15] is not fully respected.

Table 2. Transistor reordering results

SDFF USDFF

(μ+3 )s

[ps] μD

[ps] D

[ps] μE

[fJ] E

[fJ] μEDP

[e-27]

EDP

[e-27]

(μ+3 )s

[ps] μD

[ps] D

[ps] μE

[fJ] E

[fJ] μEDP

[e-27]

EDP

[e-27]

C1 1.41 44.89 2.61 22.56 1.97 1012.7 85.1 3.71 46.64 2.19 17.42 0.865 812.5 35.94

C2 -3.23 38.99 2.32 23.19 1.99 904.2 72.75 -0.5 40.77 2.14 17.72 0.927 722.4 34.76 C3 4.73 49.01 2.88 24.42 2 1196.8 98.4 7.01 50.36 2.59 24.79 1.26 1248.4 75.45

C4 10.84 49.84 2.8 21.45 1.78 1069.1 85.25 12.91 51.08 2.56 22.9 1.08 1169.7 57.95

C5 -1.52 39.43 2.35 21.01 1.95 828.4 72.1 0.99 41.11 2.15 22.95 1.62 943.5 63.65

C6 7.75 47.1 2.65 20.08 1.84 945.8 82.15 9.7 48.34 2.43 22.53 1.5 1089.1 73.85

HLFF CPFF

(μ+3 )s [ps]

μD

[ps] D

[ps] μE

[fJ] E

[fJ] μEDP

[e-27]

EDP

[e-27]

(μ+3 )s

[ps] μD

[ps] D

[ps] μE

[fJ] E

[fJ] μEDP

[e-27]

EDP

[e-27]

C1 12.47 45.02 2.34 34.46 2.17 1551.4 91.39 12.56 46.52 2.78 24.57 2.07 1143 64.8

C2 3.57 37.42 2.09 25.36 1.65 949 28.56 3.54 37.11 2.2 17.1 1.14 634.6 23.43

C3 3.27 46.47 2.55 29.98 1.55 1393.2 68.26 2.76 47.34 2.91 24.46 1.45 1157.9 47.74

C4 5.26 50.1 2.73 34.5 1.78 1728.5 59.56 6.27 51.47 3.13 23.43 1.38 1205.9 40.37

C5 8.22 36.99 1.99 31.41 1.73 1161.9 34.04 9.26 36.95 2.06 20.64 1.58 762.6 40.36

C6 14.55 44.65 2.44 34.57 1.82 1543.6 50.12 15.04 45.95 2.74 23.36 1.36 1073.4 38.63

Another interesting circuit-level strategy is the dual-Vth (DVT) technique which consists of the use of transistors with two different threshold voltages: the lower-Vth devices are used in the critical paths to optimize the performance, while the higher-Vth devices are used in non critical paths to reduce leakage power [1]. This approach was applied to the analyzed circuits in conjunction with transistor reordering, exploit-ing the 45-nm STM General Purpose transistors library. The latter includes devices with standard (SVT) and high threshold (HVT) voltages. SVT transistors were used to implement delay-critical PDNs, whereas HVT transistors were exploited when the devices delay is not a concern.

The obtained results in terms of delay, energy and EDP mean and standard devia-tion are shown in Table 3. The setup-time margins are not significantly influenced by this action, thus their values are not reported in Table 3.

By a careful comparison between the results given in Table 2 and in Table 3, it can be observed that the DVT technique has a minor impact on delay mean and standard deviation, while it can lead to a significant decrease of the energy standard deviation, depending on the input vector. More precisely, comparing the best and the worst PDN


configurations in terms of energy consumption, the energy standard deviation is im-proved from 10.9% (for the SDFF) to 15.8% (for the CPFF). As highlighted in Table 3, for each Flip-Flop topology the best transistor arrangements in terms of perform-ance or energy consumption are the same as those shown in Table 2.

Table 3. Transistor reordering + dual Vth results

SDFF USDFF

μD

[ps] D

[ps] μE

[fJ] E

[fJ] μEDP

[e-27] EDP

[e-27]

μD

[ps] D

[ps] μE

[fJ] E

[fJ] μEDP

[e-27]

EDP

[e-27]

C1+DVT 45.12 2.74 22.18 1.78 1000.8 67 46.68 2.14 16.8 0.742 784.2 33.37

C2+DVT 39.1 2.49 22.67 1.72 886.4 65.75 40.43 2.11 17.38 0.823 702.7 32.97

C3+DVT 49.08 3 23.95 1.68 1175.5 80.15 50.47 2.5 23.91 0.94 1206.7 49.66

C4+DVT 49.94 2.94 20.61 1.7 1029.3 79.3 51.41 2.44 21.08 0.705 1083.7 36.2

C5+DVT 39.5 2.44 20.68 1.71 816.9 66.3 41 2.06 21.34 1.2 874.9 44.18

C6+DVT 47.21 2.79 19.72 1.64 931 77.25 48.13 2.3 20.3 1.06 977 48.13

HLFF CPFF

μD

[ps] D

[ps] μE

[fJ] E

[fJ] μEDP

[e-27] EDP

[e-27] μD

[ps] D

[ps] μE

[fJ] E

[fJ] μEDP

[e-27]

EDP

[e-27]

C1+DVT 45.11 2.31 33.87 1.87 1527.9 78.57 46.24 2.75 24.03 1.73 1111.1 54.4 C2+DVT 37.32 2.03 24.78 1.46 924.8 27.53 36.76 2.09 16.52 0.96 607.3 18.47

C3+DVT 46.41 2.45 28.75 1.41 1334.3 65.43 47.12 2.87 23.95 1.28 1128.5 43.77

C4+DVT 50.12 2.7 33.8 1.69 1694.1 57.87 51.14 3.03 22.41 1.14 1146 34.72

C5+DVT 36.94 1.97 30.92 1.64 1142.2 32.64 36.59 2.03 20.06 1.5 734 37.36

C6+DVT 44.71 2.4 33.82 1.73 1512.1 49.03 45.53 2.65 22.18 1.09 1009.9 31.93

Fig. 5. Yield improvement comparing C1 and C5 (dashed line) SDFF transistors arrangements (the yield data is referred to the μ value of the C1 configuration)


Figure 5 shows the effects of the analyzed techniques on the SDFF topology. Results demonstrate that conjunct use of transistor reordering and DVT technique improves considerably timing and energy yields, concurrently. More precisely, com-paring C5 with C1 transistor stack arrangement, an improving of 1.98X, 1.62X and 1.99X times is obtained in terms of delay, energy and EDP yields, respectively.

4 Comparative Analysis and Discussion

For each FF topology, has been selected the solution which leads to the best trade-off between EDP and robustness to the process variations. To this purpose, the simple cost function defined in [16] was used:

CF(C)= [μ EDP(C) * σ EDP(C)] (1)

The CF results a relevant metric which takes into account both mean EDP and its variance caused by process variation effects. Obviously, the optimal transistor con-figuration (Copt) corresponds to that which minimizes the CF function (i.e. Copt=C: min{μEDP(C)*σEDP(C)} ). As shown in Tab.3, the optimal transistor arrangement is represented by the configuration C2, except for the SDFF which has as the best solu-tion configuration C5.

Comparative MC results are given in Tab.4. The ratio between the maximum spread 3σ and the mean value µ was considered in Table 4 as a measure of variability due to the process variations impact on a particular parameter. All the FF topologies show similar results in terms of delay variability. The USDFF circuit presents the lowest delay variability (about 15.66%) whereas the SDFF has the highest uncertainty in terms of delay (i.e. 18.53%). Although the CPFF has a 17.07% of delay variability it shows the best mean delay (see Tab.3). A more differentiated susceptibility to the process variations can be observed in terms of energy dissipation. The SDFF has the highest energy variability (more than 24.81%), while the USDFF is the circuit with the lowest variability (about 14.21%).

Table 4. Comparative results

(3σ/µ)D [%]

(µ+3σ)D

[ps] (3σ/µ)E

[%] (µ+3σ)E

[fJ] (3σ/µ)EDP

[%] (µ+3σ)EDP

[e-27] SDFF 18.53 46.82 24.81 25.81 24.35 1015.8

USDFF 15.66 46.76 14.21 19.85 14.08 801.61

HLFF 16.32 43.41 17.67 29.16 8.93 1007.39

CPFF 17.07 43.03 17.43 19.4 9.12 662.71

The 3-sigma value defined as µ+3σ and provided in Table 4 gives practical infor-

mation to evaluate the achievable yield. As illustrated in Fig.6, the 99.87% of fabri-cated circuits based on CPFF topology would have a worst case delay lower than 43.03 ps and an energy dissipation lower that 19.4 fJ. The 99.75%, 92.65% and 89.07% of fabricated HLFF, SDFF and USDFF circuits would reach a speed perform-ance similar to that obtained for the CPFF structure. At the 3-sigma energy value of


the CPFF, the USDFF and the SDFF achieve an energy yield of 99.29% and 23.52% respectively, whereas the HLFF presents an energy yield almost equal to zero. As expected, the CPFF shows also the lowest 3-sigma value in terms of EDP, thus result-ing the best solution between the four analyzed circuits. At the CPFF 3-sigma value the USDFF and the SDFF show an EDP yield of 11.26% and 1% respectively, whereas the HLFF presents an EDP yield equal to zero.

Fig. 6. Yield comparison

5 Conclusions

In this paper, the impact of process variations on delay and energy performances over a set of high-speed FFs has been analyzed. Moreover, in order to reduce the unpre-dictability in speed and energy dissipation, transistor reordering and dual –Vth tech-niques have been applied and their effects have been studied. It was found that these techniques can significantly impact on the setup time, data-to-output delay and energy dissipation both mean values and standard deviations. The optimum transistor reor-dered solution is dependent on the particular FF topology, the number of stacked transistors, and the relative position of switching devices in the transistor network arrangements. The best mean and standard deviation delay results were found for PDN configurations in which the data signals drive devices closer to the output node. Moreover, for each kind of FF better mean and standard deviation energy results are obtained using high-threshold transistors within the non-critical paths. Analyzed FF topologies were also compared to identify the best choice from the yield point of view. Comparative analysis clearly shows that the CPFF circuit assures highest de-lay, energy and EDP yields.


References

1. Wong, B.P., et al.: Nano-CMOS Design For Manufacturability. John Wiley & Sons, Chichester (2009)

2. Borkar, S., Karnik, T., Narendra, S., Tschanz, J., Keshavarzi, A., De, V.: Parameter varia-tions and impact on circuits and microarchitecture. In: Proc. of the 40th Conference on De-sign automation, Anaheim, CA, USA, June 2-6 (2003)

3. Sylvester, D., Agarwal, K., Shah, S.: Variability in nanometer CMOS: Impact, analysis, and minimization, Integration. The VLSI Journal 41(3), 319–339 (2008)

4. Stojanovic, V., Oklobdzija, V.: Comparative Analysis of Master-Slave Latches and Flip-Flops for High-Performance and Low-Power Systems. IEEE J. Solid-State Circuits 34(4), 536–548 (1999)

5. Rebaud, B., Belleville, M., Bernard, C., Robert, M., Maurine, P., Azemard, N.: A com-parative study of variability impact on static flip-flop timing characteristics. In: Proc. IEEE International Conference on Integrated Circuit Design and Technology ( ICICDT), Austin, TX, June 2-4, pp. 167–170 (2008)

6. Hansson, M., Alvandpour, A.: Comparative Analysis of Process Variation Impact on Flip-Flop Power-Performance. In: Proceedings of the 2007 IEEE Symposiums on Circuits and Systems (ISCAS 2007), pp. 3744–3747 (2007)

7. Mostafa, H., Anis, M., Elmasry, M.: Comparative Analysis of Timing Yield Improvement under Process Variations of Flip-Flops Circuits. In: 2009 IEEE Computer Society Annual Symposium on VLSI (2009)

8. da Silva, D.N., et al.: CMOS Logic Gate Performance Variability Related to Transistor Network Arrangements. Microelectronics Reliability 49, 977–981 (2009)

9. Ashouei, M., Chatterjec, A., Singh, A.D., De, V.: A dual-Vt layout approach for statistical leakage variability minimization in nanometer CMOS. In: Proceedings of 2005 IEEE In-ternational Conference on Computer Design (ICCD), pp. 567–573 (October 2005)

10. Nedovic, N., Oklobdzija, V.G.: Hybrid Latch Flip-Flop with Improved Power Efficiency. In: Proceedings of the 13th Symposium on Integrated Circuits and Systems Design, pp. 211–215 (2000)

11. Giacomotto, C., Nedovic, N., Oklobdzija, V.G.: The Effect of the System Specification on the Optimal Selection of Clocked Storage Elements. IEEE J. Solid-State Circuits 42(6), 1392–1403 (2007)

12. Alioto, M., Consoli, E., Palumbo, G.: General Strategies to Design Nanometer Flip-Flops in the Energy-Delay Space. IEEE Transaction on Circuits and Systems (2009)

13. Alioto, M., Consoli, E., Palumbo, G.: Flip-Flop Energy/Performance Versus Clock Slope and Impact on the Clock Network Design. IEEE Transaction on Circuits and Systems (2009)

14. Markovic, D., Nikolic, B., Brodersen, R.: Analysis and Design of Low-Energy Flip-Flops. In: Proc. of the 2001 International Symposium on Low Power Electronics and Design, Huntington Beach, California, United States, pp. 52–55 (2001)

15. Hossain, R., et al.: Reducing Power Dissipation in CMOS Circuits by Signal Probability Based Transistor Reordering. IEEE Trans. Computer Aided Design Integrated Circuits Systems 15(3), 361–368 (1996)

16. Li, B., Peh, L., Patra, P.: Impact of Process and Temperature Variations on Network-on-Chip Design Exploration. In: Proc. of the Second ACM/IEEE International Symposium on Networks-on-Chip, NOCS, pp. 117–126 (2008)

Transistor-Level Gate Modeling for Nano CMOS

Circuit Verification Considering StatisticalProcess Variations

Qin Tang, Amir Zjajo, Michel Berkelaar, and Nick van der Meijs

Circuits and Systems Group, Delft University of [email protected]

Abstract. Equation- or table-based gate-level models (GLMs) have beenapplied in static timing analysis (STA) for decades. In order to evaluatethe impact of statistical process variabilities, Monte Carlo (MC) simula-tions are utilized with GLMs for statistical static timing analysis (SSTA),which requires a massive amount of CPU time. Driven by the challengesassociated with CMOS technology scaling to 45nm and below, intensiveefforts have been contributed to optimize GLMs for higher accuracy atthe expense of enhanced complexity. In order to maintain both accu-racy and efficiency at 45nm node and below, in this paper we presenta gate model built from a simplified transistor model. Considering theincreasing statistical process variabilities, the model is embedded in ournew statistical simulation engine, which can do both implicit non-MCstatistical as well as deterministic simulations. Results of timing, noiseand power grid analysis are presented using a 45nm PTMLP technology.

Keywords: gate modeling, transistor-level, non-Monte Carlo, statisticaltiming analysis.

1 Introduction

Nowadays cell-based design flows are still dominant for circuit verification suchas timing, noise or power grid analysis. Usually, due to the challenges associatedwith gate modeling, a unique GLM, such as a noise model or a power droopmodel, is developed to handle each effect. However, improved based on recentinvention of a current source model [8], a unified GLM for timing, noise andpower analysis is in sight. Since the analysis is carried out using cell models, themodels must accurately represent the behavior of the circuit that makes up thecell for timing, crosstalk, variability calculation, etc. However, the conventionalGLMs model every element in the model as a function of input slew and sin-gle output effective capacitance (Ceff ), and have single-input-switching (SIS)assumption.

Instead of optimizing the GLMs for higher accuracy at the cost of increasedcomplexity and characterization time, we make a case that transistor-level gatemodels can address most of the limitations of GLMs [5].


Transistor-Level Gate Modeling 191

With increasing process variations at 45nm and below, the major challengein timing gate modeling becomes an efficient construction of a parameterizedtiming model of a design, representing the design characteristics as a function ofprocess variations [6]. The major approaches are Monte Carlo (MC) simulationsand computing and propagating statistical arrival times. The MC method suffersfrom excessive pessimism and poor scalability as the number of process param-eters increases. On the other hand, generating statistical arrival time modelsfor all standard cells of a library takes a huge amount of CPU time due to thenecessary MC-based simulation.

In this paper, we present a statistical simplified transistor model (SSTM)for cell modeling which is capable of simultaneously handling most of the is-sues described in section 2. The new non-MC statistical simulation method isintroduced in section 4.

2 GLM Limitations and Optimization Trends

By using conventional GLMs, (S)STA provides delay and slew much faster with-out calculating accurate waveform. In nanometer technology, however, the con-ventional GLMs become less accurate due to the following intrinsic limitations.

1. The simple saturated ramps can no longer represent the input signals, espe-cially if they arise from a complicated driving stage with noise or multiple-input switching (MIS) scenario, or are influenced by process variations orother sources of variabilities [7].

2. GLMs fail to work with a multi-port coupled interconnect load since the loadis only modeled as an effective capacitance (Ceff ). Oversimplification of theinterconnect coupling can lead to large errors during timing analysis [1]-[2].

3. GLMs are unable to capture MIS and internal charge effects for high-stackand complex cells. The SIS assumption is inherent in all timing tools. Inreality, all multiple input cells are subject to delay degradation (or delayimprovement for the min-delay STA) due to MIS. Not modeling MIS fortiming can result in as much as 100% error in delay and slew calculation [2].

4. The increasing modeling complexity required to handle voltage droop effects.In order to account for power supply variations, GLMs are required to becharacterized at different supply voltages.

There is a clear trend to optimize GLMs to deal with the limitations listed above.Croix and Wong introduced an input-waveform-independent current source model(CSM) [8] which is essentially a voltage-based, DC-transfer-derived current sourcewith transient effects modeled by a linear capacitance at the output. Many opti-mized CSMs extend the Croix model to handle other limitations. The Miller ca-pacitance is considered and voltage-based capacitance models are used in [1]-[3]while [9] focuses on waveform models. A non-linear Ceff model is described in[4] although its accuracy still needs to be evaluated further. The MIS issue is ad-dressed by modeling every input and output port in the cell [1]-[2]. The internalnodes are also modeled to capture internal charge effects in [1] to obtain higher

192 Q. Tang et al.

accuracy. However, they just attempt to optimize GLMs to maintain acceptableaccuracy for all types of gates. Unfortunately the fact that GLMs are black-boxmodels where the internal structure of the gates is hidden is the essential root of allthese issues. The increasing requirement for accuracy makes the trade-off betweenbetter accuracy and shorter runtime a real challenge [6].

At 45nm and below, the propagation of complex signals and accurate model-ing for crosstalk effects require accurate cell models. A good cell model for SSTAshould be independent of input waveform, output load and circuit structures;should not increase complexity and provide high accuracy and efficiency at thesame time compared to SPICE; should have much shorter characterization time,and should be able to capture process variations and be easy to embed in aSPICE-like engine to propagate statistical signal information. By using an effi-cient transistor model and simulation algorithm, transistor-level gate modelingfor timing analysis is a gaining popularity [10]-[12].

3 Statistical Simplified Transistor Model (SSTM)

One extreme way of transistor-level timing analysis is to simply run Spice/Spectre.However, such an approach is computationally impractical due to transistor model(e.g. BSIM4 [13]) evaluation.

Our target is to develop a simplified transistor model which captures suffi-cient second-order effects and statistical process variations to allow accurate andefficient waveform and delay calculation for (S)STA.

Fig. 1. a) current-source model; b) proposed SSTM

Recently, optimized GLMs typically model every gate by several capacitorsand a current source as shown in Fig. 1a [3]. Although the CSM is less accuratefor the whole gate representation for nanometer technology, the simple modelis, however, appropriate for transistor modeling. The proposed SSTM shown inFig. 1b represents every transistor by a statistical current source Ids and fiveparasitic capacitances which also have statistical values as a function of thestatistical process parameters of interest.


3.1 Current Source Modeling

Conventionally, without considering second-order effects of deep-micron MOS-FETs, the Shichman-Hodges model was replaced gradually by Deep SubmicronMOSFET Models (DSMM) [14]. Although a DSMM substantially improves ac-curacy for submicron MOSFET behavior, our experiments in 45nm technologystill show significant errors: i) due to channel length modulation (CLM), DIBLand substrate current induced body effect, the CLM parameter λ is a compli-cated function of Vgs and Vds. As a consequence, the method to model saturationcurrent to be a linear function of Vds with constant slope starting from Ids(Vdsat)is not accurate enough; ii) in the linear region, Ids is no longer proportional to(Vgs − Vth − 1

2Vds). In fact the 12 should be replaced by a factor which depends

on Vgs − Vth; iii) the cutoff current can not be ignored any more. Simulationresults show that when Vgs is smaller than Vth by a small amount, the currentstill has similar shape as the current when Vgs > Vth, which cannot be modeledas zero if the input slew and load capacitance are both small.

Similarly, the α-power law MOSFET model [15] is also widely used in digi-tal circuit simulation. This model assumes that near- and sub-threshold regionmodeling is not important in calculating the delay of digital circuits, so thelinear region is just approximated by linear lines and the saturation region cur-rent is constant. However, if the load capacitance and input slew are both quitesmall, the inaccuracy of the linear-region current significantly impacts the out-put waveform at the end of the transition, which introduces a large error foroutput slew. Taking these issues into consideration, the proposed BSIM4-basednominal current source Ids0 of SSTM in equation form is given as:

Ids0 =

{He(Vgst/nVt)(1− e(−Vds/Vt)) Vgs ≤ Vth

WL ·

{JVgstVdseff

(1− Vds

2Vb

)/

(1 + Vds

Vc

)}· [1 + λ(Vds − Vdseff )] Vgs > Vth

(1)where Vgst = Vgs −Vth, Vb = Vgst +2Vt and Vt is the thermal voltage. The maincomponents are described as:

Vdseff = Vdsat − 12

(Vdsat − Vds − γ +

√(Vdsat − Vds − γ)2 + 4γVdsat)

)(2)

Vdsat = Vc · (Vgst + 2Vt)/(Vc + Vgst + 2Vt) (3)

In order to link the continuous linear current with the saturation current, asmooth function (2) based on BSIM4 is used. Vdseff enables a unified expressionfor both linear and saturation currents. The threshold voltage Vth divides theI-V plane to two parts, thus accurate Vth modeling is important. According tothe BSIM4 model, a linear dependence of Vth on Vds is a good approximation.We simplify the Vth model as:

Vth = Vth0 − α · Vds + K1(√

Φs − Vbs −√

Φs) − K2 · Vbs (4)

where Vth0 is the zero-biased long-channel device Vth and α is a coefficient fordrain/source charge sharing and DIBL effects on Vth. The coefficients K1, K2

and surface potential Φs are obtained and derived from the technology file.

194 Q. Tang et al.

The model simplification focuses on the following items: i) instead of usingcomplicated expressions, the parameter J considers several effects, including mo-bility degradation; ii) no consideration for narrow channel effect for Vth model;iii) Vgsteff model in BSIM4 [13] is replaced by Vgst since the unified expressionfor the current from strong inversion to linear region is not used. As a result, theIds0 model and it’s derivative are dramatically simplified. It should be noticedthat the cut-off current could simply be modeled as zero if sharp input rampsand extremely small load capacitances rarely occur at the same time. Then theproposed model is simplified further to the 2nd equation in (1) where only J andλ are obtained in the characterization stage.

The statistical description of I-V model is:

Ids = Ids0(t) +m∑

k=1

∂Ids

∂pk|pk=pk0 (t) · ξk = Ids0(t) +

m∑k=1

χk(t) · ξk (5)

pk = pk0 + ξk (k = 1 ∼ m) (6)

where pk is the kth random process parameter which is the sum of nominal valuepk0 and random variable ξk with zero mean (μ) and the same standard deviation(σ) as pk. χk(t) is the differential function of Ids by the elements of pk.

3.2 Intrinsic Capacitance Modeling

The most accurate way to model non-linear capacitances is to represent themas voltage dependent terminal charge sources [13]. Characterization of such amodel would involve generating charge tables for a range of terminal voltages.All capacitances are derived from the charge to ensure charge conservation. Eachcapacitance is computed by Cij = ∂Qi/∂Vj at every time step, where i andj denote the transistor terminals. Although this approach would be the mostaccurate, the massive amount of simulation time would be a problem for STAand SSTA.

00.5

11.5

0

0.5

1

1.50.5

1

1.5

2

2.5

3

x 10−17

Vgs (V)Vds (V)

Cgd

(F)

saturation

cut−off

linear

Fig. 2. Cgd variation for a minimum-sized NMOS

Using a single value for all capacitors promises fast simulation, but it resultsin an overly simple model which produces errors in (S)STA for nanometer tech-nology. Fig. 2 shows the variation of Cgd for a minimum-sized NMOS. Clearly,


at the 45nm node, the capacitances are too nonlinear to be accurately modeledas a constant value. In order to improve accuracy while maintain good computa-tional efficiency, SSTM treats the five capacitances differently. For gate channelcapacitances (GCC) Cgs, Cgd and Cgb, SSTM uses a constant value in the cut-offand saturation regions respectively, while approximates them as a linear func-tion of Vgs and Vds in the linear region. For junction depletion capacitances Csb

and Cdb, SSTM uses a single value model since they are 1-2 orders of magnitudesmaller than GCCs.

In the statistical extension of the capacitance model (7), Cj0 is the nominalvalue of the jth capacitance in Fig. 1 and the sensitivity ζ is characterized byperturbing the process variables of interest.

Cj(t, ξ) = Cj0 +m∑

k=1

∂Cj

∂pk|pk=pk0 ·ξk = Cj0 +

m∑k=1

ζk · ξk (7)

The characterization time of GLMs for SSTA is quite long since standard celllibraries consist of hundreds of cells with different sizes and process corners. Incontrast, by using transistor-based gate modeling like SSTM, the characteriza-tion time is significantly reduced as only the unique transistors used in the celllibrary need to be characterized.

4 Non-MC Statistical Simulator

The proposed SSTM is embedded in our non-MC statistical simulator [16] for faststatistical timing analysis. In general, for deterministic time-domain analysis,the modified nodal analysis (MNA) equations for any circuit can be expressedin compact form as:

F (x′, x, t, p0) = 0 x(t0) = x0 (8)

where x is the vector of the circuit state variables consisting of nodal voltages andbranch currents and p0 is the nominal process variable vector with elements pk0

introduced in (6). x′ denotes the time derivative of x. Let xs be the solution to(8). Transient analysis in a conventional simulator solves for xs using numericalintegration methods. However, the existence and importance of process variationsat 45nm and below result in a random MNA which can be expressed as:

F (x′, x, t, p) = 0 x(t0) = x0 + δx0 (9)

where p is the statistical process variable vector with elements pk introduced in(6). δx0 denotes the initial variation caused by p.

It is computationally impracticable to solve (9) due to a large set of correlatedrandom variables and the nonlinearity. Therefore, in order to make the problemmanageable, we employ principal component analysis (PCA) to model a largeset m in (6) of correlated p to a n-dimensional (n � m) vector of uncorrelatedrandom variables, and linearize (9) with a truncated Taylor expansion. To avoid

196 Q. Tang et al.

notational cluttering, the notation p representing the uncorrelated process vari-ables after PCA is further used in the paper. The linear Taylor expansion iscarried out at the point of x′

s, xs and p0. Let’s define y(t) = x(t) − xs(t) as thex(t) variation vector due to process variation ξ with zero μ and finite σ men-tioned in (6). Re-organizing the 1st-order Taylor expansion of (9) we can obtaina compact format as:

y′(t) = E(t)y(t) + F (t)ξ y(t0) = δx0 (10)

The nonlinear random equation (9) is converted to a linear random differentialequation (RDE) in y. According to the mean square (m.s.) integral theorem [17],there exists a unique solution. Assuming the initial condition x0 is set to a fixedvalue, the solution is found as y(t) = α(t) · ξ. By substituting y(t) = α(t) · ξ in(10), α(t) is easy to calculate by solving the resulting ODE.

Then the mean, variance and covariance of x(t) can be calculated as:

E {x(t)} = xs(t) V ar {xj(t)} =n∑

k=1

α2jk(t)V ar {ξk} (11)

Cov(xa, xb) = α(ta) · diag(V ar {ξ1} , · · · , V ar {ξn}) · αT (tb) (12)

where xj(t) is the jth element of vector x(t). As long as α(t) is calculated, y(t)is known, thus the covariance matrix of the solution y(t) at two different timepoints ta and tb can be calculated by (12).

From the waveform modeling point of view, the waveform is modeled as a time-indexed voltage array for STA while the mean,variance and covariance array areused for SSTA. Based on (11)-(12), the probability density function (pdf) ofevery crossing time for rising and falling transitions can be straightforwardlycalculated by (13) and (14) respectively assuming the voltage at any time pointis Gaussian distributed [16].

Pr(trη = t) = Pr(Vo(t − Δt) ≤ Vη) − Pr (Vo(t − Δt) ≤ Vη ∩ Vo(t) ≤ Vη)(13)Pr(tfη = t) = Pr(Vo(t) ≤ Vη) − Pr (Vo(t − Δt) ≤ Vη ∩ Vo(t) ≤ Vη) (14)

where the crossing time tη is the time when the node voltage crosses the corre-sponding voltage threshold Vη = η% · Vdd. Pr (Vo(t − Δt) ≤ Vη ∩ Vo(t) ≤ Vη) isthe joint cdf of Vo at two time steps. Note that the proposed method calculatesthe pdf directly and considers the correlation of Vo at two time steps in con-trast to [18] and [19]. Given mean and variance of crossing time, the mean andvariance of delay and slew can be calculated.


The proposed SSTM and non-MC statistical simulation method were evaluatedusing 45nm PTMLP technology [20] and implemented in MATLAB. For SSTM,the data for characterization were obtained from Spectre using a BSIM4 modeland then imported to a characterization algorithm in MATLAB to acquire the


05

1015

2025

00.1

0.20.3

0.4

0.5−4

−3

−2

−1

0

1

2

3

Load Capacitance(fF)Input Slew (ns)

Rel

ativ

e E

rror

(%)

(a) relative error of rise delay

05

1015

2025

00.1

0.20.3

0.4

0.5−5

0

5

Load Capacitance(fF)Input Slew (ns)

Rel

ativ

e E

rror

(%)

(b) relative error of fall delay

0 5 10 15 20 250

1

2

3

4

5

6

7

8

9

Capacitive load (fF)

scal

ed o

utpu

t ris

ing

slew

SSTM resultsBSIM4 results

(c) rising output slew

0 5 10 15 20 250

2

4

6

8

10

12

Capacitive load (fF)sc

aled

out

put f

allin

g sl

ew

SSTM resultsBSIM4 results

(d) falling output slew

Fig. 3. Delay and output slew evaluation

required parameters described in section 3. We present the accuracy evaluation ofSSTM for minimum-sized cells, arbitrary inputs and MIS and the applicability ofSSTM for power grid and signal integrity verification. In the end, the statisticalsimulation results were presented.

We evaluated the nominal SSTM when process variations are not includedin SSTM in minimum-sized inverter and NAND2 cells with different input slew(Sin) and capacitive load (Cload). The Sin ranges from 1ps to 500ps and theCload spans from 0.5fF to 40fF . In comparison with Spectre using the BSIM4model, It is clear from Fig. 3 (a)-(b) that the relative error for delay calculationis within 5%. 99.2% of the output rise delay and 93.9% of output fall delayare within 1.6%. The average relative error of output slew calculation is 1.2%.Although the maximum relative error is 3.3% with zero Cload, Fig. 3 (c)-(d)show the absolute error is nearly zero.

In essence, SSTM is input waveform independent so it can handle arbitraryinput waveforms. Certain cells may experience simultaneous MIS and internalcharge sharing during some specific input to output transitions. The transistor-based SSTM is able to handle these since every node is considered at the sametime. Fig. 4 illustrates the accuracy of the nominal SSTM used in a minimum-sized inverter with irregular input and a NAND2 cell in a simultaneous MISscenario. The results show a very good match between the nominal SSTM andBSIM4 model.

Power supply integrity verification is an essential step in current design flowsdue to the large currents drawn through an increasingly resistive power supplynetwork.

The models used in power grid analysis must capture the dynamic currentcharacteristics of the cells. Fig. 5(a) shows the current drawn by a cell from

198 Q. Tang et al.

0 1 2 3 4 5 6

x 10−9

0

0.2

0.4

0.6

0.8

1

1.2

time (s)

Volta

ge (V

)

inputoutput−BSIM4output−SSTM

0 0.5 1 1.5 2 2.5 3

x 10−9

0

0.2

0.4

0.6

0.8

1

1.2

time (s)

Volta

ge (V

)

input1input2output−BSIM4output−SSTM

Fig. 4. left: irregular input; right: simultaneous MIS for a NAND2 cell

0 0.5 1 1.5 2−3

−2

−1

0

1

scaled time

scal

ed c

urre

nt

SSTM result

Spectre result

(a) SSTM to power grid verification

0 0.5 1 1.50

0.2

0.4

0.6

0.8

1

1.2

scaled time

volta

ge (

V)

noisy input

aggressor

SSTM output

Spectre output

(b) SSTM to signal integrity verification

Fig. 5. SSTM’s application to power grid and signal integrity verification

the power supply at both rising and falling transitions. It is easy for transistor-based gate models to capture the dynamic currents since the desired current iscalculated during the simulation.

The primary modeling challenge for on-chip signal integrity verification hasbeen the simulation of a driver (the victim), subject to an input noise, whoseinterconnect load is capacitively coupled to the output of another driver (theaggressor). In Fig. 5(b) we see the SSTM captures this scenario well. All wave-forms in Fig. 5 show SSTM can be applied to power grid and signal integrityverification flows.

We combined SSTM with the proposed non-MC statistical simulation methodfor a large number of standard cells in a 45nm technology. The uncorrelatedprocess variations are length and width variations with zero μ. The 3σ of lengthand width are 20% and 15% of the nominal length and the largest width ofevery cell respectively. In comparison with 1000 Monte Carlo trials in Spectre,the proposed modeling and simulation method achieved relative error within1.4% for μ and within 6.8% for σ with an average 40× speedup [16].

6 Conclusion

At 45nm and below the gate models for circuit verification should account forincreasing accuracy requirements and process variations. In this paper, a statis-tical simplified transistor model (SSTM) for transistor-level gate modeling whichis embedded in our non-MC statistical simulator is presented. The SSTM-based


gate model is independent of input waveform and output load, easy to character-ize and suitable for SSTA, and accurate compared to Spice/Spectre for standardcells. We show that, in addition to handling accuracy limitations associatedwith conventional gate-level models for STA like arbitrary input, multi-inputswitching, etc., it is possible to be applied to power grid verification and noiseverification flows. The statistical results show that our transistor-level timinganalysis methodology achieves both high accuracy and efficiency.

References

1. Menezes, N., Kashyap, C., Amin, C.: A “true” electrical cell model for timing,noise, and power grid verification. In: Proc. of DAC, pp. 462–467 (2008)

2. Amin, C., Kashyap, C., Menezes, N., Killpack, K.: A multi-port current sourcemodel for multiple-input switching effects in CMOS library cells. In: Proc. of DAC,pp. 247–252 (2006)

3. Goel, A., Vrudhula, S.: Statistical waveform and current source based standard cellmodels for accurate timing analysis. In: Proc. of DAC, pp. 227–230 (2008)

4. Li, P., Acar, E.: Waveform independent gate models for accurate timing analysis.In: Proc. of ICCD, pp. 617–622 (1996)

5. Tang, Q., Zjajo, A., Berkelaar, M., van der Meijs, N.: A simplified transistor modelfor CMOS timing analysis. In: Proc. of ProRISC, pp. 1–6 (2009)

6. Keller, I., Tarn, K.H., Kariat, V.: Challenges in gate level modeling for delay andSI at 65nm and below. In: Proc. of DAC, pp. 468–473 (2008)

7. Nazarian, S., Pedram, M., Tuncer, E., Lin, T.: Sensitivity-based gate delay prop-agation in static timing analysis. In: Proc. of ISQED, pp. 536–541 (2005)

8. Croix, J.F., Wong, D.F.: Blade and Razor: cell and interconnect delay analysisusing current-based models. In: Proc. of DAC, pp. 386–389 (2003)

9. Amin, C.S., Dartu, F., Ismail, Y.I.: Weibull based analytical waveform model.IEEE Trans. on CAD 24, 1156–1168 (2005)

10. Raja, S., Varadi, F., Becer, M., Geada, J.: Transistor level gate modeling for accurateand fast timing, noise, and power analysis. In: Proc. of DAC, pp. 456–461 (2008)

11. Kulshrehtha, P., Palermo, R., Mortazavi, M.: Transistor-level timing analysis usingembedded simulation. In: Proc. of ICCAD, pp. 344–348 (2000)

12. Li, Z., Chen, S.: Transistor level timing analysis considering multiple inputs simul-taneous switching. In: Proc. of CADCG, pp. 315–320 (2007)

13. BSIM4 Home Page, http://www-device.eecs.berkeley.edu/bsim3/bsim4.hml14. Rabaey, J.M.: Digital integrated circuit: A design perspective, pp. 96–100. Prentice

Hall, Upper Saddle River (1996)15. Sakural, T., Newton, A.R.: Alpha-power law MOSFET model and its applications

to CMOS inverter delay and other formulas. IEEE JSSC 25(2), 584–594 (1990)16. Tang, Q., Zjajo, A., Berkelaar, M., van der Meijs, N.: RDE-based transistor-level gate

simulation for statistical static timing analysis. In: Proc. of DAC, pp. 787–792 (2010)17. Soong, T.T.: Random differential equations in science and engineering. Academic

Press, New York (1973)18. Fatemi, H., Nazarian, S., Pedram, M.: Statistical logic cell delay analysis using a

current-based model. In: Proc. of DAC, pp. 253–256 (2006)19. Liu, B., Kahng, A.B.: Statistical gate level simulation via voltage controlled current

models. In: IEEE Proc. of MBAS, pp. 23–27 (2006)20. Predictive Technology Model for Low-power Applications (PTMLP) (November

2008), http://www.eas.asu.edu/~ptm/modelcard/LP/45nm_LP.pm

White-Box Current Source Modeling Including

Parameter Variation and Its Application inTiming Simulation

Christoph Knoth1, Irina Eichwald1, Petra Nordholz2, and Ulf Schlichtmann1

1 Institute for Electronic Design Automation, Technische Universitat Munchenhttp://www.eda.ei.tum.de/

2 Infineon Technologies AG, Munichhttp://www.infineon.com

Abstract. This paper presents a novel method for generating currentsource models (CSMs) for logic cells that efficiently captures the influencesof parameter variation and supply voltage drops. The characterization ex-ploits topological information from the transistor netlist resulting in typi-cally 80x faster CSM library generation. The parametric CSMs have beenintegrated into a commercial FastSPICE simulator to further acceleratepath-based timing analysis with transistor level accuracy. Without loss ofaccuracy, simulation times were reduced by 4x to 98x.

1 Introduction

Timing validation is a crucial step during the design closure of digital cir-cuits. The huge number of cell instances in modern IC designs requires abstractsignal and delay models. The industry standard delay model, nonlinear delaymodel (NLDM), therefore approximates the cell input behavior by capacitancesand logic signals by linear ramps with arrival and transition times. Nonetheless,these idealizations do not account for the increasing impact of analog effects in-troduced by interconnects. Signal transitions are non-monotonic due to couplingnoise and the wire resistance causes long transition tails and reduces the loadcapacitance seen by the driver. Effective capacitance and piecewise constant in-put capacitances emerged as patches for NLDM to better account for the analogeffects [21] but still delay and slew errors are larger than 10% [10]. EDA vendorsrecognized the importance of precise waveform modeling for correct delay mod-eling and introduced the new driver and delay models ECSM and CCS [1, 2, 24].These models use more voltage-time-points to describe logic signal but still as-sume monotonic transitions. The authors of [14] proposed to use a larger set of”typical” waveforms including noisy ones for cell delay characterization.

In contrast to simulating every possible scenario of input signal and outputload during library generation, waveform and load independent CSMs have beenproposed. They are pin compatible models of logic cells and provide the portcurrents as functions of port voltages to calculate the output waveform using


White-Box Current Source Modeling Including Parameter Variation 201

SPICE principles. CSMs are mainly used in dedicated timing or noise engines[6, 7, 9, 11] but can also be employed in SPICE simulators [17, 23].

For today’s and future technology nodes the impact of parameter variationis of major concern. It is therefore not sufficient to improve model accuracy fornominal conditions. All enhancements must support statistical analyses. Thisalso holds for CSMs. In [9], [19] and [25] CSMs are used in special statisticaltiming simulators to propagate the nominal voltage waveform and sensitivitiesof voltage crossing points w.r.t. parameters.

Despite their accuracy benefits and reported applications, generating CSMlibraries is a significant effort. As will be shown in the next section, the problemarises from time consuming transient simulations for obtaining CSM compo-nents. Moreover, this leads to a prohibitively high simulation effort when theimpact of parameter variation has to be considered.

This paper therefore presents a white-box modeling approach that allowsmuch faster CSM library generation. To the best of our knowledge, it is thefirst method to build parametric CSMs that employs transistor netlist informa-tion from a topology analysis. Furthermore, the paper reports the first utiliza-tion of CSMs to accelerate simulation performance of a commercial FastSPICEsimulator. This allows to reduce simulation times for digital and mixed signalcircuits.

2 Current Source Modeling

Current source models imitate the nonlinear port currents of logic cells as func-tions of port voltages. Different CSMs have been proposed over the years [3, 6,7, 9, 11–13, 15, 16, 18, 20, 22, 25]. All of them model the port current as a com-posite of a static current from a voltage controlled current source (VCCS) andan additional dynamic contribution realized by (non)linear charges or capacitors(see Fig. 1). These static and dynamic components are modeled as functions ofthe port voltages. Important internal nodes of complex cells might be treated asadditional virtual ports [15].

Generating a CSM can cause a significant simulation effort. Only the authorsof [6] propose a method to derive CSMs from already existing ECSM timinglibraries. Unfortunately, the impact of parameter variation cannot be captured.In almost all other approaches, a set of time consuming simulations is performed.Obtaining the functions for static port currents of a logic cell is convenientlyrealized by attaching DC voltages sources to the ports, sweeping their values,and measuring the resulting port currents. These values are stored in lookuptables (LUTs) or are approximated by polynomials or splines. The real challengeis in characterizing the dynamic components for which different methods havebeen published.

In [7, 20] the capacitor values or functions are found by error minimizationto match the transient output current for a set of typical input stimuli. In otherapproaches step or ramp signals are applied and the differences between staticand transient port currents are integrated to get equivalent port charges or

202 C. Knoth et al.

capacitances [3, 12, 18]. This is done for all combinations of port voltages in theLUT. In [18] a second order lowpass filter at the input accounts for additionalgate delay. The filter parameters and all other model components are “tuned“by step-wise error minimization with typical input waveforms.

The authors of [3] pointed out the runtime problem of transient simulationsfor CSM characterization and reduced the number of data points in the LUTs.Therefore, in [15] AC simulations are used to obtain voltage controlled capacitorsconnecting the ports of a cell. Unfortunately this method leads to very complexCSMs.

It should be noted that although being a one time effort, library character-ization can be very expensive and time consuming. Several CSMs of a singlecell have to be generated for different PVT corners. Inefficient methods blockcomputational resources and software licenses and can defer the design process.The problem is even more severe when parameter variation is considered.

In [20] the CSM elements are determined by performing a number of MonteCarlo (MC) simulations with typical input waveforms and adjacent error mini-mization w.r.t. port voltages and parameters. In [9] many CSMs are generated fordifferent parameter combinations of several MC runs. Subsequent linear fittingfor every data point in the LUTs yields a first order sensitivity model. Similarlythe authors of [18] wrap parameter deflection and the calculation of finite differ-ences for each model element around the whole characterization which is basedupon error minimization. In [13] the CSM capacitors are obtained from the dif-ference of static and total port current for a sequence of transient simulations.This is repeated for every combination of parameters. The highly dimensionaltables (port voltages and parameters) are approximated by the tensor productof polynomials which model the nominal values and variation impact.

The proposed white-box approach avoids the plethora of transient simulationsto match the port behavior of logic cells. Instead, physically motivated CSMs aregenerated based upon the original netlist elements. The additional informationobtained from the transistor netlist enables very fast and accurate model gener-ation. This efficiency is the key to capture the influence of parameter variationwithin reasonable time. The model is applicable to stand alone timing simula-tors. However, we implemented the parametric CSM for SPICE and FastSPICEsimulators. This allows to further improve the performance of existing and highlyefficient tools. Moreover, CSMs can thus be utilized for simulating mixed signalcircuits together with transistor models and behavioral descriptions in Verilog orVHDL. Each CSM can be adjusted to parameter variation and Vdd-drop duringsimulation. It is therefore compatible with MC methods and fits very much intoexisting simulation, optimization, and verification methodologies.

3 White-Box CSM Characterization

3.1 Nominal Characterization

The aim is to replicate the nonlinear port behavior of the transistor level subcir-cuit description, such as in Fig. 2, by the much simpler circuit of Fig. 1. Hence,


Qa(v∗a, vz)

vss

va

va

Ra

Iz(v∗a, vz)v∗

a vz

Qz(v∗a, vz)

Ca

ia iziava

vssva2

vzvz0

va1

va0

vdd

R1

R0

R2

M1

M2 C4

C0

R3

C2

iz

IM2d

IM1d

Fig. 1. Current Source Model with low-pass filter and nonlinear current sourceand charges

Fig. 2. Subcircuit definition of CMOSinverter with parasitic elements

for any sequence of input voltages va and any arbitrary load attached to outputport z, the model port currents ia and iz must match the original currents iaand iz.

Similar to other CSM approaches the port current is modeled by the sum of astatic current Iz(va, vz) and a dynamic current resulting from the time derivativeof the associated port charge dQ(va, vz)/dt. For efficiency, a CSM is providedfor every time arc. Therefore, the model components are functions of two nodepotentials. In cells with multiple stages (e.g. buffer, and), internal node poten-tials affect the port behavior. Structure recognition is applied to partition thesecells into channel connected blocks. These stages are then modeled individuallyby a CSM as in Fig. 1. In cells with significant parasitic input networks a lowpassfilter accounts for the additional cell delay.

While existing approaches treat the logic cell as a black box of which only theport currents are observable, the presented white-box approach uses the originalnetlist elements to derive the model components voltage controlled current sourceand voltage controlled charges. The port charge is denoted as the sum of allnode charges of resistively connected internal nodes [17]. A topological search isperformed on the transistor netlist to obtain a symbolic expression that collectsall charges associated with one port. Similarly, all static current contributions ofthe transistors are found. For the example of Fig. 2, the model components arerelated to original currents and charges through

Iz(va, vz) = IM1d (vdd, vz0 , va1) + IM2

d (vss, vz0 , va2) (1)

Qz(va, vz) =QM1d (vdd, vz0 , va1) + QM2

d (vss, vz0 , va2)++ C4 · (vz0 − vss) + C0 · (vz0 − va0) (2)

Qa(va, vz) =QM1g (vdd, vz0 , va1) + QM2

g (vss, vz0 , va2)+

+ C2 · (va0 − vss) + C0 · (va0 − vz0) (3)

QMxg denotes the gate pin charge of transistor Mx and Cx are the parasitic

capacitances. Dynamic coupling between input and output (Miller effect) is im-plicitly modeled in (2). Similarly the dependency of the input capacitance onthe output voltage is captured by the last term of (3).

204 C. Knoth et al.

While the nonlinear transistor quantities depend on internal node potentials,the model components shall be functions of port voltages only. It has been ob-served that all internal node voltages have very small time constants. Hence, anyparticular solution decays quickly, usually within one time step of a transientsimulation. The node potentials therefore have the same values as in a DC simu-lation with fixed port voltages. Consequently also the node charge values will beidentical. This observation is used to implement a very efficient characterizationwithout transient simulations. DC voltage sources are attached to the active pinsof the stage and swept from Vss to Vdd. Based on the topological search, mea-surement statements of (1-3) are executed and the data for the port quantities isobtained. In contrast to existing methods there is no interdependence among themodel components. Hence, the complete model comprising static and dynamiccomponents can be characterized simultaneously in a single DC simulation.

Having multiple parallel transistors to increase driving strength results in arather large linear parasitic input network. This causes a notable signal delaywhich is accounted for by a lowpass filter. The model elements Ra and Ca arechosen to equate the average cutoff frequency of the connected transistor gatepins. It is attached to a duplicate of the input voltage to preserve the receiverproperties modeled by Qa. The delayed input voltage is used to control thenonlinear elements.

3.2 Handling Parameter Variations

Deviations of process or environmental parameters from their nominal valuesaffect transistor quantities like saturation current or overlap capacitances, lead-ing to altered cell delays. The CSM accounts for this by modeling the physicalimpact of variations on the model quantities port current and port charge.

Consistent with existing simulation methods each parameter is described asthe superposition of nominal value pn

i and deviation Δpi. The latter is composedof global, local, and random influences.

pi = pni + Δpi = pn

i + pgi + pl

i + pri (4)

This allows to model correlation between local variations of parameters of closelyplaced cells. Consequently every CSM instance is facing an individual set of pa-rameter deflection Δp. Intra cell variation is not considered but could be modeledin the same way. Supply voltage drops are treated similarly to parameters withexpected deviations of up to 15%. An individual Vdd-drop can be assigned toeach stage of a CSM.

Every parameter variation Δpi causes an additional static current and addi-tional charges. If Δpi is sufficiently small, the first order approximation of themodel components is given as

I = In + ΔI = In +∑ dI

dpi· Δpi (5)

Q = Qn + ΔQ = Qn +∑ dQ

dpi· Δpi (6)


The applicability of every CSM modeling method strongly depends on the costsfor obtaining the linear sensitivity of a quantity w.r.t. a parameter, here dI/dpi

and dQ/dpi. All methods which excessively employ transient simulation for modelcharacterization run into severe complexity problems. The proposed white-boxmethod based upon netlist information is very efficient since a complete stage ischaracterized in a single, very fast, DC simulation. Since the relation of netlistelements and CSM components is known from the nominal characterization,also the sensitivities to parameter variations are immediately assigned to theparametric CSM components. By reusing the symbolic equations of (1-3) thelinear sensitivities of the model components are given as

dIz/dpi = dIM1d /dpi + dIM2

d /dpi (7)dQz/dpi = dQM1

d /dpi + dQM2d /dpi + C4 · d(vz0−vss)/dpi + C0 · d(vz0−va0)/dpi (8)

dQa/dpi = dQM1g /dpi + dQM2

g /dpi + C2 · d(va0−vss)/dpi + C0 · d(va0−vz0)/dpi (9)

The numerical values of (7-9) are obtained through simulation with subsequentcalculation of finite differences. Each parameter is positively and negatively de-flected by one standard deviation while all other parameters are kept constant.If cross dependencies are significant, more DC simulations can be performedto cover additional points in the parameter space. However, we observed thatsecond order effects can be neglected. Hence, for N parameters with significantinfluence, (2N + 1) DC simulations are required, which takes a few minuteson standard computers. For illustration, generating the nominal CSMs for twotiming arcs of a nand gate was done in 32 seconds on a desktop machine. Forcomparison, CSM models have been generated according to the method proposedin [3]. 46 minutes and 20 seconds were needed to generate the two nominal CSMs.Therefore, our proposed approach is faster by a factor of 86. Similar factors havebeen observed for other cell types. The full model generation including the sen-sitivities w.r.t. six parameters and Vdd required 9 minutes and 5 seconds usingour method but would take about 12 hours for the other approach.

3.3 Implementation

The characterization starts with the topology analysis of transistor netlist files.SPICE simulations are conducted for each timing arc and measured port valuesare stored in ASCII LUTs. Finally, the CSMs are generated either as Verilog-Amodules or as subcircuits using compiled models for the nonlinear elements [5].Verilog-A models are supported by many circuit simulators but more speedup isgained with compiled models. The compiled model interface (CMI) allows to useCSMs in simulators like Spectre or UltraSim. Similar interfaces exist for othersimulators.

New circuit elements for voltage controlled current source and voltage con-trolled charge have been implemented. They allow to use 2D LUTs of variablesize provided as ASCII files. During an initialization phase nominal and sen-sitivity tables are imported. In cases of a parameter alteration, the simulator

206 C. Knoth et al.

provides the numerical value of the deviation and the instance tables are up-dated according to (5). This is done prior to any transient analysis and for everyentry in the LUTs. During the simulation, bilinear interpolation is applied to thefinal tables. It is preferred to multidimensional approximation functions, sinceit is highly flexible to support the modified tables for parameter variation andsufficiently accurate. Due to modeling channel connected blocks only, the func-tions are reasonably smooth. Hence, good convergence properties exist also formoderate discretization of the LUTs. It was also observed that the size of the2D-LUTs was not runtime critical. In Verilog-A the variation is modeled by ad-ditional current and charge contributions. Hence, additional interpolations mustbe performed for each parameter in every iteration. Unfortunately Verilog-A’sinterpolation function $tablemodel is rather slow. Almost no speedup was gainedin the experiments.

The integration into commercial SPICE and FastSPICE simulators allowsto perform timing and noise analyses as well as MC simulations. This reallybroadens the applicability of CSMs since it is now possible to efficiently sim-ulate circuits containing transistor models, behavior models, and CSMs. Stillthe model can be used in dedicated timing and noise simulators. Especially thesensitivity tables are valuable data for statistical approaches such as [19] or [25].

4 Results

A CSM library has been automatically generated for 293 90nm CMOS gates withextracted parasitics. The cells have 1 to 10 input pins and consist of up to threestages. The influences of the six most dominant process parameters and staticsupply voltage drop have been considered. The complete structure recognitionrequired less than one second. Generating the CSMs for each cell required 20minutes on average using a 2GHz Linux machine with 4GB RAM.

To evaluate model accuracy and performance, the CSMs have been comparedagainst the transistor level implementations (BSIM) of logic cells using an in-house SPICE simulator. Models for every timing arc of every gate have beentested individually by performing 50 MC runs for different combinations of in-put waveforms, CRC Π-loads, and parameters. The histograms in Fig. 3 showsrelative delay and slew errors for these tests. For the majority of testcases theCSM delay prediction matches the BSIM reference. In 93.18% of the testcasesthe delay error was less than 2%, 99.58% are within 5% of BSIM. The error ofoutput slew was less than 2% for 96.54% and less than 5% for 99.86% of alltests. The CSM therefore provides significantly more accuracy than NLDM [10]while already supporting parameter variation and non-ideal input waveforms.Fig. 4 demonstrates this capability using an inverter and a noisy input signal. Inplot A the input waveform is depicted together with the two output waveformspredicted by BSIM and the CSM model, respectively. The same noisy input hasbeen applied to the gate while different simulation modifications were made.Plot B shows the output waveforms if one parameter is altered. In plot C, allsix parameters have been randomly deflected. In the scenario of plot D arbitrary


5000

10000

15000

20000

25000

Slope Error [%]

−4 −2 0 2 4

5000

10000

15000

20000

25000

Delay Error [%]

−6 −4 −2 0 2 4 6

Fig. 3. Relative delay and slew errors for all cells with different CRC Π-loads, inputsslews, and parameter variation (50 MC runs per timing arc)

A

InputBSIMCSM

V

Time [ps]

0

0.2

0.4

0.6

0.8

1

1.2

200 100

BSIM

nominal (A)CSM

Time [ps]

V

B

200 100 0

0.2

0.4

0.6

0.8

1

1.2

BSIM

nominal (A)CSM

C

V

Time [ps] 200 100 0

0.2

0.4

0.6

0.8

1

1.2

BSIM

nominal (A)CSM

D

V

Time [ps] 200 100 0

0.2

0.4

0.6

0.8

1

1.2

Fig. 4. Accurate waveform prediction in the presence of noise for nominal conditions(A), one altered parameter (B), all altered parameters (C), additional Vdd-drop (D)

parameter variation and an additional supply voltage drop have been applied.For all cases the waveforms overlap almost completely. It also visualizes thatfirst order sensitivities are suitable to capture parameter variations for the CSMcomponents.

After studying each gate individually, critical paths of ISCAS85 circuits havebeen simulated with SPICE and FastSPICE using transistor models (BSIM) andthe current source models (CSM). Table 1 compares the predicted path delaysand simulation times for 50 MC runs in SPICE. Good accuracy is achieved withmost mean errors being less than 1%. The simulations could be accelerated byfactors of 82 to 175. For the circuit c6288 this means a reduction from 3 daysand 11 hours to 30 minutes! The correlation plot of path delays for c1355 inFig. 5 shows that most errors are within 5% while the maximum error is 8.9%.Similar results are obtained for the other circuits.

208 C. Knoth et al.

Table 1. Simulation time and path delay errorsof 50 MC runs in SPICE using transistors andCSMs

Circuit Delay Error [%] CPU-Timemean max BSIM [s] CSM [s] Speedup

c17 0.044 -1.762 151.32 1.21 125.06c1355 0.753 8.883 14332.32 107.18 133.72c880 0.076 8.339 15343.50 121.27 126.52c1908 -0.499 8.642 23176.59 202.32 114.55c2670 -0.519 5.542 14838.57 180.74 82.10c5315 -0.159 9.472 16739.05 226.45 73.92c6288 -3.159 -9.215 299763.30 1715.30 174.76

(CSM)normalized delay

normalized delay

(BSIM)

−5%

+10%

±0%

+5%

−10%

Fig. 5. Correlation plot of pathdelay variation for c1355

The above studies focused on verifying the CSM accuracy. It has been fur-ther investigated if CSMs can improve existing tools used for timing analysis.FastSPICE simulators provide the necessary functions for timing verificationwith transistor level accuracy [4]. They apply circuit partitioning, use simplerdevice models and adaptively controlled explicit simulation [8]. CSMs furtherreduce the computational effort by combining several transistors of a logic cellinto three LUTs. Table 2 compares the simulation times and speedup factorsfor different models and simulators. As expected SPICE with BSIM modelsis prohibitively time consuming. Replacing the cells by CSMs causes a signif-icant acceleration of 50 to 80. Simulation times are now of the same order asthe FastSPICE simulator with transistor models. These times can be further re-duced by factors of 4 to 98 by using CSMs as cell models in FastSPICE. Speciallyremarkable are simulation times and speedup for c6288. This circuit consists ofmany identical gates. Hence, in contrast to other circuits only a few CSMs mustbe held in memory during simulation, resulting in fewer cache misses and higherspeedup. This effect can be illustrated by reducing the circuit size. Truncatingthe path to 50% or 25% decreases the speedup to 62.99 and 38.14, respectively.

Table 2 further lists the relative path delay errors compared to SPICE withBSIM models. Using a FastSPICE simulator has caused more error than usingCSMs in SPICE. Furthermore, using CSMs in a FastSPICE simulator did notresult in noticeable additional errors.

Table 2. Performance comparison for different simulators and models

Circuit SPICE Runtime [s] FastSPICE Runtime [s] Relative delay error [%]BSIM CSM BSIM/ BSIM CSM BSIM/ SPICE FastSPICE

CSM CSM CSM BSIM CSM

c17 41.05 0.70 58.6 3.06 0.32 9.6 0.00 −1.46 −1.46c880 2180.24 27.01 80.7 42.37 5.22 8.1 −0.31 −2.02 −2.02c1355 2008.78 22.96 87.5 40.95 4.62 8.9 0.26 −2.71 −2.71c1908 3473.82 41.60 83.5 35.67 7.46 4.8 −1.49 −2.33 −2.33c2670 2197.10 39.33 55.9 32.12 7.35 4.4 −0.84 −2.95 −3.01c5315 2742.89 47.27 58.0 38.09 9.34 4.2 −1.08 −3.07 −2.90c6288 30865.54 140.04 220.4 1725.58 17.57 98.0 −2.86 −2.56 −2.56


5 Conclusion

A current source modeling technique for logic gates has been presented. Byobtaining additional information from the transistor netlist, very efficient modelcharacterization based on DC simulations has been realized. This allows fastCSM library generation including the sensitivities to process parameters andsupply voltage. The CSMs have been realized as compiled circuit componentsand used in SPICE and FastSPICE timing analysis of ISCAS85 circuits. At thecosts of 3% delay error SPICE simulation times could be reduced to those ofFastSPICE simulators. Alternatively, additional speedup of 4-98x was realizedwhen using CSMs in FastSPICE simulator without additional error penalty.

Acknowlegdement

This work has been supported by the German Ministry of Education and Re-search (BMBF) within the project ’Sigma65’ (Project ID 01M3080A). The con-tent is the sole responsibility of the authors.

References

1. Composite current source (December 2006),http://www.synopsys.com/products/solutions/galaxy/ccs/cc_source.html

2. Ecsm - effective current source model (2007),http://www.cadence.com/Alliances/languages/Pages/ecsm.aspx

3. Amin, C., Kashyap, C., Menezes, N., Killpack, K., Chiprout, E.: A multi-portcurrent source model for multiple-input switching effects in cmos library cells. In:ACM/IEEE Design Automation Conference (DAC), pp. 247–252 (2006)

4. Cadence. UltraSim User’s Manual (June 2003)

5. Cadence. Compiled-Model Interface Reference (November 2004)6. Chopra, K., Kashyap, C., Su, H., Blaauw, D.: Current source driver model synthesis

and worst-case alignment for accurate timing and noise analysis. In: ACM/IEEEInternational Workshop on Timing Issues in the Specification and Synthesis ofDigital Systems, pp. 45–50 (2006)

7. Croix, J., Wong, M.: Blade and razor: cell and interconnect delay analysis usingcurrent-based models. In: ACM/IEEE Design Automation Conference (DAC), pp.386–389 (June 2003)

8. Devgan, A., Rohrer, R.A.: Aces: A transient simulation strategy for integratedcircuits. In: IEEE International Conference on Computer Design (ICCD), pp. 357–360 (1993)

9. Fatemi, H., Nazarian, S., Pedram, M.: Statistical logic cell delay analysis using acurrent-based model. In: ACM/IEEE Design Automation Conference (DAC), pp.253–256 (July 2006)

10. Feldmann, P., Abbaspour, S., Sinha, D., Schaeffer, G., Banerji, R., Gupta, H.:Driver waveform computation for timing analysis with multiple voltage thresholddriver models. In: ACM/IEEE Design Automation Conference (DAC), pp. 425–428(2008)

210 C. Knoth et al.

11. Gandikota, R., Chopra, K., Blaauw, D., Sylvester, D.: Victim alignment incrosstalk-aware timing analysis. IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems 29(2), 261–274 (2010)

12. Goel, A., Vrudhula, S.: Current source based standard cell model for accuratesignal integrity and timing analysis. In: Design, Automation and Test in Europe(DATE), pp. 574–579 (2008)

13. Goel, A., Vrudhula, S.: Statistical waveform and current source based standardcell models for accurate timing analysis. In: ACM/IEEE Design Automation Con-ference (DAC), pp. 227–230 (June 2008)

14. Jain, A., Blaauw, D., Zolotov, V.: Accurate delay computation for noisy wave-form shapes. In: IEEE/ACM International Conference on Computer-Aided Design(ICCAD), pp. 947–953 (2005)

15. Kashyap, C., Amin, C., Menezes, N., Chiprout, E.: A nonlinear cell macromodel fordigital applications. In: IEEE/ACM International Conference on Computer-AidedDesign (ICCAD), pp. 678–685 (2007)

16. Keller, I., Tseng, K., Verghese, N.: A robust cell-level crosstalk delay change analysis.In: IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp.147–154 (2004)

17. Knoth, C., Kleeberger, V.B., Nordholz, P., Schlichtmann, U.: Fast and WaveformIndependent Characterization of Current Source Models. In: IEEE/VIUF Inter-national Workshop on Behavioral Modeling and Simulation (BMAS), pp. 90–95(September 2009)

18. Li, P., Feng, Z., Acar, E.: Characterizing multistage nonlinear drivers and vari-ability for accurate timing and noise analysis. IEEE Transactions on VLSI Sys-tems 15(11), 1205–1214 (2007)

19. Liu, B., Kahng, A.B.: Statistical gate level simulation via voltage controlled cur-rent source models. In: IEEE International Behavioral Modeling and SimulationWorkshop (September 2006)

20. Mitev, A., Ganesan, D., Shanmugasundaram, D., Cao, Y., Wang, J.M.: A robustfinite-point based gate model considering process variations. In: IEEE/ACM In-ternational Conference on Computer-Aided Design (ICCAD), pp. 692–697 (2007)

21. Nassif, S., Li, Z.: A more effective ceff. In: IEEE International Symposium onQuality Electronic Design (ISQED), pp. 648–653 (2005)

22. Raja, S., Varadi, F., Becer, M., Geada, J.: Transistor level gate modeling for accu-rate and fast timing, noise, and power analysis. In: ACM (ed.) ACM/IEEE DesignAutomation Conference (DAC), Anaheim, California, USA, pp. 456–461 (June2008)

23. Venkataraman, G., Feng, Z., Hu, J., Li, P.: Combinatorial algorithms for fast clockmesh optimization. IEEE Transactions on VLSI Systems 18(1), 131–141 (2010)

24. Wang, X., Kasnavi, A., Levy, H.: An Efficient Method for Fast Delay and SI Calcu-lation Using Current Source Models. In: IEEE International Symposium on QualityElectronic Design, Washington, DC, USA, pp. 57–61. IEEE Computer Society, LosAlamitos (2008)

25. Zolotov, V., Xiong, J., Abbaspour, S., Hathaway, D.J., Visweswariah, C.: Compactmodeling of variational waveforms. In: IEEE/ACM International Conference onComputer-Aided Design (ICCAD), Piscataway, NJ, USA, pp. 705–712. IEEE Press,Los Alamitos (2007)


Controlled-Precision Pure-Digital Square-Wave Frequency Synthesizer*

Abdelkrim Kamel Oudjida, Ahmed Liacha, Mohamed Lamine Berrandjia, and Rachid Tiar

Microelectronics and Nanotechnology Division, Centre de Développement des Technologies Avancées (CDTA), Baba-Hassen, BP. 17,

16303 Algiers, Algeria {a_oudjida,liacha,mberrandjia,rtiar}@cdta.dz

Abstract. In this paper, a new pure-digital frequency synthesizer Fout =(X/Y)•Fin for square-waves with controlled precision is described. Given that Fin is the input reference frequency provided by a stable crystal oscillator, Fout is the syn-thesized frequency; X and Y are two co-prime integer numbers.

The purpose is to demonstrate that with exclusively simple digital tech-niques, a frequency synthesizer with high precision, fast switching time and medium frequency bandwidth can be achieved.

In conformity with design-reuse methodology, the frequency synthesizer is implemented as technology-independent and generic IP-core, easily adaptable to suit any particular need.

Keywords: Precision, Frequency Bandwidth, Switching Time, Double-Edge-Triggered Flip-Flops (DETFF).

1 Introduction

High precision, wide bandwidth and fast switching time are the main required specifi-cations for modern frequency synthesizers [1][2]. In the literature, there exists a pleth-ora of solutions, but roughly speaking, all fall into one of the two categories: analog solutions or digital ones. While analog solutions deliver better results, they remain very expensive as they are more difficult to design (requiring careful control of all active components), implement (especially in modern low-cost processes optimized for digi-tal systems), and maintain (there is no possibility of “patching” the circuit).

Compared to their analog counterparts, digital solutions are more stable, but suffer from a serious drawback: limited frequency bandwidth.

One of the most recent and effective digital solutions is described in [3]. While this solution is based on an interesting mathematical concept, its corresponding hardware implementation presents many weaknesses: an oversized solution (adaptive control) to handle the precision problem, varying switching time, unoptimized solution for * This work was supported by “Centre de Développement des Technologies Avancées”

(CDTA), Algiers, Algeria.

212 A.K. Oudjida et al.

frequency bandwidth (use of time consuming parallel multiplier and divider), and unknown equations for error, jitter and duty-cycle.

Based on the mathematical concept developed in [3], this paper introduces a new implementation alternative that overcomes all of the above-mentioned shortcomings.

The paper is organized as follows. In this section we outlined the main requirement specifications for modern frequency synthesizers. Section 2 introduces the function-ing principle of our proposed architecture. Section 3 deals with the theoretical aspect of the solution. Implementation results are discussed in Section 4. And finally some concluding remarks.

2 Functioning Principle of the Solution

Our architecture (Figure 1) is essentially composed of two readable/writable registers to store the X and Y co-prime integer numbers, an Up (C1) and a Down (C2) counter, an adder and a substractor, and a crystal oscillator that generates a stable standard frequency Fc. A host-side-interface is also included to read/write the X & Y registers on the fly.

K Y Fin Fc

K Y / X Fout

Load each K cycles

C1

C2

Xtal

x

Y

Fig. 1. Block Diagram of the Frequency Synthesizer

Fin is sampled during each Fc period, such that Fc = K•Fin and the accumulated result (K•Y) in C1 is loaded into C2. Then, at each clock cycle of Fc, the X value is subtracted from C2 until C2≤0, such that Fout = Fc / [K•(Y/X)]. When Fc is replaced by K•Fin, we obtain: Fout = (X/Y) • Fin.

3 Theoritical Aspect of the Solution

3.1 Precision

The error in the digital frequency synthesizer is due to the missing fractional part after the cumulative arithmetic operation in C1 is terminated (K•Y rather than K•Y+ r, where 0≤ r <Y).

Controlled-Precision Pure-Digital Square-Wave Frequency Synthesizer 213

Error is calculated as |Ftout - Fsout| / Ftout, where Ftout and Fsout are the desired theo-retical frequency and the synthesized frequency, respectively. To minimize error, a simple double sampling technique on rising (↑) and falling (↓) edges of Fc during N (N=2n for easy shift operations) periods of Fin is used, as depicted by Figure 2. This has the benefit of not only doubling the frequency bandwidth as 2Fc is used instead Fc, but to also considerably reduce the error, jitter and duty-cycle.

Fin / 2N

P1 P2 Special Case: N=1 ; K=3

Fin

Fc Y 2Y 3Y 4Y 5Y 6Y 7Y Y 2Y 3Y 4Y 5Y 6Y 7Y 8Y

Fin / 2N

7Y C1 8Y C1 Tc Jitter

Jitter on Fout = (Tc/2) • Floor (Y/X)

Fout(P1)

Fout (P2)

Fig. 2. Double Sampling Technique on N Cycles of Fin

The final value obtained in C1, which can be either [(2•N•K+1)•Y] or [(2•N•K+2) •Y], is loaded into C2, and decremented (-N•X) on both ↑ and ↓ of Fc until C2≤0. This tech-nique yields a maximum error of:

( )1

X

Y1KN211

1 −

⎥⎦⎤

⎢⎣⎡ +⋅⋅−

And the maximum jitter between successive Fout is equal to:

[Tc /2] • Floor[Y/(N.X)]

To assure a duty-cycle as close to 50% as possible, another simple technique based on the use of an up and down counter (C2 duplicated) is described in Figure 3. This tech-nique guarantees a maximum duty-cycle of:

50% + [(X/Y) /(2K+1/N)]%

However, to guarantee a duty-cycle in-between 40% and 60% (which is the norm), the following condition must be achieved:

K ≥ Ceil [5 (X/Y) – 1/(2.N)]


As N & K are two generic parameters in RTL code, they can be individually set to reach any desired precision.

33 31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 1 33 Down Counter [(2K+1).Y – N.X)]

UP Counter [0 + N.X]

Fout

Special Case: N=1 ; K=2 ; X=2 ; Y=7

Fc

Duty Cycle: 50%+ [(X/Y)/(2K+1/N)]%

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 2

Toggle

Fig. 3. 50% Duty-Cycle Technique

3.2 Switching Time

Switching time is the latency between any variation of the X or Y value and their corresponding Fout. The maximum switching time is: N•Tin + 2Tc , where Tin and Tc

are the periods of Fin and Fc, respectively.

3.3 Frequency Bandwidth

The RTL code is technology-independent, highly reconfigurable, and written accord-ing to rules and recommendations given in [5]. The frequency bandwidth (Fc_Max) as well as the area occupation exclusively depends on the precision factor N, K ratio, and the bit size of theY register. To prevent overflow, all bit sizes of internal counters and registers are set to:

Ceil [log2 ((2•N•K+2) • 2Y_reg_bit_size)]

Our RTL coding style requires that: Fin & Fout ≤ Fc_Max , given that Fc is the master clock of the circuit (Figure 1). The actual maximal rate at which Fc can run (Fc_Max) depends on the physical characteristics of the chip, either ASIC or FPGA.

4 Implementation Results

To implement the double sampling technique, we needed Double-Edge-Triggered Flip-Flops (DETFF) which allow data to be registered on both rising and falling edges of the clock [6]. Unfortunately, these types of flip-flops are not integrated within Xilinx FPGAs [7], which was the sole implementation device available to us. There-fore, to circumvent this hurdle, we extracted all mathematical equations (Table 1)


describing the main features of the proposed architecture depending on how the two counters C1 and C2 are triggered. It is important to note that whatever the imple-mentation case (I, II, III or IV of Table 1), any precision can be attained, since the precision factor N is a generic parameter that can be set accordingly. However, com-pared to case I, the maximum frequency bandwidth is divided by two in the other cases.

We implemented the solution corresponding to case II, where only C1 is double-triggered. This was achieved by using two simple-edge-triggered counters (C1 dupli-cated), running respectively during opposite edges. When the Fin/2N signal toggles (Figure 1), the accumulated results of the two counters are delayed one Tc cycle in order to stabilize before being summed and then loaded into the C2 counter. Such a trick simplifies the timing analysis of the architecture; otherwise it becomes more complicated as two types of clock-to-setup paths exist in the architecture: rising-to-falling and falling-to-rising Fc edge paths. In this case, double timing constraints Ch and Cl are to be respectively observed on both the high-level and low-level of the Fc clock, such as: Tc ≥ 2.Max(Ch , Cl).

Without the use of the trick just mentioned, Fc_Max will be significantly degraded. The whole design code, either for synthesis or functional verification, was imple-

mented in Verilog 2001 (IEEE 1365). The RTL code was simulated at both the RTL and gate level (post place & route netlist) with timing back-annotation using Model-Sim SE 6.3f, and mapped onto Xilinx FPGAs using Foundation ISE 10.1 version. It is noteworthy to mention that all results, either for slice occupation or delays, are ob-tained using the default options of the implementation software (Foundation ISE 10.1) with the selection of the smallest FPGA device in each family (Virtex5 & Virtex2) with the fastest speed grade.

The design has undergone thorough functional software verification procedure ac-cording to the IP development methodology summarized in [8]. As for physical test, the synthesizer was integrated around a Microblaze SoC environment using a V2MB1000 demonstration board [4] with Xilinx’s EDK 9.1i version. The obtained errors (Table 2) are compared to those given in [3].

To characterize our synthesizer in terms of speed and area, RTL-code with N=1, various K ratios and 8-bits X & Y register-size were mapped on recent (Virtex-5) and old (Virtex-2) FPGAs. The results are summarized in Table 3. The slice utilization ratio between the two FPGA families is not only due to the number of slices included, which are 4800 for the Viertex-5 and 256 for the Virtex-2, but also because of the difference in the number of look-up-tables (LUTs) per slice, which is: 2 LUTs of 4 inputs each for Virtex-2 devices, and 4 LUTs of 6 inputs each for Virtex-5 devices.

As for speed, there is almost no significant difference in terms of delay with regard to large variations of the K factor (Table 3). Delays were calculated for two types of paths: Clock-To-Setup and all paths together (Pad-To-Setup, Clock-To-Pad, and Pad-To-Pad). The Clock-To-Setup (Table 3) gives more precise information on the delays than the other remaining paths, which depend on the I/O Block (IOB) configuration (low/high fanout, CMOS, TTL, LVDS, …).


Table 1. Main Features of the Architecture

Triggering Error Case C1 C2 General Case SC

Jitter Duty-Cycle (%) Frequency Bandwidth

Switching Time

I

↑↓

↑↓ ( )1

X

Y1KN211

1 −

⎥⎦⎤

⎢⎣⎡ +⋅⋅−

K2

1

⋅

( )⎥⎦

⎤⎢⎣

⎡+

+N/1K2

1.

Y

X

2

1 Maxcoutin _FF&F ≤

II

↑↓

↑ ( )1

X

Y1KN221

1 −

⎥⎦⎤

⎢⎣⎡ +⋅⋅−

2

1K

1

−

⋅⋅

XN

YFloor

2

Tc

( )⎥⎦⎤

⎢⎣

⎡+

+N2/1K

1.

Y

X

2

1

III

↑

↑ ( )1

X

YKN11

1 −

⎥⎦⎤

⎢⎣⎡ ⋅−

1K

1

−

( )⎥⎦⎤

⎢⎣

⎡+

+N/1K

1.

Y

X

2

1

IV

↑

↑↓ ( )1

X

YKN211

1 −

⎥⎦⎤

⎢⎣⎡ ⋅⋅−

1K2

1

−⋅

⋅⋅

XN

YFloorTc

( )⎥⎦⎤

⎢⎣

⎡+

+N2/1K2

1.

Y

X

2

1

2

_FF&F Maxc

outin ≤

N.Tin + 2Tc

SC : Special Case for N=1 X=Y=1; ↑↓ : Double-Edge-Triggering ; ↑ : Positive or Negative Simple-Edge-Triggering.

Table 2. Error Comparison

Our Design (Case II) Stork’s Design [3] Fin (Hz) K

( )%

2

1K

1

−

Fout (Hz) Error(%) Fout (Hz) Error (%)

1502 20713 0.0048 1502.04 0.0027 1502 0

2010 15478 0.0064 2010.03 0.0015 2014 0.1990

4008 7762 0.0128 4008.01 0.0002 4016 0.1996

6004 5181 0.0193 6003.84 0.0003 6026 0.3664

10008 3108 0.0321 10009.60 0.0160 10068 0.5995

20004 1555 0.0643 20006.40 0.0120 20258 1.2694

40000 777 0.1287 40012.80 0.0320 40980 2.45

100000 311 0.3220 100160.25 0.1602 106600 6.6

1000000 31 3.2786 1003610.99 0.3610 NA NA

10000000 3 40.0000 10370646.92 3.7064 NA NA

Special case: Fc = 31.111 MHz ; N=1 ; X=Y=1 1/(K-½) is the maximum theoretical error of case II for N=1 & X=Y=1 Tektronix TLA-714 logic analyser has been used for physical measures.

Table 3. FPGA Mapping Results

Virtex 5 xc5vlx30-3ff324*

Virtex 2 xc2v40-6-cs144+

K Fc_Max

(MHz) Slice

Utilization Fc_Max

(MHz) Slice

Utilization

106 20.35 2.02 % 12.11 73%

105 22.95 1.85 % 12.72 67%

104 23.20 1.77 % 12.93 59%

103 23.45 1.62 % 13.03 50%

102 28.39 1.43 % 13.24 44%

101 28.67 1.33 % 14.85 36% Special case: N=1 ; X & Y register size = 8 bits * : Total number of slices: 4800 + : Total number of slices: 256.


5 Conclusion

We have demonstrated, both theoretically and experimentally on an FPGA, the design of an effective square-wave frequency-synthesizer with controlled precision by using simple digital techniques. However, as RTL-code is technology-independent, a map-ping on a deep-submicron standard-cell library with DET Flip-Flops delivers a much higher frequency bandwidth.

Compared to Xilinx DCM (Digital Clock Manager), besides the ability to offer a higher controlled precision, our IP is not tied to a particular process technology.

As for applications, our IP can advantageously be incorporated in any type of de-signs requiring a high level of synchronization (greater clock resolution) between source and destination for frame data-transfer, such as in serial communication proto-cols: UART, SPI, I2C, OneWire,…

References

1. Staszewski, R.B., Balsara, P.T.: All-Digital Frequency Synthesizer Design in Deep Submi-cron CMOS. John Wiley & Sons, Inc., Publishers, Chichester (2006) ISBN: 0-471-77255-0

2. Manassewitsch, V.: Frequency Synthesizers: Theory and Design. Wiley-Interscience Pub-lisher, Hoboken (2005) ISBN: 0-471-77263-1

3. Stork, M.: Digital Fractional Frequency Synthesizer Based on Counters. Turkish Journal of Electrical Engineering and Computer Sciences 14(3) (2006) TÜBITAK

4. Xilinx Inc., Virtex-IITM V2MB1000 Development Board User’s Guide 5. Keating, M., Bricaud, P.: Reuse Methodology Manual for System on a Chip Designs, 3rd

edn. Kluwer Academic Publishers, Dordrecht (2002) ISBN: 1-4020-7141-8 6. Pedram, M., et al.: A New Design for Double Edge Triggered Flop-Flops. In: Proceedings

of Asia and South Pacific Design Automation Conference, pp. 417–421 (February 1998) 7. Xilinx Inc., Doubling Counter/Timer Resolutions with CoolRunner-II, XAPP910 (V1.0)

(October 27, 2005) 8. Oudjida, A.K., et al.: Front-End IP-Development: Basic Know How, Revue Internationale

des Technologies Avancées, Algeria, vol. (20), pp. 23–30 (December 2008) ISSN 1111-0902

An All-Digital Phase-Locked Loop with High

Resolution for Local On-Chip Clock Synthesis

Oliver Schrape1, Frank Winkler2, Steffen Zeidler1, Markus Petri1,Eckhard Grass1, and Ulrich Jagdhold1

1 IHP GmbH, Frankfurt (Oder), Germany{schrape,grass,jagdhold,petri,zeidler}@ihp-microelectronics.com

2 Humboldt University Berlin, Berlin, [email protected]

Abstract. In this paper an All-Digital Phase-Locked Loop (ADPLL)with a high resolution and a wide frequency range for local on-chip clockgeneration is described. The proposed ADPLL has an operating rangefrom 250 MHz to 1.3 GHz and a resolution of 25 ps. In contrast to otherdesigns, the Digitally Controlled Oscillator (DCO) combines three dif-ferent development approaches to achieve the desired performance. TheADPLL provides four different algorithms to control the DCO. Depend-ing on the selected algorithm and the desired frequency, the lock-in timevaries between 54 to more than hundreds reference cycles. The output ofthe synthesized clock is directly connected to a Low Voltage DifferentialSignaling (LVDS) interface to provide a high frequency LVDS clock. Be-fore their VHDL implementation, all components were simulated usingan event driven Matlab model. This proposed ADPLL uses standard celllibrary elements only and is implemented in an IHP 0.25 µm BiCMOSprocess. The overall power dissipation is less than 50 mW (@ 800 MHz)with a 2.5 V power supply. Due to its VHDL description the design canbe ported to other processes in short development time.

Keywords: All-Digital Phase-Locked Loop, ADPLL, Clock Generator,Event driven Matlab Model, LVDS, PID Controller.

1 Introduction

For clocking digital synchronous integrated circuits, Phase-Locked Loops (PLLs)are most widely used for frequency synthesis. In [1] R.E. Best has introducedlinear PLL (LPLL), digital PLL (DPLL), all-digital PLL (ADPLL) and softwarePLL (SPLL) architectures. The fundamental structure of an ADPLL containsfour components that are shown in Figure 1. A Phase Frequency Detector (PFD)compares the phases of a reference clock with the phase of a divided clock andsends control signals to a Control Unit. This unit evaluates the generated con-trol signals and provides the oscillator with a signal to control the frequency.The purpose of the frequency divider is to divide the generated clock by a pro-grammable constant. In lock-in mode, the output of the frequency divider has


An ADPLL with High Resolution for Local On-Chip Clock Synthesis 219

the same phase and frequency as the reference clock. Traditionally, as publishedby F. Herzel et al. in [3] the Control Unit and the oscillator are implementedas analog IP-Blocks which are sensitive to process variations. These componentshave to be redesigned for each new manufacturing process respectively. Due tothe noise coupling and power supply noise effects, ADPLLs have gained attrac-tion in the last years, since they reduce integration problems in a digital noisyenvironment.

When designing an ADPLL, two problems have to be considered carefully:The first one is how to design a Digitally Controlled Oscillator (DCO) with awide frequency range and a high resolution. Therefore, a selectable inverter chainas published by S. Moorthi et al. in [5] provides a wide operating range. In orderto achieve a high resolution one can use a ring oscillator with parallel connectedtri-state inverters as announced by T. Olsson et al. in [8]. Furthermore, usingbus-keeper components as published by D. Sheng et al. in [10] is an alternativeapproach.

The second problem is how to accelerate the frequency convergence and phaseconvergence of the ADPLL: Simple digital clock generators are proposed in [7]and by P. Nilsson et al. in [6]. Time-to-digital converter (TDC) as publishedby D. Sheng et al. in [10] can be used to solve this problem. An alternativesearch step algorithm that allows control depending on the convergence modeis proposed by Ching-Che Chung and Chen-Yi Lee in [2]. Similarly, recursivefilters are powerful as published in [8] and presented by J. Zhuang et al. in [11].

In this paper, an implementation of an ADPLL with selectable control algo-rithms is presented. With the design of the proposed DCO, an operating range of1050 MHz can be achieved. Depending on the chosen algorithm, a deterministicjitter less than 25 ps is obtainable. All components are described in VHDL anduse standard cell library elements only. The structure and the behavior of eachcomponent are designed and illustrated. The simulation results are compared toearlier published designs.

Fig. 1. Abstract block structure of the proposed ADPLL core

220 O. Schrape et al.

2 Digitally Controlled Oscillator

In opposition to other published developments of Digitally Controlled Oscillators(DCO) using one approach only, our proposed DCO merges different designapproaches and needs few resources to achieve a wide frequency range with ahigh resolution. This component has three different tuning stages. The first one,the coarse tuning stage, is a slight modification of the selectable inverter chainproposed by S. Moorthi et al. in [5]. To achieve a wider frequency range the usedstandard cells need to have a short gate delay. Therefore, the chain elementsconsist of nine serial connected multiplexer structures. They are of differentdrive types and can be initialized by a NAND gate as shown in Figure 2. Thedifferent timing paths are selectable via a 9 bit one-hot-encoded input signalMUX [8:0] connected to the multiplexers selector. Depending on the value ofthe select signal, either the shorter left timing path or the longer right timingpath of the multiplexer is used. The transition time of each structure is about320 ps. These time differences would cause a huge deterministic jitter, if theDCO oscillates between two timing paths. As published by D. Sheng et al. in[10], bus-keeper components are used to solve this problem. These components

MUX0...

clk_dco

MUX1 MUX7 MUX8

MUX[8:0]

...

Start

Tri[8:0]

BusK[4:0]

...

Fig. 2. Schematic of the DCO

are designed using inverters and tri-state inverters of different drive types. Theyare in parallel connected to the last net in the feedback only. If one or morebits of the 5 bit control word BusK[4:0] is set to logic 1, two effects influencethe selected timing path: The output of the enabled tri-state inverter increasesthe load of the feedback net. Thus, the last multiplexer structure needs moretime to change its logical value. This effect influences the 4 other bus-keepersif the LSB of BusK for the leftmost tri-state inverter is set. Additionally, thesteepness of each rising or falling edge is smoothed due to the transition delay ofthe enabled bus-keeper component. Consequently, a finer resolution of less than40 ps is achieved, whereby a smaller deterministic jitter is possible. To eliminatethe timing leak, a third tuning stage is added. Nine parallel connected tri-state


inverters are bound to the initial NAND gate, similar to the solution proposed byT. Olsson et al. in [8]. If one or more tri-state inverters are enabled by the 9 bitcontrol signal Tri[8:0], an additional current drive is added to the multiplexerstructures. The slight speedup results in a change of the propagation delay downto 1 ps. A CDL testbench of the DCO was simulated with Spectre MDL for eachvalid control word to ensure the timing behavior. The measurement results wereparsed by scripts to generate a digital VHDL behavior model automatically. Thecombination of these three different approaches mentioned above results in anoperating range from 250MHz to 1.3GHz with an average resolution step of lessthan 5 ps. For this performance our proposed DCO requires only 46 logic gates(see Table 2). For easier control, all valid control codes in the frequency rangefrom 250MHz to 1.3GHz are sorted linearly in a look-up table with a maximumtiming difference of less than 5 ps, if possible. Nevertheless, the largest timingdifference between two neighboring control codes is about 25 ps.

3 Clock Divider

The clock divider is used to divide the clock signal generated by the DCO by aprogrammable constant factor. The resulting clock frequency is compared withthe reference clock to derive a controlling signal indicating whether to increase ordecrease the clock frequency generated by the DCO. As proposed by ShenggaoLi et al. in [4] and F. Herzel et al. in [3] the generated clock of the DCO is dividedby a dual modulus prescaler. This component divides the DCO clock by 4 or by5 (clk 45) depending on the logic value of the selector ctrl that is controlled by a2 bit Swallow Counter S. The output clock of the prescaler is divided by a 7-bitMain Counter M , that starts on the initial programmed value and counts downto zero. If the value is equal to zero, the signal start s of the Swallow Counteris set to logic 1. Consequently, the counter starts to count its initial value downto zero. If zero is reached, ctrl is set to logic 0. The necessary values of M andS are programmable over a SPI-like interface. With this structure, the amountof different adjustable frequencies is increased.

Due to the maximum frequency of the 4/5 clock divider an additional prescalercan be enabled to divide the generated oscillator clock by 2. For this case, thecalculation of the generated feedback is given in Equation 1.

clkfb =clkdco

(4 · M + S) · 2 (1)

4 Phase Frequency Detector

As published by Ching-Che Chung et al. in [2] the PFD structure is composedof a self-resetting flip-flop structure. The phases of the 5 MHz reference clockand of the divided feedback clock are compared. If the feedback clock is delayedregarding the reference clock, the flag d output signal is set to logic 1. Therefore,if the feedback clock leads regarding the reference clock, the output signal flag u


is set to logic 1. The simulated dead zone of the PFD is less than 210 ps which islimited by the cell delays of the used flip-flops in the self-resetting structure. Ifthe phase error of both clocks is lower than the dead zone each flag remains inlogic 1. To ensure that no pulse width violation occurs, a digital pulse amplifieris implemented, which enlarges the pulse width of the reset signals.

5 Control Unit

In general, analog PLLs use a simple loop filter to regulate the control volt-age of the Voltage Controlled Oscillator (VCO) that mostly results in a greaterlock-in time. A shorter lock-in time can only be achieved using a filter struc-ture of a higher order that is difficult to implement in an analog design flow. Adigital control unit has much more potential to control the PLL behavior au-tomatically. Finally, it calculates the binary code word (w) that influences thetiming paths of the DCO. The value that has to be controlled is an index ofa sorted array that includes the binary code word for the inputs of the DCO.Our proposed ADPLL has four different control algorithms initial selectable us-ing the SPI-like interface. For applications that do not need a fast lock-in time,a linear search algorithm and a binary search algorithm is implemented thatguarantees frequency acquisition only. The width of the phase error (pwidth) isnot used for the calculation of the next index. In these algorithms only the signof the phase error is interpreted. Due to the time constant of the ADPLL andthe simple control algorithm a greater jitter is produced. Therefore, a countercntx is implemented that allows a frequency change by changing a control wordevery x reference cycle only. However, this simple solution decreases the jitterenormously while simultaneously increasing the lock-in time. Both, the linearand non-linear, algorithms require few logic resources but do not support phaselocking. In addition to these algorithms, a recursive filter is selectable to controlthe DCO that allows a shorter lock-in time.

5.1 Linear – Non-linear Controller

If the linear algorithm is selected, the DCO is initialized with the mean valueof the number (#f) of valid frequency code words w. Depending on the flag uand flag d of the PFD the next higher or lower code word is chosen every xcycles. The worst case is a lock on the lowest or the highest frequency of thelook-up table. Accordingly, more than x · #f reference cycles are necessary toreach the desired frequency. For a shorter lock-in time, a binary search algorithmcan be used instead of the linear counter. The number of clocks is reduced tolog2(#f) · x. When the desired frequency is found, the control unit oscillatesbetween two neighboring code words.

5.2 Recursive Filter

In order to achieve shorter lock-in time, a recursive filter can be used as pro-posed by T. Olsson et al. in [8]. Alternatively, the use of time-to-digital converter


(TDC) components has a better performance as published by D. Sheng et al. in[10]. In contrast to the two algorithms mentioned above, two additional PID con-trol algorithms are developed to achieve a faster lock-in time. For this purpose,a local ring oscillator is implemented to clock two simple counters (cnt1, cnt2).These counters are increasing their values while the pulse width indication signalpwidth of the PFD (see Figure 1) is logical 1. The sum of the two counter valuesrepresent the measured phase error (pe). The output of the ordinary PID con-troller – the new control index w(n) of clock cycle n – is the sum of proportionalterm (KP ) an integral term (KI) and a derivative term (KD). The proportionalterm, referred to as proportional gain, is a multiple of pe by a constant factor P .The integration of the phase error is done by summing up over time. This givesan accumulated offset that should have been corrected previously and multipliedby the constant integral gain I. The change of pe after each adjustment is theproduct of the constant derivative gain D and the difference between the actualphase error and the last measured phase error. The resulting sum of the threeterms KP , KI and KD represents the next valid index for the look-up table thatcontains the corresponding numerical control word.

w(n) = P · pen︸︷︷︸KP

+ I ·n∑

k=0

pek

︸︷︷︸KI

+ D · (pen−1 − pen

)︸︷︷︸

KD

(2)

Figure 3 shows the result of a simulation of an event driven Matlab modelas proposed by J. Zhuang et al. in [11]. The PLL with this PID Controllerlocks at the desired frequency of 800 MHz after 85 reference cycles. The DCOoscillates between 4 neighboring frequencies in lock-in mode. The calculatedaverage deterministic jitter after lock-in is about 4.1 ps.

0.8 1 1.2 1.4 1.6 1.8 20

20

40

60

80

100

120Frequency Histogram

f [GHz]

Cou

nt

0 50 100 150 200 250 300

0.8

1

1.2

1.4

1.6

1.8

2ADPLL Frequencies

f [G

Hz]

Reference Periods

Fig. 3. Histogram and frequency variation of the general PID Controller

5.3 Smoothing Recursive Filter

To achieve a better performance the mentioned calculation (see Equation 2)is slightly modified. Therefore, the new control index w(n∗) is the difference


between the last stored control word w(n− 1) and the currently calculated codeword w(n) as illustrated in the following equation.

w(n∗) = w(n − 1) − w(n) (3)

As illustrated in Figure 4 the modified PID Controller requires fewer referencecycles and has a smoother approximation. In addition, the smoothed PID Con-troller oscillates only between two frequencies. The average deterministic jitterafter lock-in is about 1.6 ps.

0.65 0.7 0.75 0.8 0.85 0.9 0.950

20

40

60

80

100

120

140

160

180Frequency Histogram

f [GHz]

Cou

nt

0 50 100 150 200 250 300

0.65

0.7

0.75

0.8

0.85

0.9

0.95ADPLL Frequencies

f [G

Hz]

Reference Periods

Fig. 4. Histogram and frequency variation of the smoothing PID Controller

Table 1 shows that the approximated power consumption of the PID algo-rithms is 64 times greater than for the linear/binary search version after logicsynthesis. The reasons for this are the necessary local ring oscillator and ad-ditional registers for calculation parts. Nevertheless, depending on the desiredfrequency and adjusted filter parameter P , I and D, a lock-in after less than 55reference cycles is possible.

Table 1. Properties of the implemented control algorithms

Algorithm Area [mm2] Power [uW ] Lock Time [cycles]

(non-)linear 0.024 0.5 500 – more than 1000

(smoothed) PID 0.108 32.25 < 70

6 Chip Layout

Figure 5 illustrates the complete prototype chip including IO pads. The layoutwas done as a block design. The core area of the ADPLL without IO pads is about0.7 mm2. The stand-alone DCO is placed in the upper right corner with fullyseparated supply voltage. Additionally, two standard cells (converters), CMOSto ECL and ECL to CMOS are placed on the right side of the core area. Theseare necessary to convert the 2.5 V CMOS voltage levels of the oscillator output


to standard LVDS levels, and to use an external fast LVDS clock (dco ext p,dco ext n) as an input clock for the clock divider for separate testability. Thedifferential LVDS IO pads are placed left besides the two converters. Using thesepads, clock rates up to 1.3 GHz are supported as well as inputs to supply theclock divider with an external frequency up to 1.7 GHz for separated testability.The LVDS converters have their own supply voltage to be fully decoupled fromthe CMOS components.

(a) Layout view of the ADPLL

PFD SPI

Control

Unit

LVDSinterface

DIV

DCO

(b) Fabricated ADPLL Chip

Fig. 5. Prototype Chip: 1.6 mm x 1.6 mm

7 Simulation and Experimental Results

The proposed ADPLL is designed by using standard library cells only. It wassynthesized by a bottom up synthesis flow. Figure 6 shows a digital worst caseSDF timing simulation. The PLL locks at the desired frequency of 560 MHzafter 54 reference cycles. In this case, the smoothed recursive filter was chosento stabilize the system. For this kind of simulation, the generated VHDL behaviormodel was used. The deterministic jitter after lock-in is about 5 ps. In comparisonto other ADPLL designs shown in Table 2, our proposed implementation has afrequency range of 1050 MHz with a lock-in performance of less than 70 referencecycles. Due to the four implemented control algorithms, the additional local ringoscillator for the recursive filters and the LVDS interface, the power dissipation islarger than other implementations using a similar kind of process. Nevertheless,we achieved a fine resolution of less than 25 ps. Furthermore, the maximum lock-in time of less than 70 reference periods for our implementation compared to 46cycles for the proposal in [2] is caused by the limited number of possible controlcodes of our look-up table. In contrast to the published ADPLLs in [2] and [8],our proposed DCO structure has a wider operating range and requires fewerlogic resources.


Fig. 6. Digital worst case simulation of the ADPLL

Table 2. Properties Comparison

Performance Proposed [2] [8] [9]Parameter ADPLL

Process 0.25 µm BiCMOS 0.35 µm CMOS 0.35 µm CMOS 0.18 µm CMOS

Core Area 0.83 mm2 0.71 mm2 0.07 mm2 0.0025 mm2

DCO Gates 46 > 100 128 –

Power < 50 mW 100 mW – 6.4 mWDissipation (@ 800 MHz) (@ 500 MHz)

Min Freq 250 MHz 45 MHz 170 MHz 0.1 MHz

Max Freq 1.3 GHz 510 MHz 360 MHz 282 MHz

Lock-in Time < 70 cycles < 46 cycles ∼ 60 cycles < 5 cycles

Resolution < 25 ps < 5 ps < 55 ps –

8 Conclusion

In this paper, an ADPLL with high resolution and wide frequency range is pro-posed. All components are written in VHDL and are using standard cell libraryelements. An event driven Matlab model of the ADPLL was developed to simulateat design level. The prototype of the ADPLL was fabricated in an IHP 0.25 µmBiCMOS process. In contrast to other published DCO implementations, our de-sign is a combination of three different approaches, whereby, a resolution of lessthan 1 ps, and simultaneously, a frequency range of 1050 MHz are possible. Theimplemented recursive filters allow a short lock-in after less than 70 cycles. Forsmoother approximation that results in a lower output jitter, a new slightly


modified PID Controller is introduced. Furthermore, a LVDS clock is also pro-vided. The developed ADPLL is suitable for high-performance applications re-quiring a wide frequency range with a very low clock jitter.

References

1. Best, R.: Design and Applications, Design, Simulation, and Applications. McGraw-Hill, New York (February 2000)

2. Chung, C.C., Lee, C.Y.: An all-digital phase-locked loop for high-speed clock gen-eration. IEEE Journal of Solid-State Circuits 38(2), 679–682 (2003)

3. Herzel, F., Osmany, S.A., Hu, K., Schmalz, K., Jagdhold, U., Scheytt, J.C.,Schrape, O., Winkler, W., Follmann, R., Kohl, D.K.T., Kersten, O., Podrebersek,T., Heyer, H.V., Winkler, F.: An integrated 8-12 ghz fractional-n frequency syn-thesizer in sige bicmos for satellite communications. In: Analog Integrated Circuitsand Signal Processing (January 2010)

4. Li, S., Ismail, M.: A 7 ghz 1.5-v dual-modulus prescaler in 0.18 µm copper-cmostechnology. Analog Integrated Circuits and Signal Processing 32, 89–95 (2002)

5. Moorthi, S., Meganathan, D., Janarthanan, D., Kumar, P.P., Perinbam, J.R.P.:Low jitter adpll based clock generator for high speed soc applications. In: Proceed-ings of World Academy of Science, Engineering and Technology, vol. 32 (August2008)

6. Nilsson, P., Torkelson, M.: A monolitic digital clock-generator for on-chip clockingof custom dsp’s. IEEE Journal of Solid-State Circuits 31(5), 700–706 (1996)

7. Olsson, T., Nilsson, P., Meincke, T., Hemam, A., Torkelson, M.: A digitally con-trolled low-power clock multiplier for globally asynchronous locally synchronousdesigns. In: IEEE International Symposium on Circuits and Systems, vol. 3, pp.13–16 (2000)

8. Olsson, T., Nilsson, P.: A Digital PLL made from Standard Cells. In: EuropeanConference on Circuit Theory and Design, ECCTD 2001 (August 2001)

9. Reddy, B.S.P., Krishnaparsad, N., Moorthi, S., Perinbam, J.R.P.: An All DigitalPhase Locked Loop for Ultra Fast Locking. In: Proceedings of National Conferenceon Engineering Trends in Engineering and Technology (2008)

10. Sheng, D., Chung, C.C., Lee, C.Y.: A Fast-Lock-In ADPLL with High-Resolutionand Low-Power DCO for SoC Applications. In: IEEE Asia Pacific Conference onCircuits and Systems (2006)

11. Zhuang, J., Du, Q., Kwasniewski, T.: Event-driven modeling and simulation ofan digital pll. In: Proceedings of the IEEE International Behavioral Modeling andSimulation Workshop (2006)

Clock Network Synthesis with Concurrent GateInsertion

Jingwei Lu, Wing-Kai Chow, and Chiu-Wing Sham

The Hong Kong Polytechnic [email protected],[email protected],[email protected]

Abstract. In VLSI digital circuits, clock network plays an important role on thetotal performance of the chip. Clock skew and power dissipation are two majorfocuses of concerns in the clock network synthesis. During topology generation,the locations of buffer and gate insertion are usually not available. Despite localoptimization, the global performance is limited. In this paper, a novel approachof topology generation with concurrent gate insertion is proposed. Meanwhile, astrict clock slew constraint is applied with comprehensive buffer insertion tech-niques. By clock gating, the switched capacitance of the clock tree is reduced,with acceptable extra cost caused in controller tree. In experimental results it isshown that our approach has good performance on the reduction of both clockskew and power dissipation.

1 Introduction

Clock signals are employed in VLSI digital systems to synchronize the active com-ponents of a design. Clock skew minimization is a popular research topic during thepast decades. Some early works [1,2] mainly concentrated on the average distributionof wirelength between source and each terminal to achieve actual delay equalization.Afterwards, delay balancing [3] using Elmore delay model [4] became prevalent to ac-quire more accurate information of timing delay. The deferred-merging and embedding(DME) technique was proposed in [5], it can achieve the zero clock skew with mini-mal wirelength. In topology generation, some algorithms were proposed for unbufferedand ungated clock tree in [6], and buffered but ungated clock tree in [7]. In ISPD 2009clock network synthesis contest [8], a voltage variation related objective named ClockLatency Range (CLR) was formulated. Subsequent research work was also proposed [9]accordingly.

Twenty percent to fifty percent of the power usage is contributed by the clock net-work [10]. On behalf of power reduction, the application of clock gating is an effectiveapproach in the sequential circuits. The principal idea is to turn off the idle modulesand tree sections in order to cut down the unnecessary switching power. Clock gat-ing can be applied on logic level [11], register-transfer-level level [12] and architecturelevel [13]. Nevertheless, besides logical information, physical location of the modulesshould also be taken into account in case wirelength overhead thus power usage waste.Some achievements were proposed with both logical and physical concerns. The algo-rithm in [14] showed a clock tree topology construction, taking advantage of the activ-ity patterns of modules. Moreover, activity similarity was considered in [15]. Besides,


Clock Network Synthesis with Concurrent Gate Insertion 229

a gating method regarding microprocessor design was proposed in [16]. The algorithmconstructed the topology in a bottom-up procedure, with the objective of switched ca-pacitance minimization. Further on, in [17] a comprehensive technique with a recursivecomputation on effective switched capacitance and a solution sampling on merging seg-ment set was discussed.

In this paper, we propose a novel synthesizer to construct a binary clock tree in abottom-up course. Simultaneous optimization on the clock skew and the power dissi-pation is applied. The topology generator is responsible for a buffered and gated clocktree, and the clock gates are inserted concurrently. The major advantage of our workis to take the downstream masking information of subtrees into account during eachmerging step. An algorithm named dual-MST [9] for topology generation is involved inour work, and the cost function is improved for power awareness. Besides, we performa more strict slew constraint along the whole clock network. Thus the constraint onbuffer and gate location is emphasized. The experimental results show that our methodcan greatly reduce the power consumption of the clock network with proper gate inser-tion. Meanwhile, the clock skew and PVT variation can still be maintained within anacceptable range.

The rest of the paper is organized as follows. Some preliminary knowledges of treeconstruction and capacitance are discussed in section 2. The details of our approachare discussed in section 3. The technique of power aware topology generation withconcurrent buffer and gate insertion is proposed detailedly. Experimental results areshown in section 4. Finally we reach our conclusion in section 5.

2 Preliminaries

2.1 Clock Tree and Controller Tree

Let T = {V, E} denote the clock tree. V = {vi|i = 1, 2, . . . , mv} is the set of nodes,and E = {ej|j = 1, 2, . . . , mv − 1} is the set of clock edges between the node vj andits corresponding parent. Let |ej| denote the length of the edge ej . Apparently, for theroot node there will be no edge assigned. Let G = {gi|i = 1, 2, . . . , mv −1} denote theset of gates. The gate gj is assigned to be on the edge ej masking the node vj directly.We use S = {vk|k = 1, 2, . . . , ms} (where ms < mv) to denote the set of modules(the sinks, or leaf nodes). The rest (mv − ms) nodes are named internal nodes. Theroot is said to be at level 0. Node vi is said to be at level ni if there are ni edges onthe path from vi to the root of the tree. Moreover, we assume that the topology of theclock tree is full binary. Every internal node has exactly two children. The skew of Tis the difference between the longest signal delay and the shortest signal delay from thesource to any sinks. As proposed in [16], we assume that the control logic is located atthe center of the chip. Star routing is also applied in the controller tree, denoted as T ctr.A control edge ENi in T ctr will transmit the enable signal to the respective gate gi onthe edge ei in the clock tree T . An example of a clock tree T as well as its controllertree T ctr is shown in figure 1.

During the operating time of a circuit, each module will have its active and idle times.It is usually specified as different activity patterns. The activity patterns can be obtainedby the simulation of the design at the behavioral level [14]. Let Ai denote the activity

230 J. Lu, W.-K. Chow, and C.-W. Sham

pattern of the node vi. It is a binary string with 1s indicating the active periods and0s indicating the idle periods of a sink or an internal node. If vi is a sink node, we candirectly obtain Ai from the benchmark file. Otherwise, suppose vi to be an internal nodewith two children nodes vL and vR accordingly. The clock signal at vi must be enabledwhenever its left or right child is active. Therefore, Ai is calculated by performing thebitwise OR operation on the activity patterns of vL and vR. Hence Ai = AL ∪ AR. Anexample of a bottom-up activity pattern transmission is shown in figure 2.

Let P (Ai) denote the activity of the node vi, and Ptr (Ai) denote its transition prob-ability. These two factors are calculated based on the corresponding pattern of vi. Thespecific equations are shown as below

P (Ai) =ATno (Ai)Len (Ai)

, Ptr (Ai) =TRno (Ai)

2 × (Len (Ai) − 1)(1)

where ATno (Ai) is the number of active times (1s) in Ai, and TRno (Ai) is the numberof transitions (10 or 01) in Ai. Len (Ai) denotes the stream length of Ai.

control logic

1v 2v 3v 4v

6v

7v

1g2g 3g 4g

5g 6g

1e 2e 3e 4e

1EN

5e6e

clock signal

5v

7e

2EN

3EN4EN

5EN

6EN

clocktree T

controllertree CtrT

Fig. 1. A gated clock binary tree

2.2 Switched Capacitance

The power consumed by CMOS circuits consists of two components: static and dy-namic power. The static power is mostly determined by the feature size and other tech-nology. Therefore, in this paper we only consider dynamic power minimization. Thedefinition of the dynamic power is P = 1

2αCfV 2

dd. C means the total load capacitanceon the circuit, f is the frequency of the clock signal and Vdd is the power supply. αmeans the amount of switch times in each clock cycle. For clock tree α = 2, becausethere is one rising and one falling edge in each clock period. α = 1 in the controllertree, respectively. Since f and Vdd are constant parameters in the digital circuits, wecan use the switched capacitance as a measure of the power usage. Assume that a sub-tree Ti rooted at vi with a gate insertion gi, and the controller tree is denoted as T ctr

i .The unmasked load capacitance for Ti and T ctr

i are Cuvi

and CuT ctr

i= CENi + Cg ac-

cordingly, Cg denotes the input capacitance of a gate. We can get the equation for the


Aa

va vb

Ab

Ai

vi

idle

active

Fig. 2. An example of activity pattern transmission

downstream switched capacitance of vi as SCvi = Cuvi

P (Ai). Similarly, the corre-sponding switched capacitance for the controller tree T ctr

i is measured as SCT ctri

=(CENi + Cg)Ptr (Ai).

The power consumption of a clock network is directly proportional to the averageswitched capacitance for each clock cycle. The total switched capacitance is contributedby a gated and buffered clock tree T and a controller tree T ctr. In order to reduce theswitching activity, modules and clock tree sections can be disabled by clock gates dur-ing their inactive clock periods. From the above example, we can see that the originalcapacitance of node vi is Cu

vi. With gate gi inserted at vi, the resultant switched capac-

itance is SCvi + SCT ctri

. If Cuvi

< SCvi + SCT ctri

, the capacitance will be reducedwith the insertion of gi. A power aware clock tree topology with proper buffer and gateinsertion will efficiently reduce the switched capacitance, hence cut down the power us-age of the circuits. Given the physical location of the modules together with the modelsof wire, buffer and gate, the objective of our work is to construct a buffered and gatedclock network and a controller network. Subject to the two constraints of nominal zeroskew and maximal slew rate, the average switched capacitance should be minimized.

3 Methodology

We build our clock tree based on the dual-MST construction method [9], and the re-sulting clock tree is close to a full symmetry. In our paper, it is improved with a newcost function to take both distance and power saving into account. As a result, the ac-cording topology can result in both low power usage and small clock skew. A recursivebuffer/clock gate insertion method is developed for bottom-up merging. Blockage han-dling technique is also involved, because the buffers and gates cannot be placed insideblockage regions. Elmore model [4] is applied for clock delay computation. DME tech-nique [18] is applied for wirelength minimization. Thus, segment is used instead ofpoint to represent the set of merging location, and deferred embedding is applied toreduce total wirelength.

3.1 Power Aware Topology Generation

In order to save the power, the nodes with a bigger similarity of activity patterns shouldhave a higher priority to be matched. Assume va and vb to be a pair of two nodes,


as shown in figure 2. If the corresponding activity patterns Aa and Ab are similar, theresulted activity Ai will have a shorter active period, and smaller power cost will becaused. Besides the concerns on activity patterns, an estimation of the merging costPwr (va, vb) is also required. This can be determined in multiple ways. For instance,we can actually merge the two nodes together to obtain the exact connection informa-tion. However, exact buffer insertion and wire balancing are performed, which will costlonger time. Instead, we develop a new method for potential switched capacitance esti-mation. The Manhattan distance between the nodes va and vb is denoted by D(va, vb).The Elmore delay difference of these two nodes is denoted by DLY (va, vb). The de-lay and power consumption for unit wirelength are denoted by ρD and ρP respectively,which are computed in advance for simulation reference. If DLY (va,vb)

ρDis smaller than

D(va, vb), then the two nodes can be merged without snaking wire involved, and thecorresponding equation for power cost computation is shown as below

Pwr (va, vb) = ρP × D(va, vb) × P (Ai) (2)

Otherwise, snaking will be included, as shown in the following equation

Pwr (va, vb) = ρP × DLY (va, vb)ρD

× P (Ai) (3)

An improved power aware dual-MST geometric matching technique is developed fortopology construction, a specific definition of a geometric matching of one iterationcan be found in [2]. The detailed description is shown in procedure 1. It is a weightedperfect matching approach. Given a set of nodes V = {v1, v2 . . . vm}, we first constructa complete graph G = {V, E}. Let |V | and |E| denote the number of nodes and edgesin the graph G, so|V | = m. Since G is a complete graph, every pair of two nodes vi, vj

is connected by an edge ei,j, E = {e1,2, e1,3 . . . em−1,m} and |E| = m(m−1)2

. The costof matching two nodes vi and vj is denoted as fc (ei,j). Let M denote the matchingresult of G. M is composed of a group of edges and it is a subset of E. The maximalpairing cost of M is denoted as Cmax and defined as below. We will get close to asymmetric clock tree by reducing Cmax in each level. The merging cost fc(va, vb) isshown as below. α and β are the weight of the Manhattan distance and the estimatedpower cost, respectively.

fc(va, vb) = α × D(va, vb) + β × Pwr(va, vb) (4)

By means of this weighted cost function, the node pairs with a bigger similarity ofswitching activity and a shorter distance will have a higher priority to be matched. Ourapproach of topology generation is based on concurrent gate insertion, therefore thedownstream information of the two merging nodes are accurate.

3.2 Concurrent Gate and Buffer Insertion

A recursive buffer and gate insertion technique is developed on behalf of three objec-tives: (1) slew rate constraint (2) clock skew minimization (3) power usage reduction.Buffers are utilized for power supply to restrict the signal transition time, and clock


Procedure 1. Partition(G)Require: G = {V, E} is a complete graph, E is sorted in ascending order of fc (ei,j).

if |V | ≤ 1 thenreturn;

else if |V | = 2 thenmerge(v1, v2);return;

elseBuilding dual-MST with |V | − 2 edges inserted.Two subgraphs G′ = {V ′, E′} and G′′ = {V ′′, E′′} are generatedTwo minimum spanning trees st′ and st′′ for V ′ and V ′′ are generatedif |V ′| is odd and |V ′′| is odd then

em,n = argei,jmin{fc (ei,j) |∀ni ∈ V ′, ∀nj ∈ V ′′};

merge(vm, vn);remove vm from V ′;remove vn from V ′′;remove em,x from E′, ∀x ∈ V ′;remove en,y from E′′, ∀y ∈ V ′′;

end ifpartition(G′);partition(G′′);return;

end if

gate insertion can reduce the switched capacitance by disabling idle sections. Real-timesimulation of signal slew rate costs much more time and is impractical. Hence we buildlook-up tables in advance for slew reference. It can estimate the driving ability amongdiverse circumstances. We model the buffer and gate with according attributes for El-more delay computation. Some previous works [19] already proposed to construct abuffered clock tree with zero clock skew. In our work, we apply similar approach forboth buffer and clock gate insertion. The input/output capacitance and resistance of thebuffers and clock gates should be obtained first. Hence, the delay of wire, buffers andclock gates can be computed based on Elmore RC model.

In our work, we try to maintain the level of buffers and gates of every source-to-sinkclock path exactly the same. During the procedure of the bottom-up binary merging,we first examine the two downstream levels of gates. If they differ by two or more, apenalty cost will be engaged. Such matching result will probably be discarded due to thehuge cost. Buffer levels will be balanced accordingly. By means of this level balancing,the clock skew will be reduced significantly, and the negative effect caused by signalvariation will be reduced.

Here we will describe our technique of gate insertion based on a determined match-ing result. We first define three different kinds of gate insertion. They are virtual gateinsertion at the upstream level, temporal gate insertion at the current level and none gateinsertion. Temporal insertion is controlled by the balancing of gate levels, which willbe further divided into two kinds of single gate insertion and one kind of back-to-backdouble gates insertion. The insertion of a gate is assumed to be closest to the internal


merging node on behalf of switched capacitance minimization. Since DME technique isapplied in our work, we assume the middle point of the merging segment to be the gatelocation. The comparison among the three assumption of gate insertion are based onthe resulting switched capacitance, which are SCvir , SCtmp and SCnon, respectively.If the power consumption of the virtual insertion or the none insertion is the smallest,no insertion of any gate will definitely result in less switched capacitance compared tothe choice of temporal gate insertion. Therefore, we discard any insertion of gates atthe current level. Otherwise, temporal gate insertion will probably reduce the switchedcapacitance rather than the others, and here we will accept the insertion of gates.

An example is shown in figure 2. The activity Ai equals to Aa ∪ Ab. The edgeconnection between each of the two nodes to the merging node are denoted as ea andeb. Cea and Ceb

are their corresponding capacitance cost. The equations to compute thethree resulting switched capacitance are shown as below

SCvir (va, vb) = (Cua + Cea + Cu

b + Ceb) × P (Ai) + Cu

T ctri

× Ptr (Ai) (5)

SCtmp (va, vb) = (Cua + Cea ) × P (Aa) + Cu

T ctra

× Ptr (Aa) + Cub + Ceb

(6)

SCnon (va, vb) = Cua + Cea + Cu

b + Ceb(7)

Notice that here we only describe the equation of SCtmp for a single gate insertion atnode va. The other two equations can be derived in a similar way.


In this section, our experimental results are presented. We implement our clock networksynthesizer in C programming language. The binary is executed on a Linux machinewith an Intel Core2 Quad 2.4G Hz CPU and 4GB memory. The benchmark circuitsused in the experiments are released from the ISPD 2009 CNS contest [8]. The detailedinformation of the benchmark circuits is shown in table 2. In our experiment, one typeof wire and one type of buffer is used in our clock tree synthesizer. The unit resistanceof the wire is 0.0003Ω/nm, and the unit capacitance of the wire is 0.00016fF/nm.The specific configuration of the buffer in different sizes is shown in table 1. In oursynthesizer, the maximum buffer size is set to be 6. Hence we list the attributes ofdifferent buffer sizes up to 6. Notice that the corresponding attributes of a gate is listed inthe last row of table 1. This table is generated from our SPICE simulation statistics. Cb

means the input capacitance, Rb means the driver resistance and db means the internaldelay of a buffer, respectively. During the evaluation, the power supply is set to beV dd = 1.0V . The PTM model applied in our simulation are of 45 nanometer scale.

A summary of the performance of our clock tree after insertion of clock gates isshown in table 3. We run our program with different values of α and β for topologytuning. The clock skew (SKEW), total capacitance (TC), optimal capacitance (OSC),switched capacitance (SC) and CPU time are listed. The respective units are picoseconds(ps) for SKEW, seconds for CPU and femto-farad (fF) for capacitance. TC denotes the


Table 1. Buffer configuration

buffer sizes Cb (fF ) Rb(Ω) db(ps)

1 35 66.9 4.922 70 40.5 5.633 105 31.3 6.134 140 26.4 6.525 175 25.0 6.956 210 20.7 7.20

gate 35 52.45 17.03

Table 2. Circuit information of the benchmarks from ISPD 2009

Chip Size No. of No. of block limitCircuits(mm x mm) sinks (Area %) CAP (fF)

ispd09f11 11.0 x 11.0 121 0 (0%) 118000ispd09f12 8.1 x 12.6 117 0 (0%) 110000ispd09f21 12.6 x 11.7 117 0 (0%) 125000ispd09f22 11.7 x 4.9 91 0 (0%) 80000ispd09f31 17.1 x 17.1 273 88 (24.38%) 250000ispd09f32 17.0 x 17.0 190 99 (34.26%) 190000ispd09fnb1 2.6 x 2.1 330 53 (37.69%) 42000ispd09f33 15.3 x 15.3 209 80 (27.68%) 195000ispd09f34 16.0 x 16.0 157 99 (38.67%) 160000ispd09f35 15.3 x 15.3 193 96 (33.22%) 185000ispd09fnb2 6.4 x 4.4 440 1346 (63.88%) 88000avg. 12.1 x 11.6 203 169 (23.62%) 140273

original total capacitance cost of the clock tree without gate insertion. OSC denotes theresulted capacitance after disabling of all the idle periods at each node. SC denotes theresulted switched capacitance of our gated clock tree. It can be seen that SC is mostlysmaller than TC, which means a effective power reduction in our gated clock tree con-struction. The nominal skew of each clock tree is zero. Additionally, we use NGSPICEfor further evaluation and get the accurate skew estimation, as listed in the table. Theactivity pattern of all the sinks are generated according to the instruction and RTL de-scription used in [16]. The length of the activity pattern is 10000 for every benchmark.Previous works were proposed with loose constraint on slew or driving power supply,for instance, ≤ 20 × Cg for a buffer or gate insertion in [17,16]. The work in [14] didnot involve clock routing and synthesis. However, in our program the transition time(slew rate) is maintained to be under 100 ps throughout the whole clock network, thusmore buffers are inserted to follow this rule. As a matter of fact, in this paper it is verydifficult for us to include direct comparison with previous works. It can be speculatedthat the power cost of our work should be larger than the previous ones, but the signaltransition time is more consistent hence the work is more practical in use. Generally, inour work the switched capacitance can be reduced by around 10% with the insertion ofclock gates. Meanwhile, the clock skew is only about 20 ps in average. The runtime ofour program is less than 3 seconds, which represents good efficiency.


Table 3. Clock skew and switched capacitance with gate insertion

Our approach (α = 1, β = 0) Our approach (α = 2, β = 1)CircuitsSKEW TC OSC SC CPU SKEW TC OSC SC CPU

ispd09f11 20.0 103973 61868 78939 0.37 16.7 103851 61422 78261 0.37ispd09f12 17.2 104874 65539 78970 0.34 16.6 103998 65090 79603 0.35ispd09f21 20.0 118028 68813 89140 0.35 25.7 108116 67586 81043 0.35ispd09f22 15.6 69810 43786 53173 0.32 8.5 69552 43938 53597 0.32ispd09f31 33.7 221639 136596 179336 3.83 19.3 220522 128744 174024 5.60ispd09f32 33.4 175122 101850 138156 0.51 21.7 162525 103658 123151 0.50ispd09f33 20.6 171747 107773 139467 5.44 18.8 155995 100329 128386 6.30ispd09f34 22.2 144688 92341 118570 0.49 20.3 139518 88924 109183 0.46ispd09f35 16.9 165546 104232 134708 8.11 21.6 163376 102231 128963 8.13ispd09fnb1 18.6 32635 23452 32635 0.70 29.6 34370 24869 34370 0.63ispd09fnb2 19.7 67041 46550 66280 2.40 27.5 70478 50113 69788 1.90

avg. 21.6 125009 77527 100852 2.08 20.6 121118 76082 96397 2.26

5 Conclusion

In conclusion, power saving and clock skew are two major concerns in clock networksynthesis. A power aware topology generation with concurrent buffer/gate insertion isproposed in this paper. This is developed in order to optimize the clock skew and thepower dissipation of a clock distribution network simultaneously. Experimental resultsshow that our method can greatly reduce the switched capacitance hence power con-sumption of the clock network with proper clock gate insertion. Meanwhile, the clockskew can still be maintained within an acceptable range.

Acknowledgement

The work described in this article was partially supported by the RGC Direct AllocationFund from The Hong Kong Polytechnic University (Project No. A-PC0W).

References

1. Jackson, M.A.B., Srinivasan, A., Kuh, E.S.: Clock Routing for High-Performance ICs. In:Proceedings of IEEE/ACM Design Automation Conference, pp. 573–579 (June 1990)

2. Kahng, A., Cong, J., Robinsh, G.: High-Performance Clock Routing Based on RecursiveGeometric Matching. In: Proceedings of IEEE/ACM Design Automation Conference, pp.322–327 (June 1991)

3. Tsayz, R.S.: Exact Zero Skew. In: Proceedings of IEEE/ACM International Conference onComputer Aided Design, pp. 336–339 (November 1991)

4. Elmore, W.C.: The Transient Response of Damped Linear Networks with Particular Regardto Wideband Amplifiers. Journal of Applied Physics 19(1), 55–63 (1948)

5. Boese, K.D., Kahng, A.B.: Zero-Skew Clock Routing Trees With Minimum Wirelength. In:Proceedings of 5th the Annual IEEE International ASIC Conference and Exhibit, pp. 17–21(1992)


6. Edahiro, M.: A Clustering-Based Optimization Algorithm in Zero-Skew Routings. In: Pro-ceedings of IEEE/ACM Design Automation Conference, pp. 612–616 (June 1993)

7. Chaturvedi, R., Hu, J.: Buffered Clock Tree for High Quality IC Design. In: Proceedings ofthe International Symposium on Quality Electronic Design, pp. 381–386 (2004)

8. Sze, C.N., Restle, P., Nam, G.-J., Alpert, C.: ISPD 2009 Clock Network Synthesis Contest.In: Proceedings of ACM International Symposium on Physical Design, pp. 149–150 (March2009)

9. Lu, J., Chow, W.K., Sham, C.W., Young, E.F.Y.: A Dual-MST Approach for Clock NetworkSynthesis. In: Proceedings of Asia and South Pacific Design Automation Conference, pp.467–473 (January 2010)

10. Kitahara, T., Minami, F., Ueda, T., Usami, K., Nishio, S., Mruakata, M., Mitsuhashi, T.: AClock-Gating Method for Low-Power LSI Design. In: Proceedings of Asia and South PacificDesign Automation Conference, pp. 307–312 (January 1998)

11. Chang, C.M., Huang, S.H., Ho, Y.K., Lin, J.Z., Wang, H.P., Lu, Y.S.: Type-Matching ClockTree for Zero Skew Clock Gating. In: Proceedings of IEEE/ACM Design Automation Con-ference, pp. 714–719 (June 2008)

12. Donno, M., Ivaldi, A., Benini, L., Macii, E.: Clock-Tree Power Optimization based on RTLClock-Gating. In: Proceedings of IEEE/ACM Design Automation Conference, pp. 622–627(June 2003)

13. Luo, Y., Yu, J., Yang, J., Bhuyan, L.: Low Power Network Processor Design Using ClockGating. In: Proceedings of IEEE/ACM Design Automation Conference, pp. 712–715 (June2005)

14. Farrahi, A.H., Chen, C., Srivastava, A., Tellez, G., Sarrafzadeh, M.: Activity-Driven ClockDesign. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Sys-tems 20(6), 705–714 (2001)

15. Chen, C., Kang, C., Sarrafzadeh, M.: Activity-Sensitive Clock Tree Construction for LowPower. In: International Symposium on Low Power Electronics and Design, pp. 279–282(2002)

16. Oh, J., Pedram, M.: Gated Clock Routing for Low-Power Microprocessor Design. IEEETransactions on Computer-Aided Design of Integrated Circuits and Systems 20(6), 715–722(2001)

17. Chao, W.C., Mak, W.K.: Low-Power Gated and Buffered Clock Network Construction. ACMTransactions on Design Automation of Electronic Systems 13(1) (January 2008)

18. Chao, T.H., Hsu, Y.C., Ho, J.M.: Zero Skew Clock Net Routing. In: Proceedings ofIEEE/ACM Design Automation Conference, pp. 518–523 (July 1992)

19. Chen, Y.P., Wong, D.F.: An Algorithm for Zero-Skew Clock Tree Routing with Buffer Inser-tion. In: European Design and Test Conference, pp. 230–236 (March 1996)

Modeling Time Domain

Magnetic Emissions of ICs

Victor Lomne1, Philippe Maurine1, Lionel Torres1,Thomas Ordas1,2, Mathieu Lisart2, and Jerome Toublanc3

1 LIRMM, UMR 5506university of Montpellier 2 / CNRS

161, rue Ada34095 Montpellier, France

{firstname.lastname}@lirmm.fr2 STMicroelectronics

190 Avenue Celestin Coq13106 Rousset, France

{firstname.lastname}@st.com3 Apache Design Solutions

300 route des Cretes06902 Sophia-Antipolis, France{firstname}@apache-da.com

Abstract. ElectroMagnetic (EM) radiations of Integrated Circuits (IC)is for many years a main problem from an ElectroMagnetic Compatibil-ity (EMC) point-of-view. But with the increasing use of secure embeddedsystems, and the apparition of new attacks based on the exploitation ofphysical leakages of such secure ICs, it is now also a critical problem forsecure IC designers. Indeed, EM radiations of an IC, and more preciselythe magnetic component, can be exploited to retrieve sensible data suchas, the secret key of cryptographic algorithms. Within this context, thispaper aims at introducing a magnetic field simulation flow allowing pre-dicting, with high spatial and time resolutions, the magnetic radiationsof IC cores. Such a flow being mandatory to predict the robustness ofsecure ICs before fabrication against EM attacks.

1 Introduction

With the ever increasing speed and power consumption of ICs, EM interferencesof chips are becoming a more and more challenging issue from an EMC point ofview.

To prevent from these problems, designers need to simulate these EM radi-ations during the IC design flow. Different simulation methods and tools havebeen developed to ensure that EM radiations emitted by the different parts ofan electronic system do not interfere with the others.

Most of these tools model the circuit, and more precisely its pads, its internalPower/Ground network and its digital macro-blocks using passive RLC elementsand current sources.


Modeling Time Domain Magnetic Emissions of ICs 239

If these models (IBIS [1], ICEM [2] or IMIC [3]) and the related tools havebeen demonstrated efficient to predict the EM radiations of a circuit in its whole(leads, bonding, package and IC core), they are too coarse grain to addressproblems which are specific to the design of secure circuits.

Other tools, like CST studio [4], allow to compute a complete 3D EM simula-tion of any electronic device with very high spatial and time resolutions. But thiskind of tools need to solve Maxwell equations at lot of positions in the device,and the CPU time necessary to model complex ICs made of several hundredthousand gates, is not reasonnable for a designer.

From a hardware security point-of-view, with the ever increasing use of em-bedded systems to manage sensible data, a new kind of threats appeared at theend of the 20th century. They are called Side-Channel Attacks (SCA), and ex-ploit physical leakages like power consumption or EM radiations emanated bythe IC while it computes a cryptographic operation.

Among these threats, the major ones are the Simple ElectroMagnetic Analysis(SEMA) and the Differential ElectroMagnetic Analysis (DEMA) [5].

The SEMA consists in analysing a single EM trace of a cryptographic opera-tion, measured with the Surface Scan method [6] using a small magnetic probemade of a coiled wire with diameter varying between 50μm and 500μm. Themeasured trace is the evolution of the magnetic field radiated by the IC versustime.

When applying a SEMA at different positions above the IC, it is thus possibleto compute static and dynamic (time domain) EM cartographies [7]. Furthemore,advanced techniques based on signal processing have been proposed to localizethe crypto module [8] [9] [10].

The DEMA exploits several EM traces corresponding to several cryptographicoperations using the same key. It consists in a statistical processing of these tracesin order to guess the key. More precisely, it exploits variation in amplitude ofEM traces, which are correlated to the processed data.

Note that, although these methods are called ElectroMagnetic Analyses, it isusually the magnetic field which is measured with the Surface Scan method [6]and a magnetic probe.

Considering this threats, the basic design guidelines to increase the robustnessof an IC:

– to reduce as far as possible the EM radiations of the cryptographic modules,or

– to hide them within the EM radiations of other blocks, or finally– to design the circuit such as to obtain inintelligible EM radiations.

However, adopting these basic guidelines requires the development of a flowallowing to predict at the design step, with high time and spatial resolutions,the magnetic field generated by a circuit in the close vicinity of its surface.

Within this context, the main contribution of this paper is the proposal ofan industrial flow allowing predicting the time domain evolution of magneticradiations with high accuracy and with high spatial and time resolutions.

240 V. Lomne et al.

The rest of this paper is organized as follows. Section 2 provides an overviewof the simulation flow and then details its main features. Section 3 gives an ex-perimental validation of our flow applied on two complex ICs. Finally, conclusionis drawn section 4.

2 Magnetic Field Simulation Flow

Due to the ever increasing demand of performance, industrial integrated prod-ucts have moved from simple IC to complex integrated system, known as System-on-Chip (SoC) which consumes a significant amount of power.

To distribute efficiently this power to the basic elements of SoC, more orless complex Power/Ground (P/G) networks are designed according to specificdesign guidelines addressing different signal integrity problems such as IR drops.

As a result, the current consumed by a circuit typically flows from the topmetal layers, characterized by a lower resistivity, down to logic gates regularlyand hierarchically in order to minimize static and dynamic IR drops.

2.1 Basic Concept

Consequently, P/G networks of complex SoC, especially the part routed on toplevel metal wires, constitute the main sources of magnetic emissions, as it hasbeen experimentally observed in [7], since high amplitude currents (several mA)flow within. On the contrary, interconnect wires, which are much more resistiveand controlled by simple logic gates, are weaker sources of magnetic emissions.

From these considerations, supported by experimental results, it appears thatmodeling the magnetic radiations of complex SoC results mainly in modelingthe magnetic radiations of its P/G network.

Our magnetic field simulation flow aiming at being as general as possible, thebackbone of our modeling approach, which is represented in figure 1, followsthese steps:

– cut the P/G network in small pieces of metal, considered as small electricaldipoles, and simulate the current within each of these pieces of wire

– compute the magnetic field generated by each dipole, at several positions ona plane parallel to the IC surface (like a grid), according to Biot-Savart law

– obtain the magnetic emissions of the IC at each coordinates of this plane bysumming all the contributions of all these dipoles

– compute what can be seen by measurement, i.e. to take into account themain characteristics of the measurement setup assumed to be used.

If this approach is simple, it requires computing the time domain evolution ofthe current flowing in each dipole with a high time resolution.


Fig. 1. Overview of the proposed magnetic field simulation flow

2.2 Current Extraction Step

If it is quite standard to compute the current consumed by a small digital block,using for example a SPICE-like tool, it is much more difficult to compute thecurrent flowing along the entire P/G network of a complex SoC made of manydigital and analogue blocks and memories.

Thus, to get this current, we use an efficient IR drop tool allowing computing,with a high time resolution, the voltage evolution along the entire P/G network,called RedHawk (from Apache tools suite) [11]. RedHawk allows designers ver-ifying that their P/G network does not suffer any significant static or dynamicvoltage drops before launching the production.

Another key advantage of this tool is its ability in simulating, with a reducedcpu time and a high accuracy (see section 3), SoC integrating many differentelements such as digital blocks, co-processors, memories and analogue blocks.More precisely, another tool from Apache tools suite, called Totem [11], allowsto characterize the current evolution of analogue blocks and memories, for ausage within RedHawk.

242 V. Lomne et al.

This characterization step achieved, the simulation can be launched, accord-ing to a scenario, specifying to the tool kernel, which block is involved. Thissimulation provides different results such as static and dynamic maps disclosingthe IR drops along the P/G network. A map, allowing identifying the areas thathave suffered from the most important IR drops during a scenario, is given figure2. In that case, it corresponds to a memory decoder power rail (red part on thefigure 2).

Among all its features, RedHawk offers a key advantage for the modeling ofmagnetic emissions. Indeed, it allows extracting, by positioning virtual probes (aspecific instance of this tool), the time domain evolution of the voltage anywherealong the P/G network, i.e. the ability of computing the evolution of the currentflowing in any piece of the P/G network considered as an electrical dipole in ourmagnetic field simulation flow.

More precisely, for the magnetic field simulation of a given IC, the first step(figure 1) is to place virtual probes regularly (every X μm) along the power andground rails. The placement policy was to place a virtual probe:

– every X μm along unidirectional wire– at each intersection of vertical or horizontal wires– at each intersection of a wire and a via in order to warrant that two successive

virtual probes are connected by a single and unidirectional wire.

This point is important since it allows computing easily the current flowingbetween two probes, knowing the resistivity of the considered metal layer andthe voltage at both wire ends.

The computation of the currents flowing in all the dipoles achieved, the resultsare stored in a file gathering, for each dipole, the sampled current waveform (thesampling rate fixes the time domain resolution and the simulation speed) butalso the coordinates of the dipole.

2.3 Magnetic Field Calculation Step

The second step of our flow is based on the classical rules of the EM wave theory[12]. As aforementioned, each piece of the P/G network in which a current flows,radiates an EM field according to Maxwell equations.

In our case, considering the distance between the magnetic sensor, the ICsurface and the typical frequency bandwidth scanned by a magnetic measure-ment setup operating in time domain (from 1MHz to 1GHz), we may adopt thequasi-stationary regime approximation. This fact allows using the Biot-Savartlaw (1) for faster calculations rather than more complex expressions deducedfrom Maxwell equations.

To get an idea of what can be seen on the scope at a point m of a planeparallel to the IC surface, we first compute the magnetic field at this point.

More precisely, knowing the current IAB(t) that flows in each piece of P/Gnetwork represented by a finite wire of length AB, its contribution Bi(t) to themagnetic field B(t) at the position m is first evaluated according to the expres-sion of the Biot-Savart law (1), where μ is the permeability of the considered


Fig. 2. A static IR drop map obtained with RedHawk

space, and r the distance between the wire AB and the point m. Then, the finalvalue B(t) of the magnetic field at the position m is computed by vectorial sumof magnetic fields radiated by the N pieces of P/G network (2).

−−−→Bi(t) =

μ.IAB(t)4π

−−→AB ×−→r

r3(1)

−−→B(t) =

N∑i=1

−−−→Bi(t) (2)

Thus, we compute the magnetic flux φB(t) flowing through the coiled magneticsensor, according its diameter giving a surface S (3). This is done, assuming thatthe surface S is parallel to the IC surface. This assumption is important since itallows computing the magnetic flux by computing the magnetic field at severalpoints inside the surface S and by summing them.

φB(t) =∫

S

−−→B(t).−→dS (3)

Finally, we compute the electromotive force emf (t), measured at the pins of thecoiled sensor, by a differentiation of the magnetic flux by the time (4).

244 V. Lomne et al.

emf(t) = −dφB(t)dt

(4)

2.4 Additional Mandatory Steps

If the calculation of the magnetic field at all points of a plane, parallel to the ICsurface, is quite standard, it is not sufficient to get an accurate idea of what canbe seen by measurement.

Indeed, to obtain, by simulation, a more accurate representation of resultsprovided by a near-field scan of the IC, by simulation, the characteristics of thesetup assumed to be used by for measurements have to be considered.

In our simulation flow, three main characteristics of the setup are considered.The probe size assumed to be a small loop, and the overall bandwidth of theacquisition chain. More precisely, to increase the accuracy of the results obtained:

– we take into account the change in direction of a wave due to the refractioninvolved by the passivation layer. Thus, at a given position m of the sensor,the magnetic field radiated by a piece of the P/G network far from thesensor is not taken into account in the resulting magnetic field measuredby the sensor. This characteristic is only estimated, because it is hard toestimate the distance between the passivation layer and the magnetic sensorwith a precision < 5μm.

– we filter (band pass filter) the time domain evolution of the computed emfaccording to the acquisition chain bandwidth. More precisely, knowing thefrequency bandwidth of the sensor, the low-noise amplifier and the oscillo-scope, we can estimate the frequency bandwidth of the acquisition chain.

– we take into account the gain (in decibels) of the low-noise amplifier.

3 Validation

To validate the proposed magnetic field simulation flow, static and dynamiccartographies of the magnetic field generated by two circuits have been obtainedusing:

– a near-field scan setup operating in time domain, composed of a motorizedX-Y stage with a minimal displacement step of 1μm, a magnetic sensor madeof a coiled metal wire with diameter of 50μm, a low-noise amplifier with again of 63dB, an oscilloscope and a computer controlling the whole setup(figure 3).

– our magnetic field simulation flow, using characteristics of the near-field scansetup, as described in section 2.

The two considered ICs are microcontrollers designed in 130nm CMOS technol-ogy. They integrate different macro-bocks such as ROM, RAM, EEPROM, CPUand small analogue blocks.


One processing scenario, previously stored in the RAM memory, is executedon each circuit. It consists in reading data in RAM and passing them to theCPU.

During the execution of this scenario (several clock cycles), we measured themagnetic field radiated by the chips, using a 50μmsensor and a 25μm displacement

Fig. 3. Near-field scan setup used for experimental validation

measured peak to peak mapof the magnetic field (IC1)

simulated peak to peak mapof the magnetic field (IC1)

1.9mm

1.7m

m

Fig. 4. Measured and simulated maps disclosing the peak to peak amplitude of themagnetic field in the close vicinity of IC1 surface

246 V. Lomne et al.

step. The scenario was repeated 100 times for each position of the scanned surfacein order to increase the signal to noise ratio.

Figures 4 and 5 show the cartographies (revealing the peak to peak amplitudeof the magnetic field) obtained respectively using the aforementioned near-fieldscan setup and the proposed magnetic field simulation flow.

Note that data acquisition with the near-field scan setup takes 3 hours whilesimulation runs in 5 hours. Note also that these simulations have been launchedto obtain an emf value every 25μm. During these simulations the probe diam-eter and the frequency bandwidth were fixed respectively to 50μm and 1GHzaccordingly to characteristics of our near-field scan setup. The simulation timestep was chosen accordingly to the sampling rate of our scope. The distanceseparating the sensor from the IC surface was estimated to be roughly 30μm,using a small micro camera with a zoom x100.

As shown, considering the IC1, the agreement between simulations and mea-sures is satisfactory even if some discrepancies still exist. These discrepancies

measured peak to peak mapof the magnetic field (IC2)

simulated peak to peak mapof the magnetic field (IC2)

0.8mm

2.4m

m

Fig. 5. Measured and simulated maps disclosing the peak to peak amplitude of themagnetic field in the close vicinity of IC2 surface


may be due to several factors. Among them, some may be due to the modelingof the sensor. Indeed it is assumed that:

– the sensor is perfectly horizontal– the sensor has a perfect circular shape– the distance between the sensor and the IC is perfectly known

This latter point is critical. It is extremely difficult in practice, even with a microcamera, to measure the distance separating the sensor from the IC with a highaccuracy (< 5μm) due the package shape.

Note also that a fabricated chip does not necessarily have typical character-istics due to process variations.

Concerning IC2, one observe a significant difference (around the rectangles onfigure 5) between the measured and the calculated maps. However, this differencewas expected since the marked positions are above the clock generator, that wasnot considered during the simulation (our database related to this design beingincomplete).

If these maps demonstrate the interest of the proposed magnetic field simu-lation flow to compare the efficiency of different P/G network routing strategiesin terms of emissions before fabrication, they do not provide any informationrelated to the accuracy of the simulator with respect to time.

To fill this lack, figure 6 gives the measured and simulated time domain evo-lutions of the magnetic field at a position marked by dots in figure 5. As shown,the wavefroms are quite similar (without application of any filtering solution tomodel the bandwidth of the near-field scan setup) demonstrating the interest ofthe magnetic simulation tool.

time (ns)

emf (

100m

V)

Fig. 6. Measured (continous line) and simulated (dashed line) time domain waveformsof the electromotive force above a supply rail of the IC2 RAM

248 V. Lomne et al.

time (ns)

emf (

100m

V)

Fig. 7. Measured (continous line) and simulated (dashed line) time domain waveforms(with filtering) of the electromotive force above a supply rail of the IC2 RAM

The figure 7 shows the same results than those represented figure 6, exceptthat the simulated average emf trace has been filtered accordingly to the fre-quency bandwidth of the acquisition chain. The comparison of Fig. 6 and 7demonstrates the interest of the considering the acquisition chain impact.

4 Conclusion

In this paper, we have introduced an industrial flow allowing simulating the timedomain evolutions of the magnetic emissions of an IC in the close vicinity of itssurface. The main ideas on which is based this flow are:

– the use of a dynamic IR drop simulator, RedHawk, that quickly provides thecurrent flowing in all parts of the Power/Ground network

– the use of Biot-Savarts law for fast calculations– the modeling of the magnetic sensor and, the consideration of the near-field

scan setup bandwidth

This flow has been validated by comparing the predicted emissions of two ICsdesigned in a 130 nm technology with measured emissions. This comparison hasdemonstrated the efficiency of the proposed flow even if there is room for furtherimprovements.

References

1. Technical Specification IEC 62014-1 (2001)2. Technical Specification IEC 62014-3 (2002)3. Technical Specification IEC 62404 (2007)


4. CST Studio suite, http://www.cst.com5. Gandolfi, K., Mourtel, C., Olivier, F.: Electromagnetic Analysis, Concrete Results.

In: Koc, C.K., Naccache, D., Paar, C. (eds.) CHES 2001. LNCS, vol. 2162, pp.251–261. Springer, Heidelberg (2001)

6. Technical Specification IEC 61967-37. Ordas, T., Lisart, M., Sicard, E., Maurine, P., Torres, L.: Near-Field Mapping

System to Scan in Time Domain the Magnetic Emissions of Integrated Circuits.In: Svensson, L., Monteiro, J. (eds.) PATMOS 2008. LNCS, vol. 5349, pp. 229–236.Springer, Heidelberg (2009)

8. Sauvage, L., Guilley, S., Mathieu, Y.: Electromagnetic Radiations of FPGAs, HighSpatial Resolution Cartography and Attack on a Cryptographic Module. ACMTransactions on Reconfigurable Technology and Systems (TRETS) 2(1) (2009)

9. Real, D., Valette, F., Drissi, M.: Enhancing correlation electromagnetic attackusing planar near-field cartography. In: International Conference on Design, Au-tomation and Test in Europe (DATE), pp. 628–633 (2009)

10. Dehbaoui, A., Lomne, V., Maurine, P., Torres, L., Robert, M.: Enhancing Electro-magnetic Attacks using Spectral Coherence based Cartography. In: InternationalConference on Very Large Scale Integration, VLSI-SoC (2009)

11. Apache Design Solutions, http://www.apache-da.com12. Ben Dhia, S., Randani, M., Sicard, E.: Electromagnetic Compatibilty of Integrated

Circuits: Techniques for Low Emissions and Susceptibility. Springer Science, Hei-delberg (2006)

Power Profiling of Embedded Analog/Mixed-SignalSystems

Jan Haase and Christoph Grimm

TU Vienna, [email protected],[email protected]

Abstract. In order to optimize power consumption, it is important to know whereand why power is consumed in a specific system. Power estimation gives a moreor less accurate answer for the first question (where?). Knowing where poweris consumed allows designers to optimize these specific components. However,the second question (why?) for the reason of power consumption, is more dif-ficult to answer: Activities that are reason for power consumtion (e.g. address-ing/routing in a WSN) are not located in a single component, but use a varietyof components. However, knowing the cost of activities would pave the path tomore holistic power optimization. The presentation will introduce methods for”power profiling” that assist the analysis of power consumption, assigning powerconsumption to both components and activities.

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, p. 250, 2011.c© Springer-Verlag Berlin Heidelberg 2011

Open-People: Open Power and Energy OptimizationPLatform and Estimator

Daniel Chillet

ENSSAT/IRISA/CAIRN, [email protected]

Abstract. The presentation will explain the objectives of the ANR Open-Peopleproject and will focus on energy estimation based on high level modelling. Thisproject aims at developping an hardware platform for consumption measurementof complex SoC.

This platform will be accessible via internet for industrial and academic usersand will provide a library of power consumption models for several hardwareboards. We are currently working on the description of power model of compo-nents. These models are described through an high level language and enableto make estimations and optimizations of the energy. The platform uses Sys-temC to ensure functional verification and validation in order to provide accurateestimations.

The consumption models developed in the Open-People projet can be definedat different levels of abstraction, and the SystemC simulation can use these dif-ferent levels in order to facilitate the exploration step during the system design. Inthis presentation, we will show how the SystemC models can be used to extractpower consumption of a complex system.


Early Power Estimation in Heterogeneous Designs UsingSoCLib and SystemC-AMS

Francois Pecheux, Khouloud Zine El Abidine, and Alain Greiner

UPMC/LIP6/SOC, [email protected],

[email protected],[email protected]

Abstract. The presentation will describe a use case that consists in the model-ing and simulation of a genuine heterogeneous system composed of individuallypowered Wireless Sensor Network nodes. The models are written in SoCLib andSystemC-AMS, an open-source C++ extension to the OSCI SystemC Standarddedicated to the description of AMS designs containing digital, analog, RF hard-ware as well as other disciplines. SoCLib is a library of digital IPs simulationmodels dedicated to the design of shared memory multiprocessor archutectures. Itis currently being extended to support power estimation at the bit-cycle-accuratelevel of abstraction.

Concretely, a power-aware system of WSN nodes will detailed that can mon-itor a physical seismic perturbation, transmit information on this perturbation toother nodes by means of 2.4 GHz RF communication links, and finally computethe epicenter of the perturbation by asking the 32-bits processor embedded in anode to solve the system of nonlinear equations relative to the triangulation algo-rithm. Each node is powered by an autonomous kinetic battery model.


ASTEC: Asynchronous Technology for Low Power andSecured Embedded Systems

Pr. Marc Renaudin

CTO of TIEMPO SAS110 Rue Blaise PascalBat. Viseo - Inovallee

38330 Montbonnot St Martin - France

Abstract. The presentation is highlighting recent advances and results of theMINALOGIC ASTEC project in the domain of asynchronous microcontroller de-sign and wire-less sensor applications. The work carried out in the ASTEC projectis focused on using the asynchronous technology industrialized by Tiempo to de-sign low power and secured embedded systems. In collaboration with TIMA labo-ratory and CESTI/LETI, Tiempo fabricated and evaluated two versions of a fullyasynchronous microcontroller, one without security feature and one with secu-rity counter-measures against power and fault attacks. Sensaris and Tracedge areintegrating the asynchronous microcontroller into their systems in order to takeadvantage of the technology, and design competitive products for the low-powerembedded systems market (wire-less sensors, medical systems, RFIDs...).


OPENTLM and SOCKET: Creating an OpenEcoSystem for Virtual Prototyping of Complex SOCs

Laurent Maillet-Contoz

STMicroelectronics, Grenoble, France

Abstract. The objective of the OpenTLM project is to offer to embedded soft-ware developers a tool kit, available under open source license, and based onthe SystemC/TLM standard. It enables them to develop and test the embeddedsoftware ahead of availability of hardware platforms (silicon, but also hardwareemulators). It gives the opportunity to promote a broader use of the TLM method-ology, already adopted by hardware teams, as well as a better concurrent de-velopment of hardware and software parts of the system. Indeed, if software ismature enough when silicon is available, the overall period for system integra-tion is reduced, which accelerates the availability of the product and optimizestime-to-market.

The SoCKET project (SoC toolKit for critical Embedded sysTems) gathersindustrial and academic partners to address the issue of design methodologies forcritical embedded systems. The work targets the definition of a ”seamless” designflow which integrates the equipment qualification/certification, from the systemlevel to the Integrated Circuits (ICs) and the associated embedded software, com-pliant with the applicable norms (aeronautics: DO-178C, DO-254, ARP4754 -space: ECSS Q60-02, Q80, E40).

This ”seamless” flow requires some formalisms unification (elimination ofsemantic holes in HW/SW interfaces), the availability of models transformationoperators (skeleton generation, requirements traceability), and models & toolsinteroperability.

The main outcomes of the project will be:

* a design flow supporting critical embedded systems development* a draft IDE implementing this flow and tested with partners’s tools (adapt-

able with other tools and for other applications)* some return of experience through 4 industrial case studies* some Certification/Qualification kits for IPs and SoCs in each application

domain* some recommendations to certification and normalization bodies.


Variability-Conscious Circuit Designs for Low-VoltageMemory-Rich Nano-Scale CMOS LSIs

Kiyoo Itoh

Fellow, Central Research Laboratory, Hitachi, Ltd.1-280 Higashi-Koigakubo, Kokubunji, Tokyo 185-8601, Japan

Tel.: [email protected]

Abstract. Low-voltage scaling limitations of nanoscale CMOS LSIs are one ofthe major problems in the nanoscale era because they cause the evermore-seriouspower crises with device scaling. The problems stem from two unscalable deviceparameters: The first is the high value of the lowest necessary threshold voltageVt (that is, Vt0) of MOSFETs needed to keep the subthreshold leakage low. Thesecond is the variation in Vt (that is, ΔVt), that becomes more prominent in thenanoscale era. The ΔVt caused by the intrinsic random dopant fluctuation is themajor source of various ΔVt components. It increases with device scaling andthus intensifies various detrimental effects such as variations in speed and/or thevoltage margins of circuits. Due to such inherent features of Vt0 and ΔVt, theoperating voltage VDD is facing a 1-V wall in the 65-nm generation, and is ex-pected to rapidly increase with further scaling of bulk MOSFETs, thereby wors-ening the power crisis. To reduce VDD, the minimum operating voltage Vmin, asdetermined by Vt0 and ΔVt, must be reduced.

In this talk the Vmin of memory-rich nanoscale CMOS LSIs is investigated inan effort to reduce to below 0.5 V through variability-conscious device and circuitdesigns. First, Vmin, as a methodology to evaluate the low-voltage potential ofMOSFETs, is proposed on the basis of a tolerable speed variation. Second, Vminsof the logic, SRAM, and DRAM blocks are compared, and the SRAM blockcomprising the six-transistor (6-T) cell turns out to be particularly problematicbecause it has the highest Vmin. Third, new devices, such as a fully-depletedstructure (FD-SOI) and fin-type structure (FinFET) as ΔVt-immune MOSFETs,are investigated to further reduce the Vmins of the above-described blocks. Alsoinvestigated are new circuits to reduce Vmin of each block. For example, forthe logic block, new dual-Vt0 and dual-VDD dynamic circuits enable the power-delay product to be reduced to 0.09 at a 0.2-V supply owing to gate-source reversebiasing. For the SRAM block, repair techniques, shortening the data line, up-sizing the MOSFETs, control of the common-source line or the word line of thecell, and even the 8-T cell reduce the Vmin. For the DRAM block, if combinedwith FinFET DRAM cells, a dynamic sense amplifier minimizes the Vt0 and thusVmin.

Finally, it is concluded that such variability-conscious circuit designs shouldlead to the achievement of 0.5-V nanoscale LSIs, if relevant devices and fabrica-tion processes are successfully developed.


3D Integration for Digital and Imagers Circuits:Opportunities and Challenges

Marc Belleville

CEA, LETI, MINATEC, France

Abstract. To cope with the market requirements of more functionalities and per-formances, while keeping reasonable power consumption, the microelectronicindustry has always extensively relied on 2D technology scaling. However, withthe technical and economic challenges increasing dramatically with the very ad-vanced nodes, 3D integration is now recognized as a very attractive alternativesolution to sustain increased system integration. The key drivers towards 3D in-tegration will be first introduced in this talk. Examples of the various 3D process,their associated technological challenges and limitations will be given. At thisstage, 3D design rules and 3D specific CAD tools (industrial or at the researchlevel) will be presented and discussed. Then, examples of 3D IPs or circuits willbe detailed. Finally, a perspective about another type of 3D integration (stackingtransistors instead of dies or wafers) will conclude this talk.


Signing Off Industrial Designs on Evolving Technologies

Sebastien Marchal

STMicroelectonics, France

Abstract. Many specific challenges need to be addressed in current SOC de-signs to offer a competitive product. Chip content become highly heterogeneous,performance has to be on the leading edge and robustness needs be guaranteed.Besides the technical merits, the ”date of availability” of the product plays a keyrole in its overall competitiveness. Therefore, as schedule pressure is increasing,moving to new technology requires more parallelization of activities that usedto done serially. When the technology brick, which is the first link of the chain,moves through various maturity levels, the whol edesign process may be im-pacted. Traditionally, the impact of such evolutions were not anticipated. Layoutwere updated as a consequence of design rules changes. ”Brute force” timingmargins were put in the models regardless of design specificities. Design ForVariation techniques operate at design flow, SOC design and library/IP designlevels to anticipate those variations. Different examples of Design Rules changesor Timing variations are discussed, and techniques to handle such changes arecovered. Some aspects of Silicon process corners variations are also presented.Part of the talk covers clock network building techniques which are variationfriendly. Impact of such techniques on final design analysis, also called SignOffare demonstrated. How DFV can simplify SignOff is finally discussed.


Author Index

Alioto, Massimo 62Apolloni, R. 116

Bachmann, Christian 11Baz, Abdullah 105Beigne, Edith 94Bekiaris, Dimitris 73Belleville, Marc 256Berkelaar, Michel 190Berrandjia, Mohamed Lamine 211Blanc, Guillaume 1Boudouani, Nassima 1

Calazans, Ney 150Carazo, P. 116Castro, F. 116Chaver, D. 116Chillet, Daniel 251Chow, Wing-Kai 228Consoli, Elio 62Corsonello, Pasquale 180Crippa, Dennis 41

De Rose, Raffaele 180

Economakos, George 73Eichwald, Irina 200El Abidine, Khouloud Zine 252Elissati, Oussama 137

Fernandes, Jorge 84Fesquet, Laurent 137Flores, Paulo 84Frustaci, Fabio 180

Gag, Martin 21Garcia-Ortiz, Alberto 160Genser, Andreas 11Ghavami, Behnam 126Grass, Eckhard 218Greiner, Alain 252Grimm, Christoph 250

Haase, Jan 250Haid, Josef 11

Indrusiak, Leandro S. 160Itoh, Kiyoo 255

Jagdhold, Ulrich 218Jain, Abhishek 41

Kheradmand-Boroujeni, Bahman 170Knoth, Christoph 200Kouretas, Ioannis 31

Lanuzza, Marco 180Lazzari, Cristiano 84Leblebici, Yusuf 170Lebreton, Hugo 94Liacha, Ahmed 211Lisart, Mathieu 238Lomne, Victor 238Lu, Jingwei 228

Maillet-Contoz, Laurent 254Marchal, Sebastien 257Maurine, Philippe 238Monteiro, Jose 84Moraes, Fernando 150Moreira, Matheus 150

Nordholz, Petra 200

Ordas, Thomas 238Oudjida, Abdelkrim Kamel 211

Paliouras, Vassilis 31Palumbo, Gaetano 62Papameletis, Christos 73Papanikolaou, Antonis 73Pecheux, Francois 252Pedram, Hossein 126Pekmestzi, Kiamal 73Perri, Stefania 180Petri, Markus 218Piguet, Christian 170Pinuel, L. 116Pontes, Julian 150

Raji, Mohsen 126Ramezani, Lida 51Renaudin, Pr. Marc 253

260 Author Index

Rieubon, Sebastien 137Rolandi, Pierluigi 41

Sassolas, Tanguy 1Schlichtmann, Ulf 200Schrape, Oliver 218Sham, Chiu-Wing 228Shang, Delong 105Soudris, Dimitrios 73Steger, Christian 11

Tajary, Alireza 126Tang, Qin 190Tiar, Rachid 211Timmermann, Dirk 21Tirado, F. 116Torres, Lionel 238Toublanc, Jerome 238

van der Meijs, Nick 190Veggetti, Andrea 41Ventroux, Nicolas 1Vivet, Pascal 94

Wegner, Tim 21Weiß, Reinhold 11Winkler, Frank 218

Xia, Fei 105

Yahya, Eslam 137Yakovlev, Alex 105

Zarandi, Hamid R. 126Zeidler, Steffen 218Zergainoh, Nacer-Eddine 94Zjajo, Amir 190

lecture notes in computer science 6448 - cas · madhu sudan microsoft research, cambridge, ma, usa...

Documents