the art of differentiating

Upload: ivanportilla

Post on 26-Feb-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/25/2019 The Art of Differentiating

    1/347

    The Art of DifferentiatingComputer Programs

    An Introduction to Algorithmic Differentiation

  • 7/25/2019 The Art of Differentiating

    2/347

    The SIAM series on Software, Environments, and Tools focuses on the practical implementation of

    computational methods and the high performance aspects of scientific computation by emphasizing

    in-demand software, computing environments, and tools for computing. Software technology development

    issues such as current status, applications and algorithms, mathematical software, software tools, languages

    and compilers, computing environments, and visualization are presented.

    Editor-in-ChiefJack J. Dongarra

    University of Tennessee and Oak Ridge National Laboratory

    Editorial Board

    James W. Demmel, University of California, Berkeley

    Dennis Gannon, Indiana University

    Eric Grosse, AT&T Bell Laboratories

    Jorge J. Mor, Argonne National Laboratory

    Software, Environments, and ToolsUwe Naumann,The Art of Differentiating Computer Programs: An Introduction to Algorithmic Differentiation

    C. T. Kelley, Implicit Filtering

    Jeremy Kepner and John Gilbert, editors, Graph Algorithms in the Language of Linear Algebra

    Jeremy Kepner, Parallel MATLAB for Multicore and Multinode Computers

    Michael A. Heroux, Padma Raghavan, and Horst D. Simon, editors, Parallel Processing for Scientific Computing

    Grard Meurant, The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations

    Bo Einarsson, editor,Accuracy and Reliability in Scientific Computing

    Michael W. Berry and Murray Browne, Understanding Search Engines: Mathematical Modeling and TextRetrieval, Second Edition

    Craig C. Douglas, Gundolf Haase, and Ulrich Langer,A Tutorial on Elliptic PDE Solvers and Their Parallelization

    Louis Komzsik, The Lanczos Method: Evolution and Application

    Bard Ermentrout, Simulating, Analyzing, and Animating Dynamical Systems: A Guide to XPPAUT for Researchersand Students

    V. A. Barker, L. S. Blackford, J. Dongarra, J. Du Croz, S. Hammarling, M. Marinova, J. Wasniewski, and

    P. Yalamov,LAPACK95 Users Guide

    Stefan Goedecker and Adolfy Hoisie, Performance Optimization of Numerically Intensive Codes

    Zhaojun Bai, James Demmel, Jack Dongarra, Axel Ruhe, and Henk van der Vorst, Templates for the Solution

    of Algebraic Eigenvalue Problems: A Practical Guide

    Lloyd N. Trefethen, Spectral Methods in MATLAB

    E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling,

    A. McKenney, and D. Sorensen,LAPACK Users Guide, Third Edition

    Michael W. Berry and Murray Browne, Understanding Search Engines: Mathematical Modeling and Text Retrieval

    Jack J. Dongarra, Iain S. Duff, Danny C. Sorensen, and Henk A. van der Vorst, Numerical Linear Algebrafor High-Performance Computers

    R. B. Lehoucq, D. C. Sorensen, and C. Yang,ARPACK Users Guide: Solution of Large-Scale Eigenvalue Problems

    with Implicitly Restarted Arnoldi Methods

    Randolph E. Bank, PLTMG: A Software Package for Solving Elliptic Partial Differential Equations, Users Guide 8.0

    L. S. Blackford, J. Choi, A. Cleary, E. DAzevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling,

    G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley, ScaLAPACK Users Guide

    Greg Astfalk, editor,Applications on Advanced Architecture Computers

    Roger W. Hockney, The Science of Computer Benchmarking

    Franoise Chaitin-Chatelin and Valrie Frayss, Lectures on Finite Precision Computations

  • 7/25/2019 The Art of Differentiating

    3/347

    Society for Industrial and Applied Mathematics

    Philadelphia

    The Art of DifferentiatingComputer Programs

    An Introduction to Algorithmic Differentiation

    Uwe NaumannRWTH Aachen University

    Aachen, Germany

  • 7/25/2019 The Art of Differentiating

    4/347

    is a registered trademark.

    Copyright 2012 by the Society for Industrial and Applied Mathematics

    10 9 8 7 6 5 4 3 2 1

    All rights reserved. Printed in the United States of America. No part of this book may be

    reproduced, stored, or transmitted in any manner without the written permission of the

    publisher. For information, write to the Society for Industrial and Applied Mathematics,

    3600 Market Street, 6th Floor, Philadelphia, PA 19104-2688 USA.

    Trademarked names may be used in this book without the inclusion of a trademark symbol.

    These names are used in an editorial context only; no infringement of trademark is intended.

    Ampl is a registered trademark of AMPL Optimization LLC, Lucent Technologies Inc.

    Linux is a registered trademark of Linus Torvalds.

    Maple is a trademark of Waterloo Maple, Inc.

    Mathematicais a registered trademark of Wolfram Research, Inc.

    MATLAB is a registered trademark of The MathWorks, Inc. For MATLAB product information,

    please contact The MathWorks, Inc., 3 Apple Hill Drive, Natick, MA 01760-2098 USA,

    508-647-7000, Fax: 508-647-7001, [email protected] ,www.mathworks.com.

    NAG is a registered trademark of the Numerical Algorithms Group.

    Library of Congress Cataloging-in-Publication Data

    Naumann, Uwe, 1969-

    The art of differentiating computer programs : an introduction to algorithmic differentiation/

    Uwe Naumann.

    p. cm. -- (Software, environments, and tools)

    Includes bibliographical references and index.

    ISBN 978-1-611972-06-1

    1. Computer programs. 2. Automatic differentiations. 3. Sensitivity theory (Mathematics) I.

    Title.

    QA76.76.A98N38 2011

    003.3--dc23 2011032262

    Partial royalties from the sale of this book are placed in a fund to help

    students attend SIAM meetings and other SIAM-related activities. This fund

    is administered by SIAM, and qualified individuals are encouraged to write

    directly to SIAM for guidelines.

  • 7/25/2019 The Art of Differentiating

    5/347

    To Ines, Pia, and Antonia

  • 7/25/2019 The Art of Differentiating

    6/347

  • 7/25/2019 The Art of Differentiating

    7/347

    viii Contents

    2.4.4 Derivatives for Numerical Libraries . . . . . . . . . . . . 882.4.5 Call Tree Reversal . . . . . . . . . . . . . . . . . . . . . 88

    3 Higher Derivative Code 913.1 Notation and Terminology . . . . . . . . . . . . . . . . . . . . . . . . 913.2 Second-Order Tangent-Linear Code . . . . . . . . . . . . . . . . . . . 98

    3.2.1 Source Transformation . . . . . . . . . . . . . . . . . . . 1003.2.2 Overloading . . . . . . . . . . . . . . . . . . . . . . . . 102

    3.3 Second-Order Adjoint Code . . . . . . . . . . . . . . . . . . . . . . . 1043.3.1 Source Transformation . . . . . . . . . . . . . . . . . . . 1103.3.2 Overloading . . . . . . . . . . . . . . . . . . . . . . . . 1213.3.3 Compression of Sparse Hessians . . . . . . . . . . . . . 129

    3.4 Higher Derivative Code . . . . . . . . . . . . . . . . . . . . . . . . . 131

    3.4.1 Third-Order Tangent-Linear Code . . . . . . . . . . . . . 1343.4.2 Third-Order Adjoint Code . . . . . . . . . . . . . . . . . 1363.4.3 Fourth and Higher Derivative Code . . . . . . . . . . . . 142

    3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1443.5.1 Second Derivative Code . . . . . . . . . . . . . . . . . . 1443.5.2 Use of Second Derivative Models . . . . . . . . . . . . . 1443.5.3 Third and Higher Derivative Models . . . . . . . . . . . 144

    4 Derivative Code CompilersAn Introductory Tutorial 147

    4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1484.2 Fundamental Concepts and Terminology . . . . . . . . . . . . . . . . 153

    4.3 Lexical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1564.3.1 From RE to NFA . . . . . . . . . . . . . . . . . . . . . . 1574.3.2 From NFA to DFA with Subset Construction . . . . . . . 1574.3.3 Minimization of DFA . . . . . . . . . . . . . . . . . . . 1584.3.4 flex . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

    4.4 Syntax Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1624.4.1 Top-Down Parsing . . . . . . . . . . . . . . . . . . . . . 1664.4.2 Bottom-Up Parsing . . . . . . . . . . . . . . . . . . . . 1684.4.3 A Simple LR Language . . . . . . . . . . . . . . . . . . 1694.4.4 A Simple Operator Precedence Language . . . . . . . . . 1744.4.5 Parser for SL2 Programs withflexandbison . . . . . 175

    4.4.6 Interaction betweenflexandbison . . . . . . . . . . 1804.5 Single-Pass Derivative Code Compilers . . . . . . . . . . . . . . . . . 185

    4.5.1 Attribute Grammars . . . . . . . . . . . . . . . . . . . . 1854.5.2 Syntax-Directed Assignment-Level SAC . . . . . . . . . 1884.5.3 Syntax-Directed Tangent-Linear Code . . . . . . . . . . 1944.5.4 Syntax-Directed Adjoint Code . . . . . . . . . . . . . . . 197

    4.6 Toward Multipass Derivative Code Compilers . . . . . . . . . . . . . 2044.6.1 Symbol Table . . . . . . . . . . . . . . . . . . . . . . . 2054.6.2 Parse Tree . . . . . . . . . . . . . . . . . . . . . . . . . 206

    4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2084.7.1 Lexical Analysis . . . . . . . . . . . . . . . . . . . . . . 208

    4.7.2 Syntax Analysis . . . . . . . . . . . . . . . . . . . . . . 208

  • 7/25/2019 The Art of Differentiating

    8/347

  • 7/25/2019 The Art of Differentiating

    9/347

    x Contents

    C.2 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269C.2.1 Exercise 2.4.1 . . . . . . . . . . . . . . . . . . . . . . . 269C.2.2 Exercise 2.4.2 . . . . . . . . . . . . . . . . . . . . . . . 276C.2.3 Exercise 2.4.3 . . . . . . . . . . . . . . . . . . . . . . . 281C.2.4 Exercise 2.4.4 . . . . . . . . . . . . . . . . . . . . . . . 283C.2.5 Exercise 2.4.5 . . . . . . . . . . . . . . . . . . . . . . . 286

    C.3 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295C.3.1 Exercise 3.5.1 . . . . . . . . . . . . . . . . . . . . . . . 295C.3.2 Exercise 3.5.2 . . . . . . . . . . . . . . . . . . . . . . . 296C.3.3 Exercise 3.5.3 . . . . . . . . . . . . . . . . . . . . . . . 298

    C.4 Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308C.4.1 Exercise 4.7.1 . . . . . . . . . . . . . . . . . . . . . . . 308C.4.2 Exercise 4.7.2 . . . . . . . . . . . . . . . . . . . . . . . 309C.4.3 Exercise 4.7.3 . . . . . . . . . . . . . . . . . . . . . . . 312C.4.4 Exercise 4.7.4 . . . . . . . . . . . . . . . . . . . . . . . 322

    Bibliography 333

    Index 339

  • 7/25/2019 The Art of Differentiating

    10/347

    Preface

    How sensitive are the values of the outputs of my computer program with respectto changes in the values of the inputs? How sensitive are these first-order sensitivities

    with respect to changes in the values of the inputs? How sensitive are the second-ordersensitivities with respect to changes in the values of the inputs? . . .Computational scientists, engineers, and economists as well as quantitative analysts

    in computational finance tend to ask these questionson a regular basis. They write computerprograms in order to simulate diverse real-world phenomena. The underlying mathemat-ical models often depend on a possibly large number of (typically unknown or uncertain)parameters. Values for the corresponding inputs of the numerical simulation programs can,for example, be the result of (typically error-prone) observations and measurements. Ifvery small perturbations in these uncertain values yield large changes in the values of theoutputs, then the feasibility of the entire simulation becomes questionable. Nobody shouldmake decisions based on such highly uncertain data.

    Quantitative information about the extent of this uncertainty is crucial. First- andhigher-order sensitivities of outputs of numerical simulation programs with respect to theirinputs (also first and higher derivatives) form the basis for various approximations of uncer-tainty. They are also crucial ingredients of a large number of numerical algorithms rangingfrom the solution of (systems of) nonlinear equations to optimization under constraintsgiven as (systems of) partial differential equations. This book describes a set of techniquesfor modifying the semantics of numerical simulation programs such that the desired firstand higher derivatives can be computed accurately and efficiently. Computer programs im-plement algorithms. Consequently, the subject is known as Algorithmic (also Automatic)Differentiation (AD).

    AD provides two fundamental modes. In forward mode, a tangent-linearversion of

    the original program is built. The sensitivities of all outputs of the program with respectto its inputs can be computed at a computational cost that is proportional to the number ofinputs. The computational complexity is similar to that of finite difference approximation.At the same time, the desired derivatives are computed with machine accuracy. Truncationis avoided.

    Reverse mode yields an adjointprogram that can be used to perform the same task at acomputational cost that is proportional to thenumberof outputs. Forexample, in large-scalenonlinear optimization a scalar objective that is returned by the given computer programcan depend on a very large number of input parameters. The adjoint program allows forthe computation of the gradient (the first-order sensitivities of the objective with respect toall parameters) at a small constant multiple R(typically between 3 and 30) of the cost of

    running the original program. It outperforms gradient accumulation routines that are based

    xi

  • 7/25/2019 The Art of Differentiating

    11/347

    xii Preface

    on finite differences or on tangent-linear code as soon as the size of the gradient exceedsR.The ratio R plays a very prominent role in the evaluation of the quality of derivative code.It will reappear several times in this book.

    The generation of tangent-linear and adjoint code is the main topic of this introduc-tion to The Art of Differentiating Computer Programsby AD. Repeated applications offorward and reverse modes yield higher-order tangent-linear and adjoint code. Two waysof implementing AD are presented. Derivative code compilers take a source transforma-tion approach in order to realize the semantic modification. Alternatively, run time supportlibraries can be developed that use operator and function overloading based on a redefinedfloating-point data type to propagate tangent-linear as well as adjoint sensitivities. Note that

    AD differentiates what you implement!1

    Many successful applications of AD are described in the proceedings of, five interna-

    tional conferences [10, 11, 13, 18, 19]. The standard book on the subject by Griewank andWalther [36] covers a wide range of basic, as well as advanced, topics in AD. Our focus isdifferent. We aimto present a textbook style introduction toAD forundergraduate andgrad-uate students as well as for practitioners in computational science, engineering, economics,and finance. The material was developed to support courses on Computational Differ-entiation and Derivatives Code Compilers for students of Computational EngineeringScience, Mathematics, and Computer Science at RWTH Aachen University. Project-styleexercises come with detailed hints on possible solutions. All software is provided as opensource. In particular, we present a fully functional derivative code compiler (dcc) for a(very) limited subset of C/C++. It can be used to generate tangent-linear and adjoint code ofarbitraryorder by reapplication to itsownoutput. Our runtime support library dco provides

    a better language coverage at the expense of less efficient derivative code. It uses operatorand function overloading in C++. Both tools form the basis for the ongoing developmentof production versions that are actively used in a number of collaborative projects amongscientists and engineers from various application areas.

    Except for relatively simple cases, the differentiation of computer programs is notautomatic despite the existence of many reasonably mature AD software packages.2 Toreveal their full power,AD solutionsneed to be integrated into existing numerical simulationsoftware. Targeted application of AD tools and intervention by educated users is crucial.We expect AD to be become truly automatic at some time in the (distant) future. Inparticular, theautomaticgeneration ofoptimal (interms of robustness andefficiency)adjointversions of large-scale simulation code is one of the great open challenges in the field ofHigh-Performance Scientific Computing. With this book, we hope to contribute to a betterunderstanding of AD by a wider range of potential users of this technology. Combine itwith the book of Griewank and Walther [36] for a comprehensive introduction to the stateof the art in the field.

    There are several reasonable paths through this book that depend on your specificinterests. Chapter 1 motivates the use of differentiated computer programs in the context ofmethods for the solution of systems of nonlinear equations and for nonlinear programming.The drawbacks of closed-form symbolic differentiation and finite difference approxima-tions are discussed, and the superiority of adjoint over tangent-linear code is shown if the

    1Which occasionally differs from what you thinkyou implement!2Seewww.autodiff.org.

  • 7/25/2019 The Art of Differentiating

    12/347

    Preface xiii

    number of inputs exceeds the number of outputs significantly. The generation of tangent-linear and adjoint code by forward and reverse mode AD is the subject of Chapter 2. Ifyou are a potential user of first-order AD exclusively, then you may proceed immediatelyto the relevant sections of Chapter 5, covering the use ofdccfor the generation of firstderivative code. Otherwise, read Chapter 3 to find out more about the generation of second-or higher-order tangent-linear and adjoint code. The remaining sections in Chapter 5 illus-trate the use ofdccfor the partial automation of the corresponding source transformation.Prospective developers of derivative code compilers should not skip Chapter 4. There, werelate well-known material from compiler construction to the task of differentiating com-puter programs. The scanner and parser generators flexandbisonare used to build acompiler front-end that is suitable for both single- and multipass compilation of derivativecode. Further relevant material, includinghints on thesolutionsforallexercises, is collectedin the Appendix.

    The supplementary website for this book, http://www.siam.org/se22, contains sourcesof all software discussed in the book, further exercises and comments on their solutions(growing over the coming years), links to further sites on AD, and errata.

    In practice, the programming language that is used for the implementation of the orig-inal program accounts for many of the problems to be addressed by users of AD technology.Each language deserves to be covered by a separate book. The given computing infrastruc-ture (hardware, native compilers, concurrency/parallelism, external libraries, handling data,i/o, etc.) and software policies (level of robustness and safety, version management) maycomplicate things even further. Nevertheless, AD is actively used in many large projects,each of them posing specific challenges. The collection of these issues and their struc-tured presentation in the form of a book can probably only be achieved by a group of AD

    practitioners and is clearly beyond the scope of this introduction.Let us conclude these opening remarks with comments on the books title, whichmight sound vaguely familiar. While its scope is obviously much narrower than that of theclassic by Knuth [45], the application ofAD to computer programs still deserves to be calledan art. Educated users are crucial prerequisites for robust and efficientAD solutions in thecontext of large-scale numerical simulation programs. In AD details really do matter.3

    With this book, we hope to set the stage for many more artists to enter this exciting field.

    Uwe Naumann

    July2011

    3Quote from one of the anonymous referees.

  • 7/25/2019 The Art of Differentiating

    13/347

    Acknowledgments

    I could probably fill a few pages acknowledging the exceptional role of my family in my(professional) life including the evolution of this book. You know what I am talking about.

    Mywife Ines, who is by far the better artist, has helpedme with the cover art. My ownattemptsnever resulted ina drawing that adequately reflects theintrinsicjoy ofdifferentiatingcomputer programs. I did contribute the code fragment, though.

    I am grateful to Markus Beckers, Michael Frster, Boris Gendler, Johannes Lotz,AndrewLyons, Viktor Mosenkis, JanRiehme, Niloofar Safiran, MichelSchanen, EbadollahVarnik, and ClaudiaYastremiz for (repeatedly) proofreading the manuscript. Any remainingshortcomings should be blamed on them.

    Three anonymous referees provided valuable feedback on various versions of themanuscript.

    Last, but not least, I would like thank my former Ph.D. supervisor Andreas Griewankfor seeding my interest in AD. As a pioneer in this field, he has always been keen on

    promoting AD technology by using various techniques. One of them is music.

    xv

  • 7/25/2019 The Art of Differentiating

    14/347

    Optimality4

    Music:Think of Fools Gardens Lemon TreeLyrics:Uwe NaumannVocals:Be my guest

    Im sittin in front of the computer screen.Newtons second iteration is what Ive just seen.Its not quite the progress that I would expectfrom a code such as mineno doubt it must be perfect!Just the facts are not supportive, and I wonder. . .

    My linear solver is state of the art.It does not get better wherever I start.For differentiation is there anything else?Perturbing the inputscant imagine this fails.I pick a small Epsilon, and I wonder. . .

    I wonder how, but I still give it a try.The next change in step size is bound to fly.Cause all Id like to see is simply optimality.

    Epsilon, in fact, appears to be rather small.A factor of ten should improve it all.Cause all Id like to see is nearly optimality.

    A DAD ADADA DAD ADADA DADAD.

    4Originally presented in Nice, France on April 8, 2010 at the Algorithmic Differentiation, Optimization, andBeyondmeeting in honor of Andreas Griewanks 60th birthday.

    xvii

  • 7/25/2019 The Art of Differentiating

    15/347

    xviii Optimality

    A few hours later my talks getting rude.The sole thing descending seems to be my mood.How can guessing the Hessian only take this much time?

    N squared function runs appear to be the crime.The facts support this thesis, and I wonder . . .

    Isolation due to KKT.Isolationwhy not simply drop feasibility?

    The guy next doors been sayin again and again:An adjoint Lagrangian might relieve my pain.Though I dont quite believe him, I surrender.

    I wonder how but I still give it a try.

    Gradients and Hessians in the blink of an eye.Still all Id like to see is simply optimality.Epsilon itself has finally disappeared.Reverse mode AD works, no matter how weird,and Im about to see local optimality.

    Yes, I wonder, I wonder . . .

    I wonder how but I still give it a try.Gradient and Hessians in the blink of an eye.Still all Id like to see . . .

    I really need to see. . .now I can finally see my cherished optimality :-)

  • 7/25/2019 The Art of Differentiating

    16/347

    Chapter 1

    Motivation and Introduction

    The computation of derivatives plays a central role in many numerical algorithms. First-and higher-order sensitivities of selected outputs of numerical simulation programs withrespect to certain inputs as well as projections of the corresponding derivative tensors maybe required. Often the computational effort of these algorithms is dominatedby the run timeand the memory requirement of the computation of derivatives. Their accuracy may have adramatic effect on both convergence behavior and run time of the underlying iterative nu-merical schemes. We illustrate these claims with simple case studies in Section 1.1, namelythe solution of systems of nonlinear equations using the Newton method in Section 1.1.1

    and basic first- and second-order nonlinear programming algorithms in Section 1.1.2. Theuse of derivatives with numerical libraries is demonstrated in the context of the NAG Nu-merical Library as a prominent representative for a number of similar commercial andnoncommercial numerical software tools in Section 1.1.3.

    This first chapter aims to set the stage for the following discussion ofAlgorithmic Dif-ferentiation (AD) for the accurate and efficient computation of first and higher derivatives.Traditionally, numerical differentiation has been performed manually, possibly supportedby symbolic differentiation capabilities of modern computer algebra systems, or derivativeshave been approximated by finite difference quotients. Neither approach turns out to be aserious competitor for AD. Manual differentiation is tedious and error-prone, while finitedifferences are often highly inefficient and potentially inaccurate. These two techniques are

    discussed briefly in Section 1.2 and Section 1.3, respectively.

    1.1 Motivation: Derivatives for . . .

    Numerical simulation enables computational scientists and engineers to study the behaviorof various kinds of real-world systems in ways that are impossible (or at least extremelydifficult) in reality. Thequality of theresults depends largely on thequality of theunderlyingmathematical model F :Rn Rm. Computer programs are developed to simulate thefunctional dependence of one or more objectivesy Rm on a potentially very large numberof parametersx Rn. For a given set of input parameters, the corresponding values of theobjectives can be obtained by a single run of the simulation program as y = F(x). This

    1

  • 7/25/2019 The Art of Differentiating

    17/347

    2 Chapter 1. Motivation and Introduction

    simulation of the studied real-world system can be extremely useful. However, it leavesvarious questions unanswered.

    One of the simpler open questions is about sensitivities of the objective with respectto the input parameters with the goal of quantifying the change in the objective values forslight (infinitesimal) changes in the parameters. Suppose that the values of the parametersare defined through measurements within the simulated system (for example, an ocean ortheatmosphere). Theaccuracyof such measurements is less important forsmall sensitivitiesas inaccuracies will not translate into significant variations of the objectives. Large sen-sitivities, however, indicate critical parameters whose inaccurate measurement may yielddramaticallydifferent results. More accurate measuringdevices are likely to be more costly.Even worse, an adequate measuring strategy may be infeasible due to excessive run time orother hard constraints. Mathematical remodeling may turn out to be the only solution.

    Sensitivity analysis is one of many areas requiring the Jacobianmatrix ofF,

    F= F(x)

    yj

    xi

    j=0,...,m1

    i=0,...,n1

    ,

    whose rows contain the sensitivities of the outputs yj, j =0, . . . , m 1, of the numeri-cal simulation y= F(x) with respect to the input parameters xi , i = 0, . . . , n 1. Higherderivative tensors including theHessianofF,

    2F= 2F(x)

    2yj

    xi xk

    j=0,...,m1i,k=0,...,n1

    ,

    are used in corresponding higher-order methods. This book is based on C/C++ as the under-lying programming language; hence, vectors are indexed starting from zero instead of one.In the following, highlighted terminologyis used without definition. Formal explanationsare given in the subsequent chapters.

    1.1.1 . . .Systems of Nonlinear Equations

    The solution of systems of nonlinear equations F(x)=0, where F : Rn Rn, isa fundamental requirement of many numerical algorithms, ranging from nonlinear pro-gramming via the numerical solution of nonlinear partial differential equations (PDEs) toPDE-constrained optimization. Variants of the Newton algorithm are highly popular in this

    context.The basic version of the Newton algorithm (see Algorithm 1.1) solves F(x) =0

    iteratively fork = 0,1, . . .and for a given starting pointx0 as

    xk+1 = xk

    F(xk)1

    F(xk).

    The Newton step dxk

    F(xk)1

    F(xk) is obtained as the solution of the Newtonsystem of linear equations

    F(xk) dxk = F(xk)

    at each iteration. A good starting value x0 is crucial for reaching the desired convergencebehavior. Convergence is typically defined by some norm of the residual undercutting a

  • 7/25/2019 The Art of Differentiating

    18/347

    1.1. Motivation: Derivatives for. . . 3

    Algorithm 1.1Newton algorithm for solving the nonlinear systemF(x) = 0.

    In:

    implementation of the residualyat the current pointx Rn:F: Rn Rn,y = F(x)

    implementation of the JacobianA F(x) of the residual at the current pointx:F : Rn Rnn,A = F(x)

    solver for computing the Newton stepdx Rn as the solution of the linear NewtonsystemA dx = y:s: Rn Rnn Rn,dx = s(y, A)

    starting point: x Rn

    upper bound on the norm of the residual F(x) at the approximate solution: R

    Out:

    approximate solution of the nonlinear systemF(x) = 0: x Rn

    Algorithm:

    1: y = F(x)2: while y > do

    3: A = F(x)

    4: dx = s(y,A)5: x x + dx

    6: y = F(x)7: end while

    given bound. Refer to [44] for further details on the Newton algorithm. A basic versionwithout error handling is shown in Algorithm 1.1. Convergence is assumed.

    The computational cost of Algorithm 1.1 is dominated by the accumulation of theJacobianA Fin line 3 and by the solution of the Newton system in line 4. The qualityof the Newton step dxdepends on the accuracy of the Jacobian F. Traditionally,F is

    approximated using finite difference quotients as shown in Algorithm 1.2, whereei denotesthei th Cartesian basis vector in Rn, that is,

    ei =

    eij

    j=0,...,n1

    1 i= j,

    0 otherwise.

    The value of the residual is computed at no extra cost when using forwardor backwardfinite differences, to be discussed in further detail in Section 1.3.

    In Algorithm 1.2, a single evaluation of the residual at the current pointxin line 1 issucceeded by n evaluations at perturbed points in line 4. The components ofxare perturbedindividually in line 3. Thecolumns of the Jacobian areapproximatedseparately in lines 57,yielding a computational cost ofO(n) Cost(F), where Cost(F) denotes the cost of a single

  • 7/25/2019 The Art of Differentiating

    19/347

    4 Chapter 1. Motivation and Introduction

    Algorithm 1.2Jacobian accumulation by (forward) finite differences in the context of theNewton algorithm for solving the nonlinear systemF(x) = 0.

    In:

    implementation of the residualy = (yk)k=0,...,n1at the current pointx Rn:F: Rn Rn,y = F(x)

    current point: x Rn

    perturbation: R

    Out:

    residual at the current point: y = F(x) Rn

    approximate Jacobian of the residual at the current point:A = (ak,i )k,i=0,...,n1 F(x) Rnn

    Algorithm:

    1: y = F(x)2: fori = 0 ton 1do3: x x + ei

    4: y = F(x)5: fork = 0 ton 1do6: ak,i (yk yk)/

    7: end for8: end for

    evaluation of the residual. Refer to Section 1.3 for further details on finite differences aswell as on alternative approaches to their implementation. Potential sparsity of the Jacobianshould be exploited to reduce the computational effort as discussed in Section 2.1.3.

    Theinherent inaccuracy of theapproximation of theJacobian byfinite differences mayhave a negative impact on the convergence of the Newton algorithm. Exact (up to machineaccuracy) Jacobians can be computed by the tangent-linear modeof AD as described inSection 2.1. The corresponding part of the Newton algorithm is replaced byAlgorithm 1.3.Columns of the Jacobian are computed in line 3 after settingx(1) equal to the correspondingCartesian basis vector in line 2. The value of the residual yis computed in line 3 by thegiven implementation of the tangent-linear residualF(1) at almost no extra cost. Details onthe construction ofF(1) are discussed in Chapter 2. The superscript(1) is used to denotefirst-order tangent-linear versions of functions and variables. This notation will be foundadvantageous for generalization in the context of higher derivatives in Chapter 3.

    Example 1.1 Fory = F(x) defined as

    y0= 4 x0 (x20 + x

    21 ),

    y1= 4 x1 (x20 + x21 )

  • 7/25/2019 The Art of Differentiating

    20/347

    1.1. Motivation: Derivatives for. . . 5

    Algorithm 1.3Jacobian accumulation by tangent-linear mode AD in the context of theNewton algorithm for solving the nonlinear systemF(x) = 0.

    In:

    implementation of the tangent-linear residual F(1) for computing the residual y =(yk)k=0,...,n1 F(x) and its directional derivativey

    (1) = (y(1))k=0,...,n1 F(x)

    x(1) in the tangent-linear directionx(1) Rn at the current pointx Rn:F(1) : Rn Rn Rn Rn,

    y,y(1)

    = F(1)

    x,x(1)

    current point: x Rn

    Out:

    residual at the current point: y = F(x) Rn

    Jacobian of the residual at the current point:A = (ak,i )k,i=0,...,n1= F(x) Rnn

    Algorithm:

    1: fori = 0 ton 1do2: x(1) ei

    3: (y, y(1)) = F(1)(x,x(1))4: fork = 0 ton 1do5: ak,i y

    (1)k

    6: end for7: end for

    and starting fromxT = (1,1), a total of 21 Newton iterations are performed by the code inSection C.1.2 to drive the norm of the residualybelow 109:

    1 x0= 0.666667 x1= 0.666667 y = 11.3137

    2 x0= 0.444444 x1= 0.444444 y = 3.35221

    3 x0= 0.296296 x1= 0.296296 y = 0.993247

    . . .

    20 x0= 0.000300729 x1= 0.000300729 y = 1.03849e 09

    21 x0= 0.000200486 x1= 0.000200486 y = 3.07701e 10

    The solution of the linear Newton system by a direct method (such as GaussianLUfactorization or CholeskyLLT factorization if the Jacobian is symmetric positive definiteas in the previous example) is anO

    n3

    algorithm. Hence, the overall cost of the Newtonmethod is dominated by the solution of the linear system in addition to the accumulation ofthe Jacobian. The computational complexity of the direct linear solver can be decreased byexploiting possible sparsity of the Jacobian [24].

    Alternatively, iterative solvers can be used to approximate the Newton step. Matrix-free implementations of Krylov subspace methods avoid the accumulation of the full Jaco-bian. Consider, for example, the Conjugate Gradient (CG) algorithm [39] in Algorithm 1.4

  • 7/25/2019 The Art of Differentiating

    21/347

    6 Chapter 1. Motivation and Introduction

    Algorithm 1.4Matrix-free CG algorithm for computing the Newton step in the context ofthe Newton algorithm for solving the nonlinear systemF(x) = 0.

    In:

    implementation of the tangent-linear residual F(1) for computing the residual yF(x) and its directional derivativey(1) F(x) x(1) in the tangent-linear directionx(1) Rn at the current pointx Rn:F(1) : Rn Rn Rn Rn, (y,y(1)) = F(1)(x,x(1))

    starting point for the Newton step: dx x(1) Rn

    upper bound on the norm of the residual y F(x) dxat the approximatesolution for the Newton step: R

    Out:

    approximate solution for the Newton step: dx Rn

    Algorithm:

    1: x(1) dx

    2: (y,y(1)) F(1)(x, x(1))3: p y y(1)

    4: r p

    5: while r do

    6: x(1) p

    7: (y,y(1)) F(1)(x,x(1))8: rT r/(pT y(1))9: dx dx + p

    10: rprev r

    11: r r y(1)

    12: rT r/(rTprev rprev)13: p r + p

    14: end while

    for symmetric positive definite systems. It aims to drive during each Newton iteration thenorm of the residual y F dxtoward zero. Note that only the function value (line 3)and projections of the Jacobian (Jacobian-vector products in lines 3, 8, and 11) are required.These directional derivatives are delivered efficiently by a single run of the implementa-tion of the tangent-linear residualF(1) in lines 2 and 7, respectively. The exact solution isobtained in infinite precision arithmetic after a total ofn steps. Approximations of the solu-tion in floating-point arithmetic can often be obtained much sooner provided that a suitable

    preconditioneris available [58]. For notational simplicity, Algorithm 1.4 assumes that nopreconditioner is required.

    Example 1.2 Solutions of systems of nonlinear equations play an important role in variousfields of Computational Science and Engineering. They result, for example, from the

  • 7/25/2019 The Art of Differentiating

    22/347

    1.1. Motivation: Derivatives for. . . 7

    discretization of nonlinear PDEs used to model many real-world phenomena. We considera very simple example for illustration.

    The two-dimensional Solid Fuel Ignition (SFI) problem (also known as the Bratuproblem) from the MINPACK-2 test problem collection [5] simulates a thermal reactionprocess in a rigid material. It is given by the elliptic PDE

    y ey = 0, (1.1)

    where y = y(x0,x1) is computed over some bounded domain R2 with boundary R2 and Dirichlet boundary conditions y (x0,x1) = g(x0,x1) for (x0, x1) . For sim-plicity, we focus on the unit square = [0,1]2 and we set

    g(x0,x1) =

    1 ifx0= 1,

    0 otherwise.

    We usefinitedifferences as a basic discretization method. Its aimis to replace thedifferential

    y2y

    x20

    +2y

    x21

    with a setof algebraicequations, thus transforming(1.1) into a systemofnonlinearequationsthat can be solved by Algorithm 1.1.

    Let be discretized using central finite differences with step sizeh = 1/sin both the

    x0and x1directions. The second derivative with respect to x0at some point (xi0,x

    j1 ) (for

    example, (x20 ,x21 ) in Figure 1.1 wheres = 4) is approximated based on the finite difference

    approximation of the first derivative with respect to x0at points a =(xi

    0 h/2,x

    j

    1 ) andb=(xi0 + h/2,x

    j1 ). Similarly, the second derivative with respect to x1at the same point

    (0,1)

    (0,2)

    (0,3)

    (1,0)

    (1,1)

    (1,2)

    (1,3)

    (1,4)

    (2,0)

    (2,1)

    (2,4)

    (3,0)

    (3,1)

    (3,4)

    (4,1)

    (4,2)

    (4,3)

    a b

    c

    d

    0 1

    1

    x0

    x1

    Figure 1.1. Finite difference discretization of the SFI equation.

  • 7/25/2019 The Art of Differentiating

    23/347

    8 Chapter 1. Motivation and Introduction

    is approximated based on the finite difference approximation of the first derivative with

    respect tox1at pointsc = (xi0,x

    j1 h/2) andd= (x

    i0,x

    j1+ h/2). As the result of

    y(x0,x1)

    x0(a)

    yi,j yi1,j

    h,

    y (x0,x1)

    x0(b)

    yi+1,j yi,j

    h,

    and

    2y(x0, x1)

    x20

    (xi0,xj1 )

    y(x0,x1)x0

    (b) y (x0,x1)x0

    (a)

    h,

    we have2y(x0,x1)

    x20

    (xi0, xj1 )

    yi+1,j 2 yi,j+ yi1,j

    h2 .

    Similarly,2y(x0,x1)

    x21

    (xi0,xj1 )

    yi,j+1 2 yi,j+ yi,j1

    h2

    follows from

    2y(x0,x1)

    x21

    (xi0, xj1 )

    y(x0,x1)x1

    (d) y (x0,x1)x1

    (c)

    h

    withy(x0,x1)

    x1(c)

    yi,j yi,j1

    h,

    y (x0,x1)

    x1(d)

    yi,j+1 yi,j

    h.

    Consequently, the system of nonlinear equations to be solved becomes

    4 yi,j+ yi+1,j+ yi1,j+ yi,j+1 + yi,j1= h2 eyi,j

    fori ,j= 1, . . . ,s 1. Discretization of the boundary conditions yields ys,j = 1 andyi,0 =y0,j= yi,s = 0 fori ,j= 1, . . . ,s 1. A possible implementation is the following:

    v o i d f (i n t s , d o u b l e y , d o u b l e l , d o u b l e r ) {f o r (i n t i =1 ; i < s ; i ++)

    f o r (i n t j =1; j

  • 7/25/2019 The Art of Differentiating

    24/347

    1.1. Motivation: Derivatives for. . . 9

    Table 1.1. Run time statistics for the SFI problem. The solution is computed by

    the standard Newton algorithm and by a matrix-free implementation of the Newton-CG

    algorithm. For different resolutions of the mesh (defined by the step size s) and for varying

    levels of accuracies of the Newton iteration (defined by ), we list the run time of the Newton

    algorithm t, the number of Newton iterations i, and the number of Jacobian-vector productscomputed by calling the tangent-linear routine t1_f.

    Newton Newton-CG

    s t i t1_f t i t1_f

    25 105 6,0 8 4.608 0,0 8 28130 105 24,6 8 6.720 0,0 8 33335 105 71,4 8 9.248 0,0 8 384

    25 1010 6,8 9 5.184 0,0 9 52830 1010 27,1 9 7.569 0,0 9 629

    35 1010 79,6 9 10.404 0,1 9 724

    300 105 - - - 31,0 8 2.963300 108 - - - 44,4 9 4.905300 1010 - - - 53,3 9 5.842

    and hence the following system of nonlinear equations:

    4 y1,1 + y2,1 + y1,2 h2 ey1,1 = 0

    4 y1,2 + y2,2 + y1,1 h2 ey1,2 = 0

    4 y2,1 + y1,1 + y2,2 + 1 h2

    ey2,1

    = 04 y2,2 + y1,2 + y2,1 + 1 h

    2 ey2,2 = 0.

    The SFI problem will be used within various exercises throughout this book. Its solutionusing Algorithm 1.1 is discussed in Section 1.4.2.

    The Jacobian of the residual used in Section 1.4.2 turns out to be symmetric positivedefinite. Hence, Algorithm 1.4 can be used for the solution of the linear Newton system.Run time statistics for Algorithm 1.1 using a direct linear solver as well as for a matrix-freeimplementation based onAlgorithm 1.4 areshown inTable 1.1. We start from y(x0,x1) = 10for (x0, x1) \ and = 0.5. The CG solver is converged to the same accuracyas theenclosing Newton iteration. Its substantial superiority over direct solvers for the given

    problem is mostly due to the matrix-free implementation and the missing preconditioning.Moreover, sparsity of the Jacobian is not exploited within the direct solver. Exploitationof sparsity speeds up the Jacobian accumulation by performing fewer evaluations of thetangent-linear model and reduces both the memory requirement and the run time of thesolution of the linear Newton system when using a direct method. Refer to [24] for detailson sparse direct solvers. Existing software includes MUMPS [3], PARDISO [55], SuperLU[23], and UMFPACK [22].

    1.1.2 . . .Nonlinear Programming

    The following example has been designed to compare the performance of finite differenceapproximation of first and second derivatives with that of derivative code that computes

  • 7/25/2019 The Art of Differentiating

    25/347

    10 Chapter 1. Motivation and Introduction

    exact values in the context of basic unconstrained optimization algorithms. Adjoint codeexceeds the efficiency of finite differences by a factor at the order ofn. In many cases, thisfactor makes the difference between derivative-based methods being applicable to large-scale optimization problems or not.

    Consider the nonlinear programming problem

    minxRn

    f(x),

    where, for example, the objective function

    f(x) =

    n1i=0

    x2i

    2(1.2)

    is implemented as follows:v o i d f (i n t n , d o u b l e x , d o u b l e &y ) {

    y =0 ;f o r (i n t i =0 ; i 0, where

    f(xk) denotes the gradient(the transposed single-row Jacobian) offat thecurrent iterate.Simplefirst- (Bkisequal to the identity IninRn) andsecond-order(Bkis equal to the Hessian2f(xk) offat pointxk) methods are discussed below. We aim to find a local minimizer

    by starting atx0

    i = 1 fori = 0, . . . ,n 1.

    Steepest Descent Algorithm

    In the simplest case, (1.3) becomes

    xk+1 = xk k f(xk).

    The step lengthk >0 is, for example, chosen by recursive bisection on kstarting fromk= 1 (0.5,0.25, . . .) and such that a decrease in the objective value is ensured. This simplemethod is known as thesteepest descent algorithm. It is stated formally in Algorithm 1.5,where it is assumed that a suitable can always be found. Refer to [50] for details onexceptions.

  • 7/25/2019 The Art of Differentiating

    26/347

    1.1. Motivation: Derivatives for. . . 11

    Algorithm 1.5 Steepest descent algorithmfor solving the unconstrained nonlinearprogram-ming problem minxRn f(x).

    In:

    implementation of the objectivey R at the current pointx Rn:f: Rn R,y = f(x)

    implementation off for computingtheobjectivey f(x) and its gradient g f(x)at the current pointx:f : Rn R Rn, (y,g) = f(x)

    starting point: x Rn

    upper bound on gradient norm g at the approximate minimal point: R

    Out:

    approximate minimal value of the objective: y R

    approximate minimal point:x Rn

    Algorithm:

    1: repeat

    2: (y,g) = f(x)3: ifg > then

    4: 15: y y

    6: while y y do

    7: x x g

    8: y= f(x)9: /2

    10: end while

    11: x x

    12: end if

    13: untilg

    For a given implementation off, the only nontrivial ingredient of Algorithm 1.5 isthe computation of the gradient in line 2. For the given simple example, hand-coding of

    f(x) =

    4 xi n1

    j=0

    x2j

    i=0,...,n1

    is certainly an option. This situation will change for more complex objectives implementedas computer programs with many thousand lines of source code. Typically, the efficiencyof handwritten derivative code is regarded as close to optimal. While this is a reasonableassumption in many cases, it still depends very much on the author of the derivative code.

  • 7/25/2019 The Art of Differentiating

    27/347

    12 Chapter 1. Motivation and Introduction

    Algorithm 1.6Gradient approximation by (forward) finite differences in the context of theunconstrained nonlinear programming problem minxRnf(x).

    In:

    implementation of the objective function: f: Rn R,y = f(x)

    current point: x Rn

    perturbation: R

    Out:

    objective at the current point: y= f(x) R

    approximate gradient of the objective at the current point:

    g (gi)i=0,...,n1 f(x) Rn

    Algorithm:

    1: y= f(x)2: fori = 0 ton 1do3: x x + ei

    4: y= f(x)5: gi (y y)/6: end for

    Moreover, hand-coding may be infeasible within the time frame allocated to the project.Debugging is likely to occupy the bigger part of the development time. Hence, we aim tobuild up a set of rules that will allow us to automate the generation of derivative code tothe greatest possible extent. All of these rules can be applied manually. Our ultimate goal,however, is the development of corresponding software tools in order to make this processless tedious and less error-prone.

    As in Section 1.1.1, the gradient can be approximated using finite difference quotients(see Algorithm 1.6), provided that the computational complexity ofO(n) Cost(f) remainsfeasible. Refer to Section 1.3 for further details on finite differences. This approach has two

    major disadvantages. First, the approximation may be poor. Second, a minimum ofn + 1function evaluations are required. If, for example, a single function evaluation takes oneminute on the given computer architecture, and ifn = 106 (corresponding, for example, to atemperature distribution in a very coarse-grain discretization of a global three-dimensionalatmospheric model), then a single evaluation of the gradient would take almost two years.Serious climate simulation would not be possible.

    The computational complexity is not decreased when using tangent-linear AD asoutlined in Algorithm 1.7. Nevertheless, the improved accuracy of the computed gradientmay lead to faster convergence of the steepest descent algorithm.

    Large-scale and long-term climate simulations are performed by many researchersworldwide. A single function evaluation is likely to run for much longer than one minute,

  • 7/25/2019 The Art of Differentiating

    28/347

    1.1. Motivation: Derivatives for. . . 13

    Algorithm 1.7Gradient accumulation by tangent-linear mode AD in the context of theunconstrained nonlinear programming problem minxRn f(x).

    In:

    implementation of the tangent-linear objective function f(1) for computing the ob-jective y f(x) and its directional derivative y(1) f(x) x(1) in the tangent-lineardirectionx(1) Rn at the current pointx Rn:f(1) : Rn Rn R R, (y, y(1)) = f(1)(x, x(1))

    current point: x Rn

    Out:

    objective at the current point: y= f(x) R

    gradient of the objective at the current point:g (gi )i=0,...,n1= f(x) Rn

    Algorithm:

    1: fori = 0 ton 1do2: x(1) ei

    3: (y,y(1)) = f(1)(x,x(1))4: gi y

    (1)

    5: end for

    even on the latest high-performance computer architectures. It may take hours or days toperform climate simulations at physically meaningful spatial discretization levels and overrelevant time intervals. Typically only a few runs are feasible. The solution to this problemcomes in the form of various flavors of so-calledadjointmethods. In particular adjoint ADallows us to generate for a given implementation off an adjoint program that computes thegradient fat a computational cost ofO(1) Cost(f). As opposed to finite differencesand tangent-linear AD, adjoint AD thus makes the computational cost independent ofn. Itenables large-scale sensitivity analysis as well as high-dimensional nonlinear optimizationand uncertainty quantification for practically relevant problems in science and engineering.

    This observation is worth highlighting even at this early stage, and it serves as motivationfor the better part of the remaining chapters in this book.

    Algorithm1.8illustrates theuseofanadjoint code forf. Theadjoint objectiveis calledonly once (in line 2). We use the subscript(1)to denote adjoint functions and variables. Theadvantages of this notation will become obvious in Chapter 3, in the context of higher-orderadjoints.

    Table 1.2 summarizes the impact of the various differentiation methods on the runtime of the steepest descent algorithm when applied to our example problem in (1.2). Theseresults were obtained on a standard Linux PC running the GNU C++ compiler with opti-mization level 3, which will henceforth be referred to as thereference platform. The num-bers illustrate the superiority of adjoint over both tangent-linear code and finite difference

  • 7/25/2019 The Art of Differentiating

    29/347

    14 Chapter 1. Motivation and Introduction

    Algorithm 1.8Gradient accumulation by adjoint mode AD in the context of the uncon-strained nonlinear programming problem minxRn f(x).

    In:

    implementation of the adjoint objective function f(1) for computing the objectivey f(x) and the productx(1) y(1) f(x) of its gradient at the current pointx Rn

    with a factory(1) R:f(1): Rn R R Rn, (y,x(1)) = f(1)(x,y(1))

    current point: x Rn

    Out:

    objective at the current point: y= f(x) R

    gradient of the objective at the current point:g = f(x) Rn

    Algorithm:

    1: y(1) 12: (y,x(1)) = f(1)(x,y(1))3: g x(1)

    Table 1.2. Run time of the steepest descent algorithm (in seconds and start-

    ing from x = 1). The gradient (of size n) is approximated by central finite differences(FD; see Section1.3) or computed with machine accuracy by a tangent-linear code (TLC;see Section2.1) or an adjoint code (ADC; see Section 2.2). The tangent-linear andadjoint codes are generated automatically by the derivative code compiler (dcc) (seeChapter5).

    n FD TLC ADC

    100 13 8 < 1200 47 28 1300 104 63 2400 184 113 2.5

    500 284 173 31000 1129 689 6

    approximations in terms of the overall computational effort that is dominated by the costof the gradient evaluation. Convergence of the steepest descent algorithm is defined as theL2-norm of the gradient falling below 10

    9. The steepest descent algorithm is expected toperform a large number of iterations (with potentially very small step sizes) to reach thishigh level of accuracy. Similar numbers of iterations (over 3 105) are performed indepen-dently of the method used for the evaluation of the gradient. As expected, the step sizekis reduced to values below 104 to reach convergence while ensuring strict descent in theobjective function value.

  • 7/25/2019 The Art of Differentiating

    30/347

    1.1. Motivation: Derivatives for. . . 15

    Newton Algorithm

    Second-order methods based on the Newton algorithm promise faster convergence in the

    neighborhood of the minimum by taking into account second derivative information. Weconsider the Newton algorithm discussed in Section 1.1.1 extended by a local line searchto determinekfor k = 0,1, . . .in (1.3) as

    xk+1 = xk k

    2f(xk)1

    f(xk).

    As in Section 1.1.1, the Newton method is applied to find a stationary point offby solvingthe nonlinear system f = 0. The Newton step

    dxk 2f(xk)1

    f(xk)

    is obtained as the solution of the linear Newton system

    2f(xk) dxk = f(xk)

    at each iteration. Ifxk is far from a solution, then sufficient descent in the residual can beobtained using a local line search to determineksuch that the L2-norm of the residual atthe next iterate is minimized, that is, the scalar nonlinear optimization problem

    mink

    ||f(xk k dxk))||2

    needs to be solved. The first and, potentially also required, second derivatives of theobjective with respect to k can be computed efficiently using the methods discussed inthis book. Alternatively, a simple recursive bisection algorithm similar to that used in thesteepest descent method can help to improve the robustness of the Newton method.

    A formal description of the Newton algorithm for unconstrained nonlinear optimiza-tion is given in Algorithm 1.9. Again, convergence is assumed. The computational costis dominated by the accumulation in lines 1, 3, and 14 of gradient and Hessian and by thesolution of the linear Newton system in line 4. Both the gradient and the Hessian shouldbe accurate in order to ensure the expected convergence behavior. Approximation by finitedifferences may not be good enough due to an inaccurate Hessian in particular.

    An algorithmic view of the (second-order) finite difference method for approximat-

    ing the Hessian is given in Algorithm 1.10. Refer to Section 1.3 for details on first- andsecond-order finite differences as well as for a description of alternative approaches to theirimplementation. The shortcomings of finite difference approximation become even moreapparent in the second-order case. The inaccuracy is likely to become more significant dueto the limitations of floating-point arithmetic. Moreover, O(n2) function evaluations arerequired for the approximation of the Hessian.

    The first problem can be overcome by applying tangent-linear AD to Algorithm 1.7,yielding Algorithm 1.11. Each of then calls of the second-order tangent-linearfunctionG(1), where G f is defined as in Algorithm 1.7, involves ncalls of the tangent-linearcode. The overall computational complexity of the Hessian accumulationadds up to O(n2) Cost(f). Both thegradient and the Hessian are obtained with machine accuracy. Symmetryof the Hessian is exploited in neither Algorithm 1.10 nor Algorithm 1.11.

  • 7/25/2019 The Art of Differentiating

    31/347

    16 Chapter 1. Motivation and Introduction

    Algorithm 1.9Newton algorithm for solving the unconstrained nonlinear programmingproblem minxRnf(x).

    In:

    implementation of the objectivey R at the current pointx Rn:f: Rn R,y = f(x)

    implementation of the differentiated objective functionf for computing the objec-tivey f(x) and its gradient g f(x) at the current point x:f : Rn R Rn, (y, g) = f(x)

    implementation of the differentiated objective functionf for computing the objec-tivey , its gradientg, and its Hessian H 2f(x) at the current point x:f : Rn R Rn Rnn, (y,g,H) = f(x)

    solver to determine the Newton step dx Rn as the solution of linear Newton systemH dx = g:s: Rn Rnn Rn,dx = s(g,H)

    starting point: x Rn

    upper bound on the gradient norm g at the approximate solution: R

    Out:

    approximate minimal value: y R

    approximate minimal point:x Rn

    Algorithm:

    1: (y,g) = f(x)2: while g > do

    3: (y, g,H) = f(x)4: dx = s(g, H)5: 16: y y

    7: x x

    8: while y y do

    9: x x dx

    10: y= f(x)11: /212: end while

    13: x x

    14: (y, g) = f(x)15: end while

  • 7/25/2019 The Art of Differentiating

    32/347

    1.1. Motivation: Derivatives for. . . 17

    Algorithm 1.10Hessian approximation by (forward) finite differences in the context of theunconstrained nonlinear programming problem minxRn f(x).

    In:

    implementation off forcomputing theobjective y f(x) anditsapproximate gradi-ent g = (gj)j=0,...,n1 f(x) at the current point x Rn as definedinAlgorithm 1.6:f : Rn R Rn, (y,g) = f(x)

    current point: x Rn

    perturbation: R

    Out:

    objective function value at the current point: y= f(x)

    approximate gradient of the objective at the current point: g f(x) Rn

    approximate Hessian of the objective at the current point:H= (hj,i )j,i=0,...,n1

    2f(x) Rnn

    Algorithm:

    1: (y,g) = f(x)2: fori = 0 ton 1do3: x x + ei

    4: (y, g) = f(x)

    5: forj= 0 ton 1do6: hj,i (gj gj)/7: end for

    8: end for

    Substantial savings in the computational cost result from performing the gradientcomputation in adjoint mode. The savings are due to each of the ncalls of the second-order adjointfunction G(1), where G f is defined as in Algorithm 1.8, now involvingmerely a single call of the adjoint code. The overall computational complexity becomesO(n) Cost(f) instead ofO (n2) Cost(f).

    The tangent-linear version of an adjoint code is referred to as second-order adjointcode. It can be used to compute both the gradient asw f(x) as well as the projections ofthe Hessian in directionv Rn asw 2f(x) vby settingw = 1. A second-order tangent-linearcode (tangent-linear version of the tangent-linear code) can be used to computesingle entries of the Hessian as uT 2f(x) wby letting uand wrange independentlyover the Cartesian basis vectors in Rn. Consequently, the computational complexity ofHessian approximation using finite differences or the second-order tangent-linear model isO(n2) Cost(f). Second-order adjoint code delivers the Hessian at a computational cost ofO(n) Cost(f). Savings at the order ofn are likely to make the difference between second-order methods being applicable or not. Refer to Table 1.3 for numerical results that supportthese findings. Note that further combinations of tangent-linear and adjoint AD are possiblewhen computing second derivatives. Refer to Chapter 3 for details.

  • 7/25/2019 The Art of Differentiating

    33/347

    18 Chapter 1. Motivation and Introduction

    Algorithm 1.11Hessian accumulation by second-order AD in the context of the uncon-strained nonlinear programming problem minxRn f(x).

    In:

    implementation of the tangent-linear version G(1) of thedifferentiated objective func-tion G f defined in Algorithm 1.7 (yielding second-order tangent-linear modeAD) or Algorithm 1.8 (yielding second-order adjoint mode AD) for computing theobjectivey f(x), its directional derivativey (1) f(x) x(1) in the tangent-lineardirection x(1) Rn, its gradient g f(x), and its second directional derivativeg(1) = (g

    (1)j )j=0,...,n1

    2f(x) x(1) in directionx(1) at the current pointx Rn:

    G(1) : Rn Rn R R Rn Rn, (y, y(1),g,g(1)) = G(1)(x,x(1))

    current point: x Rn

    Out:

    objective function value at the current point: y= f(x)

    gradient at the current point: g = f(x) Rn

    Hessian at the current point: H= (hj,i )j,i=0,...,n1= 2f(x) Rnn

    Algorithm:

    1: fori = 0 ton 1do

    2: x

    (1)

    e

    i

    3: (y, y(1), g, g(1)) = G(1)(x,x(1))4: forj= 0 ton 1do

    5: hj,i g(1)j

    6: end for

    7: end for

    As already observed in Section 1.1.1, the solution of the linear Newton system by adirect method yields a computational complexity at the order ofO (n3). Hence, the overallcost of the Newton method applied to the given implementation of (1.2) is dominated by

    the solution of the linear system in columns SOFD, SOTLM, and SOADM of Table 1.3.Exploitation of possible sparsity of the Hessian can reduce the computational complexityas discussed in Chapter 3. The run time of the Hessian accumulation is higher when usingthe second-order tangent-linear model (column SOTLM) or, similarly, a second-order finitedifference approximation (column SOFD). Use of the second-order adjoint model reducesthecomputational cost (column SOADM). TheHessian maybecome indefinitevery quicklywhen using finite difference approximation.

    Matrix-free implementations of the Conjugate Gradients solver avoid the accumu-lation of the full Hessian. Note that in Algorithm 1.12 only the gradient g(line 3) andprojections of the Hessian g(1) (lines 3, 8, and 11) are required. Both are delivered effi-ciently by a single run ofG(1) (in line 2 and 7, respectively). Again, preconditioning hasbeen omitted for the sake of notational simplicity. If a second-order adjoint code is used

  • 7/25/2019 The Art of Differentiating

    34/347

    1.1. Motivation: Derivatives for. . . 19

    Table 1.3. Run time of the Newton algorithm (in seconds and starting fromx = 1).The gradient and Hessian of the given implementation of(1.2) are approximated by second-order central finite differences (SOFD; see Section 1.3) or computed with machine accuracyby a second-order tangent-linear (SOTLC; see Section3.2) or adjoint (SOADC; see Sec-tion3.3) code. The Newton system is solved using a Cholesky factorization that dominatesboth the run time and the memory requirement for increasing n due to the relatively low

    cost of the function evaluation itself. The last column shows the run times for a matrix-free

    implementation of a NewtonKrylov algorithm that uses the CG algorithm to approximate

    the Newton step based on the second-order adjoint model. As expected, the algorithm scales

    well beyond the problem sizes that could be handled by the other three approaches. A run

    time of more than1second is observed only forn 105.

    n SOFD SOTLC SOADC SOADC (CG)

    100 < 1 < 1 < 1 < 1

    200 2 1 < 1 < 1300 7 3 1 < 1400 17 9 4 < 1500 36 21 10 < 11000 365 231 138 < 1

    ......

    ......

    ...105 > 104 > 104 > 104 1

    (see Algorithm 1.11), thenthe computational complexity of evaluating the gradient anda Hessian-vector product is O(1) Cost(f). We take this result as further motivation for

    an in-depth look into the generation of first- and higher-order adjoint code in the followingchapters.

    Nonlinear Programming with Constraints

    Practically relevant optimization problems are most likely subject to constraints, which areoften nonlinear. For example, the solution may be required to satisfy a set of nonlinearPDEs as in many data assimilation problems in the atmospheric sciences. Discretization ofthe PDEs yields a system of nonlinear algebraic equations to be solved by the solution ofthe optimization problem.

    The core of many algorithms for constrained optimization is the solution of theequality-constrained problem

    min o(x) subject toc(x) = 0,

    where both the objective o: Rn Rand the constraints c: Rn Rm are assumed tobe twice continuously differentiable within the domain . The first-order KarushKuhnTucker(KKT) conditions yield the system

    o(x) (c(x))T c(x)

    = 0

    ofn + m nonlinearequations in n + m unknowns x Rn and Rm. TheNewtonalgorithmcan be used to solve the KKT system subject to the following conditions: The Jacobian of

  • 7/25/2019 The Art of Differentiating

    35/347

    20 Chapter 1. Motivation and Introduction

    Algorithm 1.12CG algorithm for computing the Newton step in the context of the uncon-strained nonlinear programming problem minxRn f(x).

    In:

    implementation of the tangent-linear version G(1) of thedifferentiated objective func-tion G f defined in Algorithm 1.7 (yielding second-order tangent-linear modeAD) or Algorithm 1.8 (yielding a potentially matrix-free implementation based onsecond-order adjoint mode AD) for computing the objectivey f(x), its directionalderivativey(1) f(x) x(1) in the tangent-linear direction x(1) Rn, its gradientg f(x), anditsseconddirectionalderivative g(1) = (g

    (1)j )j=0,...,n1

    2f(x) x(1)

    in directionx(1) at the current pointx Rn:G(1) : Rn Rn R R Rn Rn, (y, y(1),g,g(1)) = G(1)(x,x(1))

    starting point: dx Rn

    upper bound on the norm of the residual g H dx, whereH 2f(x), at theapproximate solution for the Newton step: R

    Out:

    approximate solution for the Newton step: dx Rn

    Algorithm:

    1: x(1) dx

    2: (y,y(1),g,g(1)) = G(1)(x,x(1))3: p g g(1)

    4: r p

    5: while r do

    6: x(1) p

    7: (y, y(1), g, g(1)) = G(1)(x,x(1))8: rT r/(pT g(1))9: dx dx + p

    10: rprev r

    11: r r g(1)

    12: rT r/(rTprev rprev)

    13: p r + p14: end while

    the constraints c(x) needs to have full row rank; the Hessian

    2L(x, ) 2L

    x2(x, )

    of theLagrangian L(x,) = o(x) T c(x) with respect toxneeds to be positive definiteon the tangent space of the constraints, that is,vT 2L(x,) v> 0 for allv = 0 for whichc(x) v = 0.

  • 7/25/2019 The Art of Differentiating

    36/347

    1.1. Motivation: Derivatives for. . . 21

    The iteration proceeds as

    xk+1k+1

    =xk

    k

    +dxk

    dk

    ,

    where the Newton step is computed as the solution of the linear system2L(xk,k) (c(xk))

    T

    c(xk) 0

    dxkdk

    =

    (c(xk))

    T k o(xk)c(xk)

    .

    Note that2L(x, ) = 2o(x) ,2c(x),

    where the notation for the projection ,2c(x)of the 3-tensor 2c(x) Rmnn indirection Rm is formally introduced in Chapter 3.

    Many modern algorithms for constrained nonlinear optimization are based on thesolution of the KKT system. See, for example, [50] for a comprehensive survey. Our focusis on the efficient provision of the required derivatives.

    If a direct linear solver is used, then the following derivatives need to be computed:

    o(x) and 2o(x) atO (n) Cost(o) using a second-order adjoint model ofo;

    c(x) and ,2c(x) atO (n) Cost(c) using a second-order adjoint model ofc;

    ,c atO (1) Cost(c) using the adjoint model ofc.

    A matrix-free implementation of a NewtonKrylov algorithm requires the followingderivatives:

    o(x) and 2o(x),v, where v Rn. Both can be computed at the cost ofO(1) Cost(o) using a second-order adjoint model ofo;

    w,c(x), c(x),v, and ,2c(x),v, wherev Rn andw Rm. The first- andsecond-order adjoint projections can be computed at the cost ofO(1) Cost(c) usinga second-order adjoint model ofc. A tangent-linear model ofcpermits the evaluationof Jacobian-vector products at the same relative computational cost.

    Refer to Section 3.1 for formal definitions of the projection operator ., ..

    A detailed discussion of constrained nonlinear optimization is beyond the scope ofthis book. Various software packages have been developed to solve this type of problem.Some packages use AD techniques or can be coupled with code generated byAD. Examplesinclude AMPL [26], IPOPT [59], and KNITRO [14]. Both IPOPT and KNITRO can beaccessed via the Network-Enabled Optimization Server (NEOS5) maintained by ArgonneNational Laboratorys Mathematics and Computer Science Division. NEOS uses a varietyof AD tools. Refer to the NEOS website for further information.

    A case study for the use of AD in the context of constrained nonlinear programmingis presented in [41]. Moreover, we discuss various combinatorial issues related to AD andto the use of sparse direct linear solvers. Our derivative code compiler dccis combinedwith IPOPT [59] and PARDISO [55] to solve an inverse medium problem.

    5neos.mcs.anl.gov

  • 7/25/2019 The Art of Differentiating

    37/347

    22 Chapter 1. Motivation and Introduction

    1.1.3 . . .Numerical Libraries

    The NAG C Library is a highly comprehensive collection of mathematical and statisti-

    cal algorithms for computational scientists and engineers working with the programminglanguages C and C++. We use it as a case study for variouscommercial as well asnoncommercialcollections of derivative-based numerical algorithms. Their APIs (Appli-cation Programming Interfaces) are often very similar.

    Systems of Nonlinear Equations

    Function c05ubc of the NAG C Library computes a solution of a system of nonlinearequations F(x)=0, F: Rn Rn by a modification of the Powell hybrid method [53].It is based on the MINPACK routine HYBRJ1 [47]. The user must provide code for theaccumulation of the Jacobian A F(x) as part of a function

    v o i d j _ f ( I n t e g e r n , c o n s t d ou bl e x [ ] ,d o u b l e F [ ] , d o u b l e A[ ] , . . . ) ;

    The library provides a custom integer data type Integer. Algorithm 1.2 orpreferablyAlgorithm 1.3 can be used to approximate the Jacobian or to accumulate its entries withmachine accuracy at a computational cost ofO (n) Cost(F), respectively.

    A similar approach can be taken for the integration of stiff systems of ordinary differ-ential equations

    x

    t= F(t,x)

    using various NAG C Library routines. The API is similar to the above. The accumulationof the Jacobian ofF(t,x) with respect to xis analogous.

    Unconstrained Nonlinear Optimization

    The e04dgc section of the library deals with the minimization of an unconstrained nonlinearfunction f :Rn R, y = f(x), where n is assumed to be very large. It uses a pre-conditioned, limited memoryquasi-NewtonCG methodand is based upon algorithmPLMAas described in [33]. The user must provide code for the accumulation of the gradientg f(x) as part of a function

    v o i d g _ f ( I n t e g e r n , c o n s t d ou bl e x [ ] ,d o u b l e f , d o u b l e g [ ] , . . . ) ;

    Both Algorithm 1.2 and Algorithm 1.3 can be used to approximate the gradient or to ac-cumulate its entries with machine accuracy at a computational cost of O(n) Cost(f),respectively. A better choice is Algorithm 1.8, which delivers the gradient with machineaccuracy at a computational cost ofO (1) Cost(f).

    Bound-Constrained Nonlinear Optimization

    The e04lbc section of the library provides a modified Newton algorithm for finding un-constrained or bound-constrained minima of twice continuously differentiable nonlinearfunctions f: Rn R, y= f(x) [32]. The user needs to provide code for the accumulation

  • 7/25/2019 The Art of Differentiating

    38/347

    1.2. Manual Differentiation 23

    of the gradientg f(x) and for the computation of the function value yas part of afunction

    v o i d g _ f ( I n t e g e r n , c o n s t d ou bl e x [ ] ,d o u b l e y , d o u b l e g [ ] , . . . ) ;

    Moreover, code is required to accumulate the HessianH 2f(x) inside of

    v o i d H_ ( I n t e g e r n , c o n s t d ou bl e x [ ] , d o u b l e H [ ] , . . . ) ;

    The gradient should be computed with machine accuracy by Algorithm 1.8 at a computa-tional cost ofO(1) Cost(f). For the Hessian, we can choose second-order finite differ-ences, the second-order tangent-linear model, or the second-order adjoint model. While thefirst two run at a computational cost of O(n2) Cost(f), the second-order adjoint codedelivers the Hessian with machine accuracy at a computational cost of only O(n) Cost(f).

    1.2 Manual Differentiation

    Closed-form symbolic as well as algorithmic differentiation are based on two key ingre-dients: First, expressions for partial derivatives of the various arithmetic operations andintrinsic functions provided by programming languages are well known. Second, the chainrule of differential calculus holds.

    Theorem 1.3 (Chain Rule of Differential Calculus). Let

    y = F(x) = G1(G0(x))

    such thatG0: Rn Rk , z = G0(x) is differentiable atx andG1; Rk Rm, y = G1(z)is differentiable atz.ThenFis differentiable atx and

    F

    x=

    G1

    z

    G0

    x=

    y

    z

    z

    x.

    Proof.See, for example, [4].

    Example 1.4 Lety = F(x) = G1(G0(x)) such thatz = G0(x0,x1) = x0 x1and

    G1(z) =

    sin(z)cos(z)

    .

    Then,

    F

    x=

    cos(z)sin(z)

    x1 x0

    =

    cos(z) x1 cos(z) x0 sin(z) x1 sin(z) x0

    = cos(x0 x1) x1 cos(x0 x1) x0sin(x0 x1) x1 sin(x0 x1) x0

    .

  • 7/25/2019 The Art of Differentiating

    39/347

    24 Chapter 1. Motivation and Introduction

    The fundamental assumption we make is that at run time a computer program canbe regarded as a sequence of assignments with arithmetic operations or intrinsic functionson their right-hand sides. The flow of control does not represent a serious problem as it isresolved uniquely for any given set of inputs.

    Definition 1.5. The given implementation ofF as a(numerical) program is assumed todecompose into asingle assignment code(SAC) at every point of interest as follows:

    Forj= n, . . . ,n + p + m 1,

    vj= j(vi)ij, (1.4)

    wherei jdenotes a direct dependence ofvj onvi .The transitive closure of this relationis denoted by + . The result of each elemental function jis assigned to a unique auxiliaryvariable vj. The nindependent inputs xi =vi , for i =0, . . . ,n 1, are mapped onto mdependent outputs yj= vn+p+j,forj= 0, . . . ,m 1. The values ofp intermediate variablesvk are computed fork = n, . . . ,n + p 1.

    Example 1.6 The SAC for the previous example becomes

    v0= x0; v1= x1

    v2= v0 v1

    v3= sin(v2)

    v4= cos(v2)

    y0= v3; y1= v4.

    The SAC induces a directed acyclic graph (DAG) G = (V,E) with integer vertices V ={0, . . . ,n + p + m 1}and edges E= {(i, j)|i j}. The vertices are sorted topologicallywith respect to variable dependence, that is, i,j V: (i,j) E i < j.

    The elemental functionsjare assumed to possess jointly continuous partial deriva-tives with respect to their arguments. Association of the local partial derivatives with theircorresponding edges in the DAG yields alinearized DAG.

    Example 1.7 The linearized DAG for the function Fin Example 1.4 is shown in Fig-ure 1.2.

    v0 v1

    v2

    v3 v4

    v1 v0

    cos(v2) sin(v2)

    Figure 1.2. Linearized DAG ofFin Example1.4.

  • 7/25/2019 The Art of Differentiating

    40/347

    1.2. Manual Differentiation 25

    Let A = (ai,j) F(x).As an immediateconsequenceof thechain rule, theindividualentries of the Jacobian can be computed as

    ai,j=

    [in+p+j]

    (k,l)

    cl,k (1.5)

    where

    cl,k l

    vk(vq )ql

    and [i n + p + j] denotes the set of all paths that connect the independent vertexiwiththe dependent vertexn + p + j[6].

    Example 1.8 From Figure 1.2 we get immediately

    F=

    cos(v2) v1 cos(v2) v0sin(v2) v1 sin(v2) v0

    =

    cos(x0 x1) x1 cos(x0 x1) x0sin(x0 x1) x1 sin(x0 x1) x0

    .

    Linearized DAGs and the chain rule (for example, formulated as in (1.5)) can be useful toolsfor the manual differentiation of numerical simulation programs. Nonetheless, this processcan be extremely tedious and highly error-prone.

    When differentiating computer programs, one aims for a derivative code that coversthe entire domain of the original code. Correct derivatives should be computed for any setof inputs for which the underlying function is defined. Manual differentiation is reasonablystraightforward for straight-line code, that is, for sequences of assignments that are notinterrupted by control flow statements or subprogram calls. In this case, the DAG is static,

    meaning that its structure remains unchanged for varying values of the inputs. Manualdifferentiation of computer programs becomes much more challenging under the presenceof control flow.

    Consider the given implementation of (1.2):

    v o i d f (i n t n , d o u b l e x , d o u b l e &y ) {y =0 ;f o r (i n t i =0 ; i

  • 7/25/2019 The Art of Differentiating

    41/347

    26 Chapter 1. Motivation and Introduction

    x0

    x20

    x1 . . . xi

    x21 x2i

    y

    y

    y2

    2x0

    1

    2x1 2xi

    1

    2y

    1

    1

    Figure 1.3. Linearized DAG of the given implementation ofy =n1

    i=0 x2i

    2.

    final gradient. While this simple example is certainly manageable, it still demonstrates howpainful this procedure is likely to become for complex code involving nontrivial controlflow and many subprograms.

    Repeating the above process for second derivatives computed by the differentiatedgradient code yields the following handwritten Hessian code:

    1 v o i d h_g _f (i n t n , d o u b l e x , d o u b l e &y , d o u b l e g , d o u b l e H) {2 y =0 ;3 f o r (i n t i =0 ; i

  • 7/25/2019 The Art of Differentiating

    42/347

    1.3. Approximation of Derivatives 27

    g0

    x0

    x20

    x1 . . . xi

    x21 x2i

    gig1

    y

    g0 g1 gi

    2x02

    1

    2x1 2 2x2 2

    1 1

    2g0

    2y

    2g1

    2y

    2gi

    2y

    Figure 1.4. Linearized DAG of the given implementation of the gradient of

    y=n1

    i=0 x2i

    2.

    Unfortunately, the symbolic differentiation capabilities of your favorite computeralgebra system are unlikely to be of much help. They are not needed for the very simplelocal partial derivatives and are unable to cope with the control flow. Note that somecomputer algebra systems such as Maple [29] and Mathematica [61] have recently addedAD capabilities to their list of functionalities.

    1.3 Approximation of Derivatives

    Despite its obvious drawbacks, finite differences can be a useful tool for debugging andpotentially verifyingderivative code. If thecomputedvalues andtheirapproximationmatch,then the tested derivative code is probably correct for the given set of inputs. If they do not

    match, then this may be an indication of an error in the derivative code. However, it mayas well be the finite difference approximation that turns out to be wrong. There is no easybullet-proof check for correctness of derivatives. In the extreme case, you may just haveto debug your derivative code line by line. Finite differences applied to carefully selectedparts of the original code may provide good support.

    Definition 1.9. LetD Rn be an open domain and letF: D Rm,

    F=

    F0...

    Fm1

    ,

  • 7/25/2019 The Art of Differentiating

    43/347

    28 Chapter 1. Motivation and Introduction

    be continuously differentiable on D. Aforward finite difference approximation of the ithcolumn of the Jacobian

    F(x) =Fj

    xi (x)j=0,...,m1

    i=0,...,n1

    at pointx is computed as

    F

    xi(x)

    F0xi

    (x)...

    Fm1xi

    (x)

    1 F(x + ei ) F(x) , (1.6)

    fori = 0, . . . ,n 1and where thei th Cartesian basis vector in Rn is denoted byei .Abackward finite differenceapproximation of the ith column of the same Jacobian is

    computed as

    F

    xi(x) 1

    F(x) F(x ei )

    . (1.7)

    The first-order accuracy (denoted by 1) of unidirectional finite differences followsimmediately from the Taylor expansion ofFat x. For the sake of simplicity in the notation,we consider the univariate scalar case where f: R R. Without loss of generality, let fbe infinitely often continuously differentiable atx R. Then, the Taylor expansion offatxis given by

    f(x) = f(x) +f

    x

    (x) (x x)

    +1

    2!

    2f

    x2(x) (x x)2 +

    1

    3!

    3f

    x3(x) (x x)3 + .

    (1.8)

    Forx = x + we get

    f(x + ) = f(x) +f

    x(x) +

    1

    2!

    2f

    x2(x) 2 +

    1

    3!

    3f

    x3(x) 3 + , (1.9)

    and similarly forx = x

    f(x ) = f(x) f

    x(x) +

    1

    2!

    2f

    x2(x) 2

    1

    3!

    3f

    x3(x) 3 + . (1.10)

    Truncationafter thefirst derivative terms of (1.9) and(1.10)yields scalarunivariate versionsof (1.6) and (1.7), respectively. For 0

  • 7/25/2019 The Art of Differentiating

    44/347

    1.3. Approximation of Derivatives 29

    Second-order accuracy (2) follows immediately from (1.8). Subtraction of (1.10)from (1.9) yields

    f(x + ) f(x )

    = f(x) +f

    x(x) +

    1

    2!

    2f

    x2(x) 2 +

    1

    3!

    3f

    x3(x) 3 +

    f(x)

    f

    x(x) +

    1

    2!

    2f

    x2(x) 2

    1

    3!

    3f

    x3(x) 3 +

    = 2 f

    x(x) +

    2

    3!

    3f

    x3(x) 3 + .

    Truncation after the first derivative term yields the scalar univariate version of (1.11). Forsmall valuesof, the truncation error is dominated by the value of the 3 term, which implies

    that only accuracy up to the order of2 (second-order accuracy) can be expected.

    Example 1.11 Thegradient of thegiven implementation of (1.2) is accumulated by forwardfinite differences with perturbation h = 109 as follows:

    1 v o i d g _ f _ f f d (i n t n , d o u b l e x , d o u b l e& y , d o u b l e g ) {2 c o n s t d o ub le h=1e 9;3 d o u b l e y_ph ;4 d o u b l e x_ph=new d o u b l e[ n ] ; ;5 f o r (i n t i =0 ; i

  • 7/25/2019 The Art of Differentiating

    45/347

    30 Chapter 1. Motivation and Introduction

    14 }15 f ( n , x , y ) ;16 d e l e t e [ ] x_ph ;17 d e l e t e [ ] x_mh;18 }

    The call of fat the original pointx in line 15 ensures the return of the correct functionvaluey.

    Directional derivatives can be approximated by forward finite differences as

    y(1) F(x) x(1) 1F(x + x(1)) F(x)

    .

    The correctness of this statement follows immediately from the forward finite differenceapproximation of the Jacobian ofF(x + s x(1)) at points = 0 as

    y(1) F(x) x(1) 1F(x + (s + ) x(1)) F(x + s x(1))

    s=0

    =F(x + x(1)) F(x)

    .

    Similarly,y(1) can be approximated by backward finite differences

    y(1) F(x) x(1) 1F(x) F(x x(1))

    or by central finite differences

    y(1) F(x) x(1) 2F(x + x(1)) F(x x(1))

    2 .

    The classical definition of derivatives as limits of finite difference quotients suggests thatthe quality of an approximation is improved by making the perturbationsmaller. Unfor-

    tunately, this assumption is not valid in finite precision arithmetic such as implemented bytodays computers.

    The way in which numbers are represented on a computer is defined by the IEEE 754standard [1]. Real numbers x Rare represented with base ,precision t, and exponentrange[L,U] as

    x=

    d0 +

    d1

    +

    d2

    2+ +

    dt1

    t1

    e,

    where 0 di 1 for i =0, . . . , t 1, and Le U. The string of base- digitsm= d0d1 . . . d t1 is called the mantissaand e is called the exponent. A floating-pointsystem isnormalizedifd0= 0 unlessx = 0, that is, 1 m < . We assume = 2, as thisis case for almost any existing computer.

  • 7/25/2019 The Art of Differentiating

    46/347

    1.3. Approximation of Derivatives 31

    Example 1.12 Let =2, t =3, and [L,U] =[1,1]. The corresponding normalizedfloating-point number system contains the following 25 elements:

    0

    1.002 21 = 0.510, 1.012 2

    1 = 0.62510

    1.102 21 = 0.7510, 1.112 2

    1 = 0.87510

    1.002 20 = 110, 1.012 2

    0 = 1.2510

    1.102 20 = 1.510, 1.112 2

    0 = 1.7510

    1.002 21 = 210, 1.012 2

    1 = 2.510

    1.102 21 = 310, 1.112 2

    1 = 3.510.

    The IEEE single precisionfloating-point number data type float uses 32 bits: 23bits for its mantissa, 8 bits for the exponent, and one sign bit. In decimal representation,we get 6 significant digits with minimal and maximal absolute values of 1.17549e-38 and3.40282e+38, respectively. The stored exponent is biasedby adding 27 1 = 127 to itsactual signed value.

    The double precisionfloating-point number data type doubleuses 64 bits: 52 bitsfor its mantissa, 11 bits for the exponent, and one sign bit. In decimal representation weget 15 significant digits with minimal and maximal absolute values of 2.22507e-308 and1.79769e+308, respectively. The stored biased exponent is obtained by adding 210 1 =1023to itsactual signed value. Higher precisionfloating-point types aredefinedaccordingly.

    Ifx R is not exactly representable in the given floating-point number system, thenit must be approximated by a nearby floating-point number. This process is known asrounding. The default algorithmfor rounding in binaryfloating-point arithmetic is roundingto nearest, wherex is represented by the nearest floating-point number. Ties are resolvedby choosing the floating-point number whose last stored digit is even, i.e., equal to zero inbinary floating-point arithmetic.

    Example 1.13 In the previously discussed ( = 2, t= 3,[L, U] = [1,1]) floating-pointnumber system, the decimal value 1.126 is represented as 1.25 when rounding to nearest.Tie breaking toward the trailing zero in the mantissa yields 1 for 1.125.

    Letx andybe two floating-point numbers that agree in all but the last few digits. Ifwe computez = x y, thenzmay only have a few digits of accuracy due tocancellation.Subsequent use ofz in a calculation may impact the accuracy of the result negatively. Finitedifference approximation of derivatives is a prime example for potentially catastrophic lossin accuracy due to cancellation and rounding.

    Example 1.14 Consider the approximation of the first derivative ofy = f(x) = xin singleprecision IEEE floating-point arithmetic at x = 106 by the forward finite difference quotient

    f(x)

    f(x + ) f(x)

  • 7/25/2019 The Art of Differentiating

    47/347

    32 Chapter 1. Motivation and Introduction

    with h = 0.1. Obviously, f(x) = 1 independent ofx . Still, the code

    . . .

    f l o a t x=1e6 , h=1e1;cou t

  • 7/25/2019 The Art of Differentiating

    48/347

    1.3. Approximation of Derivatives 33

    Proof.(1.12) follows immediately from

    2

    Fkxi xj

    (x)

    Fk

    xi (x + ej

    )

    Fk

    xi (x ej

    )2

    =

    F(x + ej + ei ) F(x + ej ei )

    2

    F(x ej + ei ) F(x ej ei )

    2

    /(2 )

    Forf: R R,