basic analysis ii: a modern calculus in many...

“BasicAnalysisTwo-07102018” — 2018/7/10 — 11:56 — page i — #1

Basic Analysis II: A Modern Calculus in Many Variables

The cephalopods were eager to learn. Now both squidand octopi are coming to the lectures.

James K. PetersonDepartment of Mathematical Sciences

Clemson Universityemail: [email protected]© James K. Peterson First Edition

Gneural Gnome Press

Version 07.10.2018 : Compiled July 10, 2018

“BasicAnalysisTwo-07102018” — 2018/7/10 — 11:56 — page ii — #2

ii

Dedication I dedicate this work to all of my students who have learning these ideas of analysisthrough my courses. I have learned as much from them as I hope they have from me. I am a firmbeliever that all my students are capable of excellence and that the only path to excellence is throughdiscipline and study. I am always proud of my students for doing so well on this journey. I hopethese notes in turn make you proud of my efforts.

Abstract This book introduces you to more ideas from multidimensional calculus. In additionto classical approaches, we will also discuss things like one forms, a little algebraic topology and areasonable coverage of the multidimensional versions of the one variable Fundamental Theorem ofCalculus ideas. We also continues your training in the abstract way of looking at the world. We feelthat is a most important skill to have when your life’s work will involve quantitative modeling to gaininsight into the real world.

Acknowledgements Since we regained the ability to run in 2009, we are less grumpy than usual.Running on the local trails in the forest helps with the stress levels and counteracts our most importantphilosophical view: “There is always room for a donut!” Of course, now that Jim is a bit older, allthose donuts, cookies and so forth have had to be cut back a bit. It turns out that as we age, ourphysiology slows to a crawl and even a cookie crumb expands into several pounds along the oldwaistline. However, we still look wistfully at all the sweets even though we have fewer samples!

Our daily coffee, half decaf now, is still good though and holding a cup in our hands, we wish tothank all of you who have been students for helping us by listening to what we say in the lectures,finding our typographical errors and other mistakes. We are always hopeful that our efforts helpstudents to some extent and also impart some of our obvious enthusiasm for the subject. Of course,the reality is that we have been damaging students for years by forcing them to learn these abstractthings. This is why we tell people at parties we are a roofer or electrician. If we are identified asa mathematician, it could go badly given the terrible things we inflict on the students in the classes.Who knows whom they have told about our tortuous methods. Hence, anonymity is best.

Still, by writing these these notes, we have gone public. Sigh. This could be bad. So before weare taken out with a very public hit, we, of course, want to thank our family for all their help in manysmall and big ways. It is amazing to me that we have been teaching this material since our childrenwere in grade school!

We also want to acknowledge the great debt we have to our wife, Pauli, for her patience in dealingwith those vacant stares and the long hours spent in typing and thinking. You are the love of my life.

The cover for this book is an original painting by me which was done in July 2017. It showsthe moment when cephalopods first reached out to ask to be taught advanced mathematics. We wereawed by their trust in picking us.

History Based On:Handwritten Notes For Modern AnalysisMATH 9740 2007MATH 4500 2015

“BasicAnalysisTwo-07102018” — 2018/7/10 — 11:56 — page iii — #3

Table Of Contents

I Introduction 1

1 Beginning Remarks 31.1 Table of Contents . . . . . . . . . . . . . . . . . . . . . . . 4

II Linear Mappings 7

2 Preliminaries 92.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Some Topology in <2 . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 202.3 Bolzano - Weierstrass in <2 . . . . . . . . . . . . . . . . . . 20

2.3.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 22

3 Vector Spaces 233.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Vector Spaces over a Field . . . . . . . . . . . . . . . . . . . 24

3.2.1 The Span of a Set of Vectors . . . . . . . . . . . . . . 263.3 Inner Products . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 313.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.1 Two Dimensional Vectors in the Plane . . . . . . . . . 313.4.2 The Connection between Two Orthonormal Bases . . . 343.4.3 The Invariance of The Inner Product . . . . . . . . . . 363.4.4 Two Dimensional Vectors As Functions . . . . . . . . 373.4.5 Three Dimensional Vectors in Space . . . . . . . . . . 443.4.6 The Solution Space of Higher Dimensional

ODE Systems . . . . . . . . . . . . . . . . . . . . . . 483.5 Best Approximation in a Vector Space With Inner Product . . 52

3.5.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 54

4 Linear Transformations 554.1 Organizing Point Cloud Data . . . . . . . . . . . . . . . . . 55

4.1.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 564.2 Linear Transformations . . . . . . . . . . . . . . . . . . . . 56

4.2.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 564.3 Sequence Spaces Revisited . . . . . . . . . . . . . . . . . . 57

4.3.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 634.4 Linear Transformations Between Normed Linear Spaces . . . 64

4.4.1 Basic Properties . . . . . . . . . . . . . . . . . . . . 65

iii

“BasicAnalysisTwo-07102018” — 2018/7/10 — 11:56 — page iv — #4

iv TABLE OF CONTENTS

4.4.2 Mappings between Finite Dimensional Vec-tor Spaces . . . . . . . . . . . . . . . . . . . . . . . . 66

4.5 Magnitudes of Linear Transformations . . . . . . . . . . . . 71

5 Symmetric Matrices 755.1 The General Two by Two Symmetric Matrix . . . . . . . . . 75

5.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . 765.1.2 A Canonical Form . . . . . . . . . . . . . . . . . . . 775.1.3 Two Dimensional Rotations . . . . . . . . . . . . . . 805.1.4 Homework . . . . . . . . . . . . . . . . . . . . . . . 81

5.2 Rotating Surfaces . . . . . . . . . . . . . . . . . . . . . . . 815.2.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 84

5.3 An Complex ODE System Example . . . . . . . . . . . . . . 845.3.1 The General Real and Complex Solution . . . . . . . 845.3.2 Rewriting The Real Solution . . . . . . . . . . . . . . 865.3.3 Signed Definite Matrices . . . . . . . . . . . . . . . . 885.3.4 Summarizing . . . . . . . . . . . . . . . . . . . . . . 89

6 Matrix Sequences and Ordinary Differential Equations 916.1 Linear Systems of ODE . . . . . . . . . . . . . . . . . . . . 91

6.1.1 The Characteristic Equation . . . . . . . . . . . . . . 926.1.2 Finding The General Solution . . . . . . . . . . . . . 946.1.3 Homework . . . . . . . . . . . . . . . . . . . . . . . 97

6.2 Symmetric Systems of ODEs . . . . . . . . . . . . . . . . . 976.2.1 Writing The Solution Another Way . . . . . . . . . . 996.2.2 Homework . . . . . . . . . . . . . . . . . . . . . . . 101

7 Continuity and Topology 1037.1 Topology in n Dimensions . . . . . . . . . . . . . . . . . . . 103

7.1.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 1067.2 Cauchy Sequences . . . . . . . . . . . . . . . . . . . . . . . 1067.3 Compactness . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.4 Functions of Many Variables . . . . . . . . . . . . . . . . . 112

7.4.1 Limits and Continuity for Functions of ManyVariables . . . . . . . . . . . . . . . . . . . . . . . . 112

7.4.2 Homework . . . . . . . . . . . . . . . . . . . . . . . 116

8 Abstract Symmetric Matrices 1198.1 Input - Output Ratios for Matrices . . . . . . . . . . . . . . . 119

8.1.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 1208.2 The Norm of a Symmetric Matrix . . . . . . . . . . . . . . . 120

8.2.1 Constructing Eigenvalues . . . . . . . . . . . . . . . . 1228.3 What Does This Mean? . . . . . . . . . . . . . . . . . . . . 125

8.3.1 A Worked Out Example . . . . . . . . . . . . . . . . 1278.3.2 Homework . . . . . . . . . . . . . . . . . . . . . . . 128

8.4 Signed Definite Matrices Again . . . . . . . . . . . . . . . . 1288.4.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 128

“BasicAnalysisTwo-07102018” — 2018/7/10 — 11:56 — page v — #5

TABLE OF CONTENTS v

9 Rotations and Orbital Mechanics 1299.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 129

9.1.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 1319.2 Orbital Planes . . . . . . . . . . . . . . . . . . . . . . . . . 132

9.2.1 Orbital Constants . . . . . . . . . . . . . . . . . . . . 1329.2.2 The Orbital Motion is a Conic . . . . . . . . . . . . . 1349.2.3 The ConstantB Vector . . . . . . . . . . . . . . . . . 1349.2.4 The Orbital Conic . . . . . . . . . . . . . . . . . . . 136

9.3 Three Dimensional Rotations . . . . . . . . . . . . . . . . . 1399.3.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 1419.3.2 Drawing Rotations . . . . . . . . . . . . . . . . . . . 1419.3.3 Rotated Ellipses . . . . . . . . . . . . . . . . . . . . 146

9.4 Drawing the Orbital Plane . . . . . . . . . . . . . . . . . . . 1489.4.1 The Perifocal Coordinate System . . . . . . . . . . . 1499.4.2 Orbital Elements . . . . . . . . . . . . . . . . . . . . 150

9.5 Drawing Orbital Plane In MatLab . . . . . . . . . . . . . . . 1529.5.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 159

10 Determinants and Matrix Manipulations 16110.1 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . 161

10.1.1 Consequences One . . . . . . . . . . . . . . . . . . . 16210.1.2 Homework . . . . . . . . . . . . . . . . . . . . . . . 16410.1.3 Consequences Two . . . . . . . . . . . . . . . . . . . 16410.1.4 Homework . . . . . . . . . . . . . . . . . . . . . . . 170

10.2 Matrix Manipulation . . . . . . . . . . . . . . . . . . . . . . 17010.2.1 Elementary Row Operations and Determinants . . . . 17010.2.2 MatLab Implementations . . . . . . . . . . . . . . . . 17310.2.3 Matrix Inverse Calculations . . . . . . . . . . . . . . 177

10.3 Back To Definite Matrices . . . . . . . . . . . . . . . . . . . 18110.3.1 Matrix Minors . . . . . . . . . . . . . . . . . . . . . 182

III Calculus of Many Variables 185

11 Differentiability 18711.1 Partial Derivatives . . . . . . . . . . . . . . . . . . . . . . . 187

11.1.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 18811.2 Tangent Planes . . . . . . . . . . . . . . . . . . . . . . . . . 18911.3 Derivatives for Scalar Functions of n Variables . . . . . . . . 192

11.3.1 The Chain Rule for Scalar Functions of n Variables . . 19411.4 Partials and Differentiability . . . . . . . . . . . . . . . . . . 198

11.4.1 Partials Can Exist but not be Continuous . . . . . . . . 19811.4.2 Higher Order Partials . . . . . . . . . . . . . . . . . . 20211.4.3 When Do Mixed Partials Match? . . . . . . . . . . . . 204

11.5 Derivatives for Vector Functions of n Variables . . . . . . . . 20811.5.1 The Chain Rule For Vector Valued Functions . . . . . 211

11.6 Tangent Plane Error . . . . . . . . . . . . . . . . . . . . . . 21311.6.1 The Mean Value Theorem . . . . . . . . . . . . . . . 21311.6.2 Hessian Approximations . . . . . . . . . . . . . . . . 214

11.7 A Specific Coordinate Transformation . . . . . . . . . . . . 21911.7.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 221

“BasicAnalysisTwo-07102018” — 2018/7/10 — 11:56 — page vi — #6

vi TABLE OF CONTENTS

12 Multivariable Extremal Theory 22312.1 Differentiability and Extremals . . . . . . . . . . . . . . . . 22312.2 Second Order Extremal Conditions . . . . . . . . . . . . . . 224

12.2.1 Positive and Negative Definite Hessians . . . . . . . . 22512.2.2 Saddles . . . . . . . . . . . . . . . . . . . . . . . . . 22612.2.3 Expressing Conditions in Terms of Partials . . . . . . 227

13 The Inverse and Implicit Function Theorems 23313.1 Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23313.2 Invertibility Results . . . . . . . . . . . . . . . . . . . . . . 238

13.2.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 24413.3 Implicit Function Results . . . . . . . . . . . . . . . . . . . 244

13.3.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 25013.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . 250

13.4.1 What Does the Lagrange Multiplier Mean? . . . . . . 25413.4.2 Homework . . . . . . . . . . . . . . . . . . . . . . . 255

14 Linear Approximation Applications 25714.1 Linear Approximations to Nonlinear ODE . . . . . . . . . . 257

14.1.1 An Insulin Model . . . . . . . . . . . . . . . . . . . . 25714.1.2 An Autoimmune Model . . . . . . . . . . . . . . . . 274

14.2 Finite Difference Approximations in PDE . . . . . . . . . . . 27514.2.1 First Order Approximations . . . . . . . . . . . . . . 27614.2.2 Second Order Approximations . . . . . . . . . . . . . 27814.2.3 Homework . . . . . . . . . . . . . . . . . . . . . . . 278

14.3 FD Approximations . . . . . . . . . . . . . . . . . . . . . . 27914.3.1 Error Analysis . . . . . . . . . . . . . . . . . . . . . 28114.3.2 Homework . . . . . . . . . . . . . . . . . . . . . . . 283

14.4 FD Diffusion Code . . . . . . . . . . . . . . . . . . . . . . . 28414.4.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 286

IV Integration 289

15 Integration in Multiple Dimensions 29115.1 The Darboux Integral . . . . . . . . . . . . . . . . . . . . . 291

15.1.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 30215.2 The Riemann Integral in n Dimensions . . . . . . . . . . . . 302

15.2.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 30515.3 Volume Zero and Measure Zero . . . . . . . . . . . . . . . . 305

15.3.1 Measure Zero . . . . . . . . . . . . . . . . . . . . . . 30515.3.2 Volume Zero . . . . . . . . . . . . . . . . . . . . . . 309

15.4 When is a Function Riemann Integrable? . . . . . . . . . . . 31015.4.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 315

15.5 Integration and Sets of Measure Zero . . . . . . . . . . . . . 315

16 Change of Variables and Fubini’s Theorem 31916.1 Linear Maps . . . . . . . . . . . . . . . . . . . . . . . . . . 319

16.1.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 32316.2 The Change of Variable Theorem . . . . . . . . . . . . . . . 324

16.2.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 32816.3 Fubini Type Results . . . . . . . . . . . . . . . . . . . . . . 329

“BasicAnalysisTwo-07102018” — 2018/7/10 — 11:56 — page vii — #7

TABLE OF CONTENTS vii

16.3.1 Fubini on a Rectangle . . . . . . . . . . . . . . . . . 32916.3.2 Homework . . . . . . . . . . . . . . . . . . . . . . . 337

17 Line Integrals 33917.1 Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339

17.1.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 34317.2 Conservative Force Fields . . . . . . . . . . . . . . . . . . . 343

17.2.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 34617.3 Potential Functions . . . . . . . . . . . . . . . . . . . . . . . 347

17.3.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 34917.4 Finding Potential Functions . . . . . . . . . . . . . . . . . . 349

17.4.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 34917.5 Green’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . 349

17.5.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 35417.6 Green’s Theorem for Images of the unit square . . . . . . . . 354

17.6.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 35817.7 Motivational Notation . . . . . . . . . . . . . . . . . . . . . 358

17.7.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 359

V Differential Forms 361

18 Differential Forms 36318.1 One Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . 363

18.1.1 A Simple Example . . . . . . . . . . . . . . . . . . . 36918.2 Exact and Closed 1 - Forms . . . . . . . . . . . . . . . . . . 37118.3 Two and Three Forms . . . . . . . . . . . . . . . . . . . . . 37618.4 Exterior Derivatives . . . . . . . . . . . . . . . . . . . . . . 37818.5 Integration of Forms . . . . . . . . . . . . . . . . . . . . . . 37818.6 Green’s Theorem Revisited . . . . . . . . . . . . . . . . . . 378

VI Applications 379

19 The Exponential Matrix 38119.1 The Exponential Matrix . . . . . . . . . . . . . . . . . . . . 381

19.1.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 38319.2 The Jordan Canonical Form . . . . . . . . . . . . . . . . . . 383

19.2.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 38719.3 Exponential Matrix Calculations . . . . . . . . . . . . . . . 387

19.3.1 Jordan Block Matrices . . . . . . . . . . . . . . . . . 39019.3.2 General Matrices . . . . . . . . . . . . . . . . . . . . 39219.3.3 Homework . . . . . . . . . . . . . . . . . . . . . . . 393

19.4 Applications to Linear ODE . . . . . . . . . . . . . . . . . . 39319.4.1 The Homogeneous Solution . . . . . . . . . . . . . . 39519.4.2 Homework . . . . . . . . . . . . . . . . . . . . . . . 397

19.5 The Non homogeneous Solution . . . . . . . . . . . . . . . . 39719.5.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 399

19.6 A Diagonalizable Test Problem . . . . . . . . . . . . . . . . 39919.6.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 401

19.7 A Non diagonalizable Test Problem . . . . . . . . . . . . . . 40119.7.1 Homework . . . . . . . . . . . . . . . . . . . . . . . 402

“BasicAnalysisTwo-07102018” — 2018/7/10 — 11:56 — page viii — #8

viii TABLE OF CONTENTS

19.8 A Non homogeneous Problem . . . . . . . . . . . . . . . . . 40219.8.1 Louiville’s Formula . . . . . . . . . . . . . . . . . . . 40319.8.2 Finding the Particular Solution . . . . . . . . . . . . . 404

20 Nonlinear Parametric Optimization Theory 40720.1 The More Precise and Careful Way . . . . . . . . . . . . . . 40820.2 Unconstrained Parametric Optimization . . . . . . . . . . . . 41120.3 Constrained Parametric Optimization . . . . . . . . . . . . . 414

20.3.1 Hessian Error Estimates . . . . . . . . . . . . . . . . 41520.3.2 A First Look at Constraint Satisfaction . . . . . . . . 41820.3.3 Constraint Satisfaction and the Implicit Func-

tion Theorem . . . . . . . . . . . . . . . . . . . . . . 41920.4 Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . 425

VII Summing It All Up 427

21 Summing It All Up 429

VIII References 431

IX Detailed Indices 435

“BasicAnalysisTwo-07102018” — 2018/7/10 — 11:56 — page 1 — #9

Part I

Introduction

1


Chapter 1

Beginning Remarks

The first primer essentially discusses the abstract concepts underlying the study of calculus on thereal line. A few higher dimensional concepts are touched on such as the development of rudimentarytopology in <2 and <3, compactness and the tests for extrema for functions of two variables, butthat is not a proper study of calculus concepts in two or more variables. A full discussion of the <nbased calculus is quite complex and even this second primer can not cover all the important things.The chosen focus here is on differentiation in <n and important concepts about mappings from<n to <m such as the inverse and implicit function theorem and change of variable formulae formultidimensional integration. These topics alone require much discussion and setup. These topicsintersect nicely with many other important applied and theoretical areas which are no longer coveredin mathematical science curricula. The knowledge here allows a quantitatively inclined biologistsand other scientists to more properly develop multivariable nonlinear ODE models for themselvesand physicists to learn the proper background to study differential geometry and manifolds amongmany other applications.

However, this course is just not taught at all anymore. It is material that students at all levels mustfigure out on their own. Most of the textbooks here are extremely terse and hard to follow as theyassume a lot of abstract sophistication from their readers. This text is designed to be a self - studyguide to this material. It is also designed to be taught from, but at my institution, it would be verydifficult to find the requisite 10 students to register so that the course could be taught. Students whoare coming in as Master’s Students generally do not have the ability to allocate a semester course likethis to learn this material. Instead, even if they have a solid introduction from the primer on analysis,they typically jump to Primer Three which is an introduction to very abstract concepts such as metricspaces, normed linear spaces and inner products spaces along with many other needed deeper ideas.Such a transition is always problematic and the student is always trying to catch up on the wholes intheir background.

Also, many multivariable concepts and the associated theory are used in probability, operationsresearch and optimization to name a few. In those courses, <n based ideas must be introduced andused despite the students not having a solid foundational course in such things. Hence, good studentsare reading about this themselves. This primer is intended to give them a reasonable book to readand study from.

We also provide an nice to line integrals and Green’s Theorem in the plane and a re branding ofthis material in terms of differential forms. New models of signals applied to a complex system usethese sorts of ideas and it is a nice way to plant such a pointer to future things. We also have manyexamples of how we use these ideas in optimization and approximation algorithms.

Some key features of this approach to explaining this material are

• A very careful discussion of some of the many important concepts as we believe it is moreimportant to teach well an important subset of this material rather than teaching a large quantity

3


4 CHAPTER 1. BEGINNING REMARKS

poorly. Hence, the coverage of the textbook is chosen to provide detailed coverage of anappropriate roadmap through this block of material.

• The discussion and associated online lecture material is also designed for self-study as a greatermathematical maturity and training is desperately needed in the biological sciences and physi-cal sciences at this critical junior - senior college level.

• Many pointers to ideas and concepts that are extensions of what is covered in the text so thatstudents know this is just the first step of the journey.

• The emphasis is on learning how to think as no one can take all the courses they need in a tidylinear order. Hence, the ability to read and assess on their own is an essential skill to foster.

• An emphasis on learning how to understand the consequences of assumptions using a varietyof tools to provide the proofs of propositions.

• An emphasis on learning how to learn how to use abstraction as a key skill in solving problems.

In what follows, I will outline a set of five primers on analysis. Each course has a web site forit which contains presentations and other material so feel free to download and use any material onthose sites. These courses are

• First Half of Primer One for Analysis: MATH 4530 here. The web site is at The Primer OnAnalysis I: Part One Home Page .

• Second Half of Primer One for Analysis: MATH 4540 here. The web site is at The PrimerOn Analysis I: Part Two Home Page .

• Second Primer on Analysis: this is material we do not teach: essentially calculus in <n ideas.The web site is at The Primer On Analysis II Home Page .

• Third Primer on Analysis: MATH 8210 here. The web site is at The Primer On Analysis III:Linear Analysis Home Page .

• Fourth Primer on Analysis: MATH 8220 here. The web site is at The Primer On AnalysisIV: Measure Theory Home Page .

• Fifth Primer on Analysis: MATH 9270 here. The web site is at The Primer On Analysis V:Functional Analysis and Topology Home Page .

All of these books will be published in 2019 as the set (Peterson (8) 2019), (Peterson (10) 2019),(Peterson (9) 2019), (Peterson (7) 2019) and (Peterson (6) 2019).

1.1 Table of ContentsPart One is concerned with some preliminary material: Preliminaries, Chapter 2.

Part Two is devoted to Linear Mappings and begins with a careful explanation of the ideas of vectorspaces in Vector Spaces, Chapter 3. We use the ideas of vector spaces so much it is important to seethis structure.

In Linear Transformations, Chapter 4, we look in detail at how we manage point cloud data andwhat linear transformations between finite dimensional vector spaces mean. Then, since the specialclass of symmetric matrices is so important, we start a detailed discussion of their properties in Sym-metric Matrices, Chap 5 for the two dimensional case. This includes their use in rotating surfaces

www.ces.clemson.edu/~petersj/AdvancedCalcOne.html

www.ces.clemson.edu/~petersj/AdvancedCalcOne.html

www.ces.clemson.edu/~petersj/AdvancedCalcTwo.html

www.ces.clemson.edu/~petersj/AdvancedCalcTwo.html

www.ces.clemson.edu/~petersj/BasicAnalysisTwo.html

www.ces.clemson.edu/~petersj/BasicAnalysisThree.html

www.ces.clemson.edu/~petersj/BasicAnalysisThree.html

www.ces.clemson.edu/~petersj/BasicAnalysisFour.html

www.ces.clemson.edu/~petersj/BasicAnalysisFour.html

www.ces.clemson.edu/~petersj/BasicAnalysisFive.html

www.ces.clemson.edu/~petersj/BasicAnalysisFive.html


1.1. TABLE OF CONTENTS 5

and ODE models. We then look at how these ideas are applied in linear systems of ODEs in MatrixSequences and Ordinary Differential Equations.

In Continuity and Topology, Chapter 7, we discuss continuity and compactness for functions ofmany variables. We then look at the n dimensional symmetric matrix and derive its eigenvalue andeigenvector structure carefully. This is almost the same proof we would use for the case of a self -adjoint linear operator in functional analysis ( see (Peterson (9) 2019) and (Peterson (6) 2019) ) butthe tools we have now are set in <n so we can not do that more general case here.

In Rotations and Orbital Mechanics, Chapter 9, we go over a complicated example of how we usetwo dimensional spaces in a real setting which involves determining the motion of a satellite aroundthe earth. We finish part One, with a full treatment of the determinant function and include a fair bitof code to help you see nontrivial examples.

In Calculus of Many Variables, Part Three, we discuss the idea of derivative for functions of manyvariables in Differentiability, Chapter 11. This sets the stage for a treatment of higher dimensionalextremal theory, Multivariable Extremal Theory in Chapter 12. We then cover the important topicsof the inverse and implicit function theorems in The Inverse and Implicit Function Theorem inChapter 13 and finish with how we apply these ideas to some applications in Linear ApproximationApplications, Chapter 14.

In Integration, Part Four, we go over the theory of integration in <n which is understandably moredifficult than the treatment of Riemann Integration on an interval in <. This is done in Integration inMultiple Dimensions, Chapter 15. We must include a careful discussion of measure zero extendedto <n and the idea of a multidimensional volume for a subset which is not a rectangle. We then goover Change of Variables and Fubini’s Theorem in Chapter 16. We finish with a long discussionof line integrals and their connection of Green’s Theorem in Line Integrals, Chapter 17.

The notion of a line integral is then redone as a differential form in Part Five, Differential Forms asChapter 18.

Applications, Part Six is then about ODE applications again and optimization. In The ExponentialMatrix, Chapter 19 we define eA for a matrix and show how it is useful as a matrix of solutions to alinear system of ODEs. We introduce and use the idea of the Jordan Canonical Form here. We finishthe text with a proof of a standard result in Constrained Optimization in Nonlinear Parametric Op-timization Theory, Chapter 20.

There is much more we could have done, but these topics are a nice introduction into the further useof abstraction in modeling.Jim PetersonDepartment of Mathematical SciencesClemson UniversityJuly 10, 2018


6 CHAPTER 1. BEGINNING REMARKS


Part II

Linear Mappings

7


Chapter 2

Preliminaries

We begin with some preliminaries. We do discuss these things in (Peterson (8) 2019) but a recap isalways good and it is nice to keep stuff reasonably self contained.

2.1 The BasicsLet S be a set of real numbers. We say S is bounded if there is a number M > 0 so that x ≤ M forall x ∈ S. For example, if S = y : y = tanh(x), x ∈ <, then many numbers can play the role ofa bound; e.g. M = 4, 3, 2 and 1, among others. Let’s be more precise.

Definition 2.1.1 Bounded Sets

We say the set S of real numbers is bounded above if there is a number M so that x ≤ Mfor all x ∈ S. Any such number M is called an upper bound of S. We usually abbreviatethis by the letters u.b. or just ub.We say this same set S is bounded below if there is a number m so that x ≥ m for allx ∈ S. Any such number m is called a lower bound of S; we abbreviate this a l.b. orsimply lb.

The next idea is more abstract; the idea of a least upper bound and greatest lower bound. We haveto be careful to define this carefully.

Definition 2.1.2 Least Upper Bound and Greatest Lower Bound

The least upper bound or lub or supremum or sup of a nonempty set S of real numbersis a number U satisfying

1. U itself is an upper bound of S

2. if M is any other upper bound of S, U ≤M .

Similarly, the greatest lower bound or glb or infimum or inf of the nonempty set S ofreal numbers is a number L satisfying

1. L itself is a lower bound of S

2. if m is any other lower bound of S, m ≤ L.

We often use the notation inf(S) and sup(S) to denote these numbers.

Note, inf(S) and sup(S) need not be in the set S itself!

9


10 CHAPTER 2. PRELIMINARIES

We use the following conventions:

1. if S has no lower bound, we set inf(S) = −∞. If S has no upper bound, we set sup(S) =∞.

2. We can extend these definitions to an empty set which is sometimes nice to do. If S = ∅, sinceall positive numbers are not lower bounds, we can argue we should set inf(∅) =∞. Similarly,we set sup(∅) = −∞.

When the inf(S) and sup(S) belong to S, this situation is special. It is worthy of a definition of anew set of terms.

Definition 2.1.3 The Minimum and Maximum of a Set

Let S be a set of real numbers. We say Q is a maximum of S if x ≤ Q for all x ∈ S andQ ∈ S. Then, we say q is a minimum of S when x ≥ q for all x ∈ S and q ∈ S. We alsosay q is a minimal element of S and Q is a maximal element of S.

Note if Q is a maximum of S, by definition, Q is an upper bound of S. Hence, sup(S) ≤M . On theother hand, since Q ∈ S, we must also have Q ≤ sup(S). We conclude Q = sup(S). So, when thesupremum of a set is achieved by an element of the set, that element is the supremum. We can arguein a similar way to show m = inf(S).Now, there are things we can not prove about the set of real numbers <. Instead we must assume acertain property called the Completeness Axiom.

Axiom 1 The Completeness Axiom

Let S be a set of real numbers which is nonempty and bounded above. The sup(S) existsand is finite. If S is bounded below, then inf(S) exists and is finite.

So bounded sets of real numbers always have a finite infimum and a finite supremum. Of course,they need not posses a minimum or maximum! We can prove some useful things now.

Theorem 2.1.1 S has a maximal element if and only if sup(S) is in S

Let S be a nonempty set of real numbers which is bounded above. Then S has a maximalelement if and only sup(S) ∈ S.

Proof 2.1.1(⇐=):Assume sup(S) ∈ S. By definition, sup(S) is an upper bound of S and so x ≤ sup(S) for all x ∈ S.This shows sup(S) is the maximum.(=⇒):LetQ denote the maximum of S, Then, by definition, x ≤ Q for all x ∈ S which showsQ is an upperbound of S. Then, from the definition of the supremum of S, we must have sup(S) ≤ Q. However,sup(S) is also an upper bound of S and so we also haveQ ≤ sup(S). Combining these inequalities,we see Q = sup(S).

We can prove a similar theorem about the minimum.

Theorem 2.1.2 S has a minimal element if and only if inf(S) is in S

Let S be a nonempty set of real numbers which is bounded below. Then S has a minimalelement if and only inf(S) ∈ S.


2.1. THE BASICS 11

Proof 2.1.2The argument is similar to the one we did for the maximum, so we’ll leave this one to you.

Now the next two lemmas are very important in modern analysis.

Lemma 2.1.3 Supremum Tolerance Lemma

Let S be a nonempty set that is bounded above. Then we know sup(S) exists and is finiteby the completeness axiom. Moreover, for all ε > 0, there is an element y ∈ S so thatsup(S)− ε < y ≤ sup(S).

Proof 2.1.3We argue by contradiction. Suppose for a given ε > 0, we can not find such a y. Then we must havey ≤ sup(S)− ε for all y ∈ S. This tells us sup(S)− ε is an upper bound of S and so by definition,sup(S) ≤ sup(S) − ε which is not possible. Hence, our assumption is wrong and we can find sucha y.

Lemma 2.1.4 Infimum Tolerance Lemma

Let S be a nonempty set that is bounded below. Then we know inf(S) exists and is finiteby the completeness axiom. Moreover, for all ε > 0, there is an element y ∈ S so thatinf(S) ≤ y ≤ inf(S) + ε.

Proof 2.1.4We argue by contradiction. Suppose for a given ε > 0, we can not find such a y. Then we must havey ≥ inf(S) + ε for all y ∈ S. This tells us inf(S) + ε is a lower bound of S and so by definition,inf(S) ≥ inf(S) + ε which is not possible. Hence, our assumption is wrong and we can find such ay.

We often abbreviate these lemma as the STL and IFT, respectively.Homework

Exercise 2.1.1

Exercise 2.1.2

Exercise 2.1.3

Exercise 2.1.4

Exercise 2.1.5

There is also the notion of compactness which we have discussed thoroughly in (Peterson (8) 2019).However, let’s go over it again as this is an important concept. We begin with sequential compact-ness. Recall

Definition 2.1.4 Sequential Compactness

Let S be a set of real numbers. We say S is sequentially compact if every sequence (xn)in S has at least one subsequence which converges to an element of S.

This definition has many consequences. The first is a way to characterize the sequentially compactsubsets of the real line.

Theorem 2.1.5 A Set S is Sequentially Compact if and only if it is Closed and Bounded



Let S be a set of real numbers. Then S is sequentially compact if and only if it is a boundedand closed set.

Proof 2.1.5(=⇒:) If we assume S is sequentially compact, we must show S is closed and bounded.

a: To show S is closed, we show S contains its limit points. Let x be a limit point of S. Then thereis a sequence (xn) in S so that xn → x. Since (xn) is in S, the sequential compactness of Ssays there is a subsequence (x1

n) of (xn) and a y ∈ S so that x1n → y. But the subsequential

limits of a converging sequence must converge to the same limit. Hence, y = x. This tells usx ∈ S and so the limit point x is in S. Since the choice of limit point x was arbitrary, thisimplies S is a closed set.

b: To show S is bounded, we assume it is not. Then there must be a sequence (xn) in S with|xn| > n for all positive integers n. Now apply the sequential compactness of S again. Thissequence must contain a convergent subsequence (x1

n) which converges to an x ∈ S. Butthis implies the elements of the sequence (x)n1) must be bounded as there is a theorem whichtells us convergent sequences must be bounded. But this subsequence is from an unboundedsequence which in monotonic in absolute value. This is a contradiction. So our assumptionthat S is unbounded is wrong and we conclude S must be bounded.

(⇐=) We assume S is closed and bounded. We want to show S is sequentially compact. let (xn) beany sequence in S. Since S is a bounded set, by the Bolzano - Weierstrass Theorem, there is at leastone subsequence (x1

n) in xn) and a real number x with x1n → x. This says x is a limit point of S

and since S is closed, we must have x ∈ S. Hence the sequence (xn) in S has a subsequence whichconverges to a point in S. Since our choice of sequence in S is arbitrary, this shows S is sequentiallycompact.

Homework

Exercise 2.1.6

Exercise 2.1.7

Exercise 2.1.8

Exercise 2.1.9

Exercise 2.1.10

Now add continuity to the mix. Here is a typical result.

Theorem 2.1.6 The range of a continuous function on a sequentially compact domain is alsosequentially compact

Let S be a nonempty sequentially compact set of real numbers. Assume f : S → < iscontinuous on S. Then the range f(S) is sequentially compact.

Proof 2.1.6We know f(S) = f(x)|x ∈ S. Let (yn) be sequence in f(S). Then there is an associated sequence(xn) in S so that f(xn) = yn. But S is sequentially compact, so there is a subsequence x1

n) whichconverges to a point x ∈ S. Since f is continuous at x, f(x1

n) → f(x). Let y denote the numberf(x). We have shown there is a subsequence (y1

n = f(x1n) with y1

n → y with y ∈ f(S). Hence, wehave shown f(S) is sequentially compact.


2.1. THE BASICS 13

This leads to our first extremal theorem.

Theorem 2.1.7 Continuous functions on a sequentially compact domains have a minimum andmaximum value

Let S be a sequentially compact set of real numbers and let f : S → < be continuous onS. Then there are two sequences xmn and xMn which converge to the points xm and xMrespectively and f(x) ≥ f(xm) and f(x) ≤ f(xM ) for all x ∈ S. Thus, f(xm) is theminimum value of f on S which is achieved at xm. Moreover, f(xM ) is the maximumvalue of f on S which is achieved at xM .

We call the sequences (xmn ) and (xMn ) minimizing and maximizing sequences for f on S.

Proof 2.1.7By Theorem 2.1.6, f(S) is sequentially compact and nonempty. Hence, f(S) is a closed and boundedset of real numbers by Theorem 2.1.5. We thus know M = sup(f(S) and m = inf(f(S) both existand are finite. By the IFT and SFT, satisfying

∃ xM1 ∈ S 3 M − 1 < f(xM1 ) ≤M∃ xM2 ∈ S 3 M − 1/2 < f(xM2 ) ≤M∃ xM3 ∈ S 3 M − 1/3 < f(xM3 ) ≤M

...

∃ xMn ∈ S 3 M − 1/n < f(xMn ) ≤M...

and

∃ xm1 ∈ S 3 m ≤ f(xm1 ) < m+ 1

∃ xm2 ∈ S 3 m ≤ f(xm2 ) < m+ 1/2

∃ xm3 ∈ S 3 m ≤ f(xm3 ) < m+ 1/3

...

∃ xmn ∈ S 3 m ≤ f(xMn ) < m+ 1/n

...

We see f(xMn )→ M and f(xmn )→ m. Hence (f(xMn ) and f(xmn ) are convergent sequences in theclosed set f(S). So their limit values must be in f(S). Hence, we know m ∈ f(S) and M ∈ f(S).Now apply Theorem 2.1.1 and Theorem 2.1.2 to see there are elements xm and xM in S so thatf(xm) = m ≤ f(x) and f(xM ) = M ≥ f(x) for all x ∈ S.

Homework

Exercise 2.1.11

Exercise 2.1.12

Exercise 2.1.13

Exercise 2.1.14



Exercise 2.1.15

Now add differentiation.

Theorem 2.1.8 Derivative of f at an interior point local extremum is zero

Let f : [a, b] → < where [a, b] is a finite interval. Assume f has a local extrema at aninterior point p ∈ [a, b]. Then if f is differentiable at p, f ′(p) = 0.

Proof 2.1.8Since p is an interior point, there is an r1 > 0 so that —x − p| < r1 =⇒ x ∈ [a, b]. Also, sincef(p) is a local extreme value, there is another r2 > 0 so that f(p) ≤ f(x) for x ∈ Br2(p) withBr2(p) ⊂ [a, b] if f(p) is a local maximum or f(p) ≥ f(x) for x ∈ Br2(p) with Br2(p) ⊂ [a, b]if f(p) is a local minimum. For convenience, we will assume f(p) is a local maximum and let youargue the local minimum case for yourself.Combining these facts, we see

|x− p| < r < min(r1, r2) =⇒ f(x) ≥ f(p)

Thus, for p < x < p+ r, f(x)− f(p) ≤ 0 and x− p > 0 telling us the ratio f(x)−f(p)x−p ≤ 0. Hence,

f ′(p+) = limx→p+f(x)−f(p)

x−p ≤ 0. Since f is differentiable at p, we know f ′(p+) = f ′(p). Thus,we have f ′(p) ≤ 0.Next, for p − r < x < p, we have f(x) − f(p) ≤ 0 and x − p < 0. Thus, the ratio f(x)−f(p)

x−p ≥ 0.

Hence, f ′(p−) = limx→p−f(x)−f(p)

x−p ≥ 0. Since f is differentiable at p, we know f ′(p−) = f ′(p).Thus, we have f ′(p) ≥ 0.Combining, we see f ′(p) = 0. You can do a similar argument for the case of a local minimum.

Homework

Exercise 2.1.16

Exercise 2.1.17

Exercise 2.1.18

Exercise 2.1.19

Exercise 2.1.20

We will often be in situations where the classical limits fail to exist. However, even if this is the case,limit inferiors and limit superiors do exist! Let(an) be a sequence of real numbers, Let (a1

n) beany subsequence of (an) which converges. If (an) is a bounded sequence, the Bolzano - WeierstrassTheorem guarantees there is at least one such subsequence. Consider the set

S = α ∈ <|∃(a1n) ⊂ (an) 3 a1

n → α

The set S is called the set of subsequential limits of the sequence (an). We define the limit inferiorand limit superior of a sequence using S.

Definition 2.1.5 Limit Inferior and Limit Superior of a Sequence

The limit inferior of the sequence (an) is denoted by lim inf(an) = lim(an) and definedby lim inf(an) = inf(S). In a similar fashion, the limit superior of the sequence (an) isdenoted by lim sup(an) = lim(an) and defined by lim sup(an) = sup(S).


2.1. THE BASICS 15

Recall, it is possible for a sequence to converge to ±∞. For example, if (an) = (n) for n ≥ 1, weknow the sequence is not bounded above and limn(an) = limn(n) =∞. Similarly, if (an) = (−n)for n ≥ 1, we know the sequence is not bounded below and limn(an) = limn(−n) = −∞. Thereare many other examples! Thus, it is possible to have sequences whose limits are not finite realnumbers. The set of extended real numbers is denoted <. and consists of all finite real numbersplus two additional symbols +∞ and −∞. Note, we use these symbols to connect with our intuitionabout converging to ±∞ but we could just as well used the symbols α and β. But for us, we havedefined

< = < ∪ +∞ ∪ −∞

We use the following conventions: for addition we have

x+ (+∞) = +∞, ∀x ∈ <x− (+∞) = −∞, ∀x ∈ <(+∞) + x = +∞, ∀x ∈ <(−∞) + x = −∞, ∀x ∈ <

(−∞) + (−∞) = −∞,(+∞) + (+∞) = +∞

For multiplication,

(+∞) · x = x · (+∞) = +∞, ∀x > 0

(+∞) · 0 = 0 · (+∞) = 0,

(+∞) · x = x · (+∞) = −∞, ∀x < 0

(−∞) · x = x · (−∞) = −∞, ∀x > 0

(−∞) · 0 = 0 · (−∞) = 0,

(−∞) · x = x · (−∞) = +∞, ∀x < 0

(+∞) · (+∞) = +∞(−∞) · (−∞) = +∞(+∞) · (−∞) = −∞(−∞) · (+∞) = −∞

For division

x

+∞= 0, ∀x ∈ <

x

−∞= 0,∀x ∈ <

+∞x

= +∞, ∀x > 0

+∞x

= −∞, ∀x < 0

−∞x

= −∞, ∀x > 0

−∞x

= ∞, ∀x < 0

There are several operations that are not defined. These are 0/0, (+∞ + (−∞), (+∞)/(−∞),(−∞)/(+∞), (+∞)/(+∞) and (−∞)/(−∞). However, with the above conventions, arithmetic



in < is well-defined. With these definitions, we can extend limit inferiors and limit superiors to theextended reals. Let (an) be any sequence in <. The set of subsequential limits now is

S = α ∈ <|∃(a1n) ⊂ (an) 3 a1

n → α

where we now allow the possibility that±∞ ∈ S. We need to think carefully about limiting behaviorhere.

• If (a1n) is a subsequence which is in the reals, then a1

n → +∞ implies for sufficiently largen1, 1/a1

n → 0. This means for any ε > 0, there is an N so that n1 > N implies |1/a1n| < ε or

|a1n| > 1/ε. As ε → 0+, 1/ε → +∞. So this is the same as saying for all R > 0, there is an

N so that |a1n| > R when n1 > N .

• If the subsequence (a1n)→ −∞, this is handled like we did above.

• If the subsequence has an infinite number of +∞ values, then our tests for convergence whichrequire us to look at the difference a1

n − α where α is the limit fail when we look at the term∞ −∞ which is not a defined operation. In this case, we will simply posit that α = ∞ orα = −∞ depending on the structure of the subsequence.

Here are some examples:

1. (an) = (tanh(n). Then lim(an) = −1 and lim(an) = +1.

2. (an) = (n2). Then lim(an) = +∞ and lim(an) = +∞.

3. (an) = (cos(nπ/3) for n ≥ 0. Then the range of this sequence is a set of repeating blocks

(an) = Block,Block, . . .

where

Block = 1(n = 0), 1/2(n = 1),−1/2(n = 2),−1(n = 3),−1/2(n = 4), 1/2(n = 5)

We see for integers k, we have subsequences that converges to each element in this block:

cos((6k + 0)π/3) = 1,=⇒ a6k+0 → 1

cos((6k + 1)π/3) = 1/2,=⇒ a6k+1 → 1/2

cos((6k + 2)π/3) = −1/2,=⇒ a6k+2 → −1/2

cos((6k + 3)π/3) = −1,=⇒ a6k+3 → −1

cos((6k + 4)π/3) = −1/2,=⇒ a6k+4 → −1/2

cos((6k + 5)π/3) = 1/2,=⇒ a6k+5 → 1/2

A little thought then allows us to conclude S = −1. − 1/2, 1/2, 1 and so lim(an) = −1 andlim(an) = +1.We can extend these ideas to functional limits. Consider the function f defined on a finite interval[a, b]. Let p ∈ [a, b]. we say α ∈ < is a cluster point of f as we approach p, if there is a sequence(xn) in [a, b] with xn → p and f(xn) → α. We must now think of convergence in the sense of theextended reals, of course. We define the set of cluster points S(p) then as

S(p) = ∃(xn) ⊂ [a, b] 3 xn → p and f(xn)→ α


2.1. THE BASICS 17

and we define the limit inferior as x approaches p of f to be limx→p

f(x) = inf(S(p)). Similarly, the

limit superior as x approaches p of f is limx→p

f(x) = sup(S(p)). In this version of these limits, we

do not have an unbounded domain and so we cannot have a sequence xn → ±∞. The most generalversion has the domain of f simply being the set Ω and

S(p) = ∃(xn) ⊂ Ω 3 xn → p and f(xn)→ α

In this case, we can have the extended real value limits ±∞ but the definitions of the limits aboveare still the same. We also use the notation lim

x→pf(x) = lim inf

x→f(x) and lim

x→pf(x) = lim sup

x→pf(x).

Example 2.1.1 Find the cluster points of f(x) = cos(1/x) for x 6= 0.

Solution Here f is defined in < not <. Since −1 ≤ cos(y) ≤ 1, given any α ∈ [−1, 1], the sequence(sn) defined by xn = 1/(2nπ + cos−1(α)) satisfies xn → 0 and f(xn) = α for all n. Hence,S(0) = [−1, 1] and lim

x→0f(x) = inf(S(0)) = −1 and lim

x→0f(x) = sup(S(0)) = 1. Because f

is continuous at all p 6= 0, S(p) = cos(1/p) there and limx→p

f(x) = inf(S(p)) = cos(1/p) and

limx→p

f(x) = sup(S(p)) = cos(1/p). Hence, limx→p f(x) = f(p) at those points.

We can prove an alternate definition of the limit inferior and limit superior of a sequence.

Theorem 2.1.9 Alternate Definition of the limit inferior and limit superior of a sequence

Let (an) be a sequence of real numbers. Then

lim(an) = supk

infn≥k

an and lim(an) = infk

supn≥k

an

To see this clearer, let

zn = infak, ak+1, . . . = infn≥k

an and wn = supak, ak+1, . . . = supn≥k

an

Then,

z1 ≤ z2 ≤ . . . ≤ zn ≤ . . . ≤ wn ≤ wn−1 ≤ . . . ≤ w1

The (zn) sequence is monotone increasing and the (wn) sequence in monotone decreasing.Hence, limn zn ↑ lim(an) and limn wn ↓ lim(an).

Proof 2.1.9Careful proofs of these ideas are in (Peterson (8) 2019). You should go back and review them.

Homework

Exercise 2.1.21

Exercise 2.1.22

Exercise 2.1.23

Exercise 2.1.24

Exercise 2.1.25



2.2 Some Topology in <2

If you look at Figure 2.1, you can see the big difference between simple set topology in < and in <2.

Figure 2.1: Open and Closed Balls in One and Two D

The open and closed balls in< are simply open intervals on the x-axis. You draw the interval centeredat p and then it is an open ball if the endpoints are not included and a closed ball otherwise. We definethis as Br(p) = x ∈ < : p− r ≤ x ≤ p+ r. In <2, the open intervals become 2D circles centeredat a point in <2. Thus, Br(p) = x ∈ <2 : ‖x− p ≤ r where x is the vector x =

[x1//x2

] and

the center is the vector p =[p1//p2

]. Get familiar with this by drawing lots of pictures. From our

discussions of topology in < in (Peterson (8) 2019), the important concepts to move into <2 are

1. If S is a subset of <2, SC is the complement of S and consists of all x not in S.

2. p is an interior point of the set S in <2 if there is a positive r so that Br(p) ⊂ S. We defineS to be an open set if all of its points are interior points. Note a curve in <2 has no interior!Draw the graph of y = x2 for x ∈ [−1, 1] and this arc has no interior points.

3. p is a boundary point of S if all balls Br(p) intersect both S and SC in at least one point. InFigure 2.2, we see boundary point examples in both < and <2.

4. A set S is said to be closed if SC is open.

A sequence in <2 is a set of two dimensional vectors([xnyn

])

We say this sequence converges to the vector[xy

]if

∀ε > 0, ∃N 3 ‖Xn −X‖ < ε, ∀n > N


2.2. SOME TOPOLOGY IN <2 19

Figure 2.2: Boundary Points in One and Two D

where we letXn = (

[xnyn

]andX = (

[xy

]. Further, we use the norm

‖Xn −X‖ =√

(xn − x)2 + (yn − y)2

as our way to measuring the distance between vectors in <2. Of course, other norms are possiblebut we will simply use the standard Euclidean norm for now. Now that we know what a sequence ofvectors is and what we mean by the sequence converging, we can define some other special points.

1. A point p is an accumulation point of S if every ball Br(p) contains a point of S differentfrom p. This means there is a sequence (xn) from B1/n(p) \ p which converges to p.

2. A point p is a cluster point of S if there is a sequence (xn)fromS with each xn 6= p sothat xn → p. Note cluster points are accumulation points and accumulation points are clusterpoints.

3. A point p is a limit point of S if there is a sequence (xn) from S that converges to p. Notean isolated point of a set is a point which is a positive distance from any other point in the setand an isolated point is a limit point but not an accumulation point.

4. S′ is the set S and its limit points.

Note a set S need not be open or closed and <2 is both open and closed. We can prove useful thingsessentially in the same way as we did in <.

Theorem 2.2.1 Basic Topology Results in <2



Given S in <2:

1. S is closed if and only if S = S′; i.e. S is closed if and only if it contains its limitpoints.

2. S is closed if and only if S contains its boundary points.

3. Finite intersections of open sets are open.

4. Finite unions of open sets are open.

5. Countable unions of open sets are open. Countable intersections of open sets can beclosed.

6. Finite intersections of closed sets are closed.

7. Finite unions of closed sets are closed.

8. Countable intersections of closed sets are closed. Countable unions of closed setsneed not be open.

9. DeMorgan’s Laws hold: i.e. (∪nSn)C = ∩

nSCn and (∩

nSn)C = ∪

nSCn

Proof 2.2.1You can mimic the one dimensional proofs fairly easily.

2.2.1 Homework

Exercise 2.2.1

Exercise 2.2.2

Exercise 2.2.3

Exercise 2.2.4

Exercise 2.2.5

2.3 Bolzano - Weierstrass in <2

We have discussed the Bolzano - Weierstrass theorem very carefully in < and we mentioned casuallyhow we would go about handling the proof in <n for n ≥ 2 in (Peterson (8) 2019), but now wewould like to show the details for the 2D proof. The proof is then very similar in 3D and so on andwe just refer to this one below and say, “well, it is similar...”!

Theorem 2.3.1 Bolzano - Weierstrass Theorem in 2D

Every bounded sequence in <2 has at least one convergent subsequence.

Proof 2.3.1Let’s assume the range of this sequence is infinite. If it were finite, there would be subsequencesof it that converge to each of the values in the finite range. To make it easier to read this ar-gument, we will use boldface font to indicate vectors in <2 and it is assumed they will have two


2.3. BOLZANO - WEIERSTRASS IN <2 21

components labeled with subscripts 1 and 2. So here the sequence is (an) with an =

[an1

an2

]. We

assume the sequences start at n = 1 for convenience and by assumption, there is a positive num-ber B so that ‖an‖ ≤ B/2 for all n ≥ 1. Hence, all the components of this sequence live in thesquare [−B/2, B/2] × [−B/2, B/2] ⊂ <2. Let’s use a special notation for a box of this sort. LetJ0 = I1

0 × I20 where the interval I1

0 = [α01, β01] and I20 = [α02, β02] also. Here, α0i = −B/2 and

βoi = B/2. Note the area of J0, denoted by `0 is B2.

Let S be the range of the sequence which has infinitely many points and for convenience, we will letthe phrase infinitely many points be abbreviated to IMPs.

Step 1:Bisect each axis interval of J0 into two pieces giving four subregions of J0 all of which have areaB2/4. Now at least one of the subregions contains IMPs of S as otherwise each subregion has onlyfinitely many points and that contradicts our assumption that S has IMPS. Now all may containIMPS so select one such subregion containing IMPS and call it J1. Then J1 = I1

1 × I21 where the

interval I11 = [α11, β11] and I2

1 = [α12, β12] also. Note the area of J1, denoted by `1 is B2/4.We see J1 ⊂ J0 and

−B/2 = α0i ≤ α1i ≤ β1i ≤ β0i = B/2

Since J1 contains IMPS, we can select a sequence vector an1 from J1.

Step 2:Now subdivide J1 into four subregions just as before. At least one of these subregions contain IMPSof S. Choose one such subregion and call it J2. Then J2 = I1

2 × I22 where the interval I1

2 =[α21, β21] and I2

2 = [α22, β22] also. Note the area of J2, denoted by `1 is B2/(16).We see J2 ⊂ J1 and

−B/2 = α0i ≤ α1i ≤ α2i ≤ β2i ≤ β ≤ 1i ≤ β0i = B/2

Since J2 contains IMPS, we can select a sequence vector an2 from J2. It is easy to see this valuecan be chosen different from an1 , our previous choice.

You should be able to see that we can continue this argument using induction.

Proposition:∀p ≥ 1, ∃ an interval Jp = I1

p × I2p with Iip = [αpi, βpi] with the area of Jp, `p = B2/(22p)

satisfying Jp ⊆ Jp−1, Jp contains IMPS of S and

α0i ≤ . . . ≤ αp−1,i ≤ αpi ≤ βpi ≤ βp−1,i ≤ . . . ≤ β0i

Finally, there is a sequence vector anp in Jp, different from an1 , . . . ,anp−1.

Proof We have already established the proposition is true for the basis step J1 and indeed also forthe next step J2.Inductive: We assume the interval Jq exists with all the desired properties. Since by assumption,Jq contains IMPs, bisect Jq into four subregions as we have done before. At least one of thesesubregions contains IMPs of S. Choose one of the subregions and call it Jq+1 and label Jq+1 =I1q+1 × I2

q+1 where Iiq+1 = [αq+1,i, βq+1,i] for i = 1, 2. We see immediately `q+1 = B2/22(q+1)

with αiq ≤ αiq+1 ≤ βiq+1 ≤ βiq. This shows the nested inequality we want is satisfied. Finally, sinceJq+1 contains IMPs, we can choose anq+1

distinct from the other ani ’s. So the inductive step issatisfied and by the POMI, the proposition is true for all n.



From our proposition, we have proven the existence of sequences, (αpi), (βpi) and (`p) which havevarious properties. The sequence `p satisfies `p = (1/4)`p−1 for all p ≥ 1. Since `0 = B2, thismeans `1 = B2/4, `2 = B2/16, `3 = B2/(22)3 leading to `p = B2/(22)p for p ≥ 1. Further, wehave the inequality chain

−B/2 = α0i ≤ α1i ≤ α2i ≤ . . . ≤ αpi≤ . . . ≤

βpi ≤ . . . ≤ β2i ≤ . . . ≤ β0i = B/2

The rest of this argument is almost identical to the one we did for the case of a bounded sequencein <. Note (αpi is bounded above by B/2 and (βpi)p≥0 is bounded below by −B/2. Hence, by thecompleteness axiom, inf (βpi) exists and equals the finite number βi; also sup (αpi) exists and is thefinite number αi.

So if we fix p, it should be clear the number βpi is an upper bound for all the αpi values ( look atour inequality chain again and think about this ). Thus βpi is an upper bound for (αpi) and so bydefinition of a supremum, αi ≤ βpi for all p. Of course, we also know since αi is a supremum, thatαpi ≤ αi. Thus, αpi ≤ αi ≤ βpi for all p. A similar argument shows if we fix p, the number αip is anlower bound for all the βpi values and so by definition of an infimum, αpi ≤ βi ≤ βpi for all the αpivalues. This tells us αi and βi are in [αpi, βpi] for all p. Next we show α=βi.Let ε > 0 be arbitrary. Since αi and βi are in an interval whose length is `p = (1/22p)B2, we have|αi − βi| ≤ (1/22p)B2. Pick P so that 1/(22PB2) < ε. Then |α−βi| < ε. But ε > 0 is arbitrary.

Hence, αi − βi = 0 implying αi = βi. Finally, define the vector θ =

[α0

α1

].

We now must show ank → θ. This shows we have found a subsequence which converges to θ. Weknow αpi ≤ ainp ≤ βpi and αpi ≤ αi ≤ βpi for all p. Pick ε > 0 arbitrarily. Given any p, we have

|ainp,1 − αi| = |ainp,i − αpi + αpi − αi|, add and subtract trick

≤ |ainp,i − αpi|+ |αpi − αi| triangle inequality

≤ |βpi − αpi|+ |αpi − βpi| definition of length

= 2|βpi − αpi| = 2 (1/22p)B2.

Choose P so that (1/22P )B2 < ε/2. Then, p > P implies |ainp,1 − αi| < 2 ε/2 = ε. Thus,ank,i → αi. This shows the subsequence converges to θ.

Note this argument is messy but quite similar to the one dimensional case. It should be easy for youto see how to extend this to <3 and even <n. It is more a problem of correct labeling than intellectualdifficulty!

2.3.1 HomeworkExercise 2.3.1

Exercise 2.3.2

Exercise 2.3.3

Exercise 2.3.4

Exercise 2.3.5


Chapter 3

Vector Spaces

We are now going to explore vector spaces in some generality.

3.1 IntroductionLet’s go back and think about vectors in <2. As you know, we think of these as arrows with a tailfixed at the origin of the two dimensional coordinate system we call the x− y plane. They also havea length or magnitude and this arrow makes an angle with the positive x axis. Suppose we look attwo such vectors, E and F . Each vector has an x and a y component so that we can write

E =

[ab

], F =

[cd

]The cosine of the angle between them is proportional to the inner product < E,F >= ac + bd. Ifthis angle is 0 or π, the two vectors lie along the same line. In any case, the angle associated withE is tan−1( ba ) and for F , tan−1(dc ). Hence, if the two vectors lie on the same line, E must be amultiple of F . This means there is a number β so that

E = β F .

We can rewrite this as [ab

]= β

[cd

]Now let the number 1 in front of E be called −α. Then the fact that E and F lie on the same lineimplies there are 2 constants α and β, both not zero, so that

αE + β F = 0.

where 0 is the zero vector. Note we could argue this way for vectors in <3 and even in <n. Ofcourse, our ability to think of these things in terms of lying on the same line and so forth needs to beextended to situations we can no longer draw, but the idea is essentially the same. Instead of thinkingof our two vectors as lying on the same line or not, we can rethink what is happening here and try toidentify what is happening in a more abstract way. If our two vectors lie on the same line, they arenot independent things in the sense one is a multiple of the other. As we saw above, this implies therewas a linear equation connecting the two vectors which had to add up to 0. Hence, we might say thevectors were not linearly independent or simply, they are linearly dependent. Phrased this way, we

23


24 CHAPTER 3. VECTOR SPACES

are on to a way of stating this idea which can be used in many more situations. We state this as adefinition. We will be general here: we are looking at objects of some type which we can add andscalar multiply. For the moment, think of these as vectors in say <2 but soon enough we will makethis more abstract.

Definition 3.1.1 Two Linearly Independent Objects

Let E and F be two mathematical objects for which addition and scalar multiplication isdefined. We say E and F are linearly dependent if we can find non zero constants α andβ so that

αE + β F = 0.

Otherwise, we say they are linearly independent.

We can then easily extend this idea to any finite collection of such objects as follows.

Definition 3.1.2 Finitely many Linearly Independent Objects

Let Ei : 1 ≤ i ≤ N be N mathematical objects for which addition and scalarmultiplication is defined. We say E and F are linearly dependent if we can find non zeroconstants α1 to αN , not all 0, so that

α1 E1 + . . . + αN EN = 0.

Note we have changed the way we define the constants a bit. When there are more thantwo objects involved, we can’t say, in general, that all of the constants must be non zero.

3.2 Vector Spaces over a Field

If we have a set of objects u with a way to add them to create new objects in the set and a way toscale them to make new objects, this is formally called a Vector Space with the set denoted by V .For our purposes, we scale such objects with either real or complex numbers. If the scalars are realnumbers, we say v is a vector space over the reals; otherwise, it is a vector space over the complexfield. There are other fields, of course, but we will focus on these two.

Definition 3.2.1 Abstract Vector Space


3.2. VECTOR SPACES OVER A FIELD 25

LetV be a set of objectsu with an additive operation⊕ and a scaling method. Formally,this means

VS1: Given anyu and v, the operation of adding them together is writtenu⊕v and resultsin the creation of a new object in the vector space. This operation is commutativewhich means the order of the operation is not important; so u ⊕ v and v ⊕ u givethe same result. Also, this operation is associative as we can group any two objectstogether first, perform this addition ⊕ and then do the others and the order of thegrouping does not matter.

VS2: Given any u and any number c (either real or complex, depending on the type ofvector space we have), the operation c u creates a new object. We call suchnumbers scalars.

VS3: The scaling and additive operations are nicely compatible in the sense that orderand grouping is not important. These are called the distributive laws for scaling andaddition. They are

c (u⊕ v) = (c u)⊕ (c v)

(c+ d) u = (c u)⊕ (d u).

VS4: There is a special object called o which functions as a zero so we always haveo⊕ u = u⊕ o = u.

VS5: There are additive inverses which means to each u there is a unique object u† sothat u⊕ u† = o.

Comment 3.2.1 These laws imply

(0 + 0) u = (0 u)⊕ (0 u)

which tells us 0 u = 0. A little further thought then tells us that since

0 = (1− 1) u= (1 u)⊕ (−1 u)

we have the additive inverse u† = −1 u.

Comment 3.2.2 We usually say this much simpler. The set of objects V is a vector space over itsscalar field if there are two operations which we denote by u+v and cu which generate new objectsin the vector space for any u, v and scalar c. We then just add that these operations satisfy the usualcommutative, associative and distributive laws and there are unique additive inverses.

Comment 3.2.3 The objects are often called vectors and sometimes we denote them by u althoughthis notation is often too cumbersome.

Comment 3.2.4 To give examples of vector spaces, it is usually enough to specify how the additiveand scaling operations are done.

• Vectors in <2, <3 and so forth are added and scaled by components.

• Matrices of the same size are added and scaled by components.

• A set of functions of similar characteristics uses as its additive operator, pointwise addition.The new function (f ⊕ g) is defined pointwise by (f ⊕ g)(t) = f(t) + g(t). Similarly, the new



function c f is defined by c f is the function whose value at t is (cf)(t) = cf(t). Classicexamples are

1. C([a, b]) is the set of all functions whose domain is [a, b] that are continuous on thedomain.

2. C1([a, b]) is the set of all functions whose domain is [a, b] that are continuously differen-tiable on the domain.

3. RI([a, b]) is the set of all functions whose domain is [a, b] that are Riemann integrableon the domain.

There are many more, of course.

Vector spaces have two other important ideas associated with them. We have already talked aboutlinearly independent objects in <2 or <3. In general, for any vector space, we would use the samedefinitions with just a few changes:

Definition 3.2.2 Two Linearly Independent Objects in A Vector Space

LetE and F be two objects in a vector space V . We sayE and F are linearly dependentif we can find non zero scalars α and β so that

α E + β F = 0.

Otherwise, we say they are linearly independent.

We can then easily extend this idea to any finite collection of such objects as follows.

Definition 3.2.3 Finitely many Linearly Independent Objects

Let Ei : 1 ≤ i ≤ N be N objects in a vector V . We say E and F are linearlydependent if we can find non zero constants α1 to αN , not all 0, so that

α1 E1 + . . . + αN EN = 0.

Again, note we have changed the way we define the constants. When there are more thantwo objects involved, we can’t say, in general, that all of the constants must be non zero.

Homework:

Exercise 3.2.1

Exercise 3.2.2

Exercise 3.2.3

Exercise 3.2.4

Exercise 3.2.5

3.2.1 The Span of a Set of Vectors

The next concept is that of the span of a collection of vectors.


3.2. VECTOR SPACES OVER A FIELD 27

Definition 3.2.4 The Span Of A Set Of Vectors

Given a finite set of vectors in a vector space V , W = u1, . . . ,uN for some positiveinteger N , the span of W is the collection of all new vectors of the form

∑Ni=1 ciui for any

choices of scalars c1, . . . , cN . It is easy to see W is a vector space itself and since it is asubset of V , we call it a vector subspace. The span of the set W is denoted by SpW . If theset of vectors W is not finite, the definition is similar but we say the span of W is the setof all vectors which can be written as

∑Ni=1 ciui for some finite set of vectors u1, . . .uN

from W .

Let’s talk about subspaces of a vector space in more detail. If we have a vector space V and we aregiven a finite collection of objects V1, . . . ,Vn in the vector space, we have already noted the spanof this collection satisfies the definition of a vector space itself. Since it is a subset of the originalvector space, we distinguish it from the vector space V it is inside and call it a subspace. There aremany examples and it is useful to look at some.

Example 3.2.1 If we look at the vector space C([a, b]) of continuous functions on the interval [a, b],the span of the function 1, i.e. the constant function x(t) = 1 on [a, b] is the set of all real numbers.This is a subspace of C([a, b]). Note x(t) = t0.

The span of t0, t1, t2, . . . , tn is all functions of the form p(t) =∑ni=0 ai t

i for any value of the realnumbers ai. This is the set of all polynomials of degree n over the real numbers. It is straightforwardto show these polynomials are linearly independent in the vector space. The linear independenceequation is

c0 1⊕ c1 t1 ⊕ . . .⊕ cn tn = 0

which means

c0 + c1 t1 + . . .+ cn t

n = 0, a ≤ t ≤ b

Now take n derivatives of this equation which will tell us cn = 0 and then back substitute up throughthe n derivative equations to see the rest of the constants are zero also. Thus, this collection offunctions is linearly dependent. Note every nth degree polynomial is therefore uniquely expressedas a member of the span of t0, t1, t2, . . . , tn. Let Pn denote the span of t0, t1, t2, . . . , tn. Thent0, t1, t2, . . . , tn is a linearly independent set that span Pn. For future reference, note it has n+ 1elements.

Example 3.2.2 The set of all solutions to the ordinary differential equation x′′ + 4x = 0 is the spanof the two linearly independent functions cos(2t), sin(2t). Note this linearly independent spanningset consists of two functions.

Comment 3.2.5 Let V and W be two linearly independent vectors in a vector space. Is it possibleto find another vector Q independent from V and W in the span of V and W ? Since Q is in thespan of these two vectors, we know we can find constants a and b so that Q = aV +W . But thissays the set V ,W ,Q is a linearly dependent set. So it is not possible to find a third independentvector in the span.

Comment 3.2.6 The above comment can be extended to the case of n linearly independent vectorsand their associated span. We see the maximum number of linearly independent vectors in the spanis n.

The comments above lead to the notation of a basis for a vector space.

Definition 3.2.5 A Basis For A Finite Dimensional Vector Space



If the set of vectors V1, . . . ,Vn is a linearly independent spanning set for the vectorspace V , we say this set is a basis for V and that the dimension of V is n.

Comment 3.2.7 From our comments, we know the maximum number of linear independent objectsin a vector space of dimension n is n.

We can this to vectors spaces that do not have a finite basis. First, we extend the idea of linearindependence to sets that are not necessarily finite.

Definition 3.2.6 Linear Independence For Non Finite Sets

Given a set of vectors in a vector space V , W , we say W is a linearly independent subsetif every finite set of vectors from W is linearly independent in the usual manner.

We have already defined the span of a set of vectors that is not finite. So as a final matter, we want todefine what we mean by a basis for a vector space which does not have a finite basis.

Definition 3.2.7 A Basis For An Infinite Dimensional Vector Space

Given a set of vectors in an infinite dimensional vector space V , W , we say W is a basis forV if the span of W is all of V and if the vectors in W are linearly independent. Hence, abasis is a linearly independent spanning set for V . Recall, this means any vector in V canbe expressed as a finite linear combination of elements from W and any finite collection ofobjects from W is linearly independent in the usual sense for a finite dimensional vectorspace.

Comment 3.2.8 In a vector space like <n, the maximum size of a set of linearly independent vectorsis n, the dimension of the vector space.

Comment 3.2.9 Let’s look at the vector space C([0, 1]), the set of all continuous functions on [0, 1].Let W be the set of all powers of t, 1, t, t2, t3, . . .. We can use the derivative technique to showthis set is linearly independent even though it is infinite in size. Take any finite subset from W . Labelthe resulting powers as n1, n2, . . . , np. Write down the linear dependence equation

c1tn1 + c2t

n2 + . . .+ cptnp = 0.

Take np derivatives to find cp = 0 and then backtrack to find the other constants are zero also. HenceC[0, 1] is an infinite dimensional vector space. It is also clear that W does not span C[0, 1] as if thiswas true, every continuous function on [0, 1] would be a polynomial of some finite degree. This is nottrue as sin(t), e−2t and many others are not finite degree polynomials.

Homework

Exercise 3.2.6

Exercise 3.2.7

Exercise 3.2.8

Exercise 3.2.9

Exercise 3.2.10


3.3. INNER PRODUCTS 29

3.3 Inner Products

Now there is an important idea that we use a lot in applied work. If we have an object u in a VectorSpace V , we often want to find to approximate u using an element from a given subspace W of thevector space. To do this, we need to add another property to the vector space. This is the notionof an inner product. A vector space with an inner product is called an Inner Product Space. Wealready know what an inner product is in a simple vector space like <2. Many vector spaces can havean inner product structure added easily. For example, in C([a, b]), since each object is continuous,each object is Riemann integrable. Hence, given two functions f and g from C[a, b], the real numbergiven by

∫ baf(s)g(s)ds is well - defined. It satisfies all the usual properties that the inner product

for finite dimensional vectors in <n does also. These properties are so common we will codify theminto a definition for what an inner product for a vector space V should behave like.

Definition 3.3.1 Real Inner Product

Let V be a vector space with the reals as the scalar field. Then a mapping ω which assignsa pair of objects to a real number is called an inner product on V if

IP1: ω(u,v) = ω(v,u); that is, the order is not important for any two objects.

IP2: ω(c u,v) = cω(u,v); that is, scalars in the first slot can be pulled out.

IP3: ω(u⊕w,v) = ω(u,v) + ω(w,v), $Y . for any three objects.

IP4: ω(u,u) ≥ 0 and ω(u,u) = 0 if and only if u = 0.

These properties imply that ω(u, c v) = cω(u,v) as well. A vector space V with aninner product is called an Inner Product Space.

Comment 3.3.1 The inner product is usually denoted with the symbol < , > instead of ω( , ). Wewill use this notation from now on.

Comment 3.3.2 When we have an inner product, we can measure the size or magnitude of an object,as follows. We define the analogue of the euclidean norm of an object u using the usual || || symbolas

||u|| =√< u,u >.

This is called the norm induced by the inner product of the object. In C([a, b]), with the inner product

< f, g >=∫ baf(s)g(s)ds, the norm of a function f is thus ||f || =

√∫ baf2(s)ds. This has been

discussed in the earlier volumes so you can go back and review there to get up to speed. This iscalled the L2 norm of f and is denoted ‖ · ‖2.

It is possible to prove the Cauchy-Schwartz inequality in this more general setting also.

Theorem 3.3.1 General Cauchy-Schwartz Inequality

If V is an inner product space with inner product < , > and induced norm || ||, then

| < u,v > | ≤ ||u|| ||v||

with equality occurring if and only if u and v are linearly dependent.



Proof 3.3.1If u = tv, then

| < u,v > | = | < tv,v > | = |t < v,v >= tv‖2|= (t‖v‖) (‖v‖) = ‖u‖ ‖v‖

Hence, the result holds if u and v are linearly dependent. In general, if u and v are linearly inde-pendent, look at the subspace spanned by v. Any element in this subspace can be written as av‖v‖.Consider the function

f(a) = < u− av‖v‖,u− av‖v > +‖u‖ − 2a < u,v‖v‖ > +a2

Then

f ′(a) = −2 < u,v‖v‖ > +2a

which implies the critical point is a =< u,v‖v‖ >. Since f ′′(a) = 2 > 0, the critical point is aglobal minimum. Thus, the vector whose minimum norm distance to the subspace generated by vis Puv =< u,v‖v‖ > v‖v‖ which has length | < u,v‖v‖ > |. The vector Puv is called theprojection of u onto v. We can then use GSO to find the orthonormal basis for the subspace spannedby u and v.

Q1 = v/‖v‖W = u− < u,Q1 > Q1, Q2 = W /‖W ‖

Thus we have the decomposition u =< u,Q1 > Q1+ < u,Q2 > Q2. Hence, if we assumed uand v were linearly independent with

| < u,v >= ‖u‖ ‖v‖ =⇒ ‖Puv‖ =∣∣∣⟨u, v

‖v‖

⟩∣∣∣ = ‖u‖

this forces < u, v‖v‖ >= 0 telling us u = ‖u‖Q1. But, of course, this say u is a multiple of v which

is not possible as they are assumed independent. Hence, we conclude if < u,v >= ‖u‖ ‖v‖, u andv must be dependent.

However, if u and v are linearly independent, we have the decomposition u =< u,Q1 > Q1+ <u,Q2 > Q2. which implies ‖u‖2 = | < u,Q1 > |2 + | < u,Q2 > |2. This immediately tells us

| < u,v/‖v‖ > |2 ≤ ‖u‖2 =⇒ | < u,v/‖v‖ > | ≤ ‖u‖ =⇒ | < u,v > | ≤ ‖u‖ /‖v‖

which is the result we wanted to show.

Comment 3.3.3 We can use the Cauchy-Schwartz inequality to define a notion of angle betweenobjects exactly like we would do in <2. We define the angle θ between u and v via its cosine asusual.

cos(θ) =< u,v >

||u|| ||v||.

Hence, objects can be perpendicular or orthogonal even if we can not interpret them as vectors in<2. We see two objects are orthogonal if their inner product is 0.

Comment 3.3.4 If W is a finite dimensional subspace, a basis for W is said to be an orthonormalbasis if each object in the basis has L2 norm 1 and all of the objects are mutually orthogonal. Thismeans < ui,uj > is 1 if i = j and 0 otherwise. We typically let the Kronecker delta symbol


3.4. EXAMPLES 31

δij be defined by δij = 1 if i = j and 0 otherwise so that we can say this more succinctly as< ui,uj >= δij .

Let’s go back to <2. You have gone over these ideas quite a bit in your last few courses, so we willjust refresh you mind here. Grab three pencils for the <2 examples coming up and four pencils forthe <2 examples. We will try to look at these examples in a new light so you can build on earlierlevels of understanding.

3.3.1 Homework

Exercise 3.3.1

Exercise 3.3.2

Exercise 3.3.3

Exercise 3.3.4

Exercise 3.3.5

3.4 ExamplesLet’s work through a number of important examples ranging from finite dimensional to infinite di-mensional cases.

3.4.1 Two Dimensional Vectors in the Plane

Take two pencils and hold them in your hand. They don’t have to be the same size. These are twovectors in the plane. It doesn’t matter how you hold them relative to the room you are in either.These two pencils or vectors are linearly independent if they point in different directions. The onlyway they are linearly dependent is if they lay on top of one another or point directly opposite to oneanother. In the first case, the angle between the vectors in 0 radians and in the second the angle isπ. Now these two pencils can be thought of as sitting in a sheet of paper which is our approximationof the plane determined by the two vectors represented by the pencils. Note as long the vectors arelinearly independent we can represent any other vector ( now use that third pencil!) W in terms ofthe first two which we now callE and F . We want to find the scalars a and b so that aE+ bF = Wwhere we have stopped representing scalar multiplication by . Taking the usual representation ofthese vectors, we have

E = E11

[10

]+ E12

[01

]F = F11

[10

]+ F12

[01

]W = W11

[10

]+W12

[01

]We assume, of course, our vectors are non zero. Recall we usually just write this as

E =

[E11

E12

], F =

[F11

F12

], W =

[W11

W12

]and it is very important to notice that we have hidden our way of representing the components of ourvectors inside these column vectors. Using matrix and vector notation, we can write our problem of



finding a and b as [E11 F11

E12 F12

] [ab

]=

[W11

W12

]Call this 2 × 2 matrix T . Now we all know how to take the determinant of a 2 × 2 matrix. Note ifdet(T = E11F12 − E12F12. If this determinant was zero, then that would tell us E11

E12= F11

F12in the

case all the components are non zero. But this tells us E and F point in the same direction. Hence,they can’t be linearly independent. The argument going the other way is similar. So the det(T ) = 0if and only if these two vectors are linearly dependent. Since we have assumed they are the opposite:i.e. linearly independent, we must have det(T ) 6= 0. But then Cramer’s rule tells us how to find a andb. Note each vector must have at least one non zero component and it is easy to alter our argumentabove to handle the zero’s we might see. We will let you do that yourself! Using Cramer’s rule, wehave

a =

[W11 F11

W12 F12

]det(T )

, b =

[E11 W11

E12 W12

]det(T )

Of course, if the vectors E and F were orthonormal, things are much simpler. First, note

aE + bF = W =⇒ < E, aE + bF >=< E,W >= = a < E,E > +b < E,F >=< E,W >

But < E,F >= 0 and < E,E >= 1, so we have a =< E,W >. In a similar way, we find

aE + bF = W =⇒ < F , aE + bF >=< F,W >= = a < F ,E > +b < F ,F >=< E,W >

which tells us b =< F ,W >. So we have found the unique solution is

W =

[<W ,E ><W ,F >

]=

[E11W11 + E12W12

F11W11 + F12W12

]Now using standard notation, we can also write this as[

ab

]=

[E,F

]W

It is easy to see, using the orthonormality of E and F that we also have[ET

F T

] [E,F

]= I

where I is the 2 × 2 identity matrix. Since we are in two dimensions, we note there is a nicerelationship between the components of E and F because they are orthogonal. First, the slope of Fis the negative reciprocal of the slope of E and hence we must have

E =

[E11

E12

]= F =

[−E12

E11

]with E2

11 + E212 = 1. We can thus identify E11 = cos(θ) and E12 = sin(θ) for some angle θ. This

shows us we can represent E and F as

E =

[cos(θ)sin(θ)

]= F =

[− sin(θ)cos(θ)

]


3.4. EXAMPLES 33

Thus, we see

[E,F

] [ETF T

]=

[cos(θ) − sin(θ)sin(θ0 cos(θ)

] [cos(θ) sin(θ)− sin(θ0 cos(θ)

]=

[cos2(θ) + sin(θ)2 cos(θ) sin(θ)− cos(θ) sin(θ)

cos(θ) sin(θ)− cos(θ) sin(θ) cos2(θ) + sin(θ)2

]= I

You probably recognize this representation of T as the usual 2× 2 matrix which denotes the rotationof a vector by the angle θ. We note the determinant here is always +1. From what we have justdone, it is now clear that the matrix

[ET ,F T

]is the inverse of the matrix T . Indeed, it is clear the

inverse is T T ; i.e. the inverse of T is its own transpose. Hence, any two dimensional vector in <2 isa unique linear combination of E and F .

What we have shown here is that any vectorW can be written as aE+bF for a unique a and b. Thiskind of representation is what we have called a linear combination of the vectors E and F . Alsonote we can do this whether E and F form an orthonormal set. Clearly, the set of all possible linearcombinations of E and F gives all of <2; i.e. the span of E and F is <2. Further, since E and Fare linearly independent, we say E and F form a linearly independent spanning set or basis for<2.

This method of attack will not be very useful for three dimensional vectors, of course as we will loseour intuition of dependent vectors being a simple scalar multiple of one another. Another way ofshowing

[E F

]Tis the inverse of T is as follows. It does not use rotations at all. Let

A =[E,F

] [ETF T

]We already know that any vector W can be expressed as the unique linear combination aE + bF

¯where [ab

]=

[E,F

]W

Thus

AW =[E,F

] [ETF T

](aE + bF

¯)

= a[E,F

] [ETF T

]E + b

[E,F

] [ETF T

]F

= a[E,F

] [ETEF TE

]+ b

[E,F

] [ETFF TF

]= a

[E,F

] [10

]+ b

[E,F

] [01

]= aE + bF = W

SinceAW = W for allW , this tells usA = I . Combining we have[ET

F T

] [E,F

]=

[E,F

] [ETF T

]= I

And so the inverse of T is T T but this time we proved the result without resorting to rotation anglesand slopes of vectors.

Of course, if E and F were not orthonormal, we could apply Graham - Schmidt Orthogonalizationto create the orthonormal basis U1,U2 as follows:



U1 =E

‖E‖V = F− < F ,U1 > U1

U2 =V

‖V ‖

and U1,U2 would be an orthonormal basis for <2.

3.4.1.1 Homework

Exercise 3.4.1

Exercise 3.4.2

Exercise 3.4.3

Exercise 3.4.4

Exercise 3.4.5

3.4.2 The Connection between Two Orthonormal Bases

Given the orthonormal basis E1,E2, we know from our remarks above that the matrix we formby using these vectors as columns, TE =

[E1,E2

]has an inverse which is its own transpose;

i.e. TE−1 = (TE)T . For convenience, we now denote this orthonormal basis by E. Hence, theorthonormal basis F1,F2 is denoted by F . We now introduce notation we will use later whenwe think carefully about transformations from <n to <m. Given a vector A it has a representationrelative to any choice of basis. We need to recognize that this representation will depend on the

choice of basis, so given that A = aE1 + bE2 for unique scalars a and b, the vector[ab

]is a

convenient way to handle this representation. We use the subscriptE to remind us these components

depend on our choice of basis E. When we write a two dimensional column vector[ab

]have you

ever thought about what the numbers a and b mean? You need to always remember that the idea ofa two dimensional vector space is quite abstract. We can’t really see these objects until we make achoice of basis. Thus we can’t calculate inner products until we choose a basis also. The simplestbasis is the usual one

i =

[10

], j =

[01

]which is called the standard basis which we will denote by I . So all of our discussions above wereusing the standard basis I . Now this is very important. With respect to the orthonormal basis E, the

representation ofE1 is[E11

E12

]I

but with respect to the basisE, its components would be[10

]E

. This

is what we are doing when we set up the cartesian coordinate system. We choose what we call thex and y axis oriented so they are 90 apart. The vectors i and j are then chosen along the positivex and y axis. However, rotating this choice of i and j by θ gives us a new x′ and y′ axis still 90

apart. So when we want to be very explicit, we write

[A]E

=

[ab

]E


3.4. EXAMPLES 35

to indicate this representation with respect to the basis E. Now A also has a representation due tothe basis F . [

A]F

=

[cd

]F

Form our first discussion, now explicitly realizing we were using representations relative to the stan-dard basis, we should write

[AI]FI

=

[cd

]FI

This is very cumbersome. The subscript I here reminds us that the components we see in the vectorsare relative to the standard basis and that the components of E and F are also given with respect tothe standard basis. If we are expressing everything in terms of the E basis remember E1 and E2

themselves have the usual standard basis components; i.e. we are finding components relative to anew x′ and y′ axis. So with all that said, given some component representation for the components ofE, we can find the components of the vectors in F with respect to the basisE as follows. Remember,the basis F corresponds to a new rotation to an x′′ and y′′ coordinate axis system. Using these ideas,we see we can also express F in terms of E. We have

F1 = < F1,E1 > E1+ < F1,E2 > E2 = T11E1 + T12E2

F2 = < F2,E1 > E1+ < F2,E2 > E2 = T21E1 + T22E2

Thus, using our two representations forA, we have

A = cF1 + dF2 = c(T11E1 + T12E2) + d(T21E1 + T22E2)

= (cT11 + dT21)E1 + (cT12 + dT22)E2 = aE1 + bE2

This tells us

a = cT11 + dT21

b = cT12 + dT22

This implies [ab

]=

[T11 T21

T12 T22

] [cd

]=⇒

[A]E

=

[T11 T21

T12 T22

] [A]F

or [A]E

=

[< F1,E1 > < F2,E1 >< F1,E2 > < F2,E2 >

] [A]F

We can do the same sort of analysis and interchange the role of E and F to get

[A]F

=

[< E1,F1 > < E2,F1 >< E1,F2 > < E2,F2 >

] [A]E

The coefficients (Tij) define a matrix which we will call TFE as these coefficients are completelydetermined by the basisE andF and we are transformingF components intoE components. Hence,



we let

TFE =

[< F1,E1 > < F2,E1 >< F1,E2 > < F2,E2 >

]=

[T11 T21

T12 T22

]= TEF

T

and

TEF =

[< E1,F1 > < E2,F1 >< E1,F2 > < E2,F2 >

]=

[T11 T12

T21 T22

]= TFE

T

So TFE and TEF are transposes of each other. We can easily show TFET is the inverse of TFE .

We note for all vectors in <2,

AE = TFE AF = TFE TEF AE = TFE TFETAE

and

AF = TEF AE = TFET TFE AF = TFE

T TFEAE

Hence TFE TFET = TFET TFE = I and TFET is the inverse of TFE .

3.4.2.1 Homework

Exercise 3.4.6

Exercise 3.4.7

Exercise 3.4.8

Exercise 3.4.9

Exercise 3.4.10

3.4.3 The Invariance of The Inner Product

For two orthonormal basis E and F , we now know the vectorA transforms as follows:[A]F

=[T1,T2

]T [A]E

Thus, for two vectorsA andB, we can calculate

<[A]F,[B]F> = <

[T1,T2

]T [A]E,[T1,T2

]T [B]E>

It is easy to see that using vector representations, < A,B >= ATB = BTA where these areinterpreted using the usual matrix multiplication routines we know. Thus,

<[A]F,[B]F> = (<

[T1,T2

]T [A]E

)T[T1,T2

]T [B]E>

=[AT]E

([T1,T2

])[T1,T2

]T [B]E

But ([T1,T2

])T is the inverse of

[T1,T2

]. Thus, we find

<[A]F,[B]F> =

[AT]E

[B]E

=<[A]E,[B]E>


3.4. EXAMPLES 37

which tells us the inner product calculation is invariant under a transformation from one orthonormalbasis to another. Of course, if one or both of these new basis are not orthonormal, the matrix T hereis not its own inverse and the value of the inner product could change.So in general, not writing down all the subscripts about components with respect to various bases,we find unique scalars using the E1,E2 basis so that

A = aE1 + bE2, B = cE1 + dE2

Thus,

< A,B > = < aE1 + bE2, cE1 + dE2 >= ac+ bd

Using the other basis, we have

A = αF1 + βF2

B = γF1 + δF2

and so

< A,B > = < αF1 + βF2, γF1 + δF2 >= αγ + βδ

And we now know these two values will be the same!

3.4.3.1 Homework

Exercise 3.4.11

Exercise 3.4.12

Exercise 3.4.13

Exercise 3.4.14

Exercise 3.4.15

3.4.4 Two Dimensional Vectors As Functions

Let’s review Linear Systems of first order ODE. These have the form

x′(t) = a x(t) + b y(t)

y′(t) = c x(t) + d y(t)

x(0) = x0, y(0) = y0

for any numbers a, b, c and d and initial conditions x0 and y0. The full problem is called, as usual,an Initial Value Problem or IVP for short. The two initial conditions are just called the IC’s for theproblem to save writing.

• For example, we might be interested in the system

x′(t) = −2 x(t) + 3 y(t)

y′(t) = 4 x(t) + 5 y(t)

x(0) = 5, y(0) = −3

Here the IC’s are x(0) = 5 and y(0) = −3.



• Another sample problem might be the one below.

x′(t) = 14 x(t) + 5 y(t)

y′(t) = −4 x(t) + 8 y(t)

x(0) = 2, y(0) = 7

3.4.4.1 The Characteristic Equation

For linear first order problems like u′ = 3u and so forth, we find the solution has the form u(t) =Ae3t for some numberA. We then determine the value ofA to use by looking at the initial condition.To find the solutions here, we begin by rewriting the model in matrix - vector notation.[

x′(t)y′(t)

]=

[a bc d

] [x(t)y(t)

].

The 2 × 2 matrix above is called the coefficient matrix of this model and is usually denoted by A.The initial conditions can then be redone in vector form as[

x(0)y(0)

]=

[x0

y0

]Now it seems reasonable to believe that if a constant times ert solves a first order linear problemlike u′ = ru, perhaps a vector times ert will work here. Let’s make this formal. So let’s look at theproblem below

x′(t) = 3 x(t) + 2 y(t)

y′(t) = −4 x(t) + 5 y(t)

x(0) = 2

y(0) = −3

Assume the solution has the form V ert. Let’s denote the components of V as follows:

V =

[V1

V2

].

We assume the solution is [x(t)y(t)

]= V ert.

Then the derivative of V ert is(V ert

)′= r V ert =⇒

[x′(t)y′(t)

]= r V ert.

When we plug these terms into the matrix - vector form of the problem, we find

r V ert =

[3 2−4 5

]V ert

Rewrite using the identity matrix I as

r V ert −[

3 2−4 5

]V ert = rI V ert −

[3 2−4 5

]V ert


3.4. EXAMPLES 39

=

(rI −

[3 2−4 5

])V ert

=

([r 00 r

]−[

3 2−4 5

])V ert

=

[r − 3 −2−(−4) r − 5

]V ert

Plugging this into our model, we find[r − 3 −2

4 r − 5

]V ert =

[00

].

But ert is never 0, so we want r satisfying[r − 3 −2

4 r − 5

]V =

[00

].

For each r, we get two equations in V1 and V2:

(r − 3)V1 − 2V2 = 0, 4V1 + (r − 5)V2 = 0.

LetAr be this matrix. Any r for which the detAr 6= 0 tells us these two lines have different slopesand so cross at the origin implying V1 = 0 and V2 = 0. Thus[

xy

]=

[00

]ert =

[00

].

which will not satisfy nonzero initial conditions. So reject these r. Any value of r for whichdetAr = 0 gives an infinite number of solutions which allows us to pick one that matches the initialconditions we have. The equation

det(rI −A) = det

[r − 3 −2

4 r − 5

]= 0.

is called the characteristic equation of this linear system. The characteristic equation is a quadratic,so there are three possibilities: two distinct roots, the real roots are the same and the roots are a com-plex conjugate pair.

Example 3.4.1 Derive the characteristic equation for the system below

x′(t) = 8 x(t) + 9 y(t)

y′(t) = 3 x(t) − 2 y(t)

x(0) = 12

y(0) = 4

Solution The matrix - vector form is[x′(t)y′(t)

]=

[8 93 −2

] [x(t)y(t)

][x(0)y(0)

]=

[124

]



The coefficient matrixA is thus

A =

[8 93 −2

]Assume the solution has the form V ert Plug this into the system.

r V ert −[

8 93 −2

]V ert =

[00

]Rewrite using the identity matrix I and factor(

r I −[

8 93 −2

])V ert =

[00

]Since ert 6= 0 ever, we find r and V satisfy(

r I −[

8 93 −2

])V =

[00

]If r is chosen so that det (rI −A) 6= 0, the only solution to this system of two linear equations in thetwo unknowns V1 and V2 is V1 = 0 and V2 = 0. This leads to x(t) = 0 and y(t) = 0 always and thissolution does not satisfy the initial conditions. Hence, we must find r which give det (rI − A) = 0.The characteristic equation is thus

det

[r − 8 −9−3 r + 2

]= (r − 8)(r + 2)− 27

= r2 − 6r − 43 = 0

Homework

Exercise 3.4.16

Exercise 3.4.17

Exercise 3.4.18

Exercise 3.4.19

Exercise 3.4.20

3.4.4.2 Finding The General Solution

The roots to the characteristic equation are of course the eigenvalues of the coefficient matrixA. Weusually organize the eigenvalues as with the largest one first although we don’t have to. In fact, inthe examples, we organize from small to large!

• Example: The eigenvalues are −2 and −1. So r1 = −1 and r2 = −2. Since e−2t decaysfaster than e−t, we say the root r1 = −1 is the dominant part of the solution.

• Example: The eigenvalues are −2 and 3. So r1 = 3 and r2 = −2. Since e−2t decays and e3t

grows, we say the root r1 = 3 is the dominant part of the solution.

• Example: The eigenvalues are 2 and 3. So r1 = 3 and r2 = 2. Since e2t grows slower thane3t, we say the root r1 = 3 is the dominant part of the solution.


3.4. EXAMPLES 41

For each eigenvalue r we want to find nonzero vectors V so that (rI −A)V = 0 where to help withour writing we let 0 be the two dimensional zero vector. The nonzero V are called the eigenvectorsfor eigenvalue r and satisfyAV = rV .For eigenvalue r1, find V so that (r1 I − A) V = 0. There will be an infinite number of V ’s thatsolve this; we pick one and call it eigenvector E1.

For eigenvalue r2, find V so that (r2 I − A)V = 0. There will again be an infinite number of V ’sthat solve this; we pick one and call it eigenvector E2.The general solution to our model will be[

x(t)y(t)

]= AE1 e

r1t +BE2 er2t.

where A and B are arbitrary. We use the ICs to find A and B. It is best to show all this with someexamples.

Example 3.4.2 For the system below[x′(t)y′(t)

]=

[−20 12−13 5

] [x(t)y(t)

][x(0)y(0)

]=

[−1

2

]• Find the characteristic equation

• Find the general solution

• Solve the IVP

Solution The characteristic equation is

det

(r

[1 00 1

]−[−20 12−13 5

])= 0

0 = det

([r + 20 −12

13 r − 5

])= (r + 20)(r − 5) + 156

= r2 + 15r + 56

= (r + 8)(r + 7)

Hence, eigenvalues or roots of the characteristic equation are r1 = −8 and r2 = −7. Note sincethis is just a calculation, we are not following our labeling scheme.For eigenvalue r1 = −8, substitute the value into[

r + 20 −1213 r − 5

] [V1

V2

]⇒[

12 −1213 −13

] [V1

V2

]This system of equations should be collinear: i.e. the rows should be multiples; i.e. both give rise tothe same line. Our rows are multiples, so we can pick any row to find V2 in terms of V1. Picking thetop row, we get 12V1 − 12V2 = 0 implying V2 = V1. Letting V1 = a, we find V1 = a and V2 = a:so [

V1

V2

]= a

[11

]



Choose E1:The vector

E1 =

[11

]is our choice for an eigenvector corresponding to eigenvalue r1 = −8. So one of the solutions is[

x1(t)y1(t)

]= E1 e

−8t =

[11

]e−8t.

For eigenvalue r2 = −7, substitute the value into[r + 20 −12

13 r − 5

] [V1

V2

]⇒[

13 −1213 −12

] [V1

V2

]This system of equations should be collinear: i.e. the rows should be multiples; i.e. both give rise tothe same line. Our rows are multiples, so we can pick any row to find V2 in terms of V1. Picking thetop row, we get 13V1 − 12V2 = 0 implying V2 = (13/12)V1. Letting V1 = b, we find V1 = b andV2 = (13/12)b: so [

V1

V2

]= b

[1

1312

]Choose E2 to be

E2 =

[1

1312

]as our choice for an eigenvector corresponding to eigenvalue r2 = −7. So one of the solutions is[

x2(t)y2(t)

]= E2 e

−7t =

[1

1312

]e−7t.

Note, it is easy to see E1 and E2 are linearly independent vectors in <2. The general solution isthen [

x(t)y(t)

]= AE1 e

−8t +B E2 e−7t

= A

[11

]e−8t +B

[1

1312

]e−7t

Now let’s solve the initial value problem: we find A and B.[x(0)y(0)

]=

[−1

2

]= A

[11

]e0 +B

[1

1312

]e0

= A

[11

]+B

[1

1312

]So

A+B = −1

A+13

12B = 2


3.4. EXAMPLES 43

Subtracting the bottom equation from the top equation, we get − 112B = −3 or B = 36. Thus,

A = −1−B = −37. So[x(t)y(t)

]= −37

[11

]e−8t + 36

[1

1312

]e−7t

Homework

Exercise 3.4.21

Exercise 3.4.22

Exercise 3.4.23

Exercise 3.4.24

Exercise 3.4.25

Let C1([0, 10) be the set of all functions x on [0, 10] which are continuously differentiable. This isclearly a vector space with scalar multiplication defined by cy which is the function whose valueat t is

(c y)(t) = cx(t)

To define addition of vectors ⊕, we let x⊕ y be the function whose value at t is

(x y)(t) = x(t) + y(t)

With these operations C1([0, 10] is a vector space over <. Now let X = C1([0, 10] × C1([0, 10]which is the collection of all ordered pairs of continuously differentiable functions over <. It is easyto see we make this into a vector space by defining and ⊕ to act on pairs of objects in the usualway. We can define an inner product on this space by

< f , g > =

∫ 10

0

⟨[f1

f2

],

[g1

g2

]⟩dt

where f =

[f1

f2

]and g =

[fg1g2

]are arbitrary members of X . The two solution y1(t) = E1e

−8t

and y2(t) = E2e−7t are members of X which are linearly independent as when we write down the

linearly dependent check, we have

aE1y1 ⊕ bE2y2 = 0

This means

aE1e−8t + bE2e

−7t = 0, : ∀ t ≥ 0

In particular at t = 0, we have

aE1 + bE2 = 0, : ∀ t ≥ 0

which is easily seen to force a = b = 0 since the eigenvectors E1 and E2 are linearly independentvectors in <2. Hence, these functions are linearly independent objects in this vector space. Theyplay the same role as the two dimensional vectors in the plane <2 in our first section called E andF . Their span S is all linear combinations of y1 and y2 which is the solution space of this ODE.



Thus, E1y1,E2y2 is a basis for a two dimensional subspace inside X . We can then use GrahamSchmidt Orthogonalization to create an orthonormal basis for this solution space.

Listing 3.1: Computing the Orthonormal Basis

%Convert the e i g e n v e c t o r s to u n i t v e c t o r sA = [−1;2] ;E 1 = A/ norm (A)B = [ 1 ; 1 3 / 1 2 ] ;E 2 = B / norm (B) ;

% Now do GSOG 1 = E 1 ;V = E 2 − dot ( E 2 , G 1 ) ∗G 1 ;G 2 = V/ norm (V) ;>> dot ( E 1 , E 2 )ans = 0 .35389>> dot ( G 1 , G 2 )ans = 0

We check in this code fragment thatE1 andE2 are not orthogonal but the new basisG is. Then, thefunctions u1 = G1e

−8t and u2 = G2e−7t also solve the ODE and we have

< u1,u2 > =

∫ 10

0

< G1e−8t,G2e

−7t > dt =

∫ 10

0

< G1,G2 > e−15t dt = 0

So we have found an orthonormal basis for the solution space.Note everything we did before in the two dimensional example in <2 carries through exactly. In fact,if we define the mapping φ : span(G1,G2)→ span(G1e

−8t,G2e−7t) by

φ(E) = G1e−8t, φ(G2) = G2e

−7t

and extend linearly, then φ is a 1 − 1 and onto mapping of the span of G1,G2 to the span ofG1e

−8t,G2e−7t. Hence, we can identify the solution space of this ODE with the two dimensional

space <2. We can see the set of solutions to a second order linear ODE gives us a two dimensionalvector space.

3.4.4.3 Homework

Exercise 3.4.26

Exercise 3.4.27

Exercise 3.4.28

Exercise 3.4.29

Exercise 3.4.30

3.4.5 Three Dimensional Vectors in Space

Now we will use four pencils to motivate what is happening in three dimensions. Pick any two andhold them in your hand. These two pencils determine a plane as long the the pencils are not collinear.This plane does not have to be oriented parallel to our standard planes. Based on what we have said


3.4. EXAMPLES 45

already, the standard basis for <3 is

e1 =

100

, e2 =

010

, e3 =

001

,where we use these labels for the standard basis instead of the i, j and k you may have seen before. Ifwe want to work in <4 or higher dimensions, we need the ei notation. Pick any axis to orient the e1

along. The starting point of the e1 vector determines the origin of the three dimensional coordinatesystem we are constructing. The direction of e1 determines the +x axis. The e2 vector then starts atthe same origin and is positioned so it is orthogonal to e2. The direction of e2 fixes the +y axis. Wethen use the right hand rule to determine the placement of e3. This determines the +z axis. We nowhave a standard <3 coordinate system that uses the standard orthonormal basis. The first two pencilsfit in this system. Their common origin doesn’t have to match the origin of the <3 system we havejust constructed but we want our pencils to determine a subspace in <3. So think of their commonorigin as the same as the one we use for <3 as constructed. These two pencils may not have the samelength and may not be orthogonal but they determine a plane. If the third pencil lies in this plane,then it is not independent from the first two. However, if it lies out of the plane the three pencils arelinearly independent vectors whose span in all three dimensional vectors. Call the first pencil E1,the second E2 and the third E3. Then if the fourth pencil is W , we can find the unique numberswhich give its linear combination in terms of the basis E1,E2,E3 by solving

aE1 + bE2 + cE3 = W =⇒[E1 E2 E3

] abc

= W

Since these three vectors are linearly independent,

aE1 + bE2 + cE3 = 0 =⇒[E1 E2 E3

] abc

= 0

only has the solution a = b = c = 0. Hence, the kernel of this matrix T =[E1,E2,E3

]is 0 and so

T is invertible. The vectors E1, E2 and E3 need not be orthogonal, so it is not so easy to find thisinverse but what counts here is that the equation

aE1 + bE2 + cE3 = W

has a unique solution for eachW .Linear dependence of the vectors E1, E2 and E3 is more interesting in <3. If you pick any twovectors, linear dependence of the third on the other two means this:

• E3 is in the plane determined by E1 and E2



And if some of the vectors are collinear, the situation degrades

• E1 and E2 are collinear





So the span of E1, E2 and E3 can be one dimensional ( all vectors are collinear ), two dimensional( two of the three are linearly independent ) and three dimensional ( all three vectors are linearlyindependent ). In the last case, the vectors form a linearly independent spanning set and so are abasis.If the vectors are not mutually orthogonal, we can use Graham Schmidt Orthogonalization to createan orthonormal basis as follows:

Listing 3.2: Computing the 3D Orthonormal Basis

E1 = [ −1 ; 2 ; 3 ] ;E2 = [ 4 ; 5 ; 6 ] ;E3 = [ −3 ; 1 ; 7 ] ;% Now do GSOG1 = E1 / norm ( E1 ) ;V = E2 − dot ( E2 , G1) ∗G1;G2 = V/ norm (V) ;V = E3−dot ( E3 , G1) ∗G1−dot ( E3 , G2) ∗G2;G3 = V/ norm (V) ;

If E = E1,E2,E3 is an orthonormal basis and F = F1,F2,F3 is another orthonormal basis,the representation of a vector A can be given with respect to the standard basis, the basis E or thebasis F . Hence, we have the unique representation

AE = a1E1 + a2E2 + a3E3 =⇒[E1 E2 E3

] a1

a2

a3

= AE

It is easy to see ET1ET2ET3

[E1 E2 E3

]= I

Now let’s consider

[E1 E2 E3

] ET1ET2ET3

Ej =[E1 E2 E3

]ej = Ej

Thus, we see for any vector aE1 + bE2 + cE3, we have

[E1 E2 E3

] ET1ET2ET3

(aE1 + bE2 + cE3) = aE1 + bE2 + cE3

Hence, the inverse of[E1 E2 E3

]is its own transpose.

Finally, sinceA has a representation with respect to the basis F also, We have

F1 = < F1,E1 > E1+ < F1,E2 > E2+ < F1,E3 > E3 = T11E1 + T12E2 + T13E3




3.4. EXAMPLES 47

Thus, using our two representations forA, we have

A = b1F1 + b2F2 + b3F3 = b1(T11E1 + T12E2 + T13E3) + b2(T21E1 + T22E2 + T23E3)

+b3(T31E1 + T32E2 + T33E3)

This tells us

a1 = b1T11 + b2T21 + b3T31

a2 = b1T12 + b2T22 + b3T32

a3 = b1T13 + b2T23 + b3T33

This impliesa1

a2

ae

=

T11 T21 T31

T12 T22 T32

T13 T23 T33

b1b2b3

=⇒[A]E

=

T11 T21 T31

T12 T22 T32

T13 T23 T33

[A]F

or

[A]E

=

< F1,E1 > < F2,E1 > F3,E1

< F1,E2 > < F2,E2 > F2,E3

< F1,E3 > < F2,E3 > F3,E3

[A]F

We can do the same sort of analysis and interchange the role of E and F to get

[A]F

=

< E1,F1 > < E2,F1 > E3,F1

< E1,F2 > < E2,F2 > E3,F2

< E1,F3 > < E2,F3 > E3,F3

[A]E

As before, the coefficients (Tij) define a matrix which we will call TFE as these coefficients arecompletely determined by the basis E and F and we are transforming F components into E com-ponents. Hence, we let

TFE =

< F1,E1 > < F2,E1 > F3,E1

< F1,E2 > < F2,E2 > F2,E3

< F1,E3 > < F2,E3 > F3,E3

=

T11 T21 T31

T12 T22 T23

T13 T23 T33

= TEFT

and

TEF =

< E1,F1 > < E2,F1 > E3,F1

< E1,F2 > < E2,F2 > E3,F2

< E1,F3 > < E2,F3 > E3,F3

=

T11 T21 T31

T12 T22 T23

T13 T23 T33

= TEFT

So TFE and TEF are transposes of each other. We can again show TFET is the inverse of TFE bya calculation similar to the one we did for <2. In fact, since the dimension of the underlying space isnot shown, the argument is identical. We note for all vectors in <3,

AE = TFE AF = TFE TEF AE = TFE TFETAE

and

AF = TEF AE = TFET TFE AF = TFE

T TFEAE



Hence TFE TFET = TFET TFE = I and TFET is the inverse of TFE .

The calculation that in inner product in <3 in invariant under a change of orthonormal basis is thenthe same. Also, the arguments we use here can be adapted with little change to the <n setting.

3.4.5.1 Homework

Exercise 3.4.31

Exercise 3.4.32

Exercise 3.4.33

Exercise 3.4.34

Exercise 3.4.35

3.4.6 The Solution Space of Higher Dimensional ODE Systems

Let’s do a final example to show you how <6 can be identified with the solution space of a linearsystem of ODEs. A general model of enhanced cytokine signalling production based on LRRK2mutation is as follows. The altered cytokine response starts with the activation of an allele called A,in a small compartment of cells. Initially, all cells have a correct version of the LRRK2 gene. Wewill denote this by A−/− where the superscript “−/−” indicates there are no LRRK2 mutations.One of the mutant alleles becomes activated at mutation rate u1 to generate a cell type denoted byA+/−. The superscript +/− tells us one allele is activated. The second allele becomes activatedat rate u2 to become the cell type A+/+. In addition, A−/− cells can also receive mutations thattrigger CIN. This happens at the rate uc resulting in the cell typeA−/− CIN . This kind of a cell canactivate the first allele of the LRRK2 gene with normal mutation rate u1 to produce a cell with oneactivated allele (i.e. a +/−) which started from a CIN state. We denote these cells as A+/− CIN .We can also get a cell of type A+/− CIN when a cell of type A+/− receives a mutation whichtriggers CIN. We will assume this happens at the same rate uc as before. The A+/− CIN cell thenrapidly undergoes LOH at rate u3 to produce cells having the second allele of LRRK2 which is oftype A+/+ CIN . Finally, A+/+ cells can experience CIN at rate uc to generate A+/+ CIN cells.We show this information in Figure 3.1.

A−/− A+/− A+/+

A−/− CIN A+/− CIN A+/+ CIN

Without CIN

With CIN

u1 u2

uc uc uc

u1 u3

Figure 3.1: The pathways for the LRRK2 allele gains.

Let N be the population size within which the LRRK2 mutations occur. We will assume a typicalvalue of N is 103 to 104. The first allele is activated by a point mutation. The rate at which thisoccurs is modeled by the rate u1 as shown in Figure 3.1. We make the following assumptions:


3.4. EXAMPLES 49

• the mutations governed by the rates u1 and uc are neutral. This means that these rates do notdepend on the size of the population N .

• The events governed by u2 and u3 give what is called selective advantage. This means thatthe size of the population size does matter.

Using these assumptions, we will model u2 and u3 as u2 = N u2 and u3 = N u3. where u2 and u3

are neutral rates. We can thus redraw our figure as Figure 3.2.

A−/− A+/− A+/+

A−/− CIN A+/− CIN A+/+ CIN

Without CIN

With CIN

u1 N u2

uc uc uc

u1 N u3

Figure 3.2: The pathways for the LRRK2 allele gains rewritten using selective advantage.

The mathematical model is then setup as follows. Let

X0(t) is the probability a cell in in cell type A−/− at time t.

X1(t) is the probability a cell in in cell type A+/− at time t.

X2(t) is the probability a cell in in cell type A+/+ at time t.

Y0(t) is the probability a cell in in cell type A−/− CIN at time t.

Y1(t) is the probability a cell in in cell type A+/− CIN at time t.

Y2(t) is the probability a cell in in cell type A+/+ CIN at time t.

Looking at Figure 3.2, we can generate rate equations. First, let’s rewrite Figure 3.2 using ourvariables as Figure 3.3. To generate the equations we need, note each box has arrows coming into

X0 X1 X2

Y0 Y1 Y2

Without CIN

With CIN

u1 N u2

uc uc uc

u1 N u3

Figure 3.3: The pathways for the LRRK2 allele gains rewritten using mathematical variables.



it and arrows coming out of it. The arrows in are growth terms for the net change of the variablein the box and the arrows out are the decay or loss terms. We model growth as exponentialgrowth and loss as exponential decay. So X0 only has arrows going out which tells us it onlyhas loss terms. So we would say (X ′0)loss = −u1X0 − ucX0 which implies X ′0 = −(u1 + uc)X0.Further,X1 has arrows going in and out which tells us it has growth and loss terms. So we would say(X ′1)loss = −Nu2X1−ucX1 and (X ′1)growth = u1X0 which impliesX ′1 = u1X0−(Nu2+uc)X1.We can continue in this way to find all the model equations. We can then see the rate equations are

X ′0 = −(u1 + uc)X0 (3.1)X ′1 = u1 X0 − (uc + N u2)X1 (3.2)X ′2 = N u2 X1 − uc X2 (3.3)Y ′0 = uc X0 − u1 Y0 (3.4)Y ′1 = uc X1 + u1 Y0 − N u3 Y1 (3.5)Y ′2 = N u3 Y1 + uc X2 (3.6)

Initially, at time 0, all the cells are in the state X0, so we have

X0(0) = 1, X1(0) = 0, X2(0) = 0 (3.7)Y0(0) = 0, Y1(0) = 0, Y2(0) = 0. (3.8)

Now under what circumstances is the CIN pathway to LRRK2 mutations the dominant one? In orderto answer this, we need to analyze the trajectories of this model. We can rewrite this as a matrix -vector system:

X ′0X ′1X ′2Y ′0Y ′1Y ′2

=

−(u1 + uc) 0 0 0 0 0

u1 −(uc +Nu2) 0 0 0 00 Nu2 −uc 0 0 0uc 0 0 −u1 0 00 uc 0 u1 −Nu3 00 0 uc 0 Nu3 0

X0

X1

X2

Y0

Y1

Y2

The eigenvalues of this system are the roots of the polynomial p(λ) = det(λI −A) where A is the6× 6 coefficient matrix above. We find

p(λ) = det

λ+ (u1 + uc) 0 0 0 0 0

−u1 λ+ (uc +Nu2) 0 0 0 00 −Nu2 λ+ uc 0 0 0−uc 0 0 λ+ u1 0 0

0 −uc 0 −u1 λ+Nu3 00 0 −uc 0 −Nu3 λ

This is easily expanded using the properties of determinants (which we will discuss later!) to give

p(λ) = (λ+ (u1 + uc))

λ+ (uc +Nu2) 0 0 0 0

−Nu2 λ+ uc 0 0 00 0 λ+ u1 0 0−uc 0 −u1 λ+Nu3 0

0 −uc 0 −Nu3 λ


3.4. EXAMPLES 51

= (λ+ (u1 + uc)) (λ+ (uc +Nu2))

λ+ uc 0 0 0

0 λ+ u1 0 00 −u1 λ+Nu3 0−uc 0 −Nu3 λ

= (λ+ (u1 + uc)) (λ+ (uc +Nu2)) (λ+ uc)

λ+ u1 0 0−u1 λ+Nu3 0

0 −Nu3 λ

= (λ+ (u1 + uc)) (λ+ (uc +Nu2)) (λ+ uc)(λ+ u1) (λ+Nu3) (λ)

Hence, the eigenvalues are λ1

λ2

λ3

λ4

λ5

λ6

=

−(u1 + uc)−(uc +Nu2)−uc−u1

−Nu3

0

Since this is a biological model, we get more insight into finding a solution with the critical parame-ters in this form rather than substituting numerical values and using a tool like MatLab. It is possibleto calculate the six needed eigenvectors by hand for this system which we enjoyed doing but weunderstand not all share this interest. We find the eigenvectors are

E1 =

Nu2 − u1

u1

−Nu2

−(Nu2 − u1)−u1

Nu2−u1−ucNu3−u1−uc

Nu3(Nu2−u1)−Nu2ucNu3−u1−uc

, E2 =

01−10−uc

Nu2+uc−Nu3uc

Nu2+uc−Nu3

, E3 =

001000

,

E4 =

0001u1

Nu3−u1Nu3

Nu3−u1

, E5 =

00001−1

, E6 =

000001

,

The general solution is thus any linear combination of the form

c1E1e−(u1+uc)t + c2E2e

−(uc+Nu2)t + c3E3e−uct + c4E4e

−u1t + c5E5e−Nu3t + c6E61

where 1 denotes the constant function e0t = 1. If we let these six solutions be denoted by yi(t) =Eie

λit, we see the general solution is a member of the span of y1, . . . ,y6 and since these functionsare linearly independent in the spaceX = C1([0, T ]×C1([0, T ]×C1([0, T ]×C1([0, T ]×C1([0, T ]×C1([0, T ]× for any appropriate T , we know this solution space is a six dimensional subspace of Xwith the basis y1, . . . ,y6. If F andG are two elements inX , it is easy to show the we can definean inner product onX to be ∫ T

0

< F (t),G(t) > dt

We can construct an orthonormal basis for the solution space by applying GSO to the eigenvectorsEi to create the orthonormal basis G. Then the new functions wi = Gie

λit are solutions to the



ODE system which are mutually orthogonal. So it is straightforward to find an orthonormal basishere.Note this solution space can be identified with any six dimensional subspace of any vector space bysimply mapping one orthonormal basis to another and extending linearly.It is not our intent here to pursue this problem further. It turns out looking at the solution space thisway is not so helpful to understand what is going on. Finding approximate solutions using standardTaylor series expansions is better. You can read about this in other places if you are interested.

3.4.6.1 Homework

Exercise 3.4.36

Exercise 3.4.37

Exercise 3.4.38

Exercise 3.4.39

Exercise 3.4.40

3.5 Best Approximation in a Vector Space With Inner ProductAn important problem is that of the idea of finding the best object in an subspace W that approximatesa given object . This is an easy theorem to prove.

Theorem 3.5.1 The Finite Dimensional Approximation Theorem

Let p be any object in the inner product space V with inner product<,> and induced norm‖·‖. Let W be a finite dimensional subspace with an orthonormal basisE = E1, . . .Enwhere n is the dimension of the subspace. Then there is an unique object p∗ in W whichsatisfies

||p− p∗|| = minu∈W

||u− p||

with

p∗ =

N∑i=1

< p,Ei > Ei.

Further, p− p∗ is orthogonal to the subspace W ; i.e. < p∗, u >= 0 for all u ∈ W .

Proof 3.5.1Any object in the subspace has the representation

∑Ni=1 aiEi for some scalars ai. Consider the

function of N variables

E(a1, . . . , aN ) =

⟨p−

N∑i=1

aiEi,p−N∑j=1

ajEj

⟩

= < p,p > −2

N∑i=1

ai < p,Ei >

+

N∑i=1

N∑j=1

aiaj < Ei,Ej > .


3.5. BEST APPROXIMATION IN A VECTOR SPACE WITH INNER PRODUCT 53

Simplifying using the orthonormality of the basis, we find

E(a1, . . . , aN ) = < p,p > −2

N∑i=1

ai < p,Ei > +

N∑i=1

a2i .

This is a quadratic expression and setting the gradient of E to zero, we find the critical pointsaj =< p,Ej >. This is a global minimum for the function E. Hence, the optimal p∗ has theform

p∗ =

N∑i=1

< p,Ei > Ei.

Finally, we see

< p− p∗,Ej > = < p,Ej > −N∑k=1

< p,Ek >< Ek,Ej >

= < p,Ej > − < p,Ej >= 0,

and hence, p− p∗ is orthogonal of W .

This is an extension of the Cauchy - Schwartz Theorem in a way. It was easy to see how to handle thebest approximation idea when only two vectors were involved so we didn’t need all the machineryabove. But this is a powerful idea. We can put this into perspective by recalling some basic factsfrom Fourier Series.

Letting un(x) = sin(nπL x) and using the standard inner product onC([0, L]),< f, g >=∫ L

0f(t)g(t)dt,

we know

< ui, uj > =

L2 , i = j0, i 6= j.

We define the new functions un by√

2Lun. Then, < ui, uj >= δji and the sequence of functions

(un) are all mutually orthogonal. It is clear ||un||2 = 1 always. So the sequence of functions (un)are all mutually orthogonal and length one. Letting vn(x) = cos(nπL x), we also know

< v0, v0 > =1

L

< vi, vj > =

L2 , i = j0, i 6= j.

Hence, we can define the new functions

v0(x) =

√1

L

vn(x) =

√2

Lcos

(nπ

Lx

), n ≥ 1,

Then, < vi, vj >= δji and the sequence of functions (vn) are all mutually orthogonal with length||vn|| = 1.

If we start with a function f which is continuous on the interval [0, L], we can define the trigonometricseries associated with f as follows



S(x) =1

L< f,1 >

+

∞∑i=1

(2

L

⟨f(x), sin

(iπ

Lx

)⟩sin

(iπ

Lx

)+

2

L

⟨f(x), cos

(iπ

Lx

)⟩cos

(iπ

Lx

)).

In terms of the orthonormal families we have just defined, this can be rewritten as

S(x) = limN→∞

(< f, v01 > v0 +

N∑n=1

< f(x), vn > vn +

N∑n=1

< f(x), un > un

)

Note the projection of the data function f , PN (f), onto the subspace spanned by u1, . . . , uN isthe minimal norm solution to the problem inf

u∈Vn‖f − u‖2 where Vn is the span of u1, . . . , uN.

Further, projection of the data function f , QN (f), onto the subspace spanned by 1, v1, . . . , vN isthe minimal norm solution to the problem inf

v∈Wn

‖f − v‖2 where Wn is the span of v0, . . . , vN.The Fourier Series of f : [0, 2L] has partial sums that can be written as

SN = PN (f) +QNf =

N∑n=1

< f, un > un + < f, v0 > v0 +

N∑n=1

< f, vn > vn

and so the convergence of the Fourier Series is all about the conditions under which the projectionsof f to these subspaces converge pointwise to f .


Exercise 3.5.2

Exercise 3.5.3

Exercise 3.5.4

Exercise 3.5.5


Chapter 4

Linear Transformations

We are now going to study matrices and learn to interpret them in a more general way.

4.1 Organizing Point Cloud DataThe first place to start is what is called a point cloud of data. This is a collection in which eachdata point is itself a list of real numbers. So if x is such a data point, the list of numbers associatedwith it is x1, . . . , xn where n is the number of components in the data. For example, there is acomplicated system of inferring what individual molecules are doing in fluid flow such as in an arterycalled flow cytometry. Roughly speaking, lots of molecules are separated as they flow down a columnfilled with liquid. The separation can be due to mass, charge and other things. Then laser light ofvarious frequencies is shown through the liquid with the moving molecules and how the moleculesabsorb the light is collected and stored as lists of numbers. So a particular molecular complex that isof interest might have a list of 14 numbers associated with it and a list like this is obtained at regulartime points. So if you collect data every millisecond, you have 1000 sets of these lists per molecularcomplex per second. A typical complex we might wish to track in the context of how signals areused for information processing in biology would be one of several cytokine molecules. A givenflow cytometry experiment might collect such data on 10 different cytokines for 3 seconds. If welabel the cytokines as C1, . . . ,C10, we have data of the form

Xi1, . . . ,Xi,3000

for each cytokine of typeCi for time points 1 millisecond to 3000 milliseconds. This vast collectionof data is a set of lists with 14 components. Hence, we know Xij ∈ <14, the collection of realnumbers in 14 dimensions. Note, we can collect the data and store it as tables of numbers withoutreferring to <14 as a vector space with a specific orthonormal basis. The easiest way to organizethis is to think of the initial data collection phase as generating vectors in <14 using the standardorthonormal basis

e1 =

[0, i 6= 11, i = 1

], e2 =

[0, i 6= 21, i = 2

], . . . , e14 =

[0, i 6= 141, i = 14

],

To make sense of this collection of data, we want to find ways to transform the given point cloud intoa new version of itself which we can use to understand the data better. We usually do this by applyingwhat are called Linear Transformations which are mappings from <14 into itself which satisfy aproperty called linearity, We find the used of a change of basis from one orthonormal basis toanother and the use of specialized orthonormal basis constructed from eigenvalues and eigenvectorsto be very useful. So let’s look at this transformation problem in general.

55


56 CHAPTER 4. LINEAR TRANSFORMATIONS


Exercise 4.1.2

Exercise 4.1.3

Exercise 4.1.4

Exercise 4.1.5

4.2 Linear TransformationsGiven a basis E of <n, E = e1, . . . , en, we know any vector x in <n has a decomposition

[x]E

where the components ofX with respect to the basis E are xi with

x = x1e1 + . . .+ xnen

We often wish to measure the magnitude or size of a vector x. We have talked about doing thisbefore and have used the term norm to denote the mapping that does the job. Let’s be specific now.

Definition 4.2.1 Norm on a Vector Space

Let V be a vector space over the reals and let ρ : V → < satisfy

N1: ρ(x) ≥ 0 for all x ∈ V .

N2: ρ(x) = 0 if and only if x = 0

N3: ρ(αx) = |α|ρ(x) for all real numbers α and x ∈ V

N4: ρ(x+ y) ≤ ρ(x) + ρ(y) for all x,y ∈ V

Comment 4.2.1 We usually denote the norm ρ by the symbol ‖ · ‖ and sometimes add a subscript toremind us where the norm comes from. We will see many examples of this soon.

Comment 4.2.2 As we have mentioned before, if the vector space V has an inner product, the map-ping ρ(x) =< x,x > defines a norm on V . The vector space V plus an inner product < · > isdenoted as the pair (V , < · >) and is called an inner product space. The vector space V plus anorm ρ is denoted as the pair (V ,ρ) and is called an normed linear space. Hence, an inner productspace induces a norm on the vector space. It is known the not all norms are induced by some innerproduct. Infinite dimensional vector spaces are known where this is not true, but that is a story foranother time.

Comment 4.2.3 For the vector space C([a, b]),

• the inner product is < f, g >=∫ baf(t)g(t)dt with induced norm ‖f2 =

√∫ baf2(t)dt.

• we can also use the norm ‖f‖1 =∫ ba|f(t)|dt which is not induced from an inner product.

• we can also use the norm ‖f‖1 = maxa≤t≤b |f(t)|.



4.3. SEQUENCE SPACES REVISITED 57

Exercise 4.2.2

Exercise 4.2.3

Exercise 4.2.4

Exercise 4.2.5

4.3 Sequence Spaces Revisited

Let’s recall some ideas about sequence spaces. Given a sequence of real numbers (an), which weassume has indexing starting at n = 1 for convenience, the set of all sequences has a variety ofinteresting subsets which might even be subspaces. It is clear the set of all sequences is a vector spaceover the reals but whether or not a subset is a subspace depends on whether the subset is closed underscalar multiplication and vector addition. Now we say the positive numbers p and q are conjugateexponents if p > 1 and 1/p + 1/q = 1. If p = 1, we define its conjugate exponent to be q = ∞.Conjugate exponents satisfy some fundamental identities. Clearly, if p > 1, 1

p + 1q =⇒ 1 = p+q

p q

and also pq = p + q and (p− 1) (q − 1) = 1. We will use these identities quite a bit. We quote thefollowing lemma whose proof is in many texts such as (Peterson (8) 2019).

Lemma 4.3.1 The α− β Lemma

Let α and β be positive real numbers and p and q be conjugate exponents. Then α β ≤αp

p + βq

q .

Proof 4.3.1You should look at the proof in (Peterson (8) 2019) to refresh your memory of how this is done.

Important subsets of the set of all sequences can then be defined using conjugate exponents.

Definition 4.3.1 The `p Sequence Space

Let p ≥ 1. The collection of all sequence, (an)∞n=1 for which∑∞n=1 |an|p converges is

denoted by the symbol `p.(1) `1 = (an)∞n=1 :

∑∞n=1 |an| converges.

(2) `2 = (an)∞n=1 :∑∞n=1 |an|2 converges.

We also define `∞ = (an)∞n=1 : supn≥1 |an| <∞.

There is a fundamental inequality connecting sequences in `p and `q when p and q are conjugateexponents called Holder’s Inequality. It’s proof is straightforward but has a few tricks.

Theorem 4.3.2 Holder’s Inequality

Let p > 1 and p and q be conjugate exponents. If x ∈ `p and y ∈ `q , then

∞∑n=1

|xn yn| ≤( ∞∑n=1

|xn|p)1/p ( ∞∑

n=1

|yn|q)1/q

where x = (xn) and y = (yn).

Proof 4.3.2This inequality is clearly true if either of the two sequences x and y are the zero sequence. So we can



assume both x and y have some nonzero terms in them. Then x ∈ `p, we know

0 < u =

( ∞∑n=1

|xn|p)1/p

<∞, 0 < v =

( ∞∑n=1

|yn|q)1/q

<∞

Now define new sequences, x and y by xn = xn/u and yn = yn/v. Then, we have

∞∑n=1

|xn|p =

∞∑n=1

|xn|p

up=

1

up

∞∑n=1

|xn|p =up

up= 1.

∞∑n=1

|yn|q =

∞∑n=1

|yn|q

vq=

1

vq

∞∑n=1

|yn|q =vq

vq= 1.

Now apply the α − β Lemma to α = |xn| and β = |yn| for any nonzero terms xn and yn. Then|xn yn| ≤ |xn|p/p+ |yn|q/q.This is also true, of course, if either xn or yn are zero although the α− β lemma does not apply!Now sum over N terms to get

N∑n=1

|xn yn| ≤1

p

N∑n=1

|xn|p +1

q

N∑n=1

|yn|q

Since we know x ∈ `p and y ∈ `q , we know

N∑n=1

|xn|p ≤∞∑n=1

|xn|p = 1

N∑n=1

|yn|q ≤∞∑n=1

|yn|q = 1

So we have

N∑n=1

|xn yn| ≤1

p+

1

q= 1

This is true for all N so the partial sums∑Nn=1 |xn yn| are bounded above. Hence, the partial sums

converge to this supremum which is denoted by∑∞n=1 |xn yn|. We conclude

∑∞n=1 |xn yn| ≤ 1. But

xnyn = 1/(u v) xnyn and so we have 1u v

∑∞n=1 |xn yn| ≤ 1 which implies the result as

∞∑n=1

|xn yn| ≤ u v =

( ∞∑n=1

|xn|p)1/p ( ∞∑

n=1

|yn|q)1/q

We can also do this inequality for the case p = 1 and q =∞.

Theorem 4.3.3 Holder’s Theorem for p = 1 and q =∞

If x ∈ `1 and y ∈ `∞, then∑∞n=1 |xnyn| ≤

(∑∞n=1 |xn|

)supn≥1 |yn|.



Proof 4.3.3We know since y ∈ `∞, |yn| ≤ supk≥1 |yk|. Thus,

N∑n=1

|xnyn| ≤( ∞∑n=1

|xn|)

supk≥1|yk|

Thus the sequence of partial sums∑Nn=1 |xnyn| is bounded above by

(∑∞n=1 |xn|

)supk≥1 |yk|.

This gives us our result.

Now we want to apply these ideas to <n which is a collection of numbers organized as vectors withn components. Here, we don’t really care what the underlying orthonormal basis E which gives usthese components is. Note the vector

x =

x1

...xn

can be identified with the sequence

(xj) = x1, . . . , xn, 0, . . .

It is easy to see that since (xj) has only n components, this sequence is in any `p. Holder’s Inequalityspecialized to <n gives

Theorem 4.3.4 Holder’s Inequality in <n

Let p > 1 and p and q be conjugate exponents. If x ∈ <n, then the associated sequence(xj) = x1, . . . , xn, 0, . . . is in `p for all p. We have for any two such sequences (xj) and(yj)

n∑i=1

|xj yj | ≤( n∑j=1

|xj |p)1/p ( n∑

j=1

|yj |q)1/q

and∑nj=1 |xjyj | ≤

(∑nj=1 |xj |

)max1≤j≤n |yj |.

There is an associated inequality called Minkowski’s Inequality which is also useful. The generalversion is

Theorem 4.3.5 Minkowski’s Inequality

Let p ≥ 1 and let x and y be in `p, Then,( ∞∑n=1

|xn + yn|p) 1p

≤( ∞∑n=1

|xn|p) 1p

+

( ∞∑n=1

|yn|p) 1p

and for x and y in `∞,

supn≥1|xn + yn| ≤ sup

n≥1|xn|+ sup

n≥1|yn|



Proof 4.3.4(1): p =∞We know |xn + yn| ≤ |xn|+ |yn| by the triangle inequality. So we have |xn + yn| ≤ supn≥1 |xn|+supn≥1 |yn|. Thus, the right hand side is an upper bound for all the terms of the left side. We thencan say supn≥1 |xn + yn| ≤ supn≥1 |xn|+ supn≥1 |yn| which is the result for p =∞.(2): p = 1Again, we know |xn + yn| ≤ |xn| + |yn| by the triangle inequality. Sum the first N terms on bothsides to get

N∑n=1

|xn + yn| ≤N∑n=1

|xn|+N∑n=1

|yn| ≤∞∑n=1

|xn|+∞∑n=1

|yn|

The right hand side is an upper bound for the partial sums on the left. Hence, we have

∞∑n=1

|xn + yn| ≤∞∑n=1

|xn|+∞∑n=1

|yn|

(3) 1 < p <∞We have

|xn + yn|p = |xn + yn| |xn + yn|p−1 ≤ |xn| |xn + yn|p−1 + |yn| |xn + yn|p−1

N∑n=1

|xn + yn|p ≤N∑n=1

|xn| |xn + yn|p−1 +

N∑n=1

|yn| |xn + yn|p−1

Let an = |xn|, bn = |xn + yn|p−1, cn = |yn| and dn = |xn + yn|p−1. Holder’s Inequality appliesjust fine to finite sequences: i.e. sequences in <N . So we have

N∑n=1

anbn ≤( N∑n=1

apn

) 1p( N∑n=1

bqn

) 1q

But bqn = |xn + yn|q(p−1) = |xn + yn|p using the conjugate exponents identities we established. Sowe have found

N∑n=1

|xn| |xn + yn|p−1 ≤( N∑n=1

|xn|p) 1p( N∑n=1

|xn + yn|p) 1q

We can apply the same reasoning to the terms cn and dn to find

N∑n=1

|yn| |xn + yn|p−1 ≤( N∑n=1

|yn|p) 1p( N∑n=1

|xn + yn|p) 1q

We can use the inequalities we just figured out to get the next estimate

N∑n=1

|xn + yn|p ≤( N∑n=1

|xn|p) 1p( N∑n=1

|xn + yn|p) 1q

+

( N∑n=1

|yn|p) 1p( N∑n=1

|xn + yn|p) 1q



Now factor out the common term to get

N∑n=1

|xn + yn|p ≤

(( N∑n=1

|xn|p) 1p

+

( N∑n=1

|yn|p) 1p

)( N∑n=1

|xn + yn|p) 1q

Rewrite again as ( N∑n=1

|xn + yn|p)1− 1

q

≤( N∑n=1

|xn|p) 1p

+

( N∑n=1

|yn|p) 1p

But 1 − 1/q = 1/p, so we have(∑N

n=1 |xn + yn|p) 1p

≤(∑N

n=1 |xn|p) 1p

+

(∑Nn=1 |yn|p

) 1p

.

Now apply the final estimate to find(∑N

n=1 |xn+yn|p) 1p

≤(∑∞

n=1 |xn|p) 1p

+

(∑∞n=1 |yn|p

) 1p

.

This says the right hand side is an upper bound for the partial sums on the left side. Hence, we know( ∞∑n=1

|xn + yn|p) 1p

≤( ∞∑n=1

|xn|p) 1p

+

( ∞∑n=1

|yn|p) 1p

Homework

Exercise 4.3.1

Exercise 4.3.2

Exercise 4.3.3

Exercise 4.3.4

Exercise 4.3.5

When we specialize to <n we obtain

Theorem 4.3.6 Minkowski’s Inequality in <n

Let p ≥ 1 and let x and y be in <n with associated sequences (xj) and (yj) Then,( n∑j=1

|xj + yj |p) 1p

≤( n∑j=1

|xj |p) 1p

+

( n∑j=1

|yj |p) 1p

and for the case p =∞, we have

sup1≤j≤n

|xj + yj | ≤ sup1≤j≤n

|xj |+ sup1≤j≤n

|yj |

In <n, define

1. ‖x‖1 =∑nj=1 |xj |

2. ‖x‖2 =√∑n

j=1 |xj |2



3. ‖x‖∞ = max1≤j≤n |xjand in general ‖x‖p = (

∑nj=1 |xj |p)1/p for any p ≥ 1. Minkowski’s Inequality tells in all cases

‖x+ y‖p ≤ ‖x‖p + ‖y‖p

This shows a number of things. Let V be the set of all sequences (xj) identified with vectors from<n. Then we can add them and scalar multiply them, so this is a vector space over <. If we lookat ‖x‖p, it clearly is a mapping from V to the reals which satisfies properties N1 to N3 for a normon V . Minkowski’s Inequality then proves property N4 for a norm. Hence, each ‖ · ‖p is a norm on<n and can speak of the distinct normed linear spaces (<n, ‖ · ‖1), (<n, ‖ · ‖2), (<n, ‖ · ‖∞) and ingeneral (<n, ‖ · ‖p).Convergence with respect to these norms is then defined in the usual way. We say xn converges in‖ · ‖p norm to x if

∀ ε > 0, ∃N 3 n > N =⇒ ‖xn − x‖p < ε

All of these normed linear spaces are complete. We proved this in the more general `p setting in(Peterson (8) 2019), but let’s specialize these proofs to <n here. They are a bit simpler since we donot have to worry about the convergence of series.Homework

Exercise 4.3.6

Exercise 4.3.7

Exercise 4.3.8

Exercise 4.3.9

Exercise 4.3.10

Theorem 4.3.7 The Completeness of (<n, ‖ · ‖∞)

(<n, ‖ · ‖∞) is complete; i.e. if (xk) is a Cauchy Sequence in this normed linear space,there is an x in (<n, ‖ · ‖∞) so that xk converges in ‖ · ‖∞ norm to x.

Proof 4.3.5Assume (xk) is a Cauchy Sequence. The element xk is a sequence itself which we denote by (xk,j)for 1 ≤ j ≤ n. Then for a given ε > 0, there is an N so that

max1≤j≤n

|xk,j − xm,j | < ε/2 when m > k > Nε

So for each fixed index j, we have

|xk,j − xm,j | < ε/2 when m > k > Nε

This says for fixed j, the sequence (xk,j) is a Cauchy sequence of real numbers and hence mustconverge to a real number we will call ak. This defines a new vector a ∈ <n. Does xn → a in the|| · ||∞ norm? We use the continuity of the function | · | to see for any k > Nε, we have

limm→∞

|xk,j − xm,j | ≤ ε/2 =⇒ |xk,j − limm→∞

xm,j | ≤ ε/2

This argument works for all j and so |xk,j − aj | ≤ ε/2 when k > Nε for all j which impliesmax1≤j≤n |xk,j − aj | ≤ ε/2 < ε or ‖xk − a‖∞ < ε when k > Nε. So xn → a in || · ||∞.



Next, let’s look at vector convergence using the ‖ · ‖p norm for p ≥ 1.

Theorem 4.3.8 (<n, ‖ · ‖p) is Complete

(<n, ‖ · ‖p) is complete; i.e. if (xk) is a Cauchy Sequence in this normed linear space,there is an x in (<n, ‖ · ‖p) so that xk converges in ‖ · ‖p norm to x.

Proof 4.3.6Let xk) be a Cauchy Sequence in ‖ · ‖p. Then given ε > 0, there is an Nε so that m > k > Nεimplies

||xk − xm||p =

( n∑j=1

|xk,j − xm,j |p) 1p

< ε/2.

Thus, ifm > k > Nε,∑nj=1 |xk,j−xm,j |p < (ε/2)p. Since this is a sum on non-negative terms, each

term must be less than (ε/2)p. So we must have |xk,j − xm,j |p < (ε/2)p or |xk,j − xm,j | < (ε/2)when m > k > Nε. This tells us immediately the sequence of real numbers (xk,j) is a Cauchysequence of real numbers and so must converge to a number we will call aj . This defines the vectora ∈ <n. Since | · |p is continuous, we can say

(ε/2)p ≥ limm→∞

( n∑j=1

|xk,j − xm,j |p)

=

( n∑j=1

|xk,j − limm→∞

xm,j |p)

=

n∑j=1

|xk,j − aj |p

This says immediately that ||xk − a||p < ε when k > Nε so xk → a in || · ||p.

Comment 4.3.1 It is easy to see <n has an inner product given by

< x,y > =

n∑j=1

xjyj

Hence, (<n, <,>) is an inner product space whose norm induces ‖·‖2. Moreover Holder’s Inequal-ity tells us

| < x,y > | ≤n∑j=1

|xjyj | ≤ ‖x‖2 ‖x‖2

which is our standard Cauchy - Schwartz Inequality in <n which we use to define the angle betweenvectors in <n as usual.


Exercise 4.3.12

Exercise 4.3.13

Exercise 4.3.14

Exercise 4.3.15



4.4 Linear Transformations Between Normed Linear SpacesLet’s start by defining linear transformations between vector spaces carefully

Definition 4.4.1 Linear Transformations Between Vector Spaces

LetX and Y be two vector spaces over <. We say T : X → Y is a linear transformationif

T (αx+ βy) = αT (x) + βT (y)

To discuss linear transformations between finite dimensional vector spaces, we need to have a formaldefinition of matrices. We have used them quite a bit in our explanations, but it is time to be formal.Note, we have already thought carefully about symmetric matrices in Chapter 5 but our interests aremuch more general here.

Definition 4.4.2 Real Matrices

LetM be a collection of real numbers organized in a table of m rows and n columns. Thenumber Mij refers to the entry in the ith row and jth column of this table. The table isorganized like this

M11 . . . M1n

M21 . . . M2n

......

...Mm1 . . . Mmn

We identify the matrixM with this table and write

M =

M11 . . . M1n

......

...Mm1 . . . Mmn

The set of all matrices with m rows and n columns is denoted by Mmn. We also say Mis a m× n matrix.

Comment 4.4.1 The n columns of M are clearly vectors in <m and the m rows of M are vectorsin <n. Thus, we have some additional identifications:

M =

M11 . . . M1n

......

...Mm1 . . . Mmn

=[C1 . . . Cn

]=

R1

...Rm

where column i is denoted byCj ∈ <m and row j is denoted byRj ∈ <n. Note the rows are vectorsdisplayed as the transpose of the usual column vector notation we use.

Comment 4.4.2 So for example the data pairs of time and temperature we measure in a coolingmodel experiment might turn out to be

Listing 4.1: Sample Data

0 205


4.4. LINEAR TRANSFORMATIONS BETWEEN NORMED LINEAR SPACES 65

1 2013 1986 1909 18512 18215 17920 16825 16130 15440 14650 13560 12380 109100 87120 82140 79150 77

This table defines a matrix of 18 rows and 2 columns and so defines a matrix inM18,2.

4.4.0.1 Homework

Exercise 4.4.1

Exercise 4.4.2

Exercise 4.4.3

Exercise 4.4.4

Exercise 4.4.5

4.4.1 Basic PropertiesConsider a matrixM inMmn. We define the action ofM on the vector space <n by

M(x) =([R1 . . . Rm

])(x) =

< R1,x >...

< Rm,x >

We generally just write Mx = M(x) to save a few parenthesis. It is easy to see this definition ofthe action ofM on <n defines a linear transformation from <n to <m.The set of x in <n with Mx = 0 ∈ <m is called the kernel or nullspace of M . The span of thecolumns ofM is called the column space ofM . We can prove a fundamental result.

Theorem 4.4.1 IfM ∈Mmn, then dim( kernel) + dim(column space) = n

If M is a m × n matrix, if p = dim(ker(M) and q is the dimension of the span of thecolumns ofM , then p+ q = n.

Proof 4.4.1It is easy to see that ker(M) is a subspace of <n with dimension p ≤ n. Let K1, . . . ,Kp bean orthonormal basis of ker(M). We can use GSO to construct an additional n − p vectors in<n, L1, . . . ,Ln−p that are all orthogonal to ker(M) with length one. The subspace spanned by



L1, . . . ,Ln−p is perpendicular to the subspace ker(M) and is called (ker(m)⊥ to denote it iswhat is called an orthogonal complement. We note <n = ker(M)⊕ spanL1, . . . ,Ln−p.Thus, K1, . . . ,Kp,L1, . . . ,Ln−p is an orthonormal basis for <n and hence x = a1K1 + . . .+apKp + b1L1 + . . .+ bn−pLn−p. It follows using the linearity ofM that

Mx = b1ML1 + . . . bn−pMLn−p

Now if b1ML1+. . . bn−pMLn−p = 0, this meansM(b1L1+. . .+bn−pLn−p) = 0 too. This saysb1L1 + . . . + bn−pLn−p is in both ker(M) and the spanL1, . . . ,Ln−p which forces b1L1 +. . . + bn−pLn−p = 0. But L1, . . . ,Ln−p is a linearly independent set and so all coefficientsbi = 0. This says the set ML1, . . . ,MLn−p is a linearly independent set in <m.Finally, note the column space of M is defined to be the spanC1, . . . ,Cn. A vector is this spanhas the look y =

∑mi=1 aiCi. This is the same asMa where a ∈ <n. However, we know <n has the

orthonormal basis K1, . . . ,Kp,L1, . . . ,Ln−p and so a =∑pi=1 biKi +

∑n−pi=1 ciLi. Applying

M we find

Ma = c1ML1 + . . .+ cn−pMLn−p

Since ML1, . . . ,MLn−p is a linearly independent, we see the range ofM , the column space ofM has dimension n− p. Thus, we have shown

dim(ker(M) + dim(span(C1, . . . ,Cn) = n

where, of course, we also have n− p ≤ m.

Comment 4.4.3 Thus, if n− p = m, the kernel ofM must be n−m dimensional.

4.4.1.1 Homework

Exercise 4.4.6

Exercise 4.4.7

Exercise 4.4.8

Exercise 4.4.9

Exercise 4.4.10

4.4.2 Mappings between Finite Dimensional Vector Spaces

Let X and Y be two finite dimensional vector spaces over <. Assume X has dimension n an Yhas dimension m. Let E be a basis for X and F , a basis for Y . Before we go further, let’s be clearabout some of our notation.

• We use x to denote an object in the finite dimensional vector spaceX .

• For a given basisE, we can write x = x1E1+. . .+En and the coefficients x1, . . . , xn are nreal numbers and so give a vector in<n. We let

[x]E

indicate this vector of real numbers. If Gwas another basis, there would be another set of n numbers given by x = u1G1 + . . .+unGnand this vector would be denoted by

[x]G

. Then, there will be a transformation law whichconverts

[x]E

into[x]G

and we will talk about that soon. The point is x is the same object inX which we can represent in different ways with a mechanism to map one representation intoanother.



Theorem 4.4.2 The Change of Basis Mapping

Let E and F be basis for the finite dimensional vector space X which has dimension n.Then there is an invertible matrix AEF so that

[x]G

= AEG[x]E

. We read the matrixAEG as mapping E intoG.

Proof 4.4.2We know

x = c1E1 + . . .+ cnEn

We also know each Ej has an expansion in the G basis; i.e. Ei =∑nj=1 AjiGj . Using this, we

find

x =

n∑i=1

ci

n∑j=1

AjiGj

=

n∑j=1

(n∑i=1

ci Aji

)Gj

Bu x also has an expansion with respect to G: x =∑nj=1 djGj . Hence, equating coefficients, we

have

dj =

n∑i=1

Ajici

which can be organized into a familiar set of equations.

d1 = A11c1 +A12c2 + . . .+A1ncn...

dj = Aj1c1 +Aj2c2 + . . .+Ajncn...

dn = An1c1 +An2c2 + . . .+Anncn

or d1

...dn

=

A11 . . . A1n

...An1 . . . Ann

c1...cn

We then further rewrite as

[x]G

= AEG[x]E

. A similar argument shows there is a matrix BGEso that

[x]E

= BGE[x]G

. Thus[x]G

= AEG[x]E

= AEGBGE[x]G[

x]E

= BGE[x]G

= BGEAEG[x]E

This tells usAEGBGE = BGEAEG = I . Hence,AEG−1 = BGE .

Theorem 4.4.3 The Change of Basis Mapping In a Finite Dimensional Inner Product Space



Let E and F be basis for the finite dimensional vector space X which has dimension n.We assume X has an inner product <,>. Then there is an invertible matrix AEF so that[x]G

= AEG[x]E

which has the following form:

AEG =

G1T

...Gn

T

[E1, . . . ,En]

with

AEG−1 = AEG

T =

E1T

...En

T

[G1, . . . ,Gn]

Proof 4.4.3LetG be another orthonormal basis forX . Then since x = c1E1 + . . . cnEn, we have

< x,G1 > = c1 < G1,E1 > + . . .+ cn < G1,En >

... =

< x,Gn > = c1 < Gn,E1 > + . . .+ cn < Gn,En >

which in matrix - vector form is

[x]G

=

G1T

...Gn

T

[E1, . . . ,En] [x]E

The orthonormality of E andG then tell us immediately

AEG−1 = AEG

T =

E1T

...En

T

[G1, . . . ,Gn]

Since there is an inner product onX , we know its value does not depend on the choice of orthonormalbasis forX .Homework

Exercise 4.4.11

Exercise 4.4.12

Exercise 4.4.13

Exercise 4.4.14

Exercise 4.4.15



Theorem 4.4.4 The Inner Product on a Finite Dimensional Vector Space is Independent ofChoice of Orthonormal Basis

Let X be an n dimensional vector space with inner product <,>. Then its value is inde-pendent of the choice of orthonormal basis E.

Proof 4.4.4We know xE = x1

EE1 + . . . xnEEn and xG = x1GG1 + . . . xnGGn with a similar notation for the

representations of y. We have

[x]E

=

x1E...xnE

, [x]G

=

x1G...xnG

and we know

[x]G

=

G1T

...Gn

T

[E1, . . . ,En] [x]E

Thus, by orthonormality of E andG, we have

< xG,yG > =

n∑i=1

xiGyiG =

⟨[x]G,[y]G

⟩

=

⟨G1T

...Gn

T

[E1, . . . ,En] [x]E,

G1T

...Gn

T

[E1, . . . ,En] [y]E

⟩

=[x]TE

[E1, . . . ,En

]T [G1, . . . ,Gn

] G1T

...Gn

T

[E1, . . . ,En] [y]E

=⟨[x]E,[y]E

⟩=

n∑i=1

xiEyiE

which shows the value is independent of the choice of orthonormal basis.

Homework

Exercise 4.4.16

Exercise 4.4.17

Exercise 4.4.18

Exercise 4.4.19

Exercise 4.4.20

We define linear transformations the same really.

Theorem 4.4.5 Linear Transformations Between Finite Dimensional Vector Spaces



Let X and Y be two finite dimensional vector spaces over <. Assume X has dimensionn and Y has dimension m. Let T be a linear transformation between the spaces. Given abasis E for X and a basis F for Y , we can identify <n with the span of E and <m withthe span of F . Then there is an n×m matrix TEF : <n → <m so that

[Tx]F

=

T11 . . . T1n

......

...Tm1 . . . Tmn

EF

[x]E

Note Tx is the same object in Y no matter its representation via a choice of basis F forY .

Proof 4.4.5Given x inX , we can write

x = x1E1 + . . .+ xnEn

By the linearity of T , we then have

Tx = x1TE1 + . . .+ xnTEn

But each TEi is in F , so we have the expansions

TE1 = β11F1 + . . .+ βm1Fm

TE2 = β12F1 + . . .+ βm2Fm... =

...

TEn = β1nF1 + . . .+ βmnFm

which implies since Tx is also in Y that

Tx = x1

m∑j=1

βj1Fj

+ . . .+ xn

m∑j=1

βjnFj

=

m∑j=1

γjFj

Now reorganize these sums to obtain

m∑j=1

(n∑i=1

βjixi

)Fj =

m∑j=1

γjFj

But expansions with respect to the F basis are unique, so we have

γ1

...γm

F

=

β11 β12 β13 . . . β1n

β21 β22 β23 . . . β2n

......

......

...βm1 βm2 βm3 . . . βmn

EF

x1

...xn

E

Let Tij = βij and we have found the matrix T so that[Tx]F

= TEF[x]E


4.5. MAGNITUDES OF LINEAR TRANSFORMATIONS 71

Theorem 4.4.6 A Linear Transformation between Two Finite Dimensional Vector Spaces De-termines an Equivalence Class of Matrices

Let X and Y be two finite dimensional vector spaces over <. Assume X has dimensionn and Y has dimension m. Let T be a linear transformation between the spaces. Then Tis associated with an equivalence class in the set of m× n matrices under the equivalencerelation ∼ where A ∼ B if and only if there are invertible matrices P and Q so thatA = P−1BQ.

Proof 4.4.6First, it is easy to see defines an equivalence relation on the set of all m × n matrices. We leavethat to you. Any such linear transformation T is associated with an m× n matrix once the bases Eand F are chosen. What happens if we change to a new basis G for X and a new basis H for Y ?We know from earlier calculations that there are invertible matricesAEG andBFH so that[

x]E

= AGE[x]G,[y]F

= BHF[y]H

Using all this, we have[Tx]F

= TEF[x]E

=⇒ BHF[Tx]H

= TEF AGE[x]G

Since,BHF is invertible, this tells us[Tx]H

= BHF−1 TEF AGE

[x]G

We also know[Tx]H

= TGH[x]G

and thus we must have TGH = BHF−1 TEF AGE . This

tells us TGH ∼ TEF .

Comment 4.4.4 Thus any linear transformation between two finite dimensional vector spaces isidentified with an equivalence class of m× n matrices. When we pick basis E and F for X and Yrespectively, we choose a particular representative from this equivalence class we denote by TEF .Note we can say more about the structure of TEF if these bases are orthonormal.

Homework

Exercise 4.4.21

Exercise 4.4.22

Exercise 4.4.23

Exercise 4.4.24

Exercise 4.4.25

4.5 Magnitudes of Linear TransformationsFor the linear transformation T mapping the n dimensional vector space X to the m dimensionalvector space Y , we know given a basisE forX and a basis F for Y

[Tx]F

=[T]EF

[x]E

. Let’sadd a norm to the vector spaces <n and <m. For now, lets assume we now use the normed linearspaces (<m, ‖ · ‖2) and (<n, ‖ · ‖2). Of course, we could use a different norm on the domain spaceand the range space, but we won’t do that as it adds unnecessary confusion. Let the entries of

[T]EF

be called Tij where we will not label this entries with a EF in order to avoid too much clutter. But



of course, they do depend on E and F ; just keep that in the back of your mind. Note

‖[Tx]F‖22 =

m∑j=1

([Tx]F

)j)2 =

m∑i=1

n∑j=1

Tijxj

Apply the Cauchy - Schwartz Inequality here: n∑

j=1

Tijxj

2

≤

n∑j=1

T 2ij

n∑j=1

x2j

=

n∑j=1

T 2ij

‖xE‖22

Thus,

‖[Tx]F‖22 ≤

m∑i=1

n∑j=1

T 2ij

‖xE‖22

which implies

‖[Tx]F‖2 ≤

√√√√ m∑i=1

n∑j=1

T 2ij ‖xE‖2

This leads to the following definition:

Definition 4.5.1 The Frobenius Norm of a Linear Transformation

For the linear transformation T mapping the n dimensional vector space X to the mdimensional vector space Y , we know given a basis E for X and a basis F for Y[Tx]F

=[T]EF

[x]E

. The Frobenius Norm of T is denoted ‖[T]EF‖Fr and is

defined by

‖[T]EF‖Fr =

√√√√ m∑i=1

n∑j=1

T 2ij

where Tij are the entries in the matrix[T]EF

.

We have just proven the following result:

Theorem 4.5.1 The Frobenius Norm Fundamental Inequality

For the linear transformation T mapping the n dimensional vector space X to the mdimensional vector space Y , we know given a basis E for X and a basis F for Y[Tx]F

=[T]EF

[x]E

. The Frobenius Norm of this matrix satisfies

‖[Tx]F‖2 ≤ ‖

[T]EF‖Fr ‖xE‖2

Proof 4.5.1We have just gone over this argument.


4.5. MAGNITUDES OF LINEAR TRANSFORMATIONS 73

4.5.0.1 Homework

Exercise 4.5.1

Exercise 4.5.2

Exercise 4.5.3

Exercise 4.5.4

Exercise 4.5.5


Chapter 5

Symmetric Matrices

Let’s specialize to symmetric matrices now as they are of great interest to us in many applications inapplied analysis.

5.1 The General Two by Two Symmetric Matrix

Let’s start with a general 2× 2 symmetric matrixA given by

A =

[a bb d

]where a, b and d are arbitrary non zero numbers. The characteristic equation here is r2− (a+ d)r+ad− b2. Note that the term ad− b2 is the determinant ofA. The roots are given by

r =(a+ d)±

√(a+ d)2 − 4(ad− b2)

2

=(a+ d)±

√a2 + 2ad+ d2 − 4ad+ 4b2)

2

=(a+ d)±

√a2 − 2ad+ d2 + 4b2

2

=(a+ d)±

√(a− d)2 + 4b2

2

It is easy to see the term in the square root here is always positive and so we have two real roots.Hence, for a general symmetric 2 × 2 matrix, the eigenvalues are always real. Note we can find the

eigenvectors with a standard calculation. For eigenvalue λ1 =(a+d)+

√(a−d)2+4b2

2 , we must find thevectors V so that (a+d)+

√(a−d)2+4b2

2 − a −b

−b (a+d)+√

(a−d)2+4b2

2 − d

[V1

V2

]=

[00

]

We can use the top equation to find the needed relationship between V1 and V2. We have((a+ d) +

√(a− d)2 + 4b2

2− a)V1 − bV2 = 0.

75


76 CHAPTER 5. SYMMETRIC MATRICES

Thus, we have for V2 =d−a+

√(a−d)2+4b2

2 , V1 = b. Thus, the first eigenvector is

E1 =

[b

d−a+√

(a−d)2+4b2

2

]

The second eigenvector is a similar calculation. We must find the vector V so that (a+d)−√

(a−d)2+4b2

2 − a −b

−b (a+d)−√

(a−d)2+4b2

2 − d

[V1

V2

]=

[00

]

We find ((a+ d)−

√(a− d)2 + 4b2

2− a)V1 − bV2 = 0.

Thus, we have for V2 = d+(a+d)−

√(a−d)2+4b2

2 , V1 = b. Thus, the second eigenvector is

E2 =

[b

d−a−√

(a−d)2+4b2

2

]

Note that < E1,E2 > is

< E1,E2 > = b2 +

(d− a+

√(a− d)2 + 4b2

2

)(d− a−

√(a− d)2 + 4b2

2

)= b2 +

(d− a)2

4− (a− d)2 + 4b2

4b2 +

(d− a)2 − (a− d)2

4− b2 = 0.

Hence, these two eigenvectors are orthogonal to each other. Note, the two eigenvalues are

λ1 =(a+ d) +

√(a− d)2 + 4b2

2

λ2 =(a+ d)−

√(a− d)2 + 4b2

2

5.1.1 Examples

The only way both eigenvalues can be zero is if both a+ d = 0 and 4(a+ d)2 + 4b2 = 0. That onlyhappens if a = b = d = 0 which we explicitly ruled out at the beginning of our discussion becausewe said a, b and d were nonzero. However, both eigenvalues can be negative, both can be positive orthey can be of mixed sign as our examples show.

Example 5.1.1 For

A =

[3 22 6

]the eigenvalues are (9± 5)/2 and both are positive.

Example 5.1.2 For

A =

[3 55 6

]


5.1. THE GENERAL TWO BY TWO SYMMETRIC MATRIX 77

the eigenvalues are (9±√

9 + 100)/2 giving λ1 = 9.72 and λ2 = −5.22 and the eigenvalues havemixed sign.

Example 5.1.3 For

A =

[3 3

√2

3√

2 6

]the eigenvalues are (9± 9)/2 giving λ1 = 9 and λ2 = 0.

Example 5.1.4 For

A =

[3 55 6

]the eigenvalues are (9±

√9 + 100)/2 giving λ1 = 9.72 and λ2 = −5.22 and the eigenvalues have

mixed sign.

Example 5.1.5 For

A =

[−3 55 −6

]the eigenvalues are (−9 ±

√34)/2 giving λ1 = −7.42 and λ2 = −1.58. So here, both eigenvalues

are negative.

However, in all cases the eigenvectors are still orthogonal.

5.1.1.1 Homework

Exercise 5.1.1

Exercise 5.1.2

Exercise 5.1.3

Exercise 5.1.4

Exercise 5.1.5

5.1.2 A Canonical Form

Now let’s look at 2 × 2 symmetric matrices more abstractly. Don’t worry, there is a payoff here inunderstanding! Let A be a general 2× 2 symmetric matrix. Then it has two distinct eigenvalues λ1

and another one λ2. Consider the matrix P given by

P =[E1 E2

]whose transpose is then

P T =

[ET1ET2

]



It is easy to find that P T P = P P T = I . Hence, P−1 = P T . Thus,

P T AP =

[ET1ET2

]A[E1 E2

]=

[ET1ET2

] [AE1 AE2

]=

[ET1ET2

] [λ1E1 λ2E2

]After we do the final multiplications, we have

P T AP =

[λ1 < E1,E1 > λ2 < E1,E2 >λ1 < E2,E1 > λ2 < E2,E2 >

]We know the eigenvectors are orthogonal, so we must have

P T AP =

[λ1 < E1,E1 > 0

0 λ2 < E2,E2 >

]Once last step and we are done! There is no reason, we can’t choose as our eigenvectors, vectorsof length one: here just replace E1 by the new vector E1/||E1|| where ||E1|| is the usual Eu-clidean length of the vector. Similarly, replace E2 by E2/||E2||. Assuming this is done, we have< E1,E1 >= 1 and < E2,E2 >= 1. We are left with the identity

P T AP =

[λ1 00 λ2

]which can be rewritten as

A = P

[λ1 00 λ2

]P T

This is an important thing. We have shown the 2 × 2 matrix A can be decomposed into the productA = P ΛP T where Λ is the diagonal matrix whose entries are the eigenvalues ofA which the mostpositive one in the (1, 1) position.

It is now clear how we solve an equation likeAX = b. We rewrite as P Λ P TX = b which leadsto the solution

X = P Λ−1 P T b

and we see the reciprocal eigenvalue sizes determine how large the solution can get. Another way tolook at this is that the two eigenvectors can be used to find a representation of the data vector b andthe solution vectorX as follows:

b = < b,E1 > E1+ < b,E2 > E2 = b1E1 + b2E2

X = <X,E1 > E1+ <X,E2 > E2 = X1E1 +X2E2

and soAX = b becomes

A

(X1E1 +X2E2

)= b1E1 + b2E2

λ1X1E1 + λ2X2E2 = b1E1 + b2E2.

The only way this equation works is if the coefficients on the eigenvectors match. So we haveX1 = λ−1

1 b1 and X2 = λ−12 b2. This shows very clearly how the solution depends on the size of

the reciprocal eigenvalues. Thus, if our problem has a very small eigenvalue, we would expect oursolution vector to be unstable. Also, if one of the eigenvalues is 0, we would have real problems! We


5.1. THE GENERAL TWO BY TWO SYMMETRIC MATRIX 79

can address this somewhat by finding a way to force all the eigenvalues to be positive.

The eigenvectors here are independent vectors in <2 and since they span <2, they form a basiswhich is an orthonormal basis because the vectors are orthogonal. Hence, any vector V in <2 canbe written as

V = V1E1 + V2E2

and the components V1 and V2 are known as the components of V relative to the basis E1,E2.We often just refer to this basis as E. Hence, a vector V has many possible representations. Torefresh your mind here, the one you are most used to is the one which uses the basis vectors

e1 =

[10

], and e1 =

[10

]which is called the standard basis. When we write

V =

[35

]unless it is otherwise stated, we assume these are the components of V with respect to the standardbasis. Now let’s go back to our general vector V . Since the vectors E1 and E2 are orthogonal, wecan take inner products on both sides of the representation of V with respect to the basis E to get

< V ,E1 > = < V1E1,E1 > + < V2E2,E1 >= V1 < E1,E1 > +V2 < E2,E1 >

But < E2,E1 >= 0 as the vectors are perpendicular and < E2,E1 >= E211 + E2

12 where E11and E12 are the components of E1 in the standard basis. This is one though as we have chosen oureigenvectors to have length one. So we have V1 =< V ,E1 >. A similar calculation with innerproducts gives V2 =< V ,E2 >. So we could also have written

V = < V ,E1 > E1+ < V ,E2 > E2

From our discussions here, it should be easy to see that while we can do all of the needed calcu-lations in this 2 × 2 case, we would have a lot of trouble if the symmetric matrix was 3 × 3, 4 × 4or larger. Hence, let’s explore another way. And yes it is more abstract and yes, you will probablybe lying belly up on the floor with your legs and arms pointing straight up soon enough screaming“enough”! But if you keep at it, you’ll see that when computation becomes too hard, the abstractapproach is quite nice!

5.1.2.1 Homework

Exercise 5.1.6

Exercise 5.1.7

Exercise 5.1.8

Exercise 5.1.9

Exercise 5.1.10



5.1.3 Two Dimensional RotationsLet’s look closer at the matrices we have found using the eigenvectors of the symmetric matrix A.We know the rows and columns of the matrix

P =[E1 E2

]are mutually orthogonal and the rows and columns are length one. Thus,

P =

[E11 E21

E12 E22

]Since the eigenvectors are length one, we know

√E2

11 + E212 = 1 and

√E2

21 + E222 = 1. We also

know E11E21 + E12E22 = 0 and E11E12 + E21E22 = 0. Hence, the first column defines an angleθ by cos(θ) = E11 and sin(θ) = E12. The second column defines the angle ψ by cos(ψ) = E21 andsin(ψ) = E22. Since E11E21 + E12E22 = 0 and E11E12 + E21E22 = 0 we also find θ and ψ arenot independent as cos(θ) sin(θ) + cos(ψ) sin(ψ) = 0 and cos(θ) cos(ψ) + sin(θ) sin(ψ) = 0. Thesecond equation is useful as it tells us cos(θ − ψ) = 0. This means ψ = θ ± π/2,±3π/2, . . .. Forthese choices of ψ, we have

• cos(ψ) = cos(θ + π/2) = − sin(θ) sin(π/2) = − sin(θ) and sin(ψ) = sin(θ + π/2) =cos(θ) sin(π/2) = cos(θ).

• cos(ψ) = cos(θ − π/2) = + sin(θ) sin(π/2) = sin(θ) and sin(ψ) = sin(θ − π/2) =− cos(θ) sin(π/2) = − cos(θ).

Both of these solutions work. You can check other choices such as ψ = θ ± 3π/2 do not giveanything new. Hence, P can be written two way

P1 =

[cos(θ) − sin(θ)sin(θ) cos(θ)

]and P2 =

[cos(θ) sin(θ)sin(θ) − cos(θ)

]=

[cos(θ) − sin(θ)sin(θ) cos(θ)

] [1 00 −1

]The matrix P1 is a classical rotation matrix. The standard basis vectors i and j when rotated in acounterclockwise direction of angle θ moves to new basis vectors

I1 =

[cos(θ)sin(θ)

]and I2 =

[− sin(θ)cos(θ)

]You should draw this in<2! The matrixP2 is a rotation with a reflection. The basic vector i is rotatedto the new basis vector I1, but instead of getting the new basis vector I2, we get the reflection of itthrough the x axis giving

Ireflection1 = I1 =

[cos(θ)sin(θ)

]and Ireflection2 = −I2 =

[sin(θ)− cos(θ)

]Note detP1 = cos(θ)2 + sin(θ)2 = 1 and detP2 = − cos(θ)2 − sin(θ)2 = −1 for any θ. ClearlyP1−1 = P1

T and P2−1 = P2

T . You should draw this in <2 also as it is instructive!Thus, any 2× 2 symmetric matrixA has the decomposition

A = P1T

[λ1 00 λ2

]P1

or

A = P2T

[λ1 00 λ2

]P2 =

[1 00 −1

]TP1

T

[λ1 00 λ2

]P1

[1 00 −1

]


5.2. ROTATING SURFACES 81

5.1.4 Homework

Exercise 5.1.11

Exercise 5.1.12

Exercise 5.1.13

Exercise 5.1.14

Exercise 5.1.15

5.2 Rotating Surfaces

Here is a nice example of these ideas. We will do this both by hand and using MatLab so you can seehow both approaches work. Consider the surface z = 3x2 + 2xy+ 4y2 which is a rotated paraboloid

with elliptical cross sections. Let’s fine the angle θ so that the change of variableRθ

[xy

]to the new

coordinates x and y leads to a surface equation with no cross terms xy. First note

3x2 + 2xy + 4y2 =

[xy

]T [3 11 4

] [xy

]where we let

H =

[3 11 4

]The matrixH here is symmetric so using the eigenvectors and eigenvalues ofH , we can write

[E1 E2

]T [3 11 4

] [E1 E2

]=

[λ1 00 λ2

]The orthonormal basis E1,E2 can be identified with either a rotation Rθ or a reflection Rrθ. Wewill choose the rotation. Then using the change of variables[

xy

]= Rθ

[xy

]we find

z =

[xy

]TRθ

T

[3 11 4

]Rθ

[xy

]= λ1x

2 + λ2y2.

The characteristic equation forH is

det(λI −H) = λ2 − 7λ+ 11 = 0

with roots

λ1 =7 +√

5

2= 4.618, λ2 =

7−√

5

2= 2.382



It is straightforward to find the corresponding eigenvectors

W1 =

[1

1+√

52

], W1 =

[1

1−√

52

],

Then

‖W1‖22 =

√√√√1 +

(1 +√

5

2

)2

=⇒ ‖W1‖2 =√

3.618 = 1.902

‖W2‖22 =

√√√√1 +

(1−√

5

2

)2

=⇒ ‖W2‖2 =√

1.382 = 1.176

Then the orthonormal basis we need is

E1 =W1

‖W1‖2=

[.526.8505

]E2 = ± W2

‖W2‖2= ±

[.8505−.526

]If we choose the minus choice, we get

E2 =

[−.8505.526

]and

[E1,E2

]is the rotation matrixRθ. If we choose the plus choice, we get

E2 =

[.8505−.526

]and

[E1,E2

]is the reflected rotation matrixRrθ. Our rotation matrix is then

Rθ =

[.526 −.8505.8505 .526

]=

[cos(1.017) − sin(1.017)sin(1.017) cos(1.017)

]where θ = 1.017 radians. As a check,[

.526 .8505−.8505 .526

] [3 11 4

] [.526 −.8505.8505 .526

]=

[.526 .8505−.8505 .526

] [2.4285 −2.02553.928 1.2535

]=

[4.618 0

0 2.382

]=

[λ1 00 λ2

]The change of variables is[

xy

]= Rθ

[xy

]=

[.526 −.8505.8505 .526

] [xy

]=

[.526 x− .8505 y.8505 x+ .526 y

]Now let’s do the same computations in MatLab.

Listing 5.1: Surface Rotation: Eigenvalues and Eigenvectors

>> H = [ 3 , 1 ; 1 , 4 ]


5.2. ROTATING SURFACES 83

H =

3 11 4

>> [W,D] = e i g (H)W =

−0.85065 0 .525730 .52573 0 .85065

D =

Diagonal Matrix

2 .3820 00 4 .6180

The command eig returns the unit eigenvectors as columns of V and the corresponding eigenvaluesas the diagonal entries of D. You can see by looking at the matrix V the columns are not set upcorrectly to be a rotation matrix. We pick our orthonormal basis from V and hence our rotationmatrixR as follows:

Listing 5.2: Surface Rotation: Setting the Rotation Matrix

>> E1 = W( : , 2 )E1 =

0 .525730 .85065

>> E2 = W( : , 1 )E2 =

−0.850650 .52573

>> R = [ E1 , E2 ]R =

0 .52573 −0.850650 .85065 0 .52573

We then check the decomposition.

Listing 5.3: Surface Rotation: Checking the decomposition

>> R’∗H∗Rans =

4 .61803 0 .00000−0.00000 2 .38197



5.2.1 Homework

Exercise 5.2.1

Exercise 5.2.2

Exercise 5.2.3

Exercise 5.2.4

Exercise 5.2.5

5.3 An Complex ODE System Example

Let’s look at a system of ODEs with complex roots to give an nontrivial example of how to userotation matrices and these matrix decompositions to understand a solution. First, let’s do a quickreview as it has probably been a long time since you thought about this material. Let’s begin with atheoretical analysis. If the real valued matrix A has a complex eigenvalue r = α + iβ, then there isa nonzero vectorG so that

AG = (α+ iβ)G.

Now take the complex conjugate of both sides to find

A G = (α+ iβ)G.

However, since A has real entries, its complex conjugate is simply A again. Thus, after takingcomplex conjugates, we find

A G = (α− iβ)G

and we conclude that if α+ iβ is an eigenvalue ofA with eigenvectorG, then the eigenvalue α− iβhas eigenvector G. Hence, letting E be the real part of G and F be the imaginary part, we seeE + iF is the eigenvector for α+ iβ and E − iF is the eigenvector for α− iβ.

5.3.1 The General Real and Complex Solution

We can write down the general complex solution immediately.[φ(t)ψ(t)

]= c1 (E + iF ) e(α+iβ)t + c2 (E − iF ) e(α−iβ)t

for arbitrary complex numbers c1 and c2. We can reorganize this solution into a more convenientform as follows.[

φ(t)ψ(t)

]= eαt

(c1 (E + iF ) e(iβ)t + c2 (E − iF ) e(−iβ)t

)= eαt

((c1e

(iβ)t + c2e(−iβ)t

)E + i

(c1e

(iβ)t − c2e(−iβ)t)F

).


5.3. AN COMPLEX ODE SYSTEM EXAMPLE 85

The first real solution is found by choosing c1 = 1/2 and c2 = 1/2. This give[x1(t)y1(t)

]= eαt

(((1/2)

(e(iβ)t + e(−iβ)t

))E + i

((1/2)

(e(iβ)t − e(−iβ)t

))F

).

However, we know that (1/2)(e(iβ)t + e(−iβ)t

)= cos(βt) and (1/2)


)= i sin(βt).

Thus, we have [x1(t)y1(t)

]= eαt

(E cos(βt)− F sin(βt)

).

The second real solution is found by setting c1 = 1/2i and c2 = −1/2i which gives[x2(t)y2(t)

]= eαt

(((1/2i)


))E + i

((1/2i)

(e(iβ)t + e(−iβ)t

))F

)= eαt

(E sin(βt) + F cos(βt)

).

The general real solution is therefore[x(t)y(t)

]= eαt

(a (E cos(βt)− F sin(βt)) + b (E sin(βt) + F cos(βt))

)

for arbitrary real numbers a and b.

Example 5.3.1

x′(t) = 2 x(t) + 5 y(t)

y′(t) = −x(t) + 4 y(t)

x(0) = 6

y(0) = −1

Solution The characteristic is r2 − 6r + 13 = 0 which gives the eigenvalues 3± 2i. The eigenvalueequation for the first root, 3 + 2i leads to this system to solve for nonzero V .[

(3 + 2i)− 2 −51 (3 + 2i)− 4

] [V1

V2

]=

[00

]

This reduces to the system

[1 + 2i −5

1 −1 + 2i

] [V1

V2

]=

[00

]Although it is not immediately apparent, the second row is a multiple of row one. Multiply row oneby −1 − 2i. This gives the row [−1 − 2i, 5]. Now multiple this new row by −1 to get [1 + 2i,−5]which is row one. So even though it is harder to see, these two rows are equivalent and hence weonly need to choose one to solve for V2 in terms of V1. The first row gives (1 + 2i)V1 + 5V2 = 0.Letting V1 = a, we find V2 = (−1− 2i)/5 a. Hence, the eigenvectors have the form

G = a

[1

− 1+2i5

]=

[1− 1

5

]+ i

[0− 2

5

]



Hence,

E =

[1− 1

5

]and F =

[0− 2

5

]The general real solution is therefore[x(t)y(t)

]= e3t

(a

(E cos(2t)− F sin(2t)

)+ b

(E sin(2t) + F cos(2t)

))

= e3t

(a

([1− 1

5

]cos(2t)−

[0− 2

5

]sin(2t)

)+ b

([1− 1

5

]sin(2t) +

[0− 2

5

]cos(2t)

))

= e3t

[a cos(2t) + b sin(2t)

(− 15 a−

25 b) cos(2t)( 2

5 a+ 15 b) sin(2t)

]Now apply the initial conditions to obtain[

6−1

]= e3t

[a

(− 15 a−

25 b)

]Thus, a = 6 and b = −1/2. The solution is therefore[

x(t)y(t)

]= e3t

[6 cos(2t)− 1

2 sin(2t)− cos(2t)− 23

10 sin(2t)

]5.3.1.1 Homework

Exercise 5.3.1

Exercise 5.3.2

Exercise 5.3.3

Exercise 5.3.4

Exercise 5.3.5

5.3.2 Rewriting The Real SolutionWe know the real solution is written as[

x(t)y(t)

]= eαt

((aE + b F

)cos(βt) +

(bE − a F

)sin(βt)

)

Now rewrite again in terms of the components of E and F to obtain[x(t)y(t)

]= eαt

((a

[E1

E2

]+ b

[F1

F2

])cos(βt) +

(b

[E1

E2

]− a

[F1

F2

])sin(βt)

)

= eαt[aE1 + bF1 bE1 − aF1

aE2 + bF2 bE2 − aF2]

] [cos(βt)sin(βt)

]Finally, we can move back to the vector form and write[

x(t)y(t)

]= eαt

[aE + bF , bE − aF

] [cos(βt)sin(βt)

].



Now we want nonzero solutions, so our initial conditions will give us a and b so that√a2 + b2 6= 0.

We have so far [x(t)eαty(t)eαt

]=

[aE1 + bF1 bE1 − aF1

aE2 + bF2 bE2 − aF2]

] [cos(βt)sin(βt)

]=

[E F

] [a bb −a

] [cos(βt)sin(βt)

]=

√a2 + b2

[E F

] [ a√a2+b2

b√a2+b2

b√a2+b2

− a√a2+b2

] [cos(βt)sin(βt)

]

Now define the angle θ by cos(θ) = a/√a2 + b2 and so sin(θ) = b/

√a2 + b2. Further let L =√

a2 + b2. Then, we can rewrite the solution again as[x(t)Leαty(t)Leαt

]=

[E F

] [cos(θ) sin θ)sin(θ) − cos(θ)

] [cos(βt)sin(βt)

]The matrix involving θ is a reflection matrix with angle θ which we will denote by Rrθ. Then we have[

x(t)Leαty(t)Leαt

]=

[E F

]Rrθ[cos(βt)sin(βt)

]which we can rewrite in terms of the rotation matrixRθ as[

x(t)Leαty(t)Leαt

]=

[E F

]Rθ

[1 00 −1

] [cos(βt)sin(βt)

]=[E F

]Rθ

[cos(βt)− sin(βt)

]Let u(t) = x(t)/(Leαt) and v(t) = y(t)/(Leαt). Then

u2(t) + v2(t) =

⟨[E F

]Rθ


],[E F

]Rθ


]⟩=


]TRθ

T[E F

]T [E F

]TRθ


]Now

[E F

]T [E F

]is a 2×2 symmetric matrix. Hence, we have a representation of it in terms

of its real eigenvalues λ1 and λ2 and associated eigenvectorsG1 andG2.

[E F

]T [E F

]=

[G1,G2

]T [λ1 00 λ2

] [G1,G2

]We also know from our earlier discussions, that

[G1,G2

]is either a rotation matrixRψ or a reflec-

tionRψJ where

J =

[1 00 −1

]Thus, we have, in the case it is a rotation matrix

u2(t) + v2(t) =


]TRθ

T RψT

[λ1 00 λ2

]Rψ Rθ


]



It is a straightforward calculation to showRψRθ = Rψ+θ. Thus, for a rotation matrix

u2(t) + v2(t) =


]TRψ+θ

T

[λ1 00 λ2

]Rψ+θ


]Now note we can rewrite this in terms of new coordinates based on new unit vectors.

• The reflection

J

[cos(βt)sin(βt)

]=


]corresponds to i → i′ = i and j → j′ = −j. Call this the x′ − y′ coordinate system. Notethese vectors live on the unit circle in the x′ − y′ coordinate system.

• Now apply the rotation Rψ+θ which rotates the reflected system above counterclockwise anangle of ψ+θ. The coordinates in this new system are (x, y). Call this the x′′−y′′ coordinatesystem. Note these vectors live on the unit circle in the x′′ − y′′ coordinate system. Thecoordinates in this new system are (x, y).

So we have

u2(t) + v2(t) =

[xy

]T [λ1 00 λ2

] [xy

]= λ1(x)2 + λ2(y)2

which is an ellipse if both eigenvalues are positive, a circle if they are positive and equal and a hyper-bola if they differ in algebraic sign. We get various degenerate possibilities if one of the eigenvaluesis zero.The analysis in the case of the reflected case is quite similar and the matrix J does not alter the resultqualitatively. Rewriting again, we find

x2(t) + y2(t) = L2e2αt(λ1(x)2 + λ2(y)2

)which shows we have spiral out, constant or spiral in trajectories depending on the sign of α.

5.3.2.1 Homework

Exercise 5.3.6

Exercise 5.3.7

Exercise 5.3.8

Exercise 5.3.9

Exercise 5.3.10

5.3.3 Signed Definite Matrices

A 2× 2 matrix is said to be a positive definite matrix if xTAx > 0 for all vectors x. If we multiplythis out, we find the inequality below

ax21 + 2bx1x2 + dx2

2 > 0,

If we complete the square, we find



a

((x1 + (b/a)x2

)2

+

((ad− b2)/a2)x2

2

))> 0

Now the leading term a > 0, and if the determinant ofA = ad−b2 > 0, we would have the quadratic(x1+(b/a)x2

)2

+

((ad−b2)/a2)x2

2

)is always positive. Note that since this determinant is positive

ad > b2 which forces d to be positive as well. So in this case, a and d and ad − b2 > 0. And theexpression xTAx > 0 in this case. Now recall what we found about the eigenvalues here. We hadthe eigenvalues were

r =(a+ d)±

√(a− d)2 + 4b2

2

Since ad− b2 > 0, the term

(a− d)2 + 4b2 = a2 − 2ad+ d2 + 4b2 < a2 − 2ad+ 4ad+ d2 = (a+ d)2.

Thus, the square root is smaller than a+d as a and d are positive. the first root is always positive andthe second root is too as (a + d) −

√(a− d)2 + 4b2 > a + d − (a + d) = 0. So both eigenvalues

are positive if a and d are positive and ad − b2 > 0. Note the argument can go the other way. If weassume the matrix is positive definite, the we are forced to have a > 0 and ad− b2 > 0 which givesthe same result. We conclude our 2× 2 symmetric matrixA is positive definite if and only if a > 0,d > 0 and the determinant ofA > 0 too. Note a positive definite matrix has positive eigenvalues.A similar argument holds if we have determinant of A > 0 but a < 0. The determinant conditionwill then force d < 0 too. We find that xTAx < 0. In this case, we say the matrix is negativedefinite. The eigenvalues are still

r =(a+ d)±

√(a− d)2 + 4b2

2.

But now, since ad− b2 > 0, the term

(a− d)2 + 4b2 = a2 − 2ad+ d2 + 4b2 < a2 − 2ad+ 4ad+ d2 = |a+ d|2.

Since a and d are negative, a+ d < 0 and so the second root is always negative. The first root’s signis determined by (a + d) + |

√(a− d)2 + 4b2 < (a + d) + |a + d| = 0. So both eigenvalues are

negative. We have found the matrixA is negative definite if a and d are negative and the determinantofA > 0. Note a negative definite matrix has negative eigenvalues.

5.3.4 Summarizing

Let put all the information we have together. We have been studying a special type of 2 × 2 matrixwhich is symmetric. This matrixA has the form

A =

[a bb d

]where a, b and d are arbitrary non zero numbers. The characteristic equation here is r2− (a+ d)r+ad− b2. Note that the term ad− b2 is the determinant ofA. The roots are given by



r =(a+ d)±

√(a− d)2 + 4b2

2

• This matrix always has two distinct real eigenvalues and their respective eigenvectors E1 andE2 are mutually orthogonal. Hence, we usually normalize them so they form an orthonormalbasis for <2.

• We can use these eigenvectors to writeA is a canonical form.

A = P

[λ1 00 λ2

]P T

where

P =[E1 E2

]giving

P T =

[ET1ET2

]• If we solve AX = b, it is most illuminating to rewrite both the unknown X and the data b

with respect to the orthonormal basis determined by the eigenvectors of A. When we do thiswe find the solution is expressed quite nicely X1 = λ−1

1 b1 and X2 = λ−12 b2 where b1 and b2

are the components of B in the eigenvector basis. This shows very clearly how the solutiondepends on the size of the reciprocal eigenvalues.

• If a > 0 and the determinant of A = ad − b2 > 0, then d > 0 too and A is positive definite.In this case, both eigenvalues are positive. If we started by assuming A was positive definite,our chain of inference would lead us to the same conclusions about a, b and d.

• If a < 0 and the determinant of A = ad − b2 > 0, then d < 0 also and A is negativedefinite. In this case, both eigenvalues are negative. And assumingA is negative definite leadsus backwards to a < 0, d < −0 and ad− b2 > 0.

It seems obvious to ask if we can say similar things about symmetric matrices that are 3 × 3,4× 4 and so on. Clearly, we can’t use our determinant route to find answers as analyzing cubics andquartics is quite impossible. So the next section tries another approach. The dreaded road we calllet’s go abstract.

5.3.4.1 Homework

Exercise 5.3.11

Exercise 5.3.12

Exercise 5.3.13

Exercise 5.3.14

Exercise 5.3.15


Chapter 6

Matrix Sequences and OrdinaryDifferential Equations

Here is a nice application of symmetric matrices to systems of two linear differential equations. Let’sdo some review first, as the ODE material seems to always slip away right when you need it.

6.1 Linear Systems of ODE

Let’s review Linear Systems of first order ODE. These have the form

x′(t) = a x(t) + b y(t)

y′(t) = c x(t) + d y(t)

x(0) = x0

y(0) = y0

for any numbers a, b, c and d and initial conditions x0 and y0. The full problem is called, as usual,an Initial Value Problem or IVP for short. The two initial conditions are just called the IC’s for theproblem to save writing.

• For example, we might be interested in the system

x′(t) = −2 x(t) + 3 y(t)

y′(t) = 4 x(t) + 5 y(t)

x(0) = 5

y(0) = −3

Here the IC’s are x(0) = 5 and y(0) = −3.

• Another sample problem might be the one below.

x′(t) = 14 x(t) + 5 y(t)

y′(t) = −4 x(t) + 8 y(t)

x(0) = 2

y(0) = 7

91


92 CHAPTER 6. MATRIX SEQUENCES AND ORDINARY DIFFERENTIAL EQUATIONS

6.1.1 The Characteristic Equation

For linear first order problems like u′ = 3u and so forth, we find the solution has the form u(t) =Ae3t for some numberA. We then determine the value ofA to use by looking at the initial condition.To find the solutions here, we begin by rewriting the model in matrix - vector notation.[

x′(t)y′(t)

]=

[a bc d

] [x(t)y(t)

].

The 2 × 2 matrix is called the coefficient matrix of this model. The initial conditions can then beredone in vector form as [

x(0)y(0)

]=

[x0

y0

].

Now it seems reasonable to believe that if a constant times ert solves a first order linear problemlike u′ = ru, perhaps a vector times ert will work here. Let’s make this formal. So let’s look at theproblem below

x′(t) = 3 x(t) + 2 y(t)

y′(t) = −4 x(t) + 5 y(t)

x(0) = 2

y(0) = −3

Assume the solution has the form V ert. Let’s denote the components of V as follows:

V =

[V1

V2

].

We assume the solution is [x(t)y(t)

]= V ert.

Then the derivative of V ert is(V ert

)′=

([V1

V2

]ert)′

=

([V1 e

rt

V2 ert

] )′=

[V1 (ert)′

V2 (ert)′

]=

[V1 r e

rt

V2 r ert

]=

[V1

V2

]r ert = r V ert

Hence, [x′(t)y′(t)

]= r V ert.

When we plug these terms into the matrix - vector form of the problem, we find

r V ert =

[3 2−4 5

]V ert


6.1. LINEAR SYSTEMS OF ODE 93

Rewrite as

r V ert −[

3 2−4 5

]V ert =

[00

].

Recall that the 2× 2 identity matrix I has the form

I =

[1 00 1

]⇒ I V = V .

So

r V ert −[

3 2−4 5

]V ert = rI V ert −

[3 2−4 5

]V ert

=

(rI −

[3 2−4 5

])V ert

=

([r 00 r

]−[

3 2−4 5

])V ert

=

[r − 3 −2−(−4) r − 5

]V ert

Plugging this into our model, we find[r − 3 −2

4 r − 5

]V ert =

[00

].

But ert is never 0, so we want r satisfying[r − 3 −2

4 r − 5

]V =

[00

].

For each r, we get two equations in V1 and V2:

(r − 3)V1 − 2V2 = 0, 4V1 + (r − 5)V2 = 0.

LetAr be this matrix. Any r for which the detAr 6= 0 tells us these two lines have different slopesand so cross at the origin implying V1 = 0 and V2 = 0. Thus[

xy

]=

[00

]ert =

[00

].

which will not satisfy nonzero initial conditions. So reject these r. Any value of r for whichdetAr = 0 gives an infinite number of solutions which allows us to pick one that matches the initialconditions we have. The equation

det(rI −A) = det

[r − 3 −2

4 r − 5

]= 0.

is called the characteristic equation of this linear system. The characteristic equation is a quadratic,so there are three possibilities: two distinct roots, two real roots that match and a complex conjugatepair of roots.



Example 6.1.1 Derive the characteristic equation for the system below

x′(t) = 8 x(t) + 9 y(t)

y′(t) = 3 x(t) − 2 y(t)

x(0) = 12

y(0) = 4

Solution The matrix - vector form is[x′(t)y′(t)

]=

[8 93 −2

] [x(t)y(t)

][x(0)y(0)

]=

[124

]The coefficient matrixA is thus

A =

[8 93 −2

]Assume the solution has the form V ert Plug this into the system.

r V ert −[

8 93 −2

]V ert =

[00

]Rewrite using the identity matrix I and factor(

r I −[

8 93 −2

])V ert =

[00

]Since ert 6= 0 ever, we find r and V satisfy(

r I −[

8 93 −2

])V =

[00

]If r is chosen so that det (rI −A) 6= 0, the only solution to this system of two linear equations in thetwo unknowns V1 and V2 is V1 = 0 and V2 = 0. This leads to x(t) = 0 and y(t) = 0 always and thissolution does not satisfy the initial conditions. Hence, we must find r which give det (rI − A) = 0.The characteristic equation is thus

det

[r − 8 −9−3 r + 2

]= (r − 8)(r + 2)− 27

= r2 − 6r − 43 = 0

6.1.2 Finding The General Solution

The roots to the characteristic equation are called eigenvalues. For each eigenvalue r we want to findnonzero vectors V so that (r I − A) V = 0 where to help with our writing we let 0 be the twodimensional zero vector. These nonzero V are called the eigenvectors for eigenvalue r and satisfyAV = rV .


6.1. LINEAR SYSTEMS OF ODE 95

For eigenvalue r1, find V so that (r1 I − A) V = 0. There will be an infinite number of V ’s thatsolve this; we pick one and call it eigenvector E1.

For eigenvalue r2, find V so that (r2 I − A)V = 0. There will again be an infinite number of V ’sthat solve this; we pick one and call it eigenvector E2.The general solution to our model will be[

x(t)y(t)

]= AE1 e

r1t +BE2 er2t.

where A and B are arbitrary. We use the ICs to find A and B. It is best to show all this with someexamples.

Example 6.1.2 For the system below[x′(t)y′(t)

]=

[−20 12−13 5

] [x(t)y(t)

][x(0)y(0)

]=

[−1

2

]• Find the characteristic equation


• Solve the IVP


det

(r

[1 00 1

]−[−20 12−13 5

])= 0

0 = det

([r + 20 −12

13 r − 5

])= (r + 20)(r − 5) + 156

= r2 + 15r + 56

= (r + 8)(r + 7)

Hence, eigenvalues or roots of the characteristic equation are r1 = −8 and r2 = −7. Note sincethis is just a calculation, we are not following our labeling scheme.For eigenvalue r1 = −8, substitute the value into[

r + 20 −1213 r − 5

] [V1

V2

]⇒[

12 −1213 −13

] [V1

V2

]This system of equations should be collinear: i.e. the rows should be multiples; i.e. both give rise tothe same line. Our rows are multiples, so we can pick any row to find V2 in terms of V1. Picking thetop row, we get 12V1 − 12V2 = 0 implying V2 = V1. Letting V1 = a, we find V1 = a and V2 = a:so [

V1

V2

]= a

[11

]



Choose E1:The vector

E1 =

[11


x1(t)y1(t)

]= E1 e

−8t =

[11

]e−8t.

For eigenvalue r2 = −7, substitute the value into[r + 20 −12

13 r − 5

] [V1

V2

]⇒[

13 −1213 −12

] [V1

V2

]This system of equations should be collinear: i.e. the rows should be multiples; i.e. both give rise tothe same line. Our rows are multiples, so we can pick any row to find V2 in terms of V1. Picking thetop row, we get 13V1 − 12V2 = 0 implying V2 = (13/12)V1. Letting V1 = b, we find V1 = b andV2 = (13/12)b: so [

V1

V2

]= b

[1

1312

]Choose E2:The vector

E2 =

[1

1312


x2(t)y2(t)

]= E2 e

−7t =

[1

1312

]e−7t.

The general solution is then[x(t)y(t)

]= AE1 e

−8t +B E2 e−7t

= A

[11

]e−8t +B

[1

1312

]e−7t

Now let’s solve the initial value problem: we find A and B.[x(0)y(0)

]=

[−1

2

]= A

[11

]e0 +B

[1

1312

]e0

= A

[11

]+B

[1

1312

]So

A+B = −1

A+13

12B = 2


6.2. SYMMETRIC SYSTEMS OF ODES 97

Subtracting the bottom equation from the top equation, we get − 112B = −3 or B = 36. Thus,

A = −1−B = −37. So[x(t)y(t)

]= −37

[11

]e−8t + 36

[1

1312

]e−7t

6.1.3 HomeworkFor these problems

• Write matrix, vector form.

• Derive the characteristic equation.

• Find the two eigenvalues. Label the largest one as r1 and the other as r2

• Find the two associated eigenvectors as unit vectors

• Define

P =[E1 E2

]• Compute P T AP whereA is the coefficient matrix of the ODE system.

• ShowA = PΛP T for an appropriate Λ.

• Write general solution.

• Solve the IVP.

Exercise 6.1.1

x′ = x + 2y

y′ = 2 x − 6y

x(0) = 4

y(0) = −6.

Exercise 6.1.2

x′ = 2x + 3 y

y′ = 3 x + 7 y

x(0) = 2

y(0) = −3.

6.2 Symmetric Systems of ODEsLet’s look at a specific symmetric ODE system and find its solution.

Example 6.2.1 For the symmetric system below[x′(t)y′(t)

]=

[−20 12

12 5

] [x(t)y(t)

][x(0)y(0)

]=

[−1

2

]



• Find the characteristic equation


• Solve the IVP


det

(r

[1 00 1

]−[−20 12

12 5

])= 0

Thus

0 = det

([r + 20 −12−12 r − 5

])= (r + 20)(r − 5)− 144

= r2 + 15r − 244

Hence, eigenvalues or roots of the characteristic equation are r1 = 9.83 and r2 = −24.83.For eigenvalue r1 = 9.83, substitute the value into[

r + 20 −12−12 r − 5

] [V1

V2

]⇒[

29.83 −12−12 4.83

] [V1

V2

]Picking the top row, we get 29.83V1 − 12V2 = 0 implying V2 = 6.18V1. Letting V1 = a, we findV1 = a and V2 = 6.18a: so

V =

[V1

V2

]= a

[1

6.18

]Let a = 1/||

[1 6.18

]T || = 1/6.26 = .16 and choose the unit eigenvector

E1 = 0.16

[1

6.18

]=

[.16.99

]So one of the solutions is [

x1(t)y1(t)

]= E1 e

9.83t =

[.16.99

]e9.83t.

For eigenvalue r2 = −24.83, substitute the value into[r + 20 −12−12 r − 5

] [V1

V2

]⇒

[−4.83 −12−12 −29.83

] [V1

V2

]Picking the top row, we get −4.83V1 − 29.83V2 = 0 implying V2 = −.16V1. Letting V1 = b, wefind V1 = b and V2 = −.16b: so

V =

[V1

V2

]= b

[1

−.16

]Let b = 1/||

[1 −.16

]T || = 1/1.01 = .99 and choose the unit eigenvector

E2 = .99

[1

−.16

]=

[.99−.16

]



Note < E1,E2 >= (.16)(.99) + (.99)(−.16) = 0. So the other solution is[x2(t)y2(t)

]= E2 e

−29.83t =

[.99−.16

]e−29.83t.

The general solution[x(t)y(t)

]= AE1 e

9.83t +B E2 e−29.83t

= A

[.16.99

]e9.83t +B

[.99−.16

]e−29.83t

Finally, we find A and B using the initial conditions.[x(0)y(0)

]=

[−1

2

]= A

[.16.99

]e0 +B

[.99−.16

]e0

= A

[.16.99

]+B

[.99−.16

]Thus,

[E1 E2

] [AB

]=

[−12

]⇒[ET1 ET2

] [E1 E2

] [AB

]=[ET1 ET2

] [−12

][AB

]=[ET1 ET2

] [−12

]LettingD =

[−1 2

]T, A =< E1,D > and B =< E2,D >.

6.2.1 Writing The Solution Another WayThe coefficient matrix is

A =

[−20 12

12 5

]and the eigenvalues ofA are λ1 = 9.83 and λ2 = −29.83. Then we know if

P =[E1 E2

]then

P T AP =

[λ1 00 λ2

]=⇒ A = P

[λ1 00 λ2

]P T

Let

Λ =

[λ1 00 λ2

]=⇒ A = PΛP T

Now let’s calculate the powers ofA:

•

A2 =

(PΛP T

) (PΛP T

)= PΛ

(P TP

)ΛP T



= PΛ2P T

A3 =

(PΛP T

) (A2

)=

(PΛP T

) (PΛ2P T

)= PΛ

(P TP

)Λ2P T

= PΛ3P T

• It is easy to see by POMI thatAn = PΛnP T .

Recall for the scalar t, At = tA simply multiplies each entry of A by the number t. Hence, fromthe above we haveAnt = PΛnP T t. Consider

At = P

[λ1 00 λ2

]P T t = P

[λ1 00 λ2

] [E11 t E12 tE21 t E22 t

]= P

[λ1 t 0

0 λ2 t

]P T

Thus,

At = P

[λ1 t 0

0 λ2 t

] [E11 E12

E21 E22

]A similar calculation shows

A2t2 = P

[λ2

1 t2 0

0 λ22 t

2

] [E11 E12

E21 E22

]= P

[λ2

1t2 0

0 λ22t

2

]P T

And the function

W3(t) = I +At+A2t2/2! +A3t3/3!

= P

[1 + λ1t+ λ2

1t2/2 + λ3

1t3/6 0

0 1 + λ2t+ λ22t

2/2 + λ32t

3/6

]P T

= P

[∑3k=0(λk1t

k)/k! 0

0∑3k=0(λk2t

k)/k!

]P T

In general

Wn(t) = P

[∑nk=0(λk1t

k)/k! 00

∑nk=0(λk2t

k)/k!

]P T

We know eλ1t = limn→∞∑nk=0(λk1t

k)/k! and eλ2t = limn→∞∑nk=0(λk2t

k)/k! This suggests

limn→∞

Wn(t) = P

[limn→∞

∑nk=0(λk1t

k)/k! 00 limn→∞

∑nk=0(λk2t

k)/k!

]P T

= P

[eλ1t 0

0 eλ2t

]P T

Define the matrix

eΛt =

[eλ1t 0

0 eλ2t

]The system of ODEs X′ = AX can be written as X′ = PΛP TX . Let Y = P TX . Then thesystem becomes Y ′ = ΛY . with general solution

Y (t) =

[αeλ1t 0

0 βeλ2t

]=

[eλ1t 0

0 eλ2t

] [α 00 β

]



for arbitrary α and β. Define

Θ =

[α 00 β

]Then the general solution is Y (t) = P TX = eΛtΘ. This gives X(t) = P eΛtΘ. Now define thematrix eAt = P eΛtP T . Then eAtP = P eΛt and soX(t) = eAtPΘ. Note

PΘ =[E1 E2

] [α 00 β

]=[αE1 βE2

]So for initial data,X0 =

[x1

0 x20

]T, we have[x1

0

x20

]=

[αE1 βE2

]implying α =< X0,E1 >= x1

0 and β =< X0,E1 >= x20. Thus, the solution to the initial value

problem is X(t) = eAt X0 and the general solution to the dynamics has the form X(t) = eAt Cwhere C is an arbitrary vector.The general solution to the scalar ODE x′ = λx is x(t) = eλtc and we now know we can write thegeneral solution to the vector system X ′ = AX is X(t) = eAtC also as long as we interpret theexponential matrix eAt right. Our argument was for 2 × 2 symmetric matrices, but essentially thesame argument is used in the general n × n case but the canonical form of A we need is called theJordan Canonical Form which is discussed in more advanced classes.

6.2.2 HomeworkExercise 6.2.1 Prove thatAn = PΛnP T by POMI.

Exercise 6.2.2 Show via POMI

Antn = P

[λn1 t

n 00 λn2 t

n

]P T

Exercise 6.2.3 Calculate eAt for

A =

[2 33 5

]


Chapter 7

Continuity and Topology

We went over some ideas about topology in <2 in Section 2.2 but we need to do this in <n now.

7.1 Topology in n DimensionsMost of this should be familiar although it is set in <n.

Definition 7.1.1 Balls in <n

• The ball about p in <n of radius r > 0 is

B(p; r) = x ∈ <n| ‖x− p‖ < r

where ‖ · ‖ is any norm on <n. This does not have to be the usual Euclidean norm.

• The closed ball about p in <n of radius r > 0 is

B(p; r) = x ∈ <n| ‖x− p‖ ≤ r

• The punctured ball about p in <n of radius r > 0 is

B(p; r) = x ∈ <n| 0 < ‖x− p‖ ≤ r

We can draw this sets in <, <2 and <3 but, of course, we can not do that for n > 3. Hence, we mustbegin to think about these ideas abstractly. There is also the idea of a boundary point.

Definition 7.1.2 Boundary Points of a Subset in <n

Let D be a subset of <n. We say p is a boundary point of D if for all positive radii r,B(p; r) contains a point of D and a point of its complement DC . The set of all boundarypoints of D is denoted by ∂D.

Comment 7.1.1 If D = x1,x2, then D consists to two vectors only. It is clear that every ball ofradius r smaller about x1 that ‖x1,x2‖ contains only x1 from D. We can say the same for x2. Soeach point in D is a boundary point. In fact, we would call these points isolated points as there is aball about each which contains only them.

Next, we define open and closed subsets.

103


104 CHAPTER 7. CONTINUITY AND TOPOLOGY

Definition 7.1.3 Open and Closed Sets in <n

Let D be a subset of <n. If p is in D and there is a positive r so that B(p; r) ⊂ D, we sayp is an interior point of D.

• If all p in D are interior points, we say D is an open subset of <n.

• The set of points in <n not in D is called the complement of D. We say D is closedif its complement is open. We denote the complement of D by DC .

Note the complement of a closed set is open.

Now, as usual, we can talk about the convergence of sequences xn of points in <n.

Definition 7.1.4 Norm Convergence in n Dimensions

We say the sequence xn in <n converges in norm ‖ · ‖ to x in <n if given ε > 0, thereis a positive integer N so that

n > N =⇒ ‖xn − x‖ < ε

We usually just write xn → x.

Comment 7.1.2 All of the usual things about scalar convergence of sequences still hold for normconvergence in <n but we will not clutter up the narrative by going through these proofs.

• If a sequence converges all of its subsequences converge to the same limit.

• the usual algebra of limits theorem except no products and quotients.

• convergence sequences are bounded

There are more and we will leave them as exercises for you.

Homework:

Exercise 7.1.1 Prove if a sequence converges all of its subsequences converge to the same limit.

Exercise 7.1.2 Prove the usual algebra of limits theorem. Note we do not have products and quotientsthough.

Exercise 7.1.3 Prove convergent sequences are bounded.

There are then special points in a set D which can be accessed via sequences of points in D. Weneed to define several possibilities.

Definition 7.1.5 Limit Points, Cluster Points and Accumulation Points


7.1. TOPOLOGY IN N DIMENSIONS 105

Let D be a subset of <n.

• We say p is a limit point of D if there is a sequence xn inD so that [xn → p.Note p need not be in D.

• We say p is a cluster point of D if there is a sequencexn 6= p inD so that [xn → p.

• We say p is an accumulation point ofD, for any positive radius r, B(p; r) containsa points y in D. Note this clearly implies there is a sequence xn 6= p in D so thatxn → p.

Comment 7.1.3 If p is an isolated point of D, then the constant sequence xn = p in D convergesto p and so p is a limit point of D. However, an isolated point is not and accumulation point orcluster point.

We can prove a nice theorem about this now.

Theorem 7.1.1 D in <n is closed if and only if it contains all its limit points.

Let D be a subset of <n. Let D′ be the set of limit points of D. Then D is closed ⇐⇒D′ ⊂ D.

Proof 7.1.1(=⇒):Assume D is a closed set. Let p be a limit point of D which is not in D. Then there is a sequencexn in D with xn → p. Since p is not in D. it is in DC which is open. Thus p is an interiorpoint of DC and there is a r > 0 so B(r,p) ⊂ DC . However, for ε = r/2, there is an N so thatn > N =⇒ ‖xn − p‖ < ε = r/2. Thus, for n > N , xn is in DC . This is a contradiction. Hence,p must be in D. This shows D′ ⊂ D.

(⇐=):Assume D contains D′ and consider DC . If DC is not open, then there is a p in DC which is notan interior point of DC . Hence for the sequence rn = 1/n, there is a point xn in B(p, 1/n) withxn in D. Clearly, xn → p and hence p is a limit point of D and so must be in D by assumption.But this is not possible by our assumption. So our assumption must be wrong and DC must be open.This tells us D is closed.

This leads to the definition of the closure of a set.

Definition 7.1.6 The Closure of a Set

The closure of a set D in <n is denoted by D = D ∪D′.

We can then prove a useful characterization of the closure of a set.

Theorem 7.1.2 A D Set is Closure if and only if D = D

The set D is closed if and only if D = D.

Proof 7.1.2(=⇒):



If D is closed, D′ ⊂ D and so D ∪D′ = D. So D = D.

(⇐=):If D = D = D ∪D′, Thus, D′ is in D which tells us D is closed.

There is a lot more <n topology we could discuss but this is enough to get you started.

7.1.1 Homework

Exercise 7.1.4

Exercise 7.1.5

Exercise 7.1.6

Exercise 7.1.7

Exercise 7.1.8

7.2 Cauchy Sequences

We have seen Cauchy sequences frequently in (Peterson (8) 2019) and the same idea is easy to moveto the <n setting.

Definition 7.2.1 Cauchy Sequences in <n

A sequence xn in <n satisfies

∀ ε > 0,∃N 3 n,m > N =⇒ ‖xn − xm‖ < ε

We know from the < setting that Cauchy sequences of real numbers must converge to a real numberdue to the Completeness Axiom. We also know from discussions in (Peterson (8) 2019) that CauchySequences in general normed spaces need not converge to an element in the space. Normed spacesin which Cauchy sequences converge to an element of the space are called Complete Spaces. Wecan easily prove <n is a complete space.

Theorem 7.2.1 <n is Complete

<n is a complete normed space.

Proof 7.2.1Let xn in <n be a Cauchy Sequence in <n. Pick ε > 0. Then there is an N so that

n,m > N =⇒ ‖xn − xm‖ < ε/2

or

n,m > N =⇒n∑i=1

|xni − xmi|2 < ε2/4

Thus each piece of this sum satisfies

n,m > N =⇒ |xni − xmi| < ε/2


7.3. COMPACTNESS 107

This says xni is a Cauchy sequence in < and so must converge to a number xi. Let x be the vectorwhose components are xi. Then, since | · | is continuous, for n,m > N , we have

limm→∞

( n∑i=1

|xni − xmi|2)≤ ε2/4

So

| xn − x‖2 =

n∑i=1

|xni − xi|2 ≤ ε2/4

which implies ‖xn − x‖ < ε when n > N . This shows <n is complete.

7.3 Compactness

Theorem 7.3.1 Bolzano - Weierstrass Theorem in <n

Every bounded sequence in <n has at least one convergent subsequence.

Proof 7.3.1Let’s assume the range of this sequence is infinite. If it were finite, there would be subsequences of itthat converge to each of the values in the finite range. We assume the sequences start at n = 1 forconvenience and by assumption, there is a positive number B so that ‖an‖ ≤ B/2 for all n ≥ 1.Hence, all the components of this sequence live in what could be called a hyper rectangleB =[−B/2, B/2] × . . . [−B/2, B/2] ⊂ <n. We can denote this by B =

∏i=1n [−B/2, B/2].

Let J0 =∏i=1n I

i0 be the hyper rectangle [α01, β01] × . . . [α0n, β0n]. Here, α0i = −B/2 and

β0i = B/2 for all i; i.e. on each axis. The n dimensional area of J0, denoted by `0 is Bn.

Let S be the range of the sequence which has infinitely many points and for convenience, we will letthe phrase infinitely many points be abbreviated to IMPs.

Step 1:Bisect each axis interval of J0 into two pieces giving 2n subregions of J0 all of which have areaB2/4. Now at least one of the subregions contains IMPs of S as otherwise each subregion has onlyfinitely many points and that contradicts our assumption that S has IMPS. Now all may contain IMPSso select one such subregion containing IMPS and call it J1. Label the endpoints of the componentintervals making up the cross product that define J1 as Ii1 = [α1i, β1i]. hence, J1 =

∏ni=1 I

i1 with

`1 = B2/4. We see J1 ⊂ J0 and on the jth axis, the subinterval endpoints satisfy

−B/2 = α0i ≤ α1i ≤ β1i ≤ β0i = B/2

Since J1 contains IMPS, we can select a sequence vector an1 from J1.

Step 2:Now subdivide J1 into 2n subregions just as before. At least one of these subregions contain IMPSof S.Choose one such subregion and call it J2. Then J2 is the cross product of the intervals Ii2 =[α2i, β2i]. The area of this subregion is now `2 = B2/16. We see J2 ⊆ J1 ⊆ J0 and

−B = α0i ≤ α1i ≤ α2i ≤ β2i ≤ β1i ≤ β0i = B



Since J2 contains IMPS, we can select a sequence vector an2 from J2. It is easy to see this valuecan be chosen different from an1 , our previous choice.

You should be able to see that we can continue this argument using induction.

Proposition:∀p ≥ 1, ∃ an interval Jp =

∏i=1 n I

ip with Iip = [αpi, βpi] with the area of Jp, `p = B2/(22p)

satisfying Jp ⊆ Jp−1, Jp contains IMPS of S and

−B/2 ≤ α0i ≤ . . . ≤ α1,i ≤ . . . ≤ αp,i ≤ βpi ≤ · · · ≤ β1i ≤ β0i ≤ B/2

Finally, there is a sequence vector anp in Jp, different from an1 , . . . ,anp−1.

Proof We have already established the proposition is true for the basis steps J1 and J2.Inductive: We assume the interval Jq exists with all the desired properties. Since by assumption, Jqcontains IMPs, bisect Jq into 2n subregions as we have done before. At least one of these subregionscontains IMPs of S. Choose one of the subregions and call it Jq+1 and label Jq+1 =

∏ni=1 I

iq+1

where Iiq+1 = [αq+1,i, βq+1,i] We see immediately `q+1 = B2/22(q+1) with

αq,i ≤ αq+1,i ≤ βq+1,i ≤ βqi

This shows the nested inequality we want is satisfied. Finally, since Jq+1 contains IMPs, we canchoose anq+1

distinct from the other ani ’s. So the inductive step is satisfied and by the POMI, theproposition is true for all n.

From our proposition, we have proven the existence of sequences on each axis: (αpi), (βpi) and(`p) which have various properties. The sequence `p satisfies `p = (1/4)`p−1 for all p ≥ 1. Since`0 = B2, this means `1 = B2/4, `2 = B2/16, `3 = B2/(22)3 leading to `p = B2/(22)p for p ≥ 1.Further, we have the inequality chain

−B/2 = α0i ≤ α1i ≤ α2I ≤ . . . ≤ αpi≤ . . . ≤

βpi ≤ . . . ≤ β2i ≤ . . . ≤ β0I = B/2

The rest of this argument is very familiar. Note (αpi) is bounded above byB/2 and (βpi)) is boundedbelow by−B/2. Hence, by the completeness axiom, inf (βpi) exists and equals the finite number βi;also sup (αpi) exists and is the finite number αi.

So if we fix p, it should be clear the number βpi is an upper bound for all the αpi values ( look atour inequality chain again and think about this ). Thus βpi is an upper bound for (αpi) and so bydefinition of a supremum, αi ≤ βpi for all p. Of course, we also know since αi is a supremum, thatαpi ≤ αi. Thus, αpi ≤ αi ≤ βpi for all p. A similar argument shows if we fix p, the number αpi isan lower bound for all the βpi values and so by definition of an infimum, αpi ≤ βi ≤ βpi for all theαpi values. This tells us αi and βi are in [αpi, βpi] for all p. Next we show αi = βi.

Let ε > 0 be arbitrary. Since αi and βi are in an interval whose length is `p = (1/22p)B2, we have|αi − βi| ≤ (1/22p)B2. Pick P so that 1/(22PB2) < ε. Then |α−βi| < ε. But ε > 0 is arbitrary.Hence, αi − βi = 0 implying αi = βi. Finally, define the vector

θ =

α1

...αn



We now must show ank → θ. This shows we have found a subsequence which converges to θ. Weknow αpi ≤ ainp ≤ βpi and αpi ≤ αi ≤ βip for all p. Pick ε > 0 arbitrarily. Given any p, we have

|ainp,1 − αi| = |ainp,i − αpi + αpi − αi| ≤ |ainp,i − αpi|+ |αpi − α

i|

≤ |βpi − αpi|+ |αpi − βpi| = 2|βpi − αpi| = 2 (1/22p)B2.

Choose P so that (1/22P )B2 < ε/2. Then, p > P implies |ainp,1 − αi| < 2 ε/2 = ε. Thus,ank,i → αi. This shows the subsequence converges to θ.

Note this argument is messy but quite similar to the one and two dimensional case. It is more aproblem of correct labeling than intellectual difficulty!

There are several notions of compactness for a subset of <n. There is topological compactness.

Definition 7.3.1 Topologically Compact Subsets of <n

Let D be a subset of <n. We say the collection of open sets Uα where α ∈ I andI is any index set, whether finite, countable or uncountable. is an open cover of D ifD ⊂ ∪α Uα. We say this collection has a finite subcover if there are a finite number ofsets in the collection, Uβ1

, . . . , Uβp so that D ⊂ ∪pi=1Uβi . D is topologically compactif every open cover of D has a finite subcover.

and sequential compactness.

Definition 7.3.2 Sequentially Compact Subsets of <n

Let D be a subset of <n. We say D is sequentially compact if every sequence xn in Dhas a least one subsequence which converges to a point of D.

We need to prove their equivalence through a sequence of results. Compare how we prove theseto their one dimensional analogues in (Peterson (8) 2019).

Theorem 7.3.2 D is Sequentially Compact if and only if it is closed and bounded

D in <n is sequentially compact if and only if it is bounded and closed.

Proof 7.3.3(=⇒)We assume D is sequentially compact. Assume (xn) in D converges to x. Since (xn) is a sequencein the sequentially compact setD, it has a subsequence (x1

n) which converges to y inD. Since limitsof convergent sequences are unique, we must have x = y. So x is in D and hence D is closed.

To show D is bounded, we use contradiction. Assume it is not bounded. Then for each n, there is xninD so that ‖xn‖ > n. But sinceD is sequentially compact there is a subsequence (xnk which mustconverge to x in D. We know convergence sequences are bounded and so there is a positive numberB so that ‖bsxnk‖ < B for all elements of the subsequence. But eventually ‖bsxnk‖ > nk > Bwhich is the contradiction. We conclude D must be bounded.

(⇐=)Now we assume D is closed and bounded. We argue that D is sequentially compact using essentiallythe same procedure we used to prove the Bolzano - Weierstrass Theorem for <n. First, if D is a finiteset, it is sequentially compact. So we can assumeD has infinitely points. SinceD is bounded, there isa positive numberB so thatD ⊂

∏ni=1 [−B2/,B/2]. Pick any x0inD inD0 =

∏ni=1 [−B2/,B/2].

Now bisect each of the coordinate axis giving 2n pieces. Since D is infinite, at least one piece



contains infinitely many points of D. Call this piece D1 and pick x1 6= x0 ∈ D1. Since both pointsare in D0,

‖x1 − x0‖2 =

n∑i=1

|x1i − x0i|2 < nB2 =⇒ ‖x1 − x0‖ <√nB

Now do this again. We bisect D1 on each axis to create 2n pieces. Choose one piece containinginfinitely many points of D and call it D2. The lengths of the sides of this piece are now B/4. Pickx1inD2 not the same as the previous two. Then x2 and x1 are both in D1 and so ‖x2 − x1‖ <√n(B/2. Continuing this process, we obtain a sequence xn with xn ∈ Dn, The sides of Dn are

length B/2n and both xn−1 and xn are in Dn−1. So ‖xn−1−xn‖ <√n(B/2n−1. We claim this

sequence is a Cauchy Sequence in <n. For k > m,

‖xk − xm‖ ≤k−1∑j=m

‖xj+1 − xj‖ ≤k−1∑j=m

√nB

2j≤√nB

(k−1∑j=0

1

2j−k−1∑j=0

1

2j

)

=√nB

(1− 1/2k

1− 1/2− 1− 1/2m

1− 1/2

)≤ 2√nB

(1

2m− 1

2k

)≤√nB/2m−1

Then, given ε > 0, there is an N so that√nB/2m−1 < ε if m > N . Thus, ‖xk − xm‖ < ε if

k > m > N which shows (xn) is a Cauchy Sequence. Since <n is complete, there is x so thatxn → x. Since D is closed, it follows x is in D. Now this same argument applies to any infinitesequence (wn) in D. This sequence will have a subsequence (w1

n) which converges to a pointw ∈ D. Hence, D is sequentially compact.

Theorem 7.3.3 D is topologically compact implies it is closed and bounded

D in <n is topologically compact implies it is closed and bounded.

Proof 7.3.4The collection O = B(r,x)|x ∈ D is an open cover of D. Since D is topologically compact, Ohas a finite subcover V = B(r,xi)|1 ≤ i ≤ N for some positive integer N . If x ∈ D, thereis an index i so that x ∈ B(r,xi). Thus ‖x − xi‖ < r which implies ‖x‖ ≤ r + ‖xi‖. LetB = max1≤i≤N ‖xi‖. Then, ‖x‖ ≤ r +B which shows D is bounded.

Let’s show DC is open. Let x ∈ DC . For any y in D, let dxy = ‖x − y‖ which is not zero asx is in the complement. The collection O = B((1/2)dxy,y)|y ∈ D is an open cover of D andso there is a finite subcover V = B((1/2)dx,yi ,xi)|1 ≤ i ≤ N. Let W = ∩B((1/2)dx,yi ,x).By construction, each B((1/2)dx,yi ,x) is disjoint from B((1/2)dx,yi ,yi). So if z is in W , z isnot in all B((1/2)dx,yi ,yi). But since V is an open cover for D, this says z ∈ DC . Hence, if= min1≤i≤N (1/2)dx,yi , then B(r,x is contained in DC . This shows x is an interior point of DC .So DC is open which implies D is closed.

Theorem 7.3.4 A Hyperrectangle is topologically compact

The hyperrectangle D =∏ni=1[ai, bi] is topologically compact when each segment [ai, bi]

is finite in length.

Proof 7.3.5This proof is similar to how we proved the Bolzano Weierstrass Theorem (BWT). So look at the sim-ilarities! Assume this is false and let J0 =

∏ni=1[ai, bi]. Label this interval as J0 = [α0, β0]; i.e.



α0 = a and β0 = b. The volume of this hyperrectangle is `0 =∏ni=1(b1 − ai). Assume there is an

open cover U which does not have a fsc.

Now divide J0 into 2n pieces by bisecting each segment [ai, bi. At least one of these pieces can notbe covered by a fsc as otherwise all of D has a fsc which we have assumed is not true. Call this pieceJ1 and note the volume of J1 is `1 = (1/2)`0. We can continue in this way by induction. Assume wehave constructed the piece Jn in this fashion with volume `n = (1/2)`n−1 = `0/2

n and there is nofsc of Jn. Now divide Jn into 2n equal pieces as usual. At least one of them can not be covered by afsc. Call this interval Jn+1 which has volume `n+1 = (1/2)`n like required.

We can see each Jn+1 ⊂ Jn by the way we have chosen the pieces and their volumes satisfy `n → 0.Using the same sort of arguments that we used in the proof of the BWT for Sequences, we see thereis a unique point z which lies in all the Jn; i.e. z ∈ ∩∞i=1 Jn.

Since z in in J1, there is a O1 in U with z ∈ O1 because U covers D. Also, since O1 is open,there is a circle B(r1, z) ⊂ O1. Now choose K so that `K < `1. Then, Jk is contained in O1

for all k > K. This says O1 is a fsc of Jk which contradicts our construction process. Hence, ourassumption that there is an open cover with no fsc is wrong and we have D is topologically compact.

Theorem 7.3.5 Closed Subsets of Hyperrectangles are topologically compact

Let C be a closed subset of the hyperrectangle D =∏ni=1[ai, bi] when each segment

[ai, bi] is finite in length. Then D is topologically compact.

Proof 7.3.6Let U be an open cover of B. Then the collection of open sets O = U ∪BC is an open cover of Dand hence has a fsc V = O1, . . . ,ON , BC. for someN with each On ∈ U . Hence,B ⊂ ∪Ni=1Onand we have found a fsc of U which covers B. This shows B is topologically compact.

Theorem 7.3.6 D is topologically compact if and only if it is closed and bounded

D in <n is topologically compact if and only if it is closed and bounded

Proof 7.3.7(=⇒)If D is topologically compact, by Theorem 7.3.3, it is closed and bounded.(⇐=)If D is bounded, D is inside a hyperrectangle with finite edge lengths. Since D is a closed subset ofthe hyperrectangle, D is topologically compact by Theorem 7.3.5.

Theorem 7.3.7 D is topologically compact if and only if sequentially compact if and only ifclosed and bounded

Let D be a subset of <n. Then D is topologically compact⇐⇒D is sequentially compact⇐⇒ D is closed and bounded.

Proof 7.3.8(Topologically Compact⇐⇒ Closed and Bounded)This is Theorem 7.3.6.



(Sequentially Compact⇐⇒ Closed and Bounded)This is Theorem 7.3.2.

Comment 7.3.1 Since in these finite dimensional settings, the notions of sequential compactness,topological compactness and closed and bounded are equivalent, we often just say a set is compactand then use whichever characterization is useful for our purposes.

7.4 Functions of Many Variables

In (Peterson (8) 2019), we developed a number of simple MatLab scripts and functions to help usvisualize functions of two variables. Of course, these are of little use for functions of three or morevariables. For example, if y = f(x) we know how to see this graphically. We graph the pairs(x, f(x)) in the usual xy plane. If we had z = f(x, y), we would graph the triples (x, y, f(x, y)) inthree space. We would use the right hand rule for the positioning of the +z axis like usual. However,for w = f(x, y, z), the visualization would require us to draw in <4 which we can not do. So muchof what we want to do now must be done abstractly. Now let’s look at this in the n dimensionalsetting. Let D be a subset of <n. and assume z = f(x) ∈ <m is defined locally on B(x0, r) forsome positive r. This defines m component functions f1, . . . , fm by

f(x) = f(x1, . . . , xn) = (f1(x), . . . , fm(x))

The set D ⊂ <n is at least an open set and we usually want it to be path connected as well. We willtalk about this later, but a path connected subset of <n is one in which there is a path connecting anytwo points in U that lies entirely in D. We discuss this much more carefully in Chapter 17 but fornow we will be casual about it.

Example 7.4.1 Although x here is in <n and so is usually thought of as a column vector, we typicallywrite it as a row vector. Here is a typical function f : <3 → <3;

f

(x1

x2

x3

) = f(x1, x2, x3) =

x21 + 3x2

3

2x1 + 5x2 − sin(x23)

14x42x

23 + 20

=

[x2

1 + 3x23, 2x1 + 5x2 − sin(x2

3), 14x42x

23 + 20

]These notational abuses are quite common. For example, it is very normal to use row vector notationfor the domain of f and column vector notation for the range of f . Go figure. For this example, thecomponent functions are

f1(x1, x2, x3) = x21 + 3x2

3, f2(x1, x2, x3) = 2x1 + 5x2 − sin(x23)

f3(x1, x2, x3) = 14x42x

23 + 20

7.4.1 Limits and Continuity for Functions of Many Variables

Now let’s look at continuity in this n dimensional setting. Let D be a subset of <n. and assumez = f(x) is defined locally on B(x0, r) for some positive r. Here is the two dimensional extensionof the idea of a limit of a function.

Definition 7.4.1 The <n to <m Limit


7.4. FUNCTIONS OF MANY VARIABLES 113

If limx)→x0f(x) exists, this means there is a vector L ∈ <m so that

∀ε > 0 ∃0 < δ < r 3 0 < ‖x− x0‖ < δ ⇒ ‖f(x)− L‖ < ε

We say limx→x0 f(x) = L. We use the same notation ‖· for the norm in both the domainand the range. Of course, these norms are defined separately and we need to use the correctnorms on both parts as needed.

We can now define new versions of limits and continuity for dimensional functions.

Definition 7.4.2 <n to <m Continuity

If limx→x0 f(x) exists and matches f(x0), we say f is continuous at x0. That is

∀ε > 0 ∃0 < δ < r 3 ‖x− x0‖ < δ

⇒ ‖f(x)− f(x0)‖ < ε

We say limx→x0 f(x) = f(x0).

Statements about limits and continuity of these functions are equivalent to statements about limitsand continuity of the component functions.

Theorem 7.4.1 The behavior of f : D ⊂ <n → <m is equivalent to the behavior of its compo-nent functions

Let f : D ⊂ <n → <m. Then

• f has a limit at x0 of value L if and only if fi(x) has limit Li

• f is continuous at x0 if and only if fi(x) is continuous at x0.

Proof 7.4.1We will leave this proof to you.

Example 7.4.2 Prove

f(x, y) =

[2x2 + 3y2

3x2 + 5y2

]is continuous at (2, 3).

Solution We find

f(2, 3) =

[8 + 27 = 3512 + 45 = 57

]We will use the Euclidean norm on <2 here. Note, if we choose r < 1, then ‖(x, y) − (2, 3)‖ < rimplies |x− 2| < r and |y − 3| < r. Then∥∥∥∥[2x2 + 3y2 − 35

3x2 + 5y2 − 57

]∥∥∥∥ =

∥∥∥∥[2(x− 2 + 2)2 + 3(y − 3 + 3)2 − 353(x− 2 + 2)2 + 5(y − 3 + 3)2 − 57

]∥∥∥∥=

∥∥∥∥[ 2(x− 2)2 + 8(x− 2) + 3(y − 3)2 + 6(y − 3)3(x− 2)2 + 12(x− 2) + 5(y − 2)2 + 30(y − 3)

]∥∥∥∥



=

∥∥∥∥[2(x− 2)2

3(x− 2)2

]+

[8(x− 2)12(x− 2)

]+

[3(y − 3)2

5(y − 2)2

]+

[6(y − 3)30(y − 3)

]∥∥∥∥From the triangle inequality for ‖ · ‖, since this is the Euclidean norm, we have∥∥∥∥[2x2 + 3y2 − 35

3x2 + 5y2 − 57

]∥∥∥∥ ≤ 3√

2|x− 2|2 + 12√

2|x− 2|+ 5√

2|y − 3|2 + 30√

2|y − 3|

≤ 8√

2r2 + 42√

2r < 50√

2r

Hence, given ε > 0, if we choose r < min(1, ε/(50√

2) we have δ < r implies ‖f(x, y)−f(2, 3)‖ <ε when (x, y) ∈ B((2, 3), r).Note it is probably easier in practice to show each component is continuous separately. We get twodifferent critical values of r and we just choose the minimum of them to prove continuity.

Example 7.4.3 Show

f(x, y) =

2x√x2+y2

x2 + 3y2

, if (x, y) 6= (0, 0)

0, if (x, y) = (0, 0).

is not continuous at (0, 0).

Solution First, although it is difficult to graph surfaces with discontinuities, it is possible. Let’smodify the code DrawMesh in (Peterson (8) 2019). The function DrawMesh starts at the base point(x0, y0) and moves outward from there generating surface values that include the ones involving x0

and/ or y0. So to draw the surface all we have to do is to add a conditional tests in the setup phasefor the surface values like this

Listing 7.1: Adding Test to Avoid Discontinuity

f o r i=−nx : nxi f ( i != 0 )

u = [ x0+ i ∗ d e l x ] ;x = [ x , u ] ;

5 endend

The test to avoid i = 0 is all we need. The full code is below:

Listing 7.2: Draw a Surface with a Discontinuity

f u n c t i o n DrawDiscon ( f , de lx , nx , de l y , ny , x0 , y0 )% f i s t h e f u n c t i o n d e f i n i n g t h e s u r f a c e% d e l x i s t h e s i z e o f t h e x s t e p% nx i s t h e number o f s t e p s l e f t and r i g h t from x0

5 % d e l y i s t h e s i z e o f t h e y s t e p% ny i s t h e number o f s t e p s l e f t and r i g h t from y0% ( x0 , y0 ) i s t h e c e n t e r o f t h e r e c t a n g u l a r domain%% s e t up x and y s t u f f

10 % want d e l x and d e l y t o d i v i d e x f e v e n l y



% we a v o i d x0 and y0 which i s where t h e problem i sx = [ ] ;f o r i=−nx : nx

i f ( i != 0 )15 u = [ x0+ i ∗ d e l x ] ;

x = [ x , u ] ;end

end%

20 y = [ ] ;f o r i=−ny : ny

i f ( i != 0 )u = [ y0+ i ∗ d e l y ] ;y = [ y , u ] ;

25 endendsx = 2∗nx ;sy = 2∗ny ;hold on

30

% p l o t t h e s u r f a c e% s e t up g r i d o f x and y p a i r s ( x ( i ) , y ( j ) )[X , Y ] = meshgrid ( x , y ) ;

% s e t up t h e s u r f a c e35 Z = f (X , Y ) ;

mesh (X , Y , Z ) ;x l a b e l (’x’ ) ;y l a b e l (’y’ ) ;z l a b e l (’z’ ) ;

40 r o t a t e 3 d ona x i s ong r i d on

hold o f fend

If you look at the generated surface yourself, you can spin the plot around and really get a good feelfor the surface discontinuity. At y = 0, the surface value is +2 if x > 0 and−2 if x < 0 and you cansee the attempt the plotted image makes to capture that sudden discontinuity in the strip of unplottedpoints at y = 0 on the plot. We captured on version of it in Figure 7.1.We know the component f2 is continuous everywhere as the argument is similar to the last example.So we only have to show f1 does not have a limit at (0, 0) as this is enough to show f1 is not contin-uous at (0, 0) and so f is not continuous at (0, 0). If this limit exists, we should get the same valuefor the limit no matter what path we take to reach (0, 0).Let the first path be given by x(t) = t and y(t) = 2t. We have two paths really; one for t > 0 andone for t < 0. We find for t > 0, f1(t, 2t) = 2t/

√t2 + 4t2 = 2/

√5 and hence the limit along this

path i2/√

5.

We find for t < 0, f1(t, 2t) = 2t/√t2 + 4t2 = 2t/(|t|

√5) = −2/

√5 and hence the limit along

this path −2/√

5. Since the limiting value differs on two paths, the limit can’t exist. Hence, f1 is notcontinuous at (0, 0).

Comment 7.4.1 The problem with visualization here is that we can never use this to help us withfunctions of more than two variables because we can’t generate the plot. So we have to see it in ourmind, so to speak.



Figure 7.1: The surface f(x, y) = 2x/√x2 + y2 near the discontinuity at 0, 0)

7.4.2 Homework

Exercise 7.4.1 Let

f(x, y) =

[6x2 + 9y2

3xy

]Prove f is continuous at (2.5).

Exercise 7.4.2 Let

f(x, y) =

[2x2 − 3y2

7x+ 8y

]Prove f is continuous at (−1, 2).

Exercise 7.4.3 Let

f(x, y) =

3x√

(x+1)2+2(y−2)2

4x2 + 2y2

Prove f is not continuous at (−1, 2). Draw the surface for f1 here as well.

Theorem 7.4.2 If f is continuous on a compact domain, its range is compact



Let f : D ⊂ <n → < be continuous. If D is compact, then f(D) is also compact.

Proof 7.4.2If f(D) were not bounded, there would be a sequence (yn) in f(D) so that |yn| > n for all n. Butsince yn = f(xn) for some xn in D, this means there is a sequence (xn) in D which has a conver-gent subsequence (xnk which converges to some x in D. Since convergent sequences are bounded,this means the sequence (f(xnk) must be bounded. Of course, it is not and this contradiction tellsus our assumption f(D) is not bounded is wrong. Hence f(D) is bounded.

Next, let (yn) be a convergent sequence in f(D). So yn → y for some y. There is then an associatedsequence (xn) in D so that yn = f(xn), Since D is compact, there is a subsequence (xnk) whichconverges to some x in D. Since f is continuous on D, we have f(xnk) → f(x) ∈ f(D). Butsubsequences of convergent sequences converge to the same place, so we must have y = f(x). Thusy ∈ f(D) and f(D) is closed.

Since f(D) is closed and bounded, it is compact.

Theorem 7.4.3 If f continuous on a compact domain, it has global extrema

Let f : D ⊂ <n → < be continuous on the compact set D. Then, f achieves a globalmaximum and global minimum on D.

Proof 7.4.3Since f(D) is compact, it is bounded. Thus there is a positive B so that |f(x)| ≤ B. Hence,α = infx∈D f(x) and β = supx∈D f(x) both exist and are finite by the Completeness Axiom.Using both the Infimum and Supremum Tolerance Lemma,we can find sequences (xn) and (yn) sothat f(xn)→ α and f(yn)→ β. SinceD is compact, there exist subsequences (x1

n) and (y1n) with

x1n → xm ∈ D and y1

n → xM ∈ D. Then f(x1n) → α and f(y1

n) → β too. By continuity of f ,this tells us f(xm = α and f(xM = α. Thus α is the minimum of f on D and β is the maximum off on D.


Chapter 8

Abstract Symmetric Matrices

Let’s start with a few more ideas about matrices. We have already talked about magnitudes of lineartransformations in Chapter 4.5 where the underlying spaces were n dimensional vector spaces. Nowlet’s just look at these ideas for matrices whose underlying spaces are simply <n.

8.1 Input - Output Ratios for Matrices

The size of a matrix can be measure in terms of its rows and columns, but a better way is to think abouthow it acts on vectors. We can think of a matrix as an engine which transform inputs into outputsand it is a natural think to ask about the ratio of output to input. Such a ratio gives a measure of howthe matrix alters the data. Since we can’t divide by vectors, we typically use a measure of vector sizefor the output and input sides. Recall the euclidean norm of a vector x is ‖x‖ =

√x2

1 + · · ·+ x2n

where x1 through xn are the components of x in <n with respect to the standard basis in <n. This isalso the ‖ · ‖2 norm we have discussed before. LetA be a n× n matrix. ThenA transforms vectorsin <n to new vectors in <n. We measure the ratio of output to input by calculating ‖A(x)‖/‖x‖and, of course, this doesn’t make sense if x = 0. We want to see how big this ratio can get, so we areinterested in the maximum value of ‖A(x)‖/‖x‖. Now the matrix A is linear; ie A(cx) = cA(x).So in particular, if y 6= 0, we can sayA(y) = ‖y‖A(y/‖y‖). Thus,

‖A(y‖)‖y‖

=

‖y‖A(y/‖y‖

)‖

‖y‖=

‖A(y/‖y‖

)‖

‖(y/‖y‖

)Now let x = y/‖y‖ and we have ‖A(y)‖

‖y‖ = ‖A(x)‖‖x‖ where x has norm 1. So

max‖y‖6=0

(‖A(y‖)‖y‖

)= max

‖x‖=1

(‖A(x)‖‖x‖

)In order to understand this, first note that the numerator ‖A(x‖ =

√< A(x,A(x > and it is easy

to see this is a continuous function of its arguments x1 through xn. We can’t use just any argumentseither; here we can only use those for which x2

1 + . . . + x2n = 1. The set of such xi’s in <n is a

bounded set which contains all of its boundary points. For example, in <2, this set is the unit circlex2

1 + x22 = 1 and in <3, it is the surface of the sphere x2

1 + x22 + x2

3 = 1. These sets are bounded andcontain their boundary points and are therefore compact subsets. We know any continuous function

119


120 CHAPTER 8. ABSTRACT SYMMETRIC MATRICES

must have a global maximum and a global minimum over a compact domain. Hence, the problem offinding the maximum of ‖A(x)‖ over the closed and bounded set ‖x‖ = 1 has a solution. We definethis maximum ratio to be the norm of the matrix A. We denote the matrix norm as usual by ‖A‖and so we have

Definition 8.1.1 The Norm of a Symmetric Matrix

LetA be an n× n symmetric matrix. The norm ofA is

‖A‖ = max‖x‖=1

(‖A(x)‖‖x‖

)

8.1.1 Homework

Exercise 8.1.1

Exercise 8.1.2

Exercise 8.1.3

Exercise 8.1.4

Exercise 8.1.5

8.2 The Norm of a Symmetric Matrix

Now let’s specialize to symmetric matrices. First, because of the symmetry, an easy calculationshows

< A(x),y > = yT Ax =

(Ax

)Ty = xT AT y.

But this matrix is symmetric and so its transpose is the same as itself. We conclude

< A(x),y > = xT AT y =< x,A(y) >

We will use this identity in a bit. Now, when the matrix is symmetric, we claim we also have

‖A‖ = max‖x‖=1

| < A(x),x > |

Let the maximum on the right hand side be denoted by J for convenience. Now we know howinner products behave. We have for any vector y

< A(y),y > ≤ ‖A(y‖ ‖y‖

Next, from the way ‖A‖ is defined, for any particular nonzero vector y, we have ‖A(y)‖ ≤ ‖A‖‖y‖.Thus, combining these ideas, we have for any vector x with norm 1,

< A(x),x > ≤ ‖A(x)‖ ‖x‖ ≤ ‖A‖ ‖x‖ ‖x‖ = ‖A‖.


8.2. THE NORM OF A SYMMETRIC MATRIX 121

Since this is true for all such vectors, we must have the maximum J satisfies J ≤ ‖A‖. Next, wewill show the reverse inequality and the two pieces together then show J = ‖A‖.Now for any nonzero y we have

< A

(y/‖y‖

),

(y/‖y‖

)> ≤ J

which implies < A(y),y >≤ J /‖y‖2. Now do the following calculations. These are just matrixmultiplications and they are pretty straightforward, We have for any x and y that

< A(x+ y), (x+ y > = < A(x), (x > + < A(y), (y > +2 < A(x), (y >≤ J‖(x+ y‖2

because the matrix is symmetric. We also have

< A(x− y), (x− y > = < A(x), (x > + < A(y), (y > −2 < A(x), (y >≥ −J ‖x− y‖2

Now subtract the second inequality from the first to get

4 < A(x), (y > ≤ J

(‖x+ y‖2 + ‖x+ y‖2

)Finally, we can calculate that

(‖x+ y‖2 + ‖x+ y‖2

)= 2

(‖x‖2 + ‖y‖2

)= 4

as these vectors have length one. We conclude < A(x),y >≤ J . Now consider the case wherey = A(x). IfA(x) = 0, we would have

‖A(x)‖2 = < A(x),A(x) >=< 0, 0 >= 0,

and so |A(x)‖ ≤ J . If y = A(x) 6= 0, we can say

< A(x),

(A(x)/‖A(x)‖

)>≤ J ≤ J

which we can simplify to

< A(x),A(x) >

‖A(x)‖≤ J .

But< A(x),A(x) >= ‖A(x)‖2 and so dividing, we find ‖A(x)‖ ≤ J . This implies the maximumover all ‖x‖ = 1, also satisfies this inequality and so

‖A‖ = max‖x‖=1

‖A(x)‖ ≤ J .

With the reverse inequality established, we have proven the result we want. Let’s summarize ourtechnical discussion. We have proven the following.

Theorem 8.2.1 Two Equivalent Ways To Calculate the Norm of a Symmetric Matrix



IfA is a symmetric n× n matrix, then

‖A‖ = max‖x‖=1

‖A(x)‖ = max‖x‖=1

| < A(x),x > | = J .

Proof 8.2.1See the discussions above.

Homework

Exercise 8.2.1

Exercise 8.2.2

Exercise 8.2.3

Exercise 8.2.4

Exercise 8.2.5

8.2.1 Constructing Eigenvalues

For the symmetric matrix such, we can construct its eigenvalues using a procedure which also givesus useful estimates. Now let’s find eigenvalues:

8.2.1.1 Eigenvalue One

Theorem 8.2.2 The First Eigenvalue

Either ‖A‖ or −‖A‖ is an eigenvalue for A. Letting the eigenvalue be λ1, then |λ1| =‖A‖ with an associated eigenvector of norm 1, E1.

Proof 8.2.2We know that the maximum value of | < A(x),x > over all ‖x‖ = 1 occurs at some unit vector.Call this unit vector E1. For convenience, let α = ‖A‖. Then we know | < A(E1),E1 > | = α.Case I: We assume λ1 = −‖A‖. Then,

< A(E1)− (−α)E1,A(E1)− (−α)E1 > = < A(E1) + αE1,A(E1) + αE1 >

= < A(E1),A(E1) > +2α < A(E1),E1 >

+α2 < E1,E1 >

= ‖A(E1)‖2 + 2α < A(E1),E1 > +α2

= < A(E1),A(E1) > −2α2 + α2

= ‖A(E1)‖2 − α2

since ‖E1)‖ = 1. Then, overestimating, we have

< A(E1))− (−α)E1),A(E1))− (−α)E1) > ≤ ‖A‖2 ‖E1)‖2 − α2.

But α = ‖A‖, so we have


8.2. THE NORM OF A SYMMETRIC MATRIX 123

< A(E1))− (−α)E1),A(E1))− (−α)E1) > ≤ α2 − α2 = 0.

We conclude ‖A(E1)) − (−α)E1)‖ = 0 which tells us A(E1) = −αE1. Hence, −α is an eigen-value of A and we can pick E1 as the associated eigenvector of norm 1. Case II: We assumeλ1 = ‖A‖. The argument is then quite similar. We conclude ‖A(E1) − αE1‖ = 0 which tells usA(E1) = α)E1. Hence, α is an eigenvalue of A and we can pick E1 as the associated eigenvectorof norm 1.

Homework

Exercise 8.2.6

Exercise 8.2.7

Exercise 8.2.8

Exercise 8.2.9

Exercise 8.2.10

8.2.1.2 Eigenvalue Two

We can now find the next eigenvalue using a similar argument.

Theorem 8.2.3 The Second Eigenvalue

There is a second eigenvalue λ2 which satisfies |λ2| ≤ |λ1|. Let the new symmetric matrixA2 be defined by

A2 = A− λ1 E1 ET1 ,

Then, |λ2| = ‖A2‖ and the associated eigenvector E2 is orthogonal to E1, i.e., <E1,E1 >= 0.

Proof 8.2.3The reasoning here is virtually identical to what we did for Theorem 8.2.2. First, note, if x is amultiple of E1, we have x = µE1 for some µ giving

(A2(x) = A(µE1) − λ1µE1 ET1 E1

= µλ1E1 − µλ1E1 = 0.

since E1 is an eigenvector with eigenvalue λ1. Hence, it is easy to see that

max‖x‖=1

| < A2(x),x > | = max‖x‖=1,x∈W1

| < A2(x),x > |

whereW1 is the set of vectors in that are orthogonal to eigenvectorE1. Since this is a smaller set ofvectors, the maximum we obtain will, of course, be possibly smaller. Hence, we know max‖x‖=1 | <A2(x),x > | ≥ max‖x‖=1, x∈W1

| < A2(x),x > |.The rest of the argument, applied to the new operatorA2 is identical. Thus, we find



|λ2| = ‖A2‖ = max‖x‖=1,x∈W1

| < A2(x),x > |.

Further, |λ2| ≤ |λ1| and eigenvector E2 is orthogonal to E1 because it comes fromW1.

Homework

Exercise 8.2.11

Exercise 8.2.12

Exercise 8.2.13

Exercise 8.2.14

Exercise 8.2.15

8.2.1.3 Eigenvalue Three

We can now find the next eigenvalue.

Theorem 8.2.4 The Third Eigenvalue

There is a third eigenvalue λ3 which satisfies |λ3| ≤ |λ2|. Let the new symmetric matrixA3 be defined by

A3 = A− λ1 E1 ET1 − λ2 E2 E

T2 ,

Then, |λ3| = ‖A3‖ and the associated eigenvector E3 is orthogonal to E1 and E1.

Proof 8.2.4The reasoning here is virtually identical what we did for the first two eigenvalues. First, note, if x isin the plane determined by E1 and E2, we have x = µ1E1 + µ2E2 for some constants µ1 and µ2

giving

(A3(x) = A(µ1E1 + µ2E2) − λ1µ1E1 ET1 E1 − λ2µ2E2 E

T2 E2

= µ1λ1E1 + µ2λ2E2 − µ1λ1E1 − µ2λ2E2 = 0.

since E1 and E2 are eigenvectors. Hence, it is easy to see that

max‖x‖=1

| < A3(x),x > | = max‖x‖=1,x∈W2

| < A3(x),x > |

where W2 is the set of vectors in that are orthogonal to the plane determined by the first two eigen-vectors. Since this is a smaller set of vectors, the maximum we obtain will, of course, be possiblysmaller. Hence, we know max‖x‖=1 | < A2(x),x > | ≥ max‖x‖=1, x∈W2

| < A3(x),x > |.The rest of the argument, applied to the new operatorA3 is identical. Thus, we find

|λ2| = ‖A3‖ = max‖x‖=1,x∈W2

| < A3(x),x > |.


8.3. WHAT DOES THIS MEAN? 125

Further, |λ3| ≤ |λ2| and eigenvector E3 is orthogonal to both E1 and E2 because it comes fromW2.

Homework

Exercise 8.2.16

Exercise 8.2.17

Exercise 8.2.18

Exercise 8.2.19

Exercise 8.2.20

SinceA is a n× n matrix, this process terminates after n steps. We can now state the full result.

Theorem 8.2.5 All Eigenvalues

There is a sequence of eigenvalues λi which satisfies |λ1| ≥ |λ2| ≥ . . . ≥ |λn| withassociated eigenvectors Ei which form an orthonormal basis for <n. Let the new matrixAi be defined by

Ai = A−i∑

j=1

λj EjETj ,

Then, |λi| = ‖Ai‖ and the associated eigenvector Ei is orthogonal to Ej for all j ≤ i.Moreover, we can say

A =

n∑j=1

λj EjETj ,

8.3 What Does This Mean?

Let’s put all of this together. Our symmetric n × n matrix always has n real eigenvalues and theirrespective eigenvectors E1, . . .En are mutually orthogonal and so define an orthonormal basisfor <n. . Hence, we usually normalize them so they form an orthonormal basis for <2. We can usethese eigenvectors to write A is a canonical form just like we did for the 2 × 2 case. We define thetwo n× n matrices

P =[E1 E2 . . . En

]which has transpose

P T =

ET1ET2. . .ETn

We know the eigenvectors are mutually orthogonal, so we must have P TP = I , PP T = I and



P T AP =

λ1 < E1,E1 > 0 0 . . . 0

0 λ2 < E2,E2 > 0 . . . 0...

.... . . . . . 0

...... 0

. . . 00 0 0 0 λn < En,En >

=

λ1 0 0 . . . 00 λ2 0 . . . 0...

.... . . . . . 0

...... 0

. . . 00 0 0 0 λn

which can be rewritten as

A = P

λ1 0 0 . . . 00 λ2 0 . . . 0...

.... . . . . . 0

...... 0

. . . 0...

... 0 0 λn

P T

And we have a nice representation for an arbitrary n×n symmetric matrixA. Now at this point, weknow all the eigenvalues are real and we can write them in descending order, but eigenvalues can berepeated and can even be zero. Also, we can represent the norm of the symmetric matrix A in termsof its eigenvalues.

Theorem 8.3.1 The Eigenvalue Representation of the norm of a symmetric matrix

LetA be an n× n symmetric matrix with eigenvalues |λ1| ≥ |λ2| ≥ . . . ≥ |λn|. Then

‖A‖ = |λ1| = max1≤i≤n

|λi|

This also implies

‖Ax‖ ≤ ‖A‖ ‖x‖

for all x.

Proof 8.3.1We know ‖A‖ ≥ | < AE1,E1 > | where E1 is the unit norm eigenvector corresponding toeigenvalue λ1. Thus,

‖A‖ ≥ | < λ1E1,E1 > | = |λ1|

Any x has representation∑ni=1 ciEi in terms on the orthonormal basis of eigenvectors of A. If

‖x‖ = 1, this implies∑ni=1 c

2i = 1. Then

| < Ax,x > =

∣∣∣∣ n∑i=1

n∑j=1

cicjλi < Ei,Ej >

∣∣∣∣


8.3. WHAT DOES THIS MEAN? 127

≤n∑i=1

c2i |λi| ≤ (1) |λ1|

Thus, ‖A‖ = max ‖x‖ = 1| < Ax,x > | ≤ |λ1|. Combining, we see ‖A‖ = λ1|.

8.3.1 A Worked Out Example

LetA be defined to be

A =

[1 33 2

]Then to find the first eigenvalue and eigenvector, we need to find a certain maximum. We let

f(x, y) =

[xy

]T [1 33 2

] [xy

]= x2 + 6xy + 2y2

and we want to find the maximum of |f(x, y)| subject to x2 + y2 = 1. To do this, let x = cos(θ) andy = sin(θ). Then our constrained optimization problem becomes maximize g(θ) where

g(θ) = cos2(θ) + 6 cos(θ) sin(θ) + 2 sin2(θ)

= 1 + 6 cos(θ) sin(θ) + sin2(θ)

= 1 + 3 sin(2θ) + (1/2)(1− cos(2θ))

= (3/2) + 3 sin(2θ)− (1/2) cos(2θ).

over all θ ∈ [0, 2π]. The critical points here at the two endpoint values θ = 0 and θ = 2π and whereg′(θ) = 0. We have

g′(θ) = 6 cos(2θ) + sin(2θ).

Hence, the derivative is zero when 6 cos(2θ) + sin(2θ) = 0 or tan(2θ) = −6. Thus, 2θ = 1.7359,4.8776 or 8.0192 which implies θ = 0.8680, 2.4388 or 4.0096. This is a quadratic expression, sothere is no need to look at second order tests. We’ll just evaluate the function as all the critical points.We have g(0) = 1, g(2π) = 1, g(0.8680) = 4.5414 and g(2.4388) = 1.5414. The maximum occursat θ1 = 0.8680 and has value 4.5414. This is our eigenvalue λ1. The maximum occurs at

E1 =

[cos(0.8680)sin(0.8680)

]=

[0.646350.76304

]To find the second eigenvalue, we construct the new function h(x, y) defined by

h(x, y) =

[xy

]T ([1 33 2

]− λ1 E1E1

T

) [xy

]

and maximize its absolute value over the vectors of length one. From our theoretical discussion, theextreme values here are found by maximizing f(x, y) over the vectors perpendicular to E1. Theseare the multiples of F given by

F =

[sin(θ1)− cos(θ1)

]



Hence, the extreme values occurs at ±F . The value of f(x, y) at −F is there is −1.5414 whichin absolute value is 1.5414 and so we choose E2 = F . These eigenvalues and their associatedeigenvectors are exactly the same as the one we would find using the characteristic equation to findthe eigenvalues and then solving the resulting eigenvector equations for a unit vector solution.


Exercise 8.3.2

Exercise 8.3.3

Exercise 8.3.4

Exercise 8.3.5

8.4 Signed Definite Matrices AgainAn n× n matrix is said to be a positive definite matrix if xTAx > 0 for all vectors x. We can’t doour usual algebraic tricks now (thank goodness!) so let’s sneak up on this a different way. RewriteA using our representation to get xTP D P Tx > 0 where D is the diagonal matrix having theeigenvalues λi along the main diagonal. Next, let y = P Tx. Then the inequality can be rewritten as

(P Tx)T D P Tx = yT D y > 0.

We can easily multiply this out to get λ1y21 + λ2y

22 + . . . + λny

2n > 0. Since this is positive for

all choices of yi, set all but y1 to zero. Then, we have λ1y21 > 0 which implies λ1 > 0. Similarly,

setting all but y2 to zero, we find λ2 > 0. We can do this for all of the eigenvalues to concludethat A positive definite forces all the eigenvalues to be positive. Going the other direction, if all theeigenvalues were positive, that would force the matrix to be positive definite.A similar argument shows that A is negative definite if and only if its eigenvalues are all negative.Note all, if we simply wanted xTAx ≥ 0, we would say A is positive semidefinite and our ar-guments would say this happens if and only if all the eigenvalues are nonnegative. And finally, ifxTAx ≤ 0, the matrix is negative semidefinite and all its eigenvalues are non positive.


Exercise 8.4.2

Exercise 8.4.3

Exercise 8.4.4

Exercise 8.4.5


Chapter 9

Rotations and Orbital Mechanics

9.1 IntroductionIt is easy to think that two dimensional vector spaces are somehow trivial and not worth a lot of ourtime. We think we can change your mind by looking at some problems in orbital mechanics. We aregoing to discuss how to write code to solve some problems in this field. This is a great example ofhow two dimensional vector spaces are used in practice. Our references for this material are

1. (Bate et al. (1) 1971): an old book on the basics behind interplanetary movement. It is a bitdated now.

2. (Thompson (15) 1961): a very nice book on the dynamics of space flight. It is old but most ofthis kind of information was worked out very early so don’t be put off by the age of the text.

3. (Sutton and Biblarz (14) 2001): this is a standard senior level text in aerospace engineeringundergraduate degree programs such as they have at the University of Illinois. It does notexplain as much from the basics as (Bate et al. (1) 1971) and so it is a bit hard to read whenyou are starting out. But it has really interesting stuff on rocket design which is dated of course.However, the basic ways to do rocket propulsion and the technology to implement it have notchanged all that much.

4. (Roy (13) 1982): this is a book we used when we worked at Aerospace Corporation back inthe day. It is a nice complement to (Bate et al. (1) 1971).

5. (Curtis (3) 2014): a much newer book which replaces (Bate et al. (1) 1971).

There are a number of three dimensional coordinate systems we need to pay attention.

• The Heliocentric - Ecliptic coordinate system. The origin here is the center of the sun. Theearth rotates around the sun and its position at any time is a point on a plane. The basis vectorsfor this plane are denoted by Xhe and Yhe with the Xhe direction drawn from the center ofthe earth to the center of the sun on the first day of Spring. This is called the vernal equinoxdirection. The Yhe direction is then 90circ rotated fromXhe in the direction of the rotation ofthe earth and, of course, Yhe is in the orbital plane determined by the movement of the eartharound the sun. The right hand rule then determines the Zhe direction. Hence, we can write avector P in the heliocentric - Ecliptic system as

P = xheXhe + yheYhe + zheZhe

You can see this coordinate system in Figure 9.1.

129


130 CHAPTER 9. ROTATIONS AND ORBITAL MECHANICS

P

Center of Earth Center of sunVernal Equinox Direction

V

xhe

yhe

zhe

Xhe

Zhe

Yhe

A vector P in the Helio-centric - Ecliptic coordi-nate system. The earthorbits around the sun in theXhe − Yhe plane.

Figure 9.1: Heliocentric - Ecliptic coordinate system

• The Geocentric - Equatorial coordinate system. The origin is now at the center of the earth.The equatorial circle determines a plane and we choose the unit vector for the I direction tobe the vernal equinox direction we mentioned above. The J unit vector is then 90 from I andits direction is determined by the fact that the unit vector orthogonal to the equatorial plane ischosen to point toward the north pole. Hence, the right hand rule determines the orientation ofJ . We see J points east along the equator. A given vector P can then be written as

P = xgeI + ygeJ + zgeK

You can see this coordinate system in Figure 9.2.

• The Right Ascension-Declination coordinate system. This is similar to a spherical coordinatesystem. For any point in three space, draw a line from the center of the earth to the point. Dropa perpendicular from that point to the equatorial plane or, if the point is below the equatorialplane, the perpendicular will point up until it hits the equatorial plane. This projection ofthe line connecting the center of the earth to the point will determine a line in the equatorialplane whose center is the center of the earth and whose terminal point is the point the verticalvalue of the point determines upon projection to the equatorial plane. The angle from thevernal equinox line which is the I direction to this line is the angle α which is called the rightascension. Note this is exactly the same as the angle we call θ in the usual spherical coordinatesystem. We don’ t use the angle we call φ in the spherical coordinate system though. Recallφ is the angle from the positive z axis to the line connecting the center of the xyz coordinatesystem to the tip of the point P determines in three dimensional space. There is another angle,which is measure up from the xy plane to this line which is π/2 − φ. This angle is called thedeclination and is denoted by δ. To be clear, δ is measured from the line north to the K axis.So δ = π would be a point on the line going to the south pole. The length of the line from thecenter of the earth to the point is given by R. Then a given vector P can then be written asa ordered triple (R,α, δ). Note we are not writing x in terms of new unit vectors specialized


9.1. INTRODUCTION 131

P

Center of EarthVernal Equinox Direction

xI

yJ

zK

I

K

J

r

A vector P in the Geocen-tric - Equatorial coordinatesystem. The earth rotatescounterclockwise I to J .This coordinate system isfixed and does not rotateeven though the earth ro-tates.

Figure 9.2: Geocentric - Equatorial coordinate system

to the right ascension - declination coordinate system. You can see this coordinate system inFigure 9.3.

P

α δCenter of Earth

Vernal Equinox Direction

J

I

K

R

A vector P in the RightAscension - Declinationcoordinate system.

Figure 9.3: Right Ascension - Declination coordinate system


Exercise 9.1.2

Exercise 9.1.3

Exercise 9.1.4



Exercise 9.1.5

9.2 Orbital Planes

The above coordinate systems are not specialized to the movement of a satellite in its orbit. ThePerifocal coordinate system describes the position of points in space in terms of a coordinate systembased on the orbital motion of the satellite.. Let M be the mass of the earth and m the mass of thesatellite. We will think of position and velocity as numbers in <3 and so each position and associatedvelocity defines vectors

r(t) =

x(t)y(t)z(t)

, v(t) = r′(t) =

x′(t)y′(t)z′(t)

,and we can decide to express this information in any choice of coordinate system we wish. Thevector r starts at the origin of the coordinate system and ends at the position of the satellite. So thereis an obvious unit vector we can use here, E(t) = rt/‖r‖2. The gravitational force between theearth and the satellite is modeled as if all the mass is concentrated at their individual centers and wecan write down the usual gravitational force equations you know from basic physics. The force isproportional to the acceleration and is directed along the line between the earth and the satellite. Theequation for the vector acceleration is then

d2

dt2r(t) = −G(m+M)

‖r(t)‖22E(t) = −G(m+M)

‖r(t)‖32r(t)

where G is the universal gravitational constant. Since m is negligible compared to M , we approxi-mate the vector acceleration by

d2

dt2r(t) = − G(M)

‖r(t)‖32r(t) = − µ

‖r(t)‖32r(t)

where µ is called the gravitational parameter. To make our arguments a bit more concise, let’s startusing a common notation for ‖r‖2. This is often just written as |r(t)| even though we know we arenot just taking an absolute value. For the rest of our arguments here, we will use this more simplenotation. Hence, we have r′′(t) = −(µ/|r(t)|3 r(t). We will also drop the (t) now: just rememberit is still there!

9.2.1 Orbital Constants

Now let’s do some calculations: note v = r′ and r′′ = v′.

< v,v′ >=< r′, r′′ > = < r′,−(µ/|r|3 r >=< v,−(µ/|r|3 r >

Thus

< v,v′ > + < v, (µ/|r|3 r > = 0

But we are using the ‖ · ‖2 norm here which is the usual Euclidean norm in <3. Now, it is easy to see

(|v|)′ =1

2

2v2v′1 + 2v2v

′2 + 2v3v

′3√

v21 + v2

2 + v23

=< v,v′ >

|v|


9.2. ORBITAL PLANES 133

which implies |v| |v|′ =< v,v′ >. It is usual to find a way to distinguish between the scalar |v| andthe vector v. We will let V = |v|. Hence, we have V V ′ =< v,v′ >. We will also write R = |r|and a similar calculation shows R R′ =< r, r′ >. Thus,

< v,v′ > + < v, (µ/|r|3 r >= 0 =⇒ V V ′ + (µ/R2)V = 0

But

(1/2)d

dt

(V 2)′

= V V ′ =⇒ d

dt(−µ/R) = µ

1

R2R′

Hence, we conclude

d

dt

((1/2)V 2

)′+d

dt

(−µR

)= 0

Our final equation is thus

d

dt

(V 2

2− µ

R

)= 0

This says the term V /2 − µ/R is a constant which we call the specific mechanical energy anddenote by E .Next, we since

d2

dt2r +

µ

|r|3r = 0

we can find the cross product of this with r. We find

r × r′′ + µ

|r|3r × r = 0

Since r × r = 0, we have r × r′′ = 0. By direct calculation, we can check

d

dt(r × r′) = r′ × r′ + r × r′′

But r′ × r′ = 0, so

r × r′′ =d

dt(r × r′)

We have shown

0 = r × r′′ =d

dt(r × r′) =

d

dt(r × v)

The angular momentum here is h = r× v. So we have shown the angular momentum h is constant.We let H = |h|. Thus, a satellite moves in an orbit around the center of the earth with constantangular momentum.

9.2.1.1 Homework

Exercise 9.2.1

Exercise 9.2.2



Exercise 9.2.3

Exercise 9.2.4

Exercise 9.2.5

9.2.2 The Orbital Motion is a Conic

Since r and v determine the orbital plane, it is convenient to setup some conventions as shown inFigure 9.4.

Local Vertical

Local Horizontal

r

v

φ

γ

M

The radius vector from thecenter of the earth to thesatellite determines a localvertical and local horizon-tal. The angle between vand the local vertical is γand the angle between vand the local horizontal isφ.

Figure 9.4: The Orbital Plane

The flight path angle, φ, is the angle between the local horizontal and v. The zenith angle, γ, is theangle between v and the local vertical. We see γ = π/2− φ.

9.2.2.1 Homework

Exercise 9.2.6

Exercise 9.2.7

Exercise 9.2.8

Exercise 9.2.9

Exercise 9.2.10

9.2.3 The ConstantB Vector

We know the angular momentum h is a constant vector and so

H = |h| = |r × v| = RV sin(γ) = RV cos(φ)

Now we know r′′ = −(µ/R3)r. Thus,

r′′ × h =µ

R3h× r



Then, consider

(r′ × h)′

= r′′ × h+ r′ × h′ = r′′ × h

as h is a constant vector. Combining, we have

(r′ × h)′

=µ

R3h× r

Next, we look at

µ

R3h× r =

µ

R3((r × v)××r)

You probably haven’t worked much with all the identities associated with vector cross products, solet’s go through this calculation.

r × v = det

i j kr1 r2 r3

v1 v2 v3

= (r2v3 − v2r3)i+ (r3v1 − r1v3)j + (r1v2 − v1r2)k

Thus, after some simplifying we leave to you

(r × v)× r = (−1) ·

i j kr1 r2 r3

(r2v3 − v2r3) (r3v1 − r1v3) (r1v2 − v1r2)

= −

(r1r2v2 + r1r3v3 − r2

2v1 − r23v1

)i+

(−r1r2v1 − r2r3v3 + r2

1v2 + r23v2

)j

−(r1r3v1 + r2r3v2 − r2

1v3 − r22v3

)k

We can rewrite this as

(r × v)× r = −(r1(< r,v > −r1v1)− v1(< r, r > −r2

1))i

+(−r2(< r,v > −r2v2) + v2 < r, r > −r2

2))j

−(r3(< r,v > −r3v3)− v3(< r, r > −r2

3))k

= − (r1 < r,v > −v1(< r, r)) i+ (−r2 < r,v > +v2 < r, r >) j

− (r3 < r,v > −v3 < r, r >)k

We can reorganize the above to obtain our final formula:

(r × v)× r = − < r,v > r+ < r, r > v

Using this, we find

µ

R3h× r =

µ

R3(− < r,v > r+ < r, r > v) =

µ

R3

(−RR′r + R2v

)=

µ

Rv − µ

R2R′r

Recall, we know

(r′ × h)′

=µ

R3h× r =

µ

Rv − µ

R2R′r



But,

µ( r

R

)′= µ

(r′

R− R′

R2r

)= µ

(v

R− R′

R2r

)This is what we had above! So we can say

(r′ × h)′

= µ( r

R

)′We conclude the derivative of r′×h−µ(r/R) is a constant vector. Hence, there is a constant vectorB so that

r′ × h = µr

R+B

9.2.3.1 Homework

Exercise 9.2.11

Exercise 9.2.12

Exercise 9.2.13

Exercise 9.2.14

Exercise 9.2.15

9.2.4 The Orbital Conic

We can use the equation giving us the constantB vector to derive the equation of motion we associatewith the satellite’s motion. Consider

< r, r′ × h > =⟨r, µ

r

R+B

⟩= µ

< r, r >

R+ < r,B >= µR+ < r,B >

Now, we need another vector computation you probably haven’t seen. Let’s look at < A,B ×C >for any vectorsA,B and C.

B ×C = det

i j kB1 B2 B3

C1 C2 C3

= (B2C3 −B3C2)i+ (B3C1 −B1C3)j + (B1C2 −B2C1)k

and so

< A,B ×C > = A1(B2C3 −B3C2) +A2(B3C1 −B1C3) +A3(B1C2 −B2C1)

= C1(A2B3 −A3B2) + C2(A3B1 −A1B3) + C3(A1B2 −A2B1)

= < A×B,C >

So

< r, r′ × h > = < r × r′,h >=< h,h >= H 2

This means

H 2 = < r, r′ × h >= µR+ < r,B >= µR + RB cos(ν)



where ν is the angle between r andB. Rewriting, we have

R =

(H 2

µ

)1 +

(Bµ

)cos(ν)

The scalar E = B/µ is called the eccentricity of the orbit and so the vector e = B/µ is called theeccentricity vector. From what we know about the ellipse, the angle ν measures the angle betweenthe vector pointing to the closest point on the ellipse to the current point. Hence, the constant vectorB must point in the direction of periapsis. Therefore e points in the direction of periapsis.We know e = B/mu = r′ × h− µ rR . Thus,

µ e = v × (r × v)− µr/R (9.1)= < v,v > r− < r,v > v − µr/R (9.2)= (V 2 − µ/R)r − < r,v > v (9.3)

where we expand the triple cross product just like we did the previous one. This equation tells us thedirection of periapsis for the conic!You are probably more used to the Cartesian Coordinate version of an ellipse. Look at Figure 9.5.

P

F1 F2x

y

cc

d2

d1

aa

p

The defining parametersfor an ellipse are d1 + d2 =2a and a2 = b2 + c2. Theeccentricity is e = c/a.

Figure 9.5: An Ellipse

There are two points called foci that are used to define an ellipse. These points are labeled F1 and F2

and are on the x axis with each a distance c from the origin. Hence, F1 = (−c, 0) and F2 = (c, 0).Let d1 be the distance from a point on the ellipse to focus F1 and d2 be the distance from a pointon the ellipse to focus F2. The points on the ellipse must satisfy d1 + d2 = 2a and we can use thisto find certain relationships. At the top of the ellipse, we get d1 = d2 = a which defined a righttriangle as you can see in Figure 9.5. We let b be the distance from the center to the top of the ellipse:hence A2 = b2 + c2. The ratio c/a is called the eccentricity and clearly for an ellipse, 0 > e < 1.The length of the vertical line through the focus F1 is 4p and half of this distance is called the latisrectum of the ellipse which is thus 2p. The top part of the line defining the latis rectum gives us the



equation d1 + 2p = 2a and the pythagorean theorem for the triangle formed gives d21 = 4p2 + 4c2.

Thus 4(a− p)2 = 4p2 + 4c2. This implies a2 − 2ap = c2. Hence,

2ap = a2 − c2 = a2(1− e2) =⇒ 2p = a(1− e2)

With some work ( and it is not so trivial!) we can derive the corresponding equation for an ellipse inpolar coordinates. We will leave that to you. We find

R =2p

1 + e cos(ν)

Comparing the equation of motion we found for the satellite, we see H 2/µ = 2p = a(1 − e2) andB/µ = e. Now in the orbital plane

r(ν) =

[R cos(ν)R sin(ν)

]=

[2p cos(ν)

1+e cos(ν)2p sin(ν

1+e cos(ν)

]

v(ν) =

[−2p sin(ν)1+e cos(ν) −

2p cos(ν)(1+e cos(ν))2 (−e sin(ν))

2p cos(ν)1+e cos(ν) −

2p sin(ν)(1+e cos(ν))2 (−e sin(ν))

]

=

[−2p sin(ν)(1+e cos(ν))+2pe sin(ν) cos(ν)(1+e cos(ν))2

2p cos(ν)(1+e cos(ν))+2pe sin2(ν)(1+e cos(ν))2

]

Hence, at periapsis, ν = 0 and

rp =

[ 2p1+e

0

], vp =

[02p

1+e

]and at apoapsis, ν = π

ra =

[ 2p1−e0

], va =

[0−2p1−e

]Since angular momentum is constant on the orbit, we know

H = RV cos(ν) = RpVp = RaVa

But this means

H =2p

1 + e

2p

1 + e=

2p

1− e2p

1− e

9.2.4.1 Homework

Exercise 9.2.16

Exercise 9.2.17

Exercise 9.2.18

Exercise 9.2.19

Exercise 9.2.20


9.3. THREE DIMENSIONAL ROTATIONS 139

9.3 Three Dimensional Rotations

If we have a 3 × 3 symmetric matrix A, we know it has three mutually orthogonal eigenvectorscorresponding to three real eigenvalues. Hence the unit eigenvectors form an orthonormal basis for<3 and we can write E1

T

E2T

E3T

A [E1 E2 E3

]=

λ1 0 00 λ2 00 0 λ3

where E1,E2,E3 is an orthonormal basis for <3 with corresponding eigenvalues λ1, λ2 and λ3.We now the rows and columns of the matrix P =

[E1 E2 E3

]are mutually orthogonal. Hence,

like what happened in the two dimensional case, P is often a rotation matrix which transforms anorthonormal system of mutually orthogonal axis labeled x, y and z into a new orthogonal systemlabeled x′, y′ and z′. Let’s study some of these rotation matrices. There are three simple buildingblocks:

• The axis of rotation is the z axis and we rotate the x - y system about the z axis by θ degrees.The new system is thus x′, y′ and z.

• Now apply to this new system a rotation of φ degrees about the new y′ axis. So the x′ and zaxis rotate into the new axis x′′ and z′. This gives a new orthogonal system with axis x′′, y′

and z′.

• Now apply to this new system a rotation of δ degrees about the new x′′ axis. This generates anew orthogonal system with axes x′′, y′′ and z′′.

Of course, we could organize three rotations in different ways but to explain this it helps to have aconcrete set of choices. We want to draw some of this to help you see it, so we use MatLab. It is a bitcomplicated to draw things in any programming language. Let’s begin with drawing a simple vectorusing ThreeDToPlot.m shown below.

Listing 9.1: ThreeDToPlot.m

f u n c t i o n [ x , y , z ] = ThreeDToPlot (A)%% A i s t h e v e c t o r t o p l o t us ing p l o t 3% x , y , z i s p r e p a r e d v e c t o r t h a t p l o t 3 uses

5 % N i s t h e number o f p o i n t s t o p l o t%t = l i n s p a c e ( 0 , 1 , 2 ) ;VA = @( t ) t ∗A;f o r i =1:2

10 C = VA( t ( i ) ) ;x ( i ) = C( 1 ) ;y ( i ) = C( 2 ) ;z ( i ) = C( 3 ) ;

end15 end



This looks a little strange, but the plot3 MatLab function draws the line between two vectors AandB like this. If

A =

A1

A2

A3

, B =

B1

B2

B3

then to draw the line betweenA andB, we collect x, y and z points as

x =

[A1

B1

], y =

[A2

B2

], z =

[A3

B3

]and then plot3 plots the line between the point (x(1), y(1), z(1)) and the point (x(2), y(2), z(2)).Since this is a pain to do for all the vectors we want to plot, we’ve rolled this into a conveniencefunction so we don’t have to type so much.In addition, we need our rotation matrices functions which we call RotateX.m, RotateY.m andRotateZ.m. Here is the code. The function RotateX rotates about the x axis by positive θ. togive the new coordinate system (x, y′, z′). This is a positive rotation which means your thumb pointsin the direction of the +x axis and your fingers curl in the direction of the positive θ. Hence, this is acounterclockwise (ccw) rotation from the +y axis giving the new coordinate system (x, y′, z′). Therotation matrix here isRx,θ with

Rx,θ =

1 0 00 cos(θ) − sin(θ)0 sin(θ) cos(θ)

=

[1 00 Rθ

]

where 0 here is whatever size of zero vector or its transpose we need and Rθ is the usual 2 × 2rotation matrix.

Listing 9.2: RotateX.m

f u n c t i o n Rx = RotateX ( t h e t a )% + t h e t a means ccw which i s what we u s u a l l y want% − t h e t a means cwc t h e t a = cos ( t h e t a ) ;

5 s t h e t a = s i n ( t h e t a ) ;%Rx = [ 1 , 0 , 0 ; 0 , c the ta , −s t h e t a ; 0 , s t h e t a , c t h e t a ] ;end

The function RotateY rotates about the y axis by positive θ. to give the new coordinate system(x′, y, z′). The rotation matrix here isRy,θ with

Ry,θ =

cos(θ) 0 − sin(θ)0 1 0

sin(θ) 0 cos(θ)

and you can clearly seeRθ buried inside this matrix.

Listing 9.3: RotateY.m

f u n c t i o n Ry = RotateY ( t h e t a )



% + t h e t a means cw% − t h e t a means ccw which i s what we u s u a l l y wantc t h e t a = cos ( t h e t a ) ;

5 s t h e t a = s i n ( t h e t a ) ;%Ry = [ c the ta , 0 , −s t h e t a ; 0 , 1 , 0 ; s t h e t a , 0 , c t h e t a ] ;

end

The function RotateZ rotates about the z axis by positive θ. to give the new coordinate system(x′, y′, z). The rotation matrix here isRz,θ with

Rz,θ =

[cos(θ) − sin(θ) 0sin(θ) cos(θ) 00 0 1

]and again you can clearly seeRθ buried inside this matrix.

Listing 9.4: RotateZ.m

f u n c t i o n Rz = RotateZ ( t h e t a )% + t h e t a means ccw which i s what we u s u a l l y want% − t h e t a means cwc t h e t a = cos ( t h e t a ) ;

5 s t h e t a = s i n ( t h e t a ) ;Rz = [ c the ta , −s t h e t a , 0 ; s t h e t a , c the ta , 0 ; 0 , 0 , 1 ] ;

end

9.3.1 Homework

Exercise 9.3.1

Exercise 9.3.2

Exercise 9.3.3

Exercise 9.3.4

Exercise 9.3.5

9.3.2 Drawing Rotations

The easiest way to understand the rotations is to look at some graphs. An even better way is tobuild a model out of cardboard: you should try it! We can draw the two sets of orthogonal axis weget from the application of rotations using the following code. This code is designed to work withdrawing elliptical orbits so we setup the size of the axis system based on the size of the ellipse wewill want to draw. Since the apoapsis of the ellipse is R/(1 − e) where e is the eccentricity and Ris the scaling factor, we use that the set the axis dimensions. Then later, any ellipse we draw will fitinto the coordinate system nicely. The rest of the code simply draws straight lines and determinesthe line style of the drawn line.



Listing 9.5: DrawAxis.m

f u n c t i o n DrawAxis (R, e , N, E1 , E2 , E3 , A, s t y l e , c o l o r )2 %

% Draw c o o r d i n a t e sy s t emhold on ;Rmax = R/(1− e ) ;% Graph new z a x i s and new y a x i s

7 t = l i n s p a c e (−1.5∗Rmax, 1 . 5 ∗Rmax ,N) ;W1 = E1 ;vW1 = @( t ) t ∗W1;f o r i =1:N

s = t ( i ) ; D = vW1( s ) ;12 x1 ( i ) = D( 1 ) + A( 1 ) ;

y1 ( i ) = D( 2 ) + A( 2 ) ;z1 ( i ) = D( 3 ) + A( 3 ) ;

end%

17 W2 = E2 ;vW2 = @( t ) t ∗W2;f o r i =1:N

s = t ( i ) ; D = vW2( t ( i ) ) ;x2 ( i ) = D( 1 ) + A( 1 ) ;

22 y2 ( i ) = D( 2 ) + A( 2 ) ;z2 ( i ) = D( 3 ) + A( 3 ) ;

end%

W3 = E3 ;27 vW3 = @( t ) t ∗W3;

f o r i =1:Ns = t ( i ) ; D = vW3( t ( i ) ) ;x3 ( i ) = D( 1 ) + A( 1 ) ;y3 ( i ) = D( 2 ) + A( 2 ) ;

32 z3 ( i ) = D( 3 ) + A( 3 ) ;end

i f ( s t y l e == 1)p l o t 3 ( x1 , y1 , z1 ,’-’ ,’LineWidth’ , 2 ,’Color’ , c o l o r ) ;p l o t 3 ( x2 , y2 , z2 ,’-’ ,’LineWidth’ , 2 ,’Color’ , c o l o r ) ;

37 p l o t 3 ( x3 , y3 , z3 ,’-’ ,’LineWidth’ , 2 ,’Color’ , c o l o r ) ;e l s e i f ( s t y l e == 2)

i f ( s t y l e == 2)p l o t 3 ( x1 , y1 , z1 ,’o’ ,’LineWidth’ , 2 ,’Color’ , c o l o r ) ;p l o t 3 ( x2 , y2 , z2 ,’o’ ,’LineWidth’ , 2 ,’Color’ , c o l o r ) ;

42 p l o t 3 ( x3 , y3 , z3 ,’o’ ,’LineWidth’ , 2 ,’Color’ , c o l o r ) ;e l s e

d i sp (” not one or two ”) ;end

hold o f f ;47 end

Let’s look at a rotation about the +z axis of angle θ. The code is simple.



Listing 9.6: Rotation about the z axis

I = [ 1 ; 0 ; 0 ] ; J = [ 0 ; 1 ; 0 ] ; K = [ 0 ; 0 ; 1 ] ; A = [ 0 ; 0 ; 0 ] ;2 hold on ;

DrawAxis ( 2 , 0 . 6 , 2 0 , I , J ,K, A, 1 ,’black’ ) ;t h e t a = p i / 6 ;Xp = RotateZ ( t h e t a ) ∗ I ;Yp = RotateZ ( t h e t a ) ∗J ;

7 Zp = RotateZ ( t h e t a ) ∗K;DrawAxis ( 2 , 0 . 6 , 2 0 , Xp , Yp , Zp , A, 1 ,’blue’ ) ;hold o f fXpn = RotateZ(− t h e t a ) ∗ I ;Ypn = RotateZ(− t h e t a ) ∗J ;

12 Zpn = RotateZ(− t h e t a ) ∗K;DrawAxis ( 2 , 0 . 6 , 2 0 , Xpn , Ypn , Zpn , A, 2 ,’red’ ) ;x l a b e l (’x’ ) ;y l a b e l (’y’ )z l a b e l (’z’ ) ;

In Figure 9.6, we look down on the plot so that the +z axis is pointing straight up and can not beseen. The positive θ rotation moves the x axis counter clockwise. On the other hand, the negativerotation by −θ is shown with axis plotted by circles and it moves the +x axis clockwise. So apositive rotation about the +z axis means the 2 × 2 rotation matrix seen inside should be Rθ. Thestandard way to pick the kind of rotation we want is to define a positive rotation about a positive axisas follows: line the thumb of your right hand up along the positive axis. Then your fingers will curlin the direction of what we call a positive rotation angle. So positive rotations along the +z axis givethe usual embeddedRθ part of the three dimensional rotation matrix.

Figure 9.6: A positive and negative rotation about the z axis



Homework

Exercise 9.3.6

Exercise 9.3.7

Exercise 9.3.8

Exercise 9.3.9

Exercise 9.3.10

Next, look at a rotation about the +y axis of angle θ. The code is simple.

Listing 9.7: Rotation about the y axis

I = [ 1 ; 0 ; 0 ] ; J = [ 0 ; 1 ; 0 ] ; K = [ 0 ; 0 ; 1 ] ; A = [ 0 ; 0 ; 0 ] ;hold on ;DrawAxis ( 2 , 0 . 6 , 2 0 , I , J ,K, A, 1 ,’black’ ) ;

4 t h e t a = p i / 6 ;Xp = RotateY ( t h e t a ) ∗ I ;Yp = RotateY ( t h e t a ) ∗J ;Zp = RotateY ( t h e t a ) ∗K;DrawAxis ( 2 , 0 . 6 , 2 0 , Xp , Yp , Zp , A, 1 ,’blue’ ) ;

9 hold o f fXpn = RotateY(− t h e t a ) ∗ I ;Ypn = RotateY(− t h e t a ) ∗J ;Zpn = RotateY(− t h e t a ) ∗K;DrawAxis ( 2 , 0 . 6 , 2 0 , Xpn , Ypn , Zpn , A, 2 ,’red’ ) ;

14 x l a b e l (’x’ ) ;y l a b e l (’y’ )z l a b e l (’z’ ) ;

In Figure 9.7, we look down on the plot so that the +y axis is pointing straight up and can not beseen. The positive θ rotation moves the z axis counter clockwise and the +x axis clockwise. Thiscorresponds to a rotation by −θ. So a positive rotation about the +y axis means the 2 × 2 rotationmatrix seen inside should be R−θ. Hence, if we want the +x axis to rotate counter clockwise, wewould applyRθHomework

Exercise 9.3.11

Exercise 9.3.12

Exercise 9.3.13

Exercise 9.3.14

Exercise 9.3.15

Finally, look at a rotation about the +x axis of angle θ. The code is simple.

Listing 9.8: Rotation about the x axis

I = [ 1 ; 0 ; 0 ] ; J = [ 0 ; 1 ; 0 ] ; K = [ 0 ; 0 ; 1 ] ; A = [ 0 ; 0 ; 0 ] ;



Figure 9.7: A positive and negative rotation about the y axis

hold on ;DrawAxis ( 2 , 0 . 6 , 2 0 , I , J ,K, A, 1 ,’black’ ) ;

4 t h e t a = p i / 6 ;Xp = RotateX ( t h e t a ) ∗ I ;Yp = RotateX ( t h e t a ) ∗J ;Zp = RotateX ( t h e t a ) ∗K;DrawAxis ( 2 , 0 . 6 , 2 0 , Xp , Yp , Zp , A, 1 ,’blue’ ) ;

9 hold o f fXpn = RotateX(− t h e t a ) ∗ I ;Ypn = RotateX(− t h e t a ) ∗J ;Zpn = RotateX(− t h e t a ) ∗K;DrawAxis ( 2 , 0 . 6 , 2 0 , Xpn , Ypn , Zpn , A, 2 ,’red’ ) ;

14 x l a b e l (’x’ ) ;y l a b e l (’y’ )z l a b e l (’z’ ) ;

In Figure 9.8, we look down on the plot so that the +x axis is pointing up at almost a vertical angle.The positive θ rotation moves the z axis counter clockwise and the +x axis counter clockwise.This corresponds to a rotation by θ. So a positive rotation about the +x axis means the 2× 2 rotationmatrix seen inside should beRθ.Homework

Exercise 9.3.16

Exercise 9.3.17

Exercise 9.3.18

Exercise 9.3.19



Figure 9.8: A positive and negative rotation about the x axis

Exercise 9.3.20

9.3.3 Rotated Ellipses

For a given ellipse in the x − y plane, we can rotate it into a given orbital plane using variousrotations. The general formula for an ellipse is r = ρ/(1 + e cos(t)) and we use that to set up theplot. In this code, we make sure we move in the counter clockwise direction for the +x axis whenwe do a rotation with respect to the y axis by explicitly using RotateY(-phi). We show theresulting plot in Figure 9.9 but be warned these plots are hard to see! We rotated it around until youcan sort of see the rotations, but with three axis rotations, this gets hard fast.

Listing 9.9: DrawEllipse.m

f u n c t i o n DrawEll ipse ( rho , e , the ta , phi , d e l t a )% draw o r b i t in t h e new x−y c o o r d i n a t e sy s t e mPhi = l i n s p a c e ( 0 , 2∗ pi , 5 1 ) ;R = @( t ) rho / ( 1 + e∗ cos ( t ) ) ;

5 u = @( t ) R( t ) ∗ cos ( t ) ;v = @( t ) R( t ) ∗ s i n ( t ) ;X = @( t ) [ u ( t ) ; v ( t ) ; 0 ] ;OP = @( t ) RotateZ ( t h e t a ) ∗X( t ) ;OPP = @( t ) RotateY(−phi ) ∗OP( t ) ;

10 OPPP = @( t ) RotateX ( d e l t a ) ∗OPP( t ) ;x1 = z e r o s ( 3 , 5 1 ) ;x2 = z e r o s ( 3 , 5 1 ) ;x3 = z e r o s ( 3 , 5 1 ) ;



c o n i c = z e r o s ( 3 , 5 1 ) ;15 f o r i =1:51

t = Phi ( i ) ;c o n i c ( : , i ) = OPPP( t ) ;

endx1 = c o n i c ( 1 , : ) ;

20 x2 = c o n i c ( 2 , : ) ;x3 = c o n i c ( 3 , : ) ;hold on ;p l o t 3 ( x1 , x2 , x3 ,’-’ ,’LineWidth’ , 3 ,’Color’ ,’green’ ) ;hold o f f ;

25 end

Figure 9.9: A Rotated Ellipse

9.3.3.1 Homework

Exercise 9.3.21

Exercise 9.3.22

Exercise 9.3.23

Exercise 9.3.24

Exercise 9.3.25



9.4 Drawing the Orbital PlaneNow let’s put it all together and draw an orbital plane somewhat carefully. We do this by gluingtogether the various pieces of code we have written and we include a few functions we have notdiscussed. However, the code is available for viewing and it is easy enough to figure it out after allwe have discussed so far.

Listing 9.10: DrawOrbit.m

f u n c t i o n DrawOrbit (R, e , N, the ta , phi , d e l t a , E1 , E2 , E3 , A, s t y l e , c o l o r )%% Draw c o o r d i n a t e sy s t emc l f ;

5 hold on ;% Graph t h e o r i g i n a l axesDrawAxisGrid (R, e ,N) ;DrawAxis (R, e , N, E1 , E2 , E3 , A, 1 , c o l o r ) ;% Draw r o t a t e d sy s t em f i x i n g z

10 Xp = RotateZ ( t h e t a ) ∗E1 ;Yp = RotateZ ( t h e t a ) ∗E2 ;Zp = E3 ;%DrawAxis ( Xp , Yp , Zp , A, ’ blue ’ ) ;Xpp = RotateY(−phi ) ∗Xp;

15 Ypp = Yp;Zpp = RotateY(−phi ) ∗Zp ;%DrawAxis ( Xpp , Ypp , Zpp , A, ’ green ’ ) ;Xppp = Xpp;Yppp = RotateX ( d e l t a ) ∗Ypp;

20 Zppp = RotateX ( d e l t a ) ∗Zpp ;DrawAxis (R, e , N, Xppp , Yppp , Zppp , A, 1 ,’red’ ) ;DrawEll ipse (R, e , N, the ta , phi , d e l t a ) ;DrawAxisGrid (R, e ,N) ;x l a b e l (’x’ ) ;

25 y l a b e l (’y’ ) ;z l a b e l (’z’ ) ;hold o f f ;end

To run this code, we use the command

Listing 9.11: Plotting an Orbital Plane with Rotated Axes

DrawOrbit ( 2 , 0 . 6 , 2 0 , p i / 6 , 3∗ pi / 4 , 3∗ pi / 4 , I , J ,K, A, 1 ,’black’ ) ;

The plot we generate can be manipulated to get the image in Figure 9.10.If you look closely, you can see the orbital plane crosses the equatorial plane and the details of thiscrossing are important to understanding this orbit. If you comment out the code line in DrawOrbit.m which generates the rotated axes, you can see this a little more clearly as we show in Figure9.11. The equatorial plane is drawn heavily crosshatched and the part of the orbital plane above theequatorial plane is reasonably easy to see.Homework


9.4. DRAWING THE ORBITAL PLANE 149

Figure 9.10: A Rotated Ellipse

Exercise 9.4.1

Exercise 9.4.2

Exercise 9.4.3

Exercise 9.4.4

Exercise 9.4.5

9.4.1 The Perifocal Coordinate System

The position vector r and its velocity vector v determine a plane whose normal vector is h. Theclosest point on the orbit to the center is called the periapsis while the point farthest away is calledthe apoapsis. We let xω denote the axis pointing towards periapsis and yω is the axis rotated 90

in the direction of the orbital motion. Hence, the position of the satellite in the orbit is given by anordered pair (u, v). We let P and Q be unit vectors along the xω and yω axes respectively. Ofcourse, P = e/E . Then, if the position of the satellite in the orbit is denoted by X then X hasa unique representation in the two dimensional vector space determined by the orthonormal basisP ,Q given byX = xP + yQ. The positive zω axis is determined by the right hand rule and theunit vector along zω is called W . Of course, a vector aP + bQ + cW in this system needs to beconverted to coordinates in other systems. You can see this coordinate system in Figure 9.12.

9.4.1.1 Homework

Exercise 9.4.6

Exercise 9.4.7

Exercise 9.4.8



Figure 9.11: The Orbital Plane Crosses the Equatorial Plane

Exercise 9.4.9

Exercise 9.4.10

9.4.2 Orbital ElementsThe orbital elements of an orbit are enough to determine completely the look of the orbit. Let’sdescribe them carefully.

• One half of the major axis of an ellipse is called a. You should look at Figure 9.5 to refreshyour mind about this. From the origin of the ellipse at say (0, 0, the point (c, 0) is the focalpoint F2. The point closest to a foci is (a, 0) and so the distance from F2 to the periapsis isa− c.

• The eccentricity of the elliptical orbit is called E . and often just e. We use a boldface e todenote the eccentricity vector and the length of [ewhich is traditionally just denoted by e looksa lot like e here. So to be clear, we will let the eccentricity be called E although sometimes incode we just use e for convenience.

• The inclination angle is the angle between the unit vector K and the angular momentumvector h. Hence, if we know R = R1I + R2J + R3K and R = R1I + R2J + R3K, wecan calculate

h = r × V = det

I JKR1 R2 R3

V1 V2 V3

Once, h has been calculated, we can use it to find the line of nodes vector. The orbital planecrosses the equatorial plane except in degenerate cases such as an equatorial orbit. Now the


9.4. DRAWING THE ORBITAL PLANE 151

X

Center of Earth

Yω

Xω

Zω

J

I

K

R

A vector X in the Perifocal coordinate system. X marks the position of the satel-lite in its orbit. The unit vector P is along the Xω axis, the unit vector Q isalong the Yω axis and the unit vector W lies on the Zω axis. The I − J −Kaxis for the Geocentric - Equatorial and Right Ascension - Declination coordinatesystems are shown for reference.

Figure 9.12: The Perifocal coordinate system

normal to the orbital plane is h. When the orbital plane crosses the equatorial plane, at thosepoints, it is in the equatorial plane and so must be perpendicular to the equatorial plane too. Sothe line of intersection of the orbital plane and the equatorial plane is orthogonal to both h andK. The cross product then determines a vector called the node vector which is the directionvector of the line of nodes. We let n denote this vector and we see n = h×K.

• Since the line of nodes is in the equatorial plane, we can measure the angle between the lineof nodes vector and the I vector. This angle is called the longitude of the ascending node, Ωand we see Ω = cos−1(< n, I >.

• The argument of periapsis is the angle in the orbital plane measured from the line of nodesto the periapsis vector. We denote this by ω.

• The time of periapsis passage is the time, T , when the satellite was at periapsis.

In Figure 9.13, we try to show you these in an easy to follow diagram.

9.4.2.1 Homework

Exercise 9.4.11

Exercise 9.4.12

Exercise 9.4.13

Exercise 9.4.14

Exercise 9.4.15



X

Center of Earth

Q

P : Periapsis Direction

N : Line of Nodes

W = h

J

I

K

X

Angle ΩEquatorial Plane

Angle ω Orbital Plane

Angle νo Orbital Plane

Angle i Between Normals

The orbital elements for a given orbit. All of these are straightforward to calculategiven an measurement of r and v in the IJK system.

Figure 9.13: The Orbital Elements

9.5 Drawing Orbital Plane In MatLab

Now let’s find and graph the orbital plane for a specific example and try to use our knowledge oforbital mechanics along the way. We will explicitly calculate the P , Q and W orthonormal basisvectors of the perifocal coordinate system and the line of nodes vector n. Once that is done, we canplot these vectors as lines in 3D space. So we will not use rotation matrices to do the plots here. Weassume we are given the position vector r0 and velocity vector v0 of the satellite at the initial time0. These are given as vectors in the equatorial plane coordinate system so each has an I , J and Kcomponent. We need a cross product function, so for experience, let’s build one.

Listing 9.12: CrossProduct.m

f u n c t i o n C = CrossProduct (A, B)% A i s a 3D v e c t o r% B i s a 3D v e c t o r% r e t u r n s C , t h e c r o s s p r o d u c t o f A and B

5 %% AxB = d e t ( I , J , K;C = [A( 2 ) ∗B( 3 )−A( 3 ) ∗B( 2 ) ;A( 3 ) ∗B( 1 )−A( 1 ) ∗B( 3 ) ;A( 1 ) ∗B( 2 )−A( 2 ) ∗B( 1 ) ] ;end

First, we set up the standard equatorial frame unit vectors and find the angular momentum vector.


9.5. DRAWING ORBITAL PLANE IN MATLAB 153

Listing 9.13: Find Angular Momentum Vector

% s e t s t a n d a r d I , J , K v e c t o r s2 I = [ 1 ; 0 ; 0 ] ; J = [ 0 ; 1 ; 0 ] ; K = [ 0 ; 0 ; 1 ] ;

R0 = norm ( r0 ) ;V0 = norm ( v0 ) ;H = CrossProduct ( r0 , v0 ) ;h = norm (H) ;

Now we can find H 2/µ, the eccentricity vector and its norm.

Listing 9.14: Find the eccentricity vector

p = h ˆ 2 ;% f i n d t h e e c c e n t r i c i t y v e c t o rE = ( 1 /mu) ∗ ( V0ˆ2 − mu/ R0 ) ∗r0 − dot ( r0 , v0 ) ∗v0 ;

4 % f i n d t h e e c c e n t r i c i t ye = norm (E) ;

Then find the inclination angle, i, the line of nodes vector n and the longitude of the ascending node,Ω.

Listing 9.15: Find the inclination, line of nodes and Ω

% f i n d t h e i n c l i n a t i o ni n c = acos ( dot (H,K) / h ) ;% f i n d t h e l i n e o f nodes v e c t o rN = CrossProduct (K,H) ;

5 n = norm (N) ;% f i n d OmegaOmega = acos ( dot (N, I ) / n ) ;

Then find the argument of periapsis, ω and the true anomaly, ν0.

Listing 9.16: Find the argument of periapsis and true anomaly

% f i n d omegaomega = acos ( dot (N, E) / ( n∗e ) ) ;

3 % f i n d t h e t r u e anomalynu0 = acos ( dot (E , r0 ) / ( e∗R0 ) ) ;

Now we can get the orbital plane.

Listing 9.17: Find Perifocal Basis

1 % u n i t normal t o o r b i t a l p laneW = H/ h ;



% g e t p e r i g e e d i r e c t i o n v e c t o r in o r b i t a l p laneP = E / e ;W = H/ h ;

6 Q = CrossProduct (W, P ) ;

Now get the semimajor axis.

Listing 9.18: Find the semimajor axis a

rp = ( h ˆ 2 ) / ( 1 + e ) ;ra = ( h ˆ 2 ) /(1− e ) ;a = ( rp+ra ) / 2 . 0 ;

Next, plot the standard geocentric -equatorial and Perifocal coordinate axes. We start holding theplots now.

Listing 9.19: Plot The Geocentric and Perifocal Axes

hold on2 % Draw IJK

[ x , y , z ] = ThreeDToPlot ( ra∗ I ) ;p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’black’ ) ;[ x , y , z ] = ThreeDToPlot ( ra∗J ) ;p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’black’ ) ;

7 [ x , y , z ] = ThreeDToPlot ( ra∗K) ;p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’black’ ) ;

% Draw PQW[ x , y , z ] = ThreeDToPlot ( ra∗P ) ;p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’red’ ) ;

12 [ x , y , z ] = ThreeDToPlot(−ra∗P ) ;p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’red’ ) ;[ x , y , z ] = ThreeDToPlot ( ra∗Q) ;p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’red’ ) ;[ x , y , z ] = ThreeDToPlot(−ra∗Q) ;

17 p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’red’ ) ;[ x , y , z ] = ThreeDToPlot ( ra∗W) ;p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’red’ ) ;

Draw the line of nodes.

Listing 9.20: Draw the line of nodes

1 [ x , y , z ] = ThreeDToPlot ( ra∗N) ;p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’green’ ) ;[ x , y , z ] = ThreeDToPlot(−ra∗N) ;

Plot the orbit and stop the hold on plots.



Listing 9.21: Plot the orbit

% Draw o r b i t2 O r b i t S i z e = 80;

r o r b i t p l a n e = z e r o s ( Orbi tS ize , 3 ) ;nu = l i n s p a c e ( 0 , 2∗ pi , O r b i t S i z e ) ;f o r i = 1: O r b i t S i z e

c = cos ( nu ( i ) ) ; s = s i n ( nu ( i ) ) ;7 r o r b i t = h ˆ 2 / ( 1 + e∗c ) ;

r o r b i t p l a n e ( i , : ) = r o r b i t ∗c∗P+ r o r b i t ∗ s∗Q;T = r o r b i t p l a n e ( i , : ) ;x ( i ) = T( 1 ) ;y ( i ) = T( 2 ) ;

12 z ( i ) = T( 3 ) ;endp l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’blue’ ) ;

%x l a b e l (’x’ ) ;

17 y l a b e l (’y’ ) ;z l a b e l (’z’ ) ;

hold o f f

Print out the orbital elements.

Listing 9.22: Print The Orbital Elements

1 di sp ( s p r i n t f (’Omega: longitude of the ascending node = %8.4f’ ,Omega∗180 / p i ) ) ;

d i sp ( s p r i n t f (’omega: argument of periapsis = %8.4f’ , omega ∗180 / p i ) ) ;d i sp ( s p r i n t f (’nu0: the true anomaly = %8.4f’ , nu0 ∗180 / p i ) ) ;d i sp ( s p r i n t f (’inclination = %8.4f’ , i n c ∗180 / p i ) ) ;d i sp ( s p r i n t f (’a: semimajor axis = %8.4f’ , a ) ) ;

6 di sp ( s p r i n t f (’p = hˆ2/mu = %8.4f’ , p ) ) ;d i sp ( s p r i n t f (’e: the eccentricity = %8.4f’ , e ) ) ;d i sp ( s p r i n t f (’h: the angular momentum = %8.4f’ , h ) ) ;

The full code is below. With some practice, you can read code documented this way pretty easily.Notice, we try to add comment lines that are short but informative before most of the things we tryto do.

Listing 9.23: DrawOrbitalPlane.m

f u n c t i o n [ omega , Omega , nu0 , a , p , e , E , inc , P ,Q,W, h ,H,N] = DrawOrbitalPlane (r0 , v0 )

2 %% here mu i s n o r m a l i z e d t o 1% r0 i s p o s i t i o n v e c t o r a t t ime 0% v0 i s v e l o c i t y v e c t o r a t t ime 0% Omega i s t h e l o n g i t u d e o f t h e ascend ing node

7 % which d e t e r m i n e s t h e l i n e o f nodes d i r e c t i o n% omega i s t h e argument o f p e r i a p s i s



%nu0 i s t h e t r u e anomaly% p = h ˆ 2 / mu = h ˆ2% E i s t h e e c c e n t r i c i t y v e c t o r

12 % N i s t h e l i n e o f nodes v e c t o r% P , Q, W i s t h e p e r i f o c a l b a s i s% H i s angular momentum v e c t o r% N i s t h e l i n e o f nodes u n i t v e c t o r% i n c i s t h e i n c l i n a t i o n

17 %% s e t s t a n d a r d I , J , K v e c t o r sc l f ;mu = 1;I = [ 1 ; 0 ; 0 ] ; J = [ 0 ; 1 ; 0 ] ; K = [ 0 ; 0 ; 1 ] ;

22 % Find t h e angular momentum v e c t o rR0 = norm ( r0 ) ;V0 = norm ( v0 ) ;H = CrossProduct ( r0 , v0 ) ;h = norm (H) ;

27 p = h ˆ 2 ;% f i n d t h e e c c e n t r i c i t y v e c t o rE = ( 1 /mu) ∗ ( V0ˆ2 − mu/ R0 ) ∗r0 − dot ( r0 , v0 ) ∗v0 ;% f i n d t h e e c c e n t r i c i t ye = norm (E) ;

32 % f i n d t h e i n c l i n a t i o ni n c = acos ( dot (H,K) / h ) ;% f i n d t h e l i n e o f nodes v e c t o rN = CrossProduct (K,H) ;n = norm (N) ;

37 % f i n d OmegaOmega = acos ( dot (N, I ) / n ) ;% f i n d omegaomega = acos ( dot (N, E) / ( n∗e ) ) ;% f i n d t h e t r u e anomaly

42 nu0 = acos ( dot (E , r0 ) / ( e∗R0 ) ) ;% t o g e t o r b i t a l p lane :% u n i t normal t o o r b i t a l p laneW = H/ h ;% g e t p e r i g e e d i r e c t i o n v e c t o r in o r b i t a l p lane

47 P = E / e ;W = H/ h ;Q = CrossProduct (W, P ) ;rp = ( h ˆ 2 ) / ( 1 + e ) ;ra = ( h ˆ 2 ) /(1− e ) ;

52 a = ( rp+ra ) / 2 . 0 ;hold on% Draw IJK[ x , y , z ] = ThreeDToPlot ( ra∗ I ) ;p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’black’ ) ;

57 [ x , y , z ] = ThreeDToPlot ( ra∗J ) ;p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’black’ ) ;[ x , y , z ] = ThreeDToPlot ( ra∗K) ;p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’black’ ) ;

% Draw PQW62 [ x , y , z ] = ThreeDToPlot ( ra∗P ) ;

p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’red’ ) ;[ x , y , z ] = ThreeDToPlot(−ra∗P ) ;



p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’red’ ) ;[ x , y , z ] = ThreeDToPlot ( ra∗Q) ;

67 p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’red’ ) ;[ x , y , z ] = ThreeDToPlot(−ra∗Q) ;p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’red’ ) ;[ x , y , z ] = ThreeDToPlot ( ra∗W) ;p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’red’ ) ;

72 % Draw l i n e o f nodes[ x , y , z ] = ThreeDToPlot ( ra∗N) ;p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’green’ ) ;[ x , y , z ] = ThreeDToPlot(−ra∗N) ;p l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’green’ ) ;

77 % Draw o r b i tO r b i t S i z e = 80;r o r b i t p l a n e = z e r o s ( Orbi tS ize , 3 ) ;nu = l i n s p a c e ( 0 , 2∗ pi , O r b i t S i z e ) ;f o r i = 1: O r b i t S i z e

82 c = cos ( nu ( i ) ) ; s = s i n ( nu ( i ) ) ;r o r b i t = h ˆ 2 / ( 1 + e∗c ) ;r o r b i t p l a n e ( i , : ) = r o r b i t ∗c∗P+ r o r b i t ∗ s∗Q;T = r o r b i t p l a n e ( i , : ) ;x ( i ) = T( 1 ) ;

87 y ( i ) = T( 2 ) ;z ( i ) = T( 3 ) ;

endp l o t 3 ( x , y , z ,’linewidth’ , 2 ,’color’ ,’blue’ ) ;

%92 x l a b e l (’x’ ) ;

y l a b e l (’y’ ) ;z l a b e l (’z’ ) ;a x i s ongr id on

97 box on ;hold o f fd i sp ( s p r i n t f (’Omega: longitude of the ascending node = %8.4f’ ,Omega

∗180 / p i ) ) ;d i sp ( s p r i n t f (’omega: argument of periapsis = %8.4f’ , omega ∗180 / p i ) ) ;d i sp ( s p r i n t f (’nu0: the true anomaly = %8.4f’ , nu0 ∗180 / p i ) ) ;

102 di sp ( s p r i n t f (’inclination = %8.4f’ , i n c ∗180 / p i ) ) ;d i sp ( s p r i n t f (’a: semimajor axis = %8.4f’ , a ) ) ;d i sp ( s p r i n t f (’p = hˆ2/mu = %8.4f’ , p ) ) ;d i sp ( s p r i n t f (’e: the eccentricity = %8.4f’ , e ) ) ;d i sp ( s p r i n t f (’h: the angular momentum = %8.4f’ , h ) ) ;

107 end

Running this code generates a graph which you can grab with your mouse and rotate around to get itwhere you want it. Let’s try it.

Listing 9.24: Runtime

r0 = [ 1 . 2 9 ; 0 . 7 5 ; 0 . 0 ] ;

3 1 .299000 .75000



0 .00000

v0 = [ . 3 5 ; 0 . 6 1 ; 0 . 7 0 7 ]8 v0 =

0 .350000 .610000 .70700

13 [ omega , Omega , nu0 , a , p , e , E , inc , P ,Q,W, h ,H,N] = DrawOrbitalPlane ( r0 , v0 ) ;Omega: l o n g i t u d e of the ascending node = 30 .0007omega : argument o f p e r i a p s i s = 94 .9983nu0 : the true anomaly = 94 .9983i n c l i n a t i o n = 63 .4500

18 a : semimajor a x i s = 2 .9506p = h ˆ 2 /mu = 1 .4054e : the e c c e n t r i c i t y = 0 .7237h : the angular momentum = 1.1855

You can see our chosen snapshot of this graph in Figure 9.14. The line of nodes is plotted in green,the P Q and W perifocal coordinate system is plotted in red and the geocentric - equatorial I J Kcoordinate system in black. It is hard to get a nice plot of this stuff, but as you can see with a bit ofwork we can generate these automatically.

Figure 9.14: the orbit for r0 = [1.299; 0.75; 0.0], v0 = [0.35; 0.61; 0.707]

Here is another one:

Listing 9.25: Runtime

r0 = [ 1 . 2 9 ; 0 . 7 5 ; 1 . 2 ] ;



1 .299004 0 .75000

1 .20000

v0 = [ . 3 5 ; 0 . 6 1 ; 0 . 7 0 7 ]v0 =

9

0 .350000 .610000 .70700

[ omega , Omega , nu0 , a , p , e , E , inc , P ,Q,W, h ,H,N] = DrawOrbitalPlane ( r0 , v0 ) ;14 Omega: l o n g i t u d e of the ascending node = 22 .2954

omega : argument o f p e r i a p s i s = 74 .2970nu0 : the true anomaly = 135 .9555i n c l i n a t i o n = 45 .4007a : semimajor a x i s = 19 .9894

19 p = h ˆ 2 /mu = 0 .5578e : the e c c e n t r i c i t y = 0 .9859h : the angular momentum = 0.7469

The resulting orbit is seen in Figure 9.15. However, you can’t see where the orbit crosses theequatorial plane very well because rp = 0.2827 with r1 = 39.698 and so the orbit just looks like itstarts at the origin.

Figure 9.15: the orbit for r0 = [1.299; 0.75; 1.2], v0 = [0.35; 0.61; 0.707]


Exercise 9.5.2



Exercise 9.5.3

Exercise 9.5.4

Exercise 9.5.5


Chapter 10

Determinants and MatrixManipulations

It is time for you to look at a familiar tool more carefully. Behold, determinants!

10.1 DeterminantsWe have already found the determinant of a 6 × 6 coefficient matrix that arose from a model ofLRRK2 mutations. We know you have all done such calculations before so there was nothing newthere. But you probably have not discussed the determinant very carefully which is what we will donow. We start with the determinant of a 3× 3 matrix to set the stage. We letA be defined by

A =

V11 V12 V13

V21 V22 V23

V31 V32 V33

Now think about the matrixA as made up of the row vectors. Let these be denote by V1,V2 and V3.We want a determinant function that acts on these three arguments and gives us back a real number.For the moment, call this function Φ. Then Φ(V1,V2 < V3) is a number and we want Φ to satisfythe properties our usual 2× 2 determinant function does. Let B be a 2× 2 matrix with row vectorsW1 andW2 which also has the form

B =

[W11 W12

W21 W22

]We know det(B) = W11W22 −W12W21. So another way of saying this is

det(W1,W2) = W11W22 −W12W21

What basic properties does our familiar determinant have?

P1: det is homogeneous: that is, constants come out : What if we multiply a row by a constant c?Note

det(W1, cW2) = W11cW22 −W12cW21 = cdet(W1,W2).

A similar calculation shows that det(cW1,W2) = cdet(W1,W2).

161


162 CHAPTER 10. DETERMINANTS AND MATRIX MANIPULATIONS

P2: det doesn’t change if we add rows : What is we add row two to row one? We have

det(W1 +W2,W2) = (W11 +W21)W22 − (W12 +W22)W21 = W11cW22 −W12cW21

det(W1,W2).

A similar computation shows that adding row one to row two doesn’t change the determinant.

P3: The determinant of the identity is one : this is an easy computation; det(I) = 1× 1 = 1.

Let’s assume we can find a function Φ defined on n × n matrices which we will think of as afunction defined on n row vectors of size n that satisfies Properties P1, P2 and P3. Let the rowvectors now beW1 throughWn. We assume Φ satisfies

P1: Φ is homogeneous: that is, constants come out of slots :

Φ(W1,W2, . . . , cWi, . . . ,Wn) = cΦ(W1,W2, . . . ,Wn). (10.1)

P2: Φ doesn’t change if we add rows : As long as j 6= i,

Φ(W1, . . . ,Wi +Wj , . . . ,Wn) = Φ(W1, . . . ,Wn). (10.2)

P3 :

Φ(I) = 1. (10.3)

10.1.1 Consequences OneUsing these properties, we can figure out that others hold.

P4: If a row is zero, the Φ is zero : This follows from Property 10.1 by just setting c = 0.

If two rows are the same, the Φ is zero : This follows from Property 10.2. For convenience as-sume row one and row two are equal. We have

Φ(W1,W1,W3, . . . ,Wn) = Φ(W1 −W1,W1,W3, . . . ,Wn)

= Φ(0,W1,W3, . . . ,Wn) = 0

P5: Replacing a row by adding a multiple of another row doesn’t change the determinant ; Thisuses our properties: the case c = 0 is obvious so let’s just assume c 6= 0: using Property 10.1,we have for j 6= i,

Φ(W1, . . . ,Wi + cWj , . . . ,Wn) = (1/c) Φ(W1, . . . ,Wi + cWj , . . . , cWj , . . . ,Wn)

But we now have a matrix whose jth row is cWj . Hence, using Property 10.2 adding that rowto row i does not change the value of Φ. So we have

Φ(W1, . . . ,Wi + cWj , . . . ,Wn) = (1/c) Φ(W1, . . . ,Wi + cWj , . . . , cWj , . . . ,Wn)


10.1. DETERMINANTS 163

= (1/c) Φ(W1, . . . ,Wi, . . . , cWj , . . . ,Wn)

= Φ(W1, . . . ,Wi, . . . ,Wj , . . . ,Wn)

where we use Property 10.1 again.

P6: If the rows ofB are linearly dependent, then Φ is zero : Let’s do this for the case that W1

can be written in terms of the other n − 1 vectors. You’ll get the gist of the argument andthen you’ll be able to see in your mind how to handle the other cases. In this case, we can sayW1 =

∑ni=2 cjWj for not all zero constants cj . To save typing, we’ll just write Λn for the

arguments W2 through Wn, Λn−1 for the arguments W2 through Wn−1 and so forth. Sousing our properties, we find

Φ(W1,Λn) = Φ(

n∑i=2

cjWj ,Λn)

= (1/cn)Φ(

n−1∑i=2

cjWj + cnWn,Λn−1, cnWn)

= Φ(

n−1∑i=2

cjWj ,Λn−1,Wn)

We can do this over and over again, until we peel off all the terms of the summand in the firstslot to obtain

Φ(W1,Λ) = Φ(W2,W2, . . . ,Wn)

= Φ(W2 −W2,W2, . . . ,Wn)

= Φ(0,W2, . . . ,Wn) = 0.

P7: Φ is linear in each argument ; What if have the sum of two vectors in row one, α + β? Wewill use the same Λn notation as before to save typing. There are lots of cases here, so we’lljust do one and let you think about the others. We will assume the vectors W2 through Wn

are linearly independent. Add another vector, γ to them to make a basis and write down theexpansions of both α and β in them. α = cγ +

∑nj=2 cjWj and β = dγ +

∑nj=2 djWj .

Then using our properties,

Φ(α+ β,Λn) = Φ

(cγ +

n∑i=2

cjWj) + (dγ +

n∑j=2

djWj),Λn

= Φ

(cγ +

n∑i=2

cjWj)− cnWn + (dγ +

n∑j=2

djWj)− dnWn,Λn

We then do this again and again; after doing this fro j = n− 1 to j = 2, we have

Φ(α+ β,Λn) = Φ((c+ d)γ,Λn) = (c+ d) Φ(γ,Λn)

Now working the other way, we have

Φ(α,Λn) + Φ(β,Λn) = Φ(cγ +

n∑i=2

cjWj ,Λn) + Φ(dγ +

n∑j=2

djWj),Λn)



= Φ(cγ,Λn) + Φ(dγ,Λn)

= Φ((c+ d)γ,Λn) = (c+ d) Φ(γ,Λn)

These match and so we are done with this case. The other cases are similar. If W2 throughWn are dependent, the argument changes and so forth. We leave those details to your inquiringmind. We suggest a nice tall glass of stout will help with the pain.

P8: Interchanging two rows changes the sign of Φ : Let’s do this for rows one and two to keep itsimple. We have

Φ(W1,W2,Λn−1) = Φ(W1,W1 +W2,Λn−1)

= Φ(W1 − (W1 +W2),W1 +W2,Λn−1)

= Φ(−W2,W1 +W2,Λn−1)

= −Φ(W2,W1 +W2,Λn−1) = − Φ(W2,W1,Λn−1).

Similar arguments, but messier notationally, work for the other rows. This implies a nice result.If we have a matrix whose rows are in the order (j1, j2, . . . , jn), then if it takes an even numberof row swaps to get back to the order (1, 2, . . . , n), then Φ doesn’t change sign. But if it takesan odd number of interchanges, the sign flips.

10.1.2 Homework

Exercise 10.1.1

Exercise 10.1.2

Exercise 10.1.3

Exercise 10.1.4

Exercise 10.1.5

10.1.3 Consequences Two

Our three fundamental assumptions about Φ have additional consequences. We still don’t know sucha function Φ exists although we do know we have a nice candidate, det, that works on 2×2 matrices.Let’s work out some more consequences.

P9: Φ(AB = Φ(A) Φ(B) : This is a really useful thing. Let A have rows V1 through Vn and Bhave rows W1 through Wn. Let Yi be the columns of B. The usual matrix multiplicationgives

AB =

< V1,Y1 > . . . < V1,Yn >...

......

< Vn,Y1 > . . . < Vn,Yn >

Each row of the product AB has a particular form. We’ll show this for row one and you canfigure out how to do it for the other rows from that. Do this on paper for a 3× 3 and you’ll seethe pattern. [

< V1,Y1 > . . . < V1,Yn >]

= V11W1 + . . .+ V1nWn.



Call this row vector C1. We can do a similar expansion for the other rows of the productmatrix. Hence, we want to calculate Φ(C1, . . . ,Cn). We have since Φ is additive

Φ(C1, . . . ,Cn = Φ(V11W1 + . . .+ V1nWn,C2, . . . ,Cn)

=

n∑j1=1

V1,j1 Φ(Wj1 ,C2, . . . ,Cn)

Now do this again for C2. We have

Φ(Wj1 ,C2, . . . ,Cn) = Φ

Wj1 ,

n∑j2=1

Vw,j2 Wj2 ,C3, . . . ,Cn

=

n∑j2=1

V2,j2 Φ(Wj1 ,Wj2 ,C3, . . . ,Cn)

So after two steps, we have

Φ(C1, . . . ,Cn) =

n∑j1=1

V1,j1

n∑j2=1

V2,j2 Φ(Wj1 ,Wj2 ,C3, . . . ,Cn)

Whew! Now finish up by handling the other rows. We obtain

Φ(C1, . . . ,Cn) =

n∑j1=1

V1,j1

n∑j2=1

V2,j2 . . .

n∑jn=1

Vn,jn Φ(Wj1 ,Wj2 , . . . ,Wjn)

=

n∑j1=1

. . .

n∑jn=1

V1,j1 . . . Vn,jn Φ(Wj1 ,Wj2 , . . . ,Wjn).

Now many of these terms are zero as any time two rows match, Φ is zero. So we can say

Φ(C1, . . . ,Cn) =

n∑j1,j2,...,jn=1; j1 6=j2...6=jn

V1,j1 . . . Vn,jn Φ(Wj1 ,Wj2 , . . . ,Wjn).

Next, let εj1,...,jn denote the −1 or +1 associated with the row swapping necessary to movej1, . . . , jn to 1, . . . , n. Then, we can rewrite again as

Φ(C1, . . . ,Cn) =

n∑j1,j2,...,jn=1; j1 6=j2... 6=jn

εj1,...,jn V1,j1 . . . Vn,jn Φ(W1,W2, . . . ,Wn).

In particular, we can letB = I . Then we haveAB = AI = A and

Φ(V1, . . . ,Vn) =

n∑j1,j2,...,jn=1; j1 6=j2... 6=jn

εj1,...,jn V1,j1 . . . Vn,jn Φ(I1, I2, . . . , In)

=

n∑j1,j2,...,jn=1; j1 6=j2... 6=jn

εj1,...,jn V1,j1 . . . Vn,jn



since Φ(I) = 1. We usually just refer to Φ(V1, . . . ,Vn) as Φ(A), so we have the identity

Φ(A) =

n∑j1,j2,...,jn=1; j1 6=j2... 6=jn

εj1,...,jn V1,j1 . . . Vn,jn

Combining, we have shown Φ(AB) = Φ(A) Φ(B) since Φ(W1,W2, . . . ,Wn) = Φ(B).

We have gotten pretty far in our discussions. We now know that if a function Φ satisfies PropertiesP1, P2 and P3, it must have the value

Φ(A) =

n∑j1,j2,...,jn=1; j1 6=j2...6=jn

εj1,...,jn V1,j1 . . . Vn,jn

This leads to our next consequence,

P10, Φ(A) = Φ(AT ) :. Define a new function Ψ(A) = Φ(AT ). If we multiply a row in AT bythe constant c, this is the same as multiplyingAT by the matrix

R(c) = D(1, 0, . . . , c, 1, . . . , 1)

where D is the diagonal matrix with ones on the main diagonal except in the (i, i) positionwhich has the value c. We can use Property P9 now.

Ψ(R(c)A) = Φ(R(c)AT ) = Φ(R(c)) Φ(AT ) = c Φ(AT ) = cΨ(A)

and so Ψ satisfies P1. Next, adding row j to row i corresponds to multiplyingAT by

R =

1 0 . . . 0 0 . . . 0 0 . . . 00 0 . . . 1 position (i,i) 0 . . . 0 1 position (i,j) . . . 00 0 . . . 0 1 0 0 0 . . . 0...

......

......

......

......

...0 0 . . . 0 0 . . . 0 0 . . . 1

Then

Ψ(RA) = Φ(RAT ) = Φ(R) Φ(AT ) = 1 Φ(AT ) = Ψ(A)

and so Property P2 holds. Finally, Ψ(I) = Φ(IT ) = Φ(I) = 1 and so Property P3 holds aswell. Since all three properties hold, we must also have

Ψ(A) =

n∑j1,j2,...,jn=1; j1 6=j2...6=jn

εj1,...,jn V1,j1 . . . Vn,jn = Φ(A).

We conclude Φ(A) = Φ(AT ). This also implies another really important result: everythingwe have said about row manipulations applies equally to column manipulations.

We could use the formula above as our definition of how to find Φ. However, there is an easierway. The proper way to do this would be to define how to calculate a determinant using what arecalled cofactors in a recursive fashion. Let’s go through how this works for a 3× 3. We have



Φ(A) =

3∑i,j,k=1; i6=j 6=k

εi,j,k V1iV2jV3k

= ε123V11V22V33 + ε132V11V23V32 + ε213V12V21V33

+ε231V12V23V31 + ε312V13V21V32 + ε321V13V22V31.

Now, the ε terms refer to the sign (either 1 or −1) associated with the rows being in that order.Remember an odd number of row interchanges changes Φ by −1 and an even number of row inter-changes leaves the sign unchanged.

ε123 = 1, because rows are already in proper order.ε132 = −1, because switching 3 with 2 gets 123 order.ε213 = −1, because switching 2 with 1 gets 123 order.ε231 = 1, because switching 3 with 1 gets 213 order; then 1 with 2 gives 123.

ε312 = 1, because switching 3 with 1 gets 132 order; then 3 with 2 gives 123.

ε321 = −1, because switching 3 with 1 gets 123 order.

Using these, we have

Φ(A) = V11V22V33 − V11V23V32 − V12V21V33

+V12V23V31 + V13V21V32 − V13V22V31.

Now choose any row or column you want: we will use column 3. We will organize these calculationsaround the entries of column 3. We write

Φ(A) = V13 (V21V32 − V22V31) + V23 (V12V31 − V11V32) + V33 (V11V22 − V12V21)

= V13 det

([V21 V22

V31 V32

])− V23 det

([V11 V12

V31 V32

])+ V33 det

([V11 V12

V21 V22

])= (−1)1+3 V13 det

([V21 V22

V31 V32

])+ (−1)2+3 V23 det

([V11 V12

V31 V32

])+(−1)3+3 V33 det

([V11 V12

V21 V22

]).

Now look at the matrixA with column 3 singled out.

A =

V11 V12 V13

V21 V22 V23

V31 V32 V33

Look at the entry V13 and mentally cross out row 1 and column 3 from the matrix. This leaves a2 × 2 submatrix which is identical to the one after the V13 entry above. Call this the cofactor C13.Next, look at V23 and mentally cross out row 2 and column 3 from the matrix. This leaves a 2 × 2submatrix which is identical to the one after the V23 entry above. Call this the cofactorC23. Finally,look at V33 and mentally cross out row 3 and column 3 from the matrix. This leaves a 2×2 submatrixwhich is identical to the one after the V33 entry above. Call this the cofactor C33. Hence, we canrewrite our computation as

Φ(A) = (−1)1+3 V13 det

([V21 V22

V31 V32

])+ (−1)2+3 V23 det

([V11 V12

V31 V32

])



+(−1)3+3 V33 det

([V11 V12

V21 V22

])= (−1)1+3 V13 det(C13) + (−1)2+3 V23 det(C23) + (−1)3+3 V33 det(C33)

What we have done here is to expand the computation of Φ using column 3 of the matrix. It isstraightforward to choose a different row or column and do the same thing. The cofactors are definedsimilarly and we would find for a row expansion along row r

Φ(A) = (−1)r+1 Vr1 det(Cr1) + (−1)r+2 Vr2 det(Cr2) + (−1)r+3 Vr3 det(Cr3)

and for a column expansion along column c

Φ(A) = (−1)1+c V1c det(C1c) + (−1)2+c V2c det(C2c) + (−1)3+c V3c det(C3c)

This suggests Φ should be defined recursively starting with 3 × 3 matrices using the usual 2 × 2determinant det. If A was 4 × 4, we would define the determinant of a 3 × 3 matrix as above anddefine the determinant det(A as follows: for a row expansion along row r

det(A) = (−1)r+1 Vr1 det(Cr1) + (−1)r+2 Vr2 det(Cr2)

+(−1)r+3 Vr3 det(Cr3) + (−1)r+4 Vr4 det(Cr4)

where the det’s are the 3 × 3 we defined in the previous step and the cofactors are now 3 × 3submatrices. For a column expansion along column c, we have

Φ(A) = (−1)1+c V1c det(C1c) + (−1)2+c V2c det(C2c)

+(−1)3+c V3c det(C3c) + (−1)4+c V4c det(C4c).

It is clear we can continue to do this as the size ofA increases. Hence, instead of using the computa-tional formula Φ(A) =

∑nj1,j2,...,jn=1; j1 6=j2...6=jn εj1,...,jn V1,j1 . . . Vn,jn we define Φ recursively as

the scheme above indicates. If you call the function defined by this recursive scheme Θ, we can proveit also satisfies Properties P1, P2 and P3 using what is called an induction argument. This argumentworks by assuming the stuff we have defined at step n (i.e. for n × n matrices) satisfies PropertiesP1, P2 and P3 and then showing the new version for matrices that are size (n + 1) × (n + 1) alsosatisfies the properties. Then Θ satisfies the usual Φ properties and so Θ and Φ must be the samefunction. As you can see, determinants of higher order matrices involve a lot of arithmetic! Fromthis point on, we will simple call Φ by the name det!! Let’s state what we know for posterity.

Theorem 10.1.1 The Determinant

There is a unique real valued function det defined on n × n matrices which satisfies thefollowing properties:

P1: The value of the det is unchanged if a row or a column is multiplied by a constant c.

P2: The value of the det is unchanged if row a is replaced by the new row: row a plus rowb as long as a 6= b. This is also true for columns.

P3: The determinant of the identity matrix is 1.

Proof 10.1.1We had a very long discussion about this. We showed that there is only one function satisfying thesethree properties.



These three properties imply many others which we have discussed. Here they are in a morecompact form.

Theorem 10.1.2 The Determinant Properties

The real valued function det defined on n× n matrices which satisfies the following addi-tional properties:

P4: If a row or column is zero, then det is zero.

P5: If row a is replaced by row a + c row b, det is unchanged. The same is true this isdone to columns.

P6: If the rows or columns are linearly dependent, then det is zero.

P7: det is linear in each row or column slot.

P8: Interchanging two rows or columns changes the sign of det.

P9: det(AB) = det(A) det(B).

P10: det(A) = det(AT )

Proof 10.1.2We used the determinant’s three properties to prove each of these additional ones in the argumentspresented earlier.

Property P9 is very important. From it, we derive the fundamental formula for the value of thedet. Recall this is Φ(A) =

∑nj1,j2,...,jn=1; j1 6=j2... 6=jn εj1,...,jn V1,j1 . . . Vn,jn From this, we know

immediately some very important things which we state as another theorem.

Theorem 10.1.3 The Determinant Smoothness

The real valued function det defined on n × n matrices is a continuous function of its n2

parameters. Moreover, its partials of all orders exist and are continuous.

Proof 10.1.3The value of the determinant is just a large polynomial in n2 variables. Hence, the smoothnessfollows.

Finally, we have a recursive algorithm for calculating the determinant of a n× n matrix.

Theorem 10.1.4 The Determinant Algorithm

We can calculate the det using a row expansion:

det(A) =

n∑j=1

(−1)i+j Arj det(Crj).

or via a column expansion:

Φ(A) =

n∑j=1

(−1)i+j Ajc det(Cjc).



Proof 10.1.4We showed this explicitly for a 4× 4 matrix earlier. This is just a generalization of that formula.

Comment 10.1.1 This calculation is expensive. A 2× 2 takes 2 multiples and 2 adds. A 3× 3 has 32× 2 determinants in its computation. Hence, we have 6 multiplies and 6 adds for the determinantsand then 3 more multiplies for the row or column entries and then 3 more for the −1 calculations.So, the total is 12 multiplies and 6 adds. What about a 4 × 4 matrix? We have 4 submatrix 3 × 3determinants which gives 48 multiplies and 24 adds for them. Then the row or column entries give anaddition 8 multiplies. The total is then 56 multiplies and 24 adds. So far, letting M denote multipliesand A denote adds:

2× 2 : 2 M and 2 A which is larger than 2!

3× 3 : 12 M and 6 A which is larger than 3!.

4× 4 : 56 M and 24 A which is larger than 4!.

Let’s look at the 5 × 5 case. We have 5 4 × 4 submatrix determinants which gives 280 M and 120A. We have an addition 2 × 5 multiplies for the row or column entries. The total is thus 290 M and120 A which is larger than 5!. Thus, as the size of the matrix increases, the number of arithmeticoperations to calculate a n× n determinant is larger than n!.


Exercise 10.1.7

Exercise 10.1.8

Exercise 10.1.9

Exercise 10.1.10

10.2 Manipulating MatricesTo understand our problem with finding conditions to check to see if a matrix is positive and negativedefinite, we need to dig deeper into how a matrix and another matrix which we have been altering byrow combinations are related. To begin, we focus on what are called elementary row operations.

10.2.1 Elementary Row Operations and DeterminantsLet’s try to solve the system

1 −2 3−2 5 74 3 −1

xyz

=

258

We can manipulate this by adding multiples of one equation to another and by doing this sort of thingsystematically, we can solve the problem. This is usually done much less formally when the code toperform an LU decomposition of the matrix A is discussed. But here we want to make some newpoints, so bear with us! These sorts of manipulations don’t really need us to write down the x, y andz variables, so it is customary to make up a new matrix find taking the coefficient matrixA above and


10.2. MATRIX MANIPULATION 171

adding an extra column to it which consists of the data b. This new matrix is called the augmentedmatrix. Let’s call itB. Then, we have

B =

1 −2 3−2 5 74 3 −1

258

Now if we multiple the original first equation by 2 and add it to the second equation, we get the newsecond equation

(2− 2)x+ (−4 + 5)y + (6 + 7)z = 4 + 5

y + 13z = 9

This is the same as multiplying the coefficient matrixA by the matrixR21(2)

R21(2) =

0 1 02 1 00 0 1

which is read as replace row 2 ofA by 2 times row 1 + row 2. We see

R21(2)A =

1 0 02 1 00 0 1

1 −2 3−2 5 74 3 −1

=

1 −2 30 1 134 3 −1

We do this to the data also giving

R21(2) b =

1 0 02 1 00 0 1

258

=

298

Hence, to save time, we can abuse notation and think of applying R21(2) to the augmented matrixB ( just applyR21(2) to the fourth column separately!) giving

R21(2)B =

1 0 02 1 00 0 1

1 −2 3−2 5 74 3 −1

258

=

1 −2 30 1 134 3 −1

298

Now apply a new transformation,R31(−4) to the new augmented matrix.

R31(−4)B =

1 0 00 1 0−4 0 1

1 −2 30 1 134 3 −1

298

=

1 −2 30 1 130 11 −13

290

Note we have zeroed out all of the entries below the A11 position. Now let’s work on column two.Apply the transformationR32(−11). We have

R32(−11)B =

1 0 00 1 00 −11 1

1 −2 30 1 130 11 −13

290

=

1 −2 30 1 130 0 −156

29−99



Next, applyR12(2). We find

R12(2)B =

1 2 00 1 00 0 1

1 −2 30 1 130 0 −156

29−99

=

1 0 290 1 130 0 −156

209−99

Now we have zeroed out all the entries above and below the original A22. We have found that ifwe applied a sequence of elementary row operations we could transform A into an upper triangularmatrix. To summarize, we found constants c32, c31 and c21 so that

R32(c32)R31(c31)R21(c21)A =

1 0 290 1 130 0 −156

Now each of the matricesR32(c32),R31(c31) andR31(c31) is formed by adding a multiple of somerow of the identity to another row of the identity. Hence, all of these elementary row operationpremultiplier matrices have determinant 1. Thus, we know

det(A) = det(R32(c32)) det(R31(c31)) det(R31(c31)) detU)

= detU)

where U is the upper triangular matrix we have found by applying these transformations. The de-terminant of U is simple: applying our algorithm for the recursive computation of det here, we finddet(U) is just the product of its diagonals (just expand using column 1). Hence, just 3 multiplieshere. Now the elementary row operations don’t need to be applied as real matrix multiplications.Instead, we just perform the alterations to rows as needed. So R21(c21) needs 3 M and 3 adds butsince we know the first column will be zeroed out, we don’t really have to do that one. So just 2 Mand 2 A. The next one, R31(c31) only needs 1 M and 1 A as we don’t really have to do the columntwo entry. Then we are done. We have found det(A) in just 2 + 1 M and 2 + 1 A to get U and then3 M to get det. The total is 6 M and 3 A. The recursive algorithm requires 12 M and 6 A so alreadywe have saved 50% on the operation count! It gets better and better as the size ofA goes up. Hence,calculating det using the recursive algorithm is very expensive and we prefer to use elementary rowoperations instead.We have also converted our original system into the new system

U

xyz

= R32(c32)R31(c31)R21(c21)b

which we can readily solve using backsubstitution.

10.2.1.1 Homework

Exercise 10.2.1

Exercise 10.2.2

Exercise 10.2.3

Exercise 10.2.4

Exercise 10.2.5



10.2.2 MatLab ImplementationsLet’s see if we can figure out how to find the multiplier matrices R using MatLab. In our code, wewill call this matrix P although we can return it as the matrix R is we want! We need some newcode. First, let introduce some needed utility functions. the first one, LMult(i,j,c,n) replacesrow j in the matrix A by the new row j which is c row(i) + row(j).

Listing 10.1: The Fundamental Row Operation

f u n c t i o n R = LMult ( i , j , c , n ) ;%% R i s t h e l e f t m u l t i p l i e r which r e p l a c e s% row j by c ∗ row i + row j

5 %R = eye ( n , n ) ;R( j , i ) = c ;end

We also need the code to do a row swap given by SwapRows(i,j,n) which swaps row i and rowj of A.

Listing 10.2: Swapping Rows

f u n c t i o n P = SwapRows ( i , j , n )%row = n ;P = eye ( row , row ) ;

5 C = P ( i , : ) ;D = P ( j , : ) ;P ( i , : ) = D;P ( j , : ) = C;end

The code to find the multiplier formed by all the row operations is a modification of our old codeGePiv.m so that we can keep track of all the multipliers. This is not very efficient, but it will helpyou see how it works.

Listing 10.3: Finding the Multiplier

f u n c t i o n [ P , p iv ] = GetP (A) ;%% A i s nxn mat r ix% p i v i s a nx1 i n t e g e r p r e m u t a t i o n v e c t o

5 %[ n , n ] = s i z e (A)p iv = 1: n ;R = eye ( n , n ) ;RPrev = R;

10 Aorig = A;%Mult = ;W = ;



swapindex = 0;A

15 f o r k=1:n−1% f i n d maximum a b s o l u t e v a l u e e n t r y% in column k of rows k through n[ maxc , r ] = max( abs (A( k : n , k ) ) ) ;q = r+k−1;

20 % t h i s l i n e swaps t h e k and q p o s i t i o n s in p i vpiv ( [ k q ] ) = piv ( [ q k ] ) ;% then manually swap row k and q in Aswapindex = swapindex +1;

W swapindex = SwapRows ( k , q , n ) ;25 A = W swapindex∗A;

i f A( k , k ) ˜=0% g e t m u l t i p l i e r sc = −A( k+1:n , k ) /A( k , k ) ;

% f i n d m u l t i p l i e r ma t r ix30 RPrev = W swapindex∗RPrev ;

f o r u = k+1:n% g e t m u l t i p l i e rS = LMult ( k , u , c ( u−k ) , n ) ;% r e s e t A . Note A has been p o s s i b l y row swapped

35 A = S∗A;R = S∗RPrev ;RPrev = R;

endendA =

40

1 2 32 4 53 5 6

45 endP = R;end

Let’s try this out. We are going to print out a lot of the intermediary calculations, so it will be a bitverbose! First, let do column one.

Listing 10.4: Doing Column 1

>> A = [ 1 , 2 , 3 ; 2 , 4 , 5 ; 3 , 5 , 6 ] ;>> [ P , p iv ] = GetP (A) ;

3 % The o r i g i n a l ma t r ixA =

1 2 32 4 53 5 6

8 % we need t o swap row 3 and row 1W =

0 0 10 1 01 0 0

13 % t h e swapped mat r ix i s



A =3 5 62 4 51 2 3

18 % now z e r o out t h e second e n t r y in column one% t h e row o p e r a t i o n ma t r ix i sS =

1 .00000 0 .00000 0 .00000−0.66667 1 .00000 0 .00000

23 0 .00000 0 .00000 1 .00000% t h e new mat r ix i s thenA =

3.00000 5 .00000 6 .000000 .00000 0 .66667 1 .00000

28 1 .00000 2 .00000 3 .00000% The combined row o p e r a t i o n i s then S∗W% which swaps column 1 and column 3R =

0.00000 0 .00000 1 .0000033 0 .00000 1 .00000 −0.66667

1 .00000 0 .00000 0 .00000% now z e r o out t h e t h i r d e n t r y in column one% t h e new row o p e r a t i o n ma t r ix i sS =

38 1 .00000 0 .00000 0 .000000 .00000 1 .00000 0 .00000−0.33333 0 .00000 1 .00000

% t h e new mat r ix i sA =

43 3 .00000 5 .00000 6 .000000 .00000 0 .66667 1 .000000 .00000 0 .33333 1 .00000

% t h e comined row o p e r a t i o n ma t r ix i s% t h e new one ∗ t h e p r e v i o u s one g i v i n g

48 R =0.00000 0 .00000 1 .000000 .00000 1 .00000 −0.666671 .00000 0 .00000 −0.33333

Next, column two.

Listing 10.5: Doing Column 2

% We don ’ t need t o do ano ther row swap , so% t h e second swap mat r ix i s j u s t t h e i n d e n t i t yW =

4 1 0 00 1 00 0 1

% The new A i s unchanged as t h e r e i s no swapA =

9

3 .00000 5 .00000 6 .000000 .00000 0 .66667 1 .00000



0 .00000 0 .33333 1 .00000% now z e r o out t h e t h i r d e n t r y in column 2

14 % t h e row o p e r a t i o n ma t r ix i sS =

1 .00000 0 .00000 0 .000000 .00000 1 .00000 0 .000000 .00000 −0.50000 1 .00000

19 5 the a l t e r e d A i sA =

3 .00000 5 .00000 6 .000000 .00000 0 .66667 1 .00000

24 0 .00000 0 .00000 0 .50000% t h e combined row o p e r a t i o n ma t r ix% i s t h e new one t i m e s t h e p r e v i o u s oneR =

0.00000 0 .00000 1 .0000029 0 .00000 1 .00000 −0.66667

1 .00000 −0.50000 0 .00000

Note just as a check, P*A gives

Listing 10.6: Checking PA

>> P∗Aans =

3 .00000 5 .00000 6 .000005 0 .00000 0 .66667 1 .00000

0 .00000 0 .00000 0 .50000

and our row operation matrix is

Listing 10.7: The Multiplier P

>> PP =

0.00000 0 .00000 1 .000004 0 .00000 1 .00000 −0.66667

1 .00000 −0.50000 0 .00000

10.2.2.1 Homework

Exercise 10.2.6

Exercise 10.2.7

Exercise 10.2.8

Exercise 10.2.9

Exercise 10.2.10



10.2.3 Matrix Inverse Calculations

Now let’s zero out the entries above the original A33 position withR13(c) andR23(d) for suitable cand d. Thus, applyingR13(29/156) we have

R13(29/156)B =

1 0 29/1560 1 00 0 1

1 0 290 1 130 0 −156

209−99

=

1 0 00 1 130 0 −156

1.59629−99

Finally, applyingR23(13/156), we have

R23(−1/13)B =

1 0 00 1 13/1560 0 1

1 0 00 1 130 0 −156

1.59629−99

=

1 0 00 1 00 0 −156

1.59620.75−99

The last row then tells us z = 99/156 = 0.6346. We had already found x = 1.5962 and y = 0.75so we have solve the system. This is what we did in code with the LU decomposition of A and theuse of upper and lower triangular solvers. You should go back and look at that again to get a betterappreciation of what we have done here. Of course, not trusting our typing and arithmetic, we usedour MatLab code to double check all of these calculations above. All we can say is this has beenintense! But let’s see what we have found. We wanted to solve the problem AX = b where thedata b is given and X denotes the triple

[x y z

]T. We found that if we applied a sequence of

elementary row operations we could transformA into a diagonal matrix. To summarize, we found

R23(c23)R13(c13)R12(c12)R32(c32)R31(c31)R21(c21)A =

r1 0 00 r2 00 0 r3

for appropriate constants cij and some nonzero constants r1, r2 and r3. Once this was done it is easyto find the unique x, y and z that solve the problem: we simply divide the entries of the transformedb by the appropriate ri. We can do this for any system of equations and, of course it fails if A hasproblems.Note, we can think of this a different way. Assume you are trying to solve three separate problemssimultaneously: AX = e1,AX = e2 andAX = e3 where e1 =

[1 0 0

]T, e2 =

[0 1 0

]Tand e3 =

[0 0 1

]T, Form the new augmented matrixA11 A12 A13

A21 A22 A23

A31 A32 A33

1 0 00 1 00 0 0

Then apply the elementary row operations like usual to convertA into an upper triangular matrixU .If this system in solvable, we can always do this. Thus, we find

R32(c32)R31(c31)R21(c21)A = R32(c32)R31(c31)R21(c21) IU11 U12 U13

0 U21 U23

0 0 U33

= R32(c32)R31(c31)R21(c21) I

Now apply the other three elementary row operations.

R23(c23)R13(c13)R12(c12)R32(c32)R31(c31)R21(c21)A

= R23(c23)R13(c13)R12(c12)R32(c32)R31(c31)R21(c21) I


178 CHAPTER 10. DETERMINANTS AND MATRIX MANIPULATIONSr1 0 00 r2 00 0 r3

= R23(c23)R13(c13)R12(c12)R32(c32)R31(c31)R21(c21) I

Now apply the diagonal matrixD(1/r1), (1/r2), (1/r3) ( whose main diagonal entries are (1/r1), (1/r2), (1/r3)with every other entry zero. LetB be defined by

B =

1/r1 0 00 1/r2 00 0 1/r3

R23(c23)R13(c13)R12(c12)R32(c32)R31(c31)R21(c21)

Then, we have

I = B A

We have just found something cool! For convenience, letR be given by

R = R23(c23)R13(c13)R12(c12)R32(c32)R31(c31)R21(c21)

Then note

B = D(1/r1), (1/r2), (1/r3)R

andBA = I so that we have found the inverse ofA using elementary row operations. It is

A−1 = D(1/r1), (1/r2), (1/r3)R!

This is a computationally efficient way to find a matrix inverse, if it exists.Homework

Exercise 10.2.11

Exercise 10.2.12

Exercise 10.2.13

Exercise 10.2.14

Exercise 10.2.15

10.2.3.1 MatLab Implementations

We can implement these ideas using our row operation matrices explicitly as follows. Again, we willimplement the code using a matrix name different from the R’s above. For this code we will use Qand the R we want will end up being QP . Now this is not so efficient as normally, we would notstore these matrices, but we wanted you to see how it works. We already know how to get the rowoperation matrix P that converts A into an upper triangular matrix. To finish the job, we don’t haveto worry about row swaps so the code is cleaner. We only apply this code to symmetric matrices.

Consider the code for GetQ2(A) given below.

Listing 10.8: GetQ2.m: Diagonalize A

f u n c t i o n Q = GetQ2 (A) ;



%% A i s nxn mat r ix in upper t r i a n g u l a r form%

5 [ n , n ] = s i z e (A) ;R = eye ( n , n ) ;f o r k=2:n

i f A( k , k ) ˜=0% g e t m u l t i p l i e r s

10 % look a t rows k+1 t o n and% d i v i d e them by t h e p i v o t v a l u e A( k , k )c = −A( 1 : k−1,k ) /A( k , k ) ;

% f i n d m u l t i p l i e r ma t r ixf o r u = 1: k−1

15 S = LMult ( k , u , c ( u ) , n ) ;A = S∗A;R = S∗R;

endend

20 endQ = R;end

This code finds the row operation matrices that convert PA into diagonal form. Let’s do a 4 × 4example.

Listing 10.9: Upper Triangularizing A

>> A = [ 4 , 1 , 3 , 2 ; 1 , 5 , 6 , 3 ; 3 , 6 , 1 0 , 5 ; 2 , 3 , 5 , 1 ]% o r i g i n a l A

3 A =4 1 3 21 5 6 33 6 10 52 3 5 1

8 >> [ P , p iv ] = GetP (A) ;% PA i s upper t r i a n g u l a r>> P∗Aans =

4 .00000 1 .00000 3 .00000 2 .0000013 0 .00000 5 .25000 7 .75000 3 .50000

0 .00000 0 .00000 −1.76190 −0.666670 .00000 0 .00000 0 .00000 −1.59459

% a p p l y f u r t h e r row o p e r a t i o n s t o d i a g o n l i z e PA>> Q = GetQ2 ( P∗A) ;

18 >> QQ =

1.00000 −0.19048 0 .86486 0 .474580 .00000 1 .00000 4 .39865 0 .35593

23 0 .00000 0 .00000 1 .00000 −0.418080 .00000 0 .00000 0 .00000 1 .00000

% note QP i s t h e mat r ix which d i a g o n a l i z e s A>> Q∗Pans =



28 1 .42373 0 .81356 −1.15254 0 .474581 .06780 4 .36017 −3.11441 0 .355930 .50767 1 .04520 −0.74657 −0.41808−0.18919 −0.10811 −0.37838 1 .00000

% Q∗P∗A g i v e s us t h e d i a g o n a l ma t r ix D33 Q∗P∗A

ans =4 .00000 −0.00000 0 .00000 −0.00000−0.00000 5 .25000 −0.00000 −0.00000

0 .00000 0 .00000 −1.76190 0 .0000038 0 .00000 0 .00000 0 .00000 −1.59459

Hence, we have Q P A = D and so D−1 Q P A = I which tells us A−1 = D−1 Q P ! We canpackage this into a new piece of code GetInvSymm(A).

Listing 10.10: Simple Matrix Inverse Code

f u n c t i o n Ainv = GetInvSymm (A)%% A i s an n x n symmetr ix ma t r ix%

5 n = s i z e (A) ;[ P , p iv ] = GetP (A) ;Q = GetQ2 ( P∗A) ;D = Q∗P∗A;Dinv = D;

10 f o r k=1:nDinv ( k , k ) = 1 /D( k , k ) ;

endAinv = Dinv∗Q∗P ;end

For our example, we find

Listing 10.11: Diagonalizing A

1 >> Ainv = GetInvSymm (A) ;% o r i g i n a l ma t r ixA =

4 1 3 26 1 5 6 3

3 6 10 52 3 5 1

>> AinvAinv =

11 0 .355932 0 .203390 −0.288136 0 .1186440 .203390 0 .830508 −0.593220 0 .067797−0.288136 −0.593220 0 .423729 0 .237288

0 .118644 0 .067797 0 .237288 −0.627119% Checks

16 >> A∗Ainv


10.3. BACK TO DEFINITE MATRICES 181

ans =1 .0000 e +00 −2.2482e−15 3 .5527 e−15 −2.2204e−16−1.7708e−14 1 .0000 e +00 1 .8430 e−14 4 .2188 e−15−1.9318e−14 −1.4599e−14 1 .0000 e +00 4 .8850 e−15

21 −9.8949e−15 −7.8965e−15 9 .5479 e−15 1 .0000 e +00>> Ainv∗Aans =

1 .00000 −0.00000 0 .00000 0 .00000−0.00000 1 .00000 0 .00000 0 .00000

26 0 .00000 0 .00000 1 .00000 0 .000000 .00000 0 .00000 −0.00000 1 .00000

We see the inverse checks out! Normally, we just do elementary row operations more efficiently. Forexample, our earlier LU decomposition works by using row operations.

Homework

Exercise 10.2.16

Exercise 10.2.17

Exercise 10.2.18

Exercise 10.2.19

Exercise 10.2.20

10.3 Back To Definite Matrices

Our interests are in symmetric matrices that are positive or negative definite. Recall, a symmetricmatrixA can be written as

A = P

λ1 0 00 λ2 00 0 λ3

P Twhere P is the matrix of A’s eigenvectors. This P is different from the generic P we used in theupper triangularization code introduce above. So in this section, we have the matrix of eigenvectorsP and new matrices of the form R which we will use to develop some ideas about minors. For ourdiscussions here, we will ignore row swapping issues to keep our explanations from becoming toomessy. However, you should be able to see adding the piv idea into the mixture is not terriblydifficult. Note, letting Λ be the diagonal matrixD(λ1, λ2, λ3), we have

det(A) = det(P ) det(Λ) det(P T )

= λ1λ2λ3

as PP T = I and so det(P ) det(P T ) = 1. We can apply the ideas above to the matrix A to find amatrixR given by

R = R32(c32)R31(c31)R21(c21)



for appropriate constants cij so that

RA =

U11 U12 U13

0 U22 U23

0 0 U33

Now letU denote the upper triangular matrix above. From our discussions above, we know det(U) =U11U22U33. However, this determinant is also λ1λ2λ3. Hence, we know λ1λ2λ3 = U11U22U33.Now the next step is also pretty intense so just pour a new cup of tea or coffee and settle down forsome mind stretching.

10.3.1 Matrix Minors

Now let’s study the transformed matrixRART which looks like

RART =

(R32R31R21

)A

(RT

21RT31R

T32

)= Λ

where we have dropped the cij terms for expositional clarity. You know they are there, right?

Now look at the innermost part, R21ART21. The matrix A has three important determinants

associated with it called its principle minors. This transformation is applied with some constant cso we have

R21(c)A =

1 0 0c 1 00 0 1

A11 A12 A13

A12 A22 A23

A13 A23 A33

=

A11 A12 A13

cA11 +A12 cA12 +A22 cA13 +A23

A13 A23 A33

Next, calculate

R21(c)ART21(c) =

A11 A12 A13

cA11 +A12 cA12 +A22 cA13 +A23

A13 A23 A33

1 c 00 1 00 0 1

=

A11 cA11 +A12 A13

cA11 +A12 c2A11 + 2cA12 +A22 cA13 +A23

A13 cA13 +A23 A33

The principle minors of ourA as defined as follows:

(PM)1 = det(A11) = A11

(PM)2 = det

([A11 A12

A12 A22

])

(PM)3 = det

A11 A12 A13

A12 A22 A23

A13 A23 A33

Let (PMT )i denote the principle minors of the matrix A after the first pair of elementary rowoperations. From the above calculation, we see (PMT )1 = (PM)1 = A11 and (PMT )2 =(PM)2. What about (PMT )3? Let’s do a sample calculation. We have


10.3. BACK TO DEFINITE MATRICES 183

R21(c)ART21(c) =

A11 cA11 +A12 A13

cA11 +A12 c2A11 + 2cA12 +A22 cA13 +A23

A13 cA13 +A23 A33

Let’s expand by column 3. We have

det(R21(c)ART

21(c))

= A13 det

([cA11 +A12 c2A11 + 2cA12 +A22

A13 cA13 +A23

])− (cA13 +A23) det

([A11 cA11 +A12

A13 cA13 +A23

])+ A33 det

([A11 cA11 +A12

cA11 +A12 c2A11 + 2cA12 +A22

])After calculating the determinants and some simplifications (yes, we expect you to use pen and paperand verify!), we find

det(R21(c)ART

21(c))

= A13 (A12A23 −A13A22 + cA11A23 − cA12A13)

−(cA13 +A23) (A11A23 −A12A13)

+A33

(A11A22 −A2

12

)The termA11A22−A2

12 is the determinant of the cofactorC33 and buried in this messy expression aretwo other cofactors: A12A23−A13A22 is the determinant of the cofactor C31 andA11A23−A11A13

is the determinant of the cofactor C32. Hence, we have

det(R21(c)ART

21(c))

= A13 (det(C31 + cA11A23 − cA12A13)− (cA13 +A23) (det(C32))

+A33 det(C33)

But we have a symmetric matrix and so A31 = A13 and so expanding by the third row, we have

det(A) = A31 det(C31 −A32 det(C32) +A33 det(C33)

= A13 det(C31 −A23 det(C32) +A33 det(C33)

We conclude

det(R21(c)ART

21(c))

= det(A) + cA11A13A23 − cA13A12A13

− cA13 (A11A23 −A12A13)

= det(A) + cA11A13A23 − cA12A213 − cA13A11A23 + cA12A

213

= det(A)

Thus, for this one pair of elementary row operations applied to A, we have the principle minors ofA and R21(c) ART

21(c) are equal. Here, it is important we apply transformations that transformA into a lower triangular matrix. If we don’t the determinants of the principle minors will notmatch. the arguments for the other transformations are similar. So we can say the principle minorsofR32R31R21 AR

T21R

T31R

T32 match the principle minors ofA.

We conclude the principle minors ofRART match the principle minors of A. These principleminors are λ1, λ1λ2 and λ1λ2λ3.



10.3.1.1 Homework

Exercise 10.3.1

Exercise 10.3.2

Exercise 10.3.3

Exercise 10.3.4

Exercise 10.3.5


Part III

Calculus of Many Variables

185


Chapter 11

Differentiability

We now consider the differentiability properties of mappings f : D ⊂ <n → <m in earnest. In(Peterson (8) 2019), we worked through a good introduction to functions of two variables. Now wewant to look at higher dimensional analogues.

11.1 Partial Derivatives

If f is locally defined on B(x0, r) in <n, we can look at the trace corresponding to a x0 in thedirection of the unit vector E; call this fE(t) = f(x0 + tE) for sufficiently small t so the functionvalues are well defined. This means

fE(t) = f(x0 + tE) = f(x01 + tE1, . . . , x0n + tEn)

If we calculate the difference between fE(t) and fE(0), we find for any t 6= 0

fE(t)− fE(0)

t=

f(x01, . . . , x0,i−1,x0i + t, . . . , x0n)− f(x01, . . . , x0,i−1,x0i, . . . , x0n)

t

where we have put in boldface the ith slot where all the action is taking place. Tt is useful todetermine if limt→0(1/t)(f(x0 + tE − f(x0) exists. In the case of two variables, this turns outto be the directional derivative of f in the direction of E. But we need to do this carefully for nvariables and we don’t want to take advantage of two dimensional intuition. We start by looking atthe standard basis E1, . . . ,En for <n where the components of Ei = δij as usual. Then, we canconsider the limit

limt→0

f(x0 + tEi)− f(x0)

t

If this limit exists, it is called the partial derivative of f with respect to the ith coordinate slot. Thereare many notations for this.

Definition 11.1.1

187


188 CHAPTER 11. DIFFERENTIABILITY

Let z = f(x) be a function defined locally at x in <n. The partial derivative of f withrespect to its ith coordinate or slot at x is defined by the limit

∂f

∂x(x) = lim

t→0

f(x+ tEi)− f(x)

t= limy→xi

f(x+ yEi)− f(x)

y − xi

If this limit exists, it is denoted by a variety of symbols: for functions of two variables xand y, for the partial derivatives at the point (x0, y0), we often use fx(x0, y0), zx(x0, y0),∂f∂x (x0, y0) and ∂z

∂x (x0, y0) and fy(x0, y0), zy(x0, y0), ∂f∂y (x0, y0) and ∂z∂y (x0, y0). How-

ever, this notation is not very useful when x is in <n. In the more general case, we use Di

to indicated the partial derivative with respect to the ith slot instead of ∂f∂xi

.

Comment 11.1.1 It is easy to take partial derivatives. Just imagine the one variable held constantand take the derivative of the resulting function just like you did in your earlier calculus courses.

Comment 11.1.2 Notation can be really bad here. Consider the following partial we need to take ina problem with line integrals in Chapter 17. This is a calculation for functions of two variables anduses the chain rule which we covered in (Peterson (8) 2019). We need to find

(AG1v +BG2v)u

whereA andB are functions of two variables andG1 andG2 are also. This is a change of variableproblem with x = G1(u, v) and y = G2(u, v). In the chain rule, we need ∂A with respect toargument one and argument two. In the original A function, these would be Ax and Ay but that isreally confusing here. That would give

(AG1v +BG2v)u = (AxG1u +AyG2u)G1v +AG1vu

+(BxG1u +ByG2u)G2v +BG2vu

or we could use arg 1 and arg 2 which is ugly

(AG1v +BG2v)u = (Aarg 1G1u +Aarg 2G2u)G1v +AG1vu

+(Barg 1G1u +Barg 2G2u)G2v +BG2vu

But it is cleaner to just useA1 andA2 and so forth to indicate these partials with respect to slot oneand slot two.

(AG1v +BG2v)u = (A1G1u +A2G2u)G1v +AG1vu

+(B1G1u +B2G2u)G2v +BG2vu

At any rate, don’t be surprised if, in a complicated partial calculation, it all gets confusing!

11.1.1 Homework

Exercise 11.1.1

Exercise 11.1.2

Exercise 11.1.3

Exercise 11.1.4

Exercise 11.1.5


11.2. TANGENT PLANES 189

11.2 Tangent PlanesIt is very useful to have an analytical way to discuss tangent planes and their relationship to the realsurface to which they are attached even if we can’t draw this.

Definition 11.2.1 Planes in n Dimensions

A plane in <n through the point x0 is defined as the set of all vectors x so that the anglebetween the vectorsD andN is 90 whereD is the vector we get by connecting the pointx0 to the point x. Hence, for

D = x− x0 =

x1 − x01

...xn − x0n

and N =

N1

...Nn

The plane is the set of points x so that <D,N >= 0. The vectorN is called the normalvector to the plane. If n > 2, we usually call this a hyperplane. Also note hyperplanesthrough the origin of <n are n− 1 dimensional subspaces of <n.

IfN is given, then the span ofN , sp(N), is a one dimensional subspace of<n and the the orthogonalcomplement of it which is

sp(N)⊥ = x ∈ <n| < x,N >= 0

is an n− 1 dimensional subspace of <n and we can construct an orthonormal basis for it as follows:

• Pick any nonzero w1 not equal to a multiple of N . Then w1 is not in sp(N). ComputeF1 = N/‖N‖ and

v = w1− < w1,F1 > F1

Then v is orthogonal to F1. Let F2 = v/‖v‖.

• If the sp(F1,F2) is not<n, we know there is a nonzero vectorw in (sp(F1,F2))⊥ from whichwe can construct a third vector F3 mutually orthogonal to F1 and F2 just we did before.

• If the sp(F1,F2,F3) is not <n, we do this again. If not, we are done.

This constructs n−1 mutually orthonormal vectors F1 to Fn−1 which are a basis for sp(N)⊥. Noteany x in this subspace satisfies < x,N >= 0.On the other hand if A1 through An−1 is a linearly independent set in <n, the sp(A1, . . . ,An−1

is an n− 1 dimensional subspace. To find the normal vector, note we are looking for a vector N sothat

A11 . . . A1,n

......

...

A11

... A1,n

N1

...Nn

=

0...0

This is the same as

N1

A11

A21

...An−1,1

+N1

A12

A22

...An−1,2

+ . . .+Nn

A1n

A2n

...An−1,n

= 0.



Letting

Wi =

A1i

A2i

...An−1,i

we see W1, . . . ,Wn is a set of n vectors in <n−1 and so must be linearly dependent. Thus, theequation

N1W1 + . . .+NnWn = 0

must have a nonzero solutionN which will be the normal vector we seek.

Homework

Exercise 11.2.1

Exercise 11.2.2

Exercise 11.2.3

Exercise 11.2.4

Exercise 11.2.5

Now given our function of n variables, we calculate the n partial derivatives at a given point x0

giving f1(x0) to fn(x0). The vector

N =

f1(x0)...

fn(x0)

therefore determines a hyperplane like we discussed above where the hyperplane is the set of all x in<n so that < x,N >= 0.

Now what does this have to do with what we have called tangent planes for functions of two variables.Let’s look at that case for some motivation. Recall the tangent plane to a surface z = f(x, y) at thepoint (x0, y0) was the plane determined by the tangent lines T (x, y0) and T (x0, y). The T (x, y0)line was determined by the vector

A =

10

∂f∂x (x0, y0)

=

10

2x0

and the T (x0, y) line was determined by the vector

B =

01

∂f∂y (x0, y0)

=

01

2y0


11.2. TANGENT PLANES 191

The vector perpendicular to both A and B is given by A×B but we can’t use the cross product inthe dimension n > 3. The cross product gives

N =

−fx(x0, y0)−fy(x0, y0)

1

The tangent plane to z = f(x, y) at (x0, y0, f(x0, y0)) is defined to be the set of all (x, y, z) in <3

so that ⟨−fx(x0, y0)−fy(x0, y0)

1

, x− x0

y − y0

z − f(x0, y0)

⟩= 0

or multiplying this out and rearranging

z = f(x0, y0) + fx(x0, y0)(x− x0) + fy(x0, y0)(x− x0)

We can use this as the motivation for defining the tangent vector to z = f(x at x0 as follows:

Definition 11.2.2 The Tangent Plane to z = f(x at x0 in <n

The tangent plane in <n through the point x0 for the surface z = f(x) is defined as theset of all vectors x so that

⟨−f1(x0)−f2(x0)

...−fn(x0)

1

,x1 − x01)x2 − x02

...xn − x0n

z − f(x0)

⟩

= 0

or multiplying this out and rearranging

z = f(x0) + f1(x0)(x1 − x01) + . . .+ fn(x0)(xn − x0,n)

We can use another compact definition at this point. We can define the gradient of the function f tobe the vector∇f which is defined as follows.

Definition 11.2.3 The Gradient

The gradient of the function z = f(x) at the point x0 is defined to be the vector∇f where

∇f(x0) =

f1(x0)f2(x0)

...fn(x0)

Note the gradient takes a scalar function argument and returns a vector answer.

Using the gradient, the tangent plane equation can be rewritten as

z = f(x0)+ <∇f(x0),x− x0 >

= f(x0) + (∇f(x0))T (x− x0)



It is also traditional to abbreviate∇f(x0) by∇f0. Then, the tangent plane equation becomes

z = f(x0) + (∇f0)T (x− x0)

The obvious question to ask now is how much of a discrepancy is there between the value f(x)and the value of the tangent plane?

Example 11.2.1 Find the gradient of w = f(x, y, z) = x2 + 4xyz+ 9y2 + 20z4 and the equation ofthe tangent plane to this surface at the point (1, 2, 1).

Solution

∇f(x, y) =

2x+ 4yz4xz + 18y4xy + 80z3

=⇒∇f(1, 2, 1) =

2 + 8 = 104 + 36 = 404 + 80 = 84

The equation of the tangent plane at (1, 2, 1) is then

z = f(1, 2, 1) +

⟨104084

,x− 1y − 2z − 1

⟩= 65 + 10(x− 1) + 40(y − 2 + 84(z − 1)

Homework

Exercise 11.2.6

Exercise 11.2.7

Exercise 11.2.8

Exercise 11.2.9

Exercise 11.2.10

11.3 Derivatives for Scalar Functions of n Variables

Let f : D ⊂ <n → < be a function of n variables. As long as f(x) is defined locally at x0, thepartials f1(x0) through fn(x0) exist if and only if there are error functions E1 through EN so that

f1(x0 + h1E1) = f(x0) + f1(x0)(h1(x1 − x01) + E1(x0, h1)

f2(x0 + h2E2) = f(x0) + f2(x0)(h2(x1 − x02) + E2(x0, h2)

... =...

fn(x0 + hnEn) = f(x0) + fn(x0)(hn(xn − x0n) + En(x0, hn)

with Ej → 0 and Ej/hj → 0 as hj → 0 for all indices j. This suggests the correct definition for thedifferentiability of a function of n variables.

Definition 11.3.1 Error Form of Differentiability For Scalar Function of n Variables


11.3. DERIVATIVES FOR SCALAR FUNCTIONS OF N VARIABLES 193

Let f : D ⊂ <n → < be a function of n variables. If f(x) is defined locally at x0,then f is differentiable at x0 if there is a vector L in <n so that the error functionE (x0,x) = f(x) − f(x0)− < L,x − x0 > satisfies limx→x0 E (x0,x) = 0 andlimx→x0 E (x0,x)/ ‖x − x0‖ = 0. We define the derivative of f at x0 to be the linearmap L whose value at x is LT (x− x0).We denote this map byDf(x0) = LT .

Note if f is differentiable at x0, f must be continuous at x0. The argument is simple:

f(x) = f(x0)+) < L,x− x0 > +E (x0,x)

and as x → x0, we have f(x) → f(x0) which is the definition of f being continuous at (x0).Hence, we can say

Theorem 11.3.1 Differentiable Implies Continuous: Scalar Function of n Variables

If f is differentiable at x0 then f is continuous at x0.

Proof 11.3.1We have sketched the argument already.

Comment 11.3.1 At each point x where Df(x) exists, we get a vector L(x). Hence, the derivativeof a function of n variables defines a mapping from <n to <n by

Df(x) = LT (x)

Comment 11.3.2 If f is differentiable at x0, then there is a vector L so that

f(x) = f(x0 +LT (x− x0) + E )(x0,x)

If we have chosen a basis for <n, then we have

f(x) = f(x0 +[L]TE

([x− x0

]E

) + E )([x0

]E,[x]E

)

From this definition, we can show if the scalar function f is differentiable at the point x0, thenLi = fi(x0) for all indices i. The argument goes like this: since f is differentiable at x0, we canfocus on what happens as we move from x0 to x0 + hiEi. Hence, all the components of x0 + hiEiare the same as the ones in x0. From the definition of differentiability we then have

limhi→0

f(x+ hiEi)− f(x0)− Li hi|hi|

= 0.

as ‖x0 + hiEi − x0 = |hi|. For hi > 0, we find

limhi→0+

f(x+ hiEi)− f(x0)− Li hihi

= 0.

Hence (fi(x0))+ = Li. Similarly, if hi < 0, we have

limhi→0−

f(x+ hiEi)− f(x0)− Li hi−hi

= 0.

Hence (fi(x0))− = Li as well. Combining, we see fi(x0) = Li. The arguments for the otherindices are much the same. Hence, we can say if the scalar function of n variables f is differentiable



at x0 then fi(x0) exist and

f(x) = f(x0) +

n∑i=1

fi(x0)(xi − x0i) + Ef (x0,x)

where Ef (x0,x)→ 0 and Ef (x0,x)/||x−x0|| → 0 as x→ x0. Note this argument is a pointwiseargument. It only tells us that differentiability at a point implies the existence of the partialderivatives at that point.

Theorem 11.3.2 The scalar function f is Differentiable impliesDf = (∇(f))T

If f is differentiable at x0 thenDF (x0) = (∇(f))T (x0).

Proof 11.3.2We have just proven this.

Next, we look at the version of the chain rule for a scalar function of n variables.

11.3.1 The Chain Rule for Scalar Functions of n Variables

Let’s review how you prove the chain rule. We assume there are two functions u(x, y) and v(x, y)defined locally about (x0, y0) and that there is a third function f(u, v) which is defined locally around(u0 = u(x0, y0), v0 = v(x0, y0)). Now assume f(u, v) is differentiable at (u0, v0) and u(x, y) andv(x, y) are differentiable at (x0, y0). Then we can say

u(x, y) = u(x0, y0) + ux(x0, y0)(x− x0) + uy(x0, y0)(y − y0) + Eu(x0, y0, x, y)

v(x, y) = v(x0, y0) + vx(x0, y0)(x− x0) + vy(x0, y0)(y − y0) + Ev(x0, y0, x, y)

f(u, v) = f(u0, v0) + fu(u0, v0)(u− u0) + fv(u0, v0)(v − v0) + Ef (u0, v0, u,v)

where all the error terms behave as usual as (x, y) → (x0, y0) and (u, v) → (u0, v0). Note thatas (x, y) → (x0, y0), u(x, y) → u0 = u(x0, y0) and v(x, y) → v0 = v(x0, y0) as u and v arecontinuous at the (u0, v0) since they are differentiable there. Let’s consider the partial of f withrespect to x. Let ∆u = u(x0 + ∆x, y0) − u(x0, y0) and ∆v = v(x0 + ∆x, y0) − v(x0, y0). Thus,u0 + ∆u = u(x0 + ∆x, y0) and v0 + ∆v = v(x0 + ∆x, y0). Hence,

f(u0 + ∆u, v0 + ∆v)− f(u0, v0)

∆x

=fu(u0, v0)(u− u0) + fv(u0, v0)(v − v0) + Ef (u0, v0, u, v)

∆x

= fu(u0, v0)u− u0

∆x+ fv(u0, v0)

v − v0

∆x+

Ef (u0, v0, u, v)

∆x

Continuing

f(u0 + ∆u, v0 + ∆v)− f(u0, v0)

∆x= fu(u0, v0)

ux(x0, y0)(x− x0) + Eu(x0, y0, x, y)

∆x

+ fv(u0, v0)vx(x0, y0)(x− x0) + Ev(x0.y0, x, y)

∆x+

Ef (u0, v0, u, v)

∆x= fu(u0, v0) ux(x0, y0) + fv(u0, v0) vx(x0, y0)

+ fu(u0, v0)Eu(x0, y0, x, y)

∆x+ fv(u0, v0)

Ev(x0, y0, x, y)

∆x+

Ef (u0, v0, u, v)

∆x.



If f was locally constant, then

Ef (u0, v0, u, v) = f(u, v)− f(u0, v0)− fu(u0, v0)(u− u0)fv((u0, v0)v − v0) = 0

We know

lim(a,b)→(u0,v0

Ef (u0, v0, u, v)

‖(a, b)− (u0, v0)‖= 0

and this is true no matter what sequence (an, bn) we choose that converges to (u(x0, y0), v(u0, y0).If f is not locally constant, a little thought shows locally about (u(x0, y0), v(u0, y0)) there are se-quences (un, vn) = (u(xn, yn), v(xn, yn)) → (u0, v0) = (u(x0, y0), v(u0, y0) with (un, vn) 6=(u0, v0) for all n. Also, by continuity, as ∆x → 0, (un, vn) → (u0, v0) too. For (un, vn) from thissequence, we then have

lim∆x→0

Ef (u0, v0, u, v)

∆x=

(limn→∞

Ef (u0, v0, u, v)

‖(un, vn)− (u0, v0)‖

) (lim

∆x→0

‖(un, vn)− (u0, v0)‖∆x

)We know the first term goes to zero by the properties of Ef . Expand the second term to get(

lim∆x→0

‖(un, vn)− (u0, v0)‖∆x

)=(

lim∆x→0

‖(ux(x0, y0)∆x+ Eu(x0, y0, x, y), vx(x0, y0)∆x+ Ev(x0, y0, x, y)‖∆x

)≤

|ux(x0, y0)|+ |vx(x0, y0)|

Hence, the product limit is zero. The other error terms go to zero also as (x, y) → (x0, y0). Hence,we conclude

∂f

∂x=

∂f

∂u

∂u

∂x+

∂f

∂v

∂v

∂x.

A similar argument shows

∂f

∂y=

∂f

∂u

∂u

∂y+

∂f

∂v

∂v

∂y.

This result is known as the Chain Rule and we will find there is a better way to package this. Nowassume there are more variables. We assume there are three functions u1(x1, x2, x3), u2(x1, x2, x3)and u3(x1, x2, x3) defined locally about (x10, x20, x30) and that there is another function f(u1, u2, u3)which are defined locally around (ui0 = ui(x10, x20, x30). Now assume f(u1, u2, u3) is differen-tiable at (u10, u20, u30) and ui(x1, x2, x3) are differentiable at (x10, x20, x30). Then we can say

ui(x1, x2, x3) = ui(x10, x20, x30) + ui,x1(x10, x20, x30)(x1 − x10)

+ui.x2(x10, x20, x30)(x2 − x20) + ui,x3(x10, x20, x30)(x3 − x30)

+Eui(x10, x20, x30, x1, x2, x3)

where all the error terms behave as usual as (x1, x2, x3) → (x10, x20, x30) and (u1, u2, u3) →(u10, u20, u30). This is really messy to write down! To save space, let x = (x1, x2, x3), u =(u1, u2, u3) etc. Then we can rewrite the above as

ui(x) = ui(x0) + ui,x1(x0)(x1 − x10) + ui.x2(x0)(x2 − x20) + ui,x3(x0)(x3 − x30)



+Eui(x0.x)

Again, note as x → x0, ui(x) → ui0 = ui(x0) as ui is continuous at x0 as it is differentiablethere. Let’s consider the partial of f with respect to xj . Let ∆ui = uj(x0 + hjEj)− uj(x0). Thus,ui0 + ∆ui = ui(x0 + hjEj). So we are using hj to denote the changes in xj which induce changesin ui. We see

f(u10 + ∆u1, u20 + ∆u2, u30 + ∆u3)− f(u0)

hj=

fu1(u0)(u1 − u10) + fu2(u0)(u1 − u10) + fu3(u0)(u1 − u10)

hj+

Ef (u0,u)

hj=

3∑i=1

fui(u0)(ui − ui0)

hj+

Ef (u0,u)

hj

Now replace ui − ui0 by its expansion:

ui − ui0 =

3∑k=1

ui,xk(x0)(xk − xk0) + Eui(x0.x) = ui,xj (x0)(xj − xj0) + Eui(x0.x)

= ui,xj (x0) hj + Eui(x0.x)

This gives

f(u10 + ∆u1, u20 + ∆u2, u30 + ∆u3)− f(u0)

hj=

3∑i=1

fui(u0) ui,xi(x0) +

3∑i=1

fui(u0)Eui(x0.x)

hj+

Ef (u0,u)

hj

As hj → 0, ∆ui → 0 and using the arguments we used in the two variable case, it is straightforwardto show Ef (u0,u)/hj → 0. The other two error terms go to zero also as (x, y) → (x0, y0). Hence,we conclude

∂f

∂xj=

3∑i=1

∂f

∂ui

∂ui∂xj

It is straightforward to see how to extend this result to n variables. Just a bit messy! The notationhere is quite messy and confusing. Using the ∂

∂(arg:k) doesn’t work well as we have arg terms for boththe xj and ui variables. Also, in our argument above we had u : Du ⊂ <3 → <3 with u is definedlocally about x0. Hence, u is a function of (x1, . . . , x3) with u(x1, . . . , x3) = (u1, . . . , u3) ∈ <3.Further, we assumed there is another function f : Df ⊂ <3 → < which is defined locally around(u0 = u(x0) ∈ <3. And we assume f(u)) is differentiable at u0 and u(x) is differentiable atx0. However, we could be more general: we could assume u : Du ⊂ <n → <m with u is definedlocally about x0. Hence, u is a function of (x1, . . . , xn) with u(x1, . . . , xn) = (u1, . . . , um) ∈ <m.Further, assume there is another function f : Df ⊂ <m → < which is defined locally around(u0 = u(x0) ∈ <n. Further assume f(u)) is differentiable at u0 and u(x) is differentiable at x0.The arguments are quite similar and we find

∂f

∂xj=

m∑i=1

∂f

∂ui

∂ui∂xj



This leads to our theorem.

Theorem 11.3.3 The Chain Rule for Scalar Functions of n Variables

Assume u : Du ⊂ <n → <m with u is defined locally about x0. Hence, u is a function of(x1, . . . , xn) with u(x1, . . . , xn) = (u1, . . . , um) ∈ <m. Further, assume there is anotherfunction f : Df ⊂ <m → < which is defined locally around (u0 = u(x0) ∈ <n. Furtherassume f(u)) is differentiable at u0 and u(x) is differentiable at x0. Then fi exists at x0

and is given by

∂f

∂xj=

m∑i=1

∂f

∂ui

∂ui∂xj

Proof 11.3.3We have just gone over the argument.

Comment 11.3.3 With a little bit of work, these arguments also prove the differentiability of thecomposition. In the two variable case, let g(x, y) = f(u(x, y), v(x, y)), i.e. g = f U where U isthe vector function with components u and v, and define the error

Eg = fu(u0, v0) Eu(x0, y0, x, y) + fv(u0, v0) Ev(x0, y0, x, y) + Ef (u0, v0, u, v).

It is clear from the assumptions that Eg(x0, y0, x, y) → 0 as ∆x → 0 and Eg(x0, y0, x, y) → 0

as ∆y → 0. The arguments above show Eg(x0,y0,x,y)∆x → 0 as ∆x → 0 and Eg(x0,y0,x,y)

∆y → 0 as∆y → 0. This shows Eg is the error function for the composition g = f U . Hence g = f U isdifferentiable at (x0, y0) with

Dg(x0, y0) =

[fu(u0, v0)fv(u0, v0)

]T [ux(x0, y0) vx(x0, y0)uy(x0, y0) vy(x0, y0)

]= Df(u0, v0)

[ux(x0, y0) vx(x0, y0)uy(x0, y0) vy(x0, y0)

]We will discuss this further.

Example 11.3.1 Let f(x, y, z) = x2+2x+5y4+z2. Then if x = r sin(φ) cos(θ), y = r sin(φ) sin(θ)and z = r cos(φ). y = r sin(θ), using the chain rule, we find

∂f

∂r=

∂f

∂x

∂x

∂r+∂f

∂y

∂y

∂r+∂f

∂z

∂z

∂r

∂f

∂θ=

∂f

∂x

∂x

∂θ+∂f

∂y

∂y

∂θ+∂f

∂z

∂z

∂θ

∂f

∂φ=

∂f

∂x

∂x

∂φ+∂f

∂y

∂y

∂φ+∂f

∂z

∂z

∂φ

This becomes

∂f

∂r=

∂f

∂x(sin(φ cos(θ)) +

∂f

∂y(− sin(φ) sin(θ)) +

∂f

∂z(cos(φ))

∂f

∂θ=

∂f

∂x(−r sin(φ) sin(θ)) +

∂f

∂y

∂y

(r sin(φ) cos(θ)) +

∂f

∂z(0)



∂f

∂φ=

∂f

∂x(r cos(φ) cos(θ)) +

∂f

∂y(r cos(φ) sin(θ)) +

∂f

∂z(−r sin(φ))

You can then substitute in for x and y to get the final answer in terms of r, θ and φ.

Example 11.3.2 Let u = x2 + 2y2 + z2 and v = 4x2 − 5y2 + 7z and f(u, v) = 10u2v4. Then, bythe chain rule

∂f

∂x=

∂f

∂u

∂u

∂x+∂f

∂v

∂v

∂x∂f

∂y=

∂f

∂u

∂u

∂y+∂f

∂v

∂v

∂y

∂f

∂z=

∂f

∂u

∂u

∂z+∂f

∂v

∂v

∂z

This becomes

∂f

∂x=

∂f

∂u(2x) +

∂f

∂v(8x)

∂f

∂y=

∂f

∂u(4y) +

∂f

∂v(−10y)

∂f

∂z=

∂f

∂u(2z) +

∂f

∂v(7)

You can then substitute in for u and v to get the final answer in terms of x, y and z.

11.3.1.1 Homework

Exercise 11.3.1

Exercise 11.3.2

Exercise 11.3.3

11.4 Partials and Differentiability

First, let’s talk about some issues with smoothness for partials. We cover the following examples in(Peterson (8) 2019) also, but it is worth seeing them again. These examples are intense but you learna lot by working through them.

11.4.1 Partials Can Exist but not be Continuous

We start with the function f defined by

f(x, y) =

2xy√x2+y2

, (x, y) 6= (0, 0)

0, (x, y) = (0, 0)

This function has a removeable discontinuity at (0, 0) because∣∣∣∣ 2xy√x2 + y2

− 0

∣∣∣∣ = 2|x| |y|√x2 + y2

≤ 2

√x2 + y2

√x2 + y2√

x2 + y2

= 2√x2 + y2


11.4. PARTIALS AND DIFFERENTIABILITY 199

because |x| ≤√x2 + y2 and |y| ≤

√x2 + y2. Pick ε > 0 arbitrarily. Then if δ < ε/2, we have∣∣∣∣ 2xy√x2 + y2

− 0

∣∣∣∣ < 2ε

2= ε

which proves that the limit exists and equals 0 as (x, y)→ (0, 0). This tells us we can define f(0, 0)to match this value. Thus, as said, this is a removeable discontinuity.The first order partials both exist at (0, 0) as is seen by an easy limit.

fx(0, 0) = limx→0

f(x, 0)− f(0, 0)

x= limx→0

0− 0

x= 0.

fy(0, 0) = limy→0

f(0, y)− f(0, 0)

y= limy→0

0− 0

y= 0.

We can also calculate the first order partials at any (x, y) 6= (0, 0).

fx(x, y) =2y(x2 + y2)1/2 − 2xy (1/2)(x2 + y2)−1/2 2x

(x2 + y2)1

=2y(x2 + y2)− 2x2y

(x2 + y2)3/2=

2y3

(x2 + y2)3/2

fy(x, y) =2x(x2 + y2)1/2 − 2xy (1/2)(x2 + y2)−1/2 2y

(x2 + y2)1

=2x(x2 + y2)− 2xy2

(x2 + y2)3/2=

2x3

(x2 + y2)3/2

We can show lim (x, y)→ (0, 0) fx(x, y) and lim (x, y)→ (0, 0) fy(x, y) do not exist as follows.We will find paths (x,mx) for m 6= 0 where we get different values for the limit depending on thevalue of m. We pick the path (x, 5x) for x > 0. On this path, we find

limx→0+

fx(x, 5x) = limx→0+

250x3

(26)3/2|x2|3/2= limx→0+

250

26√

26

Then, for example, using m = −5 for x < 0, we get

limx→0+

fx(x,−5x) = limx→0+

−250x3

(26)3/2|x2|3/2= limx→0+

−250

26√

26

Since these are not the same, this limit can not exist. Hence fx is not continuous at the origin. Asimilar argument shows lim (x, y)→ (0, 0)fy(x, y) does not exist either. So we have shown the firstorder partials exist locally around (0, 0) but fx and fy are not continuous at (0, 0).

Is f differentiable at (0, 0)? If so, there a numbers L1 and L2 so that two things happen:

lim(x,y)→(0,0)

(2xy√x2 + y2

− L1x− L2y

)= 0

lim(x,y)→(0,0)

1√x2 + y2

(2xy√x2 + y2

− L1x− L2y

)= 0

The first limit is 0 because

lim(x,y)→(0,0)

∣∣∣∣ 2xy√x2 + y2

− L1x− L2y − 0

∣∣∣∣ ≤



2|x||y|√x2 + y2

+ |L1|√x2 + y2 + |L2|

√x2 + y2

≤ 2

√x2 + y2

√x2 + y2√

x2 + y2+ |L1|

√x2 + y2 + |L2|

√x2 + y2

≤ (2 + |L1|+ |L2|)√x2 + y2

which goes to 0 as (x, y)→ 0. So that shows the first requirement is met.However, the second requirement is not. Look at the paths (x,mx) and consider limx→0+ like wedid before. We find, since |x| = x here

limx→0+

1

x√

1 +m2

(2mx2

x√

1 +m2− L1x−mL2x

)=

1√1 +m2

(2mx

x√

1 +m2− L1 − L2m

)=

2m

1 +m2− L1 + L2m√

1 +m2

Now look at path (x,−mx) for x > 0. This is the limit from the other side. We have

limx→0+

1

x√

1 +m2

(−2mx2

x√

1 +m2− L1x+mL2x

)=

1√1 +m2

(−2m√1 +m2

− L1 + L2m

)=−2m

1 +m2+−L1 + L2m√

1 +m2

If these limits were equal we would have

2m

1 +m2+−L1 − L2m√

1 +m2=

−2m

1 +m2+−L1 + L2m√

1 +m2

This implies L2 = 2/√

1 +m2 which tells us L2 depends on m. This values should be independentof the choice of path and so this function cannot be differentiable at (0, 0).We see that just because the first order partials exists locally in some Br(0, 0) does not necessarilytell us f is differentiable at (0, 0). So differentiable at (x0, y0) implies the first order partials existad (x0, y0) and L1 = fx(x0, y) and L2 = fy(x0, y). But the converse is not true in general.

We need to figure out the appropriate assumptions about the smoothness of f that will guaranteethe existence of the derivative of f in this multivariable situations. We will do this for n dimensionsas it is just as easy as doing the proof in <2.

Theorem 11.4.1 Sufficient Conditions to Guarantee Differentiability

Let f : D ⊂ <n → < and assume ∇(f) exists and is continuous in a ball Br(p) of someradius r > 0 around the point p = [p1, . . . , pn]′ in <n. Then f is differentiable at eachpoint in Br(p).

Proof 11.4.1By definition of ∂f

∂xiat p,

f(p+ hEi)− f(p) =∂f

∂xi(p)h+ Ei(p,p+ hEi)

Now if x is in B(r,p), all the first order partials of f are continuous. We have

x = p+ (x1 − p1)E1 + (x2 − p2)E2 + . . .+ (xn − pn)En



Now let ri = xi − pi. Then we can write ( this is the telegraphing sum trick!)

f(x)− f(p) = f(p+

n∑i=1

riEi)− f(p+

n−1∑i=1

riEi) + f(p+

n−1∑i=1

riEi)− f(p+

n−2∑i=1

riEi)

+f(p+

n−2∑i=1

riEi)− f(p+

n−3∑i=1

riEi) + . . .

+f(p+

2∑i=1

riEi)− f(p+

1∑i=1

riEi) + f(p+

1∑i=1

riEi)− f(p)

To help you see this better, if n = 3, this expansion would look like

f(x)− f(p) = f(p+

3∑i=1

riEi)− f(p+

2∑i=1

riEi) + f(p+

2∑i=1

riEi)− f(p+

1∑i=1

riEi)

+f(p+

1∑i=1

riEi)− f(p)

or

f(x)− f(p) = f(p1 + r1, p2 + r2, p3 + r3)− f(p1 + r1, p2 + r2, p3) ∆ slot 3

+ f(p1 + r1, p2 + r2, p3)− f(p1 + r1, p2, p3) ∆ slot 2

+ f(p1 + r1, p2, p3)− f(p1, p2, p3) ∆ slot 1

So we have

f(x)− f(p) =

1∑j=n

(f(p+

j∑i=1

riEi)− f(p+

j−1∑i=1

riEi)

where we interpret∑0j=1 as 0. To see this done without all this notation, you can look at the argument

for the two variable version of this theorem in (Peterson (8) 2019). That argument is messy too thoughand in someways, this more abstract form is better!

Now apply the Mean Value Theorem in one variable to get

f(p+

j∑i=1

riEi)− f(p+

j−1∑i=1

riEi) =∂f

∂xi(qj) rj

where qj is between p+∑j−1i=1 riEi and p+

∑ji=1 riEi; i.e. qj = p+

∑j−1i=1 riEi + sjEj with

|sj | < |rj | < r. So

f(x)− f(p) =

1∑j=n

∂f

∂xi(qj) rj

Since ∂f∂xj

is continuous on Br(p), given ε > 0 arbitrary, there are δj > 0 so that for 1 ≤ j ≤ n,

| ∂f∂xj (x)− ∂f∂xj

(p)| < ε/n. Thus, if ||x−p|| < δ = minδ1, . . . , δn, since the points qj are between

p +∑j−1i=1 riEi and p +

∑ji=1 riEi, we have But, as we mentioned earlier, sj is between pj and

pj + rj which implies qj is in Br(p).



So we have | ∂f∂xj (qj)−∂f∂xj

(p)| < ε/n Thus, we find

∂f

∂xj(p)− ε/n <

∂f

∂xj(qj) <

∂f

∂xj(p) + ε/n

Now f(x)− f(p) =∑1j=n

∂f∂xj

(qj) rj . Let I be the set of indices where rj > 0 and J be the other

indices. Then∑1j=n =

∑j∈I +

∑j∈J . For indices j ∈ I , multiplying the inequality rj > 0 causes

no changes:(∑j∈I

∂f

∂xj(p) rj

)− (ε/n)

∑j∈I

rj <∑j∈I

∂f

∂xj(qj) rj <

(∑j∈I

∂f

∂xj(p)

)+ (ε/n)

∑j∈I

rj

If j ∈ J , rj < 0 and multiplying changes the inequalities. We get(∑j∈J

∂f

∂xj(p) rj

)+ (ε/n)

∑j∈J

rj <∑j∈J

∂f

∂xj(qj) rj <

(∑j∈J

∂f

∂xj(p)

)− (ε/n)

∑j∈J

rj

We can combine these two forms by noting that we can replace rj by sign(rj)rj is each inequality.This gives ( 1∑

j=n

∂f

∂xj(p)rj

)−

1∑j=n

sign(rj) (ε/n) rj <

1∑j=n

∂f

∂xj(qj)rj

= f(x)− f(p) <

( 1∑j=n

∂f

∂xj(p)

)+

1∑j=n

sign(rj)(ε/n) rj

Thus,

|f(x)− f(p)−1∑

j=n

∂f

∂xj(p) rj | < (ε/n)

1∑j=n

|rj |

But (ε/n)∑1j=n |rj | ≤

∑1j=n (ε/n)||x− p|| = ε||x− p||. We therefore have the estimate

|f(x)− f(p)−1∑

j=n

∂f

∂xj(p) rj | < ε||x− p||

The term inside the absolute values is the error functionE(x,p) and so we haveE(x,p)/||x−p|| <ε if ||x − p|| < δ. This tells us both E(x,p)/||x − p|| and E(x,p) go to 0 as x → p. Hence, f isdifferentiable at p.

11.4.2 Higher Order Partials

We define the second order partials of f as follows.

Definition 11.4.1 Second Order Partials



If f(x), fxi(x) are defined locally at x0, we can attempt to find following limits:

limh→0

fxi(x0 + hEj)− fxi(x0

h

If the limit exists, it is called the partial of fxi with respect to xj and is denoted by fxi xj .Note, the order is potentially important as

limh→0

fxj (x0 + hEi)− fxi(x0

h

if it exists is the partial of fxj with respect to xi which is denoted by fxj xi . Of course,these do not have to match in general.

Comment 11.4.1 When these second order partials exist at x0, we use the following notations inter-changeably: fij = ∂xj (fxi), fji = ∂xi(fxj , As usual, these notations can get confusing.

The second order partials for a scalar function of n variables can be organized organized into a matrixcalled the Hessian.

Definition 11.4.2 The Hessian Matrix

If f(x, y), fx and fy are defined locally at (x0, y0) and if the second order partials exist at(x0, y0), we define the Hessian,H(x0) at (x0) to be the matrix

H(x0) =

fx1 x1(x0) . . . fx1 xn(x0)

......

...fxn x1

(x0) . . . fxn xn(x0)

=

f11(x0) . . . f1n(x0)...

......

fn1(x0) . . . fnn(x0)

Hence for a scalar function of n variables, we have the gradient,∇f which is the derivative of f ,Df ,in some circumstances and is an n× 1 column vector, and the Hessian of f which is a n× n matrix.If we have a vector function, it is messier. The analogue of the gradient for f : D ⊂ <n → <m isthe Jacobian of f , Jf given by

Jf (x0) =

f1,x1(x0) . . . f1, xn(x0)...

......

fm,x1(x0) . . . fm xn(x0)

=

f11(x0) . . . f1n(x0)...

......

fm1(x0) . . . fmn(x0)

where fi are the component functions of f . Note the Jacobian of f is the derivative of f , Df , insome circumstances and is an n×m matrix. You can see why the notation fij is so confusing. If f isa scalar function this has a different meaning than if f is a vector valued function. So context is all.We can also define higher order derivatives such as fxi xj xk which have more permutations. We willlet you work through those notations as the need arises in your work.

Example 11.4.1 Let f(x, y, z) = 2x− 8xy+ 4yz2. Find the first and second order partials of f andits Hessian.

Solution The partials are

fx(x, y, z) = 2− 8y, fy(x, y, z) = −8x+ 4z2, fz(x, y, z) = 8yz

fxx(x, y, z) = 0, fxy(x, y, z) = −8, fxz(x, y, z) = 0

fyx(x, y, z) = −8, fyy(x, y, z) = 0, fyz(x, y, z) = 8z



fzx(x, y, z) = 0, fzy(x, y, z) = 8z, fzz(x, y, z) = 8y

and so the Hessian is

H(x, y, z) =

fxx(x, y, z) fxy(x, y, z) fxz(x, y, z)fyx(x, y, z) fyy(x, y, z) fyz(x, y, z)fzx(x, y, z) fzy(x, y, z) fzz(x, y, z)

=

0 −8 0−8 0 8z0 8z 8y

11.4.2.1 Homework

Exercise 11.4.1

Exercise 11.4.2

Exercise 11.4.3

Exercise 11.4.4

Exercise 11.4.5

11.4.3 When Do Mixed Partials Match?

Next, we want to know under what conditions the mixed order partials match. Here is a standardexample with a function of two variables showing the two mixed partials don’t agree at a point.

Example 11.4.2 Let

f(x, y) =

xy(x2−y2)x2+y2 , (x, y) 6= (0, 0)

0, (x, y) = (0, 0)

Prove the mixed order partials do not match.

• Prove lim(x,y)→(0,0) f(x, y) = 0 thereby showing f has a removeable discontinuity at (0, 0)and so the definition for f above makes sense.

Solution

lim(x,y)→(0,0)

∣∣∣∣xy(x2 − y2)

x2 + y2

∣∣∣∣ = lim(x,y)→(0,0)

∣∣∣∣xy(x+ y)

x2 + y2

∣∣∣∣|x− y| = lim(x,y)→(0,0)

∣∣∣∣x2y + y2x)

x2 + y2

∣∣∣∣|x− y|≤ lim

(x,y)→(0,0)

(∣∣∣∣ x2y

x2 + y2

∣∣∣∣+

∣∣∣∣ y2x

x2 + y2

∣∣∣∣)|x− y|≤ lim

(x,y)→(0,0)(|y|+ |x|)|x− y| ≤ 4(x2 + y2) = 0

Thus, f has a removeable discontinuity at (0, 0) of value 0.

• Compute fx(0, 0) and fy(0, 0)

Solution

fx(0, 0) = limh→0

f(h, 0)− f(0, 0)

h=

0

h= 0

fy(0, 0) = limh→0

f(0, h)− f(0, 0)

h=

0

h= 0



Also, at any (x, y) 6= (0, 0), we have

fx(x, y) =x4y + 4x2y3 − y5

(x2 + y2)2

fy(x, y) =−xy4 + x5 − 4x3y2

(x2 + y2)2

Then

lim(x,y)→(0,0)

∣∣∣∣yx4 + 4x2y3 − y5

(x2 + y2)2

∣∣∣∣≤ lim

(x,y)→(0,0)

(|y| x4

(x2 + y2)2+ |y| y4

(x2 + y2)2+ 4|y| x2

x2 + y2

y2

x2 + y2

)≤ lim

(x,y)→(0,0)(2|y|+ 4|y| ≤ 5

√x2 + y2 = 0

Hence fx has a removeable discontinuity at (0, 0). A similar calculation shows fy has aremoveable discontinuity at (0, 0).

• Compute ∂y(fx)(0, 0) and ∂x(fy)(0, 0). Are these equal?

Solution

fxy(0, 0) = limh→0

fx(0, h)− fx(0, 0)

h=−h5/h4

h= −1

fyx(0, 0) = limh→0

fy(0, h)− fx(0, 0)

h=h5/h4

h= 1

Clearly, these mixed partials do not match.

• Compute ∂y(fx)(x, y) and ∂x(fy)(x, y) for (x, y) 6= (0, 0).

Solution At any (x, y) 6= (0, 0), we have

fxy(x, y) =x6 − y6 − 9y4x2 + 9y2x4

(x2 + y2)3

fyx(x, y) =x6 − y6 + 9y4x2 − 9y2x4

(x2 + y2)3

• Determine if ∂y(fx)(x, y) and ∂x(fy)(x, y) are continuous at (0, 0).

Solution If you look at the paths (t, 2t) and (t, 3t) for t > 0, you find you get different limitingvalues as t → 0+ for both fxy and fyx. Hence, these mixed partials are not continuous at(0, 0).

We see the lack of continuity of the mixed partials in a ball about the origin is the problem.

This leads to our next theorem.

Theorem 11.4.2 Sufficient Conditions for the Mixed Order Partials to Match



Let f : D ⊂ <n → < and fix i 6= j for the set 1, . . . , n. Assume all the first order

partials ∂f∂xi

and ∂f∂xj

and the second order partials ∂xj

(∂f∂xi

)and ∂xi

(∂f∂xj

)all exist in

Br(p) for some r > 0. Also assume ∂xj

(∂f∂xi

)and ∂xi

(∂f∂xj

)are continuous in Br(p).

Then ∂xj

(∂f∂xi

(p)

)= ∂xi

(∂f∂xj

)at p.

Proof 11.4.2Let’s control this notational mess by noticing we are only looking at changes in xi and xj . LetX1 = xi, X2 = xj and

F (X1, X2) = f(x1, . . . , xi−1,xi = X1, xi+1, . . . ,

xj−1,xj = X2, xj+1, . . . , xn)

Also, let F1 = ∂xif , F2 = ∂xjf and F12 = ∂xj (∂xif) and F21 = ∂xi(∂xjf). Now with thesevariables defined, the argument to prove the mixed partials match is essentially the same as the onein (Peterson (8) 2019). So even though there are n variables involved, we just have to focus ourattention on the two that matter. Fix h small enough for that p + hei + hej is in Br(p). Notep+ hei = X1 + h and p+ hej = X2 + h.Step One:Define

G(h) = F (X1 + h,X2 + h)− F (X1 + h,X2)− F (X1, X2 + h)− F (X1, X2)

φ(x) = F (x,X2 + h)− F (x,X2), ψ(y) = φ(X1 + y)− φ(X1)

Now apply the Mean Value Theorem to φ. Note φ′(x) = ∂X1F (x,X2 + h) − ∂X1F (x,X2) =fxi(p+ xei + hej)− fxi(p+ xei). Thus,

φ(X1 + h)− φ(X1) = φ′(X1 + θ1 h) h, θ1 between X1 and X1 + h

= (fxi(p+ θ1hei + hej)− fxi(p+ θ1hei))h

= (FX1(X1 + θ1h,X2 + h)− FX1(X1 + θ1h,X2))h

where we use the definition of F . Now define for y between X2 and X2 + h, the function ξ by

ξ(z) = FX1(X1 + θ1h, z)

ξ′(z) = ∂X2FX1

(X1 + θ1h, z)

Apply the Mean Value Theorem to ξ: there is a θ2 so that X2 + θ2h is between X2 and X2 + h with

ξ(X2 + h)− ξ(X2) = ∂X2FX1(X1 + θ1h,X2 + θ2h) h

Thus,

h (ξ(X2 + h)− ξ(X2)) = h (FX1(X1 + θ1h,X2 + h)− FX1

(X1 + θ1h,X2))

= ∂X2FX1

(X1 + θ1h,X2 + θ2h) h2

But,

φ(X1 + h)− φ(X1) = (F (X1 + h,X2 + h)− F (X1 + h,X2))

−(F (X1, X2 + h)− F (X1, X2))



= G(h)

= (FX1(X1 + θ1h,X2 + h)− FX1

(X1 + θ1h,X2))h

= (ξ(X2 + h)− ξ(X2))h

= ∂X2FX1

(X1 + θ1h,X2 + θ2h) h2

So

G(h) = ∂X2FX1(X1 + θ1h,X2 + θ2h) h2

Step Two:Now start again with L(w) = F (X1 + h,w) − F (X1, w), noting G(h) = L(X2 + h) − L(X2).Apply the Mean Value Theorem to L like before to find

G(h) = L(X2 + h)− L(X2) = L′(X2 + θ3h)h

= (∂X2F (X1 + h,X2 + θ3h)− ∂X2

F (X1, X2 + θ3h)) h

for an intermediate value of θ3 with X2 + θ3h between X2 and X2 + h.

Now define for z between X2 and X2 + h, the function γ by

γ(z) = FX2(z,X2 + θ3h)

γ′(z) = ∂X1FX2(z,X2 + θ3h)

Now apply the Mean Value Theorem a fourth time: So there is a θ4 with X1 + θ4h between X1 andX1 + h so that

γ(X1 + h)− γ(X1) = γ′(X1 + θ4h) h

= ∂X1FX2(X1 + θ4h,X2 + θ3h) h

Using what we have found so far, we have

G(h) = (∂X2F (X1 + h,X2 + θ3h)− ∂X2

F (X1, X2 + θ3h)) h

= (γ(X1 + h)− γ(X1)) h = (∂X1FX2

(X1 + θ4h,X2 + θ3h)) h2

So we have two representations for G(h):

G(h) = ∂X2FX1

(X1 + θ1h,X2 + θ2h) h2 = (∂X1FX2

(X1 + θ4h,X2 + θ3h)) h2

Cancelling the common h2, we have

∂X2FX1(X1 + θ1h,X2 + θ2h) = (∂X1FX2(X1 + θ4h,X2 + θ3h))

Now let h→ 0 and since these second order partials are continuous at p, we have ∂X2FX1

(X1, X2) =∂X1FX2(X1, X2). This shows the result we wanted.

11.4.3.1 Homework

Let

f(x, y) =

4xy(x2−y2)x2+2y2 , (x, y) 6= (0, 0)

0, (x, y) = (0, 0)



Exercise 11.4.6 Prove lim(x,y)→(0,0) f(x, y) = 0 thereby showing f has a removeable discontinuityat (0, 0) and so the definition for f above makes sense.

Exercise 11.4.7 Compute fx(0, 0) and fy(0, 0)

Exercise 11.4.8 Compute fx(x, y) and fy(x, y) for (x, y) 6= (0, 0). Simplify your answers to the bestpossible form.

Exercise 11.4.9 Compute ∂y(fx)(0, 0) and ∂x(fy)(0, 0). Are these equal?

Exercise 11.4.10 Compute ∂y(fx)(x, y) and ∂x(fy)(x, y) for (x, y) 6= (0, 0).

Exercise 11.4.11 Determine if ∂y(fx)(x, y) and ∂x(fy)(x, y) are continuous at (0, 0).

11.5 Derivatives for Vector Functions of n Variables

If f : D ⊂ <n → <m, then f determines m component functions fi : d ⊂ <n → <; i.e. at anypoint x0) where f is locally defined, we have

f(x0) =

f1(x0)...

fm(x0)

, f(x0 + ∆) =

f1(x0 + ∆1E1)...

fm(x0 + ∆nEm)

for suitable ∆. If we assume each fi is differentiable at x0, we know

f1(x0 + ∆) = f1(x0) +

n∑i=1

f1,xi(x0) (x0i −∆i) + E1(x0,∆)

... =...

fm(x0 + ∆) = f1(x0) +

n∑i=1

fm,xi(x0) (x0i −∆i) + Em(x0,∆)

where

lim∆→0

Ei(x0,∆) = 0, lim∆→0

Ei(x0,∆)

‖∆‖= 0

We can then write

f(x0 + ∆) =

f1(x0)...

fm(x0)

+

f1,x1(x0) . . . f1,xn(x0)...

...fm,x1(x0) . . . fm,xn(x0)

∆1

...∆n

+

E1(x0,∆)...

Em(x0,∆)

The matrix of partials above occurs frequently and is called the Jacobian of f .

Definition 11.5.1 The Jacobian


11.5. DERIVATIVES FOR VECTOR FUNCTIONS OF N VARIABLES 209

Let f : D ⊂ <n → <m have partial derivatives at x0. The Jacobian of f is the m × nmatrix

Jf (x0) =

f1,x1(x0) . . . f1,xn(x0)...

...fm,x1

(x0) . . . fm,xn(x0)

If f is differentiable at x0, we can write

Jf (x0) =

(∇(f1))T )(x0

...(∇(fm))T )(x0

=

Df1(x0

...Dfm(x0)

Thus, we write

f(x0 + ∆) = f(x0) + Jf (x0) ∆ + E (x0,∆)

where E (x0,∆) is the vector error

E (x0,∆) =

E1(x0,∆)...

Em(x0,∆)

and we see

lim∆→0

‖E (x0,∆)‖ = lim∆→0

√((E1(x0,∆)2 + . . .+ (Em(x0,∆)2) = 0

lim∆→0

∥∥∥∥E (x0,∆)

‖∆‖

∥∥∥∥ = lim∆→0

√√√√(( (E1(x0,∆)

‖∆‖

)2

+ . . .+

((Em(x0,∆)

‖∆‖

)2)

= 0

The definition of differentiability for a vector function is thus

Definition 11.5.2 The Differentiability of a Vector Function of n variables



Let f : D ⊂ <n → <m be defined locally at x0. We say f is differentiable at x0 is thereis a linear map L(x0) : <n → <m a error vector E (x0,∆) so that for given ∆,

f(x0 + ∆) = f(x0) +L(x0) ∆ + E (x0,∆)

where

lim∆→0

E (x0,∆) = 0

lim∆→0

E (x0,∆)

‖∆‖= 0

where we use norm convergence with respect to the norm ‖ · ‖ used in <n and <m. Forany given pair of orthonormal basis E of <n and F of <m, the linear map L(x0 has arepresentation

[L(x0)

]EF

so that[f(x0 + ∆)

]F

=[f(x0)

]F

+[L(x0)

]EF

[∆]E

+[E (x0,∆)

]F

We then say the derivative of f at x0) is L(x0). The derivative of f at x0) is denoted byDf(x0).

Comment 11.5.1 We usually are not so careful with the notation. It is understood that <n and <mhave some choice of orthonormal basis E and F and so we identify

L(x0) ≡[L(x0)

]E,F

, f(x0 + ∆) ≡[f(x0 + ∆)

]F, ∆ ≡

[∆]E

E (x0,∆) ≡[E (x0,∆)

]F

It is not so hard to mentally make the conversion back and forth and you should get used to it.

It is easy to prove the following result. Note we are using the identification of the linear mapping toits matrix representation here.

Theorem 11.5.1 If the Vector Function f is Differentiable at x0, L(x0) = Jf (x0)

Let f : D ⊂ <n → <m be defined locally at x0. If f is differentiable at x0, the(L(x0))ij = fi,xj (x0); i.e. L(x0) = Jf)(x0).

Proof 11.5.1This argument is very similar to the one we did before. Just apply the argument to each component.We leave this to you.

11.5.0.2 Homework

Exercise 11.5.1

Exercise 11.5.2

Exercise 11.5.3

Exercise 11.5.4

Exercise 11.5.5


11.5. DERIVATIVES FOR VECTOR FUNCTIONS OF N VARIABLES 211

11.5.1 The Chain Rule For Vector Valued Functions

The next thing we want to do is to extend the chain rule to this setting. Note the proof here isdifferent than the one we did in the scalar case. We are using matrix manipulations here instead ofcomponents.

Theorem 11.5.2 The Chain Rule for Vector Functions

Let F1 : D1 ⊂ <n → R(F1) ⊂ <m and F2 : D2 ⊂ <m → R(F2) ⊂ <r. HereF denotes the range of Fi. We assume there is overlap: F ∩ D2 = S 6= ∅ and letE = F−1(S). Further, assume F1 is differentiable at Q0 in E and F2 is differentiable atP0 = F1(Q0) ∈ S. Then, F3 = F2 F1 is differentiable atQ0 with matrix representationgiven by [

DF3(Q0)]

r×n=

[DF2(P0)

]r×m

[DF1(Q0)

]m×n

or in linear mapping terms

DF3(Q0) = DF2(P0) DF1(Q0)

Proof 11.5.2You can see the rough flow of these mappings in Figure 11.1 which helps set the stage. We know

Figure 11.1: The Vector Chain Rule

F2(P ) = F2(P0) + J2(P0)(P − P0) + E2(P0,P )

where J2(P0) is a linear map from <m to <r and E2(P0,P → 0 as P → P0. Also,

F1(Q) = F1(Q0) + J1(Q0)(Q−Q0) + E1(Q0,Q)

where J1(Q0) is a linear map from <n to <m and E1(Q0,Q→ 0 asQ→ Q0. Let P = F1(Q) sothat P0 = F1(Q0). Then

F3(Q)− F3(Q0) = F2(F1(Q)− F2(F1(Q0)

= J2(F1(Q0)) (F1(Q)− F1(Q0)) + E2(F1(Q0), F1(Q))



= J2(F1(Q0))

J1(Q0) (Q−Q0) + E1(Q0,Q)

+E2(F1(Q0), F1(Q))

Now reorganize:

F3(Q)− F3(Q0) = J2(F1(Q0)) J1(Q0) (Q−Q0) + J2(F1(Q0)) E1(Q0,Q)

+E2(F1(Q0), F1(Q))

or

F3(Q)− F3(Q0) = J2(F1(Q0)) J1(Q0) (Q−Q0) + E3(Q0,Q)

where

E3(Q0),Q) = J2(F1(Q0)) E1(Q0,Q) + E2(F1(Q0), F1(Q))

Since E1(Q0,Q) → 0 as Q → Q0, the first term in E3 goes to zero. For the second term, we knowsince F1 is differentiable atQ0, F1 is continuous atQ0. Thus,

limQ→Q0

E2(F1(Q0), F1(Q)) = limF1(Q)→F1(Q0

)E2(F1(Q0), F1(Q)) = 0

Next, we need to look at

E3(Q0),Q)

‖Q−Q0‖=

J2(F1(Q0)) E1(Q0,Q)

‖Q−Q0‖+

E2(F1(Q0), F1(Q))

‖Q−Q0‖

Again, the first term goes to zero as Q → Q0 since that is a property of E1. The second term ismore interesting. Now if F1 was locally constant at Q0, then J1((Q0) = 0 and we know the errorsatisfies E2(F1(Q0), F1(Q))/‖Q−Q0‖ is identically zero and we have shown what we want. If F1

is not locally constant, we can say

E2(F1(Q0), F1(Q))

‖Q−Q0‖=

0, Q 6= Q0, F1(Q) = F1(Q0)E2(F1(Q0),F1(Q))‖F1(Q)−F1(Q0)‖

‖F1(Q−F1(Q0)‖‖Q−Q0‖ , Q 6= Q0, F1(Q) 6= F1(Q0)

Hence, for allQ 6= Q0 with for all F1(Q) 6= F1(Q0), we have

E2(F1(Q0), F1(Q))

‖Q−Q0‖≤ E2(F1(Q0), F1(Q))

‖F1(Q)− F1(Q0)‖‖F1(Q)− F1Q0)‖‖Q−Q0‖

=E2(F1(Q0), F1(Q))

‖F1(Q)− F1(Q0)‖‖J1(Q0)‖Fr ‖Q−Q0‖+ ‖E1(Q0,Q)‖

‖Q−Q0‖

where we use the Frobenius norm inequalities from Chapter 4 in Theorem 4.5.1. So, simplifying abit,

E2(F1(Q0), F1(Q))

‖Q−Q0‖≤ E2(F1(Q0), F1(Q))

‖F1(Q)− F1(Q0)‖Term One

(‖J1(Q0)‖Fr +

‖E1(Q0,Q)‖‖Q−Q0‖

Term Two

)

As Q → Q0, F1(Q) → F1(Q0) also by continuity. The case where all F1(Q) = F1(Q0) has anerror bounded above by the case with F1(Q) 6= F1(Q0). Hence, from the estimates above, term onegoes to zero and the second piece goes to J1(Q0) as term two goes to zero. Hence, the limit is zeroand we have shown the second part of the condition for E3. We conclude F3 is differentiable at Q0


11.6. TANGENT PLANE ERROR 213

and

DF3(Q0) = DF2(P0)DF1(Q0)

11.5.1.1 Homework

Exercise 11.5.6

Exercise 11.5.7

Exercise 11.5.8

Exercise 11.5.9

Exercise 11.5.10

11.6 Tangent Plane Error

Now that we have the chain rule, we can quickly develop other results such as how much errorwe make when we approximate our surface f(x, y) using a tangent plane at a point (x0, y0). Weapproximate nonlinear mappings for many purposes. We go through some examples in Chapter ??and they are well worth studying. The first thing we need is to know when a scalar function of nvariables is differentiable. Just because it’s partials exist at a point is not enough to guarantee that!But we know if the partials are continuous around that point, then the derivative does exist. And thatmeans we can write the function in terms of its tangent plane plus an error. The arguments to do thisare not terribly hard, but they are a bit involved. For now, let’s go back to the old idea of a tangentplane to a surface. For the surface z = f(x, y) if its partials are continuous functions (they usuallyare for our work!) then f is differentiable and hence we know that

f(x, y) = f(x0, y0) + fx(x0, y0)(x− x0) + fy(x0, y0)(y − y0) + E(x, y, x0, y0)

where E (x, y, x0, y0)/√

(x− x0)2 + (y − y0)2 → 0 andE(x, y, x0, y0) go to 0 as (x, y)→ (x0, y0).We can characterize the error must better if we have access to the second order partial derivativesof f . We have shown that if the second order partials are continuous locally near (x0, y0) then themixed order partials fxy and fyx must match at the point (x0, y0). Hence, for these smooth surfaces,the Hessian for a scalar function is a symmetric matrix!

11.6.1 The Mean Value Theorem

We can approximate a scalar function of many variables using their gradients. This is the first step indeveloping Taylor polynomial approximations of arbitrary order, but first step uses gradient informa-tion only. The approximations using second order partial information use both the gradient and theHessian and will be covered next. The following theorem is called the Mean Value Theorem.

Theorem 11.6.1 The Mean Value Theorem



Let f : D ⊂ <n → < be differentiable locally at x0; i.e. f is differentiable in B(r,x0)for some r > 0. Then if x and y are in B(r,x0), there is a point u on the line connectingx and y so that

f(x)− f(y) = (∇(f))T (u) (x− y) = Df(u) (x− y)

The line connecting x and y is often denoted by [x,y] even though x and y are vectors.

Proof 11.6.1Let v = x− y and define P : < → <n by P (t) = y + t(x− y). Then

P ′(t) = limh→0

P (t+ h)− P (t)

h= v

Let F : < → < be defined by F (t) = f(P (t) = f(y+tv). By the Mean Value Theorem for functionsof one variable, there is a c between 0 and 1 so that

F (1)− F (0) = F ′(c) (1− 0)

Thus, by the chain rule

F ′(c) = Df(p(c)) P ′(c) = (∇(f))T (p(c)) v =

n∑i=1

fxi(p(c)) vi

Now F (1) = f(P (1)) = f(x) and F (0) = f(P (0)) = f(y). Also, let P (c) = y + c(x − y = uwhich is a point on [x,y]. So we have shown

f(x)− f(y) = Df(u) (x− y

11.6.1.1 Homework

Exercise 11.6.1

Exercise 11.6.2

Exercise 11.6.3

Exercise 11.6.4

Exercise 11.6.5

11.6.2 Hessian Approximations

We can now explain the most common approximation result for tangent planes.

Two Variables:Let h(t) = f(x0 + t∆x, y0 + t∆y). Then we know we can write h(t) = h(0) + h′(0)t+ h′′(c) t

2

2 .Using the chain rule, we find

h′(t) = fx(x0 + t∆x, y0 + t∆y)∆x+ fy(x0 + t∆x, y0 + t∆y)∆y



and

h′′(t)

= ∂x

(fx(x0 + t∆x, y0 + t∆y)∆x+ fy(x0 + t∆x, y0 + t∆y)∆y

)∆x

+ ∂y

(fx(x0 + t∆x, y0 + t∆y)∆x+ fy(x0 + t∆x, y0 + t∆y)∆y

)∆y

= fxx(x0 + t∆x, y0 + t∆y)(∆x)2 + fyx(x0 + t∆x, y0 + t∆y)(∆y)(∆x)

+ fxy(x0 + t∆x, y0 + t∆y)(∆x)(∆y) + fyy(x0 + t∆x, y0 + t∆y)(∆y)2

We can rewrite this in matrix - vector form as

h′′(t) =[∆x ∆y

] [fxx(x0 + t∆x, y0 + t∆y) fyx(x0 + t∆x, y0 + t∆y)fxy(x0 + t∆x, y0 + t∆y) fyy(x0 + t∆x, y0 + t∆y)

] [∆x∆y

]Of course, using the definition ofH , this can be rewritten as

h′′(t) =

[∆x∆y

]TH(x0 + t∆x, y0 + t∆y)

[∆x∆y

]Thus, our tangent plane approximation can be written as

h(1) = h(0) + h′(0)(1− 0) + h′′(c)1

2

for some c between 0 and 1. Substituting for the h terms, we find

f(x0 + ∆x, y0 + ∆y) = f(x0, y0) + fx(x0, y0)∆x+ fy(x0, y0)∆y

+1

2

[∆x∆y

]TH(x0 + c∆x, y0 + c∆y)

[∆x∆y

]Clearly, we have shown how to express the error in terms of second order partials. There is a point cbetween 0 and 1 so that

E (x0, y0,∆x,∆y) =1

2

[∆x∆y

]TH(x0 + c∆x, y0 + c∆y)

[∆x∆y

]Note the error is a quadratic expression in terms of the ∆x and ∆y. We also now know for a functionof two variables, f(x, y), we can estimate the error made in approximating using the gradient at thegiven point (x0, y0) as follows: We have

f(x, y) = f(x0, y0) + < ∇(f)(x0, y0), [x− x0, y − y0]T >

+ (1/2)[x− x0, y − y0]H(x0 + c(x− x0), y0 + c(y − y0))[x− x0, y − y0]T

Three Variables:Let’s shift to using (x1, x2, x3) instead of (x, y, z). Define let h(t) = f(x01 + t∆1, y01 + t∆2, z01 +t∆3). For convenience, let

x =

x1

x2

x3

, x0 =

x01

x02

x03

, ∆ =

∆1

∆2

∆3



Then. h(t) = f(x0 + t∆) and we know h(t) = h(0) + h′(0)t+ h′′(c) t2

2 . Using the chain rule, wefind

h′(t) =

3∑i=1

fxi(x0 + t∆)∆i

and

h′′(t) =

3∑i=1

( 3∑j=1

fxi,xj (x0 + t∆)∆i ∆j

)We can rewrite this in matrix - vector form as

h′′(t) =

∆1

∆2∆3

T fx1,x1(x0 + t∆) fx1,x2

(x0 + t∆) fx1,x3(x0 + t∆)

fx2,x1(x0 + t∆) fx2,x2

(x0 + t∆) fx2,x3(x0 + t∆)

fx3,x1(x0 + t∆) fx3,x2(x0 + t∆) fx3,x3(x0 + t∆)

∆1

∆2

∆3

Of course, using the definition ofH , this can be rewritten as∆1

∆2∆3

T H(x0 + t∆)

∆1

∆2

∆3

= ∆T H(x0 + t∆) ∆

Thus, our tangent plane approximation can be written as

h(1) = h(0) + h′(0)(1− 0) + h′′(c)1

2


f(x0 + ∆) = f(x0) +

3∑i=1

fxi(x0)∆i +1

2∆TH(x0 + c∆) ∆

Clearly, we have shown how to express the error in terms of second order partials. There is a point cbetween 0 and 1 so that

E (x0,∆) = ∆TH(x0 + c∆) ∆

You can see how the analysis does not change much if we did the case of n variables. We’ll leavethat to you!

Vector Valued Functions:We are now really doing a linear approximation to a vector valued function and this is not a tangentplane, but the basic idea is the same. We now have a vector function f : D ⊂ <n → <m, f wouldhave m component functions. Now ∆ and x0 have n components. A similar analysis uses

h(t) = h(t) = f(x0 + t∆) =

f1(x0 + t∆)...

fm(x0 + t∆)



Using the chain rule, we find

h′(t) =

∑ni=1 f1,xi(x0 + t∆)∆i

...∑ni=1 fm,xi(x0 + t∆)∆i

= Jf (x0 + t∆) ∆

and

h′′(t) =

∑ni=1

∑nj=1 f1,xi,xj (x0 + t∆)∆i ∆j

...∑ni=1

∑nj=1 f1,xi,xj (x0 + t∆)∆i ∆j

Each component in the vector above can be written in a matrix - vector form as

h′′(t) =

∆TH1(x0 + t∆) ∆...

∆THn(x0 + t∆) ∆

whereHi is the Hessian for fi. Thus, our linear approximation is

h(1) = h(0) + h′(0)(1− 0) + h′′(c)1

2


f(x0 + ∆) = f(x0) + Jf (x0 ∆ +1

2

∆TH1(x0 + c∆) ∆...

∆THn(x0 + c∆) ∆

Each of these entries corresponds to the error Ei for component fi and the vector error is

E =

E1

...En

=⇒ ‖E ‖ =√

((E1)2 + . . .+ (En)2

Example 11.6.1 Let

f(x, y, z) =

[x2y4 + 2x+ 3yz2 + 10

4x2 + 5z3y2

]Estimate the linear approximation error about (0, 0, 0).

Solution We have

f1,x = 2xy4 + 2, f1,y = 4x2y3 + 3z2, f1,z = 6yz

f1,xx = 2y4, f1,xy = 8xy3, f1,xz = 0

f1,yx = 8xy3, f1,yy = 12x2y2, f1,yz = 6z,

f1,zx = 0, f1,zy = 6z, f1,zz = 6y,

f2,x = 8x, f2,y = 10z3, f2,z = 15y2z2

f2,xx = 8, f2,xy = 0, f2,xz = 0

f2,yx = 0, f2,yy = 0, f2,yz = 30z2,



f2,zx = 0, f2,zy = 30yz2, f2,zz = 30y2z,

Hence,

Jf (x, y, z) =

[2xy4 + 2 4x2y3 + 3z2 6yz

8x 10z3 15y2z2

]

H1(x, y, z) =

2y4 8xy3 08xy3 12x2y2 6z

0 6z 6y

, H2(x, y, z) =

8 0 00 0 30z2

0 20yz2 30z2

This gives the errors

E1(0, 0,∆) =1

2∆TH1(c∆)∆

E2(0, 0,∆) =1

2∆TH2(c∆)∆

for a c between 0 and 1. To find how much error is made for at x0 + ∆, we would approximateeach error separately, for example, as we detail in (Peterson (8) 2019) for functions of two variables.All of these calculations are terribly messy! So at (0, 0, 0), letting (x∗, y∗, z∗) denote the inbetweenpoint x0 + c∆, we have

E1(x0,∆) =[∆1 ∆2 ∆3

] 2(y∗)4 8x∗(y∗)3 08x∗(y∗)3 12(x∗)2(y∗)2 6z∗

0 6z∗ 6y∗

∆1

∆2

∆3

If we restrict our attention to points in B(r, (0, 0, 0)), we see all the terms in ∆ satisfy ∆i| < r and|x∗|, |y∗| and |z∗| are all less than r. Thus

E1(x0,∆) =[∆1 ∆2 ∆3

] 2(y∗)4∆1 + 8x∗(y∗)3∆2

8x∗(y∗)3∆1 + 12(x∗)2(y∗)2∆2 + 6z∗∆3

6z∗∆2 + 6y∗∆3

= (2(y∗)4∆1 + 8x∗(y∗)3∆2)∆1

+(8x∗(y∗)3∆1 + 12(x∗)2(y∗)2∆2 + 6z∗∆3)∆2

+(6z∗∆2 + 6y∗∆3)∆3

Thus,

|E1(x0,∆)| ≤ (2r5 + 8r5)r + (8r5 + 12r5 + 6r2)r + (6r2 + 6r2)r

= 30r6 + 18r3

If we further restrict our attention to r < 1, then r6 < r2 and r3 < r2. Thus, our overestimate of theerror is |E1(x0,∆)| < 48r2. A similar analysis will show an overestimate |E1(x0,∆)| < Kr2 forsome integer K. Hence, the total overestimate of the error is

‖E ‖ =√

((E1(x0,∆)2 + (E1(x0,∆)2) ≤√

(48 +K)r2 =√

(48 +K) r

We can therefore determine the value of r which will guarantee the error in making our linear ap-proximation is as small as we like.

Comment 11.6.1 These calculations are horrible if the point involved is different from the origin. Wedo some for the two variable case in (Peterson (8) 2019) so, in principle, we know what to do. But


11.7. A SPECIFIC COORDINATE TRANSFORMATION 219

really intense!

11.6.2.1 Homework

Exercise 11.6.6 For the following function find its gradient and hessian and the tangent plane ap-proximation plus error at the point (0, 0, 0).

f(x, y, z) = x2 + x4y3z2 + 14x2y3z4

Find the approximate error as a function of r.

Exercise 11.6.7 Find the linear approximation of

f(x, y, z) =

[2x2 + 3y4 + 5z4

−2x+ 5y − 8z2

]near (0, 0, 0). Find the approximate error E1 as a function of r.

Exercise 11.6.8 Find the linear approximation of

f(x, y) =

[2x2 + 3y4

−2x5y2 + 5y6

]near (0, 0). Find the approximate errors E1, E2 and the total error E as a function of r.

11.7 A Specific Coordinate Transformation

Given a function g(x, y), the Laplacian of g is

∇2 g(x, y) =∂2g

∂x2(x, y) +

∂2g

∂y2(x, y) = gxx(x, y) + gyy(x, y)

We usually leave off the (x, y) as it is understood and just write∇2 g = gxx+gyy. Let’s convert thisto polar coordinates. We assume g has continuous first and second order partials and so the mixedorder partials match. Define T : <2 → <2 by

T

([rθ

])=

[T1(r, θ)T2(r, θ)

]=

[r cos(θ)r sin(θ)

]=

[xy

]

We casually abuse the notation here all the time. We usually write the column vector[rθ

]as the

ordered pair (r, θ) which is a lot like identifying the column vector with its transpose[r θ

]. There

are similar confusions built into the notation for (x, y). This is because the vector space <2 canbe interpreted as ordered pairs, column vectors or row vectors and we generally just make thesetransformations without thinking about it. But, of course, they are always there. So we wouldnormally write the statements above as

T (r, θ)) =

[T1(r, θ)T2(r, θ)

]=

[r cos(θ)r sin(θ)

]= (x, y)

Now define f : <2 → < by

f(r, θ) = (g T )(r, θ) = g(T1(r, θ), T2(r, θ)) = g(r cos(θ), r sin(θ)) = g(x, y)



We know

Df(r, θ) = (∇(f))T (r, θ) =[fr(r, θ) fθ(r, θ)

]Dg(x, y) = (∇(g))T (x, y) =

[gx(x, y) gy(x, y)

]and

DT (r, θ) = JT (r, θ) =

[T1,r(r, θ) T1,θ(r, θ)T2,r(r, θ) T2,θ(r, θ)

]=

[xr(r, θ) xθ(r, θ)yr(r, θ) yθ(r, θ)

]Hence, by the chain rule

Df(r, θ) = Dg(x, y)DT (r, θ)

or [fr(r, θ) fθ(r, θ)

]=

[gx(x, y) gy(x, y)

] [xr(r, θ) xθ(r, θ)yr(r, θ) yθ(r, θ)

]Let’s drop all of the (x, y) and (r, θ) terms as it just adds clutter. We have

[fr fθ

]=

[gx gy

] [xr xθyr yθ)

]which when multiplied out, gives our familiar expansions

fr = gx xr + gy yr

fθ = gx xθ + gy yθ

Thus, we have

fr = cos(θ)gx + sin(θ)gy

fθ = −r sin(θ)gx + r cos(θ)gy

This can be rewritten as [frfθ

]=

[cos(θ) sin(θ)−r sin(θ) r cos(θ)

] [gxgy

]Now we apply the chain rule again:

fθθ =∂

∂θ(−r sin(θ)gx + r cos(θ)gy)

= (−r sin(θ)(gxxxθ + gxyyθ) +∂

∂θ(−r sin(θ))gx

+(r cos(θ))(gyxxθ + gyyyθ)) +∂

∂θ(r cos(θ))gy

Thus,

fθθ = (−r sin(θ)(−gxxr sin(θ) + gxyr cos(θ)))− r cos(θ)gx

+(r cos(θ))(gyx(−r sin(θ)) + gyy(r cos(θ))− r sin(θ)gy

= r2(sin2(θ)gxx − sin(θ) cos(θ)gxy) + r2(− sin(θ) cos(θ)gyx + cos2(θ)gyy)

−r(cos(θ)gx + sin(θ)gy)


11.7. A SPECIFIC COORDINATE TRANSFORMATION 221

So, we now know

1

r2fθθ = sin2(θ)gxx − 2 sin(θ) cos(θ)gxy + cos2(θ)gyy −

1

r(cos(θ)gx + sin(θ)gy)

= sin2(θ)gxx − 2 sin(θ) cos(θ)gxy + cos2(θ)gyy −1

rfr

Hence, we have shown

1

r2fθθ +

1

rfr = sin2(θ)gxx − 2 sin(θ) cos(θ)gxy + cos2(θ)gyy

Next, we need

frr =∂

∂r(cos(θ)gx + sin(θ)gy)

= cos(θ)(gxxxr + gxyyr) +∂

∂r(cos(θ))gx + sin(θ)(gyxxr + gyyyr))

+∂

∂r(sin(θ))gy

= cos(θ)(gxx cos(θ) + gxy sin(θ)) + sin(θ)(gyx cos(θ) + gyy sin(θ)))

Thus,

frr = cos2(θ)gxx + 2 cos(θ) sin(θ)gxy + sin2(θ)gyy

Combining

frr +1

r2fθθ +

1

rfr = sin2(θ)gxx − 2 sin(θ) cos(θ)gxy + cos2(θ)gyy

+ cos2(θ)gxx + 2 cos(θ) sin(θ)gxy + sin2(θ)gyy

= gxx + gyy

So the Laplacian in polar coordinates is

∇2rθf = frr +

1

r2fθθ +

1

rfr =∇)

2xyg

where f(r, θ) = g(T (r, θ).

11.7.1 Homework

Converting partial differential equations in cartesian coordinates into their equivalent forms under anonlinear coordinate transformation (these are called curvilinear transformations) is a really usefulidea. We will leave much of the detail of the following problems to you as they are nicely detailedand intense. Remember, going through this sort of thing on your own is a very useful exercise andhelps to build your personal skill sets. As a reference, any book on old fashioned advanced calculusand mathematical physics goes over this stuff. Stay away from the newer books and pick up an oldtext. A good choice is (Wrede and Speigel (16) 2010).

Exercise 11.7.1 Find the Laplacian in cylindrical coordinates.

Exercise 11.7.2 Find the Laplacian in spherical coordinates.

Exercise 11.7.3 Find the Laplacian in paraboidal coordinates.



Exercise 11.7.4 Find the Laplacian in ellipsoidal coordinates.


Chapter 12

Multivariable Extremal Theory

Now let’s look at the extremes of functions of many variables.

12.1 Differentiability and Extremals

We are now ready to connect the value of partial derivatives at an extremum such as a minimum ormaximum of a scalar function of n variables. We know continuous functions on compact domainspossess such extreme values, but we do not really know how to find them. Here is a start.

Theorem 12.1.1 At an Interior Extreme Point, The Partials Vanish

Let f : D ⊂ <n → < have an extreme value at the interior point x0 in D. Then if all thepartials fxi(x0) exist, they must be zero.

Proof 12.1.1Let x0 be an extreme value for f which is an interior point. For convenience of exposition, let’sassume the extremum is a maximum locally. So there is a radius r > 0 so that if y ∈ B(r,x0),f(y) ≤ f(x0). Then if each fxi(x0) exists, for small enough h > 0, we must have

f(x0 + hEi) ≤ f(x0)

Thus,

f(x0 + hEi)− f(x0)

h≤ 0 =⇒ lim

h→0+

f(x0 + hEi)− f(x0)

h≤ 0

This tells us (fxi(x0))+ ≤ 0. On the other hand, for sufficiently small h < 0, we have

f(x0 + hEi) ≤ f(x0)

Thus, because h is negative

f(x0 + hEi)− f(x0)

h≥ 0 =⇒ lim

h→0−

f(x0 + hEi)− f(x0)

h≥ 0

This tells us (fxi(x0))− ≥ 0. Since we assume fxi(x0) exists, we must have (fxi(x0))− =(fxi(x0))+ = fxi(x0). Thus 0 ≤ fxi(x0) ≤ 0 which tells us fxi(x0) = 0. We can do this foreach index i. Thus, we have shown that if x0 is an interior point which corresponds to a local max-

223


224 CHAPTER 12. MULTIVARIABLE EXTREMAL THEORY

imum, then if the partials fxi(x0) exist they must be zero; i.e. ∇(f) = 0. We can do a similarargument for a local minimum.

Comment 12.1.1 Now if f : D ⊂ <n → < is differentiable, then Df = (∇(f))T and so theresult above tells that if f is differentiable at an extreme point which is an interior point x0 thenDf(x0) = 0.

Points where the gradient is zero are thus important. Points where extreme behavior occurs are calledcritical points of f . Once we have found them, the next question is to classify them as minima,maxima or something else. For completeness, we state a formal definition.

Definition 12.1.1 Critical Points

Let f : D ⊂ <n → <. The point x0 is called a critical point of f if∇(f)(x0 = 0, the∇(f)(x0 fails to exist or x0 is a boundary point of D. Recall this means B(r,x0) haspoints of D and DC for all choices of r > 0.

We have already discussed conditions for classifying critical points in the two variable case in (Pe-terson (8) 2019). For completeness, we will go over this again.

12.2 Second Order Extremal ConditionsAt a place where the extremum of a function of n variables might occur with ∇(f)(x0) = 0, itis clear the tangent plane will be flat as all of the partials will be zero. Of course, we can onlysee the flatness of this plane in three space but we will use that idea to help our intuition in higherdimensional spaces. From our discussion of differentiability, we know

f(x0 + ∆) = f(x0) +

n∑i=1

fxi(x0)∆i +1

2∆TH(x0 + c∆) ∆

as long as all the second order partials of f all exist locally at x0. Here,

1

2∆TH(x0 + c∆) ∆ =

1

2

n∑i=1

n∑j=1

fxixj (x0 + c∆)∆i∆j

Now since x0 is a critical point with∇(f)(x0) = 0, we can write

f(x0 + ∆) = f(x0) +1

2

n∑i=1

n∑j=1

fxixj (x0 + c∆)∆i∆j

= f(x0) +1

2∆TH(x0 + c∆) ∆

We can decide if we have a minimum or a maximum easily. The matrix H(x0 + c∆) is an n × nmatrix. If we assume all the second order partials are continuous locally at x0, then detH(x0 +c∆)is also continuous locally at x0. Hence, we can say there is a B(r,x0) where bothH and detH arecontinuous. We also know if a continuous function is positive at x0, it is strictly positive locally atx0. Similarly, if a continuous function is negative at x0, it is strictly negative locally at x0. Thus,there is a B(r1,x0) where r1 < r so that

• If detH(x0) > 0, then detH(y) > 0 if y ∈ B(r1,x0).

• If detH(x0) < 0, then detH(y) < 0 if y ∈ B(r1,x0).


12.2. SECOND ORDER EXTREMAL CONDITIONS 225

So if we start with ‖∆‖ < r1, the c we obtain above will give us x0 + c∆ ∈ B(r1,x0). Thus, wewill be able to say

• If detH(x0) > 0, then detH(x0 + c∆) > 0

• If detH(x0) < 0, then detH(x0 + c∆) < 0

Since all the second order partials are continuous locally, we know the mixed order partials must bethe same. Hence the Hessian here is a symmetric matrix.

12.2.1 Positive and Negative Definite HessiansWe can get the algebraic sign we need for det H(x0) using the notion of positive and negativedefinite matrices. The conditions above become

• If H(x0) is positive definite, then ∆TH(x0) ∆ > 0 for any ∆. Since ∆TH(x0) ∆ iscontinuous at x0 and ∆TH(x0) ∆ > 0, this means there is a radius r2 < r1 so that∆TH(y) ∆ > 0 in B(r2,x0). By choosing ‖∆‖ < r2, we guarantee c is small enoughso that x0 + c∆ ∈ B(r2,x0). Hence, ∆TH(x0 + c∆) ∆ > 0. Thus, we have

f(x0 + ∆) = f(x0) + a positive number

implying f(x0 + ∆) > f(x0) locally. Hence, f has a local minimum at x0.

We also know sinceH(x0) is positive definite, all of its eigenvalues are positive. We discussedthe idea of the principle minors of a matrix in Chapter 10. You can go back there to review.Let λ1 through λn be the eigenvalues ofH(x0). We find for ourH(x0) that

– The first minor is the determinant of 1× 1 submatrixHij(x0) for i = 1 and j = 2. Thisis λ1. Call this determinant PM1.

– The second minor is the determinant of 2× 2 submatrixHij(x0) for 1 ≤ i, j ≤ 2. Thisis λ1 λ2. Call this determinant PM2.

– The third minor is the determinant of 3× 3 submatrixHij(x0) for 1 ≤ i, j ≤ 2. This isλ1 λ2 λ3. Call this determinant PM3.

– The kth minor is the determinant of k × k submatrix Hij(x0) for 1 ≤ i, j ≤ k. This isλ1 . . . λk. Call this determinant PMk.

Note we have a local minimum if the algebraic sign of the minors is + + + . . .+. We caneasily find the eigenvalues in MatLab and determine this. For n > 2, these minor calculationsare intense. Note we are not phrasing the conditions for the local minimum in terms of thesecond order partials at x0 although we can.

• IfH(xo) is negative definite, then we can analyze just as we did above. We have ∆TH(x0)∆ <0 for any ∆. Since ∆TH(x0)∆ is continuous at x0 and ∆TH(x0)∆ < 0, this means thereis a radius r2 < r1 so that ∆TH(y) ∆ < 0 in B(r2,x0). By choosing ‖∆‖ < r2, weguarantee c is small enough so that x0 + c∆ ∈ B(r2,x0). Hence, ∆TH(x0 + c∆) ∆ > 0.Thus, we have

f(x0 + ∆) = f(x0) + a negative number

implying f(x0 + ∆) > f(x0) locally. Hence, f has a local maximum at x0.

We also know since H(x0) is negative definite, all of its eigenvalues are negative. Let λ1

through λn be the eigenvalues ofH(x0). We find for ourH(x0) that

– PM1 = λ1 < 0.



– PM2 = λ1 λ2 > 0.– PM3 = λ1 λ2 λ3 < 0.– PMk = λ1 . . . λk. The algebraic sign is determined by (−1)k.

Note we have a local maximum if the algebraic sign of the minors is − + − . . . (−1)n; i.e.the algebraic sign of the minor oscillates. We can easily find the eigenvalues in MatLab anddetermine this.

Since the Hessian H(y) is symmetric, its eigenvectors determine an orthonormal basis for <n,E1(y), . . . ,En(y). Defining the matrix

P (y) =[E1(y) . . . En(y)

]whose ith column is Ei(y), we have

H(x0) = P (x0)D(x0)P T (x0)

H(x0 + c∆) = P (x0 + c∆)D(x0 + c∆)P T (x0 + c∆)

whereD(y) is the diagonal matrix of the eigenvalues ofH(y).IfH(x0) is positive or negative definite, its minors are either strictly positive or negative and hence,because detH(x0) is continuous at x0), there is a radius r2 < r1, where all the minors have thesame algebraic sign as long as x0 + c∆ ∈ B(r3,x0). By choosing ‖∆‖ sufficiently small we canguarantee this. Hence, the remarks above concerning the minors and eigenvalues still hold. We have

f(x0 + ∆) = f(x0) +1

2∆TH(x0 + c∆) ∆

= f(x0) +1

2∆TP (x0 + c∆)D(x0 + c∆)P T (x0 + c∆)∆

= f(x0) +1

2

(P T (x0 + c∆)∆

)TD(x0 + c∆)

(P T (x0 + c∆)∆

)Let ξc = P T (x0 + c∆)∆. Then we have

f(x0 + ∆) = f(x0) +1

2ξcTD(x0 + c∆)ξc

Let the eigenvalues ofH(x0 + c∆ be denoted by λci for convenience. Then we can rewrite as

f(x0 + ∆) = f(x0) +1

2(λc1(ξci )

2 + . . .+ λcn(ξcn)2)

and the signs of these eigenvalues are the same as the signs of the eigenvalues for H(x0 by ourchoice of ‖∆‖ < r3. Note the following: the equation above makes is very clear that

• If H(x0) is positive definite, so is H(x0 + c∆) and so λc1(ξci )2 + . . . + λcn(ξcn)2 > 0 gives

us a local minimum.

• If H(x0) is negative definite, so is H(x0 + c∆) and so λc1(ξc1)2 + . . . + λcn(ξcn)2 < 0 givesus a local maximum.

12.2.2 SaddlesSince

f(x0 + ∆) = f(x0) +1

2( λc1(ξc1)2 + . . .+ λcn(ξcn)2 )



if the eigenvalues are not all the same sign, we can have what is called a saddle structure. Supposeλci and λcj had different signs for some i 6= j. For concreteness, suppose λci > 0 and let λci = α > 0.We also have λcj < 0. Write λcj = −β where β > 0. Set all the other λck’s to zero. Then on thatsurface trace, we have

f(x0 + ∆) = f(x0) +1

2(α(ξci )

2 − β(ξcj )2)

This shows the function has the behavior on the additional trace obtained by setting ξci = 0

f(x0 + ∆) = f(x0)− 1

2β(ξcj )

2)

which shows f has a local maximum along this trace in <n. On the other hand, if we set ξcj = 0

f(x0 + ∆) = f(x0) +1

2α(ξci )

2)

which shows f has a local minimum along this trace. This type of behavior, where f behaves likeit has a minimum along some directions and a maximum along others is called saddle behavior. Weconclude ifH(x0) is not positive or negative definite, it will have eigenvalues of different sign. Thiswill imply a saddle at x0.

Of course, if detH(x0) = 0, we don’t know anything. It might have a local minimum, a localmaximum, a saddle or none of those things.

12.2.3 Expressing Conditions in Terms of PartialsThe minor conditions can be expressed in terms of the partials. We have

PM1 = det[fx1x1(x0)

]= fx1x1

(x0)

Then the second minor is

PM2 = det

[fx1x1

(x0) fx1x2(x0)

fx2x1(x0) fx2x2

(x0)

]= fx1x1

(x0)fx2x2(x0)− (fx1x2

(x0))2

The third minor requires a 3× 3 determinant

PM3 = det

fx1x1(x0) fx1x2

(x0) fx1x3(x0)

fx2x1(x0) fx2x2

(x0) fx2x3(x0)

fx3x1(x0) fx3x2

(x0) fx3x3(x0)

which is much more messy to write out and so we will not do that. We discuss the two variablecase in (Peterson (8) 2019) using different strategies of proof that involve a completing the square.Clearly that technique is limited to the two variable case and the extra theory we use here allows usto get a better grasp on the extremal structure of f . In the two variable case, we find

• f has a local minimum at x0 with∇(f)(x0) = 0 when

1. fx1x1)(x0) > 0 ( first minor condition).

2. fx1x1(x0)fx2x2

(x0)− (fx1x2(x0))2 > 0 (second minor condition)

• f has a local maximum at x0 with∇(f)(x0) = 0 when

1. fx1x1)(x0) < 0 ( first minor condition).



2. fx1x1(x0)fx2x2

(x0)− (fx1x2(x0))2 > 0 (second minor condition)

• f has a saddle at x0) if the two eigenvalues of H(x0) have different signs. This impliesfx1x1(x0)fx2x2(x0)− (fx1x2(x0))2 < 0

The standard two variable theorem is thus:

Theorem 12.2.1 Extreme Test

If the partials of f are zero at the point (x0, y0), we can determine if that point is a localminimum or local maximum of f using a second order test. We must assume the secondorder partials are continuous at the point (x0, y0).

• If f0xx > 0 and f0

xx f0yy − (f0

xy)2 > 0 then f(x0, y0) is a local minimum.

• If f0xx < 0 and f0

xx f0yy − (f0

xy)2 > 0 then f(x0, y0) is a local maximum.

• If f0xx f

0yy − (f0

xy)2 < 0 then f(x0, y0) is a local saddle.

We just don’t know anything if the test f0xx f

0yy − (f0

xy)2 = 0.

Example 12.2.1 Use our tests to show f(x, y) = x2 + 3y2 has a minimum at (0, 0).

Solution The partials here are fx = 2x and fy = 6y. These are zero at x = 0 and y = 0. TheHessian at this critical point is

H(x, y) =

[2 00 6

]= H(0, 0).

asH is constant here. Our second order test says the point (0, 0) corresponds to a minimum becausefxx(0, 0) = 2 > 0 and fxx(0, 0) fyy(0, 0)− (fxy(0, 0))2 = 12 > 0.

Example 12.2.2 Use our tests to show f(x, y) = x2 + 6xy + 3y2 has a saddle at (0, 0).

Solution The partials here are fx = 2x+6y and fy = 6x+6y. These are zero at when 2x+6y = 0and 6x+ 6y = 0 which has solution x = 0 and y = 0. The Hessian at this critical point is

H(x, y) =

[2 66 6

]= H(0, 0).

as H is again constant here. Our second order test says the point (0, 0) corresponds to a saddlebecause fxx(0, 0) = 2 > 0 and fxx(0, 0) fyy(0, 0)− (fxy(0, 0))2 = 12− 36 < 0.

Example 12.2.3 Show our tests fail on f(x, y) = 2x4+4y6 even though we know there is a minimumvalue at (0, 0).

Solution For f(x, y) = 2x4 + 4y6, you find that the critical point is (0, 0) and all the second orderpartials are 0 there. So all the tests fail. Of course, a little common sense tells you (0, 0) is indeedthe place where this function has a minimum value. Just think about how it’s surface looks. But thetests just fail. This is much like the curve f(x) = x4 which has a minimum at x = 0 but all the testsfail on it also.

Example 12.2.4 Show our tests fail on f(x, y) = 2x2+4y3 and the surface does not have a minimumor maximum at the critical point (0, 0).

Solution For f(x, y) = 2x2 +4y3, the critical point is again (0, 0) and fxx(0, 0) = 4, fyy(0, 0) = 0and fxy(0, 0) = fyx(0, 0) = 0. So fxx(0, 0) fyy(0, 0) − (fxy(0, 0))2 = 0 so the test fails. Note thex = 0 trace is 4y3 which is a cubic and so is negative below y = 0 and positive above y = 0. Not



much like a minimum or maximum behavior on this trace! But the trace for y = 0 is 2x2 which is anice parabola which does reach its minimum at x = 0. So the behavior of the surface around (0, 0)is not a maximum or a minimum.

Homework

Exercise 12.2.1

Exercise 12.2.2

Exercise 12.2.3

The theorem for n variables is this:

Theorem 12.2.2 Definiteness Extrema Test For n Variables

We assume the second order partials are continuous at the point x0 and∇(f)(x0) = 0.

• If the HessianH(x0) is positive definite, the critical point is a local minimum.

• If the HessianH(x0) is negative definite, the critical point is a local maximum.

• If the local HessianH(x0) has eigenvalues of different sign, the the critical point isa saddle.

Moreover, the HessianH(x0) is positive definite if its principle minors are all positive andhas positive eigenvalues andH(x0, y0) is negative definite if its principle minors alternateis sign starting at − and all of its eigenvalues are negative.

Proof 12.2.1We have gone over the proof of this results in our earlier discussions.

Example 12.2.5 Let f(x1, x2, x3, x4, x5) = 2x21 + 4x2

2 + 9x23 + 5x2

4 + 8x25. Classify the critical

points.

Solution The only critical point is (0, 0, 0, 0, 0). The Hessian is

H =

4 0 0 0 00 8 0 0 00 0 18 0 00 0 0 10 00 0 0 0 16

H is clearly positive definite with eigenvalues 4, 8, 18, 10, 6 and the eigenvectors are the standardbasic vectors. The minors are 4, 32, 576, 5760, 92160 and are all positive. Hence, this critical pointis a minimum.

Example 12.2.6 Let f(x1, x2, x3, x4, x5) = 2x21 − 4x2

2 + 9x23 + 5x2

4 + 8x25. Classify the critical

points.


H =

4 0 0 0 00 −8 0 0 00 0 18 0 00 0 0 10 00 0 0 0 16



H is clearly not positive or negative definite with eigenvalues 4,−8, 18, 10, 6 and the eigenvectorsare the standard basic vectors. The minors are 4,−32,−576, 5760,−92160. Hence, this criticalpoint is a saddle.

Example 12.2.7 Let f(x1, x2, x3, x4, x5) = 100−2x21−4x2

2−9x23−5x2

4−8x25. Classify the critical

points.


H =

−4 0 0 0 00 −8 0 0 00 0 −18 0 00 0 0 −10 00 0 0 0 −16

H is clearly not positive or negative definite with eigenvalues −4,−8,−18,−10,−6 and the eigen-vectors are the standard basic vectors. The minors are −4, 32,−576, 5760,−92160. Hence, thiscritical point is a maximum.

Homework

The following 2×2 matrices give the HessianH0 at the critical point (1, 1) of an extremal valueproblem.

• Determine ifH0 is positive or negative definite

• Determine if the critical point is a maximum or a minimum

• Find the two eigenvalues ofH0. Label the largest one as r1 and the other as r2

• Find the two associated eigenvectors as unit vectors

• Define

P =[E1 E2

]• Compute P T H0 P

• ShowH0 = PΛP T

• Find all the minors for an appropriate Λ.

Exercise 12.2.4

H0 =

[1 −3−3 12

]Exercise 12.2.5

H0 =

[−2 77 −40

]The following 3 × 3 matrices give the Hessian H0 at the critical point (1, 1, 2) of an extremal

value problem.


• Classify the critical point



• Find the three eigenvalues ofH0. Label the largest one as r1 and the other as r2

• Find the three associated eigenvectors as unit vectors

• Define the usual P and compute P T H0 P

• ShowH0 = PΛP T for an appropriate Λ.

• Find all the minors

Exercise 12.2.6

H0 =

1 −3 4−3 12 54 5 4

Exercise 12.2.7

H0 =

−2 7 −17 −4 5−1 5 12

The following 4×4 matrices give the HessianH0 at the critical point (1, 1, 2,−3) of an extremal

value problem.


• Classify the critical point

• Find the four eigenvalues ofH0. Label the largest one as r1 and the other as r2

• Find the three associated eigenvectors as unit vectors

• Define the usual P and compute P T H0 P

• ShowH0 = PΛP T for an appropriate Λ.

• Find all the minors

Exercise 12.2.8

H0 =

1 −3 4 8−3 12 5 24 5 4 −18 2 −1 8

Exercise 12.2.9

H0 =

−2 7 −1 37 −4 5 10−1 5 12 53 10 5 9


Chapter 13

The Inverse and Implicit FunctionTheorems

Let’s look at mappings that transform one set of coordinates into another set.

13.1 Mappings

First, look at linear maps. This particular one could be replaced by any other with a nonzero deter-minant. So you can make up plenty of examples.

Example 13.1.1 Define T : <2 → <2 by T (x, y) = (u, v) where u = x− 3y and v = 2x+ 5y. Wecan write this in matrix/ vector form

T

([xy

])=

[T1(x, y)T2(x, y)

]=

[uv

]=

[x− 3y2x+ 5y

]=

[1 −32 5

] [xy

]So there are many ways to choose to express this mapping. The matrix representation of T withrespect to an orthonormal basis E is

[T]E,E

=

[1 −32 5

]We could also write the mapping simply as U =

[T]E,E

X where we let U be the vector withcomponents u and v and X , the vector with components x and y. However, this is too cluttered andit is easy to identify T and its matrix representation

[T]E,E

X which we will do from now on. Bydirect calculation, det(T ) = 11 6= 0, so T−1 exists. Hence if T (x1, y1) = T (x2, y2), we would have[

1 −32 5

] [x1

y1

]=

[1 −32 5

] [x2

y2

]=⇒

[1 −32 5

] [x1 − x2

y1 − y2

]=

[00

]=⇒ x1 = x2, y1 = y2

Thus, T is a 1− 1map. Further, if we chose a vectorB in <2, then[1 −32 5

] [xy

]=

[b1b2

]=⇒

[xy

]=

([1 −32 5

])−1 [b1b2

]233


234 CHAPTER 13. THE INVERSE AND IMPLICIT FUNCTION THEOREMS

and so we also know T is an onto map. Further, it is easy to see JT = T , so we have since T is alinear map

DT = JT =

[1 −32 5

]Therefore, in this case, det(DT (x0, y0) = det(T ) 6= 0 on <2.

Now let’s do a nonlinear mapping.

Example 13.1.2 We are now going to study T : <2 → <2 defined by T (x, y) = (x2 + y2, 2xy). Thismapping T is not linear and not 1 − 1 because T (x, y) = T (−x,−y). What is the range of T? If(u, v) is in the range, then u = x2 + y2 implying u ≥ 0 always. Further, v = 2xy. Note

u+ v = x2 + 2xy + y2 = (x+ y)2 ≥ 0

u− v = x2 − 2xy + y2 = (x− y)2 ≥ 0

Now u + v divides the u − v plane into three parts as seen in Figure 13.1(a): where u + v > 0,u + v = 0 and u − v < 0. Also, u − v divides the plane into three parts also as seen in Figure13.1(b). Hence, to be in the range, both u+ v and u− v must be non negative and we already know

(a) The u+ v Plane Subdivisions (b) The u− v Plane Subdivisions

Figure 13.1: The u+ v and u− v lines

u ≥ 0. Thus, the range of T is the part of the u− v plane shown in Figure 13.2(a); i.e

R(T ) = (u, v)|u ≥ 0, u+ v ≥ 0, u− v ≥ 0 = (u, v)|u ≥ 0,−u ≤ v ≤ u

The domain of T is <2 and it can be divided into four regions as shown in Figure 13.2(b). Thus, thefour indicated regions are the disjoint and open sets

S1 = (x, y)|x > 0,−x < y < x, S2 = (x, y)|y > 0,−y < x < yS3 = (x, y)|x < 0,−|x| < y < |x|, S4 = (x, y)|y < 0,−|y| < x < |y|

Proposition 13.1.1 Finding Where (x2 + y2, 2xy) is a Bijection

The mapping T : <2 → <2 defined by T (x, y) = (x2 + y2, 2xy). is invertible on S2 andT |S2

is onto R(T ).


13.1. MAPPINGS 235

(a) The Range of The Mapping T (b) The Subdivided Domain of T

Figure 13.2: The Range and Domain of T

Proof 13.1.1Assume T (x0, y0) = T (x1, y1). Since (x0, y0) and (x1, y1) are in S2, we have y0 > 0,−y0 < x0 <y0 and y1 > 0,−y1 < x1 < y1. Now if T (x0, y0) = (u0, v0) and T (x1, y1) = (u1, v1),

u0 + v0 = (x0 + y0)2, u0 − v0 = (x0 − y0)2

u1 + v1 = (x1 + y1)2, u1 − v1 = (x1 − y1)2

Since T (x0, y0) = T (x1, y1), (x0 + y0)2 = (x1 + y1)2 and (x0 − y0)2 = (x1 − y1)2. Now in S2,x0 + y0 > 0 and x1 + y1 > 0. Thus, we can take the positive square root to find x0 + y0 = x1 + y1.Also, in S2, x0 − y0 > 0 and x1 − y1 > 0. Hence, in this case, we take the negative square root toget y0 − x0 = y1 − x1. We now know

x0 + y0 = x1 + y1

−x0 + y0 = −x1 + y1

Adding, we have y0 = y1 which then implies x0 = x1 as well. Hence, T is 1− 1 on S2.

In fact if x ≥ 0 and y = x, T (x, y) = T (x, x) = (2x2, 2x2) which in on the line y = x in the rangeof T . Moreover, if x ≥ 0 and y = −x, T (x, y) = T (x,−x) = (2x2,−2x2) which is on the liney = −x in the range of T . It is then easy to see this shows T is 1 − 1 on these lines too. So we cansay T is 1− 1 on S2. Hence, T−1 exists on S2.

To show T : S2 → R(T ) is onto is next. We’ll do this by constructing the inverse. If T (x0, y0) =(u0, v0) with (x0, y0) ∈ S2, then

u0 + v0 = (x0 + y0)2 ≥ 0, u0 − v0 = (x0 − y0)2 ≥ 0

In S2, x0 + y0 > 0 and x0 − y0 < 0. Taking square roots, we have

x0 + y0 =√u0 + v0, x0 − y0 = −

√u0 − v0

which implies

x0 =1

2(√u0 + v0 −

√u0 − v0)



y0 =1

2(√u0 + v0 +

√u0 − v0)

Hence,

T−1(u0, v0) = (1

2(√u0 + v0 −

√u0 − v0).

1

2(√u0 + v0 +

√u0 − v0))

On the line u = v, if (x, y) ∈ S2, y ≥ 0, x+ y ≥ 0 and x− y ≤ 0. Thus

u+ v = (x+ y)2, u− v = (x− y)2 =⇒ 2u = (x+ y)2, 0 = (x− y)2 =⇒ 1

2

√2u = x, y = x

Thus, if u = v, T−1(u, u) = (12

√2u, 1

2

√2u).

If v = −u, the argument is similar: we have

u+ v = (x+ y)2, u− v = (x− y)2 =⇒ 0 = (x+ y)2, 2u = (x− y)2

=⇒ y = −x, y > 0,−1

2

√2u = x

So if u = −v, T−1(u,−u) = (− 12

√2u, 1

2

√2u). This completes the argument that T : S2 → R(T )

is onto.

The map T is differentiable with

DT (x0, y0) = JT (x0, y0) =

[2x0 2y0

2y0 2x0

]with det(DT (x0, y0)) = 4(x2

0−x21). Hence, if (x0, y0) ∈ S2, det(DT (x0, y0)) 6= 0 and if (x0, y0)

in on the boundary of S2, det(DT (x0, y0)) = 0. Thus, the determinant vanishes on the linesy = ±x in s2. Finally, in the interior of S2, x + y > 0 and x − y < 0 and so det(DT (x0, y0)) =4(x0 − x1)(x0 + x1) < 0.

These two examples give us some insight.

• If T : <2 → <2 has det(DT (x0, y0) 6= 0 does not necessarily force T−1 to exist on all of <2.Our second example shows us we may need to restrict T to a suitable domain for the inverseto exist. This also shows us T−1 can exist even when det(DT (x0, y0) = 0 as we found on thelines y = ±x.

• Given T : <2 → <2 which is linear, we have a mapping too nice to be interesting!

Homework

Exercise 13.1.1

Exercise 13.1.2

Exercise 13.1.3

Exercise 13.1.4

Exercise 13.1.5

We have been using the determinant of the Jacobian of a mapping and we know DF = JF for ourvector valued differentiable functions. In the literature, the determinant of the jacobian is very usefuland some texts use the term jacobian for the determinant of the Jacobian. We won’t do that here


13.1. MAPPINGS 237

and instead will always use detJF to denote this computation. Let’s look at our first invertibilitytheorem.

Theorem 13.1.2 The First Inverse Function Theorem

Let f : D ⊂ <n → <n with D an open set. Assume fi,xj are all continuous on D andassume Df(x0) 6= 0 at x0 ∈ D. Then there is a radius r > 0 so that T is 1 − 1 onB(r,x0); i.e. F−1 exists locally.

Proof 13.1.2If n = 1, this is a simple scalar function of one variable. We have no partial derivatives andDf(x0) = f ′(x0). If f ′(x0) 6= 0, then given ε = |f ′(x0)|/2, there is a δ > 0 so that |f ′(x)| >|f ′(x0)|/2 if x ∈ B(δ, x0). Pick any two distinct points p and q in B(δ, x0) and let [p, q] bethe line segment between p and q. By the Mean Value Theorem, there is point c in [p, q] so thatf(p)− f(q) = f ′(c)(q − p). Now if f(p) = f(q), this would mean 0 = f ′(c)(p− q). But f ′(c) 6= 0and p 6= q and so this is not possible. Hence in B(δ, x0), f must be 1− 1 and so is invertible.

If n > 1, the argument gets more interesting. Before we do the general one, let n = 2. Define thefunction M : <2 ×<2 → < by

M(x,y) =

[f1,x1

(x) f1,x2(x)

f2,x1(y) f2,x2

(y)

]Then

detM(x,y) = f1,x1(x) f2,x2

(y)− f1,x2(x) f2,x1

(y)

Since f1,x1, f1,x2

, f2,x1and f2,x2

are continuous inD, so are the functionsM(x,y) and detM(x,y).We assume detJf (x0) = M(x0,x0) 6= 0 and hence, for ε = |detM(x0,x0)|/2, there is a positiveδ so that if x and y are in B(δ,x0)

|detM(x,y)| > |detM(x0,x0)|/2 > 0

Now choose any two distinct points p and q in B(δ,x0). By the Mean Value Theorem, there arepoints c1 and c2 on the line segment [p, q] between p and q so that

f1(p)− f1(q) = < (∇(f1))T (c1,p− q >f2(p)− f2(q) = < (∇(f2))T (c2,p− q >

But if f(p) = f(q), then

< (∇(f1))T (c1,p− q > = 0

< (∇(f2))T (c2,p− q > = 0

But we know detM(c1, c2) 6= 0 and so the only solution is p = q which is not possible. So theassumption that f(p) = f(q) is not correct. We conclude f is 1 − 1 on B(δ,x0) and is invertiblethere.



To do the general theorem, we use the material on determinants developed in Chapter 10. By Theo-rem 10.1.3,

M(x1, . . . ,xn) = det

(∇(f1))T (x1)...

(∇(fn))T (xn)

is a continuous function of the variables x1, . . . ,xn. Since detJf (x0) = detM(x0, . . . ,x0)) 6= 0,for ε = |detM(x0,x0)|/2, there is a positive δ so that if x1 through xn are in B(δ,x0)

|detM(x1, . . . ,xn)| > |detM(x0,x0)|/2 > 0

Now choose any two distinct points p and q in B(δ,x0). By the Mean Value Theorem, there arepoints c1 through cn on the line segment [p, q] between p and q so that

f1(p)− f1(q) = < (∇(f1))T (c1,p− q >... =

...

fn(p)− fn(q) = < (∇(fn))T (cn,p− q >

But f(p) = f(q) and so

< (∇(f1))T (c1,p− q > = 0

... =...

< (∇(fn))T (cn,p− q > = 0

The determinant is zero if its rows are linearly independent and it is non zero if its rows are linearlyindependent. Hence, the only solution to the above equation is p = q = 0. But this is not possibleand hence f must be 1− 1 on B(δ,x0) and it invertible there.

Homework

Exercise 13.1.6

Exercise 13.1.7

Exercise 13.1.8

Exercise 13.1.9

Exercise 13.1.10

13.2 Invertibility Results

We begin with a technical result.

Theorem 13.2.1 Interior Points of the Range of f

Let D ⊂ <n be open and f : D ⊂ <n → <n with continuous partial derivatives on D.Assume x0 is in D and there is a positive r so that B(r,x0) ⊂ D with det Jf (x) 6= 0 forall x in B(r,x0). Further, assume f is 1− 1 on B(r,x0). then f(x0) is an interior pointof f(B(r,x0).


13.2. INVERTIBILITY RESULTS 239

Proof 13.2.1For x in B(r,x0), let g(x) = ‖f(x − f(x0)‖. The g is a non negative real valued function whosedomain in compact. We know the map ‖ · ‖ : <n → < is continuous and f is continuous sinceit is differentiable. Note the continuous partials imply its differentiability. Hence, the compositionh(x) = ‖f(x − f(x0‖ is continuous on the compact set B(r,x0). We see g(x) = 0 impliesf(x = f(x0). Since we assume f is 1− 1 on this domain, this means x = x0. Hence h is 1− 1 onB(r,x0). Finally, note the boundary S = x|‖x−x0‖ = r is a compact set also. Since f is 1− 1on this set, g can not be zero on S; so g(y) > 0 if y ∈ S.Since g is continuous on the compact set S, g attains a global minimum of value m > 0 for somey ∈ S.

Show B(m/2, f(x0) ⊂ f(B(r,x0))

Proof Let z be in B(m/2, f(x0). We must find a p in B(r,x0) with z = f(p). Let φ be the realvalued function onB(r,x0) defined by

φ(x) = ‖f(x)− z‖ =

√√√√ n∑i=1

(fi(x)− zi)2

We see φ is continuous on the compact set B(r,x0) and so attains a minimum value α at some pointy0 in B(r,x0). Since z is in B(m/2, f(x0), ‖f(x0)− z‖ < m/2 and so φ(x0) = ‖f(x0)− z‖ <m/2. This tells us

α = minx∈B(r,x0)

φ(x) = minx∈B(r,x0)

‖f(x)− z‖ = φ(y0) ≤ φ(x0) < m/2.

Is it possible for y0 ∈ S? If so,

φ(y0) = ‖f(y0)− z‖ = ‖f(y0)− f(x0) + f(x0)− z‖≥ ‖f(y0)− f(x0)‖ − ‖f(x0)− z‖

= g(y0)− ‖f(x0)− z‖

But ‖f(x0)− z‖ < m/2 and g(y0) ≥ m > 0 if y0 is in S. This implies

φ(y0) ≥ m−m/2 = m/2 > 0

But we know

minx∈B(r,x0)

φ(x) = α < m/2

This means φ(y0) ≥ m/2 is not possible. We conclude y0 is not in S. So the point where φ attainsits minimum on B(r,x0) is not on the boundary S; so it is an interior point and y0 ∈ B(r,x0.

Since φ has a minimum at y0, so does φ2. Since the minimum is an interior point, we must have(φ2)xi(y0) = 0 for all i. Now

φ2(x) =

n∑i=1

(fi(x)− zi)2 =⇒ (φ2)xi = 2

n∑i=1

(fi(x)− zi) fi,xi



Hence at y0,

(φ2)xi(x0) = 2

n∑i=1

(fi(x0)− zi) fi,xi(x0) = 0

This is the same as Jf (y0) (f(y0) − z) = 0. Since det Jf (y0) 6= 0, this tells the unique solutionis f(y0) = z. We conclude z ∈ B(m/2, f(x0) or f(y0) ∈ f(B(r,x0) as y0 ∈ B(r,x0). Hence,B(m/2, f(x0) ⊂ f(B(r,x0)

This, of course, says f(x0) is an interior point of f(B(r,x0)) and completes the proof.

We are now ready to tackle the Inverse Function Theorem. First, a result about inverse images ofopen sets.

Theorem 13.2.2 Inverse Images of open sets under continuous maps are open

Let f : D ⊂ <n → <m be a continuous mapping of the open set D. If V is open in therange of f , then f−1(V ) is open in D.

Proof 13.2.3Let x0 ∈ f−1(V ). Then f(x0) ∈ V . Since V is open, f(x0) is an interior point of V and so thereis a radius r > 0 so that B(r, f(x0) ⊂ V . Since f is continuous on D, for ε = r, there is a δ > 0 sothat ‖f(x− f(x0‖ < r is ‖x− x0‖, δ. Thus, f(x) ∈ B(r, f(x0) is x ∈ B(δ,x0) ∩D. Since D isopen, B(δ,x0) ∩D is open and so there is a δ1 > 0 with B(δ1,x0) ⊂ B(δ,x0) ∩D. This tells usx ∈ B(δ1,x0) implies f(x) ∈ B(r, f(x0) ⊂ V . We conclude B(δ1,x0) ⊂ f−1(V ) and so x0 isan interior point of f−1(V ). Since x0 is arbitrary, this shows f−1(V ) is open.

Theorem 13.2.3 The Inverse Function Theorem

Let D ⊂ <n be an open set and f : D ⊂ <n → <n has continuous first partials deriva-tives, fi,xj inD. Let x0 ∈ D with detJf (x0) 6= 0. Then there is an open set U containingx0 and an open set V ⊂ f(U) and a map g : V ⊂ <n → U ⊂ <n. so that

• f : U ⊂ <n<n is 1− 1 and onto V .

• g : V ⊂ <n → is 1− 1 and onto U .

• g f(x) = x on U .

• gi,xj is continuous on V implyingDg exists on V .

Comment 13.2.1 If y ∈ V , g(y) ∈ U . So f g(y) = f(g(y) = z ∈ V . But there is an x ∈ U sothat f(x) = y. Hence, g(y) = g(f(x) = x by the Theorem. We conclude f g(y) = f(x) = y. Sowe see both f g(x) = x and g f(y) = y on U and V respectively. So g = f−1 and f = g−1.

Proof 13.2.4Since detJf (x0) 6= 0 and fi,xj is continuous in D, the continuity of detJf (x0) implies we can finda positive r0 so that B(r0,x0) ⊂ D and by Theorem 13.1.2, there is an r1 < r00, so that f is 1− 1on B(r1,x0). Choose any r < r1. Then

B(r,x0) ⊂ B(r1,x0) =⇒ f1] : − 1 on B(r1,x0)



Now by Theorem 13.2.1, f(x0 is an interior point of f(B(r,x0). This tells us there is an r2 > 0so that B(r2, f(x0)) ⊂ f(B(r,x0). Let V = B(r2, f(x0)) and U = f−1(V ). Note f−1(V ) is asubset ofB(r,x0). You can see these sets in Figure 13.3. Since V is open, f−1(V ) is open also since

Figure 13.3: Inverse Function Mappings and Sets

f is continuous. We now know f is 1− 1 on B(r,x0) implying f is 1− 1 on U with U ⊂ B(r,x0).Thus, g = f−1 does exist on f(U) = V and g f(x) = x on U . This is part of what we need toprove. There is a lot more to do.

f is onto V :

Proof Let y ∈ V . Then y ∈ B(r2, f(x0) ⊂ f(B(r,x0). Hence, there is x ∈ B(r,x0) withy = f(x) and so x ∈ f−1(V ) = U . So f is onto V .

g is onto U :

Proof If x ∈ U = f−1(V ), then there is y ∈ V so that f(x) = y because f is onto V . Thus,g(f(x)) = g(y). However, we know g is f−1 here, so g(y) = x and we have shown g is onto U .

g−1 is continuous on U :

Proof We know f is continuous and 1 − 1 on the compact set B(r,x0). Hence, g = f−1 is con-tinuous on f(B(r,x0)). This implies g is continuous on V . What about g−1’s continuity? Let (yn)in V converge to y ∈ V . We know f is 1 − 1 on B(r,x0) to f(B(r,x0)). So there is a sequence(xn) in B(r,x0) with f(xn) = yn. since B(r,x0) is compact, there is a subsequence (x1

n) whichconverges to a point x in B(r,x0). Since f is continuous there, we must have f(x1

n) → f(x). Butf(x1

n) = y1n and since yn → y, we have y1

n → y. Thus, f(x) = y.

If there was another subsequence (x2n) which converged to x∗ in B(r,x0), by the continuity of f ,

we have f(x2n)→ f(x∗). But f(x2

n) = y2n and so y2

n → y. Thus, f(x∗) = y.

But f(x) = f(x∗) means x = x∗ as f is 1− 1. We have shown any convergent subsequence of xn)must convergent to the same point x. These vector sequences have n component sequences. We haveshown the set of subsequential limits for each component sequence consists of one number. Hence,



each component sequence of xi converges. Thus, the set of subsequential limits of the sequence is xand so xn → x. But this says g−1(yn)→ g−1(y).

gxi exists and is continuous on V :

Proof Fix y ∈ V and choose indices j and k from 1, . . . , n. We need to show

limh→0

gj(y + hEk)− gj(y)

h

exists and that gyj is continuous. For small enough h, the line segment [y,y + hEk] is in V . Let

x = g(y = f−1(y) ∈ Uz = g(y + hEk) = f−1(y + hEk) ∈ U

Then, x and z are distinct as f−1 is 1 − 1. Note the line segment [x, z] is in B(r,x0). Since g iscontinuous on V , limh→0 g(y + hEk) = g(y). But this says limh→0 z = g(y) = x

Now f(z)− f(x) = y + hEk − y = hEk, so

f1(z)− f1(x)...

fk−1(z)− fk−1(x)fk(z)− fk(x)fk+1(z)− fk(x)

...fn(z)− fn(x)

=

0...0

h, component k0...0

Thus, fi(z) = fi(x) if i 6= k and fk(z)− fk(x) = h. Now apply the Mean Value Theorem to eachfi. Then, there is a wi ∈ [z,x] with

fi(z)− f(x) = (∇(fi))T (z − x) = (∇(fi))

T (wi)(gi(y + hEk)− g(y)

This implies

(∇(f1))T (w1)...

(∇(fk−1))T (wk−1)(∇(fk))T (wk)

(∇(fk+1))T (wk+1)...

(∇(fn))T (wn)

(g1(y + hEk)− g1(y))/h...

(gk−1(y + hEk)− gk−1(y))/h(gk(y + hEk)− gk(y))/h

(gk+1(y + hEk)− gk+1(y))/h...

(gn(y + hEk)− gn(y))/h

0...0

1, component k0...0



Since each wi is in the line segment [z,x] which is in B(r,x0) ⊂ B(r0,x0), we know for

∆(w1, . . . ,wn) =

(∇(f1))T (w1)...

(∇(fk−1))T (wk−1)(∇(fk))T (wk)

(∇(fk+1))T (wk+1)...

(∇(fn))T (wn)

det∆(w1, . . . ,wn) 6= 0 and there is a unique solution for each (gk(y + hEk)− gk(y)/h. We canfind this solution by Cramer’s rule. Let ∆i be the submatrix we get by replacing column i with thedata vector Ek.

∆i(w1, . . . ,wn) =

f1,x1(w1) . . . f1,xi−1(w1) 0 f1,xi+1(w1) . . . f1,xn(w1)...

...... 0

......

...fk−1,x1

(wk−1) . . . fk−1,xi−1(wk−1) 0 fk−1,xi+1

(wk−1) . . . fk−1,xn(wk−1)fk,x1

(wk) . . . fk,xi−1(wk) 1 fk,xi+1

(wk) . . . fk,xn(wk)fk+1,x1(wk+1) . . . fk+1,xi−1(wk+1) 0 fk+1,xi+1(wk+1) . . . fk+1,xn(wk+1)

......

... 0...

......

fn,x1(wn) . . . fn,xi−1

(wn) 0 fn,xi+1(wn) . . . fn,xn(wn)

From Cramer’s rule, we know

(gi(y + hEk)− gi(y))/h =det∆i(w1, . . . ,wn)

det∆(w1, . . . ,wn)

We assume the continuity of the partials on D and so det∆i(w1, . . . ,wn) and det∆(w1, . . . ,wn)are continuous functions of w1, . . . ,wn. Therefore we know

limh→0

det∆i(w1, . . . ,wn)

det∆(w1, . . . ,wn)=

det∆i(z1, . . . ,zn)

det∆(z1, . . . ,zn)

Thus, gi,yk(y) = limh→0(gi(y + hEk) − gi(y))/h exists. If Yp were a sequence converging to y,then we would have

gi,yk(Yp) =det∆i(z

p1 , . . . ,z

pn)

det∆(zp1 , . . . ,zpn)

where the points obtained by the Mean Value Theorem now depend on p and so are labeled zpi . Thedeterminants are still continuous on D and so

limp→∞

gi,yk(Yp) = limp→∞

det∆i(zp1 , . . . ,z

pn)

det∆(zp1 , . . . ,zpn)

When we apply the Mean Value Theorem, we find wpi ∈ [g(Yp, g(Yp + hEk]. As p → ∞, wpi →towi ∈ [g(y, g(y + hEk] since g is continuous. Hence

limp→∞

det∆i(zp1 , . . . ,z

pn)

det∆(zp1 , . . . ,zpn)

=det∆i(z1, . . . ,zn)

det∆(z1, . . . ,zn)= gi,yk(y)



which tells us gi,yk is continuous. Note how useful Cramer’s Rule is here. Cramer’s rule isnot very good computationally for n > 2 but it is a powerful theoretical tool for the study ofsmoothness of mappings.

This completes the proof.

Example 13.2.1 For (x1, x2) = x ∈ <2, let y = (y1, y2) = (x21 − x2

2, 2x1, x2). This defined a mapF . Then

JF (x1, x2) =

[2x1 −2x2

2x2 2x1

]and det Jf (x1, x2) = 4x2

1 + 4x22. This is positive unless (x1, x2) = (0, 0). Hence, at any p =

(p1, p2)¬0, we can invoke the Inverse Function Theorem to find an open set U containing p and anopen set V containing F (p) so that f : U → V is 1−1 and onto and f−1 : V → U is also 1−1 andonto. Further f−1 has continuous partials on V . If you look back at Example 13.1.2, we figured outall of this for a slightly different map G(x1, x2) = (x2

1 + x22, 2x1x2) in great detail. We could have

done that here. Of course, these kinds of details are really needed if you want to know the explicitfunctional form of these mappings.

13.2.1 Homework

Exercise 13.2.1

Exercise 13.2.2

Exercise 13.2.3

Exercise 13.2.4

Exercise 13.2.5

13.3 Implicit Function ResultsTo motivate the implicit function theorem, we want to go over a long example which will illustratethe general approach to the problem. We often want to handle this kind of problem.

Example 13.3.1 We know x2 + y2 + z2 = 10. Can we solve for z in terms of x and y? LetF (x, y, z) = x2 + y2 + z2 − 10. Then f(x, y, z) = 0. We know z = ±

√10− x2 − y2 and for real

solutions, we must have√x2 + y2 ≤

√10. Fz(x, y, z) = 2z which is not zero as long as z 6= 0.

Since x2 + y2 = 10− z20 , we can’t pick just any z0 6= 0. We must choose −

√10 ≤ z0 ≤

√10. So if

0 < z0 <√

10, then the set S of viable choices of (s, y) is S = (x, y)|x2 +y2 = 10−z20 which is a

circle about the origin in <2 of radius√

10− z20 . For concreteness, pick a point x0, y0) in Quadrant

One with x20+y2

0 +z20−10 = 0. Then in the open set V0 = (x, y) : x2+y2 < 10−z2

0, we can solvefor z in terms of (x, y) as z =

√10− x2 − y2. This defines a mapping G(x, y) =

√10− x2 − y2

which has continuous partials and F (x, y,G(x, y)) = x2 + y2 + 10−x2− y2−10 = 0. Finally, wehave G(x0, y0) =

√1−−x2

0 − y20 = z0. This is an example of finding a function defined implicitly

by a constraint surface. Note at z = 0, Fz(x0, y0, 0) = 0 always and the constraint surface reducesto x2 = y2 − 10 = 0. For that constraint, there is no way to find z in terms of (x, y). So it seems arestriction on Fz 6= 0 is important.

So this is the problem of given x2 + y2 + z2 = 10, solve for z locally in terms of (x, y). Note wecould also ask to solve for x in terms of (y, z) or y in terms of (x, z). We can do this explicitly here.


13.3. IMPLICIT FUNCTION RESULTS 245

Example 13.3.2 The constraint surface is now F (x, y, z) = sin(x2 + 2y4z2) + 5z4 − 25 = 0.This tells us (24/5)0.25 < z < (26/5)0.25 or 1.48 < z < 1.51. However there are no constraintson (x, y) at all. It is really hard to see how to solve for z in terms of (x, y). But notice Fz =8y4z cos(x2 + 2y4z2) + 20z3 which is not zero at many (x, y, z). It turns out we can prove there isa way to solve for z in terms of (x, y) locally around a point (x0, y0) where F (x0, y0, z0) = 0 withFz(x0, y0, z0) 6= 0. This result is called the implicit function theorem and it is our next task.

Example 13.3.3 Consider the following situation. We are given U ⊂ <3 which is an open set and amapping F (x, y, z) with domain U so that F has continuous first order partials in U . Assume thereis a point (x0, y0, z0) in U with F (x0, y0, z0) = 0 with Fz(x0, y0, z0) 6= 0. We claim we can findan open set V0 containing (x0, y0) in <2 and a mapping G with continuous partials on V0 so thatG(x0, y0) = z0 and F (x, y,G(x, y)) = 0 for all (x, y) in V0. This means the constraint surfaceF (x, y, z) = 0 can be used to provide a local solution for z in terms of x and y. This is somethingwe do all the time as you can see from Example 13.3.1 and Example 13.3.2.

For any (x, y, z) ∈ U define the projections f1(x, y, z) = x, f2(x, y, z) = y and define f3(x, y, z) =F (x, y, z). Then letting f = (f1, f2, f2), we have

Jf (x, y, z) =

f1x f1y f1z

f2x f2y f2z

f3x f3y f3z

=

1 0 00 1 0Fx Fy Fz

Thus detJF (x, y, z) = Fz(x, y, z) and by our assumption detJF (x0, y0, z0) = Fz(x0, y0, z0) 6= 0.Apply the Inverse Function Theorem to f : <3 → <3. Then there is an open set U1 ⊂ U containing(x0, y0, z0), an open set V1 ⊂ f(U1) containing f(x0, y0, z0) and g = f−1 so that

f : U1 → V1, 1− 1, onto

g : V1 → U1, 1− 1, onto

Let the components of g = f−1 be denoted by (g1, g2, h). Then, if (X1, X2, X3) ∈ V1,

g(X1, Y1, Z1) = (g1(X1, Y1, Z1), g2(X1, Y1, Z1), h(X1, Y1, Z1))

Since f maps U1 to V1 1 − 1 and onto, there exist a unique (x, y, z) ∈ U1 with f(x, y, z) =(X1, Y1, Z1). But using the definition of f , this says (x, y, F (x, y, z)) = (X1, Y1, Z1). Thus, sinceg = f−1,

(x, y, z) = (g f)(x, y, z) = g(f(x, y, z)) = g(X1, Y1, Z1)

= (g1(X1, Y1, Z1), g2(X1, Y1, Z1), h(X1, Y1, Z1)) = (g1(x1, y1, Z1), g2(x1, y1, Z1), h(x1, y1, Z1))

We see

x = g1(x, y, Z1), y = g2(x, y, Z1), z = h(x, y, Z1)

and recall (x, y, z) is the unique point satisfying f(x, y, z) = (X1, Y1, Z1). Since x = X1 andy = Y1, this can be rewritten as f(x, y, z) = (x, y, Z1).

Let

V0 = (x, y)<2|(x, y, 0) ∈ V1

We claim V0 is an open set in <2 since V1 is open in <3. Let (x, y) ∈ V0, Then (x, y, 0) ∈ V1 andsince V1 is open, (x, y, 0) is an interior point. So there is a radius r > 0 so that B(r, (x, y, 0)) ⊂ V1.



If (X,Y ) ∈ B(r, x, y), then√

(X − x)2 + (Y − y)2 < r which implies (X,Y, 0) ∈ B(r, x, y, 0) ⊂V1. Thus, (X,Y, 0) ∈ V1 telling us (X,Y ) ∈ V0. We conclude B(r, (x, y)) ⊂ V0 which says (x, y)is an interior point of V0. Thus, V0 is open in <2.

All of this works for any choice of (X1, Y1, Z1). In particular, if we have Z1 = 0, we get the uniquepoint z so that f(x, y, z) = (x, y, 0). But that means z = h(x, y, 0). Define the mapping G byG(x, y) = h(x, y, 0) = z. Now g = f−1 has continuous partials in V1 and since h is a componentof g, h also has continuous partials in V1. We conclude G also have continuous partials on V0 asV0 ⊂ V1.

Next, consider G(x0, y0) = h(x0, y0, 0). By construction, h(x0, y0, 0) = w where w is the uniquepoint satisfying f(x0, y0, w) = (x0, y0, 0). But f(x0, y0, w) = (x0, y0, F (x0, y0, w)), so we seeF (x0, y0, w) = 0. There is only one point which has this property, so w must be z0. Finally,if (x, y) ∈ V0, then (x, y, 0) ∈ V1 and h(x, y, 0) = z where (x, y, z) is the unique point withf(x, y, z)− (x, y, 0). So

(x, y, z) = g f(x, y, z) = g(f(x, y, z)) = g(x, y, 0) = (g1(x, y, 0), g2(x, y, 0), h(x, y, 0))

So x = g1(x, y, 0), y = g2(x, y, 0) and z = h(x, y, 0). Thus, g(x, y, 0) = (x, y, h(x, y, 0)) =(x, y,G(x, y)). Note G(x, y) = z is the unique point so that (x, y, z) ∈ U1 and f(x, y, z) =(x, y, 0). We also know that f(x, y,G(x, y)) = (x, y, F (x, y,G(x, y)) = (x, y, 0). Thus, F (x, y,G(x, y)) =0 for (x, y) ∈ U0.

So you can see how we might go about proving a more general theorem. We will do two cases: first,we extend this argument of a scalar function of n + 1 variables and solve for the (n + 1)st variablein terms of the others. This is Theorem 13.3.1. Then, we extend to vector valued functions. Wethink of the domain as <n+m and we solve for the variables xn+1, . . . , xn+m in terms of x1, . . . , xn.This is Theorem 13.3.2. We have been trying to work through this slowly so that you can follow thearguments below. It is like the ones we have already done, but a bit more abstract.

Theorem 13.3.1 Implicit Function Theorem

Let U ⊂ <n+‘1 be an open set. Let u ∈ U be written as u = (X, y) where X ∈ <n.Assume F : U → has continuous first order partials in U and there is a point (X0, y0) ∈ Usatisfying F (X0, y0) = 0 with Fy(X0, y0) 6= 0. Then there is an open set V0 containingX0 and a function G : V0 → < with continuous first order partials so that G(X0) = y0

and F (X, G(X) = 0 on V0.

Proof 13.3.1The argument is very similar to the one we used in the example at the start of this section. Basically,we replace the use of (x1, x2) with X and make some obvious notational changes. Let’s get started.

Here we are using X to denote the components (x1, x2, . . . , xn) and we are setting the last componentxn+1 to be the variable y. For any (X, y) ∈ U define the projections fi(X, y) = xi and definefn+1(X, y) = F (X, y). Then letting f = (f1, . . . , fn, F ), we have

Jf (X, y) =

f1,x1

. . . f1,xn f1,y

f2,x1 . . . f2,xn f2,y

......

......

fn,x1. . . fn,xn fn,y

Fx1. . . Fxn Fy

=

1 0 0 . . . 0 0 00 1 0 . . . 0 0 0...

......

... 0 0 00 0 0 . . . 0 1 0Fx1

Fx2Fx3

. . . Fxn Fy



Thus det JF (X, y) = Fy(X, y) and by our assumption det JF (X0, y0) = Fy(X0, y0) 6= 0. Applythe Inverse Function Theorem to f : <n+1 → <n+1. Then there is an open set U1 ⊂ U containing(X0, y0), an open set V1 ⊂ f(U1) containing f(X0, y0) and g = f−1 so that

f : U1 → V1, 1− 1, onto

g : V1 → U1, 1− 1, onto

Let the components of g = f−1 be denoted by (g1, . . . , gn, h). Then, if (Z, u) ∈ V1,

g(Z, u) = (g1(Z, u), . . . , gn(Z, u), h(Z, u))

Since f maps U1 to V1 1 − 1 and onto, there exist a unique (X, y) ∈ U1 with f(X, y) = (Z, u).But using the definition of f , this says (X, F (X, y)) = (Z, u). Thus, since g = f−1, X = Z andF (X, y) = u. Further,

(X, y) = (g f)(X, y) = g(f(X, y)) = g(Z, u) = g(X, u)

= (g1(X, u), . . . , gn(X, u), h(X, u))

We see xi = gi(X, u). Now recall (X, y)) is the unique point satisfying f(X, y) = (Z, u). SinceX = Z, this can be rewritten as f(X, y) = (X, u).

Let

V0 = X ∈ <n|(X, 0) ∈ V1

Using an argument like before, it is straightforward to show V0 is an open set in <n since V1 is openin <n+1.

All of this works for any choice of (X, y). In particular, if we have y = 0, we get the unique pointu so that f(X, u) = (X, 0). But that means y = h(X, 0). Define the mapping G by G(X, y) =h(X, 0) = y. Now g = f−1 has continuous partials in V1 and since h is a component of g, h alsohas continuous partials in V1. We conclude G also have continuous partials on V0 as V0 ⊂ V1.

Next, consider G(X0) = h(X0, 0). By construction, h(X0, 0) = w where w is the unique pointsatisfying f(X0, w) = (X0, 0). But f(X0, w) = (X0, F (X0, w)), so we see F (X0, w) = 0. There isonly one point which has this property so w must be y0. Finally, if (X) ∈ V0, then (X, 0) ∈ V1 andh(X, 0) = y where (X, y) is the unique point with f(X, y) = (X, 0). So

(X, y) = g f(X, y) = g(f(X, y)) = g(X, 0) = (g1(X, 0), . . . , gn(X, 0), h(X, 0))

So xi = gi(X, 0) and y = h(X, 0). Thus, g(X, 0) = (X, h(X, 0)) = (X, G(X)). Note G(X) = yis the unique point so that (X, y) ∈ U1 and f(X, y) = (X, 0). We also know that f(X, G(X)) =(X, F (X, G(X)) = (X, 0). Thus, F (X, G(X)) = 0 for X ∈ U0.

Example 13.3.4 Let f(x, y, z) = tanh(xyz) − xyz − 10 = 0. Letting u = xyz, this is equivalentto tanh(u) − u − 10 = 0 or tanh(u) = u + 10. It is easy to see there is a solution by graphingtanh(u) and u+10 and finding the intersection. So there is a value u∗ where tanh(u∗) = u∗+10. Inparticular, for z0 = 1, there is a value x0y0 so that tanh(x0y0)−x0y0−10 = 0; i.e. f(x0, z−0, 1) =0, Note

Fz(x, y, z) = sech2(xyz)xy − xy = xy(sech2(xyz)− 1)

From the graph where we found the intersection ( you should sketch this yourself!), we can see



x0y0 6= 0 so sech62(x0y0)− 1 6= 0. We conclude Fx(x0, y0, 1) 6= 0.

Applying the Implicit Function Theorem, there is an open set U ∈ <2 containing (x0, y0, 1) and amapping G with continuous partials in U so that G(x0, y0) = 1 and

tanh(xy G(x, y))− xy G(x, y)− 10 = 0

for all (x, y) ∈ U .

Theorem 13.3.2 An Extension of the Implicit Function Theorem

Let U ⊂ <n+‘m be an open set. Let u ∈ U be written as u = (X,Y) where X ∈ <n andY ∈ <m. Assume F : U → <m has continuous first order partials in U and there is apoint (X0,Y0) ∈ U satisfying F (X0,Y0) = 0 with

det

Fn+1,y1(X0,Y0) . . . Fn+1,ym(X0,Y0)...

......

Fn+m,y1(X0,Y0) . . . Fn+m,ym(X0,Y0)

6= 0

where the matrix above is a submatrix of the full JF . Then there is an open set V0 con-taining X0 ∈ <n and G : V0 → <m with continuous partials so that G(X0) = Y0 andF (X, G(X) = O where O is the zero vector in <m.

Proof 13.3.2The argument is again very similar to the one we just used. This time instead of a single scalarvariable y, we have a vector Y. It is mostly a matter of notation. If you compare these proofscarefully, we mostly replace y by Y, u by U and so forth as part played by the variable xn+1 has nowbeen expanded to m variables and hence we must use vector notation. However, the structure of theargument is very much the same.Here we are using X to denote the components (x1, x2, . . . , xn) and we are setting the componentsxn+1, . . . , xn+m to be the variable Y for the new variables y1, . . . , ym. For any (X,Y) ∈ U , definef(X,Y) = (X, F (X,Y)). Thus, the first n components are simply the usual projections. Then, sinceF has m components F1 through Fm, we have

Jf (X,Y) =

1 0 0 0 0 0 00 1 0 0 0 0 0...

...... 0 0 0 0

0 0 1 0 0 0 0F1,x1 . . . F1,xn F1,y1 F1,y2 . . . F1,ym

......

......

......

...Fm,x1

. . . Fm,xn Fm,y1 Fm,y2 . . . Fm,ym

where all of the partials are evaluated at (X,Y) although we have not put that evaluation point in inorder to save space. Thus , at (X0,Y0)

det JF (X0,Y0) = det

F1,y1(X,Y0) . . . F1,ym(X,Y0)...

......

Fm,y1(X,Y0) . . . Fm,ym(X,Y0)

6= 0

by our assumption. Apply the Inverse Function Theorem to f : <n+m → <n+m. Then there is anopen set U1 ⊂ U containing (X0,Y0), an open set V1 ⊂ f(U1) containing f(X0,Y0) and g = f−1



so that

f : U1 → V1, 1− 1, onto

g : V1 → U1, 1− 1, onto

Let the components of g = f−1 be denoted by (g1, . . . , gn, h1, . . . , hm). Then, if (Z,U) ∈ V1,

g(Z,U) = (g1(Z,U), . . . , gn(Z,U),

h1(Z,U), . . . , h1(Z,U))

Since f maps U1 to V1 1 − 1 and onto, there exist a unique (X,Y) ∈ U1 with f(X,Y) = (Z,U).But using the definition of f , this says (X, F (X,Y)) = (Z,U). Thus, since g = f−1, X = Z andF (X,Y) = U. Further,

(X,Y) = (g f)(X,Y) = g(f(X,Y)) = g(Z,U) = g(X,U)

= (g1(X,U), . . . , gn(X,U),

h1(X,U), . . . , hm(X,U)))

We see xi = gi(X,U). Now recall (X,Y)) is the unique point satisfying f(X,Y) = (Z,U). SinceX = Z, this can be rewritten as f(X,Y) = (X,U).

Let

V0 = X ∈ <n|(X,O) ∈ V1

Using an argument like before, it is straightforward to show V0 is an open set in <n since V1 is openin <n+m.

All of this works for any choice of (X,Y). In particular, if we have Y = O, we get the unique point(U) so that f(X,U) = (X,O). But that means Y = (h1(X,O), . . . , hm(X,O)). For convenience,let

h(X,Y) = (h1(X,O), . . . , hm(X,O))

Then define the mapping G by G(X,Y) = h(X,O) = Y. Now g = f−1 has continuous partials inV1 and since hi are components of g, h also has continuous partials in V1. We conclude G also havecontinuous partials on V0 as V0 ⊂ V1.

Next, consider G(X0) = h(X0,O). By construction, h(X0,O) = W where W is the unique pointsatisfying f(X0,W) = (X0,O). But f(X0,W) = (X0, F (X0,W)), so we see F (X0,W) = O. Thereis only one point which has this property so W must be Y0. Finally, if (X) ∈ V0, then (X,O) ∈ V1

and h(X,O) = Y where (X,Y) is the unique point with f(X,Y) = (X,O). So

(X,Y) = g f(X,Y) = g(f(X,Y)) = g(X,O) = (g1(X,O), . . . , gn(X,O), h(X,O))

So xi = gi(X,O) and Y = h(X,O). Thus, g(X,O) = (X, h(X,O)) = (X, G(X)). Note G(X) = Yis the unique point so that (X,Y) ∈ U1 and f(X,Y) = (X,O). We also know that f(X, G(X)) =(X, F (X, G(X)) = (X,O). Thus, F (X, G(X)) = O for X ∈ U0.



Example 13.3.5 Let F (x, y, u, v) = (u2 + 2xu− v2 − 7y, u3 + 3 sin(u) + 15x2ey + cos(v) + 10).Then

JF (x, y, u, v) =

[2u −7 2u+ 2x −2v

30xey 15x2ey 3u2 + 3 cos(u) − sin(v)

]At (0, 0, 0, 0),

JF (0, 0, 0, 0) =

[0 −7 0 00 0 3 −0

]and F (0, 0, 0, 0) = (0, 11). The submatrix

A =

[−7 00 3

]has a nonzero determinant. Hence, the Implicit Function Theorem, Theorem 13.3.2, tells us we cansolve for (y, u) in terms of (x, v) to find G(x, v) so that by relabeling the order of the variables,F (x, v,G(x, v)) = 11 in an open set U containing (0, 0) with F (0, 0, 0, 0) = (0, 11).


Exercise 13.3.2

Exercise 13.3.3

Exercise 13.3.4

Exercise 13.3.5

13.4 Constrained OptimizationLet’s look at the problem of finding the minimum or maximum of a function f(x, y) subject to theconstraint g(x, y) = c where c is a constant. Let’s see how far we can get without explicitly using theimplicit function theorem. Let’s suppose we have a point (x0, y0) where an extremum value occursand assume f and g are differentiable at that point. Then, at another point (x, y) which satisfies theconstraint, we have

g(x, y) = g(x0, y0) + gx(x0, y0)(x− x0) + gy(x0, y0)(y − y0) + Eg(x, y, x0, y0)

But g(x, y) = g(x0, y0) = c, so we have

0 = g0x(x− x0) + g0

y(y − y0) + Eg(x, y, x0, y0)

where we let g0x = gx(x0, y0) and g0

y = gy(x0, y0).

Now given an x, we assume we can find a y values so that g(x, y) = c in Br(x) for some r > 0. Ofcourse, this value need not be unique. Let φ(x) be the function defined by

φ(x) = y|g(x, y) = c

For example, if g(x, y) = c was the function x2 + y2 = 25, then φ(x) = ±√

25− x2 for −5 ≤x ≤ 5. Clearly, we can’t find a full circle Br(x) when x = −5 or x = 5, so let’s assume the


13.4. CONSTRAINED OPTIMIZATION 251

point (x0, y0) where the extremum value occurs does have such a local circle around it where we canfind corresponding y values for all x values in Br(x0). So locally, we have a function φ(x) definedaround x0 so that g(x, φ(x)) = c for x ∈ Br(x0). Then we have

0 = g0x(x− x0) + g0

y(φ(x)− φ(x0)) + Eg(x, φ(x), x0, φ(x0))

Now divide through by x− x0 to get

0 = g0x + g0

y

(φ(x)− φ(x0)

x− x0

)+Eg(x, φ(x), x0, φ(x0))

x− x0

Thus,

g0y

(φ(x)− φ(x0)

x− x0

)= −g0

x −Eg(x, φ(x), x0, φ(x0))

x− x0

and assuming g0x 6= 0 and g0

y 6= 0, we can solve to find

φ(x)− φ(x0)

x− x0= −g

0x

g0y

− 1

g0y

Eg(x, φ(x), x0, φ(x0))

x− x0

Since g is differentiable at (x0, φ(x0)), as x → x0, Eg(x,φ(x),x0,φ(x0))x−x0

→ 0. Thus, we have φ isdifferentiable and therefore also continuous with

φ′(x0) = limx→x0

φ(x)− φ(x0)

x− x0= −g

0x

g0y

Now since both g0x 6= 0 and g0

y 6= 0, the fraction − g0x

g0y6= 0 too. This means locally φ(x) is either

strictly increasing or strictly decreasing; i.e. is a strictly monotone function.

Let’s assume φ′(x0) > 0 and so φ is increasing on some interval (x0−R, x0 +R). Let’s also assumethe extreme value is a minimum, so we know on this interval f(x, y) ≥ f(x0, y0) with g(x, y) = c.This means f(x, φ(x))− f(x0, φ(x0) ≥ 0 on (x0 −R, x0 +R). Now do an expansion for f to get

f(x, φ(x)) = f(x0, φ(x0)) + f0x(x− x0) + f0

y (φ(x)− φ(x0)) + Ef (x, φ(x), x0, φ(x0))

where f0x = fx(x0, φ(x0)) and f0

y = fy(x0, φ(x0)). This implies

f(x, φ(x))− f(x0, φ(x0)) = f0x(x− x0) + f0

y (φ(x)− φ(x0)) + Ef (x, φ(x), x0, φ(x0))

Now we have assumed f(x, φ(x))− f(x0, φ(x0)) ≥ 0, so we have

f0x(x− x0) + f0

y (φ(x)− φ(x0)) + Ef (x, φ(x), x0, φ(x0)) ≥ 0

If x− x0 > 0 we find

f0x + f0

y

(φ(x)− φ(x0)

x− x0

)+Ef (x, φ(x), x0, φ(x0))

x− x0≥ 0

Now take the limit as x→ x+0 to find

f0x + f0

yφ′(x0) ≥ 0



When x− x0 < 0, we argue similarly to find

f0x + f0

yφ′(x0) ≤ 0

Combining, we have

0 ≤ f0x + f0

yφ′(x0) ≤ 0

which tells us at this extreme value we have the equation f0x + f0

yφ′(x0) = 0. This implies f0

y =

− f0x

φ′(x0) . Next note

[f0x

f0y

]=

[f0x

− f0x

φ′(x0)

]=

f0x

−f0x

−(g0yg0x

) =f0x

g0x

[g0x

g0y

]

This says there is a scalar λ =f0x

g0xso that

∇(f)(x0, y0) = λ∇(g)(x0, y0)

where g(x0, y0) = c. The value λ =f0x

g0xis called the Lagrange Multiplier for this extremal Problem.

This is the basis for the Lagrange Multiplier Technique for a constrained optimization problem.

We can do a similar sort of analysis in the case the extremum is a maximum too. Our analy-sis assumes that the the point (x0, y0) where the extremum occurs is like an interior point in the(x, y)|g(x, y) = c. That is, we assume for such an x0 there is a interval Br(x0) with anyx ∈ Br(x0) having a corresponding value y = φ(x) so that g(x, φ(x)) = c. So the argumentdoes not handle boundary points such as the ±5 is our previous example.

To implement the Lagrange Multiplier technique, we define a new function

H(x, y, λ) = f(x, y) + λ(g(x, y)− c)

The critical points we see are the ones where the gradient of f is a multiple of the gradient of g forthe reasons we discussed above. Note the λ we use here is the negative of the Lagrange Multiplier.To find them, we set the partials of H equal to zero.

∂H

∂x= 0 =⇒ ∂f

∂x= −λ∂g

∂x∂H

∂y= 0 =⇒ ∂f

∂y= −λ∂g

∂x

∂H

∂λ= 0 =⇒ g(x, y) = c

The first two lines are the statement that the gradient of f is a multiple of the gradient of g and thethird line is the statement that the constraint must be satisfied. As usual, solving these three nonlinearequations gives the interior point solutions and we always must also include the boundary points thatsolve the constraints as well.

Example 13.4.1 Extremize 2x+ 3y subject to x2 + y2 = 25.



Solution Set H(x, y, λ) = (2x+ 3y) + λ(x2 + y2 − 25). The critical point equations are then

∂H

∂x= 2 + λ(2x) = 0

∂H

∂y= 3 + λ(2y) = 0

∂H

∂λ= x2 + y2 − 25 = 0

Solving for λ, we find λ = −1/x = −3/(2y). Thus, x = 2y/3.

Using this relationship in the constraint, we find (2y/3)2 + y2 = 25 or 13y2/9 = 25. Thus,y2 = 225/13 or y = ±15/

√13). This means x = ±10/

√13.

We findf(10/

√13, 15/

√13) = 45/

√13 = 12.48

f(−10/√

13,−15/√

13) = −45/√

13 = −12.48f(−5, 0) = −10f(5, 0) = 10f(0,−5) = −15f(0, 5) = 15.

So the extreme values here occur at the boundary points (0,±5).

Example 13.4.2 Extremize sin(xy) subject to x2 + y2 = 1.

Solution Set H(x, y, λ) = (sin(xy)) + λ(x2 + y2 − 1). The critical point equations are then

∂H

∂x= y cos(xy) + λ(2x) = 0

∂H

∂y= x cos(xy) + λ(2y) = 0

∂H

∂λ= x2 + y2 − 1 = 0

Solving for λ in the first equation, we find λ = −y cos(xy)/2x. Now use this in the second equationto find x cos(xy)− 2y2 cos(xy)/2x = 0. Thus, cos(xy)(x− y2/x) = 0.

The first case is cos(xy) = 0. This implies xy = π/2 + 2nπ. These are rectangular hyperbo-las and the closest one to the origin is xy = π/2 = 1.57 whose closest point to the origin is(√π/2,

√π/2) = (1, 25, 1, 25). This point is outside of the constraint set, so this case does not

matter.

The second case is x2 = y2 which, using the constraint, implies x2 = 1/2. Thus there are fourpossibilities: (1/

√2, 1/√

2), (1/√

2,−1/√

2), (−1/√

2, 1/√

2) and (−1/√

2, 1/√

2).The other possibilities are the circle’s boundary points (0,±1) and (±1, 0). We findsin(1/

√2, 1/√

2) = sin(−1/√

2,−1/√

2) = sin(1/2) = .48sin(1/

√2,−1/

√2) = sin(−1/

√2, 1/√

2) = − sin(1/2) = −.48sin(±1, 0) = 0, sin(0,±1) = 0So the extreme values occur here at the interior points of the constraint.



13.4.1 What Does the Lagrange Multiplier Mean?

Let’s change our extremum problem by assuming the constraint constant is changing as a functionof x; i.e. the problem is now find the extreme values of f(x, y) subject to g(x, y) = C(x). We canuse our tangent plane analysis like before. We assume the extreme value for constraint value C(x0)occurs at an interior point of the constraint. We have

g(x, φ(x))− C(x) = 0

Then

0 = g0x + g0

y

(φ(x)− φ(x0)

x− x0

)+

(Eg(x, φ(x), x0, φ(x0)

x− x0

)−(C(x)− C(x0)

x− x0

)We assume the constraint bound function C is differentiable and so letting x→ x0, we find

g0x + g0

yφ′(x0) = C ′(x0)

Now assume (x1, φ(x1)) extremizes f for the constraint bound value C(x1). We have

f(x1, φ(x1))− f(x0, φ(x0)) = f0x(x1 − x0) + f0

y (φ(x1)− φ(x0))

+Ef (x1, φ(x1), x0, φ(x0))

or (f(x1, φ(x1))− f(x0, φ(x0))

x1 − x0

)= f0

x + f0y

(φ(x1)− φ(x0)

x1 − x0

)+

(Ef (x1, φ(x1), x0, φ(x0))

x1 − x0

)Now assuming C(x) 6= 0 locally around x0, we can write(

f(x1, φ(x1))− f(x0, φ(x0))

C(x1)− C(x0)

)(C(x1)− C(x0)

x1 − x0

)=

f0x + f0

y

(φ(x1)− φ(x0)

x1 − x0

)+

(Ef (x1, φ(x1), x0, φ(x0))

x1 − x0

)Now let Θ(x) be defined by Θ(x) = f(x, φ(x)) when (x, φ(x)) is the extreme value for the problemof extremizing f(x, y) subject to g(x, y) = C(x). This is called the Optimal Value Function forthis problem. Rewriting, we have(

Θ(x1)−Θ(x0)

C(x1)− C(x0)

)(C(x1)− C(x0)

x1 − x0

)=

f0x + f0

y

(φ(x1)− φ(x0)

x1 − x0

)+

(Ef (x1, φ(x1), x0, φ(x0))

x1 − x0

)Now let x1 → x0. We obtain

limx1→x0

(Θ(x1)−Θ(x0)

C(x1)− C(x0)

)C ′(x0) = f0

x + f0y φ′(x0)



Thus, the rate of change of the optimal value with respect to change in the constraint bound, dΘdC is

well - defined and

dΘ

dC(C0) C ′(x0) = f0

x + f0y φ′(x0)

where C0 is the value C(x0). Assuming C ′(x0) 6= 0, we have

dΘ

dC(C0) =

f0x

C ′(x0)+

f0y

C ′(x0)φ′(x0)

But we know at the extreme value for x0 that f0y =

g0yg0xf0x . Thus,

dΘ

dC(C0) =

f0x

C ′(x0)+g0y

g0x

f0x

C ′(x0)φ′(x0) =

f0x

C ′(x0)

(1 +

g0y

g0x

φ′(x0)

)=

f0x

g0x

1

C ′(x0)

(g0x + g0

yφ′(x0)

)But the Lagrange Multiplier λ0 for extremal value for x0 is λ0 =

f0x

g0x. So

dΘ

dC(C0) = λ0

(g0x + g0

yφ′(x0)

)1

C ′(x0)

But C ′(x0) = g0x + g0

yφ′(x0). Hence, we find dΘ

dC (C0) = λ0.

We see from our argument that the Lagrange Multiplier λ0 is the rate of change of optimal value withrespect to the constraint bound. There are cases:

• sign(λ0) = sign(f0x

g0x) = +

+ = + which implies the optimal value goes up if the constraintbound is perturbed.


g0x) = −

− = + which implies the optimal value goes up if the constrainedbound is perturbed.


g0x) = +

− = −which implies the optimal value goes down if the constrainedbound is perturbed.


g0x) = −

+ = −which implies the optimal value goes down if the constrainedbound is perturbed.

Hence, we can interpret the Lagrange Multiplier as a price: the change in optimal value due toa change in constraint is essentially the loss of value experienced due to the constraint change. Forexample, if |λ0| is very small, it says the rate of change of the optimal value with respect to constraintmodification is small too. This implies the optimal value is insensitive to constraint modification.

13.4.2 HomeworkExercise 13.4.1 Find the extreme values of sin(xy) subject to x2 + 2y2 = 1 using the LagrangeMultiplier Technique.

Exercise 13.4.2 Find the extreme values of cos(xy) subject to x2 + 2y2 = 1 using the LagrangeMultiplier Technique.



Exercise 13.4.3 If an → 1 and bn → 3, use an ε−N proof to show 5an/bn → 5/3.

Exercise 13.4.4 If an → 10 and bn → 2, use an ε−N proof to show 5anbn → 100.


Chapter 14

Linear Approximation Applications

Let’s look at how we use tangent plane approximations in models and computations.

14.1 Linear Approximations to Nonlinear ODEHere, we look at several nonlinear ODE applications.

14.1.1 An Insulin ModelThis is a model we have discussed in (Peterson (5) 2016) which was first presented in (Braun (2)1978). This will show you how you can use the ideas of tangent plane approximations to buildpractical models on nonlinear phenomena.

In diabetes there is too much sugar in the blood and the urine. This is a metabolic disease and ifa person has it, they are not able to use up all the sugars, starches and various carbohydrates becausethey don’t have enough insulin. Diabetes can be diagnosed by a glucose tolerance test (GTT). Ifyou are given this test, you do an overnight fast and then you are given a large dose of sugar in aform that appears in the bloodstream. This sugar is called glucose. Measurements are made overabout five hours or so of the concentration of glucose in the blood. These measurements are thenused in the diagnosis of diabetes. It has always been difficult to interpret these results as a means ofdiagnosing whether a person has diabetes or not. Hence, different physicians interpreting the samedata can come up with a different diagnosis, which is a pretty unacceptable state of affairs!

In this chapter, we are going to discuss a criterion developed in the 1960’s by doctors at the MayoClinic and the University of Minnesota that was fairly reliable. It showcases a lot of our modeling inthis course and will give you another example of how we use our tools. We start with a simple modelof the blood glucose regulatory system.

Glucose plays an important role in vertebrate metabolism because it is a source of energy. Foreach person, there is an optimal blood glucose concentration and large deviations from this leadsto severe problems including death. Blood glucose levels are autoregulated via standard forwardand backward interactions like we see in many biological systems. An example is the signal that isused to activate the creation of a protein which we discussed earlier. The signaling molecules aretypically either bound to another molecule in the cell or are free. The equilibrium concentration offree signal is due to the fact that the rate at which signaling molecules bind equals the rate at whichthey split apart from their binding substrate. When an external message comes into the cell calleda trigger, it induces a change in this careful balance which temporarily upgrades or degrades theequilibrium signal concentration. This then influence the protein concentration rate. Blood glucoseconcentrations work like this too, although the details differ. The blood glucose concentration isinfluenced by a variety of signaling molecules just like the protein creation rates can be. Here are

257


258 CHAPTER 14. LINEAR APPROXIMATION APPLICATIONS

some of them. The hormone that decreases blood glucose concentration is insulin. Insulin is ahormone secreted by the β cells of the pancreas. After we eat carbohydrates, our gastrointestinal tractsends a signal to the pancreas to secrete insulin. Also, the glucose in our blood directly stimulates theβ cells to secrete insulin. We think insulin helps cells pull in the glucose needed for metabolic activityby attaching itself to membrane walls that are normally impenetrable. This attachment increases theability of glucose to pass through to the inside of the cell where it can be used as fuel. So, if there isnot enough enough insulin, cells don’t have enough energy for their needs. The other hormones wewill focus on all tend to change blood glucose concentrations also.

• Glucagon is a hormone secreted by the α cells of the pancreas. Excess glucose is stored in theliver in the form of Glycogen. There is the usual equilibrium amount of storage caused by therate of glycogen formation being equal to the rate of the reverse reaction that moves glycogenback to glucose. Hence the glycogen serves as a reservoir for glucose and when the bodyneeds glucose, the rate balance is tipped towards conversion back to glucose to release neededglucose to the cells. The hormone glucagon increases the rate of the reaction that convertsglycogen back to glucose and so serves an important regulatory function. Hypoglycemia (lowblood sugar) and fasting tend to increase the secretion of the hormone glucagon. On the otherhand, if the blood glucose levels increase, this tends to suppress glucagon secretion; i.e. wehave another back and forth regulatory tool.

• Epinephrine also called adrenalin is a hormone secreted by the adrenal medulla. It is partof an emergency mechanism to quickly increase the blood glucose concentration in times ofextremely low blood sugar levels. Hence, epinephrine also increases the rate at which glycogenconverts to glucose. It also directly inhibits how much glucose is able to be pulled into muscletissue because muscles use a lot of energy and this energy is needed elsewhere more urgently.It also acts on the pancreas directly to inhibit insulin production which keeps glucose in theblood. There is also another way to increase glucose by converting lactate into glucose in theliver. Epinephrine increases this rate also so the liver can pump this extra glucose back into theblood stream.

• Glucocorticoids are hormones like cortisol which are secreted by the adrenal cortex which in-fluence how carbohydrates are metabolized which is turn increase glucose if the the metabolicrate goes up.

• Thyroxin is a hormone secreted by the thyroid gland and it helps the liver form glucose fromsources which are not carbohydrates such as glycerol, lactate and amino acids. So another wayto up glucose!

• Somatotrophin is called the growth hormone and it is secreted by the anterior pituitary gland.This hormone directly affect blood glucose levels (i.e. an increase in Somatotrophin increasesblood glucose levels and vice versa) but it also inhibits the effect of insulin on muscle and fatcell’s permeability which diminishes insulin’s ability to help those cells pull glucose out of theblood stream. These actions can therefore increase blood glucose levels.

Now net hormone concentration is the sum of insulin plus the others. LetH denote this net hormoneconcentration. At normal conditions, call this concentrationH0. There have been studies performedthat show that under close to normal conditions, the interaction of the one hormone insulin withblood glucose completely dominates the net hormonal activity. That is normal blood sugar levelsprimarily depend on insulin-glucose interactions.So if insulin increases from normal levels, it increases net hormonal concentration to H0 + ∆Hand decreases glucose blood concentration. On the other hand, if other hormones such as cortisolincreased from base levels, this will make blood glucose levels go up. Since insulin dominates allactivity at normal conditions, we can think of this increase in cortisol as a decrease in insulin with


14.1. LINEAR APPROXIMATIONS TO NONLINEAR ODE 259

a resulting drop in blood glucose levels. A decrease in insulin from normal levels corresponds to adrop in net hormone concentration toH0 −∆H . Now letG denote blood glucose level. Hence, inour model an increase in H means a drop in G and a decrease in H means an increase in G! Noteour lumping of all the hormone activity into a single net activity is very much like how we wouldmodel food fish and predator fish in the predator prey model.Homework

Exercise 14.1.1

Exercise 14.1.2

Exercise 14.1.3

Exercise 14.1.4

Exercise 14.1.5

14.1.1.1 Model Details

The idea of our model for diagnosing diabetes from the GTT is to find a simple dynamical model ofthis complicated blood glucose regulatory system in which the values of two parameters would givea nice criterion for distinguishing normal individuals from those with mild diabetes or those who arepre diabetic. Here is what we will do. We describe the model as

G′(t) = F1(G,H) + J(t)

H ′(t) = F2(G,H)

where the function J is the external rate at which blood glucose concentration is being increased.There are two nonlinear interaction functions F1 and F2 because we know G and H have compli-cated interactions. Let’s assumeG andH have achieved optimal valuesG0 andH0 by the time thefasting patient has arrived at the hospital. Hence, we don’t expect to have any contribution to G′(0)andH ′(0); i.e. F1(G0,H0) = 0 and F2(G0,H0) = 0. We are interested in the deviation ofG andH from their optimal valuesG0 andH0, so let g = G−G0 and h = H −H0. We can then writeG = G0 + g andH = H0 + h. The model can then be rewritten as

(G0 + g)′(t) = F1(G0 + g,H0 + h) + J(t)

(H0 + h)′(t) = F2(G0 + g,H0 + h)

or

g′(t) = F1(G0 + g,H0 + h) + J(t)

h′(t) = F2(G0 + g,H0 + h)

Now find the tangent plane approximations.

F1(G0 + g,H0 + h) = F1(G0,H0) +∂F1

∂g(G0,H0) g +

∂F1

∂h(G0,H0) h+ EF1

F2(G0 + g,H0 + h) = F2(G0,H0) +∂F2

∂g(G0,H0) g +

∂F2

∂h(G0,H0) h+ EF2



but the terms F1(G0,H0) = 0 and F1(G0,H0) = 0, so we can simplify to

F1(G0 + g,H0 + h) =∂F1

∂g(G0,H0) g +

∂F1

∂h(G0,H0) h+ EF1

F2(G0 + g,H0 + h) =∂F2

∂g(G0,H0) g +

∂F2

∂h(G0,H0) h+ EF2

Homework

Exercise 14.1.6

Exercise 14.1.7

Exercise 14.1.8

Exercise 14.1.9

Exercise 14.1.10

It seems reasonable to assume that since we are so close to ordinary operating conditions, the errorsEF1 and EF2 will be negligible. Thus our model approximation is

g′(t) =∂F1

∂g(G0,H0) g +

∂F1

∂h(G0,H0) h+ J(t)

h′(t) =∂F2

∂g(G0,H0) g +

∂F2

∂h(G0,H0) h

We can reason out the algebraic signs of the four partial derivatives to be

∂F1

∂g(G0,H0) = −

∂F1

∂h(G0,H0) = −

∂F2

∂g(G0,H0) = +

∂F2

∂h(G0,H0) = −

The arguments for these algebraic signs come from our understanding of the physiological processesthat are going on here. Let’s look at a small positive deviationg from the optimal value G0 whileletting the net hormone concentration be fixed at H0. At this point, we are not adding an externalinput, so here J(t) = 0. Then our model approximation is

g′(t) =∂F1

∂g(G0,H0) g

At a state where we have an increase in blood sugar levels over optimal, i.e. g > 0, the otherhormones such as cortisol and glucagon will try to regulate the blood sugar level down by in-creasing their concentrations and for example storing more sugar into glycogen. Hence, the term∂F1

∂g (G0,H0) should be negative as here g′ is negative as g should be decreasing. So we model thisas ∂F1

∂g (G0,H0) = −m1 for some positive number m1. Now consider a positive change in h from



the optimal level while keeping at the optimal levelG0. Then the model is

g′(t) =∂F1

∂h(G0,H0) h

and since h > 0, this means the net hormone concentration is up which we interpret as insulinabove normal. This means blood sugar levels go down which implies g′ is negative again. Thus,∂F1

∂h (G0,H0) must be negative which means we model it as ∂F1

∂h (G0,H0) = −m2 for some positivem2.Now look at the h′ model in these two cases. If we have a small positive deviation g from the optimalvalueG0 while letting the net hormone concentration be fixed atH0, we have

h′(t) =∂F2

∂g(G0,H0) g.

Again, since g is positive, this means we are above normal blood sugar levels which implies mech-anisms are activated to bring the level down. Hence h′ > 0 as we have increasing net hormonelevels. Thus, we must have ∂F2

∂g (G0,H0) = m3 for some positive m3. Finally, if we have a positivedeviation h from optimal while blood sugar levels are optimal, the model is

h′(t) =∂F2

∂h(G0,H0) h.

Since h is positive, we have the concentrations of the hormones that pull glucose out of the bloodstream are above optimal. This means that too much sugar is being removed as so the regulatorymechanisms will act to stop this action implying h′ < 0. This tells us ∂F2

∂g (G0,H0) = −m4 forsome positive constant m4. Hence, the four partial derivatives at the optimal points can be definedby four positive numbers m1, m2, m3 and m4 as follows:

∂F1

∂g(G0,H0) = −m1

∂F1

∂h(G0,H0) = −m2

∂F2

∂g(G0,H0) = +m3

∂F2

∂h(G0,H0) = −m4

Homework

Exercise 14.1.11

Exercise 14.1.12

Exercise 14.1.13

Exercise 14.1.14

Exercise 14.1.15

14.1.1.2 Model Analysis

Our model dynamics are thus approximated by



g′(t) = −m1 g −m2 h+ J(t)

h′(t) = m3 g −m4 h

This implies

g′′(t) = −m1 g′ −m2 h

′ + J ′(t)

Now plug in the formula for h′ to get

g′′(t) = −m1 g′ −m2 (m3 g −m4 h) + J ′(t)

= −m1 g′ −m2m3g +m2m4h+ J ′(t).

But we can use the g′ equation to solve for h. This gives

m2 h = −g′(t)−m1 g + J(t)

which leads to

g′′(t) = −m1 g′ −m2m3g +m4 (−g′(t)−m1 g + J(t)) + J ′(t)

= −(m1 +m4) g′ − (m1m4 +m2m3)g +m4J(t) + J ′(t).

So our final model is

g′′(t) + (m1 +m4) g′ + (m1m4 +m2m3)g = m4J(t) + J ′(t).

Let α = (m1 +m4)/2 and ω2 = m1m4 +m2m3 and we can rewrite as

g′′(t) + 2α g′ + ω2g = S(t).

where S(t) = m4J(t) + J ′(t). Now the right hand side here is zero except for the very short timeinterval when the glucose load is being ingested. Hence, we can simply search for the solution to thehomogeneous model

g′′(t) + 2α g′ + ω2g = 0.

The roots of the characteristic equation here are

r =−2α±

√4α2 − 4ω2

2= −α±

√α2 − ω2.

The most interesting case is if we have complex roots. In that case, α2−ω2 < 0. Let Ω2 = |α2−ω2|.Then, the general phase shifted solution has the form g = Re−αt cos(Ωt− δ) which implies



G = G0 +Re−αt cos(Ωt− δ).

Hence, our model has five unknowns to find: G0, R, α, Ω and δ, The easiest way to do this is tomeasureG0, the patient’s initial blood glucose concentration, when the patient arrives. Then measurethe blood glucose concentration N more times giving the data pairs (t1,G1), (t2,G2) and so on outto (tN ,GN ). Then form the least squares error function

E =

N∑i=1

(Gi −G0 −Re−αti cos(Ωti − δ)

)2(14.1)

and find the five parameter values that make this error a minimum. This can be done using standardMatLab tools. Numerous experiments have been done with this model and if we let T0 = 2π/Ω,it has been found that if T0 < 4 hours, the patient is normal and if T0 is much larger than that, thepatient has mild diabetes.Homework

Exercise 14.1.16

Exercise 14.1.17

Exercise 14.1.18

Exercise 14.1.19

Exercise 14.1.20

14.1.1.3 Fitting the Data

Here is some typical glucose versus time data.

Time Glucose Level0 951 1802 1553 140

We will now try to find the parameter values which minimize the nonlinear least squares problemEquation 14.1. This appears to be a simple problem, but you will see all numerical optimizationproblems are actually fairly difficult. Our problem is to find the free parameters Go, R, α,Ω and δwhich minimize

E(G0, R, α,Ω, δ) =

N∑i=1

(Gi −G0 −Re−αti cos(Ωti − δ)

)2For convenience, letX = [G0, R, α,Ω, δ]

′ and fi(X) = Gi−G0−Re−αti cos(Ωti− δ); then wecan rewrite the error function as E(X) =

∑Ni=1 f

2i (X). Gradient descent requires the gradient of

this error function. This is just a messy calculation;

∂E

∂Go= 2

N∑i=1

fi(X)∂fi∂Go



∂E

∂R= 2

N∑i=1

fi(X)∂fi∂R

∂E

∂α= 2

N∑i=1

fi(X)∂fi∂α

∂E

∂Ω= 2

N∑i=1

fi(X)∂fi∂Ω

∂E

∂δ= 2

N∑i=1

fi(X)∂fi∂δ

where the fi partials are given by

∂fi∂Go

= −1

∂fi∂R

= −e−αti cos(Ωti − δ)

∂fi∂α

= ti Re−αti cos(Ωti − δ),

∂fi∂Ω

= ti Re−αti sin(Ωti − δ)

∂fi∂δ

= −Re−αti sin(Ωti − δ)

and so

∂E

∂Go= −2

N∑i=1

fi(X)

∂E

∂R= −2

N∑i=1

fi(X) e−αti cos(Ωti − δ)

∂E

∂α= 2

N∑i=1

fi(X) ti Re−αti cos(Ωti − δ),

∂E

∂Ω= 2

N∑i=1

fi(X) ti Re−αti sin(Ωti − δ)

∂E

∂δ= −2

N∑i=1

fi(X)Re−αti sin(Ωti − δ)

Now suppose we are at the point X0 and we want to know how much of the descent vector D touse. Note, if we use the amount ξ of the descent vector at X0, we compute the new error valueE(X0 − ξD(X0)). Let g(ξ) = E(X0 − ξD(X0)). We see g(0) = E(X0) and given a first choiceof ξ = λ, we have g(λ) = E(X0 − λD(X0)). Next, let Y = X0 − ξ D(X0). Then, using thechain rule, we can calculate the directional derivative of g at 0. First, we have

g′(ξ) = − < ∇(E), D(X0) >



and using the normalized gradient of E as the descent vector, we find

g′(0 = −||∇(E)||.

Now let’s approximate g using a quadratic model. Since we are trying for a minimum, in generalwe try to take a step in the direction of the negative gradient which makes the error function godown. Then, we have g(0) = E(X0) is less than g(λ) = E(X0 − λD(X0)) and the directionalderivative gives g′(0) = −||∇(E)|| < 0. Hence, if we approximate g by a simple quadratic model,g(ξ) = A+Bξ+Cξ2, this model will have a unique minimizer and we can use the value of ξ wherethe minimum occurs as our next choice of descent step. This technique is called a Line SearchMethod and it is quite useful. To summarize, we fit our g model and find

g(0) = E(X0)

g′(0) = −||∇(E)||

g(λ) = A+Bλ+ Cλ2 =⇒ C =E(X0 − λD(X0))− E(X0) + ||∇(E)||λ

λ2

The minimum of this quadratic occurs at λ∗ = B2C and this will give us our next descent direction

X0 − λ∗D(X0).Homework

Exercise 14.1.21

Exercise 14.1.22

Exercise 14.1.23

Exercise 14.1.24

Exercise 14.1.25

14.1.1.4 Gradient Descent Implementation

Let’s get started on how to find the optimal parameters numerically. Along the way, we will showyou how hard this is. We start with a minimal implementation. What is complicated here is that wehave lots of functions that depend on the data we are trying to fit. So the number of functions dependson the size of the data set which makes it harder to set up.

Listing 14.1: NonLinear LS For Diabetes Model: Version One

f u n c t i o n [ Error , G0 , R, alpha , Omega , d e l t a , normgrad , update ] =DiabetesGradOne ( I n i t i a l , lambda , maxiter , data )

%% I n i t i a l guess f o r G0 , R , alpha , Omega , d e l t a% da ta = c o l l e c t i o n of ( t ime , g l u c o s e ) p a i r s

5 % m a x i t e r = maximum number o f i t e r a t i o n s t o use% lambda = how much of t h e d e s c e n t v e c t o r t o use%% s e t u p l e a s t s q u a r e s e r r o r f u n c t i o nN = l e n g t h ( data ) ;

10 E = 0 . 0 ;Time = data ( : , 1 ) ;G = data ( : , 2 ) ;



% I n i t i a l = [ i n i t i a l G, i n i t i a l R , i n i t i a l alpha , i n i t i a l Omega ,i n i t i a l d e l t a ]

g = I n i t i a l ( 1 ) ; r = I n i t i a l ( 2 ) ; a = I n i t i a l ( 3 ) ; o = I n i t i a l ( 4 ) ; d =I n i t i a l ( 5 ) ;

15 f = @( t , equilG , r , a , o , d ) equi lG + r∗exp(−a ˆ2∗ t ) . ∗ ( cos ( o∗ t−d ) ) ;E = DiabetesError ( f , g , r , a , o , d , Time ,G)f o r i = 1: maxiter

% c a l c u l a t e e r r o rE = DiabetesError ( f , g , r , a , o , d , Time ,G) ;

20 i f ( i == 1)Error ( 1 ) = E;

endError = [ Error ;E ] ;% f i n d grad of E

25 [ gradE , normgrad ] = ErrorGradient ( Time ,G, g , r , a , o , d ) ;% f i n d d e s c e n t d i r e c t i o n Di f ( normgrad>1)

Descent = gradE / normgrad ;e l s e

30 Descent = gradE ;endDel = [ g ; r ; a ; o ; d]−lambda∗Descent ;g = Del ( 1 ) ; r = Del ( 2 ) ; a = Del ( 3 ) ; o = Del ( 4 ) ; d = Del ( 5 ) ;

end35 G0 = g ; R = r ; alpha = a ; Omega = o ; d e l t a = d ;

E = DiabetesError ( f , g , r , a , o , d , Time ,G)update = [G0;R; alpha ;Omega; d e l t a ] ;end

Note inside this function, we call another function to calculate the gradient of the norm. This is givenbelow and implements the formulae we presented earlier for these partial derivatives.

Listing 14.2: The Error Gradient Function

f u n c t i o n [ gradE , normgrad ] = ErrorGradient ( Time ,G, g , r , a , o , d )2 %

% Time i s t h e t ime da ta% G i s t h e g l u c o s e da ta% g = c u r r e n t equi lG% r = c u r r e n t R

7 % a = c u r r e n t a lpha% o = c u r r e n t Omega% d = c u r r e n t d e l t a%% C a l c u l a t e Error g r a d i e n t

12 f e r r o r = @( t ,G, g , r , a , o , d ) G − g − r∗exp(−a ˆ2∗ t ) . ∗ ( cos ( o∗ t−d ) ) ;gerror = @( t , r , a , o , d ) r∗exp(−a ˆ2∗ t ) . ∗ ( cos ( o∗ t−d ) ) ;herror = @( t , r , a , o , d ) r∗exp(−a ˆ2∗ t ) . ∗ ( s i n ( o∗ t−d ) ) ;N = l e n g t h ( Time ) ;

sum = 0;17 f o r k=1:N

sum = sum + f e r r o r ( Time ( k ) ,G( k ) , g , r , a , o , d ) ;endpEpequilg = −2∗sum ;



sum = 0;22 f o r k=1:N

sum = sum + f e r r o r ( Time ( k ) ,G( k ) , g , r , a , o , d ) ∗ gerror ( Time ( k ) , r , a , o ,d ) / r ;

endpEpR = −2∗sum ;sum = 0;

27 f o r k=1:Nsum = sum + f e r r o r ( Time ( k ) ,G( k ) , g , r , a , o , d ) ∗Time ( k ) ∗2∗a∗Time ( k ) ∗

gerror ( Time ( k ) , r , a , o , d ) ;endpEpa = 2∗sum ;sum = 0;

32 f o r k=1:Nsum = sum + f e r r o r ( Time ( k ) ,G( k ) , g , r , a , o , d ) ∗Time ( k ) ∗herror ( Time ( k

) , r , a , o , d ) ;endpEpo = 2∗sum ;sum = 0;

37 f o r k=1:Nsum = sum + f e r r o r ( Time ( k ) ,G( k ) , g , r , a , o , d ) ∗herror ( Time ( k ) , r , a , o ,

d ) ;endpEpd = −2∗sum ;gradE = [ pEpequilg ;pEpR; pEpa ; pEpo ; pEpd ] ;

42 normgrad = norm ( gradE ) ;end

We also need code for the error calculations which is given here.

Listing 14.3: Diabetes Error Calculation

f u n c t i o n E = DiabetesError ( f , g , r , a , o , d , Time ,G)2 %

% T = Time% G = Glocuose v a l u e s% f = n o n l i n e a r i n s u l i n model% g , a , r , o , d = p a r a m e t e r s in d i a b e t e s n o n l i n e a r model

7 % N = s i z e o f da taN = l e n g t h (G) ;E = 0 . 0 ;% c a l c u l a t e e r r o r f u n c t i o nf o r i = 1:N

12 E = E +( G( i ) − f ( Time ( i ) , g , r , a , o , d ) ) ˆ 2 ;end

Let’s look at some run time results using this code. We will use Octave for the computations.

Listing 14.4: Run time results for gradient descent on the original data

Data = [ 0 , 9 5 ; 1 , 1 8 0 ; 2 , 1 5 5 ; 3 , 1 4 0 ] ;2 Time = Data ( : , 1 ) ;



G = Data ( : , 2 ) ;f = @( t , equilG , r , a , o , d ) equi lG + r∗exp(−a ˆ2∗ t ) . ∗ ( cos ( o∗ t−d ) ) ;t ime = l i n s p a c e ( 0 , 3 , 4 1 ) ;R I n i t i a l = 5 3 . 6 4 ;

7 GOIni t ia l = 95 + R I n i t i a l ;A I n i t i a l = s q r t ( l o g ( 1 7 / 5 ) ) ;O I n i t i a l = p i ;d I n i t i a l = −pi ;I n i t i a l = [ GOIn i t ia l ; R I n i t i a l ; A I n i t i a l ; O I n i t i a l ; d I n i t i a l ] ;

12 [ Error , G0 , R, alpha , Omega , d e l t a , normgrad , update ] = DiabetesGrad ( I n i t i a l, 5 . 0 e−4 ,20000 , Data , 0 ) ;

I n i t i a l E = 463 .94E = 376 .40[ Error , G0 , R, alpha , Omega , d e l t a , normgrad , update ] = DiabetesGrad ( I n i t i a l

, 5 . 0 e−4 ,40000 , Data , 0 ) ;I n i t i a l E = 463 .94

17 E = 377 .77[ Error , G0 , R, alpha , Omega , d e l t a , normgrad , update ] = DiabetesGrad ( I n i t i a l

, 5 . 0 e−4 ,100000 , Data , 0 ) ;I n i t i a l E = 463 .94E = 377 .77

After 100, 000 iterations we still do not have a good fit. Note we start with a small constant λ =5.0e − 4 here. Try it yourself. If you let this value be larger, the optimization spins out of control.Also, we have not said how we chose our initial values. We actually looked at the data on a sheetof paper and did some rough calculations to try for some decent values. We will leave that to youto figure out. If the initial values are poorly chosen, gradient descent optimization is a great way togenerate really bad values! So be warned. You will have to exercise due diligence to find a sweetstarting spot.

We can see how we did by looking the resulting curve fit in Figure 14.1.Homework

Exercise 14.1.26

Exercise 14.1.27

Exercise 14.1.28

Exercise 14.1.29

Exercise 14.1.30

14.1.1.5 Adding Line Search

Now let’s add line search and see if it gets better. We will also try scaling the data so all the variablesin question are roughly the same size. For us, a good choice is to scale the G0 and the R value by50, although we could try other choices. We have already discussed line search for our problem, buthere it is again in a quick nutshell. If we are minimizing a function of M variables, say f(X), thenif we are at the point X0, we can look at the slice of this function if we move out from the basepoint X0 is the direction of the negative gradient, ∇(f(X0) = ∇(f0). Define a function of thesingle variable ξ as g(ξ) = f(X0 + ξ ∇(f0)). Then, we can try to approximate g as a quadratic,g(ξ) ≈= A+Bξ +Cξ2. Of course, the actual function might not be approximated nicely by such aquadratic, but it is worth a shot! Once we fit the parameters A, B and C, we see this quadratic modelis minimized at λ∗ = − B

2C . The code now adds the line search code which is contained in the block



Figure 14.1: The Diabetes Model Curve Fit: No Line Search

Listing 14.5: Line Search Code

f u n c t i o n [ lambdastar , DelOptimal ] = LineSearch ( Del , EStart , EFullStep ,Base , Descent , normgrad , lambda )

%% Del i s t h e upda te due t o t h e s t e p lambda% E s t a r t i s E a t t h e base v a l u e

5 % EFul lS tep i s E a t t h e g i v e n lambda v a l u e% Base i s t h e c u r r e n t parameter v e c t o r% Descent i s t h e c u r r e n t d e s c e n t v e c t o r% normgrad of grad of E a t t h e base v a l u e% lambda i s t h e c u r r e n t s t e p

10 %% we have enough i n f o r m a t i o n t o do a l i n e s ear ch here

A = EStart ;BCheck = −normgrad ;C = ( EFullStep− A − BCheck∗ lambda ) / ( lambda ˆ 2 ) ;

15 lambdastar = −BCheck / ( 2 ∗C) ;i f (C<0 | | lambdastar < 0)

% we are going t o a maximum on t h e l i n e s ear ch ; r e j e c tDelOptimal = Del ;

e l s e20 % we have a minimum on t h e l i n e s ear ch

DelOptimal= Base−lambdastar∗Descent ;end

end



Let’s put it all together now.

Listing 14.6: NonLinear LS For Diabetes Model

f u n c t i o n [ Error , G0 , R, alpha , Omega , d e l t a , normgrad , update ] = DiabetesGrad( I n i t i a l , lambda , maxiter , data , d o l i n e s e a r c h )

2 %% I n i t i a l guess f o r G0 , R , alpha , Omega , d e l t a% da ta = c o l l e c t i o n of ( t ime , g l u c o s e ) p a i r s% t o l = s t o p t o l e r a n c e% m a x i t e r = maximum number o f i t e r a t i o n s t o use

7 % lambda = how much of t h e d e s c e n t v e c t o r t o use%% s e t u p l e a s t s q u a r e s e r r o r f u n c t i o nN = l e n g t h ( data ) ;E = 0 . 0 ;

12 Time = data ( : , 1 ) ;G = data ( : , 2 ) ;% I n i t i a l = [ i n i t i a l G, i n i t i a l R , i n i t i a l alpha , i n i t i a l Omega ,

i n i t i a l d e l t a ]g = I n i t i a l ( 1 ) ;r = I n i t i a l ( 2 ) ;

17 a = I n i t i a l ( 3 ) ;o = I n i t i a l ( 4 ) ;d = I n i t i a l ( 5 ) ;f = @( t , equilG , r , a , o , d ) equi lG + r∗exp(−a ˆ2∗ t ) . ∗ ( cos ( o∗ t−d ) ) ;I n i t i a l E = DiabetesError ( f , g , r , a , o , d , Time ,G)

22 f o r i = 1: maxiter% c a l c u l a t e e r r o rE = DiabetesError ( f , g , r , a , o , d , Time ,G) ;i f ( i == 1)

Error ( 1 ) = E;27 end

Error = [ Error ;E ] ;% C a l c u l a t e Error g r a d i e n t[ gradE , normgrad ] = ErrorGradient ( Time ,G, g , r , a , o , d ) ;% f i n d d e s c e n t d i r e c t i o n

32 i f ( normgrad>1)Descent = gradE / normgrad ;

e l s eDescent = gradE ;

end37 Base = [ g ; r ; a ; o ; d ] ;

Del = Base−lambda∗Descent ;newg = Del ( 1 ) ; newr = Del ( 2 ) ; newa = Del ( 3 ) ; newo = Del ( 4 ) ; newd =

Del ( 5 ) ;EStart = E;EFul lStep = DiabetesError ( f , newg , newr , newa , newo , newd , Time ,G) ;

42 i f ( d o l i n e s e a r c h ==1)[ lambdastar , DelOptimal ] = LineSearch ( Del , EStart , EFullStep , Base ,

Descent , normgrad , lambda ) ;g = DelOptimal ( 1 ) ; r = DelOptimal ( 2 ) ; a = DelOptimal ( 3 ) ; o =

DelOptimal ( 4 ) ; d = DelOptimal ( 5 ) ;EOptimalStep = DiabetesError ( f , g , r , a , o , d , Time ,G) ;

e l s e



47 g = Del ( 1 ) ; r = Del ( 2 ) ; a = Del ( 3 ) ; o = Del ( 4 ) ; d = Del ( 5 ) ;end

endG0 = g ; R = r ; alpha = a ; Omega = o ; d e l t a = d ;E = DiabetesError ( f , g , r , a , o , d , Time ,G)

52 update = [G0;R; alpha ;Omega; d e l t a ] ;end

We can see how this is working by letting some of the temporary calculations print. Here are twoiterations of line search printing out theA,B and C and the relevant energy values. Our initial valuesdon’t matter much here as we are just checking out the line search algorithm.

Listing 14.7: Some Details of the Line Search

Data = [ 0 , 9 5 ; 1 , 1 8 0 ; 2 , 1 5 5 ; 3 , 1 4 0 ] ;2 Time = Data ( : , 1 ) ;

G = Data ( : , 2 ) ;f = @( t , equilG , r , a , o , d ) equi lG + r∗exp(−a ˆ2∗ t ) . ∗ ( cos ( o∗ t−d ) ) ;t ime = l i n s p a c e ( 0 , 3 , 4 1 ) ;R I n i t i a l = 8 5∗1 7 / 2 1 ;

7 GOIni t ia l = 95 + R I n i t i a l ;A I n i t i a l = s q r t ( l o g ( 1 7 / 4 ) ) ;O I n i t i a l = p i ;d I n i t i a l = −pi ;I n i t i a l = [ GOIn i t ia l ; R I n i t i a l ; A I n i t i a l ; O I n i t i a l ; d I n i t i a l ] ;


I n i t i a l E = 635 .38EFul lStep = 635 .31A = 635 .38BCheck = −595.35

17 C = 9.1002 e +05lambdastar = 3 .2711 e−04EFul lStep = 635 .27A = 635 .33BCheck = −592.34

22 C = 9.0725 e +05lambdastar = 3 .2645 e−04E = 635 .29

Now let’s remove those prints and let it run for awhile. We are using the original data here to try tofind the fit.

Listing 14.8: The Full Run with Line Search


I n i t i a l E = 635 .38E = 389 .07[ Error , G0 , R, alpha , Omega , d e l t a , normgrad , update ] = DiabetesGrad ( I n i t i a l

, 5 . 0 e−4 ,30000 , Data , 1 ) ;I n i t i a l E = 635 .38



6 E = 0.067171

We have success! The line search got the job done in 30, 000 iterations while the attempt using justgradient descent without line search failed. but remember, we do additional processing at each step.We show the resulting curve fit in Figure 14.2.

Figure 14.2: The Diabetes Model Curve Fit with Line Search on Unscaled Data

The qualitative look of this fit is a bit different. We leave it you to think about how we aresupposed to choose which fit is better; i.e. which fit is a better one to use for the biological reality weare trying to model? This is a really hard question. Finally, the optimal values of the parameters are

Listing 14.9: Optimal Parameter Values

updateupdate =

154 .372404 62 .73116

0 .570763 .77081−3.47707

The critical value of 2π/Ω = 1.6663 here which is less than 4 so this patient is normal!Also, note these sorts of optimizations are very frustrating, If we use the scaled version of the first

Initial = [GOInitial;.RInitial;AInitial;OInitial;.dInitial]; we makeno progress even though we run for 60, 000 iterations and these iterations a bit more expensivebecause we use line search. So let’s perturb the starting point a bit and see what happens.



Listing 14.10: Runtime Results

% use s c a l e d da taData = [ 0 , 9 5 / 5 0 ; 1 , 1 8 0 / 5 0 ; 2 , 1 5 5 / 5 0 ; 3 , 1 4 0 / 5 0 ] ;Time = Data ( : , 1 ) ;G = Data ( : , 2 ) ;

5 R I n i t i a l = 5 3 . 6 4 / 5 0 ;GOIni t ia l = 95 /50 + R I n i t i a l ;A I n i t i a l = s q r t ( l o g ( 1 7 / 5 ) ) ;O I n i t i a l = p i ;d I n i t i a l = −pi ;

10 % p e r t u r b t h e s t a r t p o i n t a b i tI n i t i a l = [ GOIn i t ia l ; . 7 ∗ R I n i t i a l ; A I n i t i a l ; 1 . 1∗ O I n i t i a l ; . 9 ∗ d I n i t i a l ] ;[ Error , G0 , R, alpha , Omega , d e l t a , normgrad , update ] = DiabetesGrad ( I n i t i a l

, 5 . 0 e−4 ,40000 , Data , 1 ) ;I n i t i a l E = 0 .36485E = 1 .5064 e−06

We have success! We see the fit to the data in Figure 14.3.

Figure 14.3: The Diabetes Model Curve Fit with Line Search on Scaled Data

The moral here is that it is quite difficult to automate our investigations. This is why truly pro-fessional optimization code is so complicated. In our problem here, we have a hard time finding agood starting point and we even find that scaling the data – which seems like a good idea – is not ashelpful as we thought it would be. Breathe a deep sigh and accept this as our lot!Homework



Exercise 14.1.31

Exercise 14.1.32

Exercise 14.1.33

Exercise 14.1.34

Exercise 14.1.35

14.1.2 An Autoimmune ModelWe assume we have a large population of cells T which consists of cells which are infected or alteredin two distinct ways by a trigger V . based on signals I , J andK. These two distinct populations ofcells will be labeled M and N . There are also non infected cells, H and non infected cells whichwill be removed due to auto immune action which we callC, for collateral damage. We will be usingthe same approach to studying nonlinear interactions that was used in (Peterson et al. (11) June 21,2017).

We assume the dynamics here are

C ′(t) = F1(C,M ,N)

M ′(t) = F2(C,M ,N))

N ′(t) = F3(C,M ,N))

There are then three nonlinear interaction functions F1, F2 and F3 because we know C, M and Ndepend on each other’s levels in very complicated ways. Usually, we assume the initial trigger doseV0 gives rise to some fraction of infected cells and the effect of the trigger will be different in thetwo cell populationsM andN .

Assumption 14.1.1

We assume the number of infected cells is p0 V0 which is split into p1 p0 V0 in populationN and p2 p0 V0 inM , where p1 + p2 = 1.

For example, a reasonable choice is p1 = 0.99 and p2 = 0.01. Thus, the total amount of triggerthat goes into altered cells is p0 V0 and the amount of free trigger is therefore (1 − p0) V0. Thus,we could expect C0 = 0, M0 = p2 p0 V0 and N0 = M0 = p1 p0 V0. However, we will explicitlyassume we are starting from a point of equilibrium prior to the administration of the viral dose V0.We could assume there is always some level of collateral damage, C0 in a host, but we will not dothat. We will therefore assume C, M and C have achieved these values C0 = 0, M0 = 0 andN0 = 0 right before the moment of alteration by the trigger. Hence, we don’t expect to there to beinitial contribution to C ′(0), M ′(0) and N ′(0); i.e. F1(C0,M0,N0) = 0, F2(C0,M0,N0) = 0and F3(C0,M0,N0) = 0. We are interested in the deviation of C, M and N from their optimalvaluesC0,M0 andN0, so let c = C −C0,m = M −M0 and n = N −N0. We can then writeC = C0 + c,M = M0 +m andN = N0 + n The model can then be rewritten as

(C0 + c)′(t) = F1(C0 + c,M0 +m,N0 + n)

(M0 +m)′(t) = F2((C0 + c,M0 +m,N0 + n)

(M0 +M)′(t) = F3((C0 + c,M0 +m,N0 + n)

or

c′(t) = F1(C0 + c,M0 +m, ,N0 + n)

m′(t) = F2((C0 + c,M0 +m,N0 + n)


14.2. FINITE DIFFERENCE APPROXIMATIONS IN PDE 275

n′(t) = F3((C0 + c,M0 +m,N0 + n)

Next, we do a standard tangent plane approximation on the nonlinear dynamics functions F1, F2 andF3 to derive approximation dynamics. We find the approximate dynamics are c′m′

n′

≈F o1c F o1m F o1nF o2c F o2m F o2nF o3c F o3m F o3n

cmn

where we now use a standard subscript scheme to indicate the partials. In (Peterson et al. (12) June21, 2017), we show how to build additional models which include the signals IFN-γ (I), J and K.However, you should be able to see how powerful the idea of linear approximation is even if we donot know the values of the partial derivatives at the equilibrium points.

14.1.2.1 Homework

Exercise 14.1.36

Exercise 14.1.37

Exercise 14.1.38

Exercise 14.1.39

Exercise 14.1.40

14.2 Finite Difference Approximations in PDEAnother way to approximate the solution of linear partial differential equations is to use what arecalled finite difference techniques. Essentially, in these methods, the various partial derivatives arereplaced by tangent line like approximations. We first must discuss how we handle the resultingapproximation error carefully. Let’s review how to approximate a function of two variables usingTaylor series. Let’s assume that u(x, t) is a nice, smooth function of the two variables x (our spatialvariable) and t (our time variable). For ease of discussion, we will focus on the space interval[0, L] and the time interval [0,M ] where L and M are both positive numbers. We assume that u iscontinuous on the rectangle D = [0, L]×[0,M ]. This means that lim(x,t)→(x0,t0) f(x, t) = f(x0, t0)for all pairs (x0, t0) in D . Note that since the rectangle is two dimensional, there are an infinitenumber of ways the pairs (x, t) can approach (x0, t0). Continuity at the point (x0, t0) means thatit does not matter how we do the approach; we always get the same answer. More precisely, if thepositive tolerance ε is given, there is a positive number δ so that

√(x− x0)2 + (t− t0)2 < δ and (x, t) ∈ D =⇒ |f(x, t)− f(x0, t0)| < ε.

We also assume the partial derivatives up to order 4 are continuous on D . Thus, we assume thecontinuity of many orders of partials. These terms get hard to write down as there are so many ofthem, so let’s define Dk

iju to be the kth partial derivative of u with respect to x i times and t j timeswhere i+ j must add up to k. More formally,

Dkiju =

∂ku

∂xi∂tj, for all i+ j = k, with i, j ≥ 0.

So using this notation, D110u = ∂u

∂x , D211u = ∂2u

∂x∂t , D440u = ∂4u

∂4x and so forth. We will assume thepartials Dk

iju are continuous on D up to an including k = 4. It is known that continuous functions



on a rectangle of the form D must be bounded. Hence, there are positive numbers Bkij so that for all(x, t) in D , we have

|Dkiju(x, t)| < Bkij , for all (x, t) ∈ D .

Hence, the maximum of |Dkiju(x, t)| over D exists and letting ||Dk

iju|| = max |Dkiju(x, t)| over

all (x, t) ∈ D , we have each ||Dkiju|| ≤ Dk

ij . This implies we can take the maximum of all theseindividual bounds and state that there is a single constant C so that

||Dkiju|| ≤ C. (14.2)

14.2.1 Approximating First Order PartialsDefine h(t) = u(x0, t). The second order Taylor expansion of h is then

h(t) = h(t0) + h′(t0)(t− t0) +1

2h′′(ct)(t− t0)2

where ct is some point between t0 and t. Using the chain rule for functions of two variables, it iseasy to see h(x0) = u(x0, y0), h′(t) = ut(x0, t) and h′′(t) = utt(x0, t). Hence, we can rewrite theexpansion as

u(x, t) = u(x0, t0) + ut(x0, t0)(t− t0) +1

2utt(x0, ct)(t− t0)2.

We can then write these another way by letting ∆t = t− t0. This gives

u(x0, t0 + ∆t) = u(x0, t0) + ut(x0, t0)∆t+1

2utt(x0, ct)(∆t)

2.

We can use these expansions to estimate the first order partials in a variety of ways.Homework

Exercise 14.2.1

Exercise 14.2.2

Exercise 14.2.3

Exercise 14.2.4

Exercise 14.2.5

14.2.1.1 Central Differences

First, let’s look at the central difference for the first order partial derivative with respect to t. We have

u(x0, t0 −∆t) = u(x0, t0)− ut(x0, t0)∆t+1

2utt(x0, c

−t )(∆t)2

u(x0, t0 + ∆t) = u(x0, t0) + ut(x0, t0)∆t+1

2utt(x0, c

+t )(∆t)2.

Subtracting, we find


14.2. FINITE DIFFERENCE APPROXIMATIONS IN PDE 277

u(x0, t0 + ∆t)− u(x0, t0 −∆t, ) = 2ut(x0, t0)∆t+1

2

(utt(x0, c

+t )− uxx(x0, c

−t )

)(∆t)2.

Thus,

ut(x0, t0) =u(x0, t0 + ∆t)− u(x0, , t0 −∆t)

2∆t− 1

2

(utt(x0, c

+t )− utt(x0, c

−t )

)∆t

Homework

Exercise 14.2.6

Exercise 14.2.7

Exercise 14.2.8

Exercise 14.2.9

Exercise 14.2.10

14.2.1.2 Forward Differences

Next, let’s look at the forward difference. We have

u(x0, t0 + ∆t) = u(x0, t0) + ut(x0, t0)∆t+1

2utt(x0, c

+t )(∆t)2.

Subtracting, we find

u(x0, t0 + ∆t)− u(x0, t0) = ut(x0, t0)∆t+1

2utt(x0, c

+t )(∆t)2.

Thus,

ut(x0, t0) =u(x0, t0 + ∆t)− u(x0, t0)

∆t− 1

2utt(x0, c

+t )∆t

Homework

Exercise 14.2.11

Exercise 14.2.12

Exercise 14.2.13

Exercise 14.2.14

Exercise 14.2.15



14.2.2 Approximating Second Order PartialsThe approximation we wish to use for the second order partial uxx is obtained like this. We fix thepoint (x0, t0) as usual and let g(x) = u(x, t0). The fourth order Taylor expansion of g is then

g(x) = g(x0) + g′(x0)(x− x0) +1

2g′′(x0)(x− x0)2 +

1

6g′′′(x0)(x− x0)3 +

1

24g′′′′(c+x )(x− x0)4

where c+x is some point between x0 and x. It is easy to see g(x0) = u(x0, y0), g′(x) = ux(x, t0)and g′′(x) = uxx(x, t0) and so forth. Hence, letting ∆x = x− x0, we can rewrite the expansion as

u(x0 + ∆x, t0) = u(x0, t0) + ux(x0, t0)∆x+1

2uxx(x0, t0)(∆x)2 +

1

6uxxx(x0, t0)(∆x)3

+1

24uxxxx(c+x , t0)(∆x)4

A similar expansion gives

u(x0 −∆x, t0) = u(x0, t0)− ux(x0, t0)∆x+1

2uxx(x0, t0)(∆x)2 − 1

6uxxx(x0, t0)(∆x)3

+1

24uxxxx(c+x , t0)(∆x)4

Adding, we have

u(x0 + ∆x, t0) + u(x0 −∆x, t) = 2u(x0, t0) + uxx(x0, t0)(∆x)2 +1

24(uxxxx(c+x , t0)

+uxxxx(c−x , t0))(∆x)4

The intermediate value theorem tells us that a continuous function takes on every value between twopoints. Hence, there is a point cx between c+x and c−x so that

uxxxx(cx, t0) =1

2(uxxxx(c+x , t0) + uxxxx(c−x , t0)).

We can then write the approximation as

u(x0 + ∆x, t) + u(x0 −∆x, t) = 2u(x0, t0) + uxx(x0, t0)(∆x)2 +1

12uxxxx(cx, t0)(∆x)4

which tells us that

uxx(x0, t0) =u(x0 + ∆x, t0) + u(x0 −∆x, t0)− 2u(x0, t0)

(∆x)2− 1



Exercise 14.2.17

Exercise 14.2.18


14.3. FD APPROXIMATIONS 279

Exercise 14.2.19

Exercise 14.2.20

14.3 Approximating the Diffusion Equation

Using these approximations, we can approximate the diffusion equation ut = Duxx using a forwarddifference for the first order partial with respect to time as

ut(x0, t0)−Duxx(x0, t0) =u(x0, t0 + ∆t)− u(x0, t0)

∆t− 1

2utt(x0, c

+t )∆t

−D(u(x0 + ∆x, t0) + u(x0 −∆x, t0)− 2u(x0, t0)

(∆x)2− 1


)This can be reorganized as

ut(x0, t0)−Duxx(x0, t0) =u(x0, t0 + ∆t)− u(x0, t0)

∆t

−D(u(x0 + ∆x, t0) + u(x0 −∆x, t0)− 2u(x0, t0)

(∆x)2

)−1

2utt(x0, c

+t )∆t+

D


Now consider the error∣∣∣∣ ut(x0, t0)−Duxx(x0, t0)− u(x0, t0 + ∆t)− u(x0, t0)

∆t

+D

(u(x0 + ∆x, t0) + u(x0 −∆x, t0)− 2u(x0, t0)

(∆x)2

)∣∣∣∣≤ 1

2|utt(x0, c

+t )|∆t+

D

12|uxxxx(cx, t0)|(∆x)2.

Then, using the estimates from Equation 14.2, we have∣∣∣∣ut(x0, t0)−Duxx(x0, t0)− u(x0, t0 + ∆t)− u(x0, t0)

∆t

+D

(u(x0 + ∆x, t0) + u(x0 −∆x, t0)− 2u(x0, t0)

(∆x)2

)∣∣∣∣≤ 1

2C∆t+

D

12C(∆x)2, (14.3)

which clearly goes to zero as ∆t and ∆x go to zero. Hence, we know the replacement of theoriginal partial derivatives in the diffusion equation by the finite difference approximations can bemade as accurate as we wish by suitable choice of ∆t and ∆x. To see how we translate this intoan approximate numerical method, divide the space interval [0, L] into pieces of size ∆x; then therewill be N ≈ L

∆x such pieces within the interval. Divide the time interval [0, T ] in a similar way by∆t into M ≈ T

∆t subintervals.



The true solution at the point (n∆x,m∆t) is then u((n∆x,m∆t) which we will denote by umn.Hence, we know that the pair (n∆x,m∆t) satisfies

u(n∆x, (m+ 1)∆t)− u(n∆x,m∆t)

∆t

−D(u((n+ 1)∆x,m∆t) + u((n− 1)∆x,m∆t)− 2u(n∆x,m∆t)

(∆x)2

)= −1

2utt(x0, c

+t )∆t+

D


Letting E1n,m = − 1

2utt(x0, c+t ) and E2

n,m = D12uxxxx(cx, t0), this is then rewritten as

un,m+1 − un,m∆t

−D(un+1,m + un−1,m − 2un,m

(∆x)2

)= E1

n,m∆t+ E2n,m(∆x)2.

Now solve for the term un.m+1 to find

un,m+1 = un,m +D∆t

(∆x)2

(un+1,m + un−1,m − 2un,m

)+ E1

n,m(∆t)2 + E2n,m∆t(∆x)2. (14.4)

We therefore know the error we make in computing

un,m+1 = un,m +D∆t

(∆x)2

(un+1,m + un−1,m − 2un,m

)using the true solution values is reasonably small when ∆x and ∆t are also sufficiently small dueto the error estimates in Equation 14.3. At this point, there is no relationship between ∆x and ∆tthat is required for the approximation to work. However, when we take Equation 14.4 and solve ititeratively as a recursive equation, problems arise. Consider a full diffusion equation model on thedomain D :

∂u

∂t− D

∂2u

∂x2= 0

u(0, t) = 0, for 0 ≤ t ≤ Tu(L, t) = 0, for 0 ≤ t ≤ Tu(x, 0) = f(x), for 0 ≤ x ≤ L

The discrete boundary conditions at each grid point (n∆x,m∆t) are then

u0,m = u(0,m∆t) = 0, 1 ≤ m ≤M (14.5)uN,m = u(N∆x,m∆t) = 0, 1 ≤ m ≤M (14.6)un,0 = u(n∆x, 0) = f(n∆x) = fn, 0 ≤ n ≤ N (14.7)

as N∆x = L and M∆t = T . Using these boundary conditions coupled with the discrete dynamicsof Equation 14.4, we obtain a full recursive system to solve. First, recall the true values satisfyEquation 14.4. Letting r = ∆t/(∆x)2 and adding the boundary conditions, we obtain the fullsystem



un,m+1 = un,m +Dr(un+1,m + un−1,m − 2un,m)

+ E1n,m(∆t)2 + E2

n,m∆t(∆x)2 (14.8)u0,m = 0, 1 ≤ m ≤M (14.9)uN,m = 0, 1 ≤ m ≤M (14.10)un,0 = fn, 0 ≤ n ≤ N (14.11)

This gives us a recursive system we can solve of the form

vn,m+1 = vn,m +Dr(vn+1,m + vn−1,m − 2vn,m) (14.12)v0,m = 0, 1 ≤ m ≤M (14.13)vN,m = 0, 1 ≤ m ≤M (14.14)vn,0 = fn, 0 ≤ n ≤ N (14.15)

where vn,m is the solution to this discrete recursion system at the grid points. To solve this system,we start at time 0, time index value m = 0, and use Equation 14.9 to get

v1,1 = v1,0 +Dr(v2,0 + v0,0 − 2v1,0).

We know v0,0 = f0, v1,0 = f1 and v2,0 = f2. Hence, we have

v1,1 = f1 +Dr(f2 + f0 − 2f1).

We can do this for any n to find

vn,1 = vn,0 +Dr(vn+1,0 + vn−1,0 − 2vn,0)

= fn +Dr(fn+1 + fn−1 − 2fn)

Once the values at m = 1 have been found, we use the recursion to find the values at the next timestep m = 2 and so on.

14.3.0.1 Homework

Exercise 14.3.1

Exercise 14.3.2

Exercise 14.3.3

Exercise 14.3.4

Exercise 14.3.5

14.3.1 Error Analysis

Let’s denote the difference between the true grid point values and the discrete solution values bywn,m = un,m− vn,m. A simple subtraction shows that wn,m satisfies the following discrete system:



wn,m+1 = wn,m +Dr(wn+1,m + wn−1,m − 2wn,m)

+E1n,m(∆t)2 + E2

n,m∆t(∆x)2 (14.16)w0,m = 0, 1 ≤ m ≤M (14.17)wN,m = 0, 1 ≤ m ≤M (14.18)wn,0 = 0, 0 ≤ n ≤ N (14.19)

From our previous estimates, we know E1n,m(∆t)2 + E2

n,m∆t(∆x)2 ≤ C(∆t)2 + ∆t(∆x)2) andso we can overestimate these terms to obtain

|wn,m+1| ≤ |1− 2Dr||wn,m|+ 2Dr(|wn+1,m|+ |wn−1,m|+ C((∆t)2 + ∆t(∆x)2)

Now since this equation holds for x grid positions, letting ||wp|| = max0≤N |wn,p|, we see

|wn,m+1| ≤ |1− 2Dr|||wm||+ 2Dr||wm||+ E(∆t,∆x)

where E(∆t,∆x) is the error term C((∆t)2 + ∆t(∆x)2). This equation holds for all indices n onthe left hand side which implies

||wm+1| ≤ (|1− 2Dr|+ 2Dr)||wm||+ E(∆t,∆x)

Letting g = |1− 2Dr|+ 2Dr, we find the estimate

||w1|| ≤ g||w0||+ E(∆t,∆x) = E(∆t,∆x)

since ||w0|| = 0 due to our error boundary conditions. The next steps are

||w2|| ≤ g||w1||+ E(∆t,∆x) ≤ (1 + g)E(∆t,∆x)

||w3|| ≤ g||w2||+ E(∆t,∆x) ≤ (1 + g + g2)E(∆t,∆x)

. . .

||wN || ≤ g||wN−1||+ E(∆t,∆x) ≤ (1 + g + g2 + . . .+ gN−1)E(∆t,∆x).

If g ≤ 1, we find ||wN || ≤ NE(∆t,∆x). However, since g ≤ 1, we know |1−2Dr|+2Dr ≤ 1. Thissays |1− 2DR| ≤ 1− 2Dr which implies Dr ≤ 1/2. Therefore, there is a relationship between ∆tand ∆x; ∆t ≤ (∆x)2/D Thus, the equation g = |1−2Dr|+2Dr becomes g = 1−2Dr+2Dr = 1and g is never actually less than 1. Hence, since N ≈ L/∆x and , we have

||wN || ≤ L

∆x((∆t)2 + ∆t(∆x)2) =

LC

∆x

(∆x)4

D2+

∆x)4

D

)=LC(1 +D)

D2(∆x)3

which goes to zero as ∆x goes to zero. Hence, in this case, the recursive equation has a well behavederror and the solution to the recursion equation approaches the solution to the diffusion equation.which is finite. We will say our finite difference approximation is stable if the relationshipDr ≤ 1/2is satisfied.



Homework

Exercise 14.3.6

Exercise 14.3.7

Exercise 14.3.8

Exercise 14.3.9

Exercise 14.3.10

Note we can arrive at this result a little faster by assuming ignoring the error E(∆t,∆x) and lookingat the solutions to the pure discrete recursion system only. We assume these have the form wn,m =gmeinβ for a positive g, where recall eiθ = cos(θ) + i sin θ. Then, we find

gm+1einβ = gmeinβ +Dr

(gmei(n+1)β + gmei(n−1)β − 2gmeinβ

)= gmeinβ

(1 +Dr

(eiβ + e−iβ − 2

))

But 2 cos(β) = eiβ + e−iβ and so we have

gm+1einβ = gmeinβ(

1 + 2Dr(cos(β)− 1)

)Dividing, we have

g = 1 + 2Dr(cos(β)− 1).

This is the multiplier that is applied at each time step. We can do the same analysis to show thatsolutions of the form wn,m = gme−inβ have the same multiplier. Then combinations of these twocomplex solutions lead to the usual two real solutions, φn,m = gm cos(nβ) and ψn,m = gm sin(nβ).So the use of the complex form of the assumed solution is a useful way to obtain the multiplier gwith minimal algebraic complications.To ensure the error does not go to infinity, we must have 0 < g < 1. We want

0 < 1− 2Dr(1− cos(β)) < 1

Since 1− 2Dr(1− cos(β)) < 1 always, the top inequality is satisfied. We also know 1− 2Dr(1−cos(β)) > 1− 2Dr, so we can satisfy the bottom inequality if we choose r so that 0 < 1− 2Dr <1− 2Dr(1− cos(β)). This implies Dr < 1/2. Thus, stability is not guaranteed unless

D∆t < (∆x)2.


Exercise 14.3.12



Exercise 14.3.13

Exercise 14.3.14

Exercise 14.3.15

14.4 Implementing The Diffusion Equation Finite Difference Ap-proximation

We can know implement the finite difference scheme for a typical diffusion/ heat equation model.This is done in the function NumericalHeatDirichlet which uses the arguments dataf, thename of the data f(x) on the bottom edge, gamma, the diffusion constant D, L and T and thedimension of the rectangle D = [0, L]× [0, T ]. Stability implies that ∆t can at most be 1/(2γ)(δx)2

and we allow the use of a fraction of that maximum by using the parameter deltfrac to determine∆t via the equation ∆t = deltfrac 1/(2γ)δx)2. Finally, we set the desired ∆x using the parameterdelx. Here are the details. First we calculate the desired ∆t.

Listing 14.11: Find ∆t

1 % c a l c u l a t e d e l td e l t = . 5∗ d e l t f r a c ∗ de lx ˆ 2 / gamma

We then find the number of time and space steps and set up the x and t data points for the grid.

Listing 14.12: Setup time and space grid data points

% c a l c u l a t e N and MN = round (L / de lx ) ;

3 M = round (T / d e l t ) ;

% s e t u p l i n s p a c e sx = l i n s p a c e ( 0 ,L ,N+1) ;t = l i n s p a c e ( 0 ,T ,M+1) ;

We then implement the finite difference scheme using a double for loop construction.

Listing 14.13: Implement Finite Difference Scheme

% s e t u p VV = z e r o s (N+1 ,M+1) ;

3

% s e t u p V( n , 1 ) = da ta f u n c t i o n ( n )f o r n = 1:N+1

V( n , 1 ) = d a t a f ( x ( n ) ) ;end

8

% s e t u p r v a l u er = d e l t / ( de lx ˆ 2 ) ;

% f i n d numer ica l s o l u t i o n t o t h e hea t e q u a t i o n


14.4. FD DIFFUSION CODE 285

13 f o r m = 1:Mf o r n = 2:N

V( n ,m+1) = gamma∗r∗V( n+1 ,m) +(1−2∗gamma∗r ) ∗V( n ,m) + gamma∗r∗V( n−1,m) ;

endend

Finally, we plot the solution surface.

Listing 14.14: Plot Solution Surface

% s e t u p s u r f a c e p l o t[X, Time ] = meshgrid ( x , t ) ;

3 mesh (X, Time , V’ ,’EdgeColor’ ,’blue’ ) ;x l a b e l (’x axis’ ) ;y l a b e l (’t axis’ ) ;z l a b e l (’Solution’ ) ;t i t l e (’Heat Equation on Square’ ) ;

The full code is below.

Listing 14.15: NumericalHeatDirichlet.m

f u n c t i o n Numer ica lHeatDir i ch l e t ( dataf , gamma , L , T , d e l t f r a c , de lx )%

3 % d a t a f i s our da ta f u n c t i o n% gamma i s t h e d i f f u s i o n c o e f f i c i e n t% L i s t h e x i n t e r v a l% T i s t h e t ime i n t e r v a l% d e l t f r a c i s how much of t h e maximum d e l t t o use

8 %% d e l x i s our chosen d e l t a x% we c a l c u l a t e d e l t = . 5∗gamma∗ ( d e l x ˆ 2 ) ;% N = round ( L / d e l x )% M = round ( T / d e l t )

13 %% c a l c u l a t e d e l td e l t = . 5∗ d e l t f r a c ∗ de lx ˆ 2 / gamma

% c a l c u l a t e N and M18 N = round (L / de lx ) ;

M = round (T / d e l t ) ;

% s e t u p l i n s p a c e sx = l i n s p a c e ( 0 ,L ,N+1) ;

23 t = l i n s p a c e ( 0 ,T ,M+1) ;

% s e t u p VV = z e r o s (N+1 ,M+1) ;

28 % s e t u p V( n , 1 ) = da ta f u n c t i o n ( n )f o r n = 1:N+1



V( n , 1 ) = d a t a f ( x ( n ) ) ;end

33 % s e t u p r v a l u er = d e l t / ( de lx ˆ 2 ) ;% f i n d numer ica l s o l u t i o n t o t h e hea t e q u a t i o nf o r m = 1:M

f o r n = 2:N38 V( n ,m+1) = gamma∗r∗V( n+1 ,m) +(1−2∗gamma∗r ) ∗V( n ,m) + gamma∗r∗V( n−1,m

) ;end

end

% s e t u p s u r f a c e p l o t43 [X, Time ] = meshgrid ( x , t ) ;

mesh (X, Time , V’ ,’EdgeColor’ ,’blue’ ) ;x l a b e l (’x axis’ ) ;y l a b e l (’t axis’ ) ;z l a b e l (’Solution’ ) ;

48 t i t l e (’Heat Equation on Square’ ) ;end

Here is a typical use for a heat equation model with diffusion coefficient 0.09 for a space interval of[0, 5] and a time interval of [0, 10] using a space step size of 0.2.

Listing 14.16: Generating a Finite Difference Heat Equation Solution

1 p u l s e = @( x ) p u l s e f u n c ( x , 2 , . 2 , 1 0 0 ) ;Numer ica lHeatDir i ch l e t ( pulse , . 0 9 , 5 , 1 0 , . 9 , . 2 ) ;d e l t = 0 .20000

This generates the plot we see in Figure 14.4.


Exercise 14.4.2

Exercise 14.4.3

Exercise 14.4.4

Exercise 14.4.5


14.4. FD DIFFUSION CODE 287

Figure 14.4: Finite Difference Solution To Heat Equation


Part IV

Integration

289


Chapter 15

Integration in Multiple Dimensions

In this chapter, we are going to discuss Riemann Integration in <n. This is a much more interestingthing than the development laid out in (Peterson (8) 2019) as the topology of < is much simpler thanthe topology of <n for n > 1. We wish to do this from a more abstract perspective.

15.1 The Darboux Integral

Let f : A ⊂ <n → < be a bounded function on the bounded set A. Enclose the set A inside abounded hyperrectangle

R = [a1, b1]× [a2, b2]× . . .× [an, bn] =

n∏i=1

[ai, bi]

There is no need to try to find the tightest bounded box possible. We will show later the definition ofintegration we develop is independent of this choice. We extend f to R to f on R as follows:

f(x) =

f(x), x ∈ A0, x ∈ R \A

In <2, we would have what we see in Figure 15.1(a) and in <3, it could look like the image in Figure15.1(b). The extension f allows us to define what we mean by integration without worrying about

(a) A bounded set A in <2 (b) A bounded set A in <3

Figure 15.1: Bounded Integration Domains

291


292 CHAPTER 15. INTEGRATION IN MULTIPLE DIMENSIONS

Figure 15.2: A bounded A and a partition P

the boundary of A, ∂A. We begin by slicing R up into a rectangular grid to form what are calledPartitions.

Definition 15.1.1 Partitions of the bounded set A

A partition P of A is defined as follows: For any bounded rectangle R containing A,subdivide each axis using a finite collection of points this way:

P1 = x11 = a1 < x12 < x13 < . . . < x1,p1 = b1P2 = x21 = a2 < x22 < x23 < . . . < x2,p2 = b2

... =...

Pk = xk1 = ak < xk2 < xk3 < . . . < xk,pk = bk... =

...

Pn = xn1 = an < xn2 < xn3 < . . . < xn,pn = bn

The partition P is the collection of points in <n defined by

P = P1 × P2 × . . .× Pn

Note each Pk is a traditional partition as used in one dimensional integration theory. Thesizes of each Pk are determined by the integers pk and these need not be the same, ofcourse.

Such a P defined a set of grid points in R and the lines of constant xi,j slice R and therefore also Ainto hyperrectangles. Figure 15.2 shows what this could look like in two dimensions. It is a lot morecluttered to try to draw this in three dimensions! And in higher than three dimensions, we can notcome up visualizations at all, so it is best to be able to understand this abstractly. In the figure, weuse the partition

P = P1 × P2 = x11, x12, x13, x14, x15 × ‖x21, x22, x23, x24, x25

If you are familiar with MatLab code, you should see that if P1 is represented by one linspacecommand and P2, by another, the meshgrid command creates the partition P .


15.1. THE DARBOUX INTEGRAL 293

For simplicity, let’s note a partition P determines a finite number of hyperrectangles of the form

S = [x1,j1 ;x1,j1+1]× [x2,j2 ;x2,j2+1]× . . .× [xn,jn ;xn,jn+1]

where we are separating points in each [·, ·] pair by semicolons as using a comma becomes veryconfusing. Here the integers j1 through jn range over all the possible choices each Pi allows; i.e.1 ≤ ji < pi. There are a large number of these rectangles, but still finitely many. The partition Pwill determine p1 p2 . . . pn such rectangles and each rectangle will have a hypervolume

V (S) = (x1,j1+1 − x1,j1)length on axis 1

· (x2,j1+1 − x2,j1)length on axis 2

· · · (xn,j1+1 − xn,j1)length on axis n

In general, it is not important how we label these finitely many rectangles, so we will just say thepartition P determines a finite number of rectangles N and label them as S1, . . . , SN or (Si)

ni=1.

Note a rectangle Sj need not intersect A! For each rectangle Si define

mi(f) = inf(x1,...,xn)∈Si

f(x1, . . . , xn)

Mi(f) = sup(x1,...,xn)∈Si

f(x1, . . . , xn)

These numbers are finite as we assume f is bounded on A. Since it is cumbersome use the n - tuplenotation for the points in Si, we will use x to indicate this instead. Thus

mi(f) = infx∈Si

f(x), Mi(f) = supx∈Si

f(x)

We can now define the Darboux lower and upper sums for f on A.

Definition 15.1.2 Darboux Lower and Upper Sums for f on A

Given a partition P of A with a bounding hyperrectangle R, P determines N hyperrect-angles and the numbers mi(f) and Mi(f) for 1 ≤ i ≤ N . The Darboux Lower Sumassociated with f , P , A and R is

L(f, P,A,R) =

N∑i=1

mi(f)V (Si)

and the Darboux Upper Sum associated with f , P , A and R is

U(f, P,A,R) =

N∑i=1

mi(f)V (Si)

The A is often understood from context and so we would usually write L(f, P,R) andU(F, P,R) to indicate these sums might depend on the choice of bounding hyperrectangleR. Since N depends on P , we generally write these as

L(f, P,R) =∑i∈P

mi(f)V (Si), U(f, P,R) =∑i∈P

Mi(f)V (Si)

Comment 15.1.1 For a given partition P , it is always true that L(f, P,R) ≤ U(f, P,R) from thedefinition of mi and Mi.



Comment 15.1.2 Since each partition P is dependent on the choice of R, if we had two boundingrectangles with partitions P (R1) and P (R2), we would know R1 = (R1∩R2)∪ (R1∩RC2 )∪ (R2∩RC1 ). The partition P (R1) corresponds to possibly non zero values of mi and Mi only on R1 ∩R2.A similar argument shows the partition Q(R2) corresponds to possibly non zero values of mi andMi only on R1 ∩ R2. You can see P (R1) induces a partition of R1 ∩ R2, P (R1 ∩ R2) and alsoQ(R2) induces a partition of R1 ∩R2, Q(R1 ∩R2) Hence,

L(f, P (R1), R1) = L(f, P (R1 ∩R2), R1 ∩R2)

L(f,Q(R2), R2) = L(f,Q(R1 ∩R2), R1 ∩R2)

These sums could be different, of course.

However, by thinking about this just a little different, we can establish the following result.

Theorem 15.1.1 Lower and Upper Darboux Sums are Independent of the choice of BoundingRectangle

L(f, P,R1) = L(f, P,R2), U(f, P,R1) = U(f, P,R2)

Proof 15.1.1First, note R1 = (R1 ∩R2)∪ (R1 ∩RC2 )∪ (R2 ∩RC1 ). So a partition P (R1) induces a partition on(R1 ∩R2) and there infinitely many ways to extend P (R1) to the piece R2 ∩RC1 . Let Q(R2) be anysuch partition of R2 which retains the partition of P (R1) on (R1 ∩R2) and extends it on (R2 ∩RC1 .Both of the partition P (R1) and Q(R2) corresponds to possibly non zero values of mi and Mi onlyon R1 ∩R2. Let P (R1 ∩R2 denote the partition induced on R1 ∩R2 by P (R1). Then

L(f, P (R1), R1) = L(f, P (R1 ∩R2), R1 ∩R2)L(f, P (R2), R2)

Hence, the value of the lower sum doesn’t depend on the choice of bounding rectangle. A similarargument shows the upper sums are independent of the choice of bounding rectangle also.

Comment 15.1.3 From Theorem 15.1.1, we know the lower and upper sums are independent of thechoice of bounding rectangle and hence we no longer need to write L(f, P,R) and U(f, P,R).Hence, from now on we will simply write L(f, P ) and U(f, P ) where the partitions are determinedfrom some choice of bounding rectangle R.

If we add more points to any of the Pi collections that comprise P , we generate a new partition P ′

called a refinement of P .

Definition 15.1.3 A Refinement of a Partition

Given a partition P of A, a refinement P ′ of P is any partition of A which contains allthe original points of P . Hence, P is a refinement of itself. Of course, the interestingrefinements are the ones which add at least one point to P . These extra points will thenintroduce new rectangles.

In Figure 15.3, we see a refinement P ′ that add one additional points. We also show the new rectan-gles generated. In this example, you can count the number of additional rectangle and see it is 18. Itis very cluttered to draw what happens if you introduce two points and so on, so you have to learnhow to understand this from an abstract point of view. The new partition in the figure is

P ′ = [x11, x12, x13, x14, α, x15, x16]× [x21, x22, x23, β, x24, x25]



Figure 15.3: One Point (α, β) is added to the partition P

Given two partitions P andQ ofA, we often want to merge them into one partition by counting onlycounting points in the union of the collection of points that comprise P and the points that compriseQ once.

Definition 15.1.4 The Common Refinement of Two Partitions

Let P and Q be two partitions of A. Let U be the union of the collection of points from Pand Q. On the ith axis, P and Q determine the collection of points

Pi = xi1, . . . , xi,pi and Qi = yi1, . . . , yi,qj

The union of Pi and Qi gives the collection

Pi ∪Qi = xi1, . . . , xi,pi , yi1, . . . , yi,qj

which determines a new ordering from low to high collection of points on the ith axisGi = zi1, . . . , zi,gi which only uses unique entries as all duplicates are removed. Thecollection of allG1× . . .×Gn gives the common refinement of P andQ denoted by P ∨Q.

Comment 15.1.4 Note P ∨Q is a refinement of both P and Q.

Homework

Exercise 15.1.1 Modify the Riemann Sum MatLab code in (Peterson (8) 2019) for the two dimensioncase.

Exercise 15.1.2 Modify the Riemann Sum MatLab code in (Peterson (8) 2019) for the two dimensioncase with uniform partitions and modify the code to draw the Riemann sums as surfaces.

Exercise 15.1.3 ForP = −1,−0.7, 0, 0.5, 1.1, 1.8, 2.1, 2.7, 3.0 andQ = 4, 4, 3, 5, 1, 5, 6, 6, 1, 6, 8, 7.3, 8,draw the resulting partition π = P × Q and find the maximum area that a rectangle Si in π canhave.

Exercise 15.1.4 ForP = −1,−0.7, 0, 0.5, 1.1, 1.8, 2.1, 2.7, 3.0 andQ = 4, 4, 3, 5, 1, 5, 6, 6, 1, 6, 8, 7.3, 8,add the point 1.5 to P to create P ′. Draw the resulting partition π′ = P ′×Q and find the maximumarea that a rectangle Si in π′ can have.



Exercise 15.1.5 LetA be the circle of radius 2 centered at (2, 6). ForP = −1,−0.7, 0, 0.5, 1.1, 1.8, 2.1, 2.7, 3.0and Q = 4, 4, 3, 5, 1, 5, 6, 6, 1, 6, 8, 7.3, 8, π = P ×Q is a partition of a bounding rectangle con-taining this circle. Let f(x, y) = 2x2 + 5y2. Find the lower and upper sums associated with thispartition. Draw the bounding box, the partition and the circle and shade all the rectangles Si thatare outside the circle one color, the rectangles inside the circle another color and the rectangles thatintersect the boundary of the circle another color.

Exercise 15.1.6 LetA be the circle of radius 2 centered at (2, 6). ForP = −1,−0.7, 0, 0.5, 1.1, 1.8, 2.1, 2.7, 3.0and Q = 4, 4, 3, 5, 1, 5, 6, 6, 1, 6, 8, 7.3, 8, π = P × Q is a partition of a bounding rectanglecontaining this circle. Let f(x, y) = 2x2 + 5y2. Add the point 1.5 to P to create P ′ giving theπ′ = P ′×Q. Find the lower and upper sums associated with this partition. Draw the bounding box,the partition and the circle and shade all the rectangles Si that are outside the circle one color, therectangles inside the circle another color and the rectangles that intersect the boundary of the circleanother color. What effect did adding the extra point have here?

The next question is what happens to the Lower and Upper Darboux sums when refine a partition?

Theorem 15.1.2 Lower and Upper Darboux Sums and Partition Refinements

Let P ′ be a refinement of P for A with bounding hyperrectangle R. Then

L(f, P,R) ≤ L(f, P ′, R), U(f, P,R) ≥ U(f, P ′, R)

Of course, these results are independent of the bounding rectangle and could be written

L(f, P ) ≤ L(f, P ′), U(f, P ) ≥ U(f, P ′)

Proof 15.1.2From now on, we will just call hyperrectangles by the simpler phrase rectangles. We all know theseare abstract things once n > 3! Let’s start by adding one point to P . Let z be the added point.Then z will be in some rectangle S∗ determined by P . Note it could be on the boundary of S∗ andnot necessarily in the interior. Let’s organize the indices of P and P ′. Look at Figure 15.3 to seeadditional rectangles in a specific case.

• The indices of P can be put into three disjoint sets:

1. The singleton index j∗ which is the index corresponding to S∗.

2. The indices which correspond to rectangles due to P which are not broken into two piecesby the addition of the extra point z to P . The rectangles here are exactly the same as therectangles generated by P ′. Let I denote this set of indices.

3. The indices which correspond to rectangles which are broken into two pieces. Each ofthese rectangles Sj splits into two rectangles for P ′. Let J denote this set of indices.

• The indices of P ′ can be put into disjoint sets also.

1. For the index j∗ from P , the additional point z causes the creation of 2n new rectanglesTj∗,i for 1 ≤ i ≤ 2n. So

∑2n

i=1 V (Tj∗,i = V (S∗). Also, since

infx∈Tj∗,i

f(x) ≤ infx∈S∗

f(x)

we have

m for Tj∗,i(f) ≥ m for S∗



which tells us

2n∑i=1

m for Tj∗,i(f) V (Tj∗,i) ≥ m for S∗

2n∑i=1

V (Tj∗,i) = mj∗V (S∗).

2. The rectangles due to P ′ that are the same as the rectangles due to P that do not comefrom adding the extra point give a set of indices we call U . Let the infimum values herebe called m′i to distinguish them from infimum values mj due to P and let the rectanglesbe called S′i for the same reason. Then we have∑

i∈Um′i(f)V (S′i) =

∑j∈I

mj(f)V (Sj)

3. The remaining indices correspond to the pairs of rectangles that come from the splittingdue to the addition of the extra point for all rectangles from P except S∗. Call theseindices V . For each rectangle, Sj from the index set J from P , we get two new rectangles:Sj = Hj,1 ∪Hj,2 and V (Sj) = V (Hj,1) + V (Hj,2). Also,

m forHj,i(f) ≥ m for Sj (f)

and so ∑j∈J

(m forHj,1(f) V (Hj,1) +m forHj,2(f) V (Hj,2)

≥∑j∈J

m for Sj (V (Hj,1 + V (Hj,2)) =∑j∈J

mjV (Sj).

So the lower sums are

L(f, P,R) =∑j∈P

mj(f)V (Sj) = m∗(f)V (S∗) +∑j∈I

mj(f)V (Sj) +∑j∈J

mj(f)V (Sj)

L(f, P ′, R) =∑j∈P ′

m′j(f)V (S′j) ≥ m∗(f)V (S∗) +∑j∈I

mj(f)V (Sj) +∑j∈J

mj(f)V (Sj)

= L(f, P,R)

This shows the result is true for the addition of one extra point into one rectangle due to P .

The next step is an induction argument on the number of points added to one rectangle of P . Assumewe have added k points to the rectangle S∗ and that the result holds. Now add one more point toS∗. The partition we get by adding n points to S∗ is what we call the partition Q and the parti-tion we get by adding the k + 1 point is the partition Q′. The argument above still works and soL(f,Q′, R) ≥ L(f,Q,R).

Now we use induction again but this time in the case where we add a finite number of points to morethan one rectangle. We assume we have added a finite number of points to k rectangles of P . Now adda finite number of points to a k+1 rectangle. Call the partition we get from adding a finite number ofpoints to k rectangles S and the new partition we get by adding a finite number of points to the k+ 1rectangle S′. Then the argument from the previous step works and we haveL(f, S′, R) ≥ L(f, S,R).

Of course, a very similar argument works for the other case and we prove U(f, P ′, R) ≤ U(f, P,R).



There are many consequences of this result and it allows us to define Darboux Integration.

Theorem 15.1.3 Lower Darboux Sums are always less than Upper Darboux Sums

Let P and Q be partitions of A for any choice of bounding rectangle R. Then L(f, P ) ≤U(f,Q).

Proof 15.1.3Let P ∨Q be the common refinement of P and Q. Then, we have

L(f, P ) ≤ L(f, P ∨Q) ≤ U(f, P ∨Q) ≤ U(f,Q)

Theorem 15.1.3 immediately implies that for any fixed partition Q

supP

L(f, P ) ≤ U(f,Q)

Further, since supP L(f, P ) is a lower bound for all U(f,Q) not matter what Q is, we see

supP

L(f, P ) ≤ infQ

U(f,Q)

This leads to the lower and upper Darboux integrals.

Definition 15.1.5 The Lower and Upper Darboux Integral

For a bounded f : A ⊂ <n → < where A is bounded the Lower Darboux Integral of fover A is

DI(f,A) = supP

L(f, P )

and the Upper Darboux Integral of f over A is

DI(f,A) = infP

U(f, P )

Comment 15.1.5 From the definitions, it is clear DI(f,A) ≤ DI(f,A).

When the lower and upper Darboux integral match, we obtain the Darboux integral.

Definition 15.1.6 The Darboux Integral

We say the bounded function f : A ⊂ <n → < where A is bounded is Darboux Integrableon A if DI(f,A) = DI(f,A). We denote this common value by DI(f,A).

We eventually can tie this idea to our usual Riemann Integral but first we need to establish someresults.

Theorem 15.1.4 The Riemann Criterion For Darboux Integrability

f is Darboux Integrable on A if and only if for all ε > 0, there is a partition P0 so thatU(f, P )− L(f, P ) < ε for all refinements P of P0.



Proof 15.1.4=⇒Note the choice of bounding rectangle is not important here so we just assume there is one in thebackground that is used to determine the partitions. Assume f is Darboux Integrable on A. ThenDI(f,A) = DI(f,A). Using the Infimum and Supremum Tolerance Lemma, given ε > 0, there arepartitions Pε and PQε so that

DI(f,A) ≤ U(f,Qε) < DI(f,A) + ε/2

DI(f,A)− ε/2 < L(f, Pε) ≤ DI(f,A)

Let P0 = Pε ∨Qε. Then,

U(f, P0)− L(f, P0) ≤ U(f,Qε)− L(f, Pε) < DI(f,A) + ε/2−DI(f,A) + ε/2

But f is Darboux Integrable onA and soDI(f,A)−DI(f,A) = 0. Thus, U(f, P0)−L(f, P0) < ε.Now if P is a refinement of P0,

U(f, P )− L(f, P ) ≤ U(f, P0)− L(f, P0) < ε

This completes the proof.

⇐=Assume ∀ ε > 0, there is a partition P0 of A so that U(f, P )− L(f, P ) < ε for all refinements P ofP0. Let ε > 0 be given. Then there is a partition P ε0 so that U(f, P ε0 ) < U(f, P ε0 ) + ε. Hence,

DI(f,A) = infPU(f, P ) ≤ U(f, P ε0 ) < L(f, P ε0 ) + ε

But then

DI(f,A)− ε < L(f, P ε0 ) ≤ DI(f,A)

This implies 0 ≤ DI(f,A) − DI(f,A) < ε. Since ε > 0 is arbitrary, we have DI(f,A) =DI(f,A). Hence f is Darboux Integrable on A.

Homework

Exercise 15.1.7 LetA be the circle of radius 1.5 centered at (2, 6). ForP = −1,−0.7, 0, 0.5, 1.1, 1.8, 2.1, 2.7, 3.0and Q = 4, 4, 3, 5, 1, 5, 6, 6, 1, 6, 8, 7.3, 8, π = P ×Q is a partition of a bounding rectangle con-taining this circle. Let f(x, y) = 3x2 +2y2. Add the point 1.5 to P to create P ′ giving the refinementπ′ = P ′×Q. Find the lower and upper sums associated with this partition. Draw the bounding box,the partition and the circle and shade all the rectangles Si that are outside the circle one color, therectangles inside the circle another color and the rectangles that intersect the boundary of the circleanother color. What effect did adding the extra point have here?

Exercise 15.1.8 LetA be the circle of radius 1.5 centered at (2, 6). ForP = −1,−0.7, 0, 0.5, 1.1, 1.8, 2.1, 2.7, 3.0and Q = 4, 4, 3, 5, 1, 5, 6, 6, 1, 6, 8, 7.3, 8, π = P ×Q is a partition of a bounding rectangle con-taining this circle. Let f(x, y) = 3x2 + 2y2. Add the point 1.5 to P to create P ′ and add the point5.8 to Q to create Q′ giving the refinement π′ = P ′×Q′. Find the lower and upper sums associatedwith both the π and π′ partition. Draw the bounding box, the partition and the circle and shade allthe rectangles Si that are outside the circle one color, the rectangles inside the circle another colorand the rectangles that intersect the boundary of the circle another color for both partitions. Whateffect did adding the extra two points have here?



Exercise 15.1.9 Let f(x, y, z) = c for some constant c. What are the lower and upper sums of f forany set A and partition P of a rectangle R containing A?

Exercise 15.1.10 LetA be the sphere of radius 1.5 centered at (2, 6, 0.5). ForP = −1,−0.7, 0, 0.5, 1.1, 1.8, 2.1, 2.7, 3.0andQ = 4, 4, 3, 5, 1, 5, 6, 6, 1, 6, 8, 7.3, 8, and S = −2,−1.5,−1, 0, 0.6, 1.7, 2. π = P×Q×Sis a partition of a bounding cube containing this circle. Let f(x, y) = 3x2 + 2y2 + 4z2. Add thepoint 1.6 to P to create P ′ and add the point 5.9 to Q to create Q′ and the point 0.3 to S to createS′ giving the refinement π′ = P ′ × Q′ × S′. Find the lower and upper sums associated with boththe π and π′ partition. Draw the bounding cube, the partition and the sphere and shade all the cubesSi that are outside the sphere one color, the cubes inside the sphere another color and the cubes thatintersect the boundary of the sphere another color for both partitions. What effect did adding theextra three points have here?

Next, we prove an approximation result. But first a definition.

Definition 15.1.7 The Norm of a Partition

For the partition P of A, P determine N rectangles of form Sk =∏ni=1 |aki , bki |. For each

rectangle, we can compute dk = max1≤i≤n |bki − aki |. The norm of P is

‖P‖ = maxk

( max1≤i≤n

|bki − aki |) = maxk

(dk)

Note, for a given rectangle Sk,

V (S) =∏

1≤i≤n

|bki − aki | ≤∏

1≤i≤n

dk

≤ (maxk

)dk)n = ‖S‖n

Comment 15.1.6 Suppose you were in 2D and you fixed the partition on the x2 axis but let thepartitioning of the x1 axis get finer and finer. Then we would have V (S) → 0 even though ‖S‖would not go to zero as the maximum axis distance is determined by the fixed partition of the x2 axis.

Theorem 15.1.5 Approximation of the Darboux Integral

If f is Darboux Integrable on A, then given a sequence of partitions (Pn) with ‖Pn‖ → 0,we have

U(f, Pn) ↓ DI(f,A), L(f, Pn) ↑ DI(f,A),

Proof 15.1.5We will only do the upper sum case and leave the other one to you. Let ε > 0 be given. Then thereis a partition Pε so that U(f, Pε) ≥ DI(f,A) and U(f, Pε) ≤ DI(f,A) + ε/2. But f is Darbouxintegrable on A so we have

DI(f,A) geq U(f, Pε) < DI(f,A, P ) +−ε/2

Let Qn = Pn ∨ Pε. Pε determines rectangles Sk of the form∏ni=1 [aki , b

ki ]. Let

ek = min1≤i≤n

(bki − aki ), e = mink

(ek)



For the tolerance ξ = e/2, there is an N so that if n > N , ‖Pn‖ < e/2. Pick any Pn with n > N .Any point in a rectangle S determined by Pε is a corner point. If the rectangle S was given by∏

[aki , bki ], then the minimum distance between two points given by S is the minimum edge distance:

i.e. min1≤i≤n(bki − aki ) = ek ≥ e. For a rectangle T given by Pn for n > N , if it is given by∏[ci, di], the maximal distance between two points in T is max 1 ≤ i ≤ n(di − ci) = ‖Pn‖ < e/2.

So if T contained two points x and y of some rectangle S of Pε, letting d(x,y) denote the distancebetween the points, we would have

d(x,y) < e/2 and d(x,y) > e

This is not possible. Hence, for n > N , the rectangles determined byPn ∨ Pε contain at most onepoint of Pε.

For a rectangle S containing one point of Pε, we see this single point divides S into 2n pieces Tk.Then we have a term of this form

supx∈S

f(x)) V ((S) −2n∑k=1

( supx∈Tk

f(x)) V (Tj)

We can overestimate this term as

supx∈S

f(x) V ((S)−2n∑k=1

( supx∈Tk

f(x)) V (Tj)

≤ supx∈S

f(x) V ((S) + supx∈S

f(x)

2n∑k=1

V (Tj) = 2 supx∈S

f(x) V ((S)

Let B = supx∈A |f(x)|. Then, for this rectangle

f(x)x∈S

V (S)−2n∑k=1

( supx∈Tk

f(x)) V (Tj) ≤ 2BV (S)

We can do this for each point in Pε and its associated rectangle S. Let M be the number of points inPε. Overestimate each V (S) by

V (S) ≤ maxS determined by Pn

V (S)

Now consider U(f, Pn)− U(f, Pn ∨ Pε). The rectangles from Pn that do not contain a point of Pεoccur in both sums and so these contributions zero out. What is left is the sum over the rectanglesthat contain points of Pε which we have overestimated in the calculations above. Thus

U(f, Pn)− U(f, Pn ∨ Pε) ≤ BM maxS determined by Pn

V (S)

We assume ‖Pn‖ → 0 so we also know maxS determined by Pn

V (S)→ 0. Hence, there is N so that

maxS determined by Pn

V (S) ≤ ε

4M(B + 1)

We conclude if n > max(N, N), then U(f, Pn)−U(f, Pn∨Pε) < 2BM ε4M(B+1) ≤ ε/2. We know

DI(f,A) ≤ U(f, Pn ∨ Pε) ≤ U(f, Pε) < DI(f,A) + ε/2



and so

U(f, Pn)−DI(f,A) = U(f, Pn)− U(f, Pn ∨ Pε) + U(f, Pn ∨ Pε)−DI(f,A)

< ε/2 + ε/2 < ε

if n > max(N, N). This shows U(f, Pn) ↓ DI(f,A). A similar argument shows L(f, Pn) ↑DI(f,A).

15.1.1 Homework

Exercise 15.1.11

Exercise 15.1.12

Exercise 15.1.13

Exercise 15.1.14

Exercise 15.1.15

15.2 The Riemann Integral in n Dimensions

There is another way to develop integration in <n. We have gone through this carefully in < in(Peterson (8) 2019) and what we do in <n is quite similar.

Let f : A ⊂ <n → < be a bounded function on the bounded set A. Let R be any bounding rectanglethat contains A and let P be any partition of A based on R. As before, this choice of boundingrectangle is immaterial and we simply choose one that is useful. Let S1, . . . , SN be the rectanglesdetermined by P . An in between or evaluation set for P is any collection of points z1, . . . ,zNwhere zi ∈ Si for the rectangle Si determined by P . Such an evaluation set is denoted by σ.

Definition 15.2.1 The Riemann Sum

The Riemann sum of f overA for partition P and evaluation set σ is denoted by S(f, σ, P )where

S(f,σ, P ) =

N∑i=1

f(zi)V (Si)

where S1, . . . , SN are the rectangles determined by P and zi ∈ σ. For convenience, wetypically simply write S(f,σ, P ) =

∑Si∈P f(zi)V (Si). It is clear we have

L(f, P ) ≤ S(f,σ, P ) ≤ U(f, P )

The Riemann Integral is then defined like this:

Definition 15.2.2 The Riemann Integral


15.2. THE RIEMANN INTEGRAL IN N DIMENSIONS 303

We say f is Riemann Integrable over A if there is a real number I so that for all ε > 0,there is a partition Pε so that

|S(f,σ, P )− I| < ε

for all partitions P that refine Pε and any evaluation set σ for P . We typically use arelaxed notation for this and say σ ⊂ P for short. We denote the number I by the symbolRI(f,A).

There is a wonderful connection between the Riemann integral and the Darboux Integral of f overA.

Theorem 15.2.1 The Equivalence of the Riemann and Darboux Integral

The following are true statements:

1. f is Riemann Integrable onA implies f is Darboux Integrable onA andRI(f,A) =DI(f,A).

2. f is Darboux Integrable onA implies f is Riemann Integrable onA andDI(f,A) =RI(f,A).

Proof 15.2.11: We assume f is Riemann integrable on A. Then there is a number I so that given ε > 0, there isa partition Pε with

I − ε/6 < S(f,σ, P ) < I + ε/6

for all refinements P of Pε and σ ⊂ P . Let R denote our arbitrary choice of bounding rectangle forA. Then given any rectangle S determined by Pε, the infimum and supremum tolerance lemma tellus there are points yS and zS in S so that

mS(f) ≤ f(yS) < mS(f) +ε

6V (R)

MS(f)− ε

6V (R)< f(zS) ≤MS(f)

where mS and MS are the usual infimum and supremum over S of f .

L(f, Pε) ≤∑S∈Pε

f(yS)V (S) <∑S∈Pε

(mS(f) +

ε

6V (R)

)V (S)

or

L(f, Pε) ≤∑S∈Pε

f(yS)V (S) < L(f, Pε) +ε

6V (R)

∑S∈Pε

V (S)

= L(f, Pε) +ε

6V (R)V (R) = L(f, Pε) +

ε

6

Let σLower = yS |S ∈ Pε. Then we have

L(f, Pε) ≤ S(f,σLower, Pε) < L(f, Pε) +ε

6



In a similar way, for σUpper = ZS |S ∈ Pε, we find

U(f, Pε)−ε

6< S(f,σUpper, Pε) ≤ U(f, Pε)

Hence,

U(f, Pε)− L(f, Pε) < S(f,σUpper, Pε)− S(f,σLower, Pε) + ε/6 + ε/6

But we know how close the Riemann sums are to I . We find

S(f,σUpper, Pε)− S(f,σLower, Pε) < I + ε/6− I + ε/6 = ε/3

Combining, we conclude

U(f, Pε)− L(f, Pε) < 2ε/3 < ε

This hold for any refinement of Pε. Hence f satisfies the Riemann Criterion and so f is DarbouxIntegrable on A.

It remains to show DI(f,A) = RI(f,A). Here are the details. Let ε > 0 be given. From thedefinition of DI(f,A) and DI(f,A) there are partitions Uε and Vε so that

U(f, Uε) < DI(f,A) + ε/4 = DI(f,A) + ε/4

L(f, Vε) > DI(f,A)− ε/4 = DI(f,A)− ε/4

Thus for the refinement Uε ∨ Vε, we have

DI(f,A)− ε/4 < L(f, Vε) ≤ L(f, Uε ∨ Vε) ≤ U(f, Uε ∨ Vε)≤ U(f, Uε) < DI(f,A) + ε/4

Also, since f is Riemann Integrable, there is a partition Pε so that

|S(f,σ, P )−RI(f,A)| < ε/4

for all refinements P of Pε and any σ ⊂ P . LetQε = Uε∨Vε∨Pε. Then |S(f,σ, Qε)−RI(f,A)| <ε/4 and

DI(f,A)− ε/4 < L(f, Uε ∨ Vε) ≤ L(f,Qε) ≤ S(f,σ, Qε)

≤ U(f,Qε) ≤ U(f, Uε ∨ Vε) < DI(f,A) + ε/4

So

|RI(f,A)−DI(f, a)| ≤ |RI(f,A)− S(f,σ, Qε)|+ |S(f,σ, Qε)−DI(f, a)|< ε/4 + ε/4 < ε

Since ε > 0 is arbitrary, we see RI(f,A) = DI(f,A).

2: We now assume f is Darboux Integrable on A. Then, just as we argued in the last part of theproof of 1, we see there is a partition Uε ∨ Vε so that

DI(f,A)− ε < L(f, Uε ∨ Vε) ≤ L(f,Q)

≤ S(f,σ, Q) ≤ U(f,Q)

≤ U(f, Uε ∨ Vε) < DI(f,A) + ε


15.3. VOLUME ZERO AND MEASURE ZERO 305

for any refinement Q of Uε ∨ Vε and σ ⊂ Q. We conclude |S(f,σ, Q) − DI(f,A)| < ε forany refinement Q of Uε ∨ Vε and σ ⊂ Q. This says f is Riemann Integrable on A with valueRI(f,A) = DI(f,A)

Next, we prove the approximation result for Riemann Integration.

Theorem 15.2.2 Approximation of the Riemann Integral

If f is Riemann Integrable on A, then given a sequence of partitions (Pn) with ‖Pn‖ → 0,we have S(f,σn, Pn)→ RI(f,A) for any sequence σn ⊂ Pn.

Proof 15.2.2We know L(f, Pn) ≤ S(f,σn, Pn) ≤ U(f, Pn). Since f is also Darboux Integrable by the equiva-lence theorem, we knowDI(f,A) = RI(f,A) and U(f, Pn) ↓ DI(f,A) and L(f, Pn) ↑ DI(f,A).So S(f,σn, Pn)→ DI(f,A) = RI(f,A).

Example 15.2.1

Example 15.2.2

15.2.1 Homework

Exercise 15.2.1

Exercise 15.2.2

Exercise 15.2.3

Exercise 15.2.4

Exercise 15.2.5

15.3 Volume Zero and Measure ZeroNow that we have moved explicitly into <n, you should see the idea of length, area and volume andso forth seem hazily defined. In (Peterson (8) 2019), one of the projects is the construction of Cantortype sets and the size of these sets can only be approached abstractly through the content of a set.We also developed a version of Lebesgue’s Theorem which tells us a function is Riemann integrableif and only if its set of discontinuities is of content zero. Now we want to extend this discussion tomore dimensions. Let’s go back to the beginning first.

15.3.1 Measure Zero

What is the length of a single point? Note given any positive integer n, x ∈ B(1/n, x) = [x −1/n, x+ 1/n] and this interval has length 2/n. Since n is arbitrary, it seems reasonable to define thelength of x to be

length x = limn→∞

2/n = 0

What if we had a finite number of distinct points x1, . . . , xp? Since the points are distinct, there isa N0 so that

x1 ∈ [x1 − 1/n, x1 + 1/n], x2 ∈ [x2 − 1/n, x2 + 1/n],



...xp ∈ [xp − 1/n, xp + 1/n]

with all these intervals disjoint when n > N0. Hence,

length x1, . . . , xp ≤ 2/n+ . . . 2/n = 2p/n

and as n→ 0, we have length x1, . . . , xp = 0. Let’s try and make this more precise for sets in <and for sets in <n also. Also, since the notion of length is not quite a normal length, let’s start usingthe term measure instead.

Definition 15.3.1 Sets of Measure Zero

A setA ⊂ <n is said to have measure zero if for all ε > 0, there is a collection of rectangles(Si), countable or infinite, so that A ⊂ ∪Si with

∑V (Si) < ε.

Using this definition, we see measure x = 0 for any x ∈ <. and measurex1, . . . , xp = 0 forany finite number of points in <. Here are some more examples.

Example 15.3.1 Q is countable, so it can be enumerated as Q = (ri)1≤i<∞. Then, ri ∈ [r1 −ε/2i, ri+ ε/2i] and Q ⊂ ∪[r1− ε/2i, ri+ ε/2i] with

∑∞i=1 ε/2

i−1 = ε/2 < ε. Hence measure Q =0.

Example 15.3.2 The line y = mx + b has measure zero in <2. We will show this a bit indirectly.First consider the segment of the line for 0 ≤ x ≤ 1. Divide [0, 1] into n pieces. Pick n boxes asshown in Figure 15.4(a) so that they overlap. The distance between (i/n,m(i/n) + b) and ((i +1)/n,m((i+ 1)/n) + b) is

d((i/n,m(i/n) + b), ((i+ 1)/n,m((i+ 1)/n) + b) =√

(1/n)2 + (m/n)2 =√

(m2 + 1)/n2

So if we use boxes as shown in Figure 15.4(b), the boxes overlap. We see the line segment of y =mx+ b for 0 ≤ x ≤ 1 is contained in ∪ni=0 Boxi and

n∑i=1

V ( Boxi) =

n∑i=0

4(1 +m2

n2=n+ 1

n24(1 +m2)

(a) Covering Boxes for y = mx+ b, 0 ≤ x ≤ 1 (b) Overlapping Covering Boxes for y = mx+ b,0 ≤ x ≤ 1

Figure 15.4: Showing the line y = mx+ b has measure zero in <2

This goes to zero as n→∞, So the line segment of y = mx+ b for 0 ≤ x ≤ 1 has measure zero.



We can do this for y = mx + b for 1 ≤ x ≤ 2, y = mx + b for 2 ≤ x ≤ 3 and so on and each ofthese line segments has measure zero. Now

( line segment for y = mx+ b, x ≥ 0) = ∪∞p=1( line segment for y = mx+ b, p− 1 ≤ x ≤ p)

Let Ep = ( line segment for y = mx+ b, p− 1 ≤ x ≤ p). Then

( line segment for y = mx+ b, x ≥ 0) = ∪∞p=1Ep

Since each Ep has measure zero, given ε > 0, there is a collection of rectangles Si,p so that

Ep ⊂ ∪Si,p,∑

V (Si,p) < ε/2p

Hence,

( line segment for y = mx+ b, x ≥ 0) ⊂ ∪p ∪i Sip,∑p

∑i

V (Sip <∑p

ε/2p < ε

This shows measure ( line segment for y = mx + b, x ≥ 0) = 0. A similar argument showsmeasure ( line segment for y = mx + b, x ≤ 0) = 0 also. From this, it follows measure (y =mx+ b) = 0

The examples above indicate we can prove the following propositions.

Theorem 15.3.1 Finite Unions of Sets of Measure Zero Also Have Measure Zero

If A1, . . . , Ap is a finite collection of sets of measure zero, then ∪pi=1 Ai has measurezero also.

Proof 15.3.1This is left to you.

Theorem 15.3.2 Countable Unions of Sets of Measure Zero are Also Measure Zero

If Ai|i ≥ 1 is an infinite collection of sets of measure zero, then ∪∞i=1 Ai has measurezero also.

Proof 15.3.2Again, this is left for you.

It is time for a harder example.

Example 15.3.3 Consider the surface z = x2 + y2 in <3. That is, f(x, y+ = x2 + y2. We will showthis is a set of measure zero in <3. Let’s consider only a finite piece of it in octant one. Note theprojection of the surface to the x − y plane is the circle x2 + y2 = 1 for z = 1. Hence, the square[−1, 1] × [−1, 1] contains the projection. Divide the square into n2 uniform pieces of area 1/n2.These induce a corresponding subdivision of the disc (x, y)|x2 + y2 ≤ 1. The corners of a patchPij in the x− y plane are

Pi,j =

(i/n, (j + 1)/m) · · · ((i+ 1)/n, (j + 1)/m)...

......

(i/n, j/n) · · · ((i+ 1)/n, j/n)



=

(xi, yj+1) · · · (xi+1, yj+1

......

...(xi, yj · · · (x+ i+ 1, yj)

and this square is mapped to a patch with corners

f(Pi,j) =

zi,j+1 · · · zi+1,j+1

......

...zi,j · · · zi+1,j

where zpq = x2

p + y2q . Look at the square [0, 1] × [0, 1]. Now z = x2 + y2 is a convex surface the

maximum occurs at the largest i and j for the patch and the minimum is at the smallest i and j. Soon the square Pi,j , the maximum and minimum surface values are

zM,i,j = x2i+1 + y2

j+1 =

(i+ 1

n

)2

+

(j + 1

n

)2

zm.i.j = x2i + y2

j =

(i

n

)2

+

(j

n

)2

Thus,

ZM.i.j − zm,i,j =2i+ 1

n2+

2j + 1

n2

This patch of surface is contained in the box Bi,j = Pi,j × [zi,j − 1/n, zM,i,j + 1/n] which hasvolume

V (Bi,j) =1

n2(zM,i,j − zm,i,j +

2

n=

2

n3+

2i+ 1 + 2j + 1

n4

The surface above [0, 1]× [0, 1] is contained in the union of the boxes Bi,j and

n∑i=1

n∑j=1

V (Bi,j) =

n∑i=1

n∑j=1

2

n3+

n∑i=1

n∑j=1

2i+ 2j + 2

n4

=2

n+

n∑i=1

(2i

n3+

2n(n+ 1)

2n4+

2n

n4

)=

2

n+

2n(n+ 1)

2n3+

2n2(n+ 1)

2n4+

2n2

n4

We see∑ni=1

∑nj=1 V (Bi,j) → 0 as n → ∞. Thus, we conclude the measure of the surface above

[0, 1]× [0, 1] is 0.

A similar argument shows the measure of the surface above [0, 1] × [−1, 0], above [−1, 0] × [0, 1]and above−− 1, 0]× [−1, 0] are zero too. Hence the measure of the surface above [−1, 1]× [−1, 1]is zero. Let Ωn be the surface above the annular region (x, y)|n2 ≤ x2 + y2 ≤ (n + 1)2. Thus,n2 ≤ z ≤ (n+ 1)2. Using an argument similar to what we just did, we can show the measure of thesurface above this annular region is zero. Note Ω0 is contained in the surface above [−1, 1]× [−1, 1]and so the measure of Ω0 is zero. The entire surface is the union of all the Ωn for n ≥ 0 and sinceeach of these are measure zero, the entire surface in <2 is measure zero.



Homework

Exercise 15.3.1 Prove y = x2 has measure zero in <2.

Exercise 15.3.2 Prove the surface z = 2x2 + 3y2 has measure zero in <3.

Exercise 15.3.3 Prove y = 2x+ 5 has measure zero in <2.

15.3.2 Volume Zero

Do all sets A ∈ <n have associated with them a generalization of volume for a box? Let’s see ifwe can defined a suitable notion. The volume of a rectangle in <n, R =

∏ni=1 [aki , b

ki ] is given by

V (R) =∏ni=1 (bki − aki ). To generalize this idea, let’s consider characteristic functions.

Definition 15.3.2 The Characteristic Function of a Set

If A ⊂ <n, the characterization of A is denoted by 1A and defined by

1A(x) =

1, x ∈ A0, x 6∈ A

Let A be a rectangle in <n. Then A is its own bounding rectangle. Let P be any partition of A withassociated rectangles Si. Then, infx∈Si 1A(x) = 1 and supx∈Si 1A(x) = 1 also. Thus,

L(1A, P ) =∑Si∈P

1 V (Si) = V (A)

U(1A, P ) =∑Si∈P

1 V (Si) = V (A)

Thus, 1A is Darboux Integrable implying Riemann Integrable over A with value V (A).

It is time to use a different notation for the value of our integrals.

Definition 15.3.3 The Integration Symbol

If f : A ⊂ <n → < for bounded A and f bounded is Riemann Integrable on A, the valueof the Riemann Integral of f on A will be denoted by

∫Af . We sometimes want to remind

ourselves explicitly that the integration is done over a subset of <n. Since lower and uppersums involves volumes of rectangles in <n, we will use the symbol dVn to indicate this inthe integration symbol. Hence, we can also write RI(f,A) =

∫AfdVn.

Comment 15.3.1 Our traditional one calculus integral would be∫ baf(x)dx for the Riemann Inte-

grable of the bounded function f on [a, b]. We could also write this as∫

[a,b]fdV1 which, of course,

is a bit much. Just remember there are many ways to represent things. The new notation is better foran arbitrary number of dimensions n and not so great for n = 1.

Using this definition, we see V (A) =∫A

1A when A is a rectangle in <n. This suggest a way toextend the idea of volume to some sets A of <n.

Definition 15.3.4 The Volume of a Set



If A is a bounded set in <n and 1A is integrable, then the volume of A is defined to beV (A) =

∫A

1A. As usual, the function 1A is extended to any bounded rectangle R by

1a =

1A(x),x ∈ A0, x ∈ R \A

It is easy to show that if a set has volume zero, it is also has measure zero.

Theorem 15.3.3 Volume Zero Implies Measure Zero

If A ⊂ <n has volume zero, it is also measure zero.

Proof 15.3.3Since A has volume zero, given ε > 0, there is a partition Pε so the 0 ≤ U(1A, Pε) < ε (rememberall the lower sums here are zero!). But U(AA, Pε) =

∑Si∈Pε V (Si), so A ⊂ ∪Si∈PεSi and∑

Si∈Pε V (Si) < ε. Since ε > 0 is arbitrary, this shows A is measure zero.

The converse is not true. Let A = (x, y, z) ∈ [0, 1]× [0, 1]× [0, 1]|x, y, z ∈ Q. Then

1A(x) =

1, x, y, z ∈ Q ∩ [0, 1]0, otherwise

Let the bounding rectangle be R = [0, 1]× [0, 1]× [0, 1]. then for any partition P ,

L(1A, P ) =∑Si∈P

0 V (Si) = 0

U(1A, P ) =∑Si∈P

1 V (Si) = V (R) = 1

This is because any Si ∈ P contains rational triples. Thus, DI(1A = 0 and DI(1A = 1. Weconclude 1A is not integrable and so its volume is not defined. However,the rational triples here arecountable and so have measure zero.

We are now ready to consider the question of when a function is Riemann Integrable on A.

15.4 When is a Function Riemann Integrable?

The result we need is called Lebesgue’s Theorem and we already have proven a version of this in(Peterson (8) 2019). But now we want to look at the situation in <n for n > 1 in general. First, anew way to look at continuity.

Definition 15.4.1 The Oscillation of a Function

Let f : A ⊂ <n → < where A is an open set. The oscillation of f at x0 ∈ A is defined tobe

ω(f,x0) = infB(r,x0),r>0)

(sup

x1,x2∈B(r,x0)

|f(x1)− f(x2)|)

We leave it to you to prove this result.


15.4. WHEN IS A FUNCTION RIEMANN INTEGRABLE? 311

Theorem 15.4.1 Oscillation is Zero if and only Continuous

Let f : A ⊂ <n → < where A is an open set. Then ω(f,x0) = 0 if and only if f iscontinuous at x0.

Proof 15.4.1This one’s for you!

Now on to Lebesgue’s Theorem.

Theorem 15.4.2 Lebesgue’s Theorem

Let f : A ⊂ <n → <, A bounded and f bounded on A. Then f is Riemann Integrable onA if and only the set of points where f is not continuous is a set of measure zero.

Proof 15.4.2=⇒We assume the set of discontinuities of f has measure zero. Call this set D. For each ε > 0, letDε = x | ω(f ,x) ≥ ε. Then D = ∪ε>0Dε and each Dε ⊂ D by Theorem 15.4.1. Assume y isa limit point of Dε. If y is an isolated limit point, it is already in Dε. If it is not isolated, there is asequence (yn) ⊂ Dε with yn 6= y so that yn → y. For any r > 0, since yn → y, there is Nr sothat yn ∈ B(r,y). Thus, there is an s < r so that B(s,yn) ⊂ B(r,y), i.e. y ∈ B(s,yn) for anyn > Nr. But then

supx1,x2∈B(r,y)

f(x1)− f(x2)| ≥ supx1,x2∈B(s,yn)

|f(x1)− f(x2)| ≥ ε

because yn ∈ Dε. Since this is true for all r, we have

infB(r,x0),r>0)

(sup

x1,x2∈B(r,y)

|f(x1)− f(x2)|)≥ ε

which tells us ω(f ,y) ≥ ε or y ∈ Dε. Thus, Dε is closed. Since it is inside the bounded set A, it isalso bounded. We conclude Dε is compact.

We know D has measure zero. thus, Dε had measure zero as well. Then, there is a collection Bi ofopen rectangles ( the definition uses closed rectangles, but it is easy to expand them a bit to makethem open rectangles) so Dε ⊂ ∪iBi and

∑i V (Bi) < ε where recall Bi is the closure of Bi. Since

Dε is topologically compact, the open cover (Bi) has a finite subcover Bi1 , . . . , Bip which we willcall B1, . . . , BN by adding in any missing sets Bi as required with N = ip. Hence, Dε ⊂ ∪Ni=1Biand

∑Ni=1 V (Bi) < ε.

The collection B1, . . . , BN of possibly overlapping rectangles determines a partition P0 of thebounding rectangle we choose for A. Pick a partition P which refines P0. The new rectangles weform will either by inside the rectangles already made by P0 or completely outside them. Thus, therectangles Si determined by P can be divided into two pieces with index sets I and J .

I = i ∈ 1, . . . , N | ∃ j 3 Si ⊂ BjJ = i ∈ 1, . . . , N |Si ⊂ (∪nj=1Bj)

C



Note, for each j ∈ J , Sj is disjoint from Dε. Thus, if z ∈ Sj , j ∈ J , ω(f ,z) < ε. Since ω(f ,z) isdefined using an infimum, there is rz > 0 so that

ω(f ,z) ≤ supx1,x2∈B(rz,z)

|f(x1)− f(x2)| < (1/2)(ω(f ,z) + ε)

It is easier to use rectangles so let’s switch to rectangles. Let R(rz, z) be an open rectangle con-tained in B(rz, z). Then we can say

ω(f ,z) ≤ supx1,x2∈R(rz,z)

|f(x1)− f(x2)| < (1/2)(ω(f ,z) + ε)

Let ε = (1/2)(ω(f ,z) + ε) < ε. Then, we have

−ε < f(x1)− f(x2) < ε, ∀x1,x2 ∈ R(rz, z)

or

f(x2)− ε < supx1∈R(rz,z)

f(x1) < f(x2) + ε, ∀x2 ∈ R(rz, z)

implying

supx1∈R(rz,z)

f(x1) ≤ infx2∈R(rz,z)

f(x2) + ε

We conclude

supx1∈R(rz,z)

f(x1) − infx2∈R(rz,z)

f(x2) ≤ ε < ε

We know Sj ⊂ ∪z∈SjR(rz, z) and since Sj is compact, there is a finite subcover

R(rz1 , z1), . . . , R(rzq , zq), Sj ⊂ ∪qi=1 R(rzi , zi)

The rectangles in the subcover define a partitioning scheme inside Sj . Choose any additional refine-ment of P so that the new rectangles are all inside some R(rxi , zi). Call this refinement P ′. Sofor any rectangle E due to this new refinement of P , since it is contained in one of the sets in thesubcover, we have

supx1∈E

f(x1)− infx2∈E

f(x2) ≤ ε < ε

Thus, ∑E∈P ′

supx∈E

V (E)−∑E∈P ′

infx∈E

V (E) ≤ ε∑E∈P ′

V (E) = εV (Sj) < εV (Sj)

We can carry out this process of refinement for each j ∈ J to get a succession of refinements:P → P ′1 → P ′2 → . . .→ P ′M where M is the number of indices in J . This gives a partition Q. Theindices of Q still split into two pieces I and J as described, but the additional refinements we haveintroduced allow us to make some estimates for each Sj ∈ J . We can say

U(f , Q)− L(f , P ) =∑Ei,i∈I

( supx∈Ei

f(x)) V (Si)−∑Ei,i∈I

( infx∈Ei

f(x)) V (Si)


15.4. WHEN IS A FUNCTION RIEMANN INTEGRABLE? 313

+∑Ei,i∈J

( supx∈Ei

f(x)) V (Si)−∑

EJ ,i∈I( infx∈Ei

f(x))|;V (Si)

<∑Ei,i∈I

( supx∈Ei

f(x)) V (Si)−∑Ei,i∈I

( infx∈Ei

f(x)) V (Si) + ε∑Ei,i∈J

V (Si)

=∑Ei,i∈I

( supx∈Ei

f(x− infx∈Ei

f(x) V (Si) + εV (R)

where R is the bounding rectangle for A. Now f is bounded on A so there is a constant M so thatf | < M on R. Thus,

U(f , Q)− L(f , Q) < 2M∑Ei,i∈I

V (Si) + εV (R)

But for each index i ∈ I , Si is contained in someBj and so∪i∈ISi ⊂ ∪Nj=1Bi and∑Ni=1 V (Bj) < ε.

We conclude

U(f , Q)− L(f , Q) < 2Mε+ εV (R)

This is enough to show f satisfies the Riemann Criterion and so f in integrable on A.

=⇒Now we assume f is integrable on A. The set of discontinuities of f is the set

D = x ∈ <n|ω(f ,x) 6= 0

We also know D = ∪nD1/n where D1/n is defined as in the first part of the proof. Since f isintegrable, given ξ > 0, there is a partition Pξ so that

U(f , Pξ)− L(f , Pξ) =∑Si∈Pξ

( supx∈Si

f(x)− infx∈Si

f(x)) V (Si) < ξ

for all refinements P of Pξ. Earlier, we showed D1/n is compact. Let’s decompose D1/n in this way.

D1/n = x ∈ D1/n| ∃i 3 x ∈ ∂Si ∈ Pξ ⇐= Qn1

∪x ∈ D1/n| ∃i 3 x ∈ Int(Si ) ∈ Pξ ⇐= Qn2

where ∂Si is the usual notation for the boundary of Si and Si denotes the interior of Si. We know themeasure of an edge of a rectangle is zero and we know countable unions of sets of measure zero arealso zero. So we can conclude the measure of Qn1 must be zero. It remains to see what the measureof Qn2 is.

If Si is a rectangle that intersects Qn2 , then if x ∈ Si, there is a point u ∈ S and in D1/n. Thus,

ω(f ,u) = infB(r,u),r>0

supx1,x2∈B(r,u)

|f(x1)− f(x2)| ≥ (1/n)

In particular, this can be rewritten in terms of open rectangles as

ω(f ,u) = infR(r,u),r>0

supx1,x2∈R(r,u)

|f(x1)− f(x2)| ≥ (1/n)



where R(r,u) is an open square rectangle with center u and axis dimensions r. Thus, for an openrectangle containing S, T , we can say

supx1,x2∈T(r,u)

|f(x1)− f(x2)| ≥ (1/n)

and this implies

supx1,x2∈S

|f(x1)− f(x2)| ≥ (1/n)

Then, using the same ( complicated!) arguments we used in the first part of this proof, we can say

supx∈S

f(x) − infx∈S

f(x) ≥ (1/n)

Applying this argument to all such rectangles S that intersect Qn2 , we find

(1/n)∑

S∩Qn2 6=∅

V (S) ≤∑

S∩Qn2 6=∅

(supx∈S

f(x) − infx∈S

f(x)

)V (S)

≤∑S∈Pξ

(supx∈S

f(x) − infx∈S

f(x)

)V (S) < ξ

Thus, the collection Vξ = S∩Qn2 6= ∅ is a set of rectangles that coverQn2 with∑S∩Qn2 6=∅

V (S) <

nξ. Now since the measure of Qn1 is zero, for ξ > 0, there is a collection of rectangles covering Qn1 ,Uξ = U1,ξ, . . . , Up,ξ so that

∑pi=1 V (Ui,ξ) < ξ. The collection S ∩ Qn2 6= ∅ ∪ Uξ is cover for

D1/n with∑S∈Uξ∪Vξ V (S) < (n+ 1)ξ. Finally, letting ξ = ε/(n+ 1), we have found a collection

whose union contains D1/n with∑S∈Uξ∪Vξ V (S) < ε. Since ε > 0 is arbitrary because ξ was

arbitrary, we see D1/n has measure zero. This tells us D also has measure zero.

Whew! That is an intense proof! Compare the approach we have used here to the one we used in< in (Peterson (8) 2019). This one had to be able to handle rectangles in spaces of more than onedimension which is why it is much harder. Of course, this argument works just fine for < althoughthe rectangles degenerate to just closed intervals.

Example 15.4.1 Consider the curve x2 + y2 = 1 in <2. Let’s show it has measure zero in <2. We’lldivide the curve into the four quadrants and just do the argument for quadrant one. In quadrant one,we have y =

√1− x2 with 0 ≤ x ≤ 1. Divide [0, 1] into n uniform pieces and choose rectanglesBn

centered at I/n as follows: Let zi = (i/n,√

1− (i/n)2). Then the distance between zi and zi+1 is

d(zi, zi+1) =

√((1/n)2 + (

√1− ((i+ 1)/n)2 −

√1− (i/n)2)2

Choose Bi to be the square centered at zi whose distance to zi+1 is√

2d(zi, zi+1). Then thesesquares overlap and their union covers the arc of x2 + y2 = 1 in quadrant one with summed area

n∑i=1

V (Bi) <

n∑i=1

4/n2 = 4/n

Thus, for a given ε > 0, there is N so that is n > N , the collection of rectangles covers the arcwith summed volume less than ε. Hence the arc in quadrant one has measure zero in <2. A similarargument works in the other quadrants and thus this curve has measure zero in <2.


15.5. INTEGRATION AND SETS OF MEASURE ZERO 315

Example 15.4.2 Now look at the integral of f(x, y) = 1 − x2 − y2 on A the interior of the disc ofradius one. Then let R = [−1, 1]× [−1, 1].

f(x) =

1− x2 − y2, x ∈ D = (x, y)|x2 + y2 < 10, x ∈ R \D

f is clearly continuous onD and (x, y) ∈ R|x2+y2 > 1. Also, the set of points where x2+y2 = 1

is a set of measure zero in <2. Hence, we know f is continuous everywhere except a set of measurezero and so f is integrable on D. So

∫DfdV exists and we can approximate it using any sequence

of partitions whose norms go to zero. We usually would write∫DfdA as we are using area ideas in

<2. Finally, we would normally write∫x2+y2<1

(x2 + y2)dAxy to be more specific.

15.4.1 Homework

Exercise 15.4.1 Prove x2 + 4y2 is integrable on [−1, 2]× [3, 5].

Exercise 15.4.2 Prove sin(x2 + 4xy + y3) is integrable on the set A where A = (x, y)|0 ≤ y ≤x2 ∩ 0 ≤ y ≤ 4

Exercise 15.4.3 Prove sin(x2 +4xy+y3) is integrable on the setA whereA = (x, y)|x2 ≤ y ≤ 6

Exercise 15.4.4 Prove x2 + y2 + 4z2 is integrable on D = (x, y, z)|z = x2 + y2 + z2 ≤ 25

15.5 Integration and Sets of Measure Zero

Let A ⊂ <n be a set of measure zero. Then if A contains a rectangle E =∏ni=1 [ai, bi], we would

know V (E) > 0. But we also know A has measure zero, so given ε = V (E)/2, there is a collectionBi with A ⊂ ∪iBi and

∑i V (Bi) < V (E)/2. However, E ⊂ A ⊂ ∪iBi implies V (E) ≤

∑iB)i

or V (A) < V (A)/2 which is not possible. We conclude if A is of measure zero, it can not contain arectangle such as E.

Let S be a rectangle containing A and now let’s assume A is a bounded set. Let f : A → < be abounded function which is integrable on A. Extend f to f on s as usual. Let P be a partition of S =S1, . . . , S +M. Since f is bounded, there are positives number m and M so that m ≤ f(x ≤Mfor x ∈ A. Then

L(f, P ) =∑i∈P

(infx∈Si

f(x)

)V (Si)

Now

infx∈Si

f(x) = inf

0, x ∈ Si ∩ACf(x), x ∈ Si ∩A

≤ inf

0, x ∈ Si ∩ACM, x ∈ Si ∩A

= M infx∈Si

1A

Thus,

L(f, P ) ≤ M∑i∈P

infx∈Si

1A V (Si)

If infx∈Si 1A = 1 for some index i, this would imply Si ∩ A 6= ∅. We know A ⊂ ∪iSi though sothis would force Si ⊂ A. But by our argument at the start of this discussion, since A has measurezero, it is not possible for a nontrivial rectangle to be inside A. Thus, we must have infx∈Si 1A = 0



always. This tells us for all P

L(f, P ) ≤ M∑i∈P

infx∈Si

1A V (Si) ≤ 0

We conclude∫Af ≤ 0,

Next, we know supx∈Si f = − infx∈Si(−f), so

U(f, P ) =∑i∈P

(supx∈Si

f(x)

)V (Si) = −

∑i∈P

(infx∈Si

(−f(x))

)V (Si) = −L(−f, P )

Then, by the same arguments we just used, we see −f ≤ −m implies

L(−f, P ) ≤ (−m)∑i∈P

infx∈Si

1A V (Si)

implying L(−f, P ) ≤ 0. Thus, −L(−f, P ) ≥ 0 for all P . Therefore U(f, P ) ≥ 0 for all P and so∫Af ≥ 0. We conclude ∫

A

f ≤ 0 ≤∫A

f

But f is integrable on A, so we have 0 ≤∫Af ≤ 0 or

∫Af = 0. We have proven an important

theorem.

Theorem 15.5.1 If f is integrable on A and A has measure zero, then the integral of f on A iszero

Let A ⊂ <n be bounded and have measure zero. If f : A → < is integrable on A, then∫Af = 0.

Proof 15.5.1This is the argument we have just done.

Another interesting result is this:

Theorem 15.5.2 If the non negative f is integrable on A with value 0, then the measure of theset of points where f > 0 is measure zero

Let A ⊂ <n be bounded and f : A → < be integrable on A and non negative. Then theset of points where f > 0 is measure zero.

Proof 15.5.2Let Am = x ∈ A|f(x) > 1/m for any positive integer m. Pick ε > 0. Since

∫Af = 0, there is a

partition Pm so that∫A

f ≤ U(f, Pm) <

∫A

f + ε/m =⇒ 0 ≤ U(f, Pm < ε/m


15.5. INTEGRATION AND SETS OF MEASURE ZERO 317

as∫Af = 0. Then,

∑i∈Pm:Si∩Am 6=∅

(sup)x ∈ Si f(x)

)V (Si) ≤

∑i∈Pm

(sup)x ∈ Si f(x)

)V (Si) < ε/m

If Si ∩Am 6= ∅, then there is an x ∈ Si ∩Am implying there is an x with f(x) > 1/m. Hence,∑i∈Pm:Si∩Am 6=∅

1/mV (Si) ≤∑i∈Pm

(sup)x ∈ Si f(x)

)V (Si) < ε/m

or∑i∈Pm:Si∩Am 6=∅ V (Si) < ε. This tells us this collection of sets covers Am with total summed

volume smaller than ε. Since ε is arbitrary, we know Am is measure zero which implies immediatelythe measure of the set of points where f > 0 is zero.

And, a result that seems natural.

Theorem 15.5.3 If f is integrable over A and f is zero except on a set of measure zero, this theintegral is zero

Let f : A ⊂ <n → < where A is bounded be integrable. Assume B = x ∈ A|f(x 6= 0is measure zero. Then

∫Af = 0.

Proof 15.5.3Since f is integrable, for any sequence of partitions Pn with ‖Pn‖ → 0, S(f, sigman, Pn)→

∫Af

where sigman is any evaluation set from Pn. Now if Sin is a rectangle in Pn, if all points z fromSin were in B, this would mean B contains a rectangle. But previous arguments tell us this is notpossible since B has measure zero. Thus, each Sin contains a point not in B. Call this point zinand let σn = zin|i ∈ Pn. Then

S(f,σn, Pn) =∑i∈Pn

f(zin)V (Si) = 0

This shows S(f,σn, Pn)→ 0 =∫Af .


Chapter 16

Change of Variables and Fubini’sTheorem

We know begin our discussion of how to make a change of variables in an integral. We start withlinear maps.

16.1 Linear Maps

We start with the simplest case. We have this hazy notion of what it means for a set to have a volume:its indicator function must be integrable. But we really only know how to compute simple volumesuch as the volumes of rectangles. The next proof starts the process of understanding change ofvariable theorems by looking at what happens to a set with volume when it is transformed with alinear map.

Theorem 16.1.1 Change of Variable for a Linear Map

If L : <n → <n is a linear map and D ⊂ <n is a set which has volume, i.e 1D isintegrable and V (D) =

∫D

1D, then L(D) also has volume as

V (L(D)) =

∫L(D)

1L(D) = |detL|∫D

=

∫D

|detL| 1D

Proof 16.1.1Note if L is not invertible, the range of L is at most a n − 1 dimensional subspace of <n which hasmeasure zero. Thus, 0 =

∫L(D)

1L(D) and since detL = 0, this matches detL∫D

1D = 0. So thiswill work even if L is not invertible.

For any invertible matrix A there is a sequence of elementary matrices Li so that

LpLp−1 · · ·L1 A = I

319


320 CHAPTER 16. CHANGE OF VARIABLES AND FUBINI’S THEOREM

where each Li is invertible. We discuss these things carefully in Chapter 10.2 but let’s review a bithere. These matrices Li come it two types: T1 and T2:

T1 =

1 0 . . . 0 00 1 . . . 0 00 0 1 1, (i.j) position 0...

......

......

000 0 1

= I + Λij

where I is the usual identity matrix and Λij is the matrix of all zeros except a 1 at the (i, j) position.Note det(T1) = 1 and so det(T−1

1 = 1 also. It is easy to see

T−11 =

1 0 . . . 0 00 1 . . . 0 00 0 1 −1, (i.j) position 0...

......

......

000 0 1

= I − Λij

Next,

T2 =

1 0 . . . 0 00 1 . . . 0 00 0 1 0...

...... c, (j, j) position

...000 0 1

It is easy to see

T2 =

1 0 . . . 0 00 1 . . . 0 00 0 1 0...

...... 1/c, (j, j) position

...000 0 1

Note det(T2) = c and so det(T−1

2 = 1/c also. So T2A creates a new matrix with scales row jby c and T1A adds row j to row i. Thus, T1T2A adds c times row j to row i to create a new rowi. The same is true for the inverses. So T−1

2 A creates a new matrix with scales row j by 1/c andT−1

1 A subtracts row j to row i. Thus, T1T2A subtracts 1/c times row j to row i to create a new row i.

The linear map L has a matrix representation A and so there are invertible linear maps λ1 to λp sothat L = λ−1

1 λ−12 · · ·λ−1

p . Thus,

det(L) = det(λ−11 ) · · · det(λ−1

p )

= det(L−11 ) · · · det(L−1

p ) =∏i∈I

(1/ci)

where I is the set of indices where Li corresponds to a Type 2 matrix. Note this is independent of thematrix representations of these maps.

If λ−1i is of type T2, then λi(x1, x2, . . . , xn) = (x1, x2, . . . , 1/cxj , . . . , x)n) for some j. and if S is

a rectangle, we see V (λiS) = |1/c|V (S).


16.1. LINEAR MAPS 321

If λi is of Type T1, we have λi(x1, x2, . . . , xn) = (x1, x2, . . . , xi + xj , . . . , x)n) for some i and j.Now apply λi to a rectangle S. The new rectangle λiS preserves all the structure except in the i− jplane. Then for S = [a1, b1] × . . . × [an, bn], all of S is preserved except in the 2 − 3 plane. Therewe have (a2, b3) . . . (b2, b3)

......

...(a2, a3) . . . (b2, a3)

−→

(a2 + b4, b3) . . . (b2 + b4, b3)...

......

(a2 + a4, a3) . . . (b2 + a4, a3)

We see the area of the original rectangle in the 2− 3 plane is (b2 − a2)(b3 − a3) which is the sameas the area of the parallelogram in the transformed rectangle. This experiment can be repeated forother choices. We have found the T1 applied to a rectangle S does not change its volume. A similaranalysis shows T−1

1 applied to a rectangle S does not change its volume.

Now Pick a bounding rectangle R large enough to contain both D and L(D). Since D has vol-ume, by the integrable approximation theorem, given a sequence of partitions Pn with ‖Pn‖ → 0,U(1D, Pn) ↓

∫D

1D = V (D) and L(1D, Pn) ↑∫D

1D = V (D). Thus, for a given ε > 0 there anN so that if n > N U(1D, Pn) < V (D)+ε/|det(L)| and L(1D, Pn) > V (D)−ε/|det(L)|. Hence,there are rectangles S1, . . . , Su in Pn, so that

∑uj=1 V (Sj) < V (D) + ε/|det(L)| and rectangles

(possibly different) T1, . . . , Tv in Pn so that∑vj=1 V (Tj) > V (D) − ε//|det(L)|. Then, from our

arguments above

V (λ−11 · · ·λ−1

p (S)) = = |det(L)|V (S)

Consider

U(1L(D), Pn) =∑S∈Pn

supx∈S

((1L(D)(x)V (S)

Now, if supx∈S((1L(D)(x) = 1. this means there is x ∈ S so that L−1(x) ∈ D. This means there isa rectangle T ∈ Pn so that supx∈T ((1D(x) = 1. Therefore, this 1 in the upper sum U(1L(D), Pn)also occurs in U(1D, Pn). So every 1 in the upper sum U(1L(D), Pn) also occurs in U(1D, Pn).We conclude

U(1L(D), Pn) ≤ U(1D, Pn)

Also,

L(1L(D), Pn) =∑S∈Pn

infx∈S

((1L(D)(x)V (S)

Now, if infx∈S((1L(D)(x) = 1. this means there is S ⊂ L(D) so that L−1(x) ∈ D for all x ∈ S;i.e. L−1(S) ⊂ D. This does not mean we can be sure we can find a T ∈ Pn which is inside D. Thus,this 1 in L(1L(D), Pn) need not be repeated in L(1D, Pn). We conclude

U(1L(D), Pn) ≥ U(1D, Pn)

Therefore,

U(1L(D), Pn)− L(1L(D), Pn) ≤ U(1D, Pn)− L(1D, Pn)



We also know

U(1D, Pn)− L(1D, Pn) < ε

Combining, we see 1L(D) is integrable since it satisfies the Riemann Criterion. Also

p∑j=1

(V (L(Sj))/|det(L)| < V (D) + ε/|det(L)|

q∑j=1

(V (L(Tj))/|det(L)|) < V (D) + ε/|det(L)|

or

U(1L(D), Pn) ≤ U(1D, Pn) =

u∑j=1

V (L(Sj)) < |det(L)|V (D) + ε

L(1L(D), Pn) ≥ L(1D, Pn) =

v∑j=1

V (L(Tj)) < |det(L)|V (D) + ε

Since Pn is a sequence of partitions with ‖Pn‖ → 0, we have∫L(D)

1L(D) = limPn

U(1L(D), Pn) ≤ |det(L)|V (D) + ε∫L(D)

1L(D) = limPn

L(1L(D), Pn) ≥ |det(L)|V (D)− ε

giving for all ε > 0 ∣∣∣∣∫L(D)

1L(D) − |det(L)|V (D)

∣∣∣∣ < ε

Thus, L(D) does have volume and∫L(D)

1L(D) = |det(L)|V (D)|.

It is easy to show that given two linearly independent vectors in <2, say V andW , the parallelogramformed by these two vectors has an area that can be computed by looking at the projection ofW ontoV . If the angle between V andW is θ, then the area of the parallelogram is

A = ‖V ‖ ‖W ‖ sin(θ) = ‖V ‖ ‖W ‖

√‖V ‖2 ‖W ‖2− < V ,W

‖V ‖2 ‖W ‖2

=√|det(T )|2 = |det(T )|

where T is the matrix formed using the V and W as its columns. It is a lot harder to work thisout in <3 and trying to extend this idea to <n shows you there must be another way to reason thisout. Theorem 16.1.1 is a reasonable way to attack this. It does require the large setup of a generalintegration theory in <n and the notion of extending volume in general, but it has the advantage ofbeing quite clear logically. We suspect the volume of the parallelepiped in <n formed by n linearlyindependent vectors should be det(T ) where T is the matrix formed using the vectors as columns.We can state this as follows:

Theorem 16.1.2 The Volume of a Rectangle in <n


16.1. LINEAR MAPS 323

If V1, . . . ,Vn are linearly independent in <n, the volume of the parallelepiped determinedby them is det(T )

Proof 16.1.2First, note ifD is a rectangle in <n, 1D(x) = 1 on all ofD and it lacks continuity only on ∂D whichis a set of measure zero. Thus, it is integrable. Theorem 16.1.1 applied with L = I tells us

∫D

1D =

V (D) which is the usual∏

(bi − ai) where [ai, bi] is the ith edge of the rectangle D. You can alsoprove this directly by looking at upper and lower sums without using this Theorem and you shouldtry that. In this case, V1, . . . ,Vn determine an invertible linear map L on D = [0, 1] × . . . × [0, 1]and so

∫L(D

1L(D) = det(L)V (D) = det(L).

Example 16.1.1 Here is a sample volume calculation in <3. We pick three independent vectors, setup the corresponding matrix and find its determinant.

Listing 16.1: Sample Volume Calculation

V1 = [ 1 ; 2 ; −3 ] ;V2 = [ −3 ; 4 ; 1 0 ] ;V3 = [ 1 1 ; 2 2 ; 1 ] ;A = [ V1 , V2 , V3 ]A =

1 −3 112 4 22−3 10 1

d e t (A)ans = 340 .00

We see V (A) = 340.

Solution Yo

16.1.1 HomeworkExercise 16.1.1 Prove if D is a rectangle in <n, V (D) =

∏(bi− ai) where [ai, bi] is the ith edge of

the rectangle D from a lower and upper sum argument without using Theorem 16.1.1.

Exercise 16.1.2 Here are the vectors. Find the volume of the parallelepiped determined by them.

Listing 16.2: Volume Calculation

V1 = [ −1 ; 2 ; 4 ; 1 0 ] ;V2 = [ 2 ; −3 ; 5 ; 1 ] ;V3 = [8;−6;−7;1];V4 = [ 9 ; 1 ; 2 ; 4 ] ;

Exercise 16.1.3 Show the volume of a parallelepiped in <n is independent of the choice of orthonor-mal basis.



16.2 The Change of Variable Theorem

We start with an overestimate.The Change of Variable result is a bit complicated to prove. So be prepared to follow some

careful arguments!

Theorem 16.2.1 Subsets of Sets That have Volume

Let D ⊂ <n be open and bounded. Assume g(D) has volume. Then if V ⊂ D, g(V ) alsohave volume.

Proof 16.2.1Since g(V ) has volume, for given ε > 0, there is a partition Pε of g(D) so that

U(g, P )− L(g, P ) < ε

For any refinement P of Pε. The partition Pε is also a partition of g(V ) and if h = g1V , it is easy tosee

U(h, P )− L(h, P ) ≤ U(g, P )− L(g, P ) < ε

which implies h satisfies the Riemann Criterion and so h is integrable and thus g(G) has volume.

Our first change of variable theorem involves the integration of a simple constant.

Theorem 16.2.2 The Change of Variable Theorem for a Constant Map

Let D ⊂ <n be open and bounded. Let g : D → <n have continuous partial derivativeson D and assume g is 1 − 1 on D with Jg(x) 6= 0 on D. If g(D) has Volume,

∫g(D)

1 =∫Dg|det JG|.

Proof 16.2.2Let C(s,x0) be a cube of edge size s inside the open set D. The C(s,x0) has the form

C(s,x0) = x ∈ <n| ‖x− x0‖∞ < s

and V (C(s,x0)) = (2s)n. Since g has continuous partial derivatives, using the Mean Value Theo-rem, we can say there are points ui on [x0,x] so that

gi(x)− gi(x0) =

n∑k=1

gi,xk(ui) (xk − x0,k)

Note each ui ∈ C(s,x0). We can then overestimate

|gi(x)− gi(x0)| ≤n∑k=1

|gi,xk(ui)| ‖x− x0‖∞ ≤n∑k=1

|gi,xk(ui)| s

≤ n ‖Jg‖Fr(ui) s ≤ n maxy∈C(s,x0)

‖Jg‖Fr(y) s

as√n < n. Hence,

‖g(x)− g(x0)‖∞ ≤ maxy∈C(s,x0)

‖Jg‖Fr(y) ns


16.2. THE CHANGE OF VARIABLE THEOREM 325

implying

V (g(C(s,x0)) ≤ ( maxy∈C(s,x0)

‖Jg‖Fr(y) )n nn V (C(s,x0))

Thus, g(C(s,x0)) is contained in a cube centered at g(x0) defined by

E(s,x0) =

z | ‖z − g(x0)‖∞ ≤ n max

y∈C(s,x0)‖Jg(y)‖Fr

Now let L be an invertible linear map. Then, if S has volume, L−1(S) also has volume and

V (L−1(S)) =

∫L−1(S)

1 =

∫S

(det(L−1) 1 = (det(L−1)V (S)

Let S = g(C(s,x0)). We know S has volume because of Theorem 16.2.1.

Thus, for any invertible map L,

det(L−1) V (g(C(s,x0))) = V ((L−1g)(C(s,x0))) =

(max

y∈C(s,x0)‖(L−1Jg)(y)‖1

)nnn V (C(s,x0))

implying

V (g(C(s,x0))) = V ((L−1g)(C(s,x0))) = (detL)

(max

y∈C(s,x0)‖(L−1Jg)(y)‖1

)nnn V (C(s,x0))

Now divide C(s,x0) into M cubes C1, . . . , CM with centers x1, . . . ,xM and apply Li = Jg(xi)to each cube Ck. We find

V (g(Ck)) = (detLk)

(max

y∈C(s,x0)‖(L−1

k Jg)(y)‖1)n

nn V (Ck)

and thus

V (g(C(s,x0))) =

M∑k=1

V (g(Ck)) =

M∑k=1

(detLk)

(max

y∈C(s,x0)‖(L−1

k Jg)(y)‖1)n

V (Ck)

The function (Jg(y))−1 is uniformly continuous on C(s,x0) with respect to the Frobenius (i.e sup )norm on matrices ( you should be able to check this easily ) and so

limz→y

Jg(z))−1 = Jg(y)−1

which tells us

limz→y

Jg(z)−1Jg(y) = Jg(y)−1 Jg(y) = I

Thus, given ε > 0, there is a δ > 0 so that

‖z − y‖ < δ =⇒ ‖Jg(z)−1Jg(y)− I‖∞ < ε

But this says

‖z − y‖ < δ =⇒ ‖I‖∞ − ε < ‖Jg(z)−1Jg(y)‖∞ < ‖I‖∞ + ε



or

‖z − y‖ < δ =⇒ 1− ε < ‖Jg(z)−1Jg(y)‖∞ < 1 + ε

Now we find the subdivision of cubes we need. Let ε = 1/p and for the δ associated with ε = 1/p,choose it so that δp < 1/p. Choose the cubes C1, . . . , CMp and centers x1 < . . . ,xMp so that‖xi − y‖ < δp. Then, we must have

‖z − xi‖ < δp =⇒ 1− 1/p < ‖Jg(xi)−1Jg(y)‖∞ < 1 + 1/p

We now have the estimate

V (g(C(s,x0))) ≤Mp∑i=1

(maxy∈Ci

(1 + 1/p)

)n|det(Jg(xi)| nn V (Ci)

≤Mp∑i=1

(1 + 1/p)n det(Jg(xi) nn V (Ci)

Since (1 + 1/p) → 1 as p → ∞, for a given ε, there is a p0 so that (1 + 1/p)n nn < 1 + ε. Weconclude if p > p0,

V (g(C(s,x0)))

Mp∑i=1

(1 + ε) |det(Jg(xi)| V (Ci)

Now, let’s connect this to Riemann sums. Let Pp be the partition of g(C(s,x0)) determined by thesecubes. Then for the evaluation set sigmap = x1, . . . ,xMP , we have∑

i∈Pp

det(Jg(xi) V (Ci) = S(|det(Jg)|,σp, Pp)

implying for p > p0,

V (g(C(s,x0))) ≤ (1 + ε)S(|det(Jg)|,σp, Pp)

We know (|det(Jg| is integrable onC(s,x0) and since ‖Pp‖ → 0, we know S(|det(Jg|)|,σp, Pp)→∫g(C(s,x0))

|det(Jg|. Since ε > 0 is arbitrary, we know

V (g(C(s,x0))) ≤∫g(C(s,x0))

|det(Jg|

To finish, we let P be a partition of g(D) into cubes Ci. Then each Ci is contained in a cube Wi andd(D) ⊂ ∪Wi. Since g(D) has volume, so does g(Ci) and∫

g(Ci)

1g(Ci) = V (g(Ci)

Hence, ∫g(D)

1g(D) = V (g(D)) =∑∫

g(Ci)

1g(Ci) =∑

V (g(Ci)

≤∑∫

Ci

|det(Jg| =∫D

|det(Jg|


16.2. THE CHANGE OF VARIABLE THEOREM 327

We have shown one part of the inequality we need:∫g(D)

1g(D) ≤∫D|det(Jg|.

The argument we just gave works just as well for another case. If f : g(D) → < has continuouspartial derivatives on g(D), then the composite map, then

∫g(D)

f ≤∫Df g|det(Jg|. Let φ =

(f g)|det(Jg)|) and ψ = g−1. Then our last result applies and∫ψ(g(D))

φ ≤∫g(D)

φ ψ|det(Jψ|

or ∫D

(f g)|det(Jg)|) ≤∫g(D)

((f g)|det(Jg)|) g−1|det(Jg−1)|

At any particular point y ∈ g(D), we have

φ ψ(y) = f(g(g−1(y))|det(Jg)(g−1(y)| |det(Jg−1)(y)| = f(y)

as the last two terms are inverses of one another. Thus, ((f g)|det(Jg)|) g−1|det(Jg−1)| = fand we have ∫

D

f (g|det(Jg)|) ≤∫g(D)

f

Then for the choice f = 1, since 1 g = 1, we find∫D

det(Jg)|) ≤∫g(D)

1 =

∫g(D)

1g(D)

This provides the other half of the inequality we need. Combining, we have∫Ddet(Jg)|) =

∫g(D)

1g(D).So we have shown the Change of Variable Theorem works for constant maps f .

Next, we extend from a constant map to a general map.

Theorem 16.2.3 The Change of Variable Theorem for a General Map

Let D ⊂ <n be open and bounded. Let g : D → <n have continuous partial derivativeson D and assume g is 1 − 1 on D with Jg(x) 6= 0 on D. If g(D) has Volume, then forf : g(D)→ < integrable, f g|detJG| is also integrable and

∫g(D)

f =∫Df g|detJG|.

Proof 16.2.3We assume f is integrable on g(D). Let R be a rectangle containing g(D) and P be a partition ofg(D) using this rectangle. For any rectangle S determined by P , define the function hS by hs(x) =infx∈S f(x). Then, hS is a constant on S. Then

L(f, P ) =∑S∈P

infX∈S

f(x V (S) =∑S∈P

hSV (S) =∑S∈P

∫S∈P

hS

Now apply the Change of Variable Theorem to the constant function hS . We have∫S

hS =

∫g−1(S)

hS g|det Jg|



If x ∈ g−1(S), then hS g(x = hS(g(x). Also, if x ∈ g−1(S), g(x) ∈ S. We conclude

hS g(x) = hS(g(x) = infg(x)∈S

f(g(x)) = infx∈g−1(S)

f(g(x))

So ∫g−1(S)

hS g|det Jg| =

∫g−1(S)

(inf

x∈g−1(S)f(g(x))

)|det Jg|

Combining,

L(f, P ) =∑S∈P

∫S∈P

hS =∑S∈P

∫g−1(S)

(inf

x∈g−1(S)| f(g(x))

)|det Jg|

≤∑S∈P

∫g−1(S)

f(g(x) |det Jg|

But∑S∈P

∫g−1(S)

=∫D

. Hence, we have shown L(f, P ) ≤∫Df g|detJg|. This implies∫

g(D)

f = supPL(f, P ) ≤

∫D

f g|detJg|

Since we assume f is integrable over g(D), we conclude∫g(D))

f ≤∫Df g|detJg|.

Since we now know the Change of Variable Theorem hold for constant f , we can use a similarargument to show

infPU(f, P ) ≥

∫D

f g|detJg|

Thus,∫g(D)

f ≥=∫Df g|detJg . But we assume f is integrable on g(D) and therefore, we have

shown ∫D

f g|detJg| ≤∫g(D)

f ≤∫D

f g|detJg

which shows∫g(D)

f =∫Df g|detJg . We have shown the Change of Variable Theorem hold for a

general f .

16.2.1 Homework

Exercise 16.2.1 Work out the Change of Variable results for polar coordinate transformations.

Exercise 16.2.2 Work out the Change of Variable results for cylindrical coordinate transformations.

Exercise 16.2.3 Work out the Change of Variable results for spherical coordinate transformations.

Exercise 16.2.4 Work out the Change of Variable results for ellipsoidal coordinate transformations.

Exercise 16.2.5 Work out the Change of Variable results for paraboidal. coordinate transformations.


16.3. FUBINI TYPE RESULTS 329

16.3 Fubini Type ResultsWe all remember how to evaluate double integrals for the area of a region D by rewriting them likethis ∫

D

f =

∫ b

a

∫ g(x)

f(x)

F (x, y)dy dx

where∫Df is the two dimensional integral we have been developing. Expressing this integral in

terms of successive single variable integrals leads to what are called iterated integrals which is howwe commonly evaluate them in our calculus classes. Thus, using the notation

∫dx,

∫dy etc. to

indicate these one dimensional integrals along one axis only, we want to know when∫D

F =

∫ (∫F (x, y) dy

)dx =

∫ ∫F (x, y) dy dx∫

D

F =

∫ (∫F (x, y) dx

)dy =

∫ ∫F (x, y) dx dy∫

Ω

G =

∫ (∫ ∫G(x, y, z) dy

)dx

)dz =

∫ ∫ ∫G(x, y, z) dy dx dz∫

Ω

G =

∫ (∫ ∫G(x, y, z) dy

)dz

)dx =

∫ ∫ ∫G(x, y, z) dy dz dx∫

Ω

G =

∫ (∫ ∫G(x, y, z) dx

)dy

)dz =

∫ ∫ ∫G(x, y, z) dx dy dz∫

Ω

G =

∫ (∫ ∫G(x, y, z) dx

)dz

)dy =

∫ ∫ ∫G(x, y, z) d xd zdy∫

Ω

G =

∫ (∫ ∫G(x, y, z) dz

)dx

)dy =

∫ ∫ ∫G(x, y, z) dz dx dy∫

Ω

G =

∫ (∫ ∫G(x, y, z) dz

)dy

)dx =

∫ ∫ ∫G(x, y, z) dz dy dx

where D is a bounded open subset of <2 and Ω is a bounded open subset of <3. Note in the Ωintegration there are many different ways to organize the integration. If we were doing an integrationover <n there would in general be a lot of choices! The conditions under which we can do this giverise to what are called Fubini type theorems. If general, you think of the bounded open set V as beingin <n+m = <n ×<m, label the volume elements as dV n and dV m and want to know when∫

V

F =

∫ (∫F (x1, . . . , xn, xn+1, . . . , xn+m) dV n

)dV m

=

∫ (∫FdV n

)dV m =

∫ ∫F dV n dV m

We are only going to work out a few of these type theorems so that you can get a taste for the wayswe prove these results.

16.3.1 Fubini on a RectangleLet’s specialize to the bounded set D = [a, b] × [c, d] ⊂ <2. Assume f : D → < is continuous.There are three integrals here:

• The integral over the two dimensional set D,∫DfdV 2 where the subscript 2 reminds us that

this is the two dimensional situation.



• The integral of functions like g : [a, b] → <. Assume g is continuous. Then the one dimen-sional integral would be represented as

∫ bag(x)dx in the integration treatment (Peterson (8)

2019) but here we will use∫

[a,b]gdV 1.

• The integral of functions like h : [c, d] → <. Assume h is continuous. Then the one di-mensional integral

∫ dch(y)dy is called

∫[c,d]

hdV 2. Note although the choice of integration

variable y in∫ dcg(y)dy is completely arbitrary, it makes a lot of sense to use this notation in-

stead of∫ dcg(x)dx because we want to cleanly separate what we are doing on each axis from

the other axis. So we have to exercise a bit of discipline here to set up the variable names andnotation to help us understand.

Now any partition P of D divides the bounding box R we choose to enclose D into rectangles. IfP = P1 × P2 where P1 = a = x0, x1, . . . , xn−1, xn = b and P2 = c = y0, y1, . . . , ym−1, ym =d, the rectangles have the form

Sij = [xi, xi+1]× [yj , yj+1] = ∆xi ×∆yj

where ∆xi = xi+1 − xi and ∆yj = yj+1 − yj . Thus for an evaluation set σ with points (zi, wj) ∈Sij , the Riemann sum is

S(f,σ, P ) =∑

(i,j)∈P

f(zi, wj) ∆xi ∆yj =∑

(i,j)∈P

f(zi, wj) ∆yj ∆xi

Note the order of the terms in the area calculation for each rectangle do not matter. Also, we couldcall ∆xi = ∆V1i ( for the first axis) and ∆yj = ∆V2j ( for the second axis) and rewrite as

S(f,σ, P ) =∑

(i,j)∈P

f(zi, wj) ∆V1i ∆V2j =∑

(i,j)∈P

f(zi, wj) ∆V2j ∆V1i

We can also revert to writing∑

(i,j)∈P as a double sum to get

S(f,σ, P ) =

n∑i=1

( m∑j=1

f(zi, wj) ∆V2j

)∆V1i =

m∑j=1

( n∑i=1

f(zi, wj) ∆V1i

)∆V2j

Then the idea is this

• y first then x

For∑ni=1

(∑mj=1 f(zi, wj) ∆V2j

)∆V1i, fix the variables on axis one and suppose

lim‖P2‖→0

m∑j=1

f(zi, wj) ∆V2j =

∫zi×[c,d]

f(zi, y)dy

To avoid using specific sizes like m for P2 here, we would usually just say

lim‖P2‖→0

∑j∈P2

f(zi, wj) ∆V2j =

∫zi×[c,d]

f(zi, y)dy

Now for this to happen we need the function f(·, y) to be integrable for each choice of x forslot one. An easy way to do this is to assume f is continuous in (x, y) ∈ D which will forcef(·, y) to be a continuous function of y for each x. Let’s look at this integral more carefully.



We have defined a new function G(x) by

G(x) =

∫[c,d]

f(x, y)dy =

∫ d

c

f(x, y)dy

If G(x) is integrable on [a, b], we can say

lim‖P1‖→0

n∑i=1

(lim‖P2‖→0

m∑j=1

f(zi, wj) ∆V2j

)∆V1i = lim

‖P1‖→0

n∑i=1

(∫zi×[c,d]

f(zi, y)dy

)∆V1i

= lim‖P1‖→0

n∑i=1

G(zi)∆V1i =

∫ b

a

G(x)dx

We can resay this again. Let ∆Vij be the area ∆xi ∆yj . Then∫D

fdV 2 = lim‖P1×P2‖→0

∑(i,j)∈P1×P2

f(zi, wj)∆Vij

= lim‖P1‖→0

∑i∈P1

(lim‖P2‖→0

∑j∈P2

f(zi, wj) ∆V2j

)∆V1i

= lim‖P1‖→0

∑i∈P1

(∫ d

c

f(zi, y) dy

)∆V1i

=

∫ b

a

(∫ d

c

f(x, y) dy

)dx =

∫ b

a

∫ d

c

f(x, y) dy dx

• x first then y

For∑mj=1

(∑ni=1 f(zi, wj) ∆V1i

)∆V2j , fix the variables on axis two and suppose

lim‖P1‖→0

n∑i=1

f(zi, wj) ∆V1i =

∫[a,b]×wj

f(x,wj)dx

Of course, this is the same as

lim‖P1‖→0

∑i∈P1

f(zi, wj) ∆V1i =

∫[a,b]×wj

f(x,wj)dx

Now for this to happen we need the function f(x, ·) to be integrable for each choice of y forslot two. If we assume f is continuous in (x, y) ∈ D this will force f(x, ·) to be a continuousfunction of x for each y. Let’s look at this integral more carefully. We have defined a newfunction H(y) by

H(y) =

∫[a,b]

f(x, y)dx =

∫ b

a

f(x, y)dx

If H(y) is integrable on [c, d], we can say

lim‖P2‖→0

m∑j=1

(lim‖P1‖→0

n∑i=1

f(zi, wj) ∆V1i

)∆V2j = lim

‖P2‖→0

m∑j=1

(∫[a,b]×wj

f(x,wj)dx

)∆V2j



= lim‖P2‖→0

m∑j=1

H(wj)∆V2j =

∫ d

c

H(y)dy

or ∫D

fdV 2 = lim‖P1×P2‖→0

∑(i,j)∈P1×P2

f(zi, wj)∆Vij

= lim‖P2‖→0

∑j∈P2

(lim‖P1‖→0

∑i∈P1

f(zi, wj) ∆V1i

)∆V2j

= lim‖P2‖→0

∑j∈P2

(∫ b

a

f(x,wj) dx

)∆V2j

=

∫ d

c

(∫ b

a

f(x, y) dx

)dy =

∫ d

c

∫ b

a

f(x, y) dx dy

We have laid out an approximate chain of reasoning to prove our first Fubini result. Let’s get to it.

Theorem 16.3.1 Fubini’s Theorem on a Rectangle: 2D

Let f : [a, b]× [c, d]→ < be continuous. Then∫D

fdV 2 =

∫ b

a

(∫ d

c

f(x, y) dy

)dx =

∫ b

a

∫ d

c

f(x, y) dy dx∫D

fdV 2 =

∫ d

c

(∫ b

a

f(x, y) dx

)dy =

∫ d

c

∫ b

a

f(x, y) dx dy

Proof 16.3.1For x ∈ [a, b], let G be defined by G(x) =

∫ dcf(x, y)dy. We need to show G is continuous for each

x. Note since f is continuous on [a, b]× [c, d], given ε > 0, there is a δ > 0 so that√(x′ − x)2 + (y′ − y)2 < δ =⇒ |f(x, y)− f(x′, y′)| < ε

In particular, for fixed y′ = y, there is a δ > 0 so that

|x′ − x| < δ =⇒ |f(x′, y)− f(x, y)| < ε/(d− c)

Thus, if |x′ − x| < δ,

|G(x′)−G(x)| = |∫ d

c

(f(x′, y)− f(x, y)dy| ≤∫ d

c

|f(x′, y)− f(x, y)|dy

≤ (d− c) ε/(d− c) = ε

and so we know G is continuous in x for each y. A similar argument shows the function H definedat each y ∈ [c, d] by H(y) =

∫ baf(x, y)dy is also continuous in y for each x.

Since f is continuous on [a, b] × [c, d], and the boundary of D has measure zero, we see the set ofdiscontinuities of f on D is a set of measure zero and so f is integrable on D. Let Pn = P1n × P2n



be a sequence of partitions with ‖Pn‖ → 0. Then, we know for any evaluation set σ of P , that∫D

f = lim‖Pn‖→0

∑i∈P1n

∑j∈P2n

f(zi, wj) ∆xi ∆yj

and following the reasoning we sketch out earlier, we can choose to organize this as x first then yor y first then x. We will do the case of x first then y only and leave the details of the other case toyou. fix ε > 0, Then there is N , so that n > N implies∣∣∣∣ ∑

j∈P2n

( ∑i∈P1n

f(zi, wj) ∆xi

)∆yj −

∫D

f

∣∣∣∣ < ε/2

Since f(x,wj) is continuous in x for eachwj , we know∫ baf(x,wj)dx exists and so for any sequence

of partitions Qn of [a, b] with ‖Qn‖ → 0, using as the evaluation set the points zi, we have

∑i∈Qn

f(zi, wj) ∆xi →∫ b

a

f(x,wj)dx.

Since ‖Pn‖ → 0, so does the sequence P1n, there is N1 < N so that n > N1 implies∣∣∣∣ ∑i∈P1n

f(zi, wj) ∆xi −∫ b

a

f(x,wj)dx

∣∣∣∣ =

∣∣∣∣ ∑i∈P1n

f(zi, wj) ∆xi −H(wj)

∣∣∣∣ < ε/(4(d− c)

Similarly, since ‖P2n‖ → 0 and H is integrable, there is N2 > N1 > N so that∣∣∣∣ ∑j∈P2n

H(wj) ∆yj −∫ d

c

H(y)dy

∣∣∣∣ < ε/4

where the points wj form the evaluation set we use in each P2n. Thus,∣∣∣∣ ∑j∈P2n

( ∑i∈P1n

f(zi, wj) ∆xi

)∆yj −

∫ d

c

H(y)dy

∣∣∣∣=

∣∣∣∣ ∑j∈P2n

( ∑i∈P1n

f(zi, wj) ∆xi −H(wj) +H(wj)

)∆yj −

∫ d

c

H(y)dy

∣∣∣∣≤∣∣∣∣ ∑j∈P2n

( ∑i∈P1n


)∆yj

∣∣∣∣+

∣∣∣∣ ∑j∈P2n

H(wj)∆yj −∫ d

c

H(y)dy

∣∣∣∣≤∑j∈P2n

∣∣∣∣ ∑i∈P1n


∣∣∣∣∆yj +

∣∣∣∣ ∑j∈P2n

H(wj)∆yj −∫ d

c

H(y)dy

∣∣∣∣<∑j∈P2n

ε/(4(d− c))∆yj + ε/4 = ε/2

We can now complete the argument. For n > N2,∣∣∣∣∫ d

c

H(y)dy −∫d

f

∣∣∣∣ ≤ ∣∣∣∣ ∑j∈P2n

( ∑i∈P1n

f(zi, wj) ∆xi

)∆yj −

∫D

f

∣∣∣∣



+

∣∣∣∣ ∑j∈P2n

( ∑i∈P1n

f(zi, wj) ∆xi

)∆yj −

∫ d

c

H(y)dy

∣∣∣∣< ε/2 + ε/2 = ε

Thus ∫D

f =

∫ d

c

H(y)dy =

∫ d

c

(∫ b

a

f(x, y) dx

)dy =

∫ d

c

∫ b

a

f(x, y) dx dy

A very similar argument handles the case y first then x giving∫D

f =

∫ b

a

G(x)dx =

∫ b

a

(∫ c

c

f(x, y) dy

)dx =

∫ b

a

∫ d

c

f(x, y) dy dx

As you can see, the argument to prove Theorem 16.3.1 is somewhat straightforward but you haveto be careful to organize the inequality estimates. A similar approach can prove the more generaltheorem.

Theorem 16.3.2 Fubini’s Theorem for a Rectangle: n dimensional

Let f : R1 ×R2 → < be continuous where R1 is a rectangle in <n and R2 ∈ <m. Then∫D

fdV n+m =

∫R1

(∫R2

f(x,y) dV m

)dV n =

∫R1

∫R2

f(x,y) dV m dV n∫D

fdV n+m =

∫R2

(∫R1

f(x,y) dV n

)dV m =

∫R2

∫R1

f(x,y) dV n dV m

where x are the variables x1, . . . , xn from <n and y are the variables xn+1, . . . , xn+m.

Proof 16.3.2For example, f : [a1, b1]× [a2, b2]× [a3, b3]× [a4, b4]→ < would use R1 = [a1, b1]× [a2, b2] andR2 = [a3, b3]× [a4, b4]. The proof here is quite similar. We proveG(x andH(y are continuous sincef is continuous. Then, since f is integrable, given any sequence of partitions Pn with ‖Pn‖ → 0,there is N so that n > N implies∣∣∣∣ ∑

j∈P2n

( ∑i∈P1n

f(zi,wj) ∆xi

)∆yj −

∫D

f

∣∣∣∣ < ε/2

where Pn = P1n × P2n like usual except P1n is a partition of R1 and P2n is a partition of R2.The summations like

∑j∈P2n

are still interpreted in the same way. The rest of the argument isstraightforward.

Now let’s specialize to some useful two dimensional situations. Consider the situation shown inFigure 16.1.

Theorem 16.3.3 Fubini’s Theorem: 2D: a top and bottom curve



Figure 16.1: Fubini’s Theorem in 2D: a top and a bottom curve

Let f and g be continuous real valued functions on [a, b] and F : D → < be continuouson D where D = (x, y) : a ≤ x ≤ b, g(x) ≤ y ≤ f(x). Then∫

D

FdV 2 =

∫ b

a

∫ f(x)

g(x)

dy dx

Proof 16.3.3Let’s apply Theorem 16.3.2 on the rectangle [a, b] × [gm, fM ] where gm is the minimum value of gon [a, b] which we know exists because g is continuous on a compact domain. Similarly fM is themaximum value of f on [a, b] which also exists. Then the D showing in Figure 16.1 is contained in[a, b]× [gm, fM ]. We see F is continuous on this rectangle except possibly on the curves given by fand g. We also know the set Sf = (x, f(x))|a ≤ x ≤ b and the set Sg = (x, g(x))|a ≤ x ≤ bboth have measure zero in <2. Thus F has a discontinuity set of measure zero and so F is integrableon [a, b]× [gm, fM ] which matches the integral of F on D.

Next, since F is continuous on the interior of D, for fixed x in [a, b], G(y) = F (x, y) is continuouson [g(x), f(x)]. Of course, this function need not be continuous at the endpoints but that is a setof measure zero in <1 anyway. Hence, G is integrable on [gm, fM ] and hence G is integrable on[g(x), f(x)]. By Theorem 16.3.2, we conclude∫

[a,b]×[gm,fM ]

FdV 2 =

∫D

FdV 2 =

∫ b

a

∫ fM

gm

G(y) dy dx

=

∫ b

a

∫ f(x)

g(x)

F (x, y) dy dx

Next, look at a typical closed curve situation.

Theorem 16.3.4 Fubini’s Theorem: 2D: a top and bottom closed curve



Figure 16.2: Fubini’s Theorem in 2D: a top and bottom closed curve

Let f and g be continuous real valued functions on [a, b] which form the bottom and tophalves of a closed curve and F : D → < be continuous on D where D = (x, y) : a ≤x ≤ b, g(x) ≤ y ≤ f(x). Then∫

D

FdV 2 =

∫ b

a

∫ f(x)

g(x)

dy dx

Proof 16.3.4From Figure 16.2, we see the closed curve defines a a box [a, b]× [gm, fM ] where gm is the minimumvalue of g and fM is the maximum value of f on [a, b]. These exist because f and g are continuouson the compact domain [a, b]. The rest of the argument is exactly the same as the proof of Theorem16.3.3. We have ∫

[a,b]×[gm,fM ]

FdV 2 =

∫D

FdV 2 =

∫ b

a

∫ fM

gm

G(y) dy dx

=

∫ b

a

∫ f(x)

g(x)

F (x, y) dy dx

It is easy to see we could prove similar theorems for a region D = (x, y)|c ≤ y ≤ d, g(y) ≤ x ≤f(y) for functions f and g continuous on [c, d] and F continuous on D. We would find∫

[gm,fM ]×[c,d]

FdV 2 =

∫D

FdV 2 =

∫ d

c

∫ f(y)

g(y)

F (x, y) dx dy

Call the top - bottom region a TB region and the left -right region a LR region. Then you should beable to see how you can break a lot of regions into finite combinations of TB and LR regions andapply Fubini’s Theorem to each piece to complete the integration.

Comment 16.3.1 We can work out similar theorems in Re3, <4 and so forth. Naturally, this processgets complicated as the dimensions increases. Still, the principles of the arguments are the same.The boundary surfaces are all of measure zero as they are a strict subspace of <n. You should try afew to see how the arguments are constructed. This is how you learn how things work: by doing.



16.3.2 HomeworkExercise 16.3.1 Prove Fubini’s Theorem for rectangles in <3. It is enough to prove it for one of thesix cases.

Exercise 16.3.2 Since we can approximate these integrals using Riemann sums, can you estimatehow computationally expensive this gets using uniform partitions? Would it be easy to compute aRiemann approximation to a 10 dimensional integral over a rectangle?

Exercise 16.3.3 This problem shows you how we can use these ideas to do something interesting!

1. Compute the two dimensional IR =∫DR

e−x2−y2 where DR = B(R, (0, 0) for all R by using

the polar coordinate transformation. Can you compute limR→∞ for this result?

2. let JR =∫SRe−x

2−y2dxdy. Prove limR→∞ JR = limR→∞ IR.

3. What is the value of∫∞−∞ e−x

2

dx?

Exercise 16.3.4 Draw a bounded 2D region in <2 which is constructed from two TB’s and one LRand convince yourself how Fubini’s theorem is applied to compute the integral.

Exercise 16.3.5 Draw an arbitrary bounded region in <2 and decompose it into appropriate TB andLR regions on which Fubini’s Theorem can be applied and convince yourself how the total integralis computed.


Chapter 17

Line Integrals

It is now time to set the stage for the connection between a type of one dimensional integral and anequivalent two dimensional integral. This is challenging material and it has deep meaning. However,we will start out slow in this chapter and show you a low level version of what it all means. Let’sstart with the idea of a path or curve in <2.

17.1 Paths

Definition 17.1.1 Two Dimensional Curves

Let [a, b] be a finite interval of real numbers. Let f and g be two continuously differentiablefunctions on [a, b]. The path defined by these functions is the set of ordered pairs

C = (x, y)|x = f(t), y = g(t), a ≤ t ≤ b

If the starting and ending point of the curve are the same, we say the curve is closed.

Comment 17.1.1 We often abuse this notation and talk about the path C being given by the pairs(x, y) with

x = x(t), y = y(t), a ≤ t ≤ b

even though the letters x and y are being used in two different ways. In practice, it is not that hardto get the two uses separate.

The curve C has a tangent line that is defined at each point t0 given by[xy

]=

[x(t0)y(t0)

]+ c

[x′(t0)y′(t0)

]and has a tangent vector T (t0) given by

T (t0) =

[x′(t0)y′(t0)

]and an associated vector perpendicular to T (t0) given

N(t0) =

[−y′(t0)x′(t0)

]339


340 CHAPTER 17. LINE INTEGRALS

You can see < T (t0),N(t0) >= 0. Also note

N∗(t0) =

[y′(t0)−x′(t0)

]is also perpendicular to T (t0). Both choices for this perpendicular vector are called the Normalvector to the curve C at t0 Using the right hand rule, we can calculate T (t0) ×N(t0). This willeither be k of −k depending on whether to get toN(t0 from T (t0), we need to move our right handcounterclockwise (ccw) (this give +k) or clockwise ( this gives −k). Look at Figure 17.1 to get afeel for this. We are deliberately showing you hand drawn figures as that is what we would do in alecture and what you should learn how to do on your scratch paper as you read this.

Figure 17.1: A Simple Closed Curve with Tangent and Normal Vectors

In Figure Figure 17.2, we show a self intersecting curve like a figure eight. Note in the right half, thetangent vector moves ccw and in the left half, the tangent vector moves cw. We indicate the directionof both N and N∗ = −N also. Note, if T is chosen, we can not decide which of N or −N . Thepoint Q here indicates where the tangent vector switches from moving ccw to moving cw. Then inFigure 17.3, we show the same figure eight curve with what we would clearly decide was the insideand outside the of the curve. Now the curve C is defined by the two functions x(t) and y(t) whichare the components of a vector field we can call γ. Thus, we can identity the curve C and the rangeof the vector function γ. Hence, in Figure 17.3 instead of labeling the curve as C we call it γ. Weuse the γ notation with some changes in definition in Chapter 18 where we discuss another way oflooking at all of this using the language of differential forms. And, thinking more carefully aboutinsides and outsides of curves will lead us into ideas from homology. So there is much to do! But forright now, we can already see it is not so easy to decide how to pick the normal – should it be N or−N? It appears it should be N is T is moving ccw and −N if T is moving cw. Now, scribble outa picture of a curve with has two self intersections, four self intersections with concomitant changesin ccw and cw movement of T as the parameter t increases. We want to believe, based on simpleexamples, that each closed curve γ divides <2 into three disjoint pieces: the interior, exterior andthe curve itself.Hence, if we want to point towards the interior of the closed curve, we should use N(t) if a pointon the curve is moving ccw and otherwise, use −N(t). Also, note we can choose ±N and T to beunit vectors by simply dividing them by their length ‖T (t) and ‖N(t), respectively. The only timethis becomes impossible is these norms are zero. This only occurs at a point t where both x′(t) and


17.1. PATHS 341

Figure 17.2: A Figure Eight Curve with Tangent and Normal Vectors

Figure 17.3: The Inside and Outside of a Curve

y′(t) = 0. We can usually figure out a way to define a nice tangent and normal anyway but we won’tcomplicate the discussion here by that. Now as long as the functions x(t) and y(t) are continuouslydifferentiable on [a, b], Consider a vector field

F =

[A(x, yB(x, y)

]defined on a open set U of <2 which contains the curve C . We define the action of F along C to bethe line integral of F along C from the starting point of the curve to its ending point using a veryspecific Riemann integral.

Definition 17.1.2 The Line Integral of a Force Field along a Curve



Let the vector field F be defined on an open set U in <2 be defined by

F =

[A(x, yB(x, y)

]where A and B are at least continuous at each point on C . Then the line integral of Falong C from P = (x(a), y(a)) to q = (x(b), y(b)) is defined by∫

C

⟨F ,

[x′

y′

]⟩dt =

∫C

(⟨[A(x, yB(x, y)

],

[x′(t)y′(t)

]⟩)dt

=

∫ b

a

(A(x(t), y(t))x′(t) +B(x(t), y(t))y′(t)) dt

Since Φ(t) = A(x(t), y(t))x′(t) +B(x(t), y(t))y′(t) is continuous on [a, b], this Riemannintegral exists and so the line integral is well - defined.

Example 17.1.1 Suppose each point on C represents the electric field at the point t. Then how dowe compute the total work done in moving a test charge from the beginning point of the curve to theending point? First, note the function

Φ(t) = A(x(t), y(t)) x′(t) +B(x(t), y(t)) y′(t)

is a continuous function on [a, b] and so is Riemann Integrable. Hence, the value of this integral isgiven by

limn→∞

S(Φ, πn, σn)

for any sequence of partitions πn whose norms go to zero and σn is any evaluatation set in πn.Then, for a given partition, the charge in the portion of the path between partition point tj and tj+1

can be approximated by

A(x(zj), y(zj))(x(tj+1 − x(tj)) +B(x(zj), y(zj))(y(tj+1 − y(tj)) =(A(x(zj), y(zj))

x(tj+1 − x(tj)

tj+1 − tj+B(x(zj), y(zj))

y(tj+1 − y(tj)

tj+1 − tj

)(tj+1 − tj)

for any choice of zy in [tj , tj+1]. Now apply the Mean Value Theorem to see(A(x(zj), y(zj))

x(tj+1 − x(tj)

tj+1 − tj+B(x(zj), y(zj))

y(tj+1 − y(tj)

tj+1 − tj

)(tj+1 − tj) =(

A(x(zj), y(zj))x′(sj) +B(x(zj), y(zj))y

′(sj)

)(tj+1 − tj)

The approximation to the mass on the portion [tj , tj+1] can just as well be done setting zj = sj intheA andB terms. Thus, the approximation can be

Φ(sj) (tj+1 − tj) =

(A(x(sj), y(sj))x

′(sj) +B(x(sj), y(sj))y′(sj)

)(tj+1 − tj)

The Riemann Approximation Theorem then tells us

limn→∞

∑πn

Φ(sj)∆tj =

∫ b

a

Φ(t)dt


17.2. CONSERVATIVE FORCE FIELDS 343

which is how we defined the line integral.

Comment 17.1.2 In the same setting as the previous example, we could find the work done along aclosed path C . The same line integral would represent this work.

Example 17.1.2 What if the mass at a point on the string represented by the curve C was representedby the scalar m(x(t), y(t))? The mass of the piece of the string corresponding to the subinterval[tj , tj+1] for a given partition π of [a, b] can be approximated by

Mj = m(x(sj), y(sj))√

(x(tj+1)− x(tj))2 + (y(tj+1)− y(tj))2

= m(x(sj), y(sj))

√(x(tj+1)− x(tj)

tj+1 − tj

)2

+

(y(tj+1)− y(tj)

tj+1 − tj

)2

(tj+1 − tj)

From the Mean Value Theorem, we know there is an zj between tj and tj+1 so that[x(tj+1 − x(tj)y(tj+1 − y(tj)

]= (tj+1 − tj)

[x′(zj)y′(zj)

]Thus, we have

Mj = m(x(sj), y(sj))√x′(zj))2 + y′(zj))2 (tj+1 − tj)

Choosing sj = zj always, we find the Riemann sums here are

S(Φ, π, σ) =∑π

m(x(zj), y(zj))√x′(zj))2 + y′(zj))2 (tj+1 − tj)

where Φ(t) = m(x(t), y(t))√x′(t)2 + y′(t)2 which is a nice continuous function on [a, b]. Hence,

the Riemann Integral∫ ba

Φ(t) dt exists and can be used to define the mass on the curve. Note, thisargument, while similar to the one we used for the line integrals, is not the same as m(x, y) is not avector function.


Exercise 17.1.2

Exercise 17.1.3

Exercise 17.1.4

Exercise 17.1.5

17.2 Conservative Force FieldsLet’s restrict our attention to nice closed curves. We want them to be simple which means they donot have self intersections so the figure eight type curves are not allowed. We also want them to havefinite length as we can calculate from the usual arc length formulae. Recall the length of the curve Cfro a to b is

Lba =

∫ b

a

√x′(t)2 + y′(t)2 dt



Since we assume x(t) and y(t) are continuously differentiable on [a, b], we know the integrand iscontinuous on [a, b] and hence it is bounded by some B > 0. Thus, Lba ≤ B (b− a). So our curvesdo have finite arc length. Such curves are called rectifiable. We also want the normal vector to beuniquely defined so it points towards the interior of the closed curve. We usually think of the tangentvector as moving ccw here, so the normal we pick is always the N instead of the −N . This kind ofcurve is called orientable. Together, using the first letters of these properties, we want to considerSCROCs: i.e. simple, closed, rectifiable and orientable curves. So what is a conservative forcefield F ?

Definition 17.2.1 Conservative Force Field

Let the vector field F be defined on an open set U in <2 by

F =

[A(x, yB(x, y)

]where A and B are at least continuous at each point in U . Let C be a SCROC correspond-ing to the vector function on [a, b] ( this domain is dependent on the choice of C ).

γ =

[x(t)y(t)

]We say F is a Conservative Force Field if

∫C < F ,γ′ > dt = 0 for all SCROCs C .

Comment 17.2.1 Let the start point be P and the endpoint be Q. Let C1 be a path that goes fromP to Q and C2 be a path that moves in the reverse. Let’s also assume the the path C obtained bytraversing C1 followed by C2. The paths are associated with vector functions γ1 and γ2. We wouldwrite this as C = C1 + C2 for convenience. Then, if F is conservative, we have∫

C

< F ,γ′ > dt =

∫C1

< F ,γ1′ > +

∫C2

< F ,γ2′ >= 0

This implies ∫C1

< F ,γ1′ > =

∫−C2

< F ,γ2′ >

where the integration over−C2 simply indicates we are moving on C2 backwards. This says the pathwe take from P toQ doesn’t matter. In other words, the value we obtain by the line integral dependsonly on the start and final point. We call this property path independence of the line integral.

Let’s assume F is a conservative force field. Let the start point be P = (xs, ys) and the endpoint besome Q = (xe = xs + h, ye = ys + h). Since P is and interior point of U , there is a ball of radiusr > 0 about P which is in U . So we can pick an endpoint Q like this in this ball. You can look atFigure 17.4 to see the kind of paths we are going to use in the argument below. Since the value ofthe line integral is independent of path, look at the path

C1 : γ1 =

[x1(t) = xs + thy1(t) = ys

], 0 ≤ t ≤ 1

C2 : γ2 =

[x2(t) = xs + hy2(t) = ys + th

], 0 ≤ t ≤ 1


17.2. CONSERVATIVE FORCE FIELDS 345

Figure 17.4: A Closed Path to prove Ay = Bx

Then, we can calculate the line integrals∫

C1and

∫C2

to find∫C1

< F ,γ1′ > +

∫C2

< F ,γ2′ >

=

∫ 1

0

A(xs + th, ys) hdt+

∫ 1

0

B(xe, ys + th) hdt

Now we can go back to P along the path

C3 : γ3 =

[x3(t) = xe − th = xs + (1− t)h

y3(t) = ye

], 0 ≤ t ≤ 1

C4 : γ4 =

[x4(t) = xs

y4(t) = ye − th = ys + (1− t)h

], 0 ≤ t ≤ 1

Then, we can calculate the line integrals∫

C3and

∫C4

to find∫C3

< F ,γ3′ > +

∫C4

< F ,γ4′ >

=

∫ 1

0

A(xe − th, ye) (−h)dt+

∫ 1

0

B(xs, ye − th) (−h)dt

Because of path independence. we have∫

C1+∫

C2= −

∫C3−∫

C4and so∫ 1

0

A(xs + th, ys) hdt+

∫ 1

0

B(xe, ys + th) hdt

= −∫ 1

0

A(xe − th, ye) (−h)dt−∫ 1

0

B(xs, ye − th) (−h)dt

Thus,

h

∫ 1

0

(A(xs + th, ys)−A(xe − th, ye)

)dt



= h

∫ 1

0

(B(xs, ye − th)−B(xe, ys + th)

)dt

Hence, cancelling the common h,∫ 1

0


)dt =

∫ 1

0


)dt

Next, let h→ 0:

limh→0

∫ 1

0


)dt

= limh→0

∫ 1

0


)dt

The integrals are continuous with respect to h and so∫ 1

0

(A(xs, ys)−A(xs, ye)

)dt =

∫ 1

0

(B(xs, ys)−B(xe, ys)

)dt

If we assume A and B have first order partials, then∫ 1

0

(−Ay(xs, ys)h− EA(xs, ys, h)

)dt =

∫ 1

0

(−Bx(xs, ys)h− EB(xs, ys, h)

)dt

Thus,

Ay(xs, ys) +

∫ 1

0

EA(xs, ys, h)

hdt =

(Bx(xs, ys) +

∫ 1

0

EB(xs, ys, h)

hdt

Now let h → 0 again to obtain the final result: Ay(xs, ys) = Bx(xs, ys). We can state this as animportant result.

Theorem 17.2.1 Conservative Force Fields Imply Ay = Bx

Let the conservative vector field F be defined on an open set U in <2 by

F =

[A(x, yB(x, y)

]where A and B have first order partials at each point in U . Then Ay = Bx in U .

Proof 17.2.1We have just gone through this argument.

17.2.1 Homework

Exercise 17.2.1

Exercise 17.2.2

Exercise 17.2.3

Exercise 17.2.4


17.3. POTENTIAL FUNCTIONS 347

Exercise 17.2.5

17.3 Potential Functions

To understand this at a deeper level, we need to look at connectedness more. Let’s introduce the ideaof a set U being path connected.

Definition 17.3.1 Path Connected Sets

The set V in <2 is path connected if given P and Q is V , there is a path C connecting PtoQ.

Comment 17.3.1 It is not true that a connected open set has to be path connected. There is a standardcounterexample of this which you should see. We need to extend connectedness to closed sets.

Definition 17.3.2 Connected Closed Sets

The closed set C in <2 is connected if we can not find two open sets A and B so thatC = A′ ∪B′ with A′ ∩B = ∅ and B′ ∩A = ∅.

To find our counterexample, consider

S = (x, y) ∈ <2|y = sin(1/x), x > 0 ∪ (0 × [−1, 1])

Let S+ = (x, y) ∈ <2|y = sin(1/x), x > 0 and S0 = 0 × [−1, 1]. It is easy to see any twopoints in S+ can be connected by a path. Just parameterize the portion of the curve y = sin(1/x)that connects the two points: x(t) = t and y(t) = sin(1/t) for suitable [a, b]. Now take a point inS+ and a point in S0 and try to find a path that connects the two points. There is no way a path thatbegins in S+ can jump to S0. Hence, the set S is not path connected. If you think about it though, Sitself is connected.Since the set of cluster points of sin(1/x) at 0 in [−1, 1], the closure of S+ must be all of S. Now if Swas not connected, we could write S = A′ ∪B′ where A and B are open sets with A′ ∩B = ∅ andB′ ∩ A = ∅. But then A ∩ S and B ∩ S is a decomposition of S and since S+ is connected, we canassume without loss of generality B ∩ S+ = ∅. Hence, S+ = A ∩ S+ ⊂ A which implies S ⊂ A′.That tells us S does not intersect B′ and so S = A′ implying S is connected. This argument worksin the other case, A∩S = ∅ as well. Thus, as S+ is connected, its closure S must be connected also.

So S is a subset of <2 which is connected but not path connected.

We can now say much more about conservative force fields. Let’s consider a force field F in U andnow assumeA andA to have continuous first order partials withAy = −Bx in U . Let’s also assumethe open set U is path connected. Pick a point (x0, y0) in U and any other point (x, y) in U . SinceAy = Bx in U , we know the line integrals over paths connecting these two points are independentof path. The point (x, y) is an interior point in U and so there is a radius r > 0 with Br(x, y) in U .For h sufficiently small, the lines connecting (x, y) to (x + h, y) and (x + h, y) to (x, h, y + h) arein U too. Let a curve connecting (x0, y0) to (x, y) be chosen and call it C with corresponding vectorfunction γ. Define the function φ(x, y) =

∫C < F ,γ′dt as usual. Then since we are free to choose

any path to get from (x, y) to (x+ h, y + h) we wish, we can consider the new paths

C1 : γ1 =

[x1(t) = x+ thy1(t) = y

], 0 ≤ t ≤ 1

C2 : γ2 =

[x2(t) = x+ hy2(t) = y + th

], 0 ≤ t ≤ 1



We illustrate this in Figure 17.5. Then

Figure 17.5: Defining the function φ on U

limh→0+

φ(x+ h, y)− φ(x, y)

h= lim

h→0+

1

h

∫ 1

0

A(x+ th, y)hdt = limh→0+

∫ 1

0

A(x+ th, y)dt

Now apply the Mean Value Theorem for integrals to find

limh→0+

φ(x+ h, y)− φ(x, y)

h= lim

h→0+A(x+ ch, y)

where c is between 0 and h. Hence, as h→ 0+, we find (φx(x, y))+ = A(x, y). A similar argument,with a slightly different picture of course, shows (φx(x, y))− = A(x, y). Hence, φx = A.We can argue in a very similar fashion to show φy = B. We have proven another result. Thisfunction φ is called a potential function corresponding to F .

Theorem 17.3.1 The Potential Function for F

Let the vector field F be defined on an path connected open set U in <2 by

F =

[A(x, yB(x, y)

]where A and B have continuous first order partials in U . Then F is a conservative forcefield if and only if there is a function φ on U so that ∇φ = F . Moreover, for any path Cwith start point P = (xs, ys) and end pointQ = (xe, ye),∫

C

< F ,γ′ > dt = φ(xe, ye)− φ(xs, ys)

Proof 17.3.1(=⇒):If F is a conservative force field, we know the line integrations are path independent and by the


17.4. FINDING POTENTIAL FUNCTIONS 349

argument above, we can find the desired φ function.

(⇐=):If such a function φ exists, thenAx = By and F is a conservative force field.

Finally, if F is conservative∫C

< F ,γ′ > dt =

∫ b

a

(A(x(t), y(t))x′(t) +B(x(t), y(t))y′(t)

)dt

=

∫ b

a

(φx(x(t), y(t))x′(t) + φy(x(t), y(t))y′(t)

)dt

=

∫ b

a

φ′(t) dt = φ(xe, ye)− φ(xs, ys)

17.3.1 Homework

Exercise 17.3.1

Exercise 17.3.2

Exercise 17.3.3

Exercise 17.3.4

Exercise 17.3.5

17.4 Finding Potential Functions

17.4.1 Homework

Exercise 17.4.1

Exercise 17.4.2

Exercise 17.4.3

Exercise 17.4.4

Exercise 17.4.5

17.5 Green’s Theorem

There is a connection between the line integral over a SCROC and a double integral over the areaenclosed by the curve. From our earlier discussions, since curves can be very complicated with theirinteriors and exteriors not so clear, by restricting out interest to a SCROC, we can look carefully atthe meat of the idea and not get lost in extraneous details. Remember this is what we always mustdo: look for the core of the thing. We start with a very simple SCROC as shown in Figure 17.6. Thecurve is the finite rectangle [a, b]×[c, d], the open set U = (a, b)×(c, d) and we let ∂U , the boundaryof U be given by the curve which traverses the rectangle ccw. Thus, the normal vector always pointsin.



Figure 17.6: Green’s Theorem for a Rectangle

The curve C = C1 + C2 + C3 + C4 as shown in the Figure. We use very simple definitions for γ1

through γ4 here.

C1 : γ1 =

[x, a ≤ x ≤ b

c

], C2 : γ2 =

[x+ h

y, c ≤ y ≤ d

]C3 : γ3 =

[x, b→ a

d

], C4 : γ4 =

[a

y, d→ c

],

The line integral over C is then∫ b

a

A(x, c)dx+

∫ d

c

B(b, y)dy +

∫ a

b

A(x, d)dx+

∫ c

d

B(a, y)dy =∫ b

a

(A(x, c)−A(x, d)

)dx+

∫ d

c

(B(b, y)−B(a, y

)dy∫ b

a

∫ d

c

−Ay(x, y) dydx+

∫ d

c

∫ b

a

Bx(x, y) dxdy

This can be rewritten as ∫C

< F ,γ′ > dt =

∫ b

a

∫ d

c

(Bx −Ay)dxdy

This is our first version of this result∫∂U

< F ,γ′ > dt =

∫ ∫U

(Bx −Ay)dA

where dA is the usual area element notation in a double integral. The open set U here is particularlysimple. Next, let’s look at a more interesting ∂U . This is a SCROC which encloses an area whichcan be described either with a left and right curve (so the area is

∫ ∫dxdy or with a bottom and top

curve ( so the area is∫ ∫

dydx.


17.5. GREEN’S THEOREM 351

Figure 17.7: Green’s Theorem for a SCROC: Right/ Left and Bottom/ Top

The curve C = C1 + C2 as shown in the Figure. We use

C1 : γ1 =

[x = xR(y),y, B ≤ y ≤ T

], C2 : γ2 =

[x = xL(y)y, T → B

]The line integral over C is then∫ T

B

A(xR(y), y))x′R(y)dy +

∫ T

B

B(xR(y), y)dy −∫ T

B

A(xL(y, y))x′L(y)dy

−∫ T

B

B(xL(y), y)dy =

∫ T

B

A(xR(y), y x′R(y) dy −∫ T

B

A(xL(y), y) x′L(y) dy

+

∫ T

B

(B(xR(y), y)−B(xL(y), y)) dy

First, note ∫ T

B

(B(xR(y), y)−B(xL(y), y)) dy =

∫ T

B

∫ xR(y)

xL(y)

Bx(x, y) dx dy

Next, consider∫ T

B

A(xR(y), y))x′R(y)dy −∫ T

B

A(xL(y), y))x′L(y)dy =

∫ d

B

A(xR(y), y))x′R(y)dy

+

∫ T

d

A(xR(y), y))x′R(y)dy −∫ d

B

A(xL(y), y))x′L(y)dy −∫ T

d

A(xL(y), y))x′L(y)dy

We can regroup this as follows:∫ d

B

(A(xR(y), y))x′R(y)−A(xL(y), y)x′L(y))

)dy



+

∫ T

d

(A(xR(y), y))x′R(y)−A(xL(y), y))x′L(y)

)dy

Now when x = xR(y) on [B, d], y = yB(x) on [c,R]; on [d, T ], y = yT (x) on [R, c]. Also, whenx = xL(y) on [B, d], y = yB(x) on [c, L]; on [d, T ], y = yT (x) on [L, c]. Making these changes ofvariables, we now have∫ T

B


B

A(xL(y, y))x′L(y)dy =∫ R

c

(A(x, yB(x)))−A(x, yT (x))) dx+

∫ L

c

(A(x, yB(x))−A(x, yT (x))) dy

or ∫ T

B


B

A(xL(y), y))x′L(y)dy =∫ R

L

(A(x, yB(x))dx−A(x, yT (x))dx =

∫ R

L

∫ yT (x)

yB(x)

−Ay(x, y) dx dy

Combining our results, we see we have shown∫C

< F ,γ′ > dt =

∫ b

a

∫ d

c

(Bx −Ay)dxdy

This is our second version of this result∫∂U

< F ,γ′ > dt =

∫ ∫U

(Bx −Ay)dA

Also, we would get the same result if we had chosen to move around the boundary of our regionusing the yB and tT curves which would have determined γ3 and γ4. For our arguments to work, itwas essential that we could invert the equations x = xR(y) and x = xL(y) on suitable intervals. Ifthe region determined by the SCROC, doesn’t allow us to do this, we have difficulties as we can seein the next example. We now examine a SCROC laid out bottom to top. We show this in Figure 17.8.In this figure, for a change of pace, we label the bottom and top curves fB and fT , respectively. Youshould get used to using a variety of notations for these kinds of arguments.

Figure 17.8: Green’s Theorem for a SCROC: Bottom Top


17.5. GREEN’S THEOREM 353

The curve C = C1 + C2 as shown in the Figure. We use

C1 : γ1 =

[x, xL ≤ x ≤ xRy = fB(x),

], C2 : γ2 =

[x, xR → xLy = fT (x),

]

The line integral over C is then∫ xR

xL

A(x, fB(x)))dx+

∫ xR

xL

B(x, fB(x))f ′B(x)dx+

∫ xL

xR

A(x, fT (x)))dx

+

∫ xL

xR

B(x, fT (x)))f ′T (x)dx =∫ xR

xL

(A(x, fB(x))−A(x, fT (x))

)dx+

∫ xR

xL

B(x, fB(x))f ′B(x)dx

−∫ xR

xL

B(x, fT (x)))f ′T (x)dx

The first integral can be rewritten giving∫ xR

xL

A(x, fB(x)))dx+

∫ xR

xL

B(x, fB(x))f ′B(x)dx+

∫ xL

xR

A(x, fT (x)))dx

+

∫ xL

xR

B(x, fT (x)))f ′T (x)dx =

∫ xR

xL

∫ fT (x)

fB(x)

−Ay(x, y) dy dx

+

∫ xR

xL

B(x, fB(x))f ′B(x)dx−∫ xR

xL

B(x, fT (x)))f ′T (x)dx

However, in the last two integrals above, we can not use the arguments from the last example aswe don’t know how to invert yB and yT on appropriate intervals. This is why the last example wasstructured the way it was. We need the inversion information. But, all is not lost. Consider the regionshown in Figure 17.9.

Figure 17.9: Green’s Theorem for a more general SCROC

In this more general case, we divide the region into pieces which have the right structure (i.e. a nicexL, xR, yB and yT so that the inversions can be done. In the figure, you see we define many closedpaths which are set up so that the horizontal movements across the figure always cancel out. Thelower path and the path above it move in opposite direction and so the net contribution to the lineintegral is zero. The part that in nonzero is the line integral we want. The second version of Green’s



Theorem applies to each piece. So in general, we can prove Green’s Theorem for any SCROCalthough it does get quite involved!So let’s state Green’s Theorem now.

Theorem 17.5.1 Green’s Theorem for a SCROC

Let C be a SCROC enclosing the open set U in <2. Let γ be the vector function used inC and let ∂U denote the boundary of U which is traversed ccw by γ. Then, for any vectorfield F ∫

∂U

< F ,γ′ > dt =

∫ ∫U

(Bx −Ay) dA

whereA andB are the components of F .


Exercise 17.5.2

Exercise 17.5.3

Exercise 17.5.4

Exercise 17.5.5

17.6 Green’s Theorem for Images of the unit squareWe can also prove a version of Green’s Theorem for mappings that smoothly take [0, 1] × [0, 1] to<2 in a one - to - one way.

Definition 17.6.1 Oriented Two Cells

We assume the coordinates of I2 are (u, v) and the coordinates of the imageF (I2) = D are(x, y). We say D in Re2 is an oriented two cell if there is a 1-1 continuously differentiablemap G : [0, 1] × [0, 1] → D where G is actually defined on an open set U containing[0, 1]× [0, 1] satisfying G([0, 1]× [0, 1]) = D. In addition, we assume detJG(u, v) > 0on U . For convenience, we let I2 = [0, 1]× [0, 1].

Consider the picture shown in Figure 17.10. We use the G1 and G2 as the component functions forF . Thus,

G =

[G1

G2

]=⇒ JG(u, v) =

[G1u(u, v) G1v(u, v)G2u(u, v) G2v(u, v)

]The edges of I2 are mapped by G into curves in <2 as shown in the figure. The edge correspondingto γ1 is the traceG(u, 0) and the tangent along this trace is

T (u, 0) =

[G1u(u, 0) G1v(u, 0)G2u(u, 0) G2v(u, 0)

] [10

]=[G1u(u, 0)G2u(u, 0))

]Hence, the limiting tangent vector as we approach (1, 0 along F (γ1 is

T γ1(1, 0) =

[G1u(1, 0)G2u(1, 0))

]


17.6. GREEN’S THEOREM FOR IMAGES OF THE UNIT SQUARE 355

Figure 17.10: An Oriented Two Cell

On the other hand, if we look at the trace for the edge given by γ2, we find

T (1, y) =

[G1u(1, v) G2v(1, v)G2u(1, v) G2v(1, v)

] [01

]=[G1v(1, v)G− 2v(1, v)

]At the point (1, 0), on the curve F (γ2), we find the limiting tangent vector

T γ2(1, 0) =

[G1v(1, 0)G2v(1, 0))

]Note

T γ1(1, 0)× T γ2

= detJg(1, 0)k = (G1u(1, 0)G2v(1, 0)−G1v(1, 0)G2u(1, 0)) > 0

by assumption. Hence as we rotate T γ1(1, 0) into T γ2

(1, 0), by the right hand rule the cross productpoints up. Hence, the curve is moving ccw because of the assumption on the positivity of the deter-minant of the Jacobian. We can do this sort of analysis at all the vertices we get when the edges in<2 under the map F are joined together. Since detJG(1, 0), it is not possible for the tangent vectorsat the vertices to align as if they did, the determinant would be zero there. Thus, the mappingG willalways generate four curves with four vertices where the tangent vectors do not line up. We oftencall the angles between the limiting value tangent vectors at the vertices the interior angles of D.From what we have just said, since the curve must be moving ccw, these interior angles must be lessthan π. That rules out some D’s we could draw.

To connect the line integrals around ∂I2 to the line integrals aroundG(∂I2), we define what is calledthe pullback of F where F is the vector field in the line integral. This pulls F which acts onG(∂I2)in (x, y) space back to a function defined on (u, v) space. The pullback of F is denoted by F ∗ andis defined by

(F ∗)(u, v) =[A(G1(u, v),G2(u, v)) B(G1(u, v),G2(u, v))

] [G1u G1v

G2u G2v

]



where, for convenience, we leave out the (u, v) in the Jacobian. So the line integral is∫∂I2

< F ∗,γ′ > dt =

∫∂I2

⟨[A(G1,G2)B(G1,G2)

],

[G1u G1v

G2u G2v

]γ′⟩dt

Now make the change of variables x = G1(u, v) and y = G2(u, v). Then[dxdy

]=

[G1u(u, v) G1v(u, v)G2u(u, v) G2v(u, v)

] [dudv

]Thus, after the change of variables, we have∫

∂I2

< F ∗,γ′ > dt =

∫G(∂I2)

⟨[A(x, y)B(x, y)

], (G(γ) )′

⟩dt

Thus, we know how to define the line integral around the boundary of the image of the unit squareunder a nice mapping. Green’s Theorem applies to the left side, so we have∫ ∫

I2

((AG1v +BG2v)u − (AG1u +BG2u)v

)dAuv =

∫G(∂I2)

⟨[A(x, y) B(x, y)

], (G(γ) )′

⟩dt

Let’s simplify the integrand for∫ ∫

over I2. We use A1, A2 etc. to indicate the partials of A withrespect to the first and second argument. Also, we must now assume more smoothness of G as wewant to take second order partials and we want the mixed order partials to match. So now G hascontinuous second order partials.

(AG1v +BG2v)u − (AG1u +BG2u)v =

(A1G1u +A2G2u)G1v +AG1vu + (B1G1u +B2G2u)G2v +BG2vu

−(A1G1v +A2G2v)G1u −AG1uv − (B1(G1v +B2G2v)G2u −BG2uv

Now cancel some terms to get

(AG1v +BG2v)u − (AG1u +BG2u)v = (A1G1u +A2G2u)G1v

+(B1G1u +B2G2u)G2v − (A1G1v +A2G2v)G1u − (B1G1v +B2G2v)G2u

The terms withA1 andB2 also cancel giving

(AG1v +BG2v)u − (AG1u +BG2u)v = (B1 −A2)(G2uG1v −G2v)G1u)

= (B1 −A2) detJG

What have we shown? We have this chain of results:∫ ∫G(I2)

(B1 −A2) dAxy =

∫ ∫I2

(B1 −A2) detJG dAuv

=

∫ ∫I2

((AG1v +BG2v)u − (AG1u +BG2u)v

)dAuv


17.6. GREEN’S THEOREM FOR IMAGES OF THE UNIT SQUARE 357

=

∫∂I2

< F ∗,γ′ > dt

=

∫G(∂I2)

⟨[A(x, y)B(x, y)

], (G(γ) )′

⟩dt

This is the proof for Green’s Theorem for Oriented 2 - Cells. Let’s state it formally.

Theorem 17.6.1 Green’s Theorem for Oriented 2 - Cells

Assume G is a 1-1 map from the oriented 2 cell I2 to G(I2) = D in <2. Assume G hascontinuous second order partials on an open set U which contains [0, 1] × [0, 1]. Let Cbe a path around ∂I2 with associated vector function γ which traverses the boundary ccw.Then, ∫

G(∂I2)

⟨[A(x, y)B(x, y)

], (G(γ) )′

⟩dt =

∫ ∫G(I2)

(B1 −A2) dAxy

Proof 17.6.1We have just gone through this argument.

Comment 17.6.1 You probably noticed how awkward these arguments sometimes were and therewas a considerable amount of manipulation involved. Everything is considerably simplified when webegin using the language of differential forms which is discussed in Chapter 18. If ω is a 1 - formdefined onG(I2) and dω is the 2 - form obtained from ω, the above theorem looks like this:∫

G(∂I2)

ω =

∫ ∫G(I2)

dω

Further, the integral symbols are simplified a bit: it is understood∫

over a two dimensional subsetcan be handled by

∫ ∫. So context tells what to do and we just say∫

G(∂I2)

ω =

∫G(I2)

dω

Comment 17.6.2 When you study one dimensional things, the work is much simpler. You can seehow much more complicated our arguments become when we pass to differentiation and integrationover portions of <2. Make sure you think about this as it helps you begin to see how to free yourselffrom one dimensional bias and prepares you for higher dimensions than two. Some of the things wehave begun to wrestle with are

• What exactly is a curve in <2? What do we mean by the interior and exterior of a closedcurve?

• What do we mean by the smoothness of functions on subsets of <2?

• Clearly, some of what we do with line integrals is a way to generalize the Fundamental The-orem of Calculus. You should see this is both an interesting problem and a difficult one aswell.



17.6.1 Homework

Exercise 17.6.1

Exercise 17.6.2

Exercise 17.6.3

Exercise 17.6.4

Exercise 17.6.5

17.7 Motivational Notation

Recall, we often use the notation dx = x′(t)dt as a convenient way to handle substitution in anintegration. We can use this idea to repackage the idea of a line integral of a force field along a curve.We know

∫C

⟨F ,

[x′

y′

]⟩dt =

∫C

(⟨[A(x, yB(x, y)

],

[x′(t)y′(t)

]⟩)dt

=

∫ b

a

(A(x(t), y(t))x′(t) +B(x(t), y(t))y′(t)) dt

Let’s rewrite using differential notation. We get∫C

⟨F ,

[x′

y′

]⟩dt =

∫ b

a

(A(x(t), y(t))dx+B(x(t), y(t))dy) dt

This suggests we combine the vector field and differential notation and define something we will calla 1 - form which has the look

ω = Adx+Bdy

and define ∫C

ω =

∫ b

a

(A(x(t), y(t))dx+B(x(t), y(t))dy) dt

We define the exterior derivative (more on this later) of ω to be

dω = (B1 −A2) dxdy = (B1 −A2) dAxy

Now go back to the oriented 2 - cell problem. Then,

ω = Adu+Bdv∫G(I2)

ω =

∫G(γ)

(A(u(t), v(t))du+B(u(t), v(t))dv) dt

dω = (B1(u, v)−A2(u, v))dudv = (B1 −A2) dAuv

The pullback ofG is then

G∗(ω) = A(G1,G2)(G1udu+G1vdv) +B(G1,G2)(G2udu+G2vdv)

= (A(G1,G2)G1u +B(G1,G2G2u)du+ (A(G1,G2)G1v +B(G1,G2G2v)dv


17.7. MOTIVATIONAL NOTATION 359

and so the exterior derivative gives

d(G∗(ω)) =

(−(A(G1,G2)G1u +B(G1,G2G2u)2 +

(A(G1,G2)G1v +B(G1,G2G2v)1

)dudv

We also define the pullback of the exterior derivative

G∗(dω) = (B1 −A2) detJG

Using the same calculations as before, we then find

d(G∗(ω)) = (B1 −A2) detJG dudv = G∗(dω)

Now go back to our arguments for Green’s Theorem for the oriented 2 - cells. Using this newnotation, our proof now goes like this:∫ ∫

G(I2)

dω =

∫ ∫I2

G∗(dω) =

∫ ∫I2

d(G∗(ω)) =

∫∂I2

G∗(ω) =

∫G(∂I2)

ω

We hide the details of the differential area dxdy and so forth. We can also rewrite using context forwhether it is a single or double integral and just write∫

G(I2)

dω =

∫I2

G∗(dω) =

∫I2

d(G∗(ω)) =

∫∂I2

G∗(ω) =

∫G(∂I2)

ω

In the next chapter we will explore differential forms more carefully.


Exercise 17.7.2

Exercise 17.7.3

Exercise 17.7.4

Exercise 17.7.5


Part V

Differential Forms

361


Chapter 18

Differential Forms

The idea of a differential form is very powerful and one that you are not exposed to in the firstcourses in analysis. So it is time to rectify that and introduce you to a new and useful way of

looking at functions of many variables.

Definition 18.0.1 Tangent Vector Fields

A tangent vector field V on <n is a function V which assigns to each point P ∈ <n avector vP . In addition, we let <nP = TP (<n) denote the set of all vectors w in <n of theform w = u+ P where u in <n is arbitrary.

Comment 18.0.1 The set <nP = TP (<n) is clearly an n dimensional vector space over < whichconsists of all the translations of vectors in <n by a fixed vector P . We also call this the vector spacerooted at P .

We also want to look at smooth functions on an open set U in <n.

Definition 18.0.2 Smooth Functions on <n

A smooth function or C∞ function on the open set U in <n is a mapping f : U → <whose partial derivatives of all orders exist and are continuous.

18.1 One Forms

We define a differential 1 - form on U or simply a 1 - from on U as follows:

Definition 18.1.1 One Forms on an open set U

363


364 CHAPTER 18. DIFFERENTIAL FORMS

A differential 1- form or just 1-form ω on U consists of a pair of smooth functions P andQ on U . The one form ω is therefore associated with a vector field

H =

[AB

]which meansH : U → <2 with smooth component functions. We also write this as

H = Ai+Bj

where i and j are the standard orthonormal basis for <2. We define the action of ω on avector V rooted at the point P with components V1 and V2 and P −1 and P2, respectivelyto be

ω(V ) = A(P1, P2)V1 +B(P1, P2)V2

and it is easy to see ω is linear.

To understand this better, we need to look at the dual vector space to a vector space.

Definition 18.1.2 Algebraic Dual of a Vector Space

Let V be a vector space over <. The algebraic dual to V is denoted by V ∗ and is definedby

V ∗ = f : V → <|f is linear

Such a function f is called a linear functional on V .

Comment 18.1.1 If the vector space V was also an inner product space and hence had an inducednorm or if V had a norm without an inner product, then we could ask if a linear functional wascontinuous. Let ‖ · ‖ be this norm. Recall f is continuous at p ∈ V if for any ε > 0, there is a ballBδ(p) = x ∈ V : ‖x − p‖ < δ in (V so |f(x) − f(pp)| < ε. We can not prove this here butwe will show you where the proof breaks down. We discuss this properly in (Peterson (9) 2019). LetE be a basis for V with norm ‖ · ‖. and we can assume each Ej has norm one. If f is a linearfunctional on V , let ξj = f(Ej). Then for any x ∈ V , there is a unique representation with respectto this basis

x = ciE1 + . . .+ cnEn

and so by the linearity of f

f(x) = xif(E1) + . . .+ xn(En) =

n∑i=1

xiξi

Thus

|f(x)− f(p)| ≤ =

n∑i=1

|xi − pi| |ξi| ≤ fEn∑i=1

|xi − pi|

where fE = max1≤i≤n |ξi| and pi are the components of p with respect to the basis E. So given anε > 0, we can make |f(x)− f(p)| < ε if we choose

∑ni=1 |xi − pi| < ε/fE . However, this does not


18.1. ONE FORMS 365

help as

‖x− p‖ = ‖n∑i=1

(ci − pi)Ei‖ ≤n∑i=1

|ci − pi| ‖Ei‖ =

n∑i=1

|ci − pi|

But choosing ‖x − p‖ < δ = ε/fE , does not guarantee that∑ni=1 |ci − pi| < ε/fE which we

need. So we are missing a way to handle the ‖x − p‖ correctly here. To fill this in, we need moredetails about how normed linear spaces work which we go over in (Peterson (9) 2019). The set ofcontinuous linear functionals on a vector space V is denoted by V ′ and it turns out V ′ = V ∗ whenV is finite dimensional normed linear space.

The vector space X = (<2, i, j) is traditionally thought of as column vectors and the dual spaceto (<2, i, j) is then X∗ the set of linear functionals on X which is the same as X ′ the set ofcontinuous linear functionals on X . The dual to (<2, i, j) is interpreted as the set of row vectorsand a basis dual to any basis E = E1,E2 is F = F1,F2 where

F1(E1) = 1, F1(E2 = 0

F2(E1) = 0, F2(E2 = 1

Hence,

F1

([V1

V2

])= V1

F2

([V1

V2

])= V2

Hence, we can identify F1 and F2 as follows

F1 =[1 0

]=⇒ F1

([V1

V2

])=[1 0

] [V1

V2

]= V1

F2 =[0 1

]=⇒ F2

([V1

V2

])=[0 1

] [V1

V2

]= V2

The usual notation for the dual basis to i, j) is not F1 and F2. Instead, it is dx = F1 anddy = F2. Hence, we define the action of a 1-form ω on a vector V rooted at P as follows:

ω

([V1

V2

])=

([F1 F2

]H) ([V1

V2

])=

([F1 F2

] [AB

]) ([V1

V2

])= (AF1 +BF2) (

([V1

V2

])= A(P1, P2)F1

([V1

V2

])+B(P1, P2)F2

([V1

V2

])= A(P1, V P2)V1 +Q(P1, P2)V2

This is quite a mouthful. It is easier to follow if we use the dx and dy notation as define

ω

([V1

V2

])= (Adx+Bdy)

([V1

V2

])= A(P1, P2)dx

([V1

V2

])+B(P1, P2)dy

([V1

V2

])= A(P1, P2)V1 +B(P1, P2)V2

To be even more succinct, for a vector V with the usual components V1 and V2 rooted at P :

ω(V ) = (Adx+Bdy)(V ) = A(P1, P2)dx(V ) +B(P1, P2)dy(V )

= A(P1, P2)V1 +B(P1, P2)V2



We can interpret these results a bit differently by thinking of the dual space to <2P where we now

think of this space as the space of all tangent vectors attached to a point P where P has coordinates(x0, y0). Consider a curve C in <2 given by the smooth curve f(x, y) = c for some constant c.Then, the tangent plane at (x0, y0) is given by

z = f0x(x− x0) + f0

y (y − y0)

and this plane has the normal vector N = f0xi × f0

y j. Each curve f determines a plane of thistype. The simplest plane is the one determined by f(x, y) = x + y = c. The tangent plane is thenz = 1(x− x0) + 1(y − y0). Any point in <2

P can thus be interpreted as the vector Q = f0xi+ f0

y jfor some smooth function f(x, y) = c where i and j are rooted at P . Another way to look at it is thei is a unit vector in the direction of x rooted at x0 and j is a unit vector in the direction of y rooted aty0 A linear functional φ acting on <2

P thus gives

φ(c1(x− x0)i+ c2(y − y0)j) = c1φ((x− x0)i) + c2φ((y − y0)j)

Now the simplest function, x + y = c gives the tangent plane z = 1(x − x0) + 1(y − y0) and wehave

φ(1(x− x0)i+ 1(y − x0)j) = 1φ((x− x0)i) + 1φ((y − y0)j)

This suggests we rethink the dual basis. We define

F1

([x− x0

0

])=

∂

∂x(x− x0) =

d

dx(x− x0) = 1

F2

([0

(y − y0)

])=

∂

∂y(y − y0) =

d

dy(y − y0) = 1

With this reinterpretation, we have

φ(c1(x− x0)i+ c2(y − y0)j) = c1∂

∂x+ c2

∂

∂x

and so

φ(f0x(x− x0)i+ f0

y (y − y0)j) = f0x

∂

∂x+ f0

y

∂

∂x

Thus, in terms of the dual basis, we would write φ = c1∂∂x + c2

∂∂y and

φ(f0x(x− x0)i+ f0

y (y − y0)j) =

(c1∂

∂x+ c2

∂

∂y

) (f0x(x− x0)i+ f0

y (y − y0)j

))

= c1f0x + c2f

0y

The common symbols for this dual basis are dx = ∂∂x and dy = ∂

∂y . Thus, a linear functional iswritten φ = c1dx+ c2dy.

Next, we need to define a smooth path in U .

Definition 18.1.3 Smooth Paths


18.1. ONE FORMS 367

Let U be an open set in <2. A smooth path in U is a mapping γ : [a, b]→ U for some finiteinterval [a, b] in < which is continuous on [a, b] and differentiable on (a, b). We assume γcan be extended to a smooth C∞ map in a neighborhood of [a, b]. Hence

γ(t) =

[x(t)y(t)

], a ≤ t ≤ b

where x and y are C∞ on (a− ε, b+ ε) for some positive ε.

Comment 18.1.2 The initial point of the path is γ(a) and the final point is γ(b). We say γ is a pathfrom γ(a) to γ(b).

If γ(t0) 6= 0, then at the point γ(t0), there is a two dimensional vector space rooted at γ(t0) we cancall T (t0) defined by the span of the the vectors

E1(t0) = i, E2(t0) = j

Note T (t0) is determined by the tangent vector to the curve γ(t) at the point t0. Any vector in thisvector space has the representation rooted at γ(t0) given by[

pq

]= (t− t0)

[α x′(t0)β y′(t0)

]= α x′(t0) (t− t0)E1(t0) + β y′(t0)(t− t0)E2(t0)

This shows E1(t0),E2(t0) is a basis for T (t0). We define a dual basis dx(t0),dy(t0)) by

dx(t0) = E1T (t0), dy(t0) = E2

T (t0)

Then at each t, we can represent γ(t) in T (t0) by[x(t)y(t)

]=

[α(t) x′(t0)β(t) y′(t0)

]= γ(t0) + α(t) x′(t0)E1(t0) + β(t) y′(t0)E2(t0)

Now if f is a linear functional on T (t0), then

f

(α(t) x′(t0)E1(t0) + β(t) y′(t0)E2(t0)

)= α(t) x′(t0)f

(E1(t0)

)+ β(t) y′(t0) f

(E2(t0)

)= α(t) x′(t0) cf + β(t) y′(t0) df

as the action of f is completely determined by how it acts on the basis for T (t0). Now consider

(c1dx+ c2dy)

(γ(t0) + α(t) x′(t0)E1(t0) + β(t) y′(t0)E2(t0)

)= c1 α(t) x′(t0) + c2 β(t) y′(t0)

For equality, we need

α(t) x′(t0) cf = c1 α(t) x′(t0) =⇒ c1 = cf

c2 β(t) y′(t0) = β(t) y′(t0) df =⇒ c2 = df



We conclude the span of dx,dy is the dual of T (t0). Thus, the action of dx and dy on the pathγ should be defined to be

dx(γ(t)) = x′(t), dy(γ(t)) = y′(t)

Note this is same as what we had before except now we have the chain rule involved. We have alinear functional

φ = c1dx+ c2dy = c1∂

∂x

d

dt+ c2

∂

∂y

d

dt

With all this done, we have an understanding of how to interpret the action of the 1 - form ω on thepath γ. At each t, we define

ω(γ) = P (x(t0), y)t0)) dx(t0)(γ(t) +Q(x(t0), y)t0)) dy(t0)(γ(t)

= P (x(t0), y)t0)) x′(t0) +Q(x(t0), y)t0)) y′(t0)

This is quite a notational morass, so we usually say this sloppier and let context be our guide. Theaction of the 1 - form ω on the smooth path γ is

ω(γ(t)) = P (γ(t))dx(γ(t)) +Q(γ(t))dy(γ(t)

= P (x(t), y(t))x′(t) +Q(x(t), y(t))y′(t) =

[PQ

](γ(t)) γ′(t)

We can then define the integral∫γω.

Definition 18.1.4 Integration of a 1 -form over a smooth path

Let γ be a smooth path in the open set U in <2 and let ω = Pdx +Qdy be a 1 - form.Then ∫

γ

ω =

∫ b

a

(P (x(t), y(t)) x′(t) +Q(x(t), y(t)) y′(t)

)dt

Note this is a well - defined Riemann Integrable as the integrand is continuous.

Recall, in Chapter 17, these types of integrals are called Line Integrals. Hence, what we wentthrough above can be interpreted from another point of view. Suppose each point on the smooth pathγ(t) represents the charge of the path at the point t. Then how do we compute the total charge of thepath? First, note the function

U(t) = P (x(t), y(t)) x′(t) +Q(x(t), y(t)) y′(t)

is a continuous function on [a, b] and hence, it is Riemann Integrable. Hence, the value of this integralis given by

limn→∞

S(U, πn, σn)

for any sequence of partitions πn whose norms go to zero and σn is any evaluatation set in πn.Then, for a given partition, the mass in the portion of the path between partition point tj and tj+1

can be approximated by

P (x(tj), y(tj))(x(tj+1 − x(tj)) +Q(x(tj), y(tj))(y(tj+1 − y(tj)) =


18.1. ONE FORMS 369(P (x(tj), y(tj))

x(tj+1 − x(tj)

tj+1 − tj+Q(x(tj), y(tj))

y(tj+1 − y(tj)

tj+1 − tj

)(tj+1 − tj)

Now apply the Mean Value Theorem to see(P (x(tj), y(tj))

x(tj+1 − x(tj)

tj+1 − tj+Q(x(tj), y(tj))

y(tj+1 − y(tj)

tj+1 − tj

)(tj+1 − tj) =(

P (x(tj), y(tj))x′(sj) +Q(x(tj), y(tj))y

′(sj)

)(tj+1 − tj)

The approximation to the mass on the portion [tj , tj+1] can just as well be done setting tj = sj inthe P andQ terms. Thus, the approximation can be

U(sj) (tj+1 − tj) =

(P (x(sj), y(sj))x

′(sj) +Q(x(sj), y(sj))y′(sj)

)(tj+1 − tj)

The Riemann Approximation Theorem then tells us

limn→∞

∑πn

U(sj)∆tj =

∫ b

a

U(t)dt

which is the same result as before. We see our interpretation of the integration of the 1 - form ω overthe smooth path γ is the same as what we call the line integral of the vector field Pi +Qj on thepath given by the smooth curve γ.

18.1.1 A Simple Example

Consider a simple circle in the plane parameterized by x(t) = cos(t), y(t) = sin(t) for 0 ≤ t ≤ 2π.Hence, the smooth path here is

γ(t) =

[cos(t)sin(t)

]for 0 ≤ t ≤ 2π where we start at the point (1, 0) at t = 0 and end there as well at t = 2π. Thisis a circle in the plane of radius 1 centered at the origin. From the center of the circle, if you drawa vector to a point outside the circle, the variable t represents the angle from the positive x axismeasured counterclockwise (ccw) to this line. You can imagine that the picture we draw would bevery similar even if we chose an anchor point different from the center. We would still have a welldefined angle t. If we choose an arbitrary starting angle t0, the angle we measure as we move ccwaround the circle would be start at t0 and would end right back at the start point with the new anglet0 + 2π. Hence, there is no way around the fact that as we move ccw around the circle, the anglewe measure has a discontinuity. To allow for more generality later, let this angle be denoted by θ(t)which in our simple example is just θ(t) = t. Since tan(θ(t)) = y(t)/x(t), we have the generalequation for the rate of change of the angle:

θ′(t) =−y(t)x′(t) + x(t)y′(t)

x2(t) + y2(t)

Of course, here this reduces to θ′(t) = 1 as we really just have θ(t) = t which has a very simplederivative! Note we can use this to define a 1 - form ω by

ω =−y

x2 + y2dx+

x

x2 + y2dy



Then, the consider this scaled integral of ω on the path γ:

1

2π

∫γ

ω =1

2π

∫ 2π

0

−y(t)x′(t) + x(t)y′(t)

x2(t) + y2(t)dt

=1

2π

∫ 2π

0

1 dt = 1.

On the other hand, suppose we put the anchor point of our angle measurement system outside thecircle. So imagine the reference line for the angle θ(t) to be moved from a vector starting at the originof the circle to a point outside the circle to a new vector starting outside the circle. For convenience,assume this new vector is in quadrant one. At the point this vector is rooted, it determines a localcoordinate system for the plane whose positive x axis direction is given by the bottom of the usualreference triangle we draw at this point. In Figure 18.1, we show a typical setup. Note the anglesmeasured start at θ1, increase to θ2 and then decrease back to θ1. Then, since the angles are nowmeasured from the base point (x0, y0) outside the circle, we set up the line integral a bit different.We find the change in angle is now zero and there is no discontinuous jump in the angle.

1

2π

∫ θ2

θ1

−(y(t)− y0)x′(t) + (x(t)− x0)y′(t)

(x(t)− x0)2 + (y(t)− y0)2dt+

1

2π

∫ θ1

θ2

−(y(t)− y0)x′(t) + (x(t)− x0)y′(t)

(x(t)− x0)2 + (y(t)− y0)2dt = 0.

θ

θ1

θ2

x

y

The angle reference systemis outside the circle. Theangle θ1 is the first angleneeded for the circle andthe angle θ2 is the last.

Figure 18.1: Angle Measurements From Outside the Circle

In general, if we try to measure the angle using a point inside the circle, we always get +1 for thisintegral and if we try to measure the angle from a reference point outside the circle we get 0. Thisinteger we are calculating is called the winding number associated with the particular curve given byour circle and we will return to it later in more detail. Now, if we had a parameterized curve in theplane given by the pair of functions (x(t), y(t)), the line integrals we have just used still are a goodway to define the change in angle as we move around the curve. We will define the winding numberof the smooth path γ relative to a point P to mean we are measuring the change in angle as we movearound the curve γ from its start point to its finish point where we assume the start and finish pointare the same. Such smooth paths are called closed. So if functions that determine γ are defined on


18.2. EXACT AND CLOSED 1 - FORMS 371

[a, b], a closed curve means (x(a), y(a)) = (x(b), y(b)). Hence, closed curves are nice analoguesof the simplest possible closed curve - the circle at the origin. The winding number, W (γ,P ), isdefined to be

W (γ,P ) =1

2π

∫γ

ω

where we use as 1 - form

ω =−(y − y0)

(x− x0)2 + (y − y0)2dx+

(x− x0)

(x− x0)2 + (y − y0)2dy

where the coordinates of P are (x0, y0).Now let’s back up a bit. In Figure 18.1, we found the change in angle as we moved around the circlewas zero giving us a winding number of zero because we had placed the angle measurement systemsreference outside the circle. Another way of saying this is that W (γ,P ) = 0 for γ being the circlecentered at the origin when P is outside the circle. A little thought tells you that this should be truefor more arbitrary paths γ although, of course, there is lots of detail we are leaving out as well assubtleties. Now look at Figure 18.1 again and think of the circle and its interior as a collection ofpoints we can not probe with our paths γ. We can form closed loops around any point P in the planein general and we can deform these closed paths into new closed paths that letP pop out of the insideof γ as long as we are free to alter the path. Altering the path means we find new parameterizations(x(t), y(t)) which pass through new points in the plane. We are free to do this kind of alteration asfreely as we want as long as we don’t pick a P inside the gray circle. Think about it. If we had aclosed path γ that enclosed a point P inside the gray circle, then we can not deform this γ into a newpath that excludes P from its interior because we are not allowed to let the points (x(t), y(t)) on thenew path enter the gray circle’s interior. So we can probe for the existence of this collection of pointwhich have been excluded by looking at the winding numbers of paths γ. Paths γ whose windingnumbers relative to a point P inside the gray circle will all have nonzero winding numbers. Anotherway of saying this is that if P is not inside the gray circle, there are paths γ with P not in theirinterior and so their winding number is zero. But for P inside the gray circle, we can not find anyclosed paths γ whose interior excludes P and hence their winding numbers are always not zero ( inour simple thought experiment, they are +1 ). Also, please note we keep talking about the interiorof a smooth path γ as if it is always easy to figure out what that is. In general, it is not! So we stillhave much to discuss. The first books you can read in this area is the book on Algebraic Topology(Fulton (4) 1995) which requires you to be focused and ready to learn new ways to think. But theseare sophisticated ideas and you should not be disheartened by their difficulty!

18.2 Exact and Closed 1 - Forms

Let U be an open set in <2. We have have already defined what a smooth function on U is, but nowwe need to know more about the topology of the open set U itself. We need the idea of a connectedset. There are several ways to do this. We will start with the simplest.

Definition 18.2.1 Connected Sets

An open set U in <n is connected if it is not possible to find two disjoint open sets A andB so that U = A ∪ B. If U is not connected, the sets A and B are called the componentsof U .

An important fact about the relationship between connected open sets and smooth functions is this:



Theorem 18.2.1 All Smooth Functions with zero derivatives are constant if and only if thedomain is connected

Let f : U ⊂ <n → < be a smooth function on the open set U . Then

all smooth f 3∇f = 0 are constant ⇐⇒ U is connected

Proof 18.2.1=⇒:If ∇f = 0 in U , then it is easy to see the second order partials of f are identically 0. Given x0

in U , it is an interior points and so there is a neighborhood of it, Br(x0) in U . Hence, for anyx ∈ Br(x0), we know

f(x) = f(x0) +∇f(x0)(x− x0) +1

2

[x− x0

]T f11(c) . . . f1n(c)...

...fn1(c) . . . fnn(c)

[x− x0

]where c is a point on the line between x0 and x. This point is also in Br(x0). However, sinceall the second order partials are zero, the Hessian here is the zero matrix always and we havef(x) = f(x0). Since these points were arbitrarily chosen, we see f is constant on Br(x0); i.e. wehave shown such functions f must be locally constant on U .

For the given x0 and a given f , let A = f−1(d) where d = f(x0). If y ∈ A, then because yis an interior point, there is a neighborhood Bs(y) contained in U . Hence, Bs(y) ∩ A is an openset with zinA for all z in Bs(y) ∩ A. This shows y is an interior point of A and so A is open.Now let B = U \ A. Pick any y in B. Then f(y 6= d and since f is locally constant at y, weknow there is a neighborhood about y with corresponding function values not in A. Hence, they arein B which shows y is an interior point of B. We conclude B is an open set and we have writtenU = A ∪B with A and B disjoint and open. This shows U is not connected. This is a contradictionand so we must concludeB is empty andU = f−1(d). Thus any such f must be constant on all ofU .

⇐=:If U were not connected, we can see using the argument in the first part that f is a constant on eachcomponent of U . This would mean f might not be constant on U . Thus, U must be connected in thiscase.

Note we could use any orthonormal basis E here, but we might as well just do the internalmapping from E to i, j mentally and just think of everything in terms of a starting basis which isthe traditional standard one.

Let’s look at a few results in this area.

Theorem 18.2.2 A Simple Recapture Theorem in <2

Let f be a C∞ function in the open set U in <2. Let the 1 - form ω = fxdx+ fydy. Then∫γω = f(γ(b)) − f(γ(a)) where the notation f(γ(t)) means f(x(t), y(t)) where x(t)

and y(t) are the component functions of the smooth path γ.



Proof 18.2.2This is a simple application of the Fundamental Theorem of Calculus.(

f(γ(t))

)′= fx(x(t), y(t)) x′(t) + fy(x(t), y(t)) y′(t)

=⇒∫γ

ω =

∫ b

a

(f(γ(t))

)′dt = f(γ(b))− f(γ(a))

Comment 18.2.1 The 1- form ω = fxdx + fydy occurs very frequently and we use a specialnotation for it: df = fxdx+ fydy

An important idea is that of exactness.

Definition 18.2.2 Exact 1 - forms

Let f be a C∞ function in the open set U in <2. with corresponding 1 - form df =fxdx + fydy. A 1 - form ω is a differential of f if ω = df . A 1 - form ω is called exactwhen ω = df for some smooth f .

Theorem 18.2.3 The angle function is not exact on <2 \ 0.0

For f(x, y) = tan−1(y/x),

df =−y

x2 + y2dx+

x

x2 + y2dy

Note f is not a smooth function on <2 \ 0, 0 but df is a smooth 1 - form on <2 \ 0, 0.In fact, there is no smooth function g on <2 \ 0, 0 so that df = dg.

Proof 18.2.3By direct calculation, for the smooth path

γ(t) =

[cos(t)sin(t)

]for 0 ≤ t ≤ 2π, we have

∫γdf = 2π and γ(0)) = γ(2π). Note, fx and fy fail to exist at some

points on the path, so the argument we used in Theorem 18.2.2 does not apply. However, using theargument we used in Theorem 18.2.2 does work for a smooth g; we have∫

γ

ω =

∫ 2π

0

(f(γ(t))

)′dt = f(γ(2π))− f(γ(0)) = 0

Hence, no such g can exist.

We can also glue smooth paths together to build segmented paths.

Definition 18.2.3 Smooth Segmented Paths



A smooth segmented path γ is a a finite sequence of smooth paths γ1, . . . ,γn whereeach γi is a smooth path and the final point of γi is the initial point of γi+1. We write

γ = γ1 + . . .+ γn

and for any 1 - form ω, we define∫γ

ω =

∫γ1

ω + . . .+

∫γn

ω

The first result is to extend Theorem 18.2.2 to segmented paths.

Theorem 18.2.4 The Recapture Theorem for Segmented Paths

Let γ be a smooth segmented path in the open set U in <2. Then if ω = df in U∫γ

ω = f(Q)− f(P )

where P = γ1(a) andQ = γn(b). P is the initial andQ is the final point of the path.

Proof 18.2.4We can apply Theorem 18.2.2 to each piece of the segmented path.∫

γ

ω =

n∑i=1

(f(γi(b))− f(γi(a))

)But γi(b) = γi+1(a). Hence,

∫γ

ω = f(γn(b)) +

n−1∑i=1

(f(γi+1(a))− f(γi(a))

)But the summation here is telescoping and gives∫

γ

ω = f(γn(b))− f(γ1(a))

which is the result.

What about the converse?

Theorem 18.2.5 Equivalent Characterizations of Exactness

Let omega be a 1 - form on the open set U in <2. Then the following ar equivalent:

1.∫γω =

∫δω for any smooth segmented paths γ and δ with the same domain, [a, b],

and the same initial and final points.

2.∫γω = 0 for smooth segmented paths γ that are closed; i.e. their initial and final

points are the same.

3. ω = df for a smooth function f on U ; i.e. ω is exact.



Proof 18.2.5First, the inverse of the path γ will be denoted by γ−1 and is defined by γ−1 = γ(b+ a− t).

(2 =⇒1:)Note

∫γ−1 ω = −

∫γω. So given paths γ and δ in U with the same initial and final points, let τ be

the closed path

τ = γ1 + . . .+ γn − δ1 + . . .+ δm

Then by (2),∫τω = 0 which implies

∫γω =

∫δω.

(1 =⇒ 2:)Let γ be a closed path and let δ be a constant path, i.e. it never leaves the initial point. By (1),∫γω =

∫δω = 0.

(1 =⇒ 3:)Let ω = Udx+V dy for concreteness. Assume U is connected. If U has more than one component,you can do the argument that follows on each component. Pick a fixed point P0 in U . Let thesmooth segmented path γ have initial point P0 and final point P = (x, y) in U . For the pointP ′ = (x+ h, y), define the path σ by σ(t) = (x+ t, y) for 0 ≤ t ≤ h. Then γ + σ is a segmentedpath from P0 to P ′. For anyQ in U , define f in U by f(Q) =

∫τω for any smooth segmented path

τ from P0 to Q. Since P is in U , P is an interior point and so for sufficiently small h, P ′ is in Ualso. Thus f(P ′) is well - defined and

f(x+ h, y)− f(x, y)

h=

1

h

(∫γ+σ

ω −∫γ

ω

)=

1

h

∫σ

ω

=1

h

∫ h

0

U(x+ t, y)dt

Now apply the Mean Value Theorem to g(h) =∫ h

0U(x + t, y)dt whose derivative is g′(h) =

U(x + h, y) by the Fundamental Theorem of Calculus. Thus, g(h) − g(0) = g′(c) h for some cbetween 0 and h. Hence,

1

h

∫ h

0

U(x+ t, y)dt =1

h

(U(x+ c, y) h

)= U(x+ c, y)

But U is continuous, so as h→ 0,

∂f

∂x(P ) = lim

h→0

f(x+ h, y)− f(x, y)

h= lim

h→0U(x+ c, y) = U(x, y) = U(P )

A similar argument shows ∂f∂x (P ) = V (P ). This shows ω = df .

(3 =⇒ 1:)If ω = df , then

∫γω = f(Q) − f(P ) where P and Q are the initial and final point of γ. This is

also true for another path δ with the same initial and final points. This shows (1) holds.



18.3 Two and Three FormsA 1 -form ω = Adx+Bdy is therefore a mapping ω : <2

P = TP (<2)→ < defined by

ω(VP ) = A(P1, P2)V1 +B(P1, P2)V2

Now consider mappings of the form ψ = Adxdy. Then, we define the action of ψ as follows:

Definition 18.3.1 2 - Forms

Let U be an open set in <2. a 2 - form ψ : <2P × <2

P → < of the form A dx dy for asmooth functionA on U has action defined by

ψ(VP ,WP ) = A dx dy(VP ,WP )

= A(P1, P2)

(dx(VP )dy(WP )− dx(WP )dy(VP )

)= A(P1, P2)(V1 W2 −W1 V2

Note it is clear from this definition that dxdy = −dy dx. Further, dxdx = dy dy = 0.

Next, we can look at 3 - forms. To do this, we have to figure out a way to alternate the algebraicsign of the terms in the expansion of dxdydz as well as move the setting to open sets U in <3.You probably have seen the permutation groups Sn on n symbols before. Let’s refresh that memory.First, let’s look at two symbols:

Definition 18.3.2 The Symmetric Group S2

For concreteness, the symmetric groups S2 will be phrased in terms of positive integers.S2 is the group consisting of the two ways the symbols 1 and 2 can be shuffled. These arethe ordering 12 and the ordering 21. The ordering 21 can be obtained from 12 by flippingthe entries: i.e. 12 → 21. Since one flip is involved, we define the algebraic sign of theordering 21 to be −1 while since there are an even number of flips (i.e zero to be exact )the algebraic sign of 12 is +. We let σ(i1 i2) be the sign of the ordering.

Next, three symbols:

Definition 18.3.3 The Symmetric Group S3

For concreteness, the symmetric groups S3 will be phrased in terms of positive integers. S3

is the group consisting of the ways the symbols 1, 2 and 3 can be shuffled. We let σ(i1 i2 i3)be the sign of the ordering.

• 123 does not need reordering so σ = +1.

• 213: 123→ 213; so one flip implying σ = −1.

• 321: 123→ 132→ 312→ 321: three flips implying σ = (−103 = −1.

• 132: 123→ 132: so one flip implying σ = −1

• 312: 123→ 132→ 312: so two flips implying σ = (−1)2 + +1.

• 231: 123→ 213→ 231: so two flips implying σ = (−1)2 = +1.

Using the symmetric group S3, we can define what we mean by dxdydz.


18.3. TWO AND THREE FORMS 377

Definition 18.3.4 3 - Forms

Let U be an open subset of <3. Then a 3 - form ψ : <3P × <3

P × <3P → < of the form

A dx1 dx2 dx3 for a smooth functionA on U has action defined by

ψ(X1P ,X

2P ,X

3P ) = A dx1 dx2 dx3(X1

P ,X2P , X

3P )

= A(P1, P2, P3)∑

(i1i2i3)∈S3

σ(i1i2i3)dx1(Xi1P )dx2(Xi2

P )dx3(Xi3P )

Comment 18.3.1 It is easier to remember this using determinants. For a 2 -form

dx dy(VP ,WP ) = dx(VP )dy(WP )− dx(WP )dy(VP )

= (V1 W2 −W1 V2 = det

[V1 V2

W1 W2

]

=

⟨det

i j kV1 V2 0W1 W2 0

,k⟩

Comment 18.3.2 Hence, for a 3 - Form, we can use determinants also:

dx1 dx2 dx3(X1P ,X

2P ,X

3P )

(+)dx1 dx2 dx3(X1P ,X

2P ,X

3P )(−)dx1 dx2 dx3(X2

P ,X1P ,X

3P )

(−)dx1 dx2 dx3(X3P ,X

2P ,X

1P )(−)dx1 dx2 dx3(X1

P ,X3P ,X

2P )

(+)dx1 dx2 dx3(X3P ,X

1P ,X

2P )(+)dx1 dx2 dx3(X2

P ,X3P ,X

1P )

Thus, we have

dx1 dx2 dx3(X1P ,X

2P ,X

3P ) = X11 X22 X33 −X21 X12 X33 −X31 X22 X13

− X11 X32 X23 +X31 X12 X23 +X21 X32 X13


dx1 dx2 dx3(X1P ,X

2P ,X

3P ) = X11(X22X33 −X23X32)−X12(X31X23 −X21X33)

+ X13(X21X32 −X22X31)

This can be reorganized like follows:

dx1 dx2 dx3(X1P ,X

2P ,X

3P ) = X11 det

[X22 X23

X32 X33)

]−X12 det

[X21 X23

X31 X33)

]+ X13det

[(X21 X22

X31 X32)

]

= det

X11 X12 X13

X21 X22 X23

X31 X32 X33

Comment 18.3.3 From the 3 - Form expansion, if you look at dx2dx2dx3, we would find

dx2 dx2 dx3(X1P ,X

2P ,X

3P ) = det

X21 X22 X23

X21 X22 X23

X31 X32 X33



Since two rows in the determinant are the same, the determinant is zero. A similar argument showsdxidxjdx3 is zero if any two superscripts are the same. Thus,

dxdxdz = 0, dxdydy = 0, dxdzdz = 0

and so forth. Finally, changing order gives

dx2 dx1 dx3(X1P ,X

2P ,X

3P ) = det

X21 X22 X23

X11 X12 X13

X31 X32 X33

which interchanges two rows in the matrix and hence the determinant changes sign. Thus, we know

dxdydz = −dydxdz, dxdydy = −dxdzdy, dxdydz = −dzdydx

and so forth.

18.4 Exterior Derivatives

18.5 Integration of Forms

18.6 Green’s Theorem Revisited


Part VI

Applications

379


Chapter 19

The Exponential Matrix

Let A be a n× n matrix. We will let ‖A‖Fr be the Frobenius norm of A; hence, we know ‖Ax‖ ≤‖A‖Fr ‖x‖. For convenience, we will simply let ‖A‖Fr be denoted by ‖A‖. Before we begin, letsconsider the product of two n× n matrices A and B. We have

Cij = (AB)ij =

n∑k=1

AikBkj ≤

√√√√ n∑k=1

A2ik

√√√√ n∑k=1

B2kj

Thus

n∑i=1

n∑j=1

C2ij ≤

n∑i=1

n∑j=1

n∑k=1

A2ik

n∑k=1

B2kj

=

( n∑i=1

n∑k=1

A2ik

) ( n∑j=1

n∑k=1

B2kj

)= ‖A‖2Fr ‖B‖2Fr

Therefore ‖AB‖Fr ≤ ‖A‖Fr ‖B‖F ‖. We will use this in a bit.

19.1 The Exponential Matrix

Now consider the finite sum

Pn = I +A+A2

2!+A3

3!+ · · ·+ An

n!

Does limn→∞Pn exist? i.e. is there a matrix P such that P = limn→∞Pn? Thus, we want to show

∀ ε > 0, ∃N 3 ‖P − Pn‖ < ε, if n > N

Formally, letting A0 = I , we suspect the matrix we seek is expressed as the following infinite series.

P =

∞∑j=0

Aj

j!

It is straightforward to show the set of all n× n real matrices form a complete normed vector spaceunder the Frobenius norm. This is not surprising as if Mn,n denotes this set of matrices, it can be

381


382 CHAPTER 19. THE EXPONENTIAL MATRIX

identified with <n2

and this space is complete using the usual ‖ · ‖2 norm. The Frobenius norm isessentially just the ‖ · ‖2 norm applied to objects requiring a double summation. So if we can showthe sequence of partial sums is a Cauchy Sequence with respect to the Frobenius norm, we knowthere is a unique matrix P which can be denoted by the series. For n > m, Pn−Pm =

∑nj=m+1

Aj

j!and

‖Pn − Pm‖ ≤n∑

j=m+1

‖Aj‖j!≤

n∑j=m+1

‖A‖j

j!

where we have used induction to show ‖An‖ ≤ ‖A‖n. But ‖A‖ = α is a constant, so

‖Pn − Pm‖ ≤n∑

j=m+1

(α)j

j!

We know that

eα =

∞∑j=0

(α)j

j!

converges and so it is a Cauchy Sequence of real numbers. Therefore, given ε > 0, there is a N sothat

∑nj=m+1 α

j/(j!) < ε when n > m > N . We conclude if n > m > N ,

‖Pn − Pm‖ ≤n∑

j=m+1

(α)j

j!< ε

Thus the sequence of partial sums is a Cauchy Sequence and by the completeness of Mn,n with theFrobenius norm, there is a unique matrix P so that P = limn→∞ Pn in the Frobenius norm. Thepartial sums Pn are exactly the partial sums we would expect for a function like eA even though A isa matrix. Hence, we use this sequence of partial sums to define what we mean by eA which is calledthe matrix exponential function. Since this analysis works for any matrix A, in particular, given avalue of t, it works for the matrix At to give the matrix eAt.

Homework

Exercise 19.1.1

Exercise 19.1.2

Exercise 19.1.3

Exercise 19.1.4

Exercise 19.1.5

Let’s apply this to systems of linear differential equations Let P (t) = eAt denote the matrix expo-nential function. consider the system x′1

...x′n

=

a11 · · · a1n

... · · ·...

an1 · · · ann

x1

...xn


19.2. THE JORDAN CANONICAL FORM 383

i.e. the linear system of differential equations

x′ = Ax

where A is a constant matrix. We will show the columns of eAt give the basis of the set of solutionsto this problem. Note in Chapter 6, we looked at this type of system for a symmetric 2× 2 matrix Aand found the general solution to that system x′ = Ax is x(t) = eAtC where

eAt =

∞∑j=0

(At)j/(j!)

=[E1 E2

] [∑nk=0(λk1t

k)/k! = eλ1t 00

∑nk=0(λk2t

k)/k! = eλ2t

] [E1 E2

]Twhere E1 and E2 were the orthogonal basis of eigenvectors to A known to exist because A was2 × 2 symmetric. So in this case, the matrix exponential function was easy to compute. In general,the n× n matrix A does not have such a nice structure, but we still now how to build it using partialsums of powers of A.

19.1.1 Homework

Exercise 19.1.6

Exercise 19.1.7

Exercise 19.1.8

Exercise 19.1.9

Exercise 19.1.10

19.2 The Jordan Canonical Form

In an advanced course in linear algebra, you will learn about the Jordan Canonical Form of amatrix. We can prove every matrix A has one and it is similar to the decomposition we find forsymmetric matrices in a way. A crucial part of the Jordan Canonical form are what are called JordanBlocks. Consider the three 3× 3 matrices below where λ is an eigenvalue. (These could possibly besubmatrices of a larger matrix.) λ 1 0

0 λ 10 0 λ

λ 0 00 λ 10 0 λ

λ 0 00 λ 00 0 λ

(1) (2) (3)

How many eigenvectors do we get for each of these? We know that for (3) we will get 3 distincteigenvectors, but what about for (1) and (2)? Recall, to get the eigenvalues of a matrix we look at

det(A− µI) = 0 (det(µI −A) = 0)

and then to find the eigenvectors we solve

(A− µI)(v) = 0



(Here µ is used to denote the eigenvalues because λ is being used in the matrices.) So, for matrix (1)we get

det(A− µI) =

∣∣∣∣∣∣λ− µ 1 0

0 λ− µ 10 0 λ− µ

∣∣∣∣∣∣ = 0

(λ− µ)3 = 0

So µ = λ and λ is our eigenvalue. To get the eigenvectors of (1) we solve λ 1 00 λ 10 0 λ

− µ 0 0

0 µ 00 0 µ

v1

v2

v3

= 0

or λ− µ 1 00 λ− µ 10 0 λ− µ

v1

v2

v3

=

000

We see that v1 = v2 = 0 and v3 is arbitrary. Thus from (1) we only get one distinct eigenvector. Theeigenvalue λ has algebraic multiplicity 3 as it occurs three times in the characteristic equation. It hasgeometric multiplicity 1 as it has only one eigenvector. Hence, the eigenspace for this eigenvalue ofalgebraic multiplicity three is one one dimensional. It is clear means in this case the eigenvectorsassociated with the eigenvalue do not give a subspace of the same dimension as the algebraic mul-tiplicity. Note the eigenvalues are on the diagonal and the superdiagonal right above that is all 1’s.This is the first of the Jordan Canonical Blocks possible for this eigenvalue of multiplicity 3. Notethere are 2 1s in the superdiagonal and there are (3− 2) eigenvectors here.

In a similar fashion we can show that matrix (2) has 2 distinct eigenvectors. Note it can be writtenlike λ 0 0

0 λ 10 0 λ

We interpret this as follows. Here the algebraic multiplicity is still 3 but now the geometric multiplic-ity is now 2. Note the structure of this matrix is different from (1). The (1, 1) entry is just λ and thistop position corresponds to a one dimensional eigenspace. The bottom 2×2 submatrix has 1’s on thediagonal and a 1 on the superdiagonal about that. Since it is a 2× 2 submatrix, the superdiagonal hasonly one entry and there are (3 − 1) eigenvectors here. This is the second type of Jordan Canonicalblock associated with an eigenvalue of algebraic multiplicity 3.

In case (3), the eigenvalues form the diagonal and the superdiagonal about it is all 0’s. Note there are3 eigenvectors now which is the same as (3−0) where the 0 is the number of 1s on the superdiagonal.Here the eigenvalue has algebraic multiplicity 3 and geometric multiplicity 3 also. This is the thirdtype of Jordan Canonical block associated with an eigenvalue of algebraic multiplicity 3.


19.2. THE JORDAN CANONICAL FORM 385

The Jordan Canonical Blocks for an eigenvalue λ of algebraic multiplicity 4 are thenλ 1 0 00 λ 1 00 0 λ 10 0 0 λ

λ 0 0 00 λ 1 00 0 λ 10 0 0 λ

λ 0 0 00 λ 0 00 0 λ 10 0 0 λ

λ 0 0 00 λ 0 00 0 λ 00 0 0 λ

(1) (2) (3) (4)

AM = 4 AM = 4 AM = 4 AM = 4GM = 1 GM = 2 GM = 3 GM = 4

where AM and GM are the algebraic and geometric multiplicity of the eigenvalue in each JordanCanonical Block. From now on, let a Jordan Canonical Block be denoted by JCB for convenience.

Homework

Exercise 19.2.1

Exercise 19.2.2

Exercise 19.2.3

Exercise 19.2.4

Exercise 19.2.5

What happens if we multiply a JCB by itself many times?

A =

λ 1 00 λ 10 0 λ

A2 =

λ 1 00 λ 10 0 λ

λ 1 00 λ 10 0 λ

=

λ2 2λ 10 λ2 2λ0 0 λ2

A3 =

λ2 2λ 10 λ2 2λ0 0 λ2

λ 1 00 λ 10 0 λ

=

λ3 3λ2 3λ0 λ3 3λ0 0 λ3

We will get the same sort of formulas for A4, A5, etc. You can see if we could compute eAt for aJCB then we could compute it for a matrix written in a form containing only JCB’s.

The general Jordan Canonical Block is thus

J =

λ 1 0

. . . . . .. . . 1

0 λ

n

︸︷︷︸n




J = λI + Z, Z =

0 1 0

. . . . . .. . . 1

0 0

with ones on the super diagonal. It turns out that the matrix Z is nilpotent, i.e. Zn = 0 for some n.Let’s show this:

Z =

0 1 0

. . . . . .. . . 1

0 0

Z2 =

0 1 0

. . . . . .. . . 1

0 0

0 1 0. . . . . .

. . . 10 0

z2ij =

n∑l=1

zilzlj = zi,i+1zi+1,j =

1 if j = i+ 20 otherwise

So

Z2 =

0 0 1 0. . . . . . . . .

. . . . . . 1. . . 0

0 0

Z3 = Z2Z = ZZ2

z3ij =

n∑l=1

z2ilzlj = zi,i+2zi+2,j =

1 if j = i+ 30 otherwise


19.3. EXPONENTIAL MATRIX CALCULATIONS 387

We find

Z3 =

0 0 0 1 0. . . . . . . . . . . .

. . . . . . . . . 1. . . . . . 0

. . . 00 0

← main-3 (i, i+ 3)

← main-2 (i, i+ 2)

← main-1 (i, i+ 1)

← main

We can only go so far though. After n− 1 iterations, we have

Zn−1 =

0 1. . .

0 0

, Zn = 0

Therefore Z is a nilpotent matrix.

19.2.1 Homework

Exercise 19.2.6

Exercise 19.2.7

Exercise 19.2.8

Exercise 19.2.9

Exercise 19.2.10

19.3 Exponential Matrix Calculations

Theorem 19.3.1 If A and B commute, eA+B = eAeB

Let A and B be n×n matrices which commute; i.e. AB = BA. Then f A and B commutethen eA+B = eAeB

Proof 19.3.1We claim

n∑i=0

(Ai/(i!))

n∑j=0

(Bj/(j!)) =

n∑k=0

k∑j=0

AjBk−j

j!(k − j)!

To see this, consider the array (0, 0) (0, 1) (0, 2) . . . (0, n)(1, 0) (1, 1) (1, 2) . . . (1, n)

......

......

...(n, 0) (n, 1) (n, 2) . . . (n, n)



Summing over all (n+ 1)2 entries indexed by this array is the same as summing over the diagonalshere

(i+ j) = 0 (i+ j) = 1 (i+ j) = 2 . . . (i+ j) = n

(i+ j) = 1 (i+ j) = 2...

... (i+ j) = n+ 1

(i+ j) = 2...

......

...... (i+ j) = n

......

...(i+ j) = n (i+ j) = n+ 1 . . . . . . (i+ j) = 2n

Let Z0 be the set of non negative integers. The scheme shown above defines a 1-1 and onto mappingg from Z0 × Z0 to Z0. If we let aij = AiBj/(i!j!), we are wondering if the

∑i

∑j aij is the same

as the∑ag(i,j) where ag(i,j) is the particular index in the scheme above. Thus, we are wondering if

n∑i=0

n∑j=0

(Ai/(i!)) (Bj/(j!)) = ag(0,0) + (ag(1,0) + ag(0,1)) + (ag(2,0) + ag(1,1) + ag(0,2)) + . . .

=

n∑k=0

k∑j=0

AjBk−j

j!(k − j)!

The∑ni=0

∑nj=0 (Ai/(i!)) (Bj/(j!)) converges to eAeB . Hence, it is a Cauchy Sequence in the

Frobenius norm. If you think about the indexing scheme, we can see( ∑i+j=2n

+∑

i+j=3n

+ . . .+∑

i+j=pn

)AiBj/(i!j!)

does not contain the terms∑ni=0

∑nj=0A

iBj/(i!j!). Further all of these terms are contained in thedifference ( pn∑

i=0

pn∑j=0

−n∑i=0

n∑j=0

)Ai/(i!)Bj/(j!)

( n∑i=0

pn∑j=n+1

+

pn∑i=n+1

n∑j=0

+

pn∑i=n+1

pn∑j=n+1

)Ai/(i!)Bj/(j!)

Thus, ∥∥∥∥( ∑i+j=2n

+∑

i+j=3n

+ . . .+∑

i+j=pn

)AiBj/(i!j!)

∥∥∥∥≤( ∑i+j=2n

+∑

i+j=3n

+ . . .+∑

i+j=pn

)‖A‖i ‖B‖j/(i!j!)

≤( n∑i=0

pn∑j=n+1

+

pn∑i=n+1

n∑j=0

+

pn∑i=n+1

pn∑j=n+1

)‖A‖i ‖B‖j/(i!j!)

Now we can simplify. We have

n∑i=0

‖A‖i/(i!)pn∑

j=n+1

‖B‖j/(j!) +

pn∑i=n+1

‖A‖i/(i!)n∑j=0

‖B‖j/(j!)



+

pn∑i=n+1

‖A‖i/(i!)pn∑

j=n+1

‖B‖j/(j!)

≤ e‖A‖pn∑

j=n+1

‖B‖j/(j!) + e‖B‖pn∑

i=n+1

‖A‖i/(i!) +

pn∑i=n+1

pn∑j=n+1

‖A‖i ‖B‖j/(i!j!)

Hence, given ε > 0, there are integers Q1, Q2 and Q3 so that

pn > n+ 1 > Q1 =⇒ e‖A‖pn∑

j=n+1

‖B‖j/(j!) < ε/3

pn > n+ 1 > Q2 =⇒ e‖B‖pn∑

i=n+1

‖A‖j/(j!) < ε/3

pn > n+ 1 > Q3 =⇒pn∑

i=n+1

pn∑j=n+1

‖A‖i ‖B‖j/(i!j!) < ε/3

because∑ni=0 ‖A‖i/(i!) → e‖A‖ and

∑nj=0 ‖B‖j/(j!) → e‖B‖. Thus, for pn > n + 1 >

maxQ1, Q2, Q3, we have∥∥∥∥( ∑i+j=2n

+∑

i+j=3n

+ . . .+∑

i+j=pn

)AiBj/(i!j!)

∥∥∥∥< ε

which tells us the sequence∑g(i,j)=ng(i,j)=0 is a Cauchy Sequence and so converges to matrix C. Finally,

consider

‖eAeB − C‖ ≤∥∥∥∥eAeB − n∑

i=0

n∑j=0

AiBj/(i!j!)

∥∥∥∥+

∥∥∥∥C − g(i,j)=N∑g(i,j)=0

AiBj/(i!j!)

∥∥∥∥+

∥∥∥∥ n∑i=0

n∑j=0

AiBj/(i!j!)−g(i,j)=N∑g(i,j)=0

AiBj/(i!j!)

∥∥∥∥For a given ε > 0, the first and second terms can be made less than ε/3 because of the convergenceof the sums to EAeB and C respectively. Thus, there is an Q1 so that if n > Q1,

‖eAeB − C‖ < 2ε/3 +

∥∥∥∥ n∑i=0

n∑j=0


AiBj/(i!j!)

∥∥∥∥The last term is analyzed just like we did in the argument to show the sequence defining C was aCauchy Sequence. We find there is a Q2 so that is n > N > Q2∥∥∥∥ n∑

i=0

M∑j=0


AiBj/(i!j!)

∥∥∥∥ < ε/3

Thus, for large enough choices of n and N , we have ‖eAeB − C‖ < ε. Since ε is arbitrary, thisshows C = eAeB .



Thus,

eAeB =

(I +A+

A2

2!+A3

3!+ · · ·+ Ak

k!+ · · ·

)(I +B +

B2

2!+B3

3!+ · · ·+ Bk

k!+ · · ·

)= I +

B

1!+A

1!+

B2

2!+AB

1!1!+A2

2!+

B3

3!+AB2

1!2!+A2B

2!1!+A3

3!+

...Bk

k!+

ABk−1

1!(k − 1)!+

A2Bk−2

2!(k − 2)!+

...

Then, using the fact that A and B commute,

eAeB =

∞∑n=0

n∑j=0

AjBn−j

j!(n− j)!

=

∞∑n=0

1

n!

n∑j=0

n!

j!(n− j)!AjBn−j

=

∞∑n=0

1

n!

n∑j=0

(nj

)AjBn−j

=

∞∑n=0

(A+B)n

n!since

n∑j=0

(nj

)AjBn−j = (A+B)n

= eA+B

Homework

Exercise 19.3.1

Exercise 19.3.2

Exercise 19.3.3

Exercise 19.3.4

Exercise 19.3.5

19.3.1 Jordan Block Matrices

In applying this to a Jordan Block we get



J = λI + Z

eJt = eλIt+Zt = eλIteZt

where

eλIt = I + tλI +(tλI)2

2!+ · · ·+ (tλI)n

n!+ · · ·

= I + tλI +(tλ)2I

2!+ · · ·+ (tλ)nI

n!+ · · ·

= diag[1 + λt+(λt)2

2!+ · · ·+ (λt)n

n!+ · · · ]

= diag[eλt] = eλtI

where diag denotes a diagonal matrix. We also find

eZt = I + tZ +t2Z2

2!+ · · ·+ tn−1Zn−1

(n− 1)!+ · · ·

=

1

. . .. . .

1

+ t

0 1

. . . . . .. . . 1

0

+ · · ·+ tn−1

(n− 1)!

0 0 1

. . . . . .. . . 0

0

=

1 t t2

2!tn−1

(n−1)!

. . . . . . . . .. . . . . . t2

2!. . . t

1

← main-2

← main-1

← main

So then

eJt = eλIteZt = eλt

1 t t2

2tn−1

(n−1)!

. . . . . . . . .. . . . . . t2

2. . . t

1



So if the Jordan form of the matrix A was

J =

2

2 12

−3 0−3

−3 1−3

then

eJt =

e2t

e2t te2t

e2t

e−3t

e−3t

e−3t te−3t

e−3t

Homework

Exercise 19.3.6

Exercise 19.3.7

Exercise 19.3.8

Exercise 19.3.9

Exercise 19.3.10

19.3.2 General Matrices

In general, for an n× n matrix A there is an invertible matrix P such that

A = PJP−1

where J is in Jordan canonical form. So then,

eAt = I + PJP−1t+

PJ2P−1︷︸︸︷(PJP−1PJP−1)

t2

2!+ · · ·+ (PJnP−1)

tn

n!

and we can easily prove we can factor out the P and P−1 to get

eAt = P (I + J +J2

2!+ · · ·+ Jn

n!)P−1

= PeJtP−1


19.4. APPLICATIONS TO LINEAR ODE 393

But how easy are P and P−1 to find? Look at a 2 × 2 matrix A having eigenvalues λ1 and λ2 andeigenvectors v1 and v2 such that v1 ⊥ v2 and each has unit length. Then, as we know(

vT1vT2

)︸︷︷︸

Q

A(v1 v2

)︸︷︷︸QT

=

(vT1vT2

)(Av1 Av2

)

=

(vT1vT2

)(λ1v1 λ2v2

)=

(λ1v

T1 v1 λ2v

T1 v2

λ1vT2 v1 λ2v

T2 v2

)=

(λ1 00 λ2

)So QAQT = J where J is the above matrix in Jordan canonical form. Then

AQT = QTJ

A = QTJQ

and set

P = QT = (v1, v2)

P−1 = Q

We can therefore do this fairly easily for symmetric matrices. From this we can see that certainmatrices have some nice forms and things just fall out nicely.

19.3.3 Homework

Exercise 19.3.11

Exercise 19.3.12

Exercise 19.3.13

Exercise 19.3.14

Exercise 19.3.15

19.4 Applications to Linear ODE

Consider the following system:

x′ = =

x′1x′2...x′n

=

a11 · · · · · · a1n

... · · · · · ·...

... · · · · · ·...

an1 · · · · · · ann

x1

x2

...xn

= Ax, x(t0) = x0 =

x01

x02

...x0n

Theorem 19.4.1 The Derivative of the exponential matrix



Let A be an n× n matrix. Then ddt

(eAt)

= AeAt.

Proof 19.4.1Since eAt =

∑∞j=0

Aj

j! tj , to find its derivative, we look at

d

dt

(eAt)

=d

dt

∞∑j=0

Aj

j!tj

We want to use an analogue of the derivative interchange theorem we proved in (Peterson (8) 2019)to show

d

dt

∞∑j=0

Aj

j!tj

=

∞∑j=0

d

dt

(Aj

j!tj)

The series we have here is a series of matrices and convergence is with respect to the Frobeniusnorm so it is not immediately clear the earlier results apply. For a series of real numbers, we wouldperform the following checks applied to the sequence of partial sums Sn. First, we fix a T . We needto check

1. Sn is differentiable on [0, T ]. It is easy to see( n∑j=0

Aj

j!tj)′

=

n∑j=0

jAj

j!tj−1

as the derivative Bf(t) for a matrix B and a differentiable function f is easily seen to beBf ′(t) by looking at the derivative of each component.

2. S′n is Riemann Integrable on [0, t]: This is true as the integral of Bf(t) for any integrable f isB∫ t

0f(s)ds for any matrix B by looking at components.

3. There is at least one point t0 ∈ [0, t] such that the sequence (Sn(t0)) converges. This is trueas the series here converges at all t.

4. S′nunif−→ y on [0, T ] and the limit function y is continuous. This one is tougher. What we

have shown in our work so far is that Sn is a Cauchy Sequence with respect to the Frobeniusnorm. In fact, we know that if Tn(t) =

∑ni=0 ‖A‖i/(i!) is a partial sum of e‖A‖t which we

know converges uniformly on any finite interval [0, T ], we use it to bound the partial sumsSn. Hence, for any ε > 0, there is N so that n > m > N implies ‖Sn(t) − Sm(t)‖Fr ≤|Tn(t) − Tm(t)| < ε on [0, T ]. If you look at the arguments we use in proving these sorts oftheorems in (Peterson (8) 2019), you can see how they can be modified to show Sn convergesuniformly on [0, T ] to a continuous function eAt. Everything thing just said can be modified abit to prove S′n converges uniformly to some matrix function D on [0, T ] as well.

You can then modify the proof of the derivative interchange theorem to show there is a function W

on [0, T ] so that Snunif−→ W on [0, T ] and W ′ = D. Since limits are unique, we then have W = eAt

with (eAt)′ = D. This is the same as saying the interchange works and

d

dt

∞∑j=0

Aj

j!tj

=

∞∑j=0

d

dt

(Aj

j!tj)


19.4. APPLICATIONS TO LINEAR ODE 395

Now

∞∑j=0

d

dt

(AjtJ

j!

)=

∞∑j=0

Ajjtj−1

j!=

∞∑j=1

Ajjtj−1

j!=

∞∑j=1

Ajtj−1

(j − 1)!

= A(I + tA+t2

2A2 + · · ·+ tn

n!An + · · · ) = AeAt

where it is straightforward to prove the we can factor theA out of the series by a partial sum argumentwhich we leave to you.So we get that d

dt (eAt) = AeAt.

Homework

Exercise 19.4.1

Exercise 19.4.2

Exercise 19.4.3

Exercise 19.4.4

Exercise 19.4.5

19.4.1 The Homogeneous SolutionTheorem 19.4.2 The general solution to a linear ODE system

Let A be a n× n matrix. Then Φ(t) = eAt · C is a solution to x′ = Ax.

Proof 19.4.2We note

Φ′(t) =d

dt

(eAt · C

)and if we write eAt using its column vectors

eAt = [Φ1(t), · · · ,Φn(t)]

then

d

dt

(eAt · C

)=

d

dt

[ Φ1(t) · · · Φn(t)] C1

...Cn

=d

dt

(n∑i=1

Φi(t)Ci

)=

n∑i=1

d

dt(Φi(t)Ci) =

n∑i=1

CiΦ′i(t)

=[

Φ′i(t) · · · Φ′n(t)] C1

...Cn

=

d

dt

(eAt)· C = AeAt · C = AΦ(t)

This shows Φ is a solution to the system.



We also know that

eAt = PeJtP−1C

where J is the Jordan canonical form of A. For the structure of the Jordan Canonical Form of A, youcan see its columns are linearly independent functions. Thus, letting the columns of eJt be Ψ(t), wesee [

Φ1 . . . Φn]

= P[Ψ1 . . . Ψn

]P−1

To see the columns of eAt are linearly independent, consider the usual linear dependence equation

α1Φ1 + . . .+ αnΦn = 0 =⇒[Φ1 . . . Φn

] α1

...αn

= 0

This implies

P[Ψ1 . . . Ψn

]P−1

α1

...αn

= 0

or

[Ψ1 . . . Ψn

]P−1

α1

...αn

= P−10 = 0

Since eJt has independent columns, the only solution here is

P−1

α1

...αn

= 0 =⇒

α1

...αn

= 0

We conclude the functions Φ are independent. Since Φ′i = AΦi, we also know each Φ is a solutionto x′ = Ax. The solutions Φ = eAtC for arbitrary C ∈ <n are the span of the linearly independentfunctions Φ1 to Φn. So the functions Φ1 to Φn form a basis for the set of solutions to x′ = Ax.

When we search for solutions to x′ = Ax, we look for solutions of the form x = V ert and we findthe only scalars r that work are the eigenvalues of A. Of course, if a root is repeated the eigenspacefor that eigenvalue need not have the same geometric multiplicity as its algebraic multiplicity. Thestructure of the Jordan Canonical block here shows us what happens. Any solution satisfies

x′ = P J P−1x

and letting y = P−1x, we see we are searching for solutions to y′ = Jy. Now focus on one JCB,say Ji. Then setting all variables in yi to be zero except for the ones that concern Ji, we look forsolutions

yi′ = Jiyi

and we know we either get eλi solutions with corresponding eigenvectors or if the eigenspace is too


19.5. THE NON HOMOGENEOUS SOLUTION 397

small we also get solutions such as tkeλi for appropriate k. So the only solutions we can get to thissystem are linear combinations of the columns of eJt. Hence, any solution is a linear combination ofthe columns of eAt. We conclude the solution space of x′ = Ax is the n dimensional vector spacewith basis Φ1, . . . , Phin and any function in the span of this basis can be written as eAtC for someC ∈ <n. Let’s summarize:

Theorem 19.4.3 The Solution Space to a linear System of ODE

Let S = y | y′ = Ay Then

1. S is an n-dimensional vector space and any basis for S consists of n linearly inde-pendent solutions to x′ = Ax.

2. The columns of eAt give a basis for S.

3. eAt is invertible and we write its inverse as e−At.

4. Φ(t) = eA(t−t0)x0 is the solution to x′ = Ax, x(t0) = x0.

Proof 19.4.3Most of this has already been proven. Note the columns of eAt are linearly independent, it does havean inverse. Thus

e(A−A)t = I =⇒ eA e−A = I =⇒ e−A = (eA)−1

Next, note x′ = Ax, x(t0) = x0 has general solution Φ(t) = eAtC. Using the initial condition wecan find

Φ(t0) = x0 = eA t0C =⇒ C = e−A t0x0

So

Φ(t) = eAt(e−At0

)x0 = eA(t−t0)x0


Exercise 19.4.7

Exercise 19.4.8

Exercise 19.4.9

Exercise 19.4.10

19.5 The Non homogeneous SolutionNow let’s look at what are called non homogeneous systems. Now consider the nonhomogeneoussystem

x′ = Ax+ b(t)



x(t0) = x0

We say Ψ is a particular solution if we have

Ψ′ = AΨ + b(t)

These are hard to find in practice, but we can find a general formula that is a good start. If we have aparticular solution in hand, the general solution is then

Φ(t) = EA(t−t0)C + Ψ(t)

Theorem 19.5.1 A Particular Solution to a linear ODE System

Let b be a continuous function. Then Ψ(t) = eAt∫ tt0e−As b(s)ds is a particular solution

to

x′ = Ax+ b(t)

Here, integration is defined on each component in the usual way. Hence, the solution tothe system is

Φ(t) = eA(t−t0)x0 +

∫ t

t0

e−A(t−s)b(s)ds

Proof 19.5.1This is a straightforward calculation. The general solution to the nonhomogeneous system should be

Φ(t) = eAtC + eAt∫ t

t0

e−Asb(s)ds

Using the initial condition we get that the actual solution to the above nonhomogeneous system is

Φ(t) = eA(t−t0)x0 +

∫ t

t0

e−A(t−s)b(s)ds

where the second part holds since

eAt∫ t

t0

e−Asb(s)ds =

∫ t

t0

eAte−Asb(s)ds =

∫ t

t0

eA(t−s)b(s)ds

Note,

Ψ′ = AeAt∫ t

t0

e−Asb(s)ds+ eAt(∫ t

t0

e−Asb(s)ds

)′= AeAt

∫ t

t0

e−Asb(s)ds+ eAt(e−Atb(t)

)as the Fundamental Theorem of Calculus holds for these vectors functions which is easily seen bylooking at components. Therefore

Ψ′ = AeAt∫ t

t0

e−Asb(s)ds+ b(t) = AΨ + b(t)


19.6. A DIAGONALIZABLE TEST PROBLEM 399

The solution to the system is thus

Φ(t) = eA(t−t0)x0 +

∫ t

t0

e−A(t−s)b(s)ds

Comment 19.5.1 We see then that if we were able to find eAt we would then be able to solve manyof these linear systems!

19.5.1 Homework

Exercise 19.5.1

Exercise 19.5.2

Exercise 19.5.3

Exercise 19.5.4

Exercise 19.5.5

19.6 A Diagonalizable Test Problem

Take

A =

[0 1−2 −3

]Let’s find eAt.

1. Find eigenvalues:∣∣∣∣ −r 1−2 3− r

∣∣∣∣ = 3r + r2 − 2 = (r + 2)(r + 1)− 0

So eigenvalues are r1 = −2, r2 = −1.

2. The Cayley–Hamilton Theorem ( which we have not discussed and that is covered in youradvanced linear algebra course) tells us A satisfies the characteristic equation. Hence, weknow A2 + 3A+ 2I = 0 So

A2 = −3A− 2I

A3 = A2A = −3A2 − 2A = −3(−3A− 2I)− 2A = 7A− 6I

...An = anA+ bn I

Now

eA = I +A+A2

2!+A3

3!+ · · ·+ An

n!+ · · ·

= I +A+

(a2

2!A+

b22!I

)+ · · ·+

(ann!A+

bnn!I

)+ · · ·



=

(1 +

b22!

+b33!

+ · · ·)I +

(1 +

a2

2!+a3

3!+ · · ·

)A

and we know that both of these sequences in parenthesis converge because eA converges. SoeA = αI + βA for some α and β. Then eAt can be written as eAt = α(t)I + β(t)A. Next,

[0 1−2 −3

]= P

J︷︸︸︷[−2 0

0 −1

]P−1

From before we know that eAt = PeJtP−1, so

PeJtP−1 = Pα(t)IP−1 + β(t)PJP−1

P

[e−2t 0

0 e−t

]P−1 = P

[α(t)I + β(t)

[−2 0

0 −1

] ]P−1

i.e.

e−2t = α(t)− 2β(t) (1)

e−t = α(t)− β(t) (2)

Taking the negative of (2) we get

e−2t = α(t)− 2β(t)

−e−t = −α(t) + β(t)

Adding these we get

e−2t − e−t = −β(t) or β(t) = e−t − e−2t

So

α(t) = β(t) + e−t

= 2e−t − e−2t

Therefore

eAt =(2e−t − e−2t

)( 1 00 1

)+(e−t − e−2t

)( 0 1−2 −3

)=

(2e−t − e−2t e−t − e−2t

−2e−t + 2e−2t −e−t + 2e−2t

)and we know that each of the columns in eAt should be solutions to the linear system ofequations associated with the matrix A.

Solve the system x′ = Ax Now, as mentioned before, both columns of eAt should be solutions tothe above system. Let’s check it for

Φ1 =

(2e−t − e−2t

−2e−t + 2e−2t

)


19.7. A NON DIAGONALIZABLE TEST PROBLEM 401

First,

Φ′1 =

(−2e−t + 2e−2t

2e−t − 4e−2t

)and second,

AΦ1 =

(0 1−2 −3

)(2e−t − e−2t

−2e−t + 2e−2t

)=

(−2e−t + 2e−2t

2e−t − 4e−2t

)So Φ′1 = AΦ1 and Φ1 is a solution to the system. In a similar fashion it can be shown that Φ2 is alsoa solution. Thus eAt contains the fundamental solutions to the system.

19.6.1 Homework

Exercise 19.6.1

Exercise 19.6.2

Exercise 19.6.3

Exercise 19.6.4

Exercise 19.6.5

19.7 A Non diagonalizable Test Problem

Now try finding eAt using this method for a matrix A which is not diagonalizable, and has multipleeigenvalues. Try

A =

(1 10 1

)This is a Jordan block and we know that for this matrix

eAt =

(et tet

0 et

)Now let’s use the method above to find eAt.

1. Find eigenvalues: We know they are r1 = 1, r2 = 1.

2. Use Cayley–Hamilton Theorem:

(A− I)(A− I) = 0

A2 − 2A+ I = 0

A2 = 2A− I

3. eAt = α(t)I + β(t)A

PeJtP−1 = P [α(t)I + β(t)J ]P−1 (Here A=J)



Pe

1 10 1

tP−1 = P

[α(t)I + β(t)

[1 10 1

] ]P−1

[et tet

0 et

]=

[α(t) 0

0 α(t)

] [β(t) β(t)

0 β(t)

]So,

et = α(t) + β(t) (1)

tet = β(t) (2)

et = α(t) + β(t) (3)

From (2) we get β(t) = tet, and using (2), (1) gives us α(t) = et − tet. Therefore,

eAt = (1− t)etI + tetA

= (1− t)et(

1 00 1

)+ tet

(1 10 1

)=

((1− t)et 0

0 (1− t)et)

+

(tet tet

0 tet

)=

(et tet

0 et

)and this is what we expected. So this method works for matrices having repeated eigenvaluesand for matrices that are not diagonalizable. Again, note that we can find eAt and we neverhave to find P or P−1.

19.7.1 Homework

Exercise 19.7.1

Exercise 19.7.2

Exercise 19.7.3

Exercise 19.7.4

Exercise 19.7.5

19.8 A Non homogeneous Problem

Let’s look at a typical nonhomogeneous system to show you the kinds of calculations you have to doto solve a problem like this. It is pretty intense! Consider the non homogeneous problem

x′ =

[1 24 3

]x+

[t

cos(t)

]and we want x1(t0) = 1 and x2(t0) = −1. We have already shown that P (t) =

∫ tt0eA(t−s)b(s)ds

is a particular solution to the nonhomogeneous system. Instead of using eAt, let’s use Φ(t) to getQ(t) = Φ(t)

∫ tt0

Φ−1(s)b(s)dt as a particular solution. The check that it is a particular solution we


19.8. A NON HOMOGENEOUS PROBLEM 403

know

Q′(t) = Φ′(t)

∫ t

t0

Φ−1(s) b(s) ds+ Φ(t) Φ−1′(t) b(t)

where

Φ′ =[

Φ′1 Φ′2]

=[AΦ1 AΦ2

]= A

[Φ1 Φ2

]= AΦ

since Φ1 and Φ2 are solutions to the homogeneous system. So then

Q′ = AΦ

∫ t

t0

Φ−1(s) b(s) ds+ b(t)

= AQ+ b(t)

and we see that Q(t) satisfies the nonhomogeneous system and is therefore a particular solution. Sowe see that we don’t need to use eAt; all we need is a matrix containing two linearly independentsolutions. (Here we used Φ(t)).

Homework

Exercise 19.8.1

Exercise 19.8.2

Exercise 19.8.3

Exercise 19.8.4

Exercise 19.8.5

19.8.1 Louiville’s Formula

Now we know that Q(t) = Φ(t)∫ tt0

Φ′(s) b(s) ds is also a particular solution where

Φ(t) =

[ [12

]e5t

[1−1

]e−t

]=[V1e

5t V2e−t]

where V1 and V2 are the corresponding eigenvectors. A quick calculation shows

detΦ(t) = det[V1e

5t V2e−t] = det

[V1 V2

]e4t

where the det[V1 V2

]is nonzero because the eigenvectors are linearly independent. It is well

known the inverse is then

Φ−1(t) =1

det[V1 V2

] [ V22e−t −V12e

5t

−V12e5t V11e

5t

]Now the trace of a matrix is the sum of its diagonal elements and for the matrix B is denoted bytr(B). Here, tr(A) = 4 and detΦ(0) = det

[V1 V2

]. Thus, this formula

detΦ(t) = detΦ(0)e∫ t0tr(A) ds = det

[V1 V2

]e4t



which we have already found. However, if these matrices were not 2×2 we can define the submatrices

Φij = (−1)i+j det(

submatrix of Φ obtained bydeleting row i and column j

)and form the adjoint of Φ adj(Φ)(t) = (Φij)

T (t). From advanced linear algebra, we find the inverseof Φ(t) is

(Φ(t))−1

=1

detΦadj(Φ)

with

detΦ(t) = detΦ(t0)e∫ tt0tr(A) ds

Now for t0 = 0 we get the determinant we already found which is detΦ(t) = −3e4t. This formulais called Louiville’s Formula and it holds for linear systems.Homework

Exercise 19.8.6

Exercise 19.8.7

Exercise 19.8.8

Exercise 19.8.9

Exercise 19.8.10

19.8.2 Finding the Particular Solution

How do we find adjΦ? Using the formulas given above we get that

adj Φ =

[−e−t −2e5t

−e−t e5t

]Twhich is what we found earlier. So then

Φ−1(t) =

[−e−t −2e5t

−e−t e5t

]T−3e4t

=

[13e−5t 1

3e−5t

23et − 1

3et

]Now, if the inverse was calculated correctly then ΦΦ−1 = I. To verify this we see that[

e5t e−t

2e5t −e−t] [

13e−5t 1

3e−5t

23et − 1

3et

]=

[1 00 1

]So now our particular solution looks like

Q(t) =

[e5t e−t

2e5t −e−t] ∫ t

0

[13e−5s 1

3e−5s

23es − 1

3es

] [s

cos(s)

]ds

=

[e5t e−t

2e5t −e−t][ ∫ t

0

[13e−5s + 1

3 cos(s)e−5s]ds∫ t

0

[23es − 1

3 cos(s)es]ds

]


19.8. A NON HOMOGENEOUS PROBLEM 405

and in evaluating these integrals we will get the particular solution for our example. With this then,the general solution to the nonhomogeneous system is

Φ(t) = c1

[12

]e5t + c2

[1−1

]e−t +Q(t)

where c1 and c2 can be found using the initial conditions given. As you can see this is a very intenseprocess!Homework

Exercise 19.8.11

Exercise 19.8.12

Exercise 19.8.13

Exercise 19.8.14

Exercise 19.8.15


Chapter 20

Nonlinear Parametric OptimizationTheory

Let’s look at some optimization problems to see how the material we have been working throughapplies. First, here is a very relaxed way to think about a classical optimization problem. Considerthe task of

min

∫ t

0

fo(s, x(s), x′(s), µ(s)) subject to

x′(t) = f(t, x(t), µ(t)), 0 ≤ t ≤ Tx(0) = x0

As stated, it is not very clear what is going on. We are supposed to find a minimum value of anintegral, so there should be some conditions of the integrand fo to ensure fo(s, x(s), x(s), µ(s)) isan integrable function. This problem probably makes sense even if the arguments of f0 are vectorstoo. We clearly need to restrict our attention to some class of interesting functions. For example, wecould rephrase the problem as

minx,x′,µ∈C1([0,T )

∫ t

0

fo(s, x(s), x′(s), µ(s)) subject to

x′(t) = f(t, x(t), µ(t)), 0 ≤ t ≤ Tx(0) = x0

This still does not say anything about f0. A more careful versions would be Let’s assume f0 :[0, t]×D ⊂ <n×<n×<n → < is continuous onD with continuous partials onD. The minimizationproblem is

minx,x′,µ∈C1([0,T )

∫ t

0

fo(s,x(s),x′(s),µ(s)) subject to

x′(t) = f(t,x(t),µ(t)), 0 ≤ t ≤ Tx(0) = x0

Here is a specific one.

407


408 CHAPTER 20. NONLINEAR PARAMETRIC OPTIMIZATION THEORY

Example 20.0.1

min

∫ 1

0

f0−not dependent on time︷︸︸︷[x2(s) + (x′(s))2︸︷︷︸

like energy

+ µ2(s)︸︷︷︸control cost

] ds

subject to

x′(t) = x(t) + µ(t) + e−t︸︷︷︸damping term

0 ≤ t ≤ T

x(0) = 1

We call x is the state variable and µ the control variable. It’s possible that we could have a controlconstraint which bounds the control. So we could have something like −1 ≤ µ(t) ≤ 1. We couldalso have a state constraint which says that x(t) lives in a certain set; i.e. x(t) ∈ C for some set C.Hence, these problems can be pretty hard.

20.1 The More Precise and Careful Way

Let T be a fixed real number and define some underlying spaces.

• The control space is U = µ : [0, T ]→ <m|µ has some properties. These properties mightbe the control functions are continuous, piecewise continuous (meaning they can have a finitenumber of discontinuities), piecewise differentiable ( meaning they can have a finite numberof points where the right and left hand derivatives exists but don’t match) and so forth. Eachfunction in the control space is called a control.

• The State Space is X = x : [0, T ] → <n | x′ continuous on [0, T ]. This is the same asC1([0, T ]). Each function in the state space is called a state.

Let f0 : [0, T ]×<n ×<n ×<m → < and assume f0 is continuous on [0, T ]×<n ×<n ×<m i.e.,∀ ε > 0, ∃ δ > 0 so that

‖(t,u,v,w)− (t0,u0,v0,w0‖ < δ =⇒ |f0(t,u,v,w)− f0(t0,u0,v0,w0)| < ε

The norm ‖ · ‖ here is

‖(t,u,v,w)− (t0,u0,v0,w0‖ =√|t− t0|2 + ‖u− u0‖2<n + ‖v − v0‖2<n + ‖w −w0‖2<m

This means that if x(t),x′(t) and µ(t) are continuous functions on [0, T ], then since the compositionof continuous functions is continuous, f(t,xx(t),x′(t),µ(t)) will be continuous on [0, T ]. Let’schoose U = PC[0, T ], the set of piecewise continuous functions on [0, T ]. We know that evenif µ(t) is a piecewise continuous function the above will still hold except at the places where thecontrol is not continuous because there is only a finite number of discontinuities. Hence, f0 will alsobe piecewise continuous. Now the ODE isx

′1...x′n

=

f1(t,x(t),µ(t))...

fn(t,x(t),µ(t))

, 0 ≤ t ≤ T,

x1(0)...

xn(0)

=

x01

...x0n


20.1. THE MORE PRECISE AND CAREFUL WAY 409

or

x′ = f(t,x(t),µ(t)). 0 ≤ t ≤ T, x(0) = x0

where f : [0, T ]×<n ×<m → < and f is continuous on [0, T ]×<n ×<m.

Now, how do we describe the minimization? Let

J(x0,µ) =

∫ T

0

f0(s,x(s),x′(s),µ(s))ds

where µ is a specific control function and x is the solution to the ODE for that control functionchoice. Hence, if there is such a solution and the solution is at least piecewise continuous, then forthe piecewise continuous control µ, the integrand is piecewise continuous and so the integral exists.But perhaps the ODE does not have a continuous solution for a given controlµwhich in other courseswe would find would mean the solution x would have ‖x‖ → ∞. In that case, the integral could gounbounded. Hence, in general, J : <n × U → <

⋃±∞. We want to find

J(x0) = infµ∈U

J(x0,µ)

This function is called the Optimal Value Function and understanding its properties is very impor-tant in many application areas. For example, a really good question would be how does the optimalvalue function vary when the initial data x0 changes? So our full optimization problem is

infµ∈U

J(x0,µ), subject to

x′ = f(t,x(t),µ(t)), 0 ≤ t ≤ Tx(0) = x0

or

infµ∈U

∫ T

0

f0(s,x(s),x′(s),µ(s))ds, subject to

x′ = f(t,x(t),µ(t)), 0 ≤ t ≤ Tx(0) = x0

The most general way to write this is: Find

J(t0, x0) = infµ∈U

J(t0, x0,µ)

where

J(t0, x0, µ) = `(t0,x0) +

∫ T

0

f0(s,x(s),x′(s),µ(s))ds

subject to

x′ = f(t,x(t),µ(t)), 0 ≤ t ≤ Tx(t0) = x0

x(t) ∈ Ω(t,µ(t))

µ(t) ∈ Λ(t,x(t))



with `(t0,x0) a penalty on initial conditions. For example, this could be some random noise addedfrom some probability distribution. The set Ω is a constraint set on the values the state can take attime t given the value of the control at the time. The set Λ is a constraint set on the values the controlis permitted to have at time t given the state value at that time. As you may imagine, adding stateand control constraints makes this problem much harder to find solutions to. We would like to seethat J(t0,x0) is a finite number. So we want to find a control which actually allows us to achieve theinfimum. This is a very hard problem both theoretically and practically.

Example 20.1.1 A Constrained Optimal Control ProblemConsider this problem which has constraints on the control.

inf∫ 1

0

[1

2xTQx+

1

2(x′)TRx′ +

1

2µTSµ

]ds

Subject to:

x′ = Ax+Bµ

−1 ≤ µ ≤ 1

x(t0) = x0

The idea here is to take a control from any space that we want. Then we solve the differentialequation, hopefully obtaining a solution, and then use this solution along with the control to find avalue for the integral. Our hope is that there is a control which gives us a smallest value for theintegral. For the above problem the solution to the differential equation is given by

x(t) = eA(t−t0)x0 + eAt∫ t

t0

e−As B µ(s)ds

= eA(t−t0)x0 +

∫ t

t0

eA(t−s)Bµ(s)ds.

Now how big is x(t)?

‖x(t)‖ ≤ ‖eA(t−t0)‖ ‖x0‖+

∫ t

t0

‖eA(t−s)‖ ‖B‖ ‖µ(s)‖ds

We get a nice bound on the first term if all of the eigenvalues have negative real part. We knoweAt = PeJtP−1 so the norm of the exponential matrix is determined by the structure of the JordanCanonical Blocks. Any eigenvalue which is positive or has positive real part will give us a normwhich is unbounded over long time scales. We encourage you to work this out. Hence, we valueproblems where we can guarantee the eigenvalues of A have negative real part. And, provided ourcontrol is bounded, we can bound the second term. So we can get a bound on x(t) in these cases.We get a bound even if the eigenvalues have positive real part but then the norm of the solution canget arbitrary large which is not what we usually want. Now, in general, we can write the solution as

x(t) = Φ(t)Φ′(t0)x0 + Φ(t)

∫ t

t0

Φ−1(s)Bµ(s)ds

where Φ(t) is the fundamental matrix composed of linearly independent solutions. For nonlinearproblems this is very difficult to find and it is here that the problem lies.

Comment 20.1.1 This control problem is called a LQR for the Linear Quadratic regulator prob-lem. Often there is state variables x and variables y which are actually observed, called observedvariables. This simply adds an extra constraint of the form y = Cx.


20.2. UNCONSTRAINED PARAMETRIC OPTIMIZATION 411

20.2 Unconstrained Parametric Optimization

Given L : <n → <, the problem is to find infx∈<n L(x) We have already discussed such extremumproblems, so we will be brief here. Under sufficient smoothness conditions, if x0 is a local minimum,then

∇L(x0) =

∂L∂x1

(x0)...

∂L∂xn

(x0)

= 0

The directional derivative of L in the direction of the unit vector u from a point x is

DuL(x) =

⟨∇L(x),u

⟩

=

(∇L(x)

)Tu =

∂L∂x1

(x)...

∂L∂xn

(x)

T u1

...un

where u2

1 + · · · + u2n = 1. In two dimensions, <2, assume L has a local min at ~x0 =

[x01

x02

],

i.e. ∃ r > 0 such that if x ∈ B(x0; r), then L(x) ≥ L(x0); a generic illustration of this is shownin Figure 20.1. This looks like a paraboloid locally. In fact, if we look at the cross–section of this

Figure 20.1: Two Dimensional Local Minimum

surface through the line Lθ, 0 ≤ θ ≤ 2π, we obtain the two–dimensional slice shown in Figure 20.2.Now restrict your attention to the trace in the L surface above this line lying inside the cylinder ofradius r

2 about x0 as shown in Figure 20.3. Then for a given θ, we obtain the parameterized curvegiven by x3(t) = L(x01 + t cos(θ), x02 + t sin(θ) for− r2 < t < r

2 . This is illustrated in Figure 20.4.

Now x =

[x1

x2

], so

dx3

dt=

⟨∇L(x),

[cos θsin θ

]⟩=

∂L

∂x1(x) cos θ +

∂L

∂x2(x) sin θ.



Figure 20.2: Lθ Slice

Figure 20.3: Local Cylinder

Then

d2x3

dt2= cos θLx1,x1(x)

∂x1

∂t+ cos θLx1,x2(x)

∂x2

∂t+ sin θLx2,x1(x)

∂x1

∂t+ sin θLx2,x2(x)

∂x2

∂t

= cos2 θLx1,x1(x) + 2 sin θ cos θLx1,x2(x) + sin2 θLx2,x2(x)

where we assume the mixed order partials match. Of course, if we assume the partials are continuouslocally this is true. From this we get the general quadratic form

d2x3

dt2=

[cos θsin θ

]THL(x)

[cos θsin θ

].

Hence, at the local minimum x0,

d2x3

dt2=

[cos θsin θ

]THL(x0)

[cos θsin θ

].


20.2. UNCONSTRAINED PARAMETRIC OPTIMIZATION 413

Figure 20.4: Local Cylinder Cross–Section

Since x0 is a local minimum

d2x3

dt2= ET

θHL(x0Eθ > 0 ∀ 0 < θ < 2π

where

Eθ =

[cos θsin θ

]Any vector in <2 can be written as rEθ, so we have shown

xTHL(x0)x > 0 ∀ x ∈ <2

i.e. HL(x0) is a positive definite matrix implying it is diagonalizable and has determinant greaterthan zero). We note that at θ = 0 we get Lx1,x1(x0) > 0. So for the matrix HL(x0) we know at aminimum:

1. The first principle minor is Lx1,x1(x0) and its determinant is positive.

2. The next principle minor isHL(x0) itself and since it is positive definite symmetric we knowits determinant is positive.

So at the local minimum x0 we expect this behavior

• ∇L(x0) = 0.

• HL(x0) is symmetric positive definite

• The first principle minor has positive determinant.

• The second principle minor has positive determinant.

These are all results we have already worked out, but here we used a different method of attack. Forthis <2 case, we have

d2x3

dt2= cos2 θfx1,x1

(x0) + 2 sin θ cos θfx1,x2(x0) + sin2 θfx2,x2

(x0) ∀ 0 < θ < 2π

Rewrite this as

α2

α2 + β2fx1,x1

(x0) +2αβ

α2 + β2fx1,x2

(x0) +β2

α2 + β2fx2,x2

(x0) > 0 ∀ α, β



So

α2fx1,x1(x0) + 2αβfx1,x2(x0) + β2fx1,x1(x0) > 0 ∀ α, β

Now factoring out fx1,x1(x0) and completing the square we get the following:

fx1,x1(x0)

[α2 +

2βfx1,x2(x0)

fx1,x1(x0)

+β2(fx1,x1(x0))2

(fx1,x1(x0))2

]− β2(fx1,x2(x0))2

fx1,x1(x0)

+ β2fx2,x2(x0) > 0

Simplifying,

fx1,x1(x0)

[α+

βfx1,x2(x0)

fx1,x1(x0)

]2

+ β2

[fx2,x2(x0)− (fx1,x2

(x0))2

fx1,x1(x0)

]> 0

Thus

fx1,x1(x0)

[α+

βfx1,x2(x0)

fx1,x1(x0)

]2

+ β2

[fx1,x1

(x0) fx2,x2(x0)− (fx1,x2

(x0))2

fx1,x1(x0)

]> 0

Since we know fx1,x1(x0) > 0, this forces

fx1,x1(x0) fx2,x2

(x0)− (fx1,x2(x0))2 > 0

This is precisely the determinant of the second minor which we expected from our theory to bepositive at a minimum. You should go back and look at the argument for the <2 case presented in(Peterson (8) 2019). We used a somewhat different argument there based on Hessian approximationsand had to work a little more to be able to focus on what happened at the critical point x0. From ourmore general theory in Chapter 12.2, we know there are similar conditions at the minimum in <n.

20.3 Constrained Parametric Optimization

Consider the problem:

infµ∈<m

L(x,µ)

subject to

f(x,µ) = 0

where f(x,µ) = 0 is the same as saying

f1(x, µ) = 0...

fn(x, µ) = 0

where x ∈ <n is the state, µ ∈ <m is the control and L : <n × <m → < is the performanceindex. The function f : <n × <m → <n is the constraint. We assume the functions f and L havesufficient smoothness.

Example 20.3.1

infµ∈<

x2 + µ2


20.3. CONSTRAINED PARAMETRIC OPTIMIZATION 415

subject to

x2 = 2µ

The Figures 20.5 and 20.6 show what is happening.

Figure 20.5: Minimize x2 + µ2 Subject to x2 = 2µ

Figure 20.6: Solution to Minimize x2 + µ2 Subject to x2 = 2µ

Here the minimum is clearly at (0, 0). The idea here is to pick µ and then find an x which satisfiesthe given constraints. This is done until the minimum of L(x, µ) over all µ is found. Note: Here xand µ are not independent of one another. They may be linked in the constraints somehow.

Let’s look at the general problem in detail. We start by looking at Hessian approximations in detail.We want to learn how to use them to derive estimates.

20.3.1 Hessian Error EstimatesTo save some typing, we will start using the superscript ()o to indicate an expression evaluated at thepoint (x0,µ0). The superscript ()θ refers to the expression evaluated at the point on the line segment[(x0,µ0), (x0 + ∆x,µ0 + ∆µ)]. In general, given a matrix A, we know that

‖Ax‖ ≤ ‖A‖Fr ‖x‖



and so

‖xTAx‖ = | < x, Ax > | ≤ ‖x‖ ‖Ax‖ ≤ ‖A‖Fr ‖x‖2

The Frobenius norm of A is an example of an operator norm and this kind of behavior is studied ingreater detail in (Peterson (9) 2019).

Now consider the Hessian approximations of the functions L and fi.

L(x0 + ∆x,µ0 + ∆µ) = L0 + (∇xL0)T∆x+ (∇µL

0)T∆µ+1

2

[∆x∆µ

]T [HL

θx,x HL

θx,µ

HLθµ,x HL

θµ,µ

] [∆x∆µ

]where

HLx,x =

Lx1,x1. . . Lx1,xn

......

Lxn,x1. . . Lxn,xn

, HLx,µ =

Lx1,µ1. . . Lx1,µm

......

Lxn,µ1. . . Lxn,µm

HLµ,x =

Lµ1,x1. . . Lµ1,xn

......

Lµm,x1. . . Lµm,xn

, HLµ,µ =

Lµ1,µ1. . . Lµ1,µm

......

Lµm,µ1. . . Lµm,µm

Expand the Hessian term:

1

2

[∆x∆µ

]T [HL

θx,x HL

θx,µ

HLθµ,x HL

θµ,µ

] [∆x∆µ

]=

1

2

[∆x∆µ

]T [HL

θx,x∆x+HL

θx,µ∆µ

HLθµ,x∆x HL

θµ,µ∆µ

]=

1

2

(HL

θx,x(∆x)T∆x+HL

θx,µ(∆x)T∆µ+HL

θµ,x(∆µ)T∆x+HL

θµ,µ(∆µ)T∆µ

)Assuming sufficient smoothness for L, in any closed and bounded domain Ω ⊂ <n × <m, thereexists a bound BL(Ω) such that

maxΩ

(∣∣∣∣ ∂2L

∂xi∂µj

∣∣∣∣ , ∣∣∣∣ ∂2L

∂xi∂xj

∣∣∣∣ , ∣∣∣∣ ∂2L

∂µi∂µj

∣∣∣∣) < B2L(Ω)

Thus,

‖HLx,x‖Fr ≤ nML(Ω), ‖HLx,µ‖Fr ≤√nmML(Ω)

‖HLµ,x‖Fr ≤√nmML(Ω), ‖HLµ,µ‖Fr ≤ mML(Ω)

Let ML(Ω) = max√nm, n,mML(Ω). Then we have

‖HLx,x‖Fr < ML(Ω), ‖HLx,µ‖Fr < ML(Ω)

‖HLµ,x‖Fr < ML(Ω), ‖HLµ,µ‖Fr < ML(Ω)

for any Ω containing (x,µ). So then

1

2

[∆x∆µ

]T [HL

θx,x HL

θx,µ

HLθµ,x HL

θµ,µ

] [∆x∆µ

]≤ ML(Ω)

2

[‖∆x‖2 + 2‖∆x‖‖∆µ‖+ ‖∆µ‖2

]



=ML(Ω)

2(‖∆x‖+ ‖∆µ‖)2

We know for two non negative numbers a and b, (a+ b)2 ≤ 2(a2 + b2). Thus,

1

2

[∆x∆µ

]T [HL

θx,x HL

θx,µ

HLθµ,x HL

θµ,µ

] [∆x∆µ

]≤ ML(Ω)(‖∆x‖2 + ‖∆µ‖2)

Now

L(x0 + ∆x,µ0 + ∆µ)− L0 −∇xL0∆x−∇µL

0∆µ =1

2

[∆x∆µ

]T [HL

θx,x HL

θx,µ

HLθµ,x HL

θµ,µ

] [∆x∆µ

]Let

ηL(∆x,∆µ) =1

2

[∆x∆µ

]T [HL

θx,x HL

θx,µ

HLθµ,x HL

θµ,µ

] [∆x∆µ

]Now for any fixed R we can set Ω = B(R, (x,µ)). Then in this ball

∆L = (∇xL0)T∆x+ (∇µL

0)T∆µ+ ηL(∆x,∆µ)

where ∆L = L(x0 + ∆x,µ0 + ∆µ) − L0. Since all the partials of L are continuous locally,for sufficiently small R, L is differentiable and thus we know and ηL → 0 as ∆x,∆µ → 0 and‖etaL‖/‖(∆(x,∆µ) → 0 as ‖(∆x,∆µ)‖ → 0. Of course, we have already shown this ourselvesas the estimate

ηL(∆x,∆µ) ≤ ML(Ω)(‖∆x‖2 + ‖∆µ‖2)

clearly shows this. Also, in using the same argument as above on each fi we get

∆fi =∇xfoi ∆x+∇µf

0i ∆µ+ ηfi(∆x,∆µ)

and ηfi → 0 as ∆x,∆µ→ 0. Here

ηfi(∆x,∆µ) =1

2

[∆x∆µ

]T [Hfiφx,x Hfi

φx,µ

Hfiφµ,x Hfi

φµ,µ

] [∆x∆µ

]with

|ηfi(∆x,∆µ)| ≤Mfi(Ω)(‖∆x‖2 + ‖∆µ‖2

)for all i, 1 ≤ i ≤ n. So in putting all of the ∆fi into one vector we see

∆f1

∆f2

...∆fn

︸︷︷︸n×1

=

∂f0

1

∂x1· · · ∂f0

1

∂xn∂f0

2

∂x1· · · ∂f0

2

∂xn...

......

∂f0n

∂x1· · · ∂f0

n

∂xn

︸︷︷︸

n×n

∆x1

∆x2

...∆xn

︸︷︷︸n×1

+

∂f0

1

∂µ1· · · ∂f0

1

∂µm∂f0

2

∂µ1· · · ∂f0

2

∂µm...

......

∂f0n

∂µ1· · · ∂f0

n

∂µm

︸︷︷︸

n×m

∆µ1

∆µ2

...∆µm

︸︷︷︸m×1



+

ηf1(∆x,∆µ)ηf2(∆x,∆µ)

...ηfn(∆x,∆µ)

︸︷︷︸

n×1

We can rewrite this in terms of gradients:∆f1

...∆fn

︸︷︷︸n×1

=

(∇0f1,x)T

...(∇0

fn,x)T

︸︷︷︸

n×n

∆x1

...∆xn

︸︷︷︸n×1

+

(∇0f1,µ)T

...(∇0

fn,µ)T

︸︷︷︸

n×m

∆µ1

...∆µm

︸︷︷︸m×1

+

ηf1(∆x,∆µ)...

ηfn(∆x,∆µ)

︸︷︷︸

n×1

Now let

Jf0x =

(∇0f1,x)T

...(∇0

fn,x)T

, Jf0µ =

(∇0f1,µ)T

...(∇0

fn,µ)T

, ηf =

ηf1(∆x,∆µ)...

ηfn(∆x,∆µ)

Then, we have

∆f = Jf0x∆x+ Jf

0µ∆µ+ ηf (∆x,∆µ)

We also have the estimate

‖ηf (∆x,∆µ)‖ ≤ Mf (Ω)(‖∆x‖2 + ‖∆µ‖2)n

where

Mf (Ω) = max(Mf1(Ω), · · · ,Mfn(Ω)).

To summarize, we can write

∆L = (∇xL0)T∆x+ (∇µL

0)T∆µ+ ηL(∆x,∆µ)

∆f = Jf0x∆x+ Jf

0µ∆µ+ ηf (∆x,∆µ)

and we know that the error terms go to zero as ∆x and ∆µ go to zero.

20.3.2 A First Look at Constraint Satisfaction

Now we also know that constraint satisfaction, i.e. satisfying f = 0, implies that ∆f = 0. So then

0 = Jf0x∆x+ Jf

0µ∆µ+ ηf (∆x,∆µ)

Assuming (Jf0x)−1 exists and is continuous, we then get

∆x = −(Jf0x)−1Jf

0µ∆µ− (Jf

0x)−1ηf (∆x,∆µ)

Note the ∆x on both sides of this equation. We can not really solve for ∆x of course, but ignoringthat we see the above expression yields

‖(Jf 0x)−1ηf (∆x,∆µ)‖ ≤ ‖(Jf 0

x)−1Jf0µ∆µ‖+ ‖∆x‖ ≤ ‖(Jf 0

x)−1‖ ‖Jf 0µ∆µ‖+ ‖∆x‖



≤ ‖(Jf 0x)−1‖ ‖Jf 0

µ‖∆µ‖+ ‖∆x‖

From out discussions of the bounds due to the smoothness of the functions here, there is an overesti-mate B(Ω) such that

‖(Jf 0x)−1‖ ‖Jf 0

µ‖ ≤ B(Ω)

and so

‖(Jf 0x)−1ηf (∆x,∆µ)‖ ≤ B(Ω) ∆µ+ ∆x

Using this, we get that the change in performance ∆L is given by

∆L = (∇xL0)T(−(Jf

0x)−1Jf

0µ∆µ− (Jf

0x)−1ηf (∆x,∆µ)

)+ (∇µL

0)T∆µ+ ηL(∆x,∆µ)

= −(∇xL0)T (Jf

0x)−1Jf

0µ∆µ− (∇xL

0)T (Jf0x)−1ηf (∆x,∆µ) + (∇µL

0)T∆µ+ ηL(∆x,∆µ)

Letting α = (∇µL0)T − (∇xL

0)T (Jf0x)−1Jf

0µ we see for a minimum, ∆L > 0. Thus,

0 ≤ α∆µ− (∇xL0)T (Jf

0x)−1ηf (∆x,∆µ) + ηL(∆x,∆µ)

20.3.3 Constraint Satisfaction and the Implicit Function TheoremWe want to show the α = 0 at the minimum of L subject to the constraints, but as written ∆x and∆µ are dependent of one another. What we need is to find how ∆x depends on ∆µ so we use towrite an inequality equation only in ∆µ. We will do this by stepping back, rewriting a few things andinvoking the Implicit Function Theorem. It is the Implicit Function Theorem that will allow us towrite ∆x in terms of ∆µ only rather than in terms of itself and ∆x. At the local minimum (x0,µ0),we have constraint satisfaction and so

0 ≤ ∆L = (∇xL0)T∆x+ (∇µL

0)T∆µ+ ηL(∆x,∆µ)

0 = ∆f = Jf0x ∆x+ Jf

0µ ∆µ+ ηf (∆x,∆µ)

where the Hessian terms in ηL and ηf are evaluated at in between points

(xθ,µθ) = (1− θ)(x0,µ0) + θ(x0 + ∆x,µ0 + ∆µ)

Here we could assume that Jf 0x)−1 exists and solve for ∆x like we did last time, but we must realize

that there is a relationship between ∆x and ∆µ and that we need to do something else to be precise.We can use the Implicit Function Theorem to write ∆x in terms of ∆µ to provide this precision. TheImplicit Function Theorem, Theorem 13.3.1 has been carefully proven and discussed in Chapter 13.If we translate it to our situation, it would be stated

Implicit Function TheoremLet U ⊂ <n+‘m be an open set. Let u ∈ U be written as u = (x,µ) where x ∈ <n and µ ∈ <m.Assume f : U → <m has continuous first order partials in U and there is a point (x0,µ0) ∈ Usatisfying f(x0,µ0) = 0. This is the constraint satisfaction condition. Also assume detJf 0

x 6= 0Then there is an open set V0 containing µ0 ∈ <m and g : V0 → <n with continuous partials sothat g(µ0) = x0, f(g(µ),µ) = 0 on V0.

This tells us we have constraint satisfaction: f(g(µ0 + ∆µ),µ0 + ∆µ) = 0 for all µ0 + ∆µ ∈ V0.Also, since g(µ0) = x0, letting g(µ0 + ∆µ) = x, then ∆x = g(µ0 + ∆µ) − g(µ0) = x − x0

which is our usual notation.



Recall that we had from before

0 ≤ ∆L = (∇xL0)T∆x+ (∇µL

0)T∆µ+1

2

[∆x∆µ

]T [HL

θx,x HL

θx,µ

HLθµ,x HL

θµ,µ

] [∆x∆µ

]Let Ω = B(r,x0,µ0) and apply the Implicit Function Theorem. Associated with Ω will be an openset V0. Choose a ball B(ρ,x0) such that Bρ,x0) ⊂ V0. Then for all ∆x ∈ B(ρ,µ0) we know

0 ≤ ∆L = (∇xL0)T∆x+ (∇µL

0)T∆µ+

1

2

[g(µ0 + ∆µ)− g(µ0)

∆µ

]T [HL

θx,x HL

θx,µ

HLθµ,x HL

θµ,µ

] [g(µ0 + ∆µ)− g(µ0)

∆µ

]Now consider

HLθ(∆µ) =

1

2

[g(µ0 + ∆µ)− g(µ0)

∆µ

]T [HL

θx,x HL

θx,µ

HLθµ,x HL

θµ,µ

] [g(µ0 + ∆µ)− g(µ0)

∆µ

]The matrix in the middle is evaluated at the point

(xθ,µθ) = (1− θ)(x0,µ0) + θ(x0 + ∆x,µ0 + ∆µ) = (x0,µ0) + θ(∆x,∆µ)

= (g(µ0),µ0) + θ(g(µ0 + ∆µ)− g(µ0),∆µ)

Note here that

(xθ,µθ)− (x0,µ0) = θ(∆x,∆µ) = θ(g(µ0 + ∆µ)− g(µ0),∆µ)

is in B(r, (0,0)) as (g(µ0 + ∆µ),µ0) + ∆µ) is in B(r,x0,µ0). Note also that by our assumptionsHL

θ(∆µ) is continuous. So now

0 ≤ ∆L = (∇xL0)T∆x+ (∇µL

0)T∆µ+HθL(∆µ)

and because constraints are satisfied

0 = Jf0x∆x+ Jf

0µ∆µ+Hf

φ(∆µ)

where

Hfiφ(∆µ) =

1

2

[g(µ0 + ∆µ)− g(µ0)

∆µ

]T [Hfiθx,x Hfi

θx,µ

Hfiθµ,x Hfi

θµ,µ

] [g(µ0 + ∆µ)− g(µ0)

∆µ

]

Hfφ(∆µ) =

Hf1φ(∆µ)...

Hfnφ(∆µ)

Hence, Hφ

f (∆µ) is constructed in the same way as HθL(∆µ) but the intermediate point is now

(xφ,µφ). Then since we know Jf 0x is invertible we have

∆x = −(Jf0x))−1 Jf

0µ∆µ− (Jf

0x))−1 Hf

φ(∆µ)



which is an expression for ∆x written in terms of ∆µ only. This is what we wanted all along. Now,plugging in for ∆x in ∆L we get

0 ≤ ∆L = (∇xL0)T(−(Jf

0x))−1 Jf

0µ∆µ− (Jf

0x))−1 Hf

φ(∆µ)

)+ (∇µL


=

((∇µL

0)T − (∇xL0)T (Jf

0x))−1 Jf

0µ

)∆µ− (∇xL

0)T (Jf0x))−1 Hf

φ(∆µ) +HθL(∆µ)

Now let α = (∇µL0)T−(∇xL

0)T (Jf0x))−1Jf

0µ and β(∆µ) = −(∇xL

0)T (Jf0x))−1Hf

φ(∆µ)+

HθL(∆µ). We then have

0 ≤ α∆µ+ β(∆µ)

for ∆µ ∈ B(ρ,0) and we see β is continuous on B(ρ,0). Our goal is to show that α = 0, so that wewill have the known necessary conditions for constrained optimization. In trying to show this we willattempt to get a bound on β(∆µ) and then try to trap α in between two arbitrarily small quantities.Thus showing that α must be 0. There are two approaches that we will take. Our first attempt willbe an incorrect approach and will lead us to a dead end. Our second approach will back off a little,summarizing some of this material once again, and proceed on to the result that we are trying to getto. Namely, that α = 0.

20.3.3.1 An Incorrect Approach

Picking up where we left off, since β is continuous on B(ρµ0) we know that given ε > 0, there is aδ > 0 so that

∆µ ∈ B(δ,0) =⇒ |β(∆µ)− β(0)| < ε

But β(0) = 0 so

∆µ ∈ B(δ,0) =⇒ |β(∆µ)| < ε

Choose ∆µ = cEi for c sufficiently small. Then we have if |c| < δ,

∆µ =

0...c...0

← jthslot , and 0 ≤

α1

...αj...αm

T 0...c...0

+ ε = αjc+ ε

So 0 ≤ αjc+ ε if |c| < δ.

Choose c = δ2 Choose c = − δ2

0 ≤ αj δ2 + ε 0 ≤ αj(− δ2)

+ ε

−ε ≤ αj δ2 −ε ≤ αj(− δ2)

− 2εδ ≤ αj

−2ε−δ ≥ αj



In combining these two inequalities we get

−2ε

δ≤ αj ≤

2ε

δ

We would like to say here that both sides of this inequality go to zero as ε → 0, but since we do notknow anything about how ε depends on δ we have no idea what the ratio ε

δ does as ε→ 0. So we cansee that this approach is not going to work! Arghh!

20.3.3.2 The Correct Approach

To back off a little bit, we know that there is always a linkage between the controls and the states.Look at B(ρ2 ,µ0). Then

x0 = g(µ0)

x0 + ∆x = g(µ0 + ∆µ)

f(g(µ0 + ∆µ),µ0 + ∆µ) ∈ B(r,x0,µ0)

f(g(µ0 + ∆µ),µ0 + ∆µ) = 0, ∆µ ∈ B(ρ

2,µ0)

Constraint satisfaction then gives us

0 = Jf0x∆x+ Jf

0µ∆µ+Hf

φ(∆µ)

for ∆µ ∈ B(ρ2 ,µ0). Now examine the last term Hfφ(∆µ). Writing Hf (∆µ) out we see the ith

component is

(Hf (∆µ) )i =1

2(∆x)T Hfi

φxx∆x+

1

2(∆µ)T Hfi

φxµ∆x

+1

2(∆x)T Hfi

φµx∆µ+

1

2(∆µ)T Hfi

φµµ∆µ

and so

2 ‖(Hf (∆µ) )i‖ ≤ ‖Hfiφxx‖ ‖∆x‖

2 + ‖Hfiφxµ‖ ‖∆x‖ ‖∆µ‖

+ ‖Hfiφµx‖ ‖δx‖ ‖∆µ‖+ ‖Hfi

φµµ‖ ‖∆µ‖

2

But,

g(µ0 + ∆µ)− g(µ0) = Jg0µ∆µ+

1

2(∆µ)THg

ξµ∆µ

for an intermediate point ξ. Therefore

‖∆x‖ ≤ ‖Jg0µ‖ ‖∆µ‖+

1

2‖Hg

ξµ‖ ‖∆µ‖

2

Now for any fixed R we can set Ω = B(R, (x,µ)) which is a compact set. Since all the partials hereare continuous, this means all the gradient and Hessian terms have maximum values on Ω. Thus (thisis a bit messy as there are a lot of upper bounds!)

‖Hfiφxx‖ ≤ B1i(Ω), ‖Hfi

φxµ‖ ≤ B2i(Ω)

‖Hfiφµµ‖ ≤ B3i(Ω), ‖Hfi

φµµ‖ ≤ B4i(Ω)

‖Hgξµ‖ ≤ C(Ω)



Thus,

2 ‖(Hf (∆µ) )i‖ ≤ B1i(Ω) ‖∆x‖2 +B2i(Ω)‖∆x‖ ‖∆µ‖B3i(Ω) ‖δx‖ ‖∆µ‖+B4i(Ω ‖∆µ‖2

Let

B(Ω) = max1≤i≤n

B1i(Ω), B2i(Ω), B3i(Ω), B4i(Ω)

D(Ω) = max‖Jg0µ‖,

1

2C(Ω)

‖∆x‖ ≤ D(Ω)‖∆µ‖+D(Ω)‖∆µ‖2

Now assume ‖∆µ‖ < 1, then ‖∆x‖ ≤ 2D(Ω)‖∆µ‖. We can now do our final estimate:

2 max1≤i≤n

‖(Hf (∆µ) )i‖ ≤ B(Ω)

(‖∆x‖2 + 2‖∆x‖‖∆µ‖+ ‖∆µ‖2

)= B(Ω)(‖∆x‖2 + ‖∆µ‖)2) ≤ 2B(Ω)(‖∆x‖2 + ‖∆µ‖2)

Thus

max1≤i≤n

‖(Hf (∆µ) )i‖ ≤ B(Ω)(‖∆x‖2 + ‖∆µ‖2)

Next, using the estimate for ‖∆x‖, we find

max1≤i≤n

‖(Hf (∆µ) )i‖ ≤ B(Ω)(4D(Ω)2 + 1) ‖∆µ‖2

Let

ξ(Ω) = B(Ω)(4D(Ω)2 + 1)

and we have shown

max1≤i≤n

‖(Hf (∆µ) )i‖ ≤ ξ((Ω) ‖∆µ‖2

Thus, to repeat, we have the following constraint equations

0 = Jf0x∆x+ Jf

0µ∆µ+Hf

φ(∆µ)

where

‖Hfφ(∆µ)‖ = max

1≤i≤n‖(Hf (∆µ) )i‖ ≤ ξ(Ω)‖∆µ‖2

Using the same argument we used above we will also get that

0 ≤ ∆L = (∇xL0)T∆x+ (∇µL


‖HLθ(∆µ)‖ ≤ ζ(Ω)‖∆µ‖2



some constant ζ(Ω). And now since we know Jf 0xx is invertible we know from the constraint equa-

tion that

∆x = −(Jf0x))−1 Jf

0µ∆µ− (Jf

0x)−1 Hf

φ(∆µ)

So putting this into the expression for ∆L we will get an expression that we have seen before.Namely,

0 ≤ ∆L = (∇0x)T(−(Jf

0x))−1 Jf

0µ∆µ− (Jf

0x))−1 Hf

φ(∆µ)

)+ (∇µL


=

(∇µL

0)T − (∇xL0)T (Jf

0x)−1 Jf

0µ

)∆µ− (∇xL

0)T (Jf0x))−1 Hf


Letting

α = (∇µL0)T − (∇xL

0)T (Jf0x)−1 Jf

0µ

and

Q(∆µ) = −(∇xL0)T (Jf

0x))−1 Hf


we get

0 ≤ ∆L = α∆µ+Q(∆µ).

From all of the above work and the fact the (Jf0x))−1 exists, we have

|Q(∆µ) ≤ ‖∇xL0‖ ‖(Jf 0

x))−1‖ ξ(Ω)‖∆µ‖2 + C(Ω)‖∆µ‖2

=

(‖∇xL

0‖ ‖(Jf 0x))−1‖ ξ(Ω)‖+ C(Ω)

)‖∆µ‖2

Now let

K(Ω) = ‖∇xL0‖ ‖(Jf 0

x))−1‖ ξ(Ω)‖+ C(Ω

and we have shown |Q(∆µ) ≤ K(Ω)‖∆µ‖2. So now we have the inequality

0 ≤ α∆µ+K‖∆µ‖2.

Notice that this is similar to the inequality that we had at the end of the other approach. This timethough we will be able to get the result that we want, i.e. α = 0. Now choose all ∆µi = 0 except for∆µj , and let ∆µj = 1√

Kp. For p large enough, we know that ∆µ ∈ B(ρ2 ,µ0). Looking at the jth

component of the inequality we see

0 ≤ αj1√Kp

+K1

Kp2=⇒ − 1

p2≤ 1

p√Kαj =⇒ −p

√K

p2≤ αj =⇒ −

√K

p≤ αj

Now choose ∆µj = − 1√Kp

. Then the jth component of the inequality becomes

0 ≤ αj

(−1√Kp

)+K

1

Kp2=⇒ αj

(1√Kp

)≤ 1

p2=⇒ αj ≤

√K

p


20.4. LAGRANGE MULTIPLIERS 425

Combining these we get that

−√K

p≤ αj ≤

√K

p

Since p is arbitrarily large, this implies αj = 0. This same argument works for any of the componentsα1, α2, · · · , αn. So we are finally able to conclude that α = 0, and we have found the necessaryconditions for constrained optimization.

Theorem 20.3.1 Constrained Optimization

Consider the problem

minµ∈<m

L(x,µ)

subject to

f1(x, µ) = 0...

fn(x, µ) = 0

f(x,µ) = 0

where x ∈ <n is the state, µ ∈ <m is the control and L : <n × <m → < is theperformance index. The function f : <n × <m → <n is the constraint. We assume thefunctions f and L have continuous partials locally about (x0,µ0). Then if (x0,µ0) is apoint where L is minimized and (Jfx(x0,µ0))−1 exists, then the following equation mustbe satisfied:

(∇µL0)T − (∇xL

0)T (Jf0x)−1 Jf

0µ = 0

where ()0 indicates terms that are evaluated at (x0,µ0). Hence, finding the points(x0,µ0) that satisfy this equation are possible candidates for the minima of L subjectto f = 0.

Proof 20.3.1We have just finished a very long winded explanation!

20.4 Lagrange MultipliersWe could also approach this problem using the Lagrange Multiplier technique. If we let

Φ(x, µ, λ) = L(x, µ) + λT f(x, µ)

then Φ is minimized when

∇xΦ = 0 =⇒∇xL0 + λTJfxf

0 = 0, *

∇µΦ = 0 =⇒∇µL0 + λTJfµf

0, **

∇λΦ = 0 =⇒ f(x,µ), constraint equations

If Jfxf0 is invertible we get from (*) that

λT = −∇xL0 (Jfxf

0)−1



and so using (**) we see

0 =∇µL0 −∇xL

0(Jfxf0)−1Jfµf

0

which is the transpose of the same equation that we had before. To get some sort of idea what theλi’s are we know that to first order

∆L ≈ (∇xL0)T∆x+ (∇µL

0)T∆µ

∆f ≈ Jf0x∆x+ Jf

0µ∆µ.

Now suppose we set ∆µ = 0 and we let ∆x vary. Then

∆L ≈ (∇xL0)T∆x

∆f ≈ Jf0x∆x.

So

∆L ≈ (∇xL0)T (Jf

0x)−1∆f =⇒ ∆L ≈ −λT∆f =⇒ ∆L ≈ −

∑i

λi∆fi

and thus

λi ≈ −∆L

∆fi.

Here the Lagrange Multipliers λi have an interpretation of the cost for violating the constraint fi, i.e.they are the change in performance with respect to the change in a constraint. Of course, the argumentabove is very loose! But suggestive arguments can often be proven later with precision. The hardestpart of all in some ways is trying to discover conjectures. Think of the pricing interpretation as achallenge! How would you prove it carefully?


Part VII

Summing It All Up

427


Chapter 21

Summing It All Up

We have now come to the end of these notes. We have not covered all of the things we wanted to butwe view that as a plus: there is more to look forward to! In particular, we hope we have encouragedyour interest in

• A thorough discussion of the result∫ω

∂U

=∫dωU

for 1 - forms ω in <3.

• The topology of <n and how we generalize this in n dimensional manifolds.

• More ideas in algebraic topology and how it can be used in applied models in various areas ofscience such as immunology and cognition.

• learning more computational science

So here’s to the future! The next book (Peterson (9) 2019) focuses on the study of metric, normedand inner product spaces in more depth. You are welcome to get started on that one next.

429


430 CHAPTER 21. SUMMING IT ALL UP


Part VIII

References

431


References

[1] R. Bate, D. Mueller, and J. White. Fundamentals of Astrodynamics. Dover, 1971.

[2] M. Braun. Differential Equations and Their Applications. Springer-Verlag, 1978.

[3] H. Curtis. Orbital Mechanics for Engineering Students. Elsevier, 2014.

[4] W. Fulton. Algebraic Topology: A First Course. Graduate Texts in Mathematics 153, SpringerNY, 1995.

[5] J. Peterson. Calculus for Cognitive Scientists: Higher Order Models and Their Analysis.Springer Series on Cognitive Science and Technology, Springer Science+Business Media Sin-gapore Pte Ltd. 152 Beach Road, #22-06/08 Gateway East Singapore 189721, Singapore, 2016.doi: 10.1007/978-981-287-877-9.

[6] J. Peterson. Basic Analysis V: Linear Functional Analysis and Topology. CRC Press, a Divisionof the Taylor and Francis Group, 6000 Broken Sound Parkway NW, Suite 300, Boca Raton,Florida 33487, 2019. Contract December 15, 2016, 450 pages, 100 illustrations.

[7] J. Peterson. Basic Analysis IV: Abstract Measure Theory. CRC Press, a Division of the Taylorand Francis Group, 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, Florida 33487,2019. Contract December 15, 2016, 450 pages, 100 illustrations.

[8] J. Peterson. Basic Analysis I: Functions of a Real Variable. CRC Press, a Division of the Taylorand Francis Group, 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, Florida 33487,2019. Contract December 15, 2016, 450 pages, 100 illustrations.

[9] J. Peterson. Basic Analysis III: Mappings on Infinite Dimensional Spaces. CRC Press, a Di-vision of the Taylor and Francis Group, 6000 Broken Sound Parkway NW, Suite 300, BocaRaton, Florida 33487, 2019. Contract December 15, 2016, 450 pages, 100 illustrations.

[10] J. Peterson. Basic Analysis II: Functions of Multiple Variables. CRC Press, a Division of theTaylor and Francis Group, 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, Florida33487, 2019. Contract December 15, 2016, 450 pages, 100 illustrations.

[11] J. Peterson, A. M. Kesson, and N. J. C. King. A Theoretical Model of the West Nile VirusSurvival Data. BMC Immunology, 18(Suppl 1)(22):24–38, June 21, 2017. URL http://dx.

doi.org/10.1186/s12865-017-0206-z.

[12] J. Peterson, A. M. Kesson, and N. J. C. King. A Model of Auto Immune Response. BMCImmunology, 18(Suppl 1)(24):48–65, June 21, 2017. URL http://dx.doi.org/10.1186/

s12865-017-0208-x.

[13] A. E. Roy. Orbital Motion. Adam Hilger, Ltd, Bristol, 1982.

[14] G. Sutton and O. Biblarz. Rocket Propulsion Elements. John Wiley & Sons, Inc, New York,2001.

433

http://dx.doi.org/10.1186/s12865-017-0206-z

http://dx.doi.org/10.1186/s12865-017-0206-z

http://dx.doi.org/10.1186/s12865-017-0208-x

http://dx.doi.org/10.1186/s12865-017-0208-x


434 REFERENCES

[15] W. Thompson. Introduction to Space Dynamics. Dover, 1961.

[16] R. Wrede and M. Speigel. Schaum’s Outline on Advanced Calculus. McGraw - Hill Education,third edition, 2010.


Part IX

Detailed Indices

435


Index

AxiomThe Completeness Axiom, 10

Definition<n to <m Continuity, 1132 Forms, 3763 Forms, 376A Refinement of a Partition, 294Abstract Vector Space, 24Algebraic Dual of a Vector Space, 364Balls in <n, 103Boundary Points of a Subset in <n, 103Bounded Sets, 9Cauchy Sequences in <n, 106Connected Closed Sets, 347Connected Sets, 371Conservative Force Field, 344Critical Points, 224Darboux Lower and Upper Sums for f on A,

293Error Form of Differentiability For Scalar Func-

tion of n Variables, 192Exact 1 - forms, 373Integration of a 1 -form over a smooth path,

368Least Upper Bound and Greatest Lower Bound,

9Limit Inferior and Limit Superior of a Sequence,

14Limit Points, Cluster Points and Accumula-

tion Points, 104Linear Independence, 26Linear Independence in a Vector Space, 26Linear Independence Of N Objects, 24Linear Independence of Two Objects, 24Linear Transformations Between Vector Spaces,

64Norm Convergence in n Dimensions, 104Norm on a Vector Space, 56One Forms on an open set U , 363Open and Closed Sets in <n, 104Oriented Two Cells, 354Partial Derivatives, 187Partitions of the bounded set A, 292Path Connected Sets, 347Planes in n Dimensions, 189

Real Matrices, 64Second Order Partials, 202Sequential Compactness, 11Sequentially Compact Subsets of <n, 109Sets of Measure Zero, 306Smooth Functions on <n, 363Smooth Paths, 366Smooth Segmented Paths, 373Tangent Vector Fields, 363The `p Sequence Space, 57The <n to <m Limit, 112The Characteristic Function of a Set, 309The Closure of a Set, 105The Common Refinement of Two Partitions,

295The Darboux Integral, 298The Differentiability of a Vector Function of n

variables, 209The Frobenius Norm of a Linear Transforma-

tion, 72The Gradient, 191The Hessian Matrix, 203The Inherited Inner Product on a Finite Di-

mensional Vector Space is Independentof Choice of Orthonormal Basis, 68

The Integration Symbol, 309The Jacobian, 208The Line Integral of a Force Field along a Curve,

341The Lower and Upper Darboux Integral, 298The Minimum and Maximum of a Set, 10The Norm of a Partition, 300The Norm of a Symmetric Matrix, 120The Oscillation of a Function, 310The Riemann Integral, 302The Riemann Sum, 302The Symmetric Group S2, 376The Symmetric Group S3, 376The Tangent Plane to z = f(x at x0 in <n,

191The Volume of a Set, 309Topologically Compact Subsets of <n, 109Two Dimensional Curves, 339Vector Spaces

A Basis For A Finite Dimensional VectorSpace, 27

437


438 INDEX

A Basis For An Infinite Dimensional VectorSpace, 28

Linear Independence For Non Finite Sets,28

Real Inner Product, 29The Span Of A Set Of Vectors, 27

Finite Difference Techniques for PDEApproximation for Diffusion Equation, 279Central Difference for First Order, 277Central Difference for Second Order, 278Discretization of Underlying Domain, 279Forward Difference for First Order, 277Recursive System for the Diffusion Equation,

281Recursive System for the Diffusion Equation

Errors, 282Fourier Series

Fourier Series for f , 54Normalized Cosine functions, 53

Functions of Two VariablesContinuity, 275Fourth Order Taylor Expansion, 278Partial Derivative Bounds, 276Second Order Taylor Expansion, 276

General Autoimmune ModelsThe CMN model

Assumption 1: the number of infected cells,274

Deviations of C, M and N from nominalvalues, 274

Dynamics, 274Dynamics: the F1, F2 and F3 nonlinear

functions, 274Linearizing F1, F2 and F3, 275The 3 × 3 F1, F2 and F3 nominal partial

values matrix, 275The initial conditions for the dynamics, 274

The CMN model: M and N are two dis-tinct populations of cells and C is au-toimmune action, 274

Immune TheoryAssumptions

Number of infected cells, 274

LemmaInfimum Tolerance Lemma, 11Supremum Tolerance Lemma, 11

Linear ODE SystemsComplex Eigenvalues, 84

First Real Solution, 85General Complex Solution, 84General Real Solution, 85Matrix Representation of the Real Solution,

86

Rewriting the Real Solution, 86Second Real Solution, 85Theoretical Analysis, 84Worked Out Example, 85

MultiVariable CalculusSmoothness

Continuous first order partials imply mixedsecond order partials match, 213

Theorem, 65, 113, 307, 387(<n,ΓB30D · ΓB30Dp is Complete, 63D in <n is closed if and only if it contains all

its limit points., 105D is Sequentially Compact if and only if it is

closed and bounded, 109D is topologically compact if and only if it is

closed and bounded, 111D is topologically compact if and only if se-

quentially compact if and only if closedand bounded, 111

D is topologically compact implies it is closedand bounded, 110

S has a maximal element if and only if sup(S) ∈S, 10

S has a minimal element if and only if inf(S) ∈S, 10

<n is Complete, 106Linear Transformation between Two Finite Di-

mensional Vector Spaces Determines anEquivalence Class of Matrices, 71

A D Set is Closure if and only if D = D, 105A Hyperrectangle is topologically compact, 110A Particular Solution to a linear ODE System,

398A Set S is Sequentially Compact if and only if

it is Closed and Bounded, 11A Simple Recapture Theorem in <2, 372All Eigenvalue, 125Alternate Definition of the limit inferior and

limit superior of a sequence, 17An Extension of the Implicit Function Theo-

rem, 248Approximation of the Darboux Integral, 300Approximation of the Riemann Integral, 305At an Interior Extreme Point, The Partials Van-

ish, 223Basic Topology Results in <2, 19Best Finite Dimensional Approximation The-

orem, 52Bolzano - Weierstrass Theorem in <n, 107Bolzano - Weierstrass Theorem in 2D, 20Cauchy-Schwartz Inequality, 29Change of Variable for a Linear Map, 319Closed Subsets of Hyperrectangles are topo-

logically compact, 111


INDEX 439

Conservative Force Fields Imply Ay = Bx,346

Constrained Optimization, 425Continuous functions on compact domains have

a minimum and maximum value, 13Definiteness Test for nD Extrema Using Defi-

niteness Of the Hessian, 229Derivative of f at an interior point local ex-

tremum is zero, 14Determinant, 168

Algorithm, 169Properties, 169Smoothness, 169

Differentiable Implies Continuous: Scalar Func-tion of n Variables, 193

Equivalent Characterizations of Exactness, 374Finding where (x2 + y2, 2xy) is a Bijection,

234Finite Unions of Sets of Measure Zero Also

Have Measure Zero, 307First Eigenvalue, 122Fubini’s Theorem for a Rectangle: n dimen-

sional, 334Fubini’s Theorem on a Rectangle: 2D, 332Fubini’s Theorem: 2D: a top and bottom closed

curve, 335Fubini’s Theorem: 2D: a top and bottom curve,

334Green’s Theorem for a SCROC, 354Green’s Theorem for Oriented 2 - Cells, 357Holder’s Inequality, 57Holder’s Inequality in <n, 59Holder’s Theorem for p = 1 and q =∞, 58If f continuous on a compact domain, it has

global extrema, 117If f is continuous on a compact domain, its

range is compact, 116If f is integrable on A and A has measure

zero, then the integral of f on A is zero,316

If f is integrable over A and f is zero excepton a set of measure zero, this the integralis zero, 317

If the non negative f is integrable on A withvalue 0, then the measure of the set ofpoints where f > 0 is measure zero, 316

If the Vector Function f is Differentiable atx0), L(x0) = Jf (x0), 210

Implicit Function Theorem, 246Interior Points of the Range of f , 238Inverse Images of open sets under continuous

maps are open, 240Lebesgue’s Theorem, 311Linear Transformations Between Finite Dimen-

sional Vector Spaces, 69

Lower and Upper Darboux Sums and PartitionRefinements, 296

Lower and Upper Darboux Sums are Indepen-dent of the choice of Bounding Rectan-gle, 294

Lower Darboux Sums are always less than Up-per Darboux Sums, 298

Minkowski’s Inequality, 59Minkowski’s Inequality in <n, 61Oscillation is Zero if and only Continuous, 311Smooth Functions with zero derivatives are con-

stant if and only if the domain is con-nected, 371

Subsets of Sets That have Volume, 324Sufficient Conditions for the Mixed Order Par-

tials to Match, 205Sufficient Conditions On the First Order Par-

tials to Guarantee Differentiability, 200The α− β Lemma, 57The angle function is not exact on <2 \ 0.0,

373The Chain Rule for Scalar Functions of nVari-

ables, 197The Chain Rule for Vector Functions, 211The Change of Basis Mapping, 67The Change of Basis Mapping In a Finite Di-

mensional Inner Product Space, 67The Change of Variable Theorem for a Con-

stant Map, 324The Change of Variable Theorem for a Gen-

eral Map, 327The Completeness of (<n,ΓB30D·ΓB30D∞),

62The Derivative of the exponential matrix, 393The Eigenvalue Representation of the norm of

a symmetric matrix, 126The Equivalence of the Riemann and Darboux

Integral, 303The First Inverse Function Theorem, 237The Frobenius Norm Fundamental Inequality,

72The general solution to a linear ODE system,

395The Inverse Function Theorem, 240The Mean Value Theorem, 213The Potential Function for F , 348The range of a continuous function on a se-

quentially compact domain is also sequen-tially compact, 12

The Recapture Theorem for Segmented Paths,374

The Riemann Criterion For Darboux Integra-bility, 298

The scalar function f is Differentiable impliesDf = (∇(f))T , 194

The Second Eigenvalue, 123


440 INDEX

The Solution Space to a linear System of ODE,397

The Third Eigenvalue, 124The Volume of a Rectangle in <n, 322Two dimensional Extrema Test for minima and

maxima, 228Two Equivalent Ways To Calculate the Norm

of a Symmetric Matrix, 121Volume Zero Implies Measure Zero, 310

basic analysis ii: a modern calculus in many...

Documents